UIMA Annotation
The Range Annotator extension turns Semantic Turkey into an interactive annotation system, which uses ontologies to express annotation schemata.
The UIMAST extension allows Semantic Turkey users to export their RDF/OWL annotations into an UIMA-compliant format, making it possible to use them in an UIMA workflow.
The corpus format
The corpus is serialized as a collection of XMI files, one for each annotated document, which contain:
- the URI of the document
- the metadata describing that document
The XMI format was chosen, since the UIMA specification developed at the OASIS adopts it to serialize the data in a CAS.
The data model
In Semantic Turkey there is a deep separation between the domain knowledge (WHAT) and the associated documents (WHERE).
The concept ann:RangeAnnotation
(which extends ann:SemanticAnnotation
) represents
a range within a (structured) document, identified by an XPointer expression.
An individual of the class ann:RangeAnnotation
holds a regional reference to a portion of a document, but it hasn't
a domain-specific meaining.
Every ann:RangeAnnotation
individual is the object of the property ann:annotation
of a domain object,
which assigns a domain-specific semantic to the labelled region.
In an UIMA type system there is an annotation type for each domain entity to be identified. Those types extend a common base type, which holds the regional reference to a portion of the SOFA (Subject Of Analysis).
At the present the system exports the raw XPointer used by Semantic Turkey, so that a custom base type is needed to hold the XPointer expression.
Once the execution of an Analysis Engine has been supported, we implement the automatic extraction of the unformatted text and the conversion of the regional reference from XPointer expression to integer offsets.
It is quite obvious that an UIMA annotation corresponds to the pair <domain object, range annotation> = <domainObj, annot>
in Semantic Turkey, since
- the
range annotation
- holds a regional reference (but it hasn't any meaning)
- the
domain object
- assigns an application specific semantic to the range referred by the annotation
Each pair <domainObj, annot>
may be transformed into one or more UIMA annotations, which depend only from
domainObj
since annot
hasn't a domain specific meaning.
The ontology projection
Once an UIMA type system has been chosen, it is necessary to project some ontology classes into types in the type system.
Let C
be a class in the ontology, T(C)
denotes the associated type in the type system.
An object domainObj
such that domainObj rdf:type C
is transformed into a feature structure
of type T(C)
.
Given two classes A
and B
, if A rdfs:subClassOf B
, then the system requires that
T(A) extends or is equal to T(B)
.
The above requirement guarantees that a domain object won't loose its identity during the export process, since a feature structure cannot have two unrelated types.
If A owl:equivalentClass B
, then T(A) is equal to T(B)
, because there cannot be cycles in the
type system.
In the context of the projection of a class into a type, the user can define initializers for the features of that type. He can use either an immediate value or the values of a property.
The ontology projection is described in an XML file.
The projection process
The following explanation describes the process from a theoretical point of view and it may be unrelated to the actual implementation.
For every annotated page P
, the system considers every pair
<domain object, range annotation> = <domainObj, annot>
such that annot
is located on that page.
Let be CC
the set of the projected classes C
such that one can prove
that domainObj rdf:type C
.
Let M
be the set of the minimal elements of CC
(reducing equivalent classes)
with respect to the partial order relation among classes induced by the ontology property rdfs:subClassOf
.
For each class C
belonging to M
the system creates a feature structure fs
of type T(C)
;
then it initializes fs
with the initializers associated with the classes D
belonging to CC
such that
one can prove that C rdfs:subClassOf D
.
The classes D
must be considered from the most general to the most specific to guarantee a proper order of
initialization for those features which are assigned in several contexts.
The procedure is well-defined, since if C rdfs:subClassOf D
, then it is required that T(C) extends or is equal to T(D)
,
thus every feature defined for T(D)
must be defined also for T(C)
.