UIMA Annotation

The Range Annotator extension turns Semantic Turkey into an interactive annotation system, which uses ontologies to express annotation schemata.

The UIMAST extension allows Semantic Turkey users to export their RDF/OWL annotations into an UIMA-compliant format, making it possible to use them in an UIMA workflow.

The corpus format

The corpus is serialized as a collection of XMI files, one for each annotated document, which contain:

the URI of the document
the metadata describing that document

The XMI format was chosen, since the UIMA specification developed at the OASIS adopts it to serialize the data in a CAS.

The data model

In Semantic Turkey there is a deep separation between the domain knowledge (WHAT) and the associated documents (WHERE).

The concept ann:RangeAnnotation (which extends ann:SemanticAnnotation) represents a range within a (structured) document, identified by an XPointer expression.

An individual of the class ann:RangeAnnotation holds a regional reference to a portion of a document, but it hasn't a domain-specific meaining.

Every ann:RangeAnnotation individual is the object of the property ann:annotation of a domain object, which assigns a domain-specific semantic to the labelled region.

The Semantic Turkey annotation model

In an UIMA type system there is an annotation type for each domain entity to be identified. Those types extend a common base type, which holds the regional reference to a portion of the SOFA (Subject Of Analysis).

The CAS annotation model

At the present the system exports the raw XPointer used by Semantic Turkey, so that a custom base type is needed to hold the XPointer expression.

Once the execution of an Analysis Engine has been supported, we implement the automatic extraction of the unformatted text and the conversion of the regional reference from XPointer expression to integer offsets.

It is quite obvious that an UIMA annotation corresponds to the pair <domain object, range annotation> = <domainObj, annot> in Semantic Turkey, since

the range annotation: holds a regional reference (but it hasn't any meaning)
the domain object: assigns an application specific semantic to the range referred by the annotation

Each pair <domainObj, annot> may be transformed into one or more UIMA annotations, which depend only from domainObj since annot hasn't a domain specific meaning.

The ontology projection

Once an UIMA type system has been chosen, it is necessary to project some ontology classes into types in the type system.

Let C be a class in the ontology, T(C) denotes the associated type in the type system.

An object domainObj such that domainObj rdf:type C is transformed into a feature structure of type T(C).

Given two classes A and B, if A rdfs:subClassOf B, then the system requires that T(A) extends or is equal to T(B).

The above requirement guarantees that a domain object won't loose its identity during the export process, since a feature structure cannot have two unrelated types.

If A owl:equivalentClass B, then T(A) is equal to T(B), because there cannot be cycles in the type system.

In the context of the projection of a class into a type, the user can define initializers for the features of that type. He can use either an immediate value or the values of a property.

The ontology projection is described in an XML file.

The projection process

The following explanation describes the process from a theoretical point of view and it may be unrelated to the actual implementation.

For every annotated page P, the system considers every pair <domain object, range annotation> = <domainObj, annot> such that annot is located on that page.

Let be CC the set of the projected classes C such that one can prove that domainObj rdf:type C.

Let M be the set of the minimal elements of CC (reducing equivalent classes) with respect to the partial order relation among classes induced by the ontology property rdfs:subClassOf.

For each class C belonging to M the system creates a feature structure fs of type T(C); then it initializes fs with the initializers associated with the classes D belonging to CC such that one can prove that C rdfs:subClassOf D.

The classes D must be considered from the most general to the most specific to guarantee a proper order of initialization for those features which are assigned in several contexts.

The procedure is well-defined, since if C rdfs:subClassOf D, then it is required that T(C) extends or is equal to T(D), thus every feature defined for T(D) must be defined also for T(C).