Metadata Registry

The content of the metadata registry conforms to an application profile of DCAT that combines notable (mostly metadata) vocabularies such as VoID, Dublin Core Metadata Terms, LIME and FOAF.

The Model

The following discussion of the model will assume the following prefix declarations:


@prefix : <...> # the default namespace of the catalog

@prefix dcat: <http://www.w3.org/ns/dcat#>
@prefix dcterms: <http://purl.org/dc/terms/>
@prefix foaf: <http://xmlns.com/foaf/0.1/>
@prefix lime: <http://www.w3.org/ns/lemon/lime#>
@prefix mdr: <http://semanticturkey.uniroma2.it/ns/mdr#>
@prefix ontolex: <http://www.w3.org/ns/lemon/ontolex#>
@prefix stmdr: <http://semanticturkey.uniroma2.it/ns/stmdr#>
@prefix skos: <http://www.w3.org/2004/02/skos/core#>

The namespace of the metadata registry ontology was http://semanticturkey.uniroma2.it/ns/mdreg# prior to Semantic Turkey 8.0. An update routine will take care of applying the new namespace to existing catalogs when upgrading from a previous version.

The metadata registry should contain exactly one dcat:Catalog, as follows:


:catalog a dcat:Catalog .

The addition of a dataset to the catalog requires a dcat:CatalogRecord that references the dataset via the foaf:primaryTopic property and provides metadata about its record in the catalog (e.g., the date the dataset was cataloged).


:catalog a dcat:Catalog;
  dcat:record :agrovoc_record .

:agrovoc_record a dcat:CatalogRecord;
  dcterms:issued "2019-02-11T16:04:47.369+01:00"^^xsd:dateTime;
  foaf:primaryTopic :agrovoc_dataset .

:agrovoc_dataset a dcat:Dataset ;
  …

A concrete dataset is associated with a set of triples (a void:Dataset) representing its intellectual expression. These are further modeled as instances of mdr:RDFRepository, which in turn is a subclasse of dcat:Distribution. This class is actually an abstract class for different RDF storage technologies, which can provide different accesses to the same triples (whose availability is bound together, since they are provided by the same installation). The MDR ontology defines a concrete subclass mdr:HTTPSPARQLProvider, for deployments hat provide access via the SPARQL HTTP protocol.. The STMDR ontology defines another concrete sublcasse stmdr:Project for datasets hosted by a (local) Semantic Turkey project.


  :agrovoc_lod_dataset a dcat:Dataset ;
    dcat:distribution :agrovoc_lod_distribution .

  :agrovoc_lod_distribution a mdr:HTTPSPARQLProvider ;
    void:sparqlEndpoint <https://agrovoc.fao.org/sparql>
    .
  
  :agrovoc_master_dataset a dcat:Dataset ;
    dcat:distribution :agrovoc_master_distribution .
    
  :agrovoc_master_distribution a stmdr:Project ;
    foaf:name "AGROVOC MASTER"

The description of a dataset may contain various metadata, but it should contain at least the namespace (void:uriPrefix) and other metadata describing different accesses to the data. While the derefenceability of HTTP URIs is a "reasonable assumption" according to the specifications of VoID, the metadata registry uses the property mdr:dereferenciationSystem, in order to represent explicit this fact (mdr:standardDereferenciation) or the absence of this access (mdr:noStandardDereferenciation). The knowledge model (dcterms:conformsTo) is then associated to its distribution, as the intellectual content of a dataset can have different expressions. This is particurlay true for the lexicalizations, which will be discussed later.


:agrovoc_lod_dataset a dcat:Dataset ;
  void:uriSpace "http://aims.fao.org/aos/agrovoc/"; ;
  mdreg:dereferenciationSystem mdreg:standardDereferenciation ;
  dcat:distribution :agrovoc_lod_distribution .

:agrovoc_lod_distribution a mdr:HTTPSPARQLProvider ;
  dcterms:conformsTo <http://www.w3.org/2004/02/skos/core>;
  void:sparqlEndpoint <https://agrovoc.fao.org/sparql>
  .

The namespace is used to find the (unique) dataset that defines URIs with a given namespace. This is useful when someone has a URI, and wants to identify the dataset it belongs to. The knowledge model is required to differentiate, for example, between an ontology and a thesaurus, thus allowing the selection of the appropriate strategy to browse the semantic content (e.g. class/property hierarchy vs concept hierarchy). The SPARQL endpoint of a dataset may provide a richer alternative to HTTP resolution as an access mechanism.

A SPARQL endpoint may be further described to provide metadata about it. Currently, the model offers the property mdr:sparqlEndpointLimitation, which holds a resource describing a limitation of the given endpoint as perceived by an application using the MDR, which is responsible for providing concrete modeling terms. Currently, the sole defined value is stmdr:noAggregation, which is defined operationally as the inability of the endpoint to support aggregation to the extent required by services of Semantic Turkey (e.g. the resource view).

The distribution of dataset, which is an void:Dataset, of a dataset may define subsets (void:subset), when parts of the dataset deserve specific metadata. As an example, a dataset may define various lexicalization sets as subsets.


        :agrovoc_lod_distribution; a mdr:HTTPSPARQLProvider ;
  void:subset :agrovoc_lod_distribution_it_lexicalization_set .

:agrovoc_lod_distribution_it_lexicalization_set a lime:LexicalizationSet;
  dcterms:language <http://id.loc.gov/vocabulary/iso639-1/it>, <http://lexvo.org/id/iso639-3/ita>;
  lime:avgNumOfLexicalizations 0.7686822;
  lime:language "it"^^xsd:language ;
  lime:lexicalizationModel <http://www.w3.org/2008/05/skos-xl>
  lime:lexicalizations 32443;
  lime:percentage 0.59477323;;
  lime:referenceDataset :agrovoc_lod_distribution;
  lime:references 25103 .

The description of the lexicalization set reports on the natural language and the lexicalization model (in the example, SKOS-XL) that it conforms to. This information is important to instruct a dataset matching system on how to consume the available lexical material. In addition to this general metadata, a number of metrics can be used to evaluate coverage and expressiveness of the lexicalization set, and thus its relevance as a source of evidence in a matching scenario. We refer to the LIME specifications and to section on the formal definition of properties for further information on the metrics.

Other potential subsets of a dataset are void:Linksets describing the axioms connecting the datasets to other datasets.


:agrovoc_lod_distribution a mdr:HTTPSPARQLProvider ;
  void:subset :agrovoc_lod_distribution_it_lexicalization_set .
      
:agrovoc_dbpedia a void:Linkset ;
  void:linkPredicate skos:exactMatch ;
  void:objectsTarget <http://dbpedia.org/void/Dataset> ;
  void:subjectsTarget <http://aims.fao.org/aos/agrovoc/void.ttl#Agrovoc>
  void:triples 11057 .

The description of a language resource such as Open Multilingual Wordnet has some peculiarities that are not found in other generic datasets. Firstly, it contains the following subsets:

One or more ontolex:ConceptSet describing the set of synsets (possibly shared by multiple languages)
One lime:Lexicon for each collection of words in some language
One lime:ConceptualizationSet describing the binding of the words in a given lexicon to the synsets in the concept set.

The description of the concept set should provide the number of synsets (lexical concepts):


<http://art.uniroma2.it/pmki/omw/pwn30-conceptset> a ontolex:ConceptSet;
  lime:concepts 117659 .

As for the lexicalization set, the description of a lexicon tells the natural language in which the lexicon is expressed, and the number of words (lexical entries):


  <http://art.uniroma2.it/pmki/omw/Princeton_WordNet-en-lexicon> a
    lime:Lexicon;
    dcterms:language <http://id.loc.gov/vocabulary/iso639-1/en>, <http://lexvo.org/id/iso639-3/eng>
    lime:language "en" ;
    lime:lexicalEntries 156584 .

Finally, the conceptualization set tells which lexicon and concept set it refers to, and provides useful metrics about the conceptualization itself:


:Princeton_WordNet-en-lexicon_pwn30-conceptset_conceptualization_set a lime:ConceptualizationSet;
  lime:avgAmbiguity 1.322;
  lime:avgSynonymy 1.76;
  lime:concepts 117659;
  lime:conceptualDataset <http://art.uniroma2.it/pmki/omw/pwn30-conceptset>
  lime:conceptualizations 206978;
  lime:lexicalEntries 156584;
  lime:lexiconDataset <http://art.uniroma2.it/pmki/omw/Princeton_WordNet-en-lexicon> .

It is noteworthy that certain elements that have been described as subsets (e.g. lexicalization sets or conceptualization sets) can actually be described as root datasets, when they are distributed separately.

A lot datasets, such as AGROVOC in the examples above, may change over time, with discrete snapshots of them taken as different versions. Consequently, we should be able to differentiate between a dataset, in general, irrespectively of specific versions and specific (immutable) versions. Furthermre, we should be able to reference to the master development copy of the dataset, and the (ever changing, most current) data published at the dataset's namespace following the LOD best practices. To that end, the MDR models the dataset per se, irrespectively of its specific content, as an abstract dataset (i.e., a dataset without distributions), which is in turn linked to different concrete datasets that play different roles, such as versions (for which we rely on the DCAT-3 model for versioning), LOD dataset (mdr:lod), and master copy (mdr:master).


:agrovoc_dataset a dcat:Dataset ;
  dcat:hasVersion :agrovoc_v1_dataset ;
  dcat:hasVersion :agrovoc_v2_dataset ;
  mdr:lod :agrovoc_lod_dataset ;
  mdr:master :agrovoc_master_dataset .

The description of a concrete dataset that is a version of a given dataset have to contain a version number using the owl:versionInfo property. Additionally, the different versions can be linked together following the chaining model, using the properties dcat:hasNextVersion, dcat:hasPreviousVersion, and dcat:hasCurrentVersion.

Example of Metadata

EuroVoc, TESEO, Open Multilingual WordNet

The Metadata Persistence

The metadata registry managing a dataset catalog inside Semantic Turkey aggregates metadata from two sources:

${ST_DATA_DIR}/metadataRegistry/catalog.ttl (where ${ST_DATA_DIR} denotes the data directory of Semantic Turkey): metadata about remote datasets
project-scoped settings of the settings manager it.uniroma2.art.semanticturkey.settings.metadata.ProjectMetadataStore: metadata about a locally managed project

Application Programming Interface

The metadata registry can be resed at three different levels: as a library in any Java application, as a managed dependency within Semantic Turkey, or as a web service consumed by external appplications.

The metadata registry has been factored into a dedicated subsystem since Semantic Turkey 8.0. This subsystem is futher decomposed into a core, bindings and services module.

Metadata Registry as a library

Since Semantic Turkey 8.0, the core module of the metadata registry can be used as a library in any Java application.

The first step is to add a dependency on the needed artifact (where ${st.version} is a variable containing the version of Semantic Turkey).


<dependency>
	<groupId>it.uniroma2.art.semanticturkey</groupId>
	<artifactId>st-metadata-registry-core</artifactId>
	<version>${st.version}</version>
</dependency>

The following Java program shows how to intantiate and initialize the metadata registry, add a dataset (record) to the catalog and then shutdown the registry.

	
import java.io.File;
import java.io.IOException;

import it.uniroma2.art.semanticturkey.mdr.core.*;
import it.uniroma2.art.semanticturkey.mdr.core.vocabulary.METADATAREGISTRY;
import org.eclipse.rdf4j.model.util.Values;
import org.eclipse.rdf4j.repository.RepositoryException;
import org.eclipse.rdf4j.rio.RDFParseException;

import it.uniroma2.art.maple.orchestration.MediationFramework;
import it.uniroma2.art.semanticturkey.mdr.core.impl.MetadataRegistryBackendImpl;

public class Main {
    public static void main(String[] args) throws RDFParseException, RepositoryException,
            MetadataRegistryCreationException, MetadataRegistryIntializationException,
            IllegalArgumentException, MetadataRegistryWritingException {
        File baseDir = new File(".");
        MediationFramework mediationFramework = null;
        MetadataRegistryBackend mdr = new MetadataRegistryBackendImpl(baseDir, mediationFramework);
        mdr.initialize();
        try {
            mdr.createConcreteDataset("agrovoc_lod",
                    "http://aims.fao.org/aos/agrovoc/",
                    Values.literal("AGROVOC"),
                    Values.literal("The AGROVOC thesaurus contains more than 41 000 concepts [..]", "en"),
                    true,
                    new Distribution(null,
                            /* or METADATAREGISTRY.HTTP_SPARQL_PROVIDER starting from the upcoming v. 14.0 release */
                            Values.iri(METADATAREGISTRY.NS, "HTTPSPARQLProvider"),
                            Values.iri("http://agrovoc.fao.org/sparql"), null),
                    null,
                    false
            );
        } finally {
            mdr.destroy();
        }
    }
}

The program above pesists the content of the catalog inside the file metadataRegistry/catalog.ttl within the current working directory. The variable mediationFramework should contain a reference to a MAPLE mediation framework. In this simple example, it is set to null, as we are not interested into certain capabilities related to information discovery and analysis that require this dependency.

The class MetadataRegistryBackendImpl contains several protected methods that can be overridden to customize this implementation of the registry.

Metadata Registry as a managed dependency within Semantic Turkey

Semantic Turkey manages a single instance of the metadata registry (i.e. a singleton), which is then published in the root Spring context to make it generally available within the system, as it is inherited by the contexts associated with each plugin. The general metadata registry that has been described previous is bound to Semantic Turkey, by providing i) a specific position for the persistence file within the Semantic Turkey data directory and ii) use metadata stored in each project.

The metadata registry can be injected using Spring's inversion-of-control mechanism, using the type it.uniroma2.art.semanticturkey.mdr.bindings.STMetadataRegistryBackend:


/**
* This service class allows the management of the metadata about remote datasets.
*/
@STService
public class MetadataRegistry extends STServiceAdapter {

  private STMetadataRegistryBackend metadataRegistryBackend;

  @Autowired
  public void setMetadataRegistry(STMetadataRegistryBackend metadataRegistryBackend) {
    this.metadataRegistryBackend = metadataRegistryBackend;
  }

  ...

}

Prior to Semantic Turkey 8.0, the metadata registry was part of the core-framework and it should be injected using the type it.uniroma2.art.semanticturkey.resources.MetadataRegistryBackend.

Metadata Registry as a web service

External applications may interact with the metadata registry through the REST API defined by these service classes:

it.uniroma2.art.semanticturkey.mdr.services.MetadataRegistry (artifactId: st-metadata-registry-services)
it.uniroma2.art.semanticturkey.services.core.MAPLE (artifactId: st-core-services)

For instructions on how to interact with these services, please refer to the documentation on the Web API of Semantic Turkey.

Prior to Semantic Turkey 8.0, the metadata registry services were also part of the core-services bundle.

Related Software

LIME API: an API for working with metadata conforming to the LIME metadata module of OntoLex-Lemon. It also provides a profiler for the automatic generation of metadata by analyzing the actual data.