The first step was to transfer the ontology – provided in Web Ontology Language (OWL) format – into GraphDB VERTEX TYPES and EDGES. Therefore, a parser had been implemented that reads the OWL-file, converts it into a class-model and is able to export data into a GQL – CREATE VERTEX TYPES statement.
The ontology currently contains 273 classes (DBPedia 3.6.) and thousands of datatype properties and object properties. A short demonstration of its main structures can be found here:
OWL-Class:
<owl:Class rdf:about="http://dbpedia.org/ontology/Island"> <rdfs:label xml:lang="en">island</rdfs:label> <rdfs:label xml:lang="el">νησί</rdfs:label> <rdfs:label xml:lang="fr">île</rdfs:label> <rdfs:subClassOf rdf:resource="http://dbpedia.org/ontology/PopulatedPlace"> </rdfs:subClassOf> </owl:Class>
- represents an Island (based on a PopulatedPlace)
OWL-DatatypeProperty:
<owl:DatatypeProperty rdf:about="http://dbpedia.org/ontology/numberOfIslands"> <rdfs:label xml:lang="en">number of islands</rdfs:label> <rdfs:domain rdf:resource="http://dbpedia.org/ontology/Island"></rdfs:domain> <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#nonNegativeInteger"></rdfs:range> </owl:DatatypeProperty>
- describes the non-negative Integer attribute “number of islands” for the class Island.
OWL-ObjectProperty:
<owl:ObjectProperty rdf:about="http://dbpedia.org/ontology/highestState"> <rdfs:label xml:lang="en">highest state</rdfs:label> <rdfs:domain rdf:resource="http://dbpedia.org/ontology/Island"></rdfs:domain> <rdfs:range rdf:resource="http://dbpedia.org/ontology/PopulatedPlace"></rdfs:range> </owl:ObjectProperty>
represents the highest “PopulatedPlace” on an island.
The conversion creates a
- VERTEX TYPE – one for each class,
- having multiple PROPERTIES – from datatype properties
- and multiple EDGES – from object-properties
Within the data schema, there is a big amount of multi-lateral dependencies. The CREATE VERTEX TYPES statement solves all of them and creates a valid data schema.
Additionally to the ontology from the OWL file, we’ve added some vertex types to fix some problems we’ve run at and to enhance the functionality a little bit:
- At first, the VERTEX TYPE Thing was not described in the Ontology. It is the base class in the ontology that all other VERTEX TYPES are base upon.
- To reflect disambiguation, we’ve created a VERTEX TYPE Instance with an EDGE to a SET of Thing. In case there is a disambiguation, an Instance refers to the corresponding NODEs in the GraphDB.
- Within the RDF-files, labels are saved in dedicated triples. We’ve added a dedicated VERTEX TYPE also, to avoid a mix-up in case one label refers to multiple Instances.
Currently, the GraphDB has some limitations regarding the allowed characters within VERTEX TYPES, its ATTRIBUTES and EDGES. The OWL and RDF format is generally based on URLs as data-definition. GraphDB has limitations working with colons, dots and slashes (both slash and backslash). Our simple workaround was to keep the URL and remove all occurrences of these characters. This leads us from http://dbpedia.org/ontology/Island to httpdbpediaorgontologyIsland.
Another challenge is the type-mapping between OWL and GraphDB. GraphDB supports c# simple data types, in the DBPedia OWL we are facing a list of 9 datatypes from an XML schema, DBPedia area units, speed units, density units, time units, volume units, distance units and several others. This led us to a huge switch that does the mapping – all properties could be reflected with the C# data types without data loss.
Wikipedia is available in multiple languages. DBPedia export currently is provided in 99 of them.
Some time later (during the next steps) we’ve found out that data in several languages differs a little bit sometimes, since there are different authors. For the data schema, this is relevant, because there are options how to handle this behavior.
One option is to let the data importer application logic decide how to handle this. We’ve decided to make the data schema language specific and provide a separate – language specific – attributes. This grows up the data schema a little bit, but does not lead to any data loss. Additionally, some application logic can be implemented later on, to check data quality for each node.
The command-line tool “1_CreateGqlSchemaFromOntology” ‚available at GitHub (https://github.com/sones/sones-dbpedia) VisualStudion solution creates the CREATE VERTEX TYPES statements as described above, based on the ontology of DBPedia 3.6. – later versions currently have not yet been tested.
The command line executable has to be started with 2 parameters:
- .owl filename (the filename has either to be an absolute path or located within the executables directory.
- result .gql file – name of the file, where all queries will be inserted in.
During runtime, the user will be requested for all languages that have to be reflected in schema. Our suggestion is to use 2-letter county-codes like “_en” or “_de”. An empty string exits the iteration.
After the execution the result .gql file easily can be imported via IMPORT GQL statement.
