Fork me on GitHub

GraphDBPedia — Creating the schema — From Ontology to VERTEX TYPES and EDGES

The first step was to trans­fer the ontol­ogy – pro­vided in Web Ontol­ogy Lan­guage (OWL) for­mat – into GraphDB VERTEX TYPES and EDGES. There­fore, a parser had been imple­mented that reads the OWL-file, con­verts it into a class-model and is able to export data into a GQLCREATE VERTEX TYPES state­ment.
The  ontol­ogy cur­rently con­tains 273 classes (DBPe­dia 3.6.) and thou­sands of datatype prop­er­ties and object prop­er­ties. A short demon­stra­tion of its main struc­tures can be found here:

OWL-Class:

<owl:Class rdf:about="http://dbpedia.org/ontology/Island">
   <rdfs:label xml:lang="en">island</rdfs:label>
    <rdfs:label xml:lang="el">νησί</rdfs:label>
    <rdfs:label xml:lang="fr">île</rdfs:label>
    <rdfs:subClassOf
           rdf:resource="http://dbpedia.org/ontology/PopulatedPlace">
    </rdfs:subClassOf>
</owl:Class>

- rep­re­sents an Island (based on a PopulatedPlace)

OWL-DatatypeProperty:

<owl:DatatypeProperty rdf:about="http://dbpedia.org/ontology/numberOfIslands">
   <rdfs:label xml:lang="en">number of islands</rdfs:label>
    <rdfs:domain rdf:resource="http://dbpedia.org/ontology/Island"></rdfs:domain>
    <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#nonNegativeInteger"></rdfs:range>
</owl:DatatypeProperty>

- describes the non-negative Inte­ger attribute “num­ber of islands” for the class Island.

OWL-ObjectProperty:

 <owl:ObjectProperty rdf:about="http://dbpedia.org/ontology/highestState">
    <rdfs:label xml:lang="en">highest state</rdfs:label>
    <rdfs:domain rdf:resource="http://dbpedia.org/ontology/Island"></rdfs:domain>
    <rdfs:range rdf:resource="http://dbpedia.org/ontology/PopulatedPlace"></rdfs:range>
</owl:ObjectProperty>

rep­re­sents the high­est “Pop­u­lat­ed­Place” on an island.

The con­ver­sion cre­ates a

  • VERTEX TYPE  – one for each class,
  • hav­ing mul­ti­ple PROPERTIES – from datatype properties
  • and mul­ti­ple EDGES – from object-properties

Within the data schema, there is a big amount of multi-lateral depen­den­cies. The CREATE VERTEX TYPES state­ment solves all of them and cre­ates a valid data schema.

Addi­tion­ally to the ontol­ogy from the OWL file, we’ve added some ver­tex types to fix some prob­lems we’ve run at and to enhance the func­tion­al­ity a lit­tle bit:

  • At first, the VERTEX TYPE Thing was not described in the Ontol­ogy. It is the base class in the ontol­ogy that all other VERTEX TYPES are base upon.
  • To reflect dis­am­bigua­tion, we’ve cre­ated a VERTEX TYPE Instance with an EDGE to a SET of Thing. In case there is a dis­am­bigua­tion, an Instance refers to the cor­re­spond­ing NODEs in the GraphDB.
  • Within the RDF-files, labels are saved in ded­i­cated triples. We’ve added a ded­i­cated VERTEX TYPE also, to avoid a mix-up in case one label refers to mul­ti­ple Instances.

Cur­rently, the GraphDB has some lim­i­ta­tions regard­ing the allowed char­ac­ters within VERTEX TYPES, its ATTRIBUTES and EDGES. The OWL and RDF for­mat is gen­er­ally based on URLs as data-definition. GraphDB has lim­i­ta­tions work­ing with colons, dots  and slashes (both slash and back­slash). Our sim­ple workaround was to keep the URL and remove all occur­rences of these char­ac­ters. This leads us from http://dbpedia.org/ontology/Island to httpdbpediaorgontologyIsland.

Another chal­lenge is the type-mapping between OWL and GraphDB. GraphDB sup­ports c# sim­ple data types, in the DBPe­dia OWL we are fac­ing a list of 9 datatypes from an XML schema,  DBPe­dia area units, speed units, den­sity units, time units, vol­ume units, dis­tance units and sev­eral oth­ers. This led us to a huge switch that does the map­ping – all prop­er­ties could be reflected with the C# data types with­out data loss.

Wikipedia is avail­able in mul­ti­ple lan­guages. DBPe­dia export cur­rently is pro­vided in 99 of them.
Some time later (dur­ing the next steps) we’ve found out that data in sev­eral lan­guages dif­fers a lit­tle bit some­times, since there are dif­fer­ent authors. For the data schema, this is rel­e­vant, because there are options how to han­dle this behavior.

One option is to let the data importer appli­ca­tion logic decide how to han­dle this. We’ve decided to make the data schema lan­guage spe­cific and pro­vide a sep­a­rate – lan­guage spe­cific – attrib­utes. This grows up the data schema a lit­tle bit, but does not lead to any data loss. Addi­tion­ally, some appli­ca­tion logic can be imple­mented later on, to check data qual­ity for each node.

The command-line  tool “1_CreateGqlSchemaFromOntology” ‚avail­able at GitHub (https://github.com/sones/sones-dbpedia) Visu­al­Stu­dion solu­tion cre­ates the CREATE VERTEX TYPES state­ments as described above, based on the ontol­ogy of DBPe­dia 3.6. – later ver­sions cur­rently have not yet been tested.
The com­mand line exe­cutable has to be started with 2 parameters:

  • .owl file­name (the file­name has either to be an absolute path or located within the exe­cuta­bles directory.
  • result .gql file – name of the file, where all queries will be inserted in.

Dur­ing run­time, the user will be requested for all lan­guages that have to be reflected in schema. Our sug­ges­tion is to use 2-letter county-codes like “_en” or “_de”. An empty string exits the iter­a­tion.
After the exe­cu­tion the result .gql file eas­ily can be imported via IMPORT GQL statement.

write a new comment

*