Fork me on GitHub

The first step was to trans­fer the ontol­ogy – pro­vided in Web Ontol­ogy Lan­guage (OWL) for­mat – into GraphDB VERTEX TYPES and EDGES. There­fore, a parser had been imple­mented that reads the OWL-file, con­verts it into a class-model and is able to export data into a GQLCREATE VERTEX TYPES state­ment.
The  ontol­ogy cur­rently con­tains 273 classes (DBPe­dia 3.6.) and thou­sands of datatype prop­er­ties and object prop­er­ties. A short demon­stra­tion of its main struc­tures can be found here:

OWL-Class:

<owl:Class rdf:about="http://dbpedia.org/ontology/Island">
   <rdfs:label xml:lang="en">island</rdfs:label>
    <rdfs:label xml:lang="el">νησί</rdfs:label>
    <rdfs:label xml:lang="fr">île</rdfs:label>
    <rdfs:subClassOf
           rdf:resource="http://dbpedia.org/ontology/PopulatedPlace">
    </rdfs:subClassOf>
</owl:Class>

- rep­re­sents an Island (based on a PopulatedPlace)

OWL-DatatypeProperty:

<owl:DatatypeProperty rdf:about="http://dbpedia.org/ontology/numberOfIslands">
   <rdfs:label xml:lang="en">number of islands</rdfs:label>
    <rdfs:domain rdf:resource="http://dbpedia.org/ontology/Island"></rdfs:domain>
    <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#nonNegativeInteger"></rdfs:range>
</owl:DatatypeProperty>

- describes the non-negative Inte­ger attribute “num­ber of islands” for the class Island.

OWL-ObjectProperty:

 <owl:ObjectProperty rdf:about="http://dbpedia.org/ontology/highestState">
    <rdfs:label xml:lang="en">highest state</rdfs:label>
    <rdfs:domain rdf:resource="http://dbpedia.org/ontology/Island"></rdfs:domain>
    <rdfs:range rdf:resource="http://dbpedia.org/ontology/PopulatedPlace"></rdfs:range>
</owl:ObjectProperty>

rep­re­sents the high­est “Pop­u­lat­ed­Place” on an island.

The con­ver­sion cre­ates a

  • VERTEX TYPE  – one for each class,
  • hav­ing mul­ti­ple PROPERTIES – from datatype properties
  • and mul­ti­ple EDGES – from object-properties

Within the data schema, there is a big amount of multi-lateral depen­den­cies. The CREATE VERTEX TYPES state­ment solves all of them and cre­ates a valid data schema.

Addi­tion­ally to the ontol­ogy from the OWL file, we’ve added some ver­tex types to fix some prob­lems we’ve run at and to enhance the func­tion­al­ity a lit­tle bit:

  • At first, the VERTEX TYPE Thing was not described in the Ontol­ogy. It is the base class in the ontol­ogy that all other VERTEX TYPES are base upon.
  • To reflect dis­am­bigua­tion, we’ve cre­ated a VERTEX TYPE Instance with an EDGE to a SET of Thing. In case there is a dis­am­bigua­tion, an Instance refers to the cor­re­spond­ing NODEs in the GraphDB.
  • Within the RDF-files, labels are saved in ded­i­cated triples. We’ve added a ded­i­cated VERTEX TYPE also, to avoid a mix-up in case one label refers to mul­ti­ple Instances.

Cur­rently, the GraphDB has some lim­i­ta­tions regard­ing the allowed char­ac­ters within VERTEX TYPES, its ATTRIBUTES and EDGES. The OWL and RDF for­mat is gen­er­ally based on URLs as data-definition. GraphDB has lim­i­ta­tions work­ing with colons, dots  and slashes (both slash and back­slash). Our sim­ple workaround was to keep the URL and remove all occur­rences of these char­ac­ters. This leads us from http://dbpedia.org/ontology/Island to httpdbpediaorgontologyIsland.

Another chal­lenge is the type-mapping between OWL and GraphDB. GraphDB sup­ports c# sim­ple data types, in the DBPe­dia OWL we are fac­ing a list of 9 datatypes from an XML schema,  DBPe­dia area units, speed units, den­sity units, time units, vol­ume units, dis­tance units and sev­eral oth­ers. This led us to a huge switch that does the map­ping – all prop­er­ties could be reflected with the C# data types with­out data loss.

Wikipedia is avail­able in mul­ti­ple lan­guages. DBPe­dia export cur­rently is pro­vided in 99 of them.
Some time later (dur­ing the next steps) we’ve found out that data in sev­eral lan­guages dif­fers a lit­tle bit some­times, since there are dif­fer­ent authors. For the data schema, this is rel­e­vant, because there are options how to han­dle this behavior.

One option is to let the data importer appli­ca­tion logic decide how to han­dle this. We’ve decided to make the data schema lan­guage spe­cific and pro­vide a sep­a­rate – lan­guage spe­cific – attrib­utes. This grows up the data schema a lit­tle bit, but does not lead to any data loss. Addi­tion­ally, some appli­ca­tion logic can be imple­mented later on, to check data qual­ity for each node.

The command-line  tool “1_CreateGqlSchemaFromOntology” ‚avail­able at GitHub (https://github.com/sones/sones-dbpedia) Visu­al­Stu­dion solu­tion cre­ates the CREATE VERTEX TYPES state­ments as described above, based on the ontol­ogy of DBPe­dia 3.6. – later ver­sions cur­rently have not yet been tested.
The com­mand line exe­cutable has to be started with 2 parameters:

  • .owl file­name (the file­name has either to be an absolute path or located within the exe­cuta­bles directory.
  • result .gql file – name of the file, where all queries will be inserted in.

Dur­ing run­time, the user will be requested for all lan­guages that have to be reflected in schema. Our sug­ges­tion is to use 2-letter county-codes like “_en” or “_de”. An empty string exits the iter­a­tion.
After the exe­cu­tion the result .gql file eas­ily can be imported via IMPORT GQL statement.

DBPe­dia already is saved in a machine read­able for­mat (RDF). We’ve started a proof-of-concept to show that GraphDB is able to solve these require­ments too and to find out dif­fer­ences, advan­tages and dis­ad­van­tages of the dif­fer­ent con­cepts.
In RDF, the data model stands next to the data. Within sones GraphDB there is close con­nec­tion between each object (node) and it’s (Ver­tex) type. For exam­ple the node “Homer Simp­son” knows that he’s a “Fic­tion­alChar­ac­ter”.
Our expec­ta­tion was, that GraphDB requires less hard-disk space and also offers a bet­ter data store, since all infor­ma­tion about an object is saved in a unique node instead of  sev­eral triple-data-files. Besides, any rela­tion­ship between two objects (e.g. a per­son and its birth-place) is saved directly on that object. While load­ing a node, all infor­ma­tion is avail­able from a sin­gle loca­tion.
Dur­ing project run­time we’ve dis­cov­ered sev­eral prob­lems that can be solved with that idea. The aris­ing data net­work enables cus­tomers to find out com­plex rela­tion­ships between any node using graph-algorithms. Dis­am­bigua­tion of words is pos­si­ble, using the schema infor­ma­tion (e.g. Tuareg can be either nomads liv­ing in the Sahara or a vehi­cle built by a Ger­man car vendor).

We’ve had our first con­tacts with DBPe­dia in May 2010 already. A prospect asked us, whether or not GraphDB is the best way to reflect the data schema and import all data. After get­ting a first impres­sion from the DBPedia-Website:

from www.dbpedia.org/About:

“DBpe­dia is a com­mu­nity effort to extract struc­tured infor­ma­tion from Wikipedia and to make this infor­ma­tion avail­able on the Web. DBpe­dia allows you to ask sophis­ti­cated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it eas­ier for the amaz­ing amount of infor­ma­tion in Wikipedia to be used in new and inter­est­ing ways, and that it might inspire new mech­a­nisms for nav­i­gat­ing, link­ing and improv­ing the ency­clopae­dia itself.”

We’ve decided: Yes, it is!.

 from www.dbpedia.org/Datasets:

DBpe­dia uses the Resource Descrip­tion Frame­work (RDF) as a flex­i­ble data model for rep­re­sent­ing extracted infor­ma­tion and for pub­lish­ing it on the Web. We use the SPARQL query lan­guage to query this data. Please refer to the Devel­op­ers Guide to Seman­tic Web Toolk­its to find a devel­op­ment toolkit in your pre­ferred pro­gram­ming lan­guage to process DBpe­dia data.

The DBpe­dia knowl­edge base cur­rently describes more than 3.64 mil­lion things, out of which 1.83 mil­lion are clas­si­fied in a con­sis­tent Ontol­ogy, includ­ing 416,000 per­sons, 526,000 places (includ­ing 360,000 pop­u­lated places), 106,000 music albums, 60,000 films, 17,500 video games, 169,000 orga­ni­za­tions (includ­ing 40,000 com­pa­nies and 38,000 edu­ca­tional insti­tu­tions), 183,000 species and 5,400 diseases.

At this time we’ve not yet had too much expe­ri­ences with the Seman­tic Web, there­fore there was prob­a­bly some work to do.

The fol­low­ing blog arti­cles will describe our work and refer to the source-code avail­able under www.github.com/sones/sones-dbpedia