Fork me on GitHub

GraphDBPedia — Importing data – From Triples to Vertices and Edges

DBPe­dia data is pro­vided in sev­eral RDF triple files. Each line in each file gives a “com­plete” infor­ma­tion set – based on pred­i­cate, sub­ject and object, e.g.

mappingbased_properties_en.nt: (some line)
<http://dbpedia.org/resource/12_Monkeys>
<http://dbpedia.org/ontology/editing>
<http://dbpedia.org/resource/Mick_Audsley> .       
stands for:  “12 Mon­keys” has a “edi­tor” “Mick Audsley”.

In other files there is addi­tional infor­ma­tion avail­able, e.g. that

  • 12 Mon­keys” is a film
  • Mick Aud­s­ley” is a person
  • … prob­a­bly more infor­ma­tion about “12 Mon­keys” and “Mick Audsley”

What we want to do in sones GraphDB is to cre­ate a VERTEX for the film “12 Mon­keys”. This includes

  • type infor­ma­tion – 12 Mon­keys is a film
  • a set of prop­er­ties – e.g. its budget
  • EDGES to related infor­ma­tion, e.g. the edi­tor Mick Audsley.

There is a sin­gle point of infor­ma­tion (The VERTEX “12 Mon­keys”) that holds all infor­ma­tion and rela­tion in a sin­gle instance. To import the VERTEX “12 mon­keys”, we had to write a parser over all avail­able triple files that gives us all related infor­ma­tion from DBPe­dia data set.
At this point we’ve had two options imple­ment­ing this parser. The first one was to read all triple files in a ded­i­cated order to ensure data valid­ity (we need to know that “12 Mon­keys” is a movie, to be able to assign the pred­i­cate “edi­tor” unam­bigu­ous) or do an inter­me­di­ate step by cre­at­ing a tem­po­rary file that col­lects all data with­out val­i­da­tion and to do the import after­wards.
Our deci­sion was to do the inter­me­di­ate step, because of that it allows some syn­chro­niza­tion dur­ing read­ing the triple files and avoids cre­at­ing invalid data since exported data can be cross-checked eas­ily.
This step is rep­re­sented by project “2_ParseAndConvertTripleDataFiles” in solu­tion GraphDB­Pe­dia avail­able at http://github.com/sones/sones-dbpedia.  The parser reads only a sub­set of offered data-files to show func­tion­al­ity and focus on the added val­ues.
The result of the export for “Apollo 8” looks like this:

1  VertexID=-9223372036854775808
2  http://dbpedia.org/resource/Apollo_8=http://dbpedia.org/ontology/SpaceMission
3  LongAbstract_de=viel text
4  LongAbstract_en=a lot of text
5  http://dbpedia.org/ontology/commandModule_en=CM-103
6  http://dbpedia.org/ontology/missionDuration_en=529242.0
7  http://dbpedia.org/ontology/lunarOrbitTime_en=72613.0
8  http://dbpedia.org/ontology/crewSize_en=3
9  http://dbpedia.org/ontology/lunarModule_en=Ballast: Lunar Test Arti­cle B
10  http://dbpedia.org/ontology/serviceModule_en=SM-103
11  http://dbpedia.org/ontology/nextMission_en=http://dbpedia.org/resource/Apollo-9-patch.png
12   http://dbpedia.org/ontology/booster_en=http://dbpedia.org/resource/Saturn_V
13   http://dbpedia.org/ontology/previousMissions_en= http://dbpedia.org/resource/AP7lucky7.png
14   http://dbpedia.org/ontology/launchPad_en=http://dbpedia.org/resource/Kennedy_Space_Center_Launch_Complex_39
15   ShortAbstract_en=some text
16   Name_en=http://dbpedia.org/resource/Apollo_8
17    http://dbpedia.org/ontology/SpaceMission/lunarOrbitTime_en=20.170277777777777
18   http://dbpedia.org/ontology/SpaceMission/missionDuration_en=6.125486111111111

Apart from one prop­erty, all data had been exported from the triple files. Dur­ing import­ing the ontol­ogy infor­ma­tion (line2 in this exam­ple), we’ve also cre­ated a Ver­texID – unique for the cor­re­spond­ing VERTEX TYPE. This allows us to do a unique and per­for­mant link­ing dur­ing data import (hap­pens later) by refer­ring to this ID.

After this inter­me­di­ate step, the real import step can be done. Sones Graph DB offers GraphQL as sim­ple and intu­itive lan­guage. Based on the data-structure we’ve pre­pared above, with GQL two steps have to be done. At first, cre­ate all VERTICES includ­ing all prop­er­ties and after­wards do the link­ing between all VERTICES.
There­fore, for the exam­ple above, two state­ments would be created:

 INSERT INTO http­wwwdb­pe­diaor­gontol­ogy­Space­Mis­son VALUES (
   VertexID=-9223372036854775808,
   LongAbstract_de=’viel text’,
   LongAbstract_en=’a lot of text’,
   Name_en=’ http://dbpedia.org/resource/Apollo_8’,
   httpdbpediaorgontologycommandModule_en=’CM-103’,
   httpdbpediaorgontologymissionDuration_en=529242.0,
   httpdbpediaorgontologylunarOrbitTime_en=72613.0,
   httpdbpediaorgontologycrewSize_en=3,
   httpdbpediaorgontologylunarModule_en=’Ballast: Lunar Test Arti­cle B’,
   httpdbpediaorgontologyserviceModule_en=’SM-103’,
   ShortAbstract_en=’some text’,
   httpdbpediaorgontologySpaceMissionlunarOrbitTime_en=20.170277777777777,  
   httpdbpediaorgontologySpaceMissionmissionDuration_en=6.125486111111111

UPDATE http­wwwdb­pe­diaor­gontol­ogy­Space­Mis­son SET
(
   httpdbpediaorgontologynextMission_en=SETOF(Name_en=’http://dbpedia.org/resource/Apollo-9’)
   httpdbpediaorgontologybooster_en=SETOF(Name_en=’http://dbpedia.org/resource/Saturn_V’)
httpdbpediaorgontologypreviousMission_en=SETOF( Name_en=’http://dbpedia.org/resource/AP7’)
httpdbpediaorgontologylaunchPad_en=SETOF(Name_en=’http://dbpedia.org/resource/Kennedy_Space_Center_Launch_Complex_39’)
)
WHERE VertexID=-9223372036854775808

The prob­lem of this approach is, that EDGES are set via a WHERE con­di­tion that maybe is not unique or the attribute is not set at all at the tar­get VERTEX. An option to solve this, is to ver­ify the ID of the tar­get ver­tex and do the link­ing via this condition.

Sones GraphDB also offers another option to do the import­ing, Xml­BulkIm­port. It has the advan­tage that it is faster than GraphQL (due to the fact it uses Graph-filesystem inter­faces) and also orga­nizes INSERTING and LINKING of data itself. Instead of cre­at­ing GraphQL, a pro­pri­etary XML struc­ture has to be cre­ated and the import is done via a sin­gle IMPORT GQL state­ment.
A descrip­tion of this for­mat and its usage can be found at: http://developers.sones.de/wiki/doku.php?id=importexport:xmlbulkimport

This Xml­BulkIm­port data file is cre­ated by project “3_ParseAndConvertTripleDataFiles” in solu­tion GraphDB­Pe­dia avail­able at http://github.com/sones/sones-dbpedia”.

 

write a new comment

*