GraphDBPedia — Importing data – From Triples to Vertices and Edges
DBPedia data is provided in several RDF triple files. Each line in each file gives a “complete” information set – based on predicate, subject and object, e.g.
mappingbased_properties_en.nt: (some line)
<http://dbpedia.org/resource/12_Monkeys>
<http://dbpedia.org/ontology/editing>
<http://dbpedia.org/resource/Mick_Audsley> .
stands for: “12 Monkeys” has a “editor” “Mick Audsley”.
In other files there is additional information available, e.g. that
- “12 Monkeys” is a film
- “Mick Audsley” is a person
- … probably more information about “12 Monkeys” and “Mick Audsley”
What we want to do in sones GraphDB is to create a VERTEX for the film “12 Monkeys”. This includes
- type information – 12 Monkeys is a film
- a set of properties – e.g. its budget
- EDGES to related information, e.g. the editor Mick Audsley.
There is a single point of information (The VERTEX “12 Monkeys”) that holds all information and relation in a single instance. To import the VERTEX “12 monkeys”, we had to write a parser over all available triple files that gives us all related information from DBPedia data set.
At this point we’ve had two options implementing this parser. The first one was to read all triple files in a dedicated order to ensure data validity (we need to know that “12 Monkeys” is a movie, to be able to assign the predicate “editor” unambiguous) or do an intermediate step by creating a temporary file that collects all data without validation and to do the import afterwards.
Our decision was to do the intermediate step, because of that it allows some synchronization during reading the triple files and avoids creating invalid data since exported data can be cross-checked easily.
This step is represented by project “2_ParseAndConvertTripleDataFiles” in solution GraphDBPedia available at http://github.com/sones/sones-dbpedia. The parser reads only a subset of offered data-files to show functionality and focus on the added values.
The result of the export for “Apollo 8” looks like this:
1 VertexID=-9223372036854775808
2 http://dbpedia.org/resource/Apollo_8=http://dbpedia.org/ontology/SpaceMission
3 LongAbstract_de=viel text
4 LongAbstract_en=a lot of text
5 http://dbpedia.org/ontology/commandModule_en=CM-103
6 http://dbpedia.org/ontology/missionDuration_en=529242.0
7 http://dbpedia.org/ontology/lunarOrbitTime_en=72613.0
8 http://dbpedia.org/ontology/crewSize_en=3
9 http://dbpedia.org/ontology/lunarModule_en=Ballast: Lunar Test Article B
10 http://dbpedia.org/ontology/serviceModule_en=SM-103
11 http://dbpedia.org/ontology/nextMission_en=http://dbpedia.org/resource/Apollo-9-patch.png
12 http://dbpedia.org/ontology/booster_en=http://dbpedia.org/resource/Saturn_V
13 http://dbpedia.org/ontology/previousMissions_en= http://dbpedia.org/resource/AP7lucky7.png
14 http://dbpedia.org/ontology/launchPad_en=http://dbpedia.org/resource/Kennedy_Space_Center_Launch_Complex_39
15 ShortAbstract_en=some text
16 Name_en=http://dbpedia.org/resource/Apollo_8
17 http://dbpedia.org/ontology/SpaceMission/lunarOrbitTime_en=20.170277777777777
18 http://dbpedia.org/ontology/SpaceMission/missionDuration_en=6.125486111111111
Apart from one property, all data had been exported from the triple files. During importing the ontology information (line2 in this example), we’ve also created a VertexID – unique for the corresponding VERTEX TYPE. This allows us to do a unique and performant linking during data import (happens later) by referring to this ID.
After this intermediate step, the real import step can be done. Sones Graph DB offers GraphQL as simple and intuitive language. Based on the data-structure we’ve prepared above, with GQL two steps have to be done. At first, create all VERTICES including all properties and afterwards do the linking between all VERTICES.
Therefore, for the example above, two statements would be created:
INSERT INTO httpwwwdbpediaorgontologySpaceMisson VALUES (
VertexID=-9223372036854775808,
LongAbstract_de=’viel text’,
LongAbstract_en=’a lot of text’,
Name_en=’ http://dbpedia.org/resource/Apollo_8’,
httpdbpediaorgontologycommandModule_en=’CM-103’,
httpdbpediaorgontologymissionDuration_en=529242.0,
httpdbpediaorgontologylunarOrbitTime_en=72613.0,
httpdbpediaorgontologycrewSize_en=3,
httpdbpediaorgontologylunarModule_en=’Ballast: Lunar Test Article B’,
httpdbpediaorgontologyserviceModule_en=’SM-103’,
ShortAbstract_en=’some text’,
httpdbpediaorgontologySpaceMissionlunarOrbitTime_en=20.170277777777777,
httpdbpediaorgontologySpaceMissionmissionDuration_en=6.125486111111111
UPDATE httpwwwdbpediaorgontologySpaceMisson SET
(
httpdbpediaorgontologynextMission_en=SETOF(Name_en=’http://dbpedia.org/resource/Apollo-9’)
httpdbpediaorgontologybooster_en=SETOF(Name_en=’http://dbpedia.org/resource/Saturn_V’)
httpdbpediaorgontologypreviousMission_en=SETOF( Name_en=’http://dbpedia.org/resource/AP7’)
httpdbpediaorgontologylaunchPad_en=SETOF(Name_en=’http://dbpedia.org/resource/Kennedy_Space_Center_Launch_Complex_39’)
)
WHERE VertexID=-9223372036854775808
The problem of this approach is, that EDGES are set via a WHERE condition that maybe is not unique or the attribute is not set at all at the target VERTEX. An option to solve this, is to verify the ID of the target vertex and do the linking via this condition.
Sones GraphDB also offers another option to do the importing, XmlBulkImport. It has the advantage that it is faster than GraphQL (due to the fact it uses Graph-filesystem interfaces) and also organizes INSERTING and LINKING of data itself. Instead of creating GraphQL, a proprietary XML structure has to be created and the import is done via a single IMPORT GQL statement.
A description of this format and its usage can be found at: http://developers.sones.de/wiki/doku.php?id=importexport:xmlbulkimport
This XmlBulkImport data file is created by project “3_ParseAndConvertTripleDataFiles” in solution GraphDBPedia available at http://github.com/sones/sones-dbpedia”.
