Fork me on GitHub

Kategorie: use-case

DBPe­dia data is pro­vided in sev­eral RDF triple files. Each line in each file gives a “com­plete” infor­ma­tion set – based on pred­i­cate, sub­ject and object, e.g.

mappingbased_properties_en.nt: (some line)
<http://dbpedia.org/resource/12_Monkeys>
<http://dbpedia.org/ontology/editing>
<http://dbpedia.org/resource/Mick_Audsley> .       
stands for:  “12 Mon­keys” has a “edi­tor” “Mick Audsley”.

In other files there is addi­tional infor­ma­tion avail­able, e.g. that

  • 12 Mon­keys” is a film
  • Mick Aud­s­ley” is a person
  • … prob­a­bly more infor­ma­tion about “12 Mon­keys” and “Mick Audsley”

What we want to do in sones GraphDB is to cre­ate a VERTEX for the film “12 Mon­keys”. This includes

  • type infor­ma­tion – 12 Mon­keys is a film
  • a set of prop­er­ties – e.g. its budget
  • EDGES to related infor­ma­tion, e.g. the edi­tor Mick Audsley.

There is a sin­gle point of infor­ma­tion (The VERTEX “12 Mon­keys”) that holds all infor­ma­tion and rela­tion in a sin­gle instance. To import the VERTEX “12 mon­keys”, we had to write a parser over all avail­able triple files that gives us all related infor­ma­tion from DBPe­dia data set.
At this point we’ve had two options imple­ment­ing this parser. The first one was to read all triple files in a ded­i­cated order to ensure data valid­ity (we need to know that “12 Mon­keys” is a movie, to be able to assign the pred­i­cate “edi­tor” unam­bigu­ous) or do an inter­me­di­ate step by cre­at­ing a tem­po­rary file that col­lects all data with­out val­i­da­tion and to do the import after­wards.
Our deci­sion was to do the inter­me­di­ate step, because of that it allows some syn­chro­niza­tion dur­ing read­ing the triple files and avoids cre­at­ing invalid data since exported data can be cross-checked eas­ily.
This step is rep­re­sented by project “2_ParseAndConvertTripleDataFiles” in solu­tion GraphDB­Pe­dia avail­able at http://github.com/sones/sones-dbpedia.  The parser reads only a sub­set of offered data-files to show func­tion­al­ity and focus on the added val­ues.
The result of the export for “Apollo 8” looks like this:

1  VertexID=-9223372036854775808
2  http://dbpedia.org/resource/Apollo_8=http://dbpedia.org/ontology/SpaceMission
3  LongAbstract_de=viel text
4  LongAbstract_en=a lot of text
5  http://dbpedia.org/ontology/commandModule_en=CM-103
6  http://dbpedia.org/ontology/missionDuration_en=529242.0
7  http://dbpedia.org/ontology/lunarOrbitTime_en=72613.0
8  http://dbpedia.org/ontology/crewSize_en=3
9  http://dbpedia.org/ontology/lunarModule_en=Ballast: Lunar Test Arti­cle B
10  http://dbpedia.org/ontology/serviceModule_en=SM-103
11  http://dbpedia.org/ontology/nextMission_en=http://dbpedia.org/resource/Apollo-9-patch.png
12   http://dbpedia.org/ontology/booster_en=http://dbpedia.org/resource/Saturn_V
13   http://dbpedia.org/ontology/previousMissions_en= http://dbpedia.org/resource/AP7lucky7.png
14   http://dbpedia.org/ontology/launchPad_en=http://dbpedia.org/resource/Kennedy_Space_Center_Launch_Complex_39
15   ShortAbstract_en=some text
16   Name_en=http://dbpedia.org/resource/Apollo_8
17    http://dbpedia.org/ontology/SpaceMission/lunarOrbitTime_en=20.170277777777777
18   http://dbpedia.org/ontology/SpaceMission/missionDuration_en=6.125486111111111

Apart from one prop­erty, all data had been exported from the triple files. Dur­ing import­ing the ontol­ogy infor­ma­tion (line2 in this exam­ple), we’ve also cre­ated a Ver­texID – unique for the cor­re­spond­ing VERTEX TYPE. This allows us to do a unique and per­for­mant link­ing dur­ing data import (hap­pens later) by refer­ring to this ID.

After this inter­me­di­ate step, the real import step can be done. Sones Graph DB offers GraphQL as sim­ple and intu­itive lan­guage. Based on the data-structure we’ve pre­pared above, with GQL two steps have to be done. At first, cre­ate all VERTICES includ­ing all prop­er­ties and after­wards do the link­ing between all VERTICES.
There­fore, for the exam­ple above, two state­ments would be created:

 INSERT INTO http­wwwdb­pe­diaor­gontol­ogy­Space­Mis­son VALUES (
   VertexID=-9223372036854775808,
   LongAbstract_de=’viel text’,
   LongAbstract_en=’a lot of text’,
   Name_en=’ http://dbpedia.org/resource/Apollo_8’,
   httpdbpediaorgontologycommandModule_en=’CM-103’,
   httpdbpediaorgontologymissionDuration_en=529242.0,
   httpdbpediaorgontologylunarOrbitTime_en=72613.0,
   httpdbpediaorgontologycrewSize_en=3,
   httpdbpediaorgontologylunarModule_en=’Ballast: Lunar Test Arti­cle B’,
   httpdbpediaorgontologyserviceModule_en=’SM-103’,
   ShortAbstract_en=’some text’,
   httpdbpediaorgontologySpaceMissionlunarOrbitTime_en=20.170277777777777,  
   httpdbpediaorgontologySpaceMissionmissionDuration_en=6.125486111111111

UPDATE http­wwwdb­pe­diaor­gontol­ogy­Space­Mis­son SET
(
   httpdbpediaorgontologynextMission_en=SETOF(Name_en=’http://dbpedia.org/resource/Apollo-9’)
   httpdbpediaorgontologybooster_en=SETOF(Name_en=’http://dbpedia.org/resource/Saturn_V’)
httpdbpediaorgontologypreviousMission_en=SETOF( Name_en=’http://dbpedia.org/resource/AP7’)
httpdbpediaorgontologylaunchPad_en=SETOF(Name_en=’http://dbpedia.org/resource/Kennedy_Space_Center_Launch_Complex_39’)
)
WHERE VertexID=-9223372036854775808

The prob­lem of this approach is, that EDGES are set via a WHERE con­di­tion that maybe is not unique or the attribute is not set at all at the tar­get VERTEX. An option to solve this, is to ver­ify the ID of the tar­get ver­tex and do the link­ing via this condition.

Sones GraphDB also offers another option to do the import­ing, Xml­BulkIm­port. It has the advan­tage that it is faster than GraphQL (due to the fact it uses Graph-filesystem inter­faces) and also orga­nizes INSERTING and LINKING of data itself. Instead of cre­at­ing GraphQL, a pro­pri­etary XML struc­ture has to be cre­ated and the import is done via a sin­gle IMPORT GQL state­ment.
A descrip­tion of this for­mat and its usage can be found at: http://developers.sones.de/wiki/doku.php?id=importexport:xmlbulkimport

This Xml­BulkIm­port data file is cre­ated by project “3_ParseAndConvertTripleDataFiles” in solu­tion GraphDB­Pe­dia avail­able at http://github.com/sones/sones-dbpedia”.

 

The first step was to trans­fer the ontol­ogy – pro­vided in Web Ontol­ogy Lan­guage (OWL) for­mat – into GraphDB VERTEX TYPES and EDGES. There­fore, a parser had been imple­mented that reads the OWL-file, con­verts it into a class-model and is able to export data into a GQLCREATE VERTEX TYPES state­ment.
The  ontol­ogy cur­rently con­tains 273 classes (DBPe­dia 3.6.) and thou­sands of datatype prop­er­ties and object prop­er­ties. A short demon­stra­tion of its main struc­tures can be found here:

OWL-Class:

<owl:Class rdf:about="http://dbpedia.org/ontology/Island">
   <rdfs:label xml:lang="en">island</rdfs:label>
    <rdfs:label xml:lang="el">νησί</rdfs:label>
    <rdfs:label xml:lang="fr">île</rdfs:label>
    <rdfs:subClassOf
           rdf:resource="http://dbpedia.org/ontology/PopulatedPlace">
    </rdfs:subClassOf>
</owl:Class>

- rep­re­sents an Island (based on a PopulatedPlace)

OWL-DatatypeProperty:

<owl:DatatypeProperty rdf:about="http://dbpedia.org/ontology/numberOfIslands">
   <rdfs:label xml:lang="en">number of islands</rdfs:label>
    <rdfs:domain rdf:resource="http://dbpedia.org/ontology/Island"></rdfs:domain>
    <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#nonNegativeInteger"></rdfs:range>
</owl:DatatypeProperty>

- describes the non-negative Inte­ger attribute “num­ber of islands” for the class Island.

OWL-ObjectProperty:

 <owl:ObjectProperty rdf:about="http://dbpedia.org/ontology/highestState">
    <rdfs:label xml:lang="en">highest state</rdfs:label>
    <rdfs:domain rdf:resource="http://dbpedia.org/ontology/Island"></rdfs:domain>
    <rdfs:range rdf:resource="http://dbpedia.org/ontology/PopulatedPlace"></rdfs:range>
</owl:ObjectProperty>

rep­re­sents the high­est “Pop­u­lat­ed­Place” on an island.

The con­ver­sion cre­ates a

  • VERTEX TYPE  – one for each class,
  • hav­ing mul­ti­ple PROPERTIES – from datatype properties
  • and mul­ti­ple EDGES – from object-properties

Within the data schema, there is a big amount of multi-lateral depen­den­cies. The CREATE VERTEX TYPES state­ment solves all of them and cre­ates a valid data schema.

Addi­tion­ally to the ontol­ogy from the OWL file, we’ve added some ver­tex types to fix some prob­lems we’ve run at and to enhance the func­tion­al­ity a lit­tle bit:

  • At first, the VERTEX TYPE Thing was not described in the Ontol­ogy. It is the base class in the ontol­ogy that all other VERTEX TYPES are base upon.
  • To reflect dis­am­bigua­tion, we’ve cre­ated a VERTEX TYPE Instance with an EDGE to a SET of Thing. In case there is a dis­am­bigua­tion, an Instance refers to the cor­re­spond­ing NODEs in the GraphDB.
  • Within the RDF-files, labels are saved in ded­i­cated triples. We’ve added a ded­i­cated VERTEX TYPE also, to avoid a mix-up in case one label refers to mul­ti­ple Instances.

Cur­rently, the GraphDB has some lim­i­ta­tions regard­ing the allowed char­ac­ters within VERTEX TYPES, its ATTRIBUTES and EDGES. The OWL and RDF for­mat is gen­er­ally based on URLs as data-definition. GraphDB has lim­i­ta­tions work­ing with colons, dots  and slashes (both slash and back­slash). Our sim­ple workaround was to keep the URL and remove all occur­rences of these char­ac­ters. This leads us from http://dbpedia.org/ontology/Island to httpdbpediaorgontologyIsland.

Another chal­lenge is the type-mapping between OWL and GraphDB. GraphDB sup­ports c# sim­ple data types, in the DBPe­dia OWL we are fac­ing a list of 9 datatypes from an XML schema,  DBPe­dia area units, speed units, den­sity units, time units, vol­ume units, dis­tance units and sev­eral oth­ers. This led us to a huge switch that does the map­ping – all prop­er­ties could be reflected with the C# data types with­out data loss.

Wikipedia is avail­able in mul­ti­ple lan­guages. DBPe­dia export cur­rently is pro­vided in 99 of them.
Some time later (dur­ing the next steps) we’ve found out that data in sev­eral lan­guages dif­fers a lit­tle bit some­times, since there are dif­fer­ent authors. For the data schema, this is rel­e­vant, because there are options how to han­dle this behavior.

One option is to let the data importer appli­ca­tion logic decide how to han­dle this. We’ve decided to make the data schema lan­guage spe­cific and pro­vide a sep­a­rate – lan­guage spe­cific – attrib­utes. This grows up the data schema a lit­tle bit, but does not lead to any data loss. Addi­tion­ally, some appli­ca­tion logic can be imple­mented later on, to check data qual­ity for each node.

The command-line  tool “1_CreateGqlSchemaFromOntology” ‚avail­able at GitHub (https://github.com/sones/sones-dbpedia) Visu­al­Stu­dion solu­tion cre­ates the CREATE VERTEX TYPES state­ments as described above, based on the ontol­ogy of DBPe­dia 3.6. – later ver­sions cur­rently have not yet been tested.
The com­mand line exe­cutable has to be started with 2 parameters:

  • .owl file­name (the file­name has either to be an absolute path or located within the exe­cuta­bles directory.
  • result .gql file – name of the file, where all queries will be inserted in.

Dur­ing run­time, the user will be requested for all lan­guages that have to be reflected in schema. Our sug­ges­tion is to use 2-letter county-codes like “_en” or “_de”. An empty string exits the iter­a­tion.
After the exe­cu­tion the result .gql file eas­ily can be imported via IMPORT GQL statement.

DBPe­dia already is saved in a machine read­able for­mat (RDF). We’ve started a proof-of-concept to show that GraphDB is able to solve these require­ments too and to find out dif­fer­ences, advan­tages and dis­ad­van­tages of the dif­fer­ent con­cepts.
In RDF, the data model stands next to the data. Within sones GraphDB there is close con­nec­tion between each object (node) and it’s (Ver­tex) type. For exam­ple the node “Homer Simp­son” knows that he’s a “Fic­tion­alChar­ac­ter”.
Our expec­ta­tion was, that GraphDB requires less hard-disk space and also offers a bet­ter data store, since all infor­ma­tion about an object is saved in a unique node instead of  sev­eral triple-data-files. Besides, any rela­tion­ship between two objects (e.g. a per­son and its birth-place) is saved directly on that object. While load­ing a node, all infor­ma­tion is avail­able from a sin­gle loca­tion.
Dur­ing project run­time we’ve dis­cov­ered sev­eral prob­lems that can be solved with that idea. The aris­ing data net­work enables cus­tomers to find out com­plex rela­tion­ships between any node using graph-algorithms. Dis­am­bigua­tion of words is pos­si­ble, using the schema infor­ma­tion (e.g. Tuareg can be either nomads liv­ing in the Sahara or a vehi­cle built by a Ger­man car vendor).

We’ve had our first con­tacts with DBPe­dia in May 2010 already. A prospect asked us, whether or not GraphDB is the best way to reflect the data schema and import all data. After get­ting a first impres­sion from the DBPedia-Website:

from www.dbpedia.org/About:

“DBpe­dia is a com­mu­nity effort to extract struc­tured infor­ma­tion from Wikipedia and to make this infor­ma­tion avail­able on the Web. DBpe­dia allows you to ask sophis­ti­cated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it eas­ier for the amaz­ing amount of infor­ma­tion in Wikipedia to be used in new and inter­est­ing ways, and that it might inspire new mech­a­nisms for nav­i­gat­ing, link­ing and improv­ing the ency­clopae­dia itself.”

We’ve decided: Yes, it is!.

 from www.dbpedia.org/Datasets:

DBpe­dia uses the Resource Descrip­tion Frame­work (RDF) as a flex­i­ble data model for rep­re­sent­ing extracted infor­ma­tion and for pub­lish­ing it on the Web. We use the SPARQL query lan­guage to query this data. Please refer to the Devel­op­ers Guide to Seman­tic Web Toolk­its to find a devel­op­ment toolkit in your pre­ferred pro­gram­ming lan­guage to process DBpe­dia data.

The DBpe­dia knowl­edge base cur­rently describes more than 3.64 mil­lion things, out of which 1.83 mil­lion are clas­si­fied in a con­sis­tent Ontol­ogy, includ­ing 416,000 per­sons, 526,000 places (includ­ing 360,000 pop­u­lated places), 106,000 music albums, 60,000 films, 17,500 video games, 169,000 orga­ni­za­tions (includ­ing 40,000 com­pa­nies and 38,000 edu­ca­tional insti­tu­tions), 183,000 species and 5,400 diseases.

At this time we’ve not yet had too much expe­ri­ences with the Seman­tic Web, there­fore there was prob­a­bly some work to do.

The fol­low­ing blog arti­cles will describe our work and refer to the source-code avail­able under www.github.com/sones/sones-dbpedia

 

We always think about new ways to inte­grate GraphDB into exist­ing envi­ron­ments. And one of those envi­ron­ments our users are work­ing with right now are the sev­eral Enter­prise Ser­vice Busses which are avail­able right now.

One big player in the ESB envi­ron­ment is the Mule Open Source ESB:

Mule is a light­weight enter­prise ser­vice bus (ESB) and inte­gra­tion frame­work. It can han­dle ser­vices and appli­ca­tions using dis­parate trans­port and mes­sag­ing tech­nolo­gies. The plat­form is Java-based, but can bro­ker inter­ac­tions between other plat­forms such as .NET using web ser­vices or sockets.

The archi­tec­ture is a scal­able, highly-distributable object bro­ker that can seam­lessly han­dle inter­ac­tions across legacy sys­tems, in-house appli­ca­tions and almost all mod­ern trans­ports and protocols.”

In order to show how a GraphDB inte­grates into those typ­i­cal ESB envi­ron­ments we cre­ated a small example.

The archi­tec­ture of this exam­ple is like this:

mule-esb

The idea behind this is that an exam­ple Message-WebApp is post­ing a mes­sage to the Mule ESB and then this mes­sage gets trans­formed and in the last con­se­quence con­sumed by a sones REST­ful web­ser­vice hosted by a GraphDB.

You can read more in this tuto­r­ial here and you can down­load the source­code here.

Source 1: http://www.mulesoft.org/
Source 2: https://github.com/sones/sones-mule
Source 3: http://developers.sones.de/wiki/doku.php?id=tutorials:muleexampleapp

For many sce­nar­ios it’s impor­tant to know how a data­base per­forms. Espe­cially these days when the num­ber of data­bases seem to grow by the day and a choice is hard to make.

To demon­strate how sones GraphDB per­forms at given use-cases we cre­ated a bench­mark frame­work and tool which basi­cally divides bench­mark­ing into two steps:

  1. Gen­er­ate and/or Import use-case spe­cific data and mea­sure the performance

  2. Exe­cute use-case spe­cific algo­rithms on the graph and mea­sure the performance

Because there are many dif­fer­ent use-cases these both steps are made up by plug-ins which can be adressed using the com­man­d­line which is inte­grated into the bench­mark tool.

The frame­work, tool and plug-ins are released as AGPLv3 licensed Open­Source soft­ware and can be down­loaded here.

We dis­trib­ute the source code mainly because it’s the best way for you to repro­duce the results and take a look at what actu­ally is being tested, the other main cause is that we want every­body to be able to bench­mark and test their own algo­rithms on GraphDB.

fetch

Source 1: https://github.com/sones/benchmark
Source 2: http://developers.sones.de/wiki/doku.php?id=benchmarks

In the pre­vi­ous arti­cle of this series a short intro­duc­tion into graph data­bases was pro­vided. In arti­cle 3 of the series an ini­tial scheme of the Crunch­base use case was cre­ated. In this arti­cle this ini­tial scheme is going to be extended by more com­plex attrib­utes. The pos­si­b­lity to alter a scheme when­ever nec­es­sary is a fea­ture of the sones GraphDB. In tra­di­tional rela­tional data­base man­age­ment sys­tems this scheme alter­ation is, when even pos­si­ble, very slow when deal­ing with large data sets. As a result the typ­i­cal rela­tional tables con­tain “reserved” columns which are later filled with information.

When arti­cle 3 and 4 was suc­cess­fully applied the data­base should con­tain the basic attrib­utes and data of the 5 main objects (nodes). At this point only 2 exam­ple rela­tions (edges) are inserted. The next step would be to take a closer look at how the 5 node types can be linked together. A good exam­ple and start­ing point is the many-to-many (m:n) rela­tion­ship of per­sons to com­pa­nies and finan­cial organizations.

In the sones GraphDB many-to-many (m:n) rela­tion­ships are implic­itly formed with one-to-many (1:n) rela­tion­ships: One object holds a named SET<> attribute which then stores the edges to a num­ber of objects. In the con­text of the exam­ple this means that a per­son has edges point­ing to com­pa­nies and finan­cial orga­ni­za­tions this per­sons had worked for in the past.

Since the infor­ma­tion about the rela­tion­ship are held in the node type Rela­tion­ship this node type will be expanded with edges to the node types Com­pany and Finan­cialOr­gan­i­sa­tion. The edge to the per­son was already in the orig­i­nal scheme. The cor­re­spond­ing GraphQL expres­sion is:

  • ALTER VERTEX Rela­tion­ship ADD ATTRIBUTES(Company Com­pa­nyRe­la­tion­ship, Finan­cialOr­gan­i­sa­tion FinancialOrganisationRelationship)

Now, through the use of back­ward edges, a back­ward edge can be pre­pared. This means, if a per­son con­nects with a com­pany an edge occurs from the com­pany to the person.

The GraphQL instruc­tions are:

  • ALTER VERTEX Com­pany ADD BACKWARDEDGES (Relationship.CompanyRelationship Relationships)
  • ALTER VERTEX Finan­cialOr­gan­i­sa­tion ADD BACKWARDEDGES ( Relationship.FinancialOrganisationRelationship Relationships)
  • ALTER VERTEX Per­son ADD BACKWARDEDGES (Relationship.Person Relationships)

Now we have an extended scheme – only the data is still missing.

To extract this from the already exported JSON objects, a small JSON parser tool was writ­ten. This tool reads and dese­ri­al­izes all the pre­vi­ously mir­rored JSON files into cor­re­spond­ing .NET objects. These objects are later used by the IScriptWriter imple­men­ta­tions. Each imple­men­ta­tion works exactly on one rela­tion in the scheme and gen­er­ates GraphQL queries using the infor­ma­tion of the. NET objects.

This tool can be down­loaded as source code and pre-compiled binary. If run with­out para­me­ters, it tries to find the fold­ers com­pany, financial-organization, per­son, service-provider and prod­uct in the cur­rent folder. The result­ing scripts are then writ­ten to the cur­rent folder.

If the desired input and out­put fold­ers dif­fer from the local The input and out­put direc­tory can be passed to the program:

Con­nect­ing Nodes.exe [INPUT-FOLDER [OUTPUT-FOLDER]]

The out­put will look like this:

image

The imple­men­ta­tion of the IScriptWriter inter­face for the rela­tion­ship from per­sons to com­pa­nies and finan­cial orga­ni­za­tions is shown in this pic­ture / code:

image

The inter­ested reader is of course free to imple­ment not yet imple­mented con­nec­tions between nodes. For exam­ple, the rela­tion­ship of com­pa­nies to products.

For exam­ple a request from the file Step_4_Relationships.qgl looks like this:

  • INSERT INTO Rela­tion­ship VALUES (Per­son = REF(Permalink = ‘andrew-cheung’), Title = ‘Pres­i­dent & CEO’, IsPast = False, Com­pa­nyRe­la­tion­ship = REF(Permalink = ’01-communique’))

In this case, the per­son „Andrew Che­ung“ is con­nected with the com­pany „01 Com­mu­nique“. The rela­tion­ship is enriched with addi­tional infor­ma­tion, such as the job title he had or has in that company.

The back­ward edges of the com­pa­nies „01 Com­mu­nique“ to „Andrew Che­ung“ was auto­mat­i­cally gen­er­ated due to the applied reverse rela­tion­ship (back­ward edge) and can be queried immediately.

In this arti­cle was exem­plar­ily demon­strated, how even after the import of data con­nec­tions in GraphDB can be pro­duced and exist­ing data linked with each other. The fol­low­ing sec­tion 6 will show how easy it is to write and run com­plex queries in GraphQL on the sones GraphDB.

It’s about time to import some data into our pre­vi­ously estab­lished object scheme. If you want to do this your­self you want to first run the Crunch­base mir­ror­ing tool and cre­ate your own mir­ror on your hard disk.

In the next step another small tool needs to be writ­ten. A tool that cre­ates nice clean GraphQL import scripts for our data. Since every data source is dif­fer­ent there’s not really a way around this step – in the end you’ll need to extract data here and import data here. One pos­si­ble dif­fer­ent solu­tion could be to imple­ment a ded­i­cated importer for the GraphDB – but I’ll leave that for another arti­cle series. Back to our tool: It’s called “First-Import” and it’s only pur­pose is to cre­ate a first small graph out of the mir­rored Crunch­base data and fill the mainly prim­i­tive data attrib­utes. Down­load this tool here.

This is why in this first step we mainly focus on the fol­low­ing object types:

  • Com­pany
  • Finan­cialOr­ga­ni­za­tion
  • Per­son
  • Prod­uct
  • Ser­vi­ce­Provider

Addi­tion­ally all edges to a com­pany object and the com­pe­ti­tion will be imported in this part of the arti­cle series.

So what does the first-import tool do? Simple:

  1. it dese­ri­al­izes the JSON data into a use­able object – in this case it’s writ­ten in C# and uses .NETs own JavaScript deserializer
  2. it then maps all attrib­utes of that dese­ri­al­ized JSON object to attribute names in our graph data object scheme and it does so by out­putting a sim­ple query
    1. Sim­ple Attribute Types like String and Inte­ger are just sim­ply assigned using the “=” oper­a­tor in the Graph Query Language
    2. 1:1 Ref­er­ences are assigned by assign­ing a REF(…) to the attribute – for exam­ple: INSERT INTO Prod­uct VALUES (Com­pany = REF(Permalink=’companyname’))
    3. 1:n Ref­er­ences are assigned by assign­ing a SETOF(…) to the attribute – because we are not using a bulk import inter­face but the stan­dard GraphQL REST Inter­face it’s nec­es­sary that the object(s) we’re going to ref­er­ence are already in exis­tence – there­fore we chose to do this 1:n link­ing step after cre­at­ing the objects itself in a sep­a­rate UPDATE step. Know­ing this the UPDATE looks like this: UPDATE Com­pany SET (ADD TO Com­pe­ti­tions SETOF(permalink=’…’,permalink=’…’)) WHERE Perma­link = ’companyname’

For the most part of the work it’s copy-n-paste to get the first-import tool together – it could have been done in a more sophis­ti­cated way (like using reflec­tion on the dese­ri­al­ized JSON objects) but that’s most prob­a­bly part of another article.

When run in the “crunch­base” direc­tory cre­ated by the Crunch­base Mir­ror­ing tool the first-import tool gen­er­ates GraphQL scripts – 6 of them to be precise:

crunchbase-first-import

gql-scripts-part-4

The last script is named “Step_3” because it’s sup­posed to come after all the others.

These scripts can be eas­ily imported after estab­lish­ing the object scheme. The thing is though – it won’t be that fast. Why is that? We’re cre­at­ing sev­eral thou­sand nodes and the edges between them. To cre­ate such an edge the Query Lan­guage needs to iden­tify the node the edge orig­i­nates and the node the edge should point to. To find these nodes the user is free to spec­ify match­ing cri­te­ria just like in a WHERE clause.

So if you do a UPDATE Com­pany SET (ADD TO Com­pe­ti­tions SETOF(Permalink=’company1’,Permalink=’company2’)) WHERE Perma­link = ’com­pa­ny­name’ the GraphDB needs to access the node iden­ti­fied by the Perma­link Attribute with the value “com­pa­ny­name” and the two nodes with the val­ues “company1” and “company2” to cre­ate the two edges. It will work just like all the scripts are but it won’t be as fast as it could be. What can help to speed up things are indices. Indices are used by the GraphDB to iden­tify and find spe­cific objects. These indices are used mainly in the eval­u­a­tion of a WHERE clause.

The sones GraphDB offers a num­ber of inte­grated indices, one of which is HASHTABLE which we are going to use in this exam­ple. Fur­ther­more every­one inter­ested can imple­ment it’s own index plu­gin – we will have a tuto­r­ial how to do that online in the future – if you’re inter­ested now just ask how we can help you to make it happen!

Back to the indices in our example:

The syn­tax of cre­at­ing an index is quite easy, the only thing you have to do is tell the CREATE INDEX query on which type and attribute the index should be cre­ated and of which index­type the index should be. Since we’re using the Perma­link attribute of the Crunch­base objects as an iden­ti­fier in the exam­ple (it could be any other attribute or group of attrib­utes that iden­tify one par­tic­u­lar object) we want to cre­ate indices on the Perma­link attribute for the full speed-up. This would look like this:

  • CREATE INDEX ON Com­pany (Perma­link) INDEXTYPE HashTable
  • CREATE INDEX ON Finan­cialOr­ga­ni­za­tion (Perma­link) INDEXTYPE HashTable
  • CREATE INDEX ON Per­son (Perma­link) INDEXTYPE HashTable
  • CREATE INDEX ON Ser­vi­ce­Provider (Perma­link) INDEXTYPE HashTable
  • CREATE INDEX ON Prod­uct (Perma­link) INDEXTYPE HashTable

Looks easy, is easy! To take advan­tage of course this index cre­ation should be done before cre­at­ing the first nodes and edges.

After we got that sorted the only thing that’s left is to run the scripts. This will, depend­ing on your machine, take a minute or two.

So after run­ning those scripts what hap­pened is: all Com­pany, Finan­cialOr­ga­ni­za­tion, Per­son, Ser­vi­ce­Provider and Prod­uct objects are cre­ated and filled with prim­i­tive data types

  1. all attrib­utes which are essen­tially ref­er­ences (1:1 or 1:n) to a Com­pany object are being set, these are
    1. Company.Competitions
    2. Product.Company

That’s it for this part – in the next part of the series we will dive deeper into con­nect­ing nodes with edges. There is a ton of things that can be done with the data – stay tuned for the next part.

After the overview and the first use-case intro­duc­tion it’s about time to play with some data objects.

So how can one actu­ally access the data of crunch­base? Easy as pie: Crunch­base offers an easy to use inter­face to get all infor­ma­tion out of their data­base in a fairly struc­tured JSON for­mat. So what we did is to write a tool that actu­ally down­loads all the avail­able data to a local machine so we can play with it as we like in the fol­low­ing steps.

This small tool is called Mir­ror­Crunch­base and can be down­loaded in binary and source­code here. As for all source­code and tools in this series this runs on win­dows and linux (mono). You can use the source­code to get an impres­sion what’s going on there or just the included bina­ries (in bin/Debug) to mir­ror the data of Crunchbase.

To say a few words about what the Mir­ror­Crunch­base tool actu­ally does first a small source code excerpt:

codesnippet_1

So first it gets the list of all objects like the com­pany names and then it retrieves each com­pany object accord­ing to it’s name and stores every­thing in .js files. Easy eh?

When it’s run­ning you get an out­put sim­i­lar to that:

mirror_run_linux

And after the suc­cess­ful com­ple­tion you should end up with a direc­tory structure

crunchbase_directory_structure

The .js files store basi­cally every infor­ma­tion accord­ing to the data scheme overview pic­ture of part 2.  So what we want to do now is to trans­form this overview into a GraphQL data scheme we can start to work with. A main con­cept of sones GraphDB is to allow the user to evolve a data scheme over time. That way the user does not have to have the final data scheme before the first cre­ate state­ment. Instead the user can start with a basic data scheme rep­re­sent­ing only stan­dard data types and add com­plex user defined types as migra­tion goes along. That’s a fun­da­men­tally dif­fer­ent approach from what data­base admin­is­tra­tors and users are used to today.

Todays user gen­er­ated data evolves and grows and it’s not pos­si­ble to fore­see in which way attrib­utes need to be added, removed, renamed. Maybe the scheme changes com­pletely. Every­time the neces­sity emerged to change any­thing on a estab­lished and pop­u­lated data scheme it was about time to start a com­plex and costly migra­tion process. To sub­stan­tially reduce or even in some cases elim­i­nate the need for such a com­plex process is a design goal of the sones GraphDB.

In the Crunch­base use-case this results in a fairly straight-forward process to estab­lish and fill the data scheme. First we cre­ate all types with their cor­rect name and add only those attrib­utes which can be filled from the start – like prim­i­tives or direct ref­er­ences. All Lists and Sets of Edges can be added later on.

So these would be the Create-Type State­ments to start with in this use-case:

  • CREATE TYPE Com­pany ATTRIBUTES ( String Alias_List, String BlogFee­dURL,    String BlogURL, String Cat­e­gory, Date­Time Created_At, String Crunch­baseURL, Date­Time Deadpooled_At, String Descrip­tion, String EMailAdress, Date­Time Founded_At, String Home­pageURL, Inte­ger Num­berO­fEm­ploy­ees, String Overview, String Perma­link, String Pho­neNum­ber, String Tags, String Twit­terUser­name, Date­Time Updated_At, Set<Com­pany> Competitions )
  • CREATE TYPE Finan­cialOr­ga­ni­za­tion ATTRIBUTES ( String Alias_List, String BlogFee­dURL, String BlogURL, Date­Time Created_At, String Crunch­baseURL, String Descrip­tion, String EMailAdress, Date­Time Founded_At, String Home­pageURL, String Name, Inte­ger Num­berO­fEm­ploy­ees, String Overview, String Perma­link, String Pho­neNum­ber, String Tags, String Twit­terUser­name, Date­Time Updated_At )
  • CREATE TYPE Prod­uct ATTRIBUTES ( String BlogFee­dURL, String BlogURL, Com­pany Com­pany, Date­Time Created_At, String Crunch­baseURL, Date­Time Deadpooled_At, String Home­pageURL, String Invite­ShareURL, Date­Time Launched_At, String Name, String Overview, String Perma­link, String Stage­Code, String Tags, String Twit­terUser­name, Date­Time Updated_At)
  • CREATE TYPE Exter­nalLink ATTRIBUTES ( String Exter­nalURL, String Title )
  • CREATE TYPE Embed­ded­Video ATTRIBUTES ( String Descrip­tion, String EmbedCode )
  • CREATE TYPE Image ATTRIBUTES ( String Attri­bu­tion, Inte­ger SizeX, Inte­ger SizeY, String ImageURL )
  • CREATE TYPE IPO ATTRIBUTES ( Date­Time Published_At, String StockSym­bol, Dou­ble Val­u­a­tion, String ValuationCurrency )
  • CREATE TYPE Acqui­si­tion ATTRIBUTES ( Date­Time Acquired_At, Com­pany Com­pany, Dou­ble Price, String Price­Cur­rency, String SourceDes­ti­na­tion, String SourceURL, String TermCode )
  • CREATE TYPE Office ATTRIBUTES ( String Address1, String Address2, String City, String Coun­tryCode, String Descrip­tion, Dou­ble Lat­i­tude, Dou­ble Lon­git
    ude, String State­Code, String ZipCode )
  • CREATE TYPE Mile­stone ATTRIBUTES ( String Descrip­tion, String SourceDescrip­tion, String SourceURL, Date­Time Stoned_At )
  • CREATE TYPE Fund ATTRIBUTES ( Date­Time Funded_At, String Name, Dou­ble RaisedAmount, String Raised­Cur­ren­cy­Code, String SourceDescrip­tion, String SourceURL )
  • CREATE TYPE Per­son ATTRIBUTES ( String Affil­i­a­tion­Name, String Alias_List, String Birth­place, String BlogFee­dURL, String BlogURL, Date­Time Birth­day, Date­Time Created_At, String Crunch­baseURL, String First­Name, String Home­pageURL, Image Image, String Last­Name, String Overview, String Perma­link, String Tags, String Twit­terUser­name, Date­Time Updated_At )
  • CREATE TYPE Degree ATTRIBUTES ( String DegreeType, Date­Time Graduated_At, String Insti­tu­tion, String Subject )
  • CREATE TYPE Rela­tion­ship ATTRIBUTES ( Boolean Is_Past, Per­son Per­son, String Title )
  • CREATE TYPE Ser­vi­ce­Provider ATTRIBUTES ( String Alias_List, Date­Time Created_At, String Crunch­baseURL, String EMailAdress, String Home­pageURL, Image Image, String Name, String Overview, String Perma­link, String Pho­neNum­ber, String Tags, Date­Time Updated_At )
  • CREATE TYPE Provider­ship ATTRIBUTES ( Boolean Is_Past, Ser­vi­ce­Provider Provider, String Title )
  • CREATE TYPE Invest­ment ATTRIBUTES ( Com­pany Com­pany, Finan­cialOr­ga­ni­za­tion Finan­cialOr­ga­ni­za­tion, Per­son Person )
  • CREATE TYPE Fund­in­gRound ATTRIBUTES ( Com­pany Com­pany, Date­Time Funded_At, Dou­ble RaisedAmount, String Raised­Cur­ren­cy­Code, String Round­Code, String SourceDescrip­tion, String SourceURL )

You can directly down­load the accord­ing GraphQL script here. If you use the sone­sEx­am­ple appli­ca­tion from our open source dis­tri­b­u­tion you can cre­ate a sub­folder “scripts” in the binary direc­tory and put the down­loaded script file there. When you’re using the inte­grated Web­Shell, which is by default launched on port 9975 an can be accessed by brows­ing to http://localhost:9975/WebShell you can exe­cute the script using the com­mand “execdb­script” fol­lowed by the file­name of the script.

As you can see it’s quite straight for­ward a copy-paste action from the graph­i­cal scheme. Even ref­er­ences are not rep­re­sented by a dif­fi­cult rela­tional helper, instead if you want to ref­er­ence a com­pany object you can just do that (we actu­ally did that – look for exam­ple at the last line of the graphql script above). As a result when you exe­cute the above script you get all the Types nec­es­sary to fill data in in the next step.

So that’s it for this part – in the next part of this series we will start the ini­tial data import using a small tool which reads the mir­rored data and out­puts graphql insert queries.

Where to start: exist­ing data scheme and API

This series already tells in it’s name what the use case is: The “Crunch­Base”.  On their web­site they speak for them­selves to explain what it is: “Crunch­Base is the free data­base of tech­nol­ogy com­pa­nies, peo­ple, and investors that any­one can edit.”. There are many rea­sons why this was cho­sen as a use-case. One impor­tant rea­son is that all data behind the Crunch­Base ser­vice is licensed under Creative-Commons-Attribution (CC-BY) license. So it’s freely avail­able data of high-tech com­pa­nies, peo­ple and investors.

crunchbase_logo

Cur­rently there are more than 40.000 dif­fer­ent com­pa­nies, 51.000 dif­fer­ent peo­ple and 4.200 dif­fer­ent investors in the data­base. The flood of infor­ma­tion is big and the scale of con­nec­tiv­ity even big­ger. The graph rep­re­sented by the nodes could be even big­ger than that but because of the lim­it­ing fac­tors of cur­rent rela­tional data­base tech­nol­ogy it’s not fea­si­ble to try to do that.

sones GraphDB is com­ing to the res­cue: because it’s opti­mized to han­dle huge datasets of strongly con­nected data. Since the Crunch­Base data could be uses as a start­ing point to drive con­nec­tiv­ity to even greater detail it’s a great use-case to show these migra­tion and handling.

Thank­fully the devel­op­ers at Crunch­Base already made one or two steps into an object ori­ented world by offer­ing an API which answers queries in JSON for­mat. By using this API every­one can access the com­plete data set in a very struc­tured way. That’s both good and bad. Because the used tech­nolo­gies don’t offer a way to rep­re­sent linked objects they had to use what we call “rela­tional helpers”. For exam­ple: A per­son founded a com­pany. (per­son and com­pany being a JSON object). There’s no stan­dard­ized way to model a rela­tion­ship between those two. So what the Crunch­Base devel­op­ers did is they added an unique-Identifier to each object. And they added a new object which is uses as a “rela­tional helper”-object. The only pur­pose of these helper objects is to point towards a unique-identifier of another object type. So in our exam­ple the rela­tion­ship attribute of the per­son object is not point­ing directly to a spe­cific com­pany or rela­tion­ship, but it’s point­ing to the helper object which stores the infor­ma­tion which unique-identifier of which object type is meant by that link.

To visu­al­ize this here’s the data scheme behind the Crunch­Base (+all cur­rently avail­able links):

CrunchbaseRelations

As you can see there are many more “rela­tional helper” dead-ends in the scheme. What an appli­ca­tion had to do up until now is to resolve these dead-ends by going the extra mile. So instead of retriev­ing a per­son and all rela­tion­ships, and with them all data that one would expect, the appli­ca­tion has to split the data into many queries to inter­nally build a struc­ture which essen­tially is a graph.

Another exam­ple would be the com­pany object. Like the name implies all data of a com­pany is stored there. It holds an attribute called invest­ments which isn’t a prim­i­tive data type (like a num­ber or text) but a user defined com­plex data type. This user defined data type is called List<FundingRoundStructure>. So it’s a sim­ple list of Fund­in­gRound­Struc­ture objects.

When we take a look at the Fund­in­gRound­Struc­ture there’s an attribute called com­pany which is made up by the user defined data type Com­pa­nyS­truc­ture. This Com­pa­nyS­truc­ture is one of these dead-ends because there’s just a name and a unique-id. The appli­ca­tion now needs retrieve the right com­pany object with this unique-id to access the com­pany information.

Sim­ple things told in a sim­ple way: No mat­ter where you start, you always will end up in a dead-end which will force you to start over with the infor­ma­tion you found in that dead-end. It’s not user-friendly nor easy to implement.

The good news is that there is a way to han­dle this type of data and links between data in a very easy way. The sones GraphDB pro­vides a rich set of fea­tures to make the life of devel­op­ers and users eas­ier. In that con­text: If we would like to know which com­pa­nies also received fund­ing from the same investor like let’s say the com­pany “face­book” the only thing nec­es­sary would be one short query. Beside that those “rela­tional helpers” are redun­dant infor­ma­tion. That means in a graph data­base this infor­ma­tion would be stored in the form of edges but not in any helper objects.

The rea­son why the devel­op­ers of Crunch­Base had to use these helpers is that JSON and the rela­tional table behind it isn’t able to directly store this infor­ma­tion or to query it directly. To learn more about those rela­tional tables and data­bases try this link.

I want to end this part of the series with a pic­ture of the above rela­tional dia­gram (with­out the arrows and connections).

Crunchbase

The next part of the series will show how we can access the avail­able infor­ma­tion and how a graph scheme starts to evolve.