Fork me on GitHub

the “Crunchbase use-case” part 4 – the initial data import

It’s about time to import some data into our pre­vi­ously estab­lished object scheme. If you want to do this your­self you want to first run the Crunch­base mir­ror­ing tool and cre­ate your own mir­ror on your hard disk.

In the next step another small tool needs to be writ­ten. A tool that cre­ates nice clean GraphQL import scripts for our data. Since every data source is dif­fer­ent there’s not really a way around this step – in the end you’ll need to extract data here and import data here. One pos­si­ble dif­fer­ent solu­tion could be to imple­ment a ded­i­cated importer for the GraphDB – but I’ll leave that for another arti­cle series. Back to our tool: It’s called “First-Import” and it’s only pur­pose is to cre­ate a first small graph out of the mir­rored Crunch­base data and fill the mainly prim­i­tive data attrib­utes. Down­load this tool here.

This is why in this first step we mainly focus on the fol­low­ing object types:

  • Com­pany
  • Finan­cialOr­ga­ni­za­tion
  • Per­son
  • Prod­uct
  • Ser­vi­ce­Provider

Addi­tion­ally all edges to a com­pany object and the com­pe­ti­tion will be imported in this part of the arti­cle series.

So what does the first-import tool do? Simple:

  1. it dese­ri­al­izes the JSON data into a use­able object – in this case it’s writ­ten in C# and uses .NETs own JavaScript deserializer
  2. it then maps all attrib­utes of that dese­ri­al­ized JSON object to attribute names in our graph data object scheme and it does so by out­putting a sim­ple query
    1. Sim­ple Attribute Types like String and Inte­ger are just sim­ply assigned using the “=” oper­a­tor in the Graph Query Language
    2. 1:1 Ref­er­ences are assigned by assign­ing a REF(…) to the attribute – for exam­ple: INSERT INTO Prod­uct VALUES (Com­pany = REF(Permalink=’companyname’))
    3. 1:n Ref­er­ences are assigned by assign­ing a SETOF(…) to the attribute – because we are not using a bulk import inter­face but the stan­dard GraphQL REST Inter­face it’s nec­es­sary that the object(s) we’re going to ref­er­ence are already in exis­tence – there­fore we chose to do this 1:n link­ing step after cre­at­ing the objects itself in a sep­a­rate UPDATE step. Know­ing this the UPDATE looks like this: UPDATE Com­pany SET (ADD TO Com­pe­ti­tions SETOF(permalink=’…’,permalink=’…’)) WHERE Perma­link = ’companyname’

For the most part of the work it’s copy-n-paste to get the first-import tool together – it could have been done in a more sophis­ti­cated way (like using reflec­tion on the dese­ri­al­ized JSON objects) but that’s most prob­a­bly part of another article.

When run in the “crunch­base” direc­tory cre­ated by the Crunch­base Mir­ror­ing tool the first-import tool gen­er­ates GraphQL scripts – 6 of them to be precise:

crunchbase-first-import

gql-scripts-part-4

The last script is named “Step_3” because it’s sup­posed to come after all the others.

These scripts can be eas­ily imported after estab­lish­ing the object scheme. The thing is though – it won’t be that fast. Why is that? We’re cre­at­ing sev­eral thou­sand nodes and the edges between them. To cre­ate such an edge the Query Lan­guage needs to iden­tify the node the edge orig­i­nates and the node the edge should point to. To find these nodes the user is free to spec­ify match­ing cri­te­ria just like in a WHERE clause.

So if you do a UPDATE Com­pany SET (ADD TO Com­pe­ti­tions SETOF(Permalink=’company1’,Permalink=’company2’)) WHERE Perma­link = ’com­pa­ny­name’ the GraphDB needs to access the node iden­ti­fied by the Perma­link Attribute with the value “com­pa­ny­name” and the two nodes with the val­ues “company1” and “company2” to cre­ate the two edges. It will work just like all the scripts are but it won’t be as fast as it could be. What can help to speed up things are indices. Indices are used by the GraphDB to iden­tify and find spe­cific objects. These indices are used mainly in the eval­u­a­tion of a WHERE clause.

The sones GraphDB offers a num­ber of inte­grated indices, one of which is HASHTABLE which we are going to use in this exam­ple. Fur­ther­more every­one inter­ested can imple­ment it’s own index plu­gin – we will have a tuto­r­ial how to do that online in the future – if you’re inter­ested now just ask how we can help you to make it happen!

Back to the indices in our example:

The syn­tax of cre­at­ing an index is quite easy, the only thing you have to do is tell the CREATE INDEX query on which type and attribute the index should be cre­ated and of which index­type the index should be. Since we’re using the Perma­link attribute of the Crunch­base objects as an iden­ti­fier in the exam­ple (it could be any other attribute or group of attrib­utes that iden­tify one par­tic­u­lar object) we want to cre­ate indices on the Perma­link attribute for the full speed-up. This would look like this:

  • CREATE INDEX ON Com­pany (Perma­link) INDEXTYPE HashTable
  • CREATE INDEX ON Finan­cialOr­ga­ni­za­tion (Perma­link) INDEXTYPE HashTable
  • CREATE INDEX ON Per­son (Perma­link) INDEXTYPE HashTable
  • CREATE INDEX ON Ser­vi­ce­Provider (Perma­link) INDEXTYPE HashTable
  • CREATE INDEX ON Prod­uct (Perma­link) INDEXTYPE HashTable

Looks easy, is easy! To take advan­tage of course this index cre­ation should be done before cre­at­ing the first nodes and edges.

After we got that sorted the only thing that’s left is to run the scripts. This will, depend­ing on your machine, take a minute or two.

So after run­ning those scripts what hap­pened is: all Com­pany, Finan­cialOr­ga­ni­za­tion, Per­son, Ser­vi­ce­Provider and Prod­uct objects are cre­ated and filled with prim­i­tive data types

  1. all attrib­utes which are essen­tially ref­er­ences (1:1 or 1:n) to a Com­pany object are being set, these are
    1. Company.Competitions
    2. Product.Company

That’s it for this part – in the next part of the series we will dive deeper into con­nect­ing nodes with edges. There is a ton of things that can be done with the data – stay tuned for the next part.

4 comments zu “the “Crunchbase use-case” part 4 – the initial data import”

  • developers.sones.de » The “CrunchBase use-case” – part 1 – Overview

    […] part 4: The ini­tial data import […]

  • Obed

    I am a com­puter engi­neer­ing stu­dent and am very inter­ested in learn­ing to use sones GraphDB, I would like to send me some exam­ples of how to inte­grate my web appli­ca­tion with the solu­tion you have developed.

  • merando

    nice expla­na­tion. I hope you will launch the next part soon. because its nearly 2 month since the last one.

    May you can anwser the fol­low­ing ques­tion:
    Is it pos­si­ble to con­nect the nodes with edges which hold strings?
    Std Syn­tax e.g.: CREATE TYPE Pro­file ATTRIBUTES(SET<WEIGHTED(Double, DEFAULT = 1.0, SORTED DESC)> Value

    i want to do some­thing like this:
    CREATE TYPE Pro­file ATTRIBUTES(SET<WEIGHTED(String, DEFAULT = “”) > Value

    thank you for your help

  • bietiekay

    we’re on it!

write a new comment

*