The “Crunchbase use-case” part 3 – How does a graph data scheme start?

After the overview and the first use-case intro­duc­tion it’s about time to play with some data objects.

So how can one actu­ally access the data of crunch­base? Easy as pie: Crunch­base offers an easy to use inter­face to get all infor­ma­tion out of their data­base in a fairly struc­tured JSON for­mat. So what we did is to write a tool that actu­ally down­loads all the avail­able data to a local machine so we can play with it as we like in the fol­low­ing steps.

This small tool is called Mir­ror­Crunch­base and can be down­loaded in binary and source­code here. As for all source­code and tools in this series this runs on win­dows and linux (mono). You can use the source­code to get an impres­sion what’s going on there or just the included bina­ries (in bin/Debug) to mir­ror the data of Crunchbase.

To say a few words about what the Mir­ror­Crunch­base tool actu­ally does first a small source code excerpt:

codesnippet_1

So first it gets the list of all objects like the com­pany names and then it retrieves each com­pany object accord­ing to it’s name and stores every­thing in .js files. Easy eh?

When it’s run­ning you get an out­put sim­i­lar to that:

mirror_run_linux

And after the suc­cess­ful com­ple­tion you should end up with a direc­tory structure 

crunchbase_directory_structure

The .js files store basi­cally every infor­ma­tion accord­ing to the data scheme overview pic­ture of part 2.  So what we want to do now is to trans­form this overview into a GQL data scheme we can start to work with. A main con­cept of sones GraphDB is to allow the user to evolve a data scheme over time. That way the user does not have to have the final data scheme before the first cre­ate state­ment. Instead the user can start with a basic data scheme rep­re­sent­ing only stan­dard data types and add com­plex user defined types as migra­tion goes along. That’s a fun­da­men­tally dif­fer­ent approach from what data­base admin­is­tra­tors and users are used to today.

Todays user gen­er­ated data evolves and grows and it’s not pos­si­ble to fore­see in which way attrib­utes need to be added, removed, renamed. Maybe the scheme changes com­pletely. Every­time the neces­sity emerged to change any­thing on a estab­lished and pop­u­lated data scheme it was about time to start a com­plex and costly migra­tion process. To sub­stan­tially reduce or even in some cases elim­i­nate the need for such a com­plex process is a design goal of the sones GraphDB.

In the Crunch­base use-case this results in a fairly straight-forward process to estab­lish and fill the data scheme. First we cre­ate all types with their cor­rect name and add only those attrib­utes which can be filled from the start – like prim­i­tives or direct ref­er­ences. All Lists and Sets of Edges can be added later on.

So these would be the Create-Type State­ments to start with in this use-case:

  • CREATE TYPE Com­pany ATTRIBUTES ( String Alias_List, String BlogFee­dURL,    String BlogURL, String Cat­e­gory, Date­Time Created_At, String Crunch­baseURL, Date­Time Deadpooled_At, String Descrip­tion, String EMailAdress, Date­Time Founded_At, String Home­pageURL, Inte­ger Num­berO­fEm­ploy­ees, String Overview, String Perma­link, String Pho­neNum­ber, String Tags, String Twit­terUser­name, Date­Time Updated_At, Set<Com­pany> Competitions )
  • CREATE TYPE Finan­cialOr­ga­ni­za­tion ATTRIBUTES ( String Alias_List, String BlogFee­dURL, String BlogURL, Date­Time Created_At, String Crunch­baseURL, String Descrip­tion, String EMailAdress, Date­Time Founded_At, String Home­pageURL, String Name, Inte­ger Num­berO­fEm­ploy­ees, String Overview, String Perma­link, String Pho­neNum­ber, String Tags, String Twit­terUser­name, Date­Time Updated_At )
  • CREATE TYPE Prod­uct ATTRIBUTES ( String BlogFee­dURL, String BlogURL, Com­pany Com­pany, Date­Time Created_At, String Crunch­baseURL, Date­Time Deadpooled_At, String Home­pageURL, String Invite­ShareURL, Date­Time Launched_At, String Name, String Overview, String Perma­link, String Stage­Code, String Tags, String Twit­terUser­name, Date­Time Updated_At)
  • CREATE TYPE Exter­nalLink ATTRIBUTES ( String Exter­nalURL, String Title )
  • CREATE TYPE Embed­ded­Video ATTRIBUTES ( String Descrip­tion, String EmbedCode )
  • CREATE TYPE Image ATTRIBUTES ( String Attri­bu­tion, Inte­ger SizeX, Inte­ger SizeY, String ImageURL )
  • CREATE TYPE IPO ATTRIBUTES ( Date­Time Published_At, String StockSym­bol, Dou­ble Val­u­a­tion, String ValuationCurrency )
  • CREATE TYPE Acqui­si­tion ATTRIBUTES ( Date­Time Acquired_At, Com­pany Com­pany, Dou­ble Price, String Price­Cur­rency, String SourceDes­ti­na­tion, String SourceURL, String TermCode )
  • CREATE TYPE Office ATTRIBUTES ( String Address1, String Address2, String City, String Coun­tryCode, String Descrip­tion, Dou­ble Lat­i­tude, Dou­ble Lon­gi­tude, String State­Code, String ZipCode )
  • CREATE TYPE Mile­stone ATTRIBUTES ( String Descrip­tion, String SourceDescrip­tion, String SourceURL, Date­Time Stoned_At )
  • CREATE TYPE Fund ATTRIBUTES ( Date­Time Funded_At, String Name, Dou­ble RaisedAmount, String Raised­Cur­ren­cy­Code, String SourceDescrip­tion, String SourceURL )
  • CREATE TYPE Per­son ATTRIBUTES ( String Affil­i­a­tion­Name, String Alias_List, String Birth­place, String BlogFee­dURL, String BlogURL, Date­Time Birth­day, Date­Time Created_At, String Crunch­baseURL, String First­Name, String Home­pageURL, Image Image, String Last­Name, String Overview, String Perma­link, String Tags, String Twit­terUser­name, Date­Time Updated_At )
  • CREATE TYPE Degree ATTRIBUTES ( String DegreeType, Date­Time Graduated_At, String Insti­tu­tion, String Subject )
  • CREATE TYPE Rela­tion­ship ATTRIBUTES ( Boolean Is_Past, Per­son Per­son, String Title )
  • CREATE TYPE Ser­vi­ce­Provider ATTRIBUTES ( String Alias_List, Date­Time Created_At, String Crunch­baseURL, String EMailAdress, String Home­pageURL, Image Image, String Name, String Overview, String Perma­link, String Pho­neNum­ber, String Tags, Date­Time Updated_At )
  • CREATE TYPE Provider­ship ATTRIBUTES ( Boolean Is_Past, Ser­vi­ce­Provider Provider, String Title )
  • CREATE TYPE Invest­ment ATTRIBUTES ( Com­pany Com­pany, Finan­cialOr­ga­ni­za­tion Finan­cialOr­ga­ni­za­tion, Per­son Person )
  • CREATE TYPE Fund­in­gRound ATTRIBUTES ( Com­pany Com­pany, Date­Time Funded_At, Dou­ble RaisedAmount, String Raised­Cur­ren­cy­Code, String Round­Code, String SourceDescrip­tion, String SourceURL )

You can directly down­load the accord­ing GQL script here. If you use the sone­sEx­am­ple appli­ca­tion from our open source dis­tri­b­u­tion you can cre­ate a sub­folder “scripts” in the binary direc­tory and put the down­loaded script file there. When you’re using the inte­grated Web­Shell, which is by default launched on port 9975 an can be accessed by brows­ing to http://localhost:9975/WebShell you can exe­cute the script using the com­mand “execdb­script” fol­lowed by the file­name of the script.

As you can see it’s quite straight for­ward a copy-paste action from the graph­i­cal scheme. Even ref­er­ences are not rep­re­sented by a dif­fi­cult rela­tional helper, instead if you want to ref­er­ence a com­pany object you can just do that (we actu­ally did that – look for exam­ple at the last line of the gql script above). As a result when you exe­cute the above script you get all the Types nec­es­sary to fill data in in the next step. 

So that’s it for this part – in the next part of this series we will start the ini­tial data import using a small tool which reads the mir­rored data and out­puts gql insert queries.

2 Reaktionen zu “The “Crunchbase use-case” part 3 – How does a graph data scheme start?”

Einen Kommentar schreiben