The “CrunchBase use-case” – part 2 – A short introduction

Where to start: exist­ing data scheme and API

This series already tells in it’s name what the use case is: The “Crunch­Base”.  On their web­site they speak for them­selves to explain what it is: “Crunch­Base is the free data­base of tech­nol­ogy com­pa­nies, peo­ple, and investors that any­one can edit.”. There are many rea­sons why this was cho­sen as a use-case. One impor­tant rea­son is that all data behind the Crunch­Base ser­vice is licensed under Creative-Commons-Attribution (CC-BY) license. So it’s freely avail­able data of high-tech com­pa­nies, peo­ple and investors.

crunchbase_logo

Cur­rently there are more than 40.000 dif­fer­ent com­pa­nies, 51.000 dif­fer­ent peo­ple and 4.200 dif­fer­ent investors in the data­base. The flood of infor­ma­tion is big and the scale of con­nec­tiv­ity even big­ger. The graph rep­re­sented by the nodes could be even big­ger than that but because of the lim­it­ing fac­tors of cur­rent rela­tional data­base tech­nol­ogy it’s not fea­si­ble to try to do that. 

sones GraphDB is com­ing to the res­cue: because it’s opti­mized to han­dle huge datasets of strongly con­nected data. Since the Crunch­Base data could be uses as a start­ing point to drive con­nec­tiv­ity to even greater detail it’s a great use-case to show these migra­tion and handling.

Thank­fully the devel­op­ers at Crunch­Base already made one or two steps into an object ori­ented world by offer­ing an API which answers queries in JSON for­mat. By using this API every­one can access the com­plete data set in a very struc­tured way. That’s both good and bad. Because the used tech­nolo­gies don’t offer a way to rep­re­sent linked objects they had to use what we call “rela­tional helpers”. For exam­ple: A per­son founded a com­pany. (per­son and com­pany being a JSON object). There’s no stan­dard­ized way to model a rela­tion­ship between those two. So what the Crunch­Base devel­op­ers did is they added an unique-Identifier to each object. And they added a new object which is uses as a “rela­tional helper”-object. The only pur­pose of these helper objects is to point towards a unique-identifier of another object type. So in our exam­ple the rela­tion­ship attribute of the per­son object is not point­ing directly to a spe­cific com­pany or rela­tion­ship, but it’s point­ing to the helper object which stores the infor­ma­tion which unique-identifier of which object type is meant by that link.

To visu­al­ize this here’s the data scheme behind the Crunch­Base (+all cur­rently avail­able links):

CrunchbaseRelations

As you can see there are many more “rela­tional helper” dead-ends in the scheme. What an appli­ca­tion had to do up until now is to resolve these dead-ends by going the extra mile. So instead of retriev­ing a per­son and all rela­tion­ships, and with them all data that one would expect, the appli­ca­tion has to split the data into many queries to inter­nally build a struc­ture which essen­tially is a graph.

Another exam­ple would be the com­pany object. Like the name implies all data of a com­pany is stored there. It holds an attribute called invest­ments which isn’t a prim­i­tive data type (like a num­ber or text) but a user defined com­plex data type. This user defined data type is called List<FundingRoundStructure>. So it’s a sim­ple list of Fund­in­gRound­Struc­ture objects.

When we take a look at the Fund­in­gRound­Struc­ture there’s an attribute called com­pany which is made up by the user defined data type Com­pa­nyS­truc­ture. This Com­pa­nyS­truc­ture is one of these dead-ends because there’s just a name and a unique-id. The appli­ca­tion now needs retrieve the right com­pany object with this unique-id to access the com­pany information. 

Sim­ple things told in a sim­ple way: No mat­ter where you start, you always will end up in a dead-end which will force you to start over with the infor­ma­tion you found in that dead-end. It’s not user-friendly nor easy to implement. 

The good news is that there is a way to han­dle this type of data and links between data in a very easy way. The sones GraphDB pro­vides a rich set of fea­tures to make the life of devel­op­ers and users eas­ier. In that con­text: If we would like to know which com­pa­nies also received fund­ing from the same investor like let’s say the com­pany “face­book” the only thing nec­es­sary would be one short query. Beside that those “rela­tional helpers” are redun­dant infor­ma­tion. That means in a graph data­base this infor­ma­tion would be stored in the form of edges but not in any helper objects. 

The rea­son why the devel­op­ers of Crunch­Base had to use these helpers is that JSON and the rela­tional table behind it isn’t able to directly store this infor­ma­tion or to query it directly. To learn more about those rela­tional tables and data­bases try this link.

I want to end this part of the series with a pic­ture of the above rela­tional dia­gram (with­out the arrows and connections). 

Crunchbase

The next part of the series will show how we can access the avail­able infor­ma­tion and how a graph scheme starts to evolve.

Eine Reaktion zu “The “CrunchBase use-case” – part 2 – A short introduction”

Einen Kommentar schreiben