Kategorie: use-case
DBPedia data is provided in several RDF triple files. Each line in each file gives a “complete” information set – based on predicate, subject and object, e.g.
mappingbased_properties_en.nt: (some line)
<http://dbpedia.org/resource/12_Monkeys>
<http://dbpedia.org/ontology/editing>
<http://dbpedia.org/resource/Mick_Audsley> .
stands for: “12 Monkeys” has a “editor” “Mick Audsley”.
In other files there is additional information available, e.g. that
- “12 Monkeys” is a film
- “Mick Audsley” is a person
- … probably more information about “12 Monkeys” and “Mick Audsley”
What we want to do in sones GraphDB is to create a VERTEX for the film “12 Monkeys”. This includes
- type information – 12 Monkeys is a film
- a set of properties – e.g. its budget
- EDGES to related information, e.g. the editor Mick Audsley.
There is a single point of information (The VERTEX “12 Monkeys”) that holds all information and relation in a single instance. To import the VERTEX “12 monkeys”, we had to write a parser over all available triple files that gives us all related information from DBPedia data set.
At this point we’ve had two options implementing this parser. The first one was to read all triple files in a dedicated order to ensure data validity (we need to know that “12 Monkeys” is a movie, to be able to assign the predicate “editor” unambiguous) or do an intermediate step by creating a temporary file that collects all data without validation and to do the import afterwards.
Our decision was to do the intermediate step, because of that it allows some synchronization during reading the triple files and avoids creating invalid data since exported data can be cross-checked easily.
This step is represented by project “2_ParseAndConvertTripleDataFiles” in solution GraphDBPedia available at http://github.com/sones/sones-dbpedia. The parser reads only a subset of offered data-files to show functionality and focus on the added values.
The result of the export for “Apollo 8” looks like this:
1 VertexID=-9223372036854775808
2 http://dbpedia.org/resource/Apollo_8=http://dbpedia.org/ontology/SpaceMission
3 LongAbstract_de=viel text
4 LongAbstract_en=a lot of text
5 http://dbpedia.org/ontology/commandModule_en=CM-103
6 http://dbpedia.org/ontology/missionDuration_en=529242.0
7 http://dbpedia.org/ontology/lunarOrbitTime_en=72613.0
8 http://dbpedia.org/ontology/crewSize_en=3
9 http://dbpedia.org/ontology/lunarModule_en=Ballast: Lunar Test Article B
10 http://dbpedia.org/ontology/serviceModule_en=SM-103
11 http://dbpedia.org/ontology/nextMission_en=http://dbpedia.org/resource/Apollo-9-patch.png
12 http://dbpedia.org/ontology/booster_en=http://dbpedia.org/resource/Saturn_V
13 http://dbpedia.org/ontology/previousMissions_en= http://dbpedia.org/resource/AP7lucky7.png
14 http://dbpedia.org/ontology/launchPad_en=http://dbpedia.org/resource/Kennedy_Space_Center_Launch_Complex_39
15 ShortAbstract_en=some text
16 Name_en=http://dbpedia.org/resource/Apollo_8
17 http://dbpedia.org/ontology/SpaceMission/lunarOrbitTime_en=20.170277777777777
18 http://dbpedia.org/ontology/SpaceMission/missionDuration_en=6.125486111111111
Apart from one property, all data had been exported from the triple files. During importing the ontology information (line2 in this example), we’ve also created a VertexID – unique for the corresponding VERTEX TYPE. This allows us to do a unique and performant linking during data import (happens later) by referring to this ID.
After this intermediate step, the real import step can be done. Sones Graph DB offers GraphQL as simple and intuitive language. Based on the data-structure we’ve prepared above, with GQL two steps have to be done. At first, create all VERTICES including all properties and afterwards do the linking between all VERTICES.
Therefore, for the example above, two statements would be created:
INSERT INTO httpwwwdbpediaorgontologySpaceMisson VALUES (
VertexID=-9223372036854775808,
LongAbstract_de=’viel text’,
LongAbstract_en=’a lot of text’,
Name_en=’ http://dbpedia.org/resource/Apollo_8’,
httpdbpediaorgontologycommandModule_en=’CM-103’,
httpdbpediaorgontologymissionDuration_en=529242.0,
httpdbpediaorgontologylunarOrbitTime_en=72613.0,
httpdbpediaorgontologycrewSize_en=3,
httpdbpediaorgontologylunarModule_en=’Ballast: Lunar Test Article B’,
httpdbpediaorgontologyserviceModule_en=’SM-103’,
ShortAbstract_en=’some text’,
httpdbpediaorgontologySpaceMissionlunarOrbitTime_en=20.170277777777777,
httpdbpediaorgontologySpaceMissionmissionDuration_en=6.125486111111111
UPDATE httpwwwdbpediaorgontologySpaceMisson SET
(
httpdbpediaorgontologynextMission_en=SETOF(Name_en=’http://dbpedia.org/resource/Apollo-9’)
httpdbpediaorgontologybooster_en=SETOF(Name_en=’http://dbpedia.org/resource/Saturn_V’)
httpdbpediaorgontologypreviousMission_en=SETOF( Name_en=’http://dbpedia.org/resource/AP7’)
httpdbpediaorgontologylaunchPad_en=SETOF(Name_en=’http://dbpedia.org/resource/Kennedy_Space_Center_Launch_Complex_39’)
)
WHERE VertexID=-9223372036854775808
The problem of this approach is, that EDGES are set via a WHERE condition that maybe is not unique or the attribute is not set at all at the target VERTEX. An option to solve this, is to verify the ID of the target vertex and do the linking via this condition.
Sones GraphDB also offers another option to do the importing, XmlBulkImport. It has the advantage that it is faster than GraphQL (due to the fact it uses Graph-filesystem interfaces) and also organizes INSERTING and LINKING of data itself. Instead of creating GraphQL, a proprietary XML structure has to be created and the import is done via a single IMPORT GQL statement.
A description of this format and its usage can be found at: http://developers.sones.de/wiki/doku.php?id=importexport:xmlbulkimport
This XmlBulkImport data file is created by project “3_ParseAndConvertTripleDataFiles” in solution GraphDBPedia available at http://github.com/sones/sones-dbpedia”.
The first step was to transfer the ontology – provided in Web Ontology Language (OWL) format – into GraphDB VERTEX TYPES and EDGES. Therefore, a parser had been implemented that reads the OWL-file, converts it into a class-model and is able to export data into a GQL – CREATE VERTEX TYPES statement.
The ontology currently contains 273 classes (DBPedia 3.6.) and thousands of datatype properties and object properties. A short demonstration of its main structures can be found here:
OWL-Class:
<owl:Class rdf:about="http://dbpedia.org/ontology/Island">
<rdfs:label xml:lang="en">island</rdfs:label>
<rdfs:label xml:lang="el">νησί</rdfs:label>
<rdfs:label xml:lang="fr">île</rdfs:label>
<rdfs:subClassOf
rdf:resource="http://dbpedia.org/ontology/PopulatedPlace">
</rdfs:subClassOf>
</owl:Class>
- represents an Island (based on a PopulatedPlace)
OWL-DatatypeProperty:
<owl:DatatypeProperty rdf:about="http://dbpedia.org/ontology/numberOfIslands">
<rdfs:label xml:lang="en">number of islands</rdfs:label>
<rdfs:domain rdf:resource="http://dbpedia.org/ontology/Island"></rdfs:domain>
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#nonNegativeInteger"></rdfs:range>
</owl:DatatypeProperty>
- describes the non-negative Integer attribute “number of islands” for the class Island.
OWL-ObjectProperty:
<owl:ObjectProperty rdf:about="http://dbpedia.org/ontology/highestState">
<rdfs:label xml:lang="en">highest state</rdfs:label>
<rdfs:domain rdf:resource="http://dbpedia.org/ontology/Island"></rdfs:domain>
<rdfs:range rdf:resource="http://dbpedia.org/ontology/PopulatedPlace"></rdfs:range>
</owl:ObjectProperty>
represents the highest “PopulatedPlace” on an island.
The conversion creates a
- VERTEX TYPE – one for each class,
- having multiple PROPERTIES – from datatype properties
- and multiple EDGES – from object-properties
Within the data schema, there is a big amount of multi-lateral dependencies. The CREATE VERTEX TYPES statement solves all of them and creates a valid data schema.
Additionally to the ontology from the OWL file, we’ve added some vertex types to fix some problems we’ve run at and to enhance the functionality a little bit:
- At first, the VERTEX TYPE Thing was not described in the Ontology. It is the base class in the ontology that all other VERTEX TYPES are base upon.
- To reflect disambiguation, we’ve created a VERTEX TYPE Instance with an EDGE to a SET of Thing. In case there is a disambiguation, an Instance refers to the corresponding NODEs in the GraphDB.
- Within the RDF-files, labels are saved in dedicated triples. We’ve added a dedicated VERTEX TYPE also, to avoid a mix-up in case one label refers to multiple Instances.
Currently, the GraphDB has some limitations regarding the allowed characters within VERTEX TYPES, its ATTRIBUTES and EDGES. The OWL and RDF format is generally based on URLs as data-definition. GraphDB has limitations working with colons, dots and slashes (both slash and backslash). Our simple workaround was to keep the URL and remove all occurrences of these characters. This leads us from http://dbpedia.org/ontology/Island to httpdbpediaorgontologyIsland.
Another challenge is the type-mapping between OWL and GraphDB. GraphDB supports c# simple data types, in the DBPedia OWL we are facing a list of 9 datatypes from an XML schema, DBPedia area units, speed units, density units, time units, volume units, distance units and several others. This led us to a huge switch that does the mapping – all properties could be reflected with the C# data types without data loss.
Wikipedia is available in multiple languages. DBPedia export currently is provided in 99 of them.
Some time later (during the next steps) we’ve found out that data in several languages differs a little bit sometimes, since there are different authors. For the data schema, this is relevant, because there are options how to handle this behavior.
One option is to let the data importer application logic decide how to handle this. We’ve decided to make the data schema language specific and provide a separate – language specific – attributes. This grows up the data schema a little bit, but does not lead to any data loss. Additionally, some application logic can be implemented later on, to check data quality for each node.
The command-line tool “1_CreateGqlSchemaFromOntology” ‚available at GitHub (https://github.com/sones/sones-dbpedia) VisualStudion solution creates the CREATE VERTEX TYPES statements as described above, based on the ontology of DBPedia 3.6. – later versions currently have not yet been tested.
The command line executable has to be started with 2 parameters:
- .owl filename (the filename has either to be an absolute path or located within the executables directory.
- result .gql file – name of the file, where all queries will be inserted in.
During runtime, the user will be requested for all languages that have to be reflected in schema. Our suggestion is to use 2-letter county-codes like “_en” or “_de”. An empty string exits the iteration.
After the execution the result .gql file easily can be imported via IMPORT GQL statement.
DBPedia already is saved in a machine readable format (RDF). We’ve started a proof-of-concept to show that GraphDB is able to solve these requirements too and to find out differences, advantages and disadvantages of the different concepts.
In RDF, the data model stands next to the data. Within sones GraphDB there is close connection between each object (node) and it’s (Vertex) type. For example the node “Homer Simpson” knows that he’s a “FictionalCharacter”.
Our expectation was, that GraphDB requires less hard-disk space and also offers a better data store, since all information about an object is saved in a unique node instead of several triple-data-files. Besides, any relationship between two objects (e.g. a person and its birth-place) is saved directly on that object. While loading a node, all information is available from a single location.
During project runtime we’ve discovered several problems that can be solved with that idea. The arising data network enables customers to find out complex relationships between any node using graph-algorithms. Disambiguation of words is possible, using the schema information (e.g. Tuareg can be either nomads living in the Sahara or a vehicle built by a German car vendor).
We’ve had our first contacts with DBPedia in May 2010 already. A prospect asked us, whether or not GraphDB is the best way to reflect the data schema and import all data. After getting a first impression from the DBPedia-Website:
from www.dbpedia.org/About:
“DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopaedia itself.”
We’ve decided: Yes, it is!.
from www.dbpedia.org/Datasets:
DBpedia uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. We use the SPARQL query language to query this data. Please refer to the Developers Guide to Semantic Web Toolkits to find a development toolkit in your preferred programming language to process DBpedia data.
The DBpedia knowledge base currently describes more than 3.64 million things, out of which 1.83 million are classified in a consistent Ontology, including 416,000 persons, 526,000 places (including 360,000 populated places), 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organizations (including 40,000 companies and 38,000 educational institutions), 183,000 species and 5,400 diseases.
At this time we’ve not yet had too much experiences with the Semantic Web, therefore there was probably some work to do.
The following blog articles will describe our work and refer to the source-code available under www.github.com/sones/sones-dbpedia
We always think about new ways to integrate GraphDB into existing environments. And one of those environments our users are working with right now are the several Enterprise Service Busses which are available right now.
One big player in the ESB environment is the Mule Open Source ESB:
“Mule is a lightweight enterprise service bus (ESB) and integration framework. It can handle services and applications using disparate transport and messaging technologies. The platform is Java-based, but can broker interactions between other platforms such as .NET using web services or sockets.
The architecture is a scalable, highly-distributable object broker that can seamlessly handle interactions across legacy systems, in-house applications and almost all modern transports and protocols.”
In order to show how a GraphDB integrates into those typical ESB environments we created a small example.
The architecture of this example is like this:

The idea behind this is that an example Message-WebApp is posting a message to the Mule ESB and then this message gets transformed and in the last consequence consumed by a sones RESTful webservice hosted by a GraphDB.
You can read more in this tutorial here and you can download the sourcecode here.
Source 1: http://www.mulesoft.org/
Source 2: https://github.com/sones/sones-mule
Source 3: http://developers.sones.de/wiki/doku.php?id=tutorials:muleexampleapp
For many scenarios it’s important to know how a database performs. Especially these days when the number of databases seem to grow by the day and a choice is hard to make.
To demonstrate how sones GraphDB performs at given use-cases we created a benchmark framework and tool which basically divides benchmarking into two steps:
-
Generate and/or Import use-case specific data and measure the performance
-
Execute use-case specific algorithms on the graph and measure the performance
Because there are many different use-cases these both steps are made up by plug-ins which can be adressed using the commandline which is integrated into the benchmark tool.
The framework, tool and plug-ins are released as AGPLv3 licensed OpenSource software and can be downloaded here.
We distribute the source code mainly because it’s the best way for you to reproduce the results and take a look at what actually is being tested, the other main cause is that we want everybody to be able to benchmark and test their own algorithms on GraphDB.

Source 1: https://github.com/sones/benchmark
Source 2: http://developers.sones.de/wiki/doku.php?id=benchmarks
In the previous article of this series a short introduction into graph databases was provided. In article 3 of the series an initial scheme of the Crunchbase use case was created. In this article this initial scheme is going to be extended by more complex attributes. The possiblity to alter a scheme whenever necessary is a feature of the sones GraphDB. In traditional relational database management systems this scheme alteration is, when even possible, very slow when dealing with large data sets. As a result the typical relational tables contain “reserved” columns which are later filled with information.
When article 3 and 4 was successfully applied the database should contain the basic attributes and data of the 5 main objects (nodes). At this point only 2 example relations (edges) are inserted. The next step would be to take a closer look at how the 5 node types can be linked together. A good example and starting point is the many-to-many (m:n) relationship of persons to companies and financial organizations.
In the sones GraphDB many-to-many (m:n) relationships are implicitly formed with one-to-many (1:n) relationships: One object holds a named SET<> attribute which then stores the edges to a number of objects. In the context of the example this means that a person has edges pointing to companies and financial organizations this persons had worked for in the past.
Since the information about the relationship are held in the node type Relationship this node type will be expanded with edges to the node types Company and FinancialOrganisation. The edge to the person was already in the original scheme. The corresponding GraphQL expression is:
- ALTER VERTEX Relationship ADD ATTRIBUTES(Company CompanyRelationship, FinancialOrganisation FinancialOrganisationRelationship)
Now, through the use of backward edges, a backward edge can be prepared. This means, if a person connects with a company an edge occurs from the company to the person.
The GraphQL instructions are:
- ALTER VERTEX Company ADD BACKWARDEDGES (Relationship.CompanyRelationship Relationships)
- ALTER VERTEX FinancialOrganisation ADD BACKWARDEDGES ( Relationship.FinancialOrganisationRelationship Relationships)
- ALTER VERTEX Person ADD BACKWARDEDGES (Relationship.Person Relationships)
Now we have an extended scheme – only the data is still missing.
To extract this from the already exported JSON objects, a small JSON parser tool was written. This tool reads and deserializes all the previously mirrored JSON files into corresponding .NET objects. These objects are later used by the IScriptWriter implementations. Each implementation works exactly on one relation in the scheme and generates GraphQL queries using the information of the. NET objects.
This tool can be downloaded as source code and pre-compiled binary. If run without parameters, it tries to find the folders company, financial-organization, person, service-provider and product in the current folder. The resulting scripts are then written to the current folder.
If the desired input and output folders differ from the local The input and output directory can be passed to the program:
Connecting Nodes.exe [INPUT-FOLDER [OUTPUT-FOLDER]]
The output will look like this:

The implementation of the IScriptWriter interface for the relationship from persons to companies and financial organizations is shown in this picture / code:

The interested reader is of course free to implement not yet implemented connections between nodes. For example, the relationship of companies to products.
For example a request from the file Step_4_Relationships.qgl looks like this:
- INSERT INTO Relationship VALUES (Person = REF(Permalink = ‘andrew-cheung’), Title = ‘President & CEO’, IsPast = False, CompanyRelationship = REF(Permalink = ’01-communique’))
In this case, the person „Andrew Cheung“ is connected with the company „01 Communique“. The relationship is enriched with additional information, such as the job title he had or has in that company.
The backward edges of the companies „01 Communique“ to „Andrew Cheung“ was automatically generated due to the applied reverse relationship (backward edge) and can be queried immediately.
In this article was exemplarily demonstrated, how even after the import of data connections in GraphDB can be produced and existing data linked with each other. The following section 6 will show how easy it is to write and run complex queries in GraphQL on the sones GraphDB.
It’s about time to import some data into our previously established object scheme. If you want to do this yourself you want to first run the Crunchbase mirroring tool and create your own mirror on your hard disk.
In the next step another small tool needs to be written. A tool that creates nice clean GraphQL import scripts for our data. Since every data source is different there’s not really a way around this step – in the end you’ll need to extract data here and import data here. One possible different solution could be to implement a dedicated importer for the GraphDB – but I’ll leave that for another article series. Back to our tool: It’s called “First-Import” and it’s only purpose is to create a first small graph out of the mirrored Crunchbase data and fill the mainly primitive data attributes. Download this tool here.
This is why in this first step we mainly focus on the following object types:
- Company
- FinancialOrganization
- Person
- Product
- ServiceProvider
Additionally all edges to a company object and the competition will be imported in this part of the article series.
So what does the first-import tool do? Simple:
- it deserializes the JSON data into a useable object – in this case it’s written in C# and uses .NETs own JavaScript deserializer
- it then maps all attributes of that deserialized JSON object to attribute names in our graph data object scheme and it does so by outputting a simple query
- Simple Attribute Types like String and Integer are just simply assigned using the “=” operator in the Graph Query Language
- 1:1 References are assigned by assigning a REF(…) to the attribute – for example: INSERT INTO Product VALUES (Company = REF(Permalink=’companyname’))
- 1:n References are assigned by assigning a SETOF(…) to the attribute – because we are not using a bulk import interface but the standard GraphQL REST Interface it’s necessary that the object(s) we’re going to reference are already in existence – therefore we chose to do this 1:n linking step after creating the objects itself in a separate UPDATE step. Knowing this the UPDATE looks like this: UPDATE Company SET (ADD TO Competitions SETOF(permalink=’…’,permalink=’…’)) WHERE Permalink = ’companyname’
For the most part of the work it’s copy-n-paste to get the first-import tool together – it could have been done in a more sophisticated way (like using reflection on the deserialized JSON objects) but that’s most probably part of another article.
When run in the “crunchbase” directory created by the Crunchbase Mirroring tool the first-import tool generates GraphQL scripts – 6 of them to be precise:


The last script is named “Step_3” because it’s supposed to come after all the others.
These scripts can be easily imported after establishing the object scheme. The thing is though – it won’t be that fast. Why is that? We’re creating several thousand nodes and the edges between them. To create such an edge the Query Language needs to identify the node the edge originates and the node the edge should point to. To find these nodes the user is free to specify matching criteria just like in a WHERE clause.
So if you do a UPDATE Company SET (ADD TO Competitions SETOF(Permalink=’company1’,Permalink=’company2’)) WHERE Permalink = ’companyname’ the GraphDB needs to access the node identified by the Permalink Attribute with the value “companyname” and the two nodes with the values “company1” and “company2” to create the two edges. It will work just like all the scripts are but it won’t be as fast as it could be. What can help to speed up things are indices. Indices are used by the GraphDB to identify and find specific objects. These indices are used mainly in the evaluation of a WHERE clause.
The sones GraphDB offers a number of integrated indices, one of which is HASHTABLE which we are going to use in this example. Furthermore everyone interested can implement it’s own index plugin – we will have a tutorial how to do that online in the future – if you’re interested now just ask how we can help you to make it happen!
Back to the indices in our example:
The syntax of creating an index is quite easy, the only thing you have to do is tell the CREATE INDEX query on which type and attribute the index should be created and of which indextype the index should be. Since we’re using the Permalink attribute of the Crunchbase objects as an identifier in the example (it could be any other attribute or group of attributes that identify one particular object) we want to create indices on the Permalink attribute for the full speed-up. This would look like this:
- CREATE INDEX ON Company (Permalink) INDEXTYPE HashTable
- CREATE INDEX ON FinancialOrganization (Permalink) INDEXTYPE HashTable
- CREATE INDEX ON Person (Permalink) INDEXTYPE HashTable
- CREATE INDEX ON ServiceProvider (Permalink) INDEXTYPE HashTable
- CREATE INDEX ON Product (Permalink) INDEXTYPE HashTable
Looks easy, is easy! To take advantage of course this index creation should be done before creating the first nodes and edges.
After we got that sorted the only thing that’s left is to run the scripts. This will, depending on your machine, take a minute or two.
So after running those scripts what happened is: all Company, FinancialOrganization, Person, ServiceProvider and Product objects are created and filled with primitive data types
- all attributes which are essentially references (1:1 or 1:n) to a Company object are being set, these are
- Company.Competitions
- Product.Company
That’s it for this part – in the next part of the series we will dive deeper into connecting nodes with edges. There is a ton of things that can be done with the data – stay tuned for the next part.
After the overview and the first use-case introduction it’s about time to play with some data objects.
So how can one actually access the data of crunchbase? Easy as pie: Crunchbase offers an easy to use interface to get all information out of their database in a fairly structured JSON format. So what we did is to write a tool that actually downloads all the available data to a local machine so we can play with it as we like in the following steps.
This small tool is called MirrorCrunchbase and can be downloaded in binary and sourcecode here. As for all sourcecode and tools in this series this runs on windows and linux (mono). You can use the sourcecode to get an impression what’s going on there or just the included binaries (in bin/Debug) to mirror the data of Crunchbase.
To say a few words about what the MirrorCrunchbase tool actually does first a small source code excerpt:

So first it gets the list of all objects like the company names and then it retrieves each company object according to it’s name and stores everything in .js files. Easy eh?
When it’s running you get an output similar to that:

And after the successful completion you should end up with a directory structure

The .js files store basically every information according to the data scheme overview picture of part 2. So what we want to do now is to transform this overview into a GraphQL data scheme we can start to work with. A main concept of sones GraphDB is to allow the user to evolve a data scheme over time. That way the user does not have to have the final data scheme before the first create statement. Instead the user can start with a basic data scheme representing only standard data types and add complex user defined types as migration goes along. That’s a fundamentally different approach from what database administrators and users are used to today.
Todays user generated data evolves and grows and it’s not possible to foresee in which way attributes need to be added, removed, renamed. Maybe the scheme changes completely. Everytime the necessity emerged to change anything on a established and populated data scheme it was about time to start a complex and costly migration process. To substantially reduce or even in some cases eliminate the need for such a complex process is a design goal of the sones GraphDB.
In the Crunchbase use-case this results in a fairly straight-forward process to establish and fill the data scheme. First we create all types with their correct name and add only those attributes which can be filled from the start – like primitives or direct references. All Lists and Sets of Edges can be added later on.
So these would be the Create-Type Statements to start with in this use-case:
-
CREATE TYPE Company ATTRIBUTES ( String Alias_List, String BlogFeedURL, String BlogURL, String Category, DateTime Created_At, String CrunchbaseURL, DateTime Deadpooled_At, String Description, String EMailAdress, DateTime Founded_At, String HomepageURL, Integer NumberOfEmployees, String Overview, String Permalink, String PhoneNumber, String Tags, String TwitterUsername, DateTime Updated_At, Set<Company> Competitions )
-
CREATE TYPE FinancialOrganization ATTRIBUTES ( String Alias_List, String BlogFeedURL, String BlogURL, DateTime Created_At, String CrunchbaseURL, String Description, String EMailAdress, DateTime Founded_At, String HomepageURL, String Name, Integer NumberOfEmployees, String Overview, String Permalink, String PhoneNumber, String Tags, String TwitterUsername, DateTime Updated_At )
-
CREATE TYPE Product ATTRIBUTES ( String BlogFeedURL, String BlogURL, Company Company, DateTime Created_At, String CrunchbaseURL, DateTime Deadpooled_At, String HomepageURL, String InviteShareURL, DateTime Launched_At, String Name, String Overview, String Permalink, String StageCode, String Tags, String TwitterUsername, DateTime Updated_At)
-
CREATE TYPE ExternalLink ATTRIBUTES ( String ExternalURL, String Title )
-
CREATE TYPE EmbeddedVideo ATTRIBUTES ( String Description, String EmbedCode )
-
CREATE TYPE Image ATTRIBUTES ( String Attribution, Integer SizeX, Integer SizeY, String ImageURL )
-
CREATE TYPE IPO ATTRIBUTES ( DateTime Published_At, String StockSymbol, Double Valuation, String ValuationCurrency )
-
CREATE TYPE Acquisition ATTRIBUTES ( DateTime Acquired_At, Company Company, Double Price, String PriceCurrency, String SourceDestination, String SourceURL, String TermCode )
-
CREATE TYPE Office ATTRIBUTES ( String Address1, String Address2, String City, String CountryCode, String Description, Double Latitude, Double Longit
ude, String StateCode, String ZipCode )
-
CREATE TYPE Milestone ATTRIBUTES ( String Description, String SourceDescription, String SourceURL, DateTime Stoned_At )
-
CREATE TYPE Fund ATTRIBUTES ( DateTime Funded_At, String Name, Double RaisedAmount, String RaisedCurrencyCode, String SourceDescription, String SourceURL )
-
CREATE TYPE Person ATTRIBUTES ( String AffiliationName, String Alias_List, String Birthplace, String BlogFeedURL, String BlogURL, DateTime Birthday, DateTime Created_At, String CrunchbaseURL, String FirstName, String HomepageURL, Image Image, String LastName, String Overview, String Permalink, String Tags, String TwitterUsername, DateTime Updated_At )
-
CREATE TYPE Degree ATTRIBUTES ( String DegreeType, DateTime Graduated_At, String Institution, String Subject )
-
CREATE TYPE Relationship ATTRIBUTES ( Boolean Is_Past, Person Person, String Title )
-
CREATE TYPE ServiceProvider ATTRIBUTES ( String Alias_List, DateTime Created_At, String CrunchbaseURL, String EMailAdress, String HomepageURL, Image Image, String Name, String Overview, String Permalink, String PhoneNumber, String Tags, DateTime Updated_At )
-
CREATE TYPE Providership ATTRIBUTES ( Boolean Is_Past, ServiceProvider Provider, String Title )
-
CREATE TYPE Investment ATTRIBUTES ( Company Company, FinancialOrganization FinancialOrganization, Person Person )
-
CREATE TYPE FundingRound ATTRIBUTES ( Company Company, DateTime Funded_At, Double RaisedAmount, String RaisedCurrencyCode, String RoundCode, String SourceDescription, String SourceURL )
You can directly download the according GraphQL script here. If you use the sonesExample application from our open source distribution you can create a subfolder “scripts” in the binary directory and put the downloaded script file there. When you’re using the integrated WebShell, which is by default launched on port 9975 an can be accessed by browsing to http://localhost:9975/WebShell you can execute the script using the command “execdbscript” followed by the filename of the script.
As you can see it’s quite straight forward a copy-paste action from the graphical scheme. Even references are not represented by a difficult relational helper, instead if you want to reference a company object you can just do that (we actually did that – look for example at the last line of the graphql script above). As a result when you execute the above script you get all the Types necessary to fill data in in the next step.
So that’s it for this part – in the next part of this series we will start the initial data import using a small tool which reads the mirrored data and outputs graphql insert queries.
Where to start: existing data scheme and API
This series already tells in it’s name what the use case is: The “CrunchBase”. On their website they speak for themselves to explain what it is: “CrunchBase is the free database of technology companies, people, and investors that anyone can edit.”. There are many reasons why this was chosen as a use-case. One important reason is that all data behind the CrunchBase service is licensed under Creative-Commons-Attribution (CC-BY) license. So it’s freely available data of high-tech companies, people and investors.

Currently there are more than 40.000 different companies, 51.000 different people and 4.200 different investors in the database. The flood of information is big and the scale of connectivity even bigger. The graph represented by the nodes could be even bigger than that but because of the limiting factors of current relational database technology it’s not feasible to try to do that.
sones GraphDB is coming to the rescue: because it’s optimized to handle huge datasets of strongly connected data. Since the CrunchBase data could be uses as a starting point to drive connectivity to even greater detail it’s a great use-case to show these migration and handling.
Thankfully the developers at CrunchBase already made one or two steps into an object oriented world by offering an API which answers queries in JSON format. By using this API everyone can access the complete data set in a very structured way. That’s both good and bad. Because the used technologies don’t offer a way to represent linked objects they had to use what we call “relational helpers”. For example: A person founded a company. (person and company being a JSON object). There’s no standardized way to model a relationship between those two. So what the CrunchBase developers did is they added an unique-Identifier to each object. And they added a new object which is uses as a “relational helper”-object. The only purpose of these helper objects is to point towards a unique-identifier of another object type. So in our example the relationship attribute of the person object is not pointing directly to a specific company or relationship, but it’s pointing to the helper object which stores the information which unique-identifier of which object type is meant by that link.
To visualize this here’s the data scheme behind the CrunchBase (+all currently available links):

As you can see there are many more “relational helper” dead-ends in the scheme. What an application had to do up until now is to resolve these dead-ends by going the extra mile. So instead of retrieving a person and all relationships, and with them all data that one would expect, the application has to split the data into many queries to internally build a structure which essentially is a graph.
Another example would be the company object. Like the name implies all data of a company is stored there. It holds an attribute called investments which isn’t a primitive data type (like a number or text) but a user defined complex data type. This user defined data type is called List<FundingRoundStructure>. So it’s a simple list of FundingRoundStructure objects.
When we take a look at the FundingRoundStructure there’s an attribute called company which is made up by the user defined data type CompanyStructure. This CompanyStructure is one of these dead-ends because there’s just a name and a unique-id. The application now needs retrieve the right company object with this unique-id to access the company information.
Simple things told in a simple way: No matter where you start, you always will end up in a dead-end which will force you to start over with the information you found in that dead-end. It’s not user-friendly nor easy to implement.
The good news is that there is a way to handle this type of data and links between data in a very easy way. The sones GraphDB provides a rich set of features to make the life of developers and users easier. In that context: If we would like to know which companies also received funding from the same investor like let’s say the company “facebook” the only thing necessary would be one short query. Beside that those “relational helpers” are redundant information. That means in a graph database this information would be stored in the form of edges but not in any helper objects.
The reason why the developers of CrunchBase had to use these helpers is that JSON and the relational table behind it isn’t able to directly store this information or to query it directly. To learn more about those relational tables and databases try this link.
I want to end this part of the series with a picture of the above relational diagram (without the arrows and connections).

The next part of the series will show how we can access the available information and how a graph scheme starts to evolve.