Graph data modeling example
Details of a larger data model creation.
Let’s consider the world of food further to create a complex data model. This example will go through some of the thinking behind creating a graph database data model. Many different decisions are made to optimize the schema for a graph. Some of the following considerations include:
vertex vs property
vertex property vs edge property
properties with multiple values
properties associated with other properties
edge uniqueness (single edge vs multiple edges)
indexes - why use them and which ones
In general, if an entity is a thing, it will be a vertex. If it describes an action on a thing, it’s an edge. Lastly, if it is a qualifier of a thing, it is a property.
For instance, what is a possible additional type of vertex besides
recipe in the food data model?
Recipes use ingredients, so
ingredient vertices will be required.
Recipes are generally published in cookbooks, so
book vertices will also be added.
What are some edges that will connect these vertices?
Each ingredient is
included_in a recipe, and each recipe is likely
included_in at least one book.
And finally, all of these vertices and edges have properties.
An ingredient will have a
id, a book will have a
publish_year, an edge
included_in can identify the
amount of an ingredient used in a recipe.
A recipe can be included_in many books, creating multiple edges between vertices.
Sometimes, a property may seem to be the best schema option for an entity. But in some cases, making the property into another vertex label would be a better choice. Let’s look at the following vertex: image::dataModelExample1.png
It seems perfectly obvious that an ingredient should have a
category be a vertex with an edge connecting an ingredient with several categories?
Generally, vertex properties are easily queried, so are edges between vertices.
What is the deciding factor in which option to use?
One key data model feature you want to avoid is a super-node, a single vertex containing billions of connections to other vertices.
With ingredients, there is unlikely to be billions of ingredients in any category, unless the category is absurdly broad, like
Another deciding factor can be to contemplate if the category vertex would have any property of its own.
Perhaps a category is a member of another category, branching out from a broad category to more sub-categories.
In this case, however, it seems that category has no definite requirements, so creating it either as a property or a vertex is reasonable.
Vertex and edge properties can be searched equally well, starting with a specific set of vertices or edges based on property key:value pairs.
For instance, if I want to find all the cookbook authors in France, I can search all the vertices with the vertex label
person who have lived in the
country of France.But I can also search all the edges between a vertex label
person and a vertex label
country with an edge property of
DSG can equally search these two scenarios, and often, a particular query must be tested to see which is optimal.
For a different query, you can find all the cookbook authors in France who
know Julia Child, but the query begins with the
person Julia Child and traverses outward.
And edge property for the
know edges can give us additional information, such as when Julia Child met an author who lives in France, but starting the search to see who Julia Child knew in 1955 would not be performant.
An edge property is often the only choice. For instance, look at an ingredient that is included in a recipe.
Knowing the ingredients in a recipe is important, but the amount used is also critical.
One cup of salt instead of one teaspoon of salt will make a big difference in the results.
But an ingredient is used in many recipes, and each recipe has a specified amount.
In this case, adding the edge property
amount is the right choice:
Properties can have multiple values and are useful for storing similar information.
For instance, a
nickname property can store all the nicknames that a person might have, or a
Collections, tuples, and UDTs are the best method of associating properties with other properties.
For instance, if you want to assign a
badge that consists of the level and the date at which the badge was awarded, a map collection is an excellent choice.
A UDT is a good choice if a specific group of data is required, such as an address and multiple phone numbers for a home or business.
location_details is composed of the UDT
address plus an additional property
telephone List data type.
Edge directionality can play a role in the performance of queries.
Edges are unidirectional by default to avoid the unfortunate possibility of super-nodes, nodes that have too many edges.
If your queries generally look to find the ingredients
included_in a recipe, rather than what recipes use a specific ingredient, then designing the edge to connection from
ingredient->recipe is the right choice.
If bidirectionality is required for particular edges, then the special indexing step
inverse() can be used to create a materialized index to add the opposite direction edges.
Edge uniqueness is required if multiple edges between two vertices is required.
For instance, if a
person can review a
recipe more than once, a property that will identify the unique instances of those edges must be created in the edge label schema.
An edge property
review_date makes clear that different reviews can be made at different times, and should be a clustering key for the edge label
Additionally, if a celebrity, or super-node is present with millions of incoming edges, the data model will benefit from breaking the incoming edges with an additional partition key and no indexing.
Indexes play a significant role in making DSG queries performant.
Graph queries that must traverse the entire graph to find information will have poor performance, which explains why full-scan queries are disallowed in production environments.
DSG implements three types of indexes, materialized view indexes, secondary indexes, and search indexes to address these different aspects of query processing.
Indexes are used to find the starting point in a graph and involve finding a matching vertex or edge property value.
Queries that require indexes will not execute without an index unless a development bypass mechanism, like
g.with('allow-filtering') is used.
An index analyzer can be used to discover what indexes are required, by running the analyzer on any query. The analyzer results will return an index that can be applied or state that indexes required for the query already exist.
Today, people publish their recipes online and in cookbooks. Restaurants create fixed price meals from recipes. Consumers review the recipes they try. The results are an intertwined graph of data.The additional vertices and edges that can be added to this graph are numerous. For instance, the gender of the recipe authors and reviewers can be included. Nutritional information for the ingredients can be derived from the calories for a recipe. The number of servings that a recipe makes is useful to cooks. The resulting web of data can grow quickly.
Add a hundred authors, a thousand recipes, ten thousand reviews, and the enormity of the graph becomes obvious. However, as you will see in later sections, DSG can transform complex searches and pattern matching into simple and powerful solutions.
The data model is the first step in creating a graph. Using the data model, a schema can be created that defines how DSG will store the data.