Graph data modeling example

Details of a larger data model creation.

Let’s consider the world of food further to create a complex data model. This example will go through some of the thinking behind creating a graph database data model. Many different decisions are made to optimize the schema for a graph. Some of the following considerations include:

  • vertex vs property

  • vertex property vs edge property

  • properties with multiple values

  • properties associated with other properties

  • edge directionality

  • edge uniqueness (single edge vs multiple edges)

  • indexes - why use them and which ones

  • complexity

Vertex, edge or property?

In general, if an entity is a thing, it will be a vertex. If it describes an action on a thing, it’s an edge. Lastly, if it is a qualifier of a thing, it is a property.

For instance, what is a possible additional type of vertex besides person and recipe in the food data model? Recipes use ingredients, so ingredient vertices will be required. Recipes are generally published in cookbooks, so book vertices will also be added.

What are some edges that will connect these vertices? Each ingredient is included_in a recipe, and each recipe is likely included_in at least one book.

And finally, all of these vertices and edges have properties. An ingredient will have a name and id, a book will have a publish_year, an edge included_in can identify the amount of an ingredient used in a recipe. A recipe can be included_in many books, creating multiple edges between vertices.

Vertex vs property

Sometimes, a property may seem to be the best schema option for an entity. But in some cases, making the property into another vertex label would be a better choice. Let’s look at the following vertex: image::dataModelExample1.png[]

It seems perfectly obvious that an ingredient should have a category property. But could category be a vertex with an edge connecting an ingredient with several categories? Generally, vertex properties are easily queried, so are edges between vertices. What is the deciding factor in which option to use? One key data model feature you want to avoid is a super-node, a single vertex containing billions of connections to other vertices. With ingredients, there is unlikely to be billions of ingredients in any category, unless the category is absurdly broad, like hot_food. Another deciding factor can be to contemplate if the category vertex would have any property of its own. Perhaps a category is a member of another category, branching out from a broad category to more sub-categories. In this case, however, it seems that category has no definite requirements, so creating it either as a property or a vertex is reasonable.

Vertex property vs edge property

Vertex and edge properties can be searched equally well, starting with a specific set of vertices or edges based on property key:value pairs. For instance, if I want to find all the cookbook authors in France, I can search all the vertices with the vertex label person who have lived in the country of France.But I can also search all the edges between a vertex label person and a vertex label country with an edge property of lived_in. DSG can equally search these two scenarios, and often, a particular query must be tested to see which is optimal. For a different query, you can find all the cookbook authors in France who know Julia Child, but the query begins with the person Julia Child and traverses outward. And edge property for the know edges can give us additional information, such as when Julia Child met an author who lives in France, but starting the search to see who Julia Child knew in 1955 would not be performant.

An edge property is often the only choice. For instance, look at an ingredient that is included in a recipe.

dataModelExample2

Knowing the ingredients in a recipe is important, but the amount used is also critical. One cup of salt instead of one teaspoon of salt will make a big difference in the results. But an ingredient is used in many recipes, and each recipe has a specified amount. In this case, adding the edge property amount is the right choice:

dataModelExample3

Properties with multiple values

Properties can have multiple values and are useful for storing similar information. For instance, a nickname property can store all the nicknames that a person might have, or a email property can store all the various email addresses a person owns. Consider how you will access data in your data model design when considering collections, tuples, and user-defined types (UDTs).

Properties associated with other properties

Collections, tuples, and UDTs are the best method of associating properties with other properties. For instance, if you want to assign a person a badge that consists of the level and the date at which the badge was awarded, a map collection is an excellent choice. A UDT is a good choice if a specific group of data is required, such as an address and multiple phone numbers for a home or business. The UDT location_details is composed of the UDT address plus an additional property telephone List data type.

Edge directionality

Edge directionality can play a role in the performance of queries. Edges are unidirectional by default to avoid the unfortunate possibility of super-nodes, nodes that have too many edges. If your queries generally look to find the ingredients included_in a recipe, rather than what recipes use a specific ingredient, then designing the edge to connection from ingredient->recipe is the right choice. If bidirectionality is required for particular edges, then the special indexing step inverse() can be used to create a materialized index to add the opposite direction edges.

Edge uniqueness

Edge uniqueness is required if multiple edges between two vertices is required. For instance, if a person can review a recipe more than once, a property that will identify the unique instances of those edges must be created in the edge label schema. An edge property review_date makes clear that different reviews can be made at different times, and should be a clustering key for the edge label reviewed. Additionally, if a celebrity, or super-node is present with millions of incoming edges, the data model will benefit from breaking the incoming edges with an additional partition key and no indexing.

Indexes

Indexes play a significant role in making DSG queries performant. Graph queries that must traverse the entire graph to find information will have poor performance, which explains why full-scan queries are disallowed in production environments. DSG implements three types of indexes, materialized view indexes, secondary indexes, and search indexes to address these different aspects of query processing. Indexes are used to find the starting point in a graph and involve finding a matching vertex or edge property value. Queries that require indexes will not execute without an index unless a development bypass mechanism, like dev or g.with('allow-filtering') is used.

An index analyzer can be used to discover what indexes are required, by running the analyzer on any query. The analyzer results will return an index that can be applied or state that indexes required for the query already exist.

Complexity

Today, people publish their recipes online and in cookbooks. Restaurants create fixed price meals from recipes. Consumers review the recipes they try. The results are an intertwined graph of data.The additional vertices and edges that can be added to this graph are numerous. For instance, the gender of the recipe authors and reviewers can be included. Nutritional information for the ingredients can be derived from the calories for a recipe. The number of servings that a recipe makes is useful to cooks. The resulting web of data can grow quickly.

dataModelExample4

Add a hundred authors, a thousand recipes, ten thousand reviews, and the enormity of the graph becomes obvious. However, as you will see in later sections, DSG can transform complex searches and pattern matching into simple and powerful solutions.

The data model is the first step in creating a graph. Using the data model, a schema can be created that defines how DSG will store the data.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com