Anatomy of a graph traversal

Structure of a graph traversal

Simple traversals can be complex, but generally do not employ specialized techniques such as recursion or branching.

Break down the chain of a graph traversal into traversal steps:

g.V().hasLabel('recipe').count()

This graph traversal to find the number of recipes in the graph has four parts:

The graph traversal g

g will return an error if run alone

All vertices are gathered with V()

All the vertices will be returned. A sample of the result:

gremlin>  g.V()
==>v[{~label=ingredient, member_id=18, community_id=1989847424}]
==>v[{~label=ingredient, member_id=19, community_id=1989847424}]
==>v[{~label=ingredient, member_id=16, community_id=1989847424}]
==>v[{~label=ingredient, member_id=17, community_id=1989847424}]
==>v[{~label=ingredient, member_id=22, community_id=1989847424}]
==>v[{~label=ingredient, member_id=13, community_id=1989847424}]
==>v[{~label=meal, member_id=25, community_id=1989847424}]
==>v[{~label=ingredient, member_id=24, community_id=1989847424}]
==>v[{~label=recipe, member_id=14, community_id=1878171264}]
==>v[{~label=recipe, member_id=21, community_id=1878171264}]
==>v[{~label=recipe, member_id=19, community_id=1878171264}]
==>v[{~label=meal, member_id=27, community_id=1989847424}]
==>v[{~label=recipe, member_id=20, community_id=1878171264}]
==>v[{~label=meal, member_id=26, community_id=1989847424}]
==>v[{~label=book, member_id=13, community_id=1878171264}]
==>v[{~label=book, member_id=10, community_id=1878171264}]
==>v[{~label=book, member_id=11, community_id=1878171264}]
==>v[{~label=author, member_id=1, community_id=1878171264}]
==>v[{~label=author, member_id=0, community_id=1878171264}]
==>v[{~label=author, member_id=3, community_id=1878171264}]
==>v[{~label=ingredient, member_id=2, community_id=1989847424}]
==>v[{~label=author, member_id=2, community_id=1878171264}]
Filter out the vertices labeled as a recipe with hasLabel('recipe')

Only the vertices that are recipes will be returned:

gremlin>  g.V().hasLabel('recipe')
==>v[{~label=recipe, member_id=14, community_id=1878171264}]
==>v[{~label=recipe, member_id=21, community_id=1878171264}]
==>v[{~label=recipe, member_id=19, community_id=1878171264}]
==>v[{~label=recipe, member_id=20, community_id=1878171264}]
==>v[{~label=recipe, member_id=17, community_id=1878171264}]
==>v[{~label=recipe, member_id=18, community_id=1878171264}]
==>v[{~label=recipe, member_id=15, community_id=1878171264}]
==>v[{~label=recipe, member_id=16, community_id=1878171264}]
Count the number of vertices with count()

The number of vertices returned from the last traversal step is totalled:

gremlin>  g.V().hasLabel('recipe').count()
==>8

Standard vertex ids are auto-generated, and are guaranteed to be unique. The standard vertex id consists of three parts:

member_id

vertex ID within a group

community_id

community ID within a graph

label

The specified vertex label

Standard vertex ids are synthetic and have a small footprint. The composition is not tied to a domain and are more flexible. Graph partitioning is an important aspect for retrieving graph objects. DSE Graph uses an optimizing algorithm to set the member_id and community_id for each vertex. The relationship is:

  • A graph is a collection of disjoint communities

  • A community is a collection of disjoint member vertices

Disjoint sets have no element in common. Therefore, a vertex is a member of exactly one community. In the example above, all vertices are in a couple of communities. The member_id is set to a value within each community.

Custom vertex ids can also be created using natural, or externally generated keys. However, applications using custom vertex ids must be manually partitioned and the guarantee of unique keys are up to the user.

Graph traversal with edges

Before trying the traversals displayed below, run the following script either in Studio (copy and paste) or Gremlin console (:load /tmp/generateReviews.groovy):

// reviewer vertices
johnDoe = graph.addVertex(label, 'reviewer', 'name','John Doe')
johnSmith = graph.addVertex(label, 'reviewer', 'name', 'John Smith')
janeDoe = graph.addVertex(label, 'reviewer', 'name','Jane Doe')
sharonSmith = graph.addVertex(label, 'reviewer', 'name','Sharon Smith')
betsyJones = graph.addVertex(label, 'reviewer', 'name','Betsy Jones')

beefBourguignon = g.V().has('recipe', 'name','Beef Bourguignon').tryNext().orElseGet {graph.addVertex(label, 'recipe', 'name', 'Beef Bourguignon')}
spicyMeatLoaf = g.V().has('recipe', 'name','Spicy Meatloaf').tryNext().orElseGet {graph.addVertex(label, 'recipe', 'name', 'Spicy Meatloaf')}
carrotSoup = g.V().has('recipe', 'name','Carrot Soup').tryNext().orElseGet {graph.addVertex(label, 'recipe', 'name', 'Carrot Soup')}

// reviewer - recipe edges
johnDoe.addEdge('rated', beefBourguignon, 'timestamp', Instant.parse('2014-01-01T00:00:00.00Z'), 'stars', 5, 'comment', 'Pretty tasty!')
johnSmith.addEdge('rated', beefBourguignon, 'timestamp', Instant.parse('2014-01-23T00:00:00.00Z'), 'stars', 4)
janeDoe.addEdge('rated', beefBourguignon, 'timestamp', Instant.parse('2014-02-01T00:00:00.00Z'), 'stars', 5, 'comment', 'Yummy!')
sharonSmith.addEdge('rated', beefBourguignon, 'timestamp', Instant.parse('2015-01-01T00:00:00.00Z'), 'stars', 3, 'comment', 'It was okay.')
johnDoe.addEdge('rated', spicyMeatLoaf, 'timestamp', Instant.parse('2015-12-31T00:00:00.00Z'), 'stars', 4, 'comment', 'Really spicy - be careful!')
sharonSmith.addEdge('rated', spicyMeatLoaf, 'timestamp',Instant.parse('2014-07-23T00:00:00.00Z'), 'stars', 3, 'comment', 'Too spicy for me. Use less garlic.')
janeDoe.addEdge('rated', carrotSoup, 'timestamp', Instant.parse('2015-12-30T00:00:00.00Z'), 'stars', 5, 'comment', 'Loved this soup! Yummy vegetarian!')

Any number of traversal steps can be chained into a traversal, filtering and transforming the graph data as required. In some cases, edges will be the result, and perhaps unexpected. Consider the following traversal:

g.V().hasLabel('recipe').has('name', 'Beef Bourguignon').inE().values('comment')

This graph traversal begins as the last traversal did with g.V().hasLabel('recipe'). It is then followed by:

A traversal step to pick only the vertices with the recipe title specified

The filter should capture one recipe if recipe titles are unique.

gremlin>  g.V().hasLabel('recipe').has('name', 'Beef Bourguignon')
==>v[{~label=recipe, member_id=14, community_id=1878171264}]
A traveral step that retrieves incoming edges

Notice from the two edges sampled from the complete result that edges with any label are filtered with this step. Using inE('rated') would be more precise if the target result are only ratings.

gremlin>  g.V().hasLabel('recipe').has('name', 'Beef Bourguignon').inE()
==>e[{out_vertex={~label=reviewer, member_id=4, community_id=857859584},
local_id=ca461ec0-0e7e-11e6-b5e4-0febe4822aa4,
in_vertex={~label=recipe, member_id=14, community_id=1878171264},
~type=rated}]
[{~label=reviewer, member_id=4, community_id=857859584}-rated->{~label=recipe, member_id=14, community_id=1878171264}]
==>e[{out_vertex={~label=reviewer, member_id=5, community_id=857859584},
local_id=ca5bf0b0-0e7e-11e6-b5e4-0febe4822aa4,
in_vertex={~label=recipe, member_id=14, community_id=1878171264},
~type=rated}]
[{~label=reviewer, member_id=5, community_id=857859584}-rated->{~label=recipe, member_id=14, community_id=1878171264}]
==>e[{out_vertex={~label=reviewer, member_id=6, community_id=857859584},
local_id=ca72d410-0e7e-11e6-b5e4-0febe4822aa4,
in_vertex={~label=recipe, member_id=14, community_id=1878171264},
~type=rated}]
[{~label=reviewer, member_id=6, community_id=857859584}-rated->{~label=recipe, member_id=14, community_id=1878171264}]
==>e[{out_vertex={~label=reviewer, member_id=7, community_id=857859584},
local_id=ca8a0590-0e7e-11e6-b5e4-0febe4822aa4,
in_vertex={~label=recipe, member_id=14, community_id=1878171264},
~type=rated}]
[{~label=reviewer, member_id=7, community_id=857859584}-rated->{~label=recipe, member_id=14, community_id=1878171264}]
==>e[{out_vertex={~label=author, member_id=0, community_id=1878171264},
local_id=524504c0-0e7b-11e6-b5e4-0febe4822aa4,
in_vertex={~label=recipe, member_id=14, community_id=1878171264},
~type=created}]
[{~label=author, member_id=0, community_id=1878171264}-created->{~label=recipe, member_id=14, community_id=1878171264}]
Parsing out the comment property from the rated edges

Here, the inE() is specified with the edge label rated. The property values are retrieved for the property key comment:

gremlin>  g.V().hasLabel('recipe').has('name', 'Beef Bourguignon').inE('rated').values('comment')
==>Yummy!
==>Pretty tasty!
==>It was okay.

Building graph traversals one step at a time can yield interesting results and insight into how to create traversals.

The path of a graph traversal

A traversal step exists that will show the path taken by a graph traversal. First, find the results for a traversal that answers the question about what recipes that list beef and carrots as ingredients are included in the cookbooks, given the cookbook and recipe title?

gremlin>  g.V().hasLabel('ingredient').has('name',within('beef','carrots')).in().as('Recipe').
  out().hasLabel('book').as('Book').
  select('Book','Recipe').by('name').
  by('name')
==>[Book:The Art of French Cooking, Vol. 1, Recipe:Beef Bourguignon]
==>[Book:The Art of Simple Food: Notes, Lessons, and Recipes from a Delicious Revolution, Recipe:Carrot Soup]

One expects that the traversal path from ingredient to recipe to book. To check if this assumption is correct, add path() to the end of the traversal.

gremlin>  g.V().hasLabel('ingredient').has('name',within('beef','carrots')).
  in().as('Recipe').
  out().hasLabel('book').as('Book').
  select('Book','Recipe').
  by('name').by('name').path()
==>[v[{~label=ingredient, member_id=22, community_id=1878171264}],
v[{~label=recipe, member_id=14, community_id=1878171264}],
v[{~label=book, member_id=10, community_id=1878171264}],
{Book=The Art of French Cooking, Vol. 1, Recipe=Beef Bourguignon}]
==>[v[{~label=ingredient, member_id=21, community_id=1989847424}],
v[{~label=recipe, member_id=20, community_id=1878171264}],
v[{~label=book, member_id=13, community_id=1878171264}],
{Book=The Art of Simple Food: Notes, Lessons, and Recipes from a Delicious Revolution, Recipe=Carrot Soup}]

For each case, notice that the traversal does follow the expected path.

Traversal metrics

In addition to tracing the output of each graph traversal step, metrics can produce interesting insights as well. To add metrics to the last traversal shown, use some additional chained steps:

gremlin>  g.V().hasLabel('recipe').has('name', 'Beef Bourguignon').inE('rated').values('comment').profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
DsegGraphStep([~label.eq(recipe), name.eq(Beef ...                     1           1           0.979    73.00
  query-optimizer                                                                              0.184
  retrieve-new                                                                                 0.115
  iterator-setup                                                                               0.390
DsegVertexStep(IN,[rated],edge)                                        4           4           0.286    21.37
  query-optimizer                                                                              0.080
  retrieve-new                                                                                 0.014
  iterator-setup                                                                               0.062
DsegPropertiesStep([comment],value)                                    3           3           0.075     5.63
                                            >TOTAL                     -           -           1.342        -

The type of traversal step is listed, along with the number of traversers and the time to complete the traversal step. If a traversal step can be processed in parallel, multiple traversers will be employed to retrieve data. Some traversal steps are graph-global requiring retrieval from the entire graph; DsegGraphStep is a graph-global retrieval that finds vertices that match certain conditions. Other traversal steps are graph-local walks and can be processed in parallel; DsegVertexStep is a graph-local walk that walks through the graph along constrained paths. DSE Graph uses automatic query optimization to determine the traversal strategies to efficiently use any index structures that exist.

Looking at the metrics, the question of performance comes to mind. For instance, is there any way to optimize the traversal shown above? In fact, a simple modification results in a time savings:

==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
DsegGraphStep([~label.eq(recipe), name.eq(Beef ...                     1           1           0.733    70.62
  query-optimizer                                                                              0.143
  retrieve-new                                                                                 0.059
  iterator-setup                                                                               0.289
DsegVertexStep(IN,[rated],edge)                                        4           4           0.241    23.29
  query-optimizer                                                                              0.083
  retrieve-new                                                                                 0.006
  iterator-setup                                                                               0.049
DsegPropertiesStep([comment],value)                                    3           3           0.063     6.10
                                            >TOTAL                     -           -           1.038        -

The change made is subtle. Two traversal steps, hasLabel('recipe').has('name', 'Beef Bourguignon') have been replaced by one traversal step, has('recipe', 'name', 'Beef Bourguignon'). Although each measurement can vary, generally the second traversal will outperform the first traversal.

In DSE 5.1 and later, DSE Studio 2.0 provides more information on metrics such as index-query, showing the type of query used (Search, Materialized, Secondary). The two examples shown here display a materialized view index and a search index in use:

Query anatomy
Query anatomy

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com