Anatomy of a graph traversal

The anatomy of a graph traversal explores the results of each traversal step.

Structure of a graph traversal

Simple traversals can be complex, but generally do not employ specialized techniques such as recursion or branching.

Break down the chain of a graph traversal into traversal steps:

g.V().hasLabel('recipe').count()

This graph traversal to find the number of recipes in the graph has four parts:

The graph traversal g

g will return an error if run alone

All vertices are gathered with V()

All the vertices will be returned. A sample of the result:

gremlin>  g.V()==>v[{~label=recipe, recipeId=2003}]
==>v[{~label=recipe, recipeId=2005}]
==>v[{~label=ingredient, ingredId=3017}]
==>v[{~label=ingredient, ingredId=3028}]
==>v[{~label=person, personId=1}]
==>v[{~label=person, personId=3}]
==>v[{~label=book, bookId=1004}]
==>v[{~label=book, bookId=1001}]
==>v[{~label=meal, type="lunch", mealId=4003}]

Filter out the vertices labeled as a recipe with hasLabel('recipe')

Only the vertices that are recipes will be returned:

gremlin>  g.V().hasLabel('recipe')
==>v[{~label=recipe, recipeId=2003}]
==>v[{~label=recipe, recipeId=2005}]
==>v[{~label=recipe, recipeId=2006}]
==>v[{~label=recipe, recipeId=2007}]
==>v[{~label=recipe, recipeId=2001}]
==>v[{~label=recipe, recipeId=2002}]
==>v[{~label=recipe, recipeId=2004}]
==>v[{~label=recipe, recipeId=2008}]

Count the number of vertices with count()

The number of vertices returned from the last traversal step is totalled:

gremlin>  g.V().hasLabel('recipe').count()
==>8

Graph traversal with edges

Before trying the traversals displayed below, run the following script either in Studio (copy and paste) or Gremlin console (:load /tmp/generateReviews.groovy):

// reviewer vertices
johnDoe = graph.addVertex(label, 'person', 'personId', 11, 'name','John Doe')
johnSmith = graph.addVertex(label, 'person', 'personId', 12, 'name', 'John Smith')
janeDoe = graph.addVertex(label, 'person', 'personId', 13, 'name','Jane Doe')
sharonSmith = graph.addVertex(label, 'person', 'personId', 14, 'name','Sharon Smith')
betsyJones = graph.addVertex(label, 'person', 'personId', 15, 'name','Betsy Jones')

beefBourguignon = g.V().has('recipe', 'recipeId', 2001, 'name','Beef Bourguignon').tryNext().orElseGet {graph.addVertex(label, 'recipe', 'recipeId', 2001, 'name', 'Beef Bourguignon')}
spicyMeatLoaf = g.V().has('recipe', 'recipeId', 2005, 'name','Spicy Meatloaf').tryNext().orElseGet {graph.addVertex(label, 'recipe', 'recipeId', 2005, 'name', 'Spicy Meatloaf')}
carrotSoup = g.V().has('recipe', 'recipeId', 2007, 'name','Carrot Soup').tryNext().orElseGet {graph.addVertex(label, 'recipe', 'recipedId', 2007, 'name', 'Carrot Soup')}

// reviewer - recipe edges
johnDoe.addEdge('reviewed', beefBourguignon, 'timestamp', Instant.parse('2014-01-01T00:00:00.00Z'), 'stars', 5, 'comment', 'Pretty tasty!')
johnSmith.addEdge('reviewed', beefBourguignon, 'timestamp', Instant.parse('2014-01-23T00:00:00.00Z'), 'stars', 4)
janeDoe.addEdge('reviewed', beefBourguignon, 'timestamp', Instant.parse('2014-02-01T00:00:00.00Z'), 'stars', 5, 'comment', 'Yummy!')
sharonSmith.addEdge('reviewed', beefBourguignon, 'timestamp', Instant.parse('2015-01-01T00:00:00.00Z'), 'stars', 3, 'comment', 'It was okay.')
johnDoe.addEdge('reviewed', spicyMeatLoaf, 'timestamp', Instant.parse('2015-12-31T00:00:00.00Z'), 'stars', 4, 'comment', 'Really spicy - be careful!')
sharonSmith.addEdge('reviewed', spicyMeatLoaf, 'timestamp',Instant.parse('2014-07-23T00:00:00.00Z'), 'stars', 3, 'comment', 'Too spicy for me. Use less garlic.')
janeDoe.addEdge('reviewed', carrotSoup, 'timestamp', Instant.parse('2015-12-30T00:00:00.00Z'), 'stars', 5, 'comment', 'Loved this soup! Yummy vegetarian!')

Any number of traversal steps can be chained into a traversal, filtering and transforming the graph data as required. In some cases, edges will be the result, and perhaps unexpected. Consider the following traversal:

g.V().hasLabel('recipe').has('name', 'Beef Bourguignon').inE().values('comment')

This graph traversal begins as the last traversal did with g.V().hasLabel('recipe'). It is then followed by:

A traversal step to pick only the vertices with the recipe title specified

The filter should capture one recipe if recipe titles are unique.

gremlin>  g.V().hasLabel('recipe').has('name', 'Beef Bourguignon')
==>v[{~label=recipe, recipeId=2001}]

A traveral step that retrieves incoming edges

Notice from the two edges sampled from the complete result that edges with any label are filtered with this step. Using inE('reviewed') would be more precise if the target result are only ratings.

gremlin>  g.V().hasLabel('recipe').has('name', 'Beef Bourguignon').inE()
==>e[{~label=includedIn, ~out_vertex={~label=ingredient, ingredId=3001}, 
~in_vertex={~label=recipe, recipeId=2001}, ~local_id=95672dd0-2e3d-11e8-8043-6bfe97ac83b9}]
[{~label=ingredient, ingredId=3001}-includedIn->{~label=recipe, recipeId=2001}]
==>e[{~label=includedIn, ~out_vertex={~label=ingredient, ingredId=3002}, 
~in_vertex={~label=recipe, recipeId=2001}, ~local_id=9567ca17-2e3d-11e8-8043-6bfe97ac83b9}]
[{~label=ingredient, ingredId=3002}-includedIn->{~label=recipe, recipeId=2001}]
==>e[{~label=includedIn, ~out_vertex={~label=ingredient, ingredId=3003}, 
~in_vertex={~label=recipe, recipeId=2001}, ~local_id=9567ca16-2e3d-11e8-8043-6bfe97ac83b9}]
[{~label=ingredient, ingredId=3003}-includedIn->{~label=recipe, recipeId=2001}]
==>e[{~label=includedIn, ~out_vertex={~label=ingredient, ingredId=3004}, 
~in_vertex={~label=recipe, recipeId=2001}, ~local_id=9567ca14-2e3d-11e8-8043-6bfe97ac83b9}]
[{~label=ingredient, ingredId=3004}-includedIn->{~label=recipe, recipeId=2001}]
==>e[{~label=includedIn, ~out_vertex={~label=ingredient, ingredId=3005}, 
~in_vertex={~label=recipe, recipeId=2001}, ~local_id=956754e0-2e3d-11e8-8043-6bfe97ac83b9}]
[{~label=ingredient, ingredId=3005}-includedIn->{~label=recipe, recipeId=2001}]
==>e[{~label=created, ~out_vertex={~label=person, personId=1}, 
~in_vertex={~label=recipe, recipeId=2001}, ~local_id=956495c4-2e3d-11e8-8043-6bfe97ac83b9}]
[{~label=person, personId=1}-created->{~label=recipe, recipeId=2001}]
==>e[{~label=reviewed, ~out_vertex={~label=person, personId=11}, 
~in_vertex={~label=recipe, recipeId=2001}, ~local_id=95775a70-2e3d-11e8-8043-6bfe97ac83b9}]
[{~label=person, personId=11}-reviewed->{~label=recipe, recipeId=2001}]
==>e[{~label=reviewed, ~out_vertex={~label=person, personId=12}, 
~in_vertex={~label=recipe, recipeId=2001}, ~local_id=95773361-2e3d-11e8-8043-6bfe97ac83b9}]
[{~label=person, personId=12}-reviewed->{~label=recipe, recipeId=2001}]
==>e[{~label=reviewed, ~out_vertex={~label=person, personId=13}, 
~in_vertex={~label=recipe, recipeId=2001}, ~local_id=95773360-2e3d-11e8-8043-6bfe97ac83b9}]
[{~label=person, personId=13}-reviewed->{~label=recipe, recipeId=2001}]
==>e[{~label=reviewed, ~out_vertex={~label=person, personId=14}, 
~in_vertex={~label=recipe, recipeId=2001}, ~local_id=95778180-2e3d-11e8-8043-6bfe97ac83b9}]
[{~label=person, personId=14}-reviewed->{~label=recipe, recipeId=2001}]

Parsing out the comment property from the rated edges

Here, the inE() is specified with the edge label reviewed. The property values are retrieved for the property key comment:

gremlin>  g.V().hasLabel('recipe').has('name', 'Beef Bourguignon').inE('reviewed').values('comment')
==>Yummy!
==>Pretty tasty!
==>It was okay.

Building graph traversals one step at a time can yield interesting results and insight into how to create traversals.

The path of a graph traversal

A traversal step exists that will show the path taken by a graph traversal. First, find the results for a traversal that answers the question about what recipes that list beef and carrots as ingredients are included in the cookbooks, given the cookbook and recipe title?

gremlin>  g.V().hasLabel('ingredient').has('name',within('beef','carrots')).in().as('Recipe').
  out().hasLabel('book').as('Book').
  select('Book','Recipe').by('name').
  by('name')
==>{Book=The Art of French Cooking, Vol. 1, Recipe=Beef Bourguignon}
==>{Book=The Art of Simple Food: Notes, Lessons, and Recipes from a Delicious Revolution, Recipe=Carrot Soup}

One expects that the traversal path will be from ingredient to recipe to book. To check if this assumption is correct, add path() to the end of the traversal.

gremlin>  g.V().hasLabel('ingredient').has('name',within('beef','carrots')).
  in().as('Recipe').
  out().hasLabel('book').as('Book').
  select('Book','Recipe').
  by('name').by('name').path()
==>[v[{~label=ingredient, ingredId=3001}], v[{~label=recipe, recipeId=2001}], v[{~label=book, bookId=1001}], 
      {Book=The Art of French Cooking, Vol. 1, Recipe=Beef Bourguignon}]
==>[v[{~label=ingredient, ingredId=3028}], v[{~label=recipe, recipeId=2007}], v[{~label=book, bookId=1004}], 
      {Book=The Art of Simple Food: Notes, Lessons, and Recipes from a Delicious Revolution, Recipe=Carrot Soup}]

For each case, notice that the traversal does follow the expected path.

Traversal metrics

In addition to tracing the output of each graph traversal step, metrics can produce interesting insights as well. To add metrics to the last traversal shown, use some additional chained steps:

gremlin> g.V().has('recipe', 'name', 'Beef Bourguignon').inE().values('comment').profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
DsegGraphStep(vertex,[],(label = recipe & name ...                     1           1           9.710    77.19
  query-optimizer                                                                              0.946
    \_condition=((label = recipe & name = Beef Bourguignon) & (true))
  query-setup                                                                                  0.051
    \_isFitted=true
    \_isSorted=false
    \_isScan=false
  index-query                                                                                  7.131
    \_indexType=Search
    \_usesCache=false
    \_statement=SELECT "recipeId" FROM "newComp"."recipe_p" WHERE "solr_query" = '{"q":"*:*", "fq":["name:Bee
                f\\ Bourguignon"]}' LIMIT ?; with params (java.lang.Integer) 50000
    \_options=Options{consistency=Optional[ONE], serialConsistency=Optional.empty, fallbackConsistency=Option
              al.empty, pagingState=null, pageSize=-1, user=Optional.empty, waitForSchemaAgreement=true, asyn
              c=true}
DsegVertexStep(IN,edge,direction = IN,Unordered)                      10          10           2.275    18.09
  query-optimizer                                                                              0.421
    \_condition=((true) & direction = IN)
  query-setup                                                                                  0.016
    \_isFitted=true
    \_isSorted=false
    \_isScan=false
  vertex-query                                                                                 0.475
    \_usesCache=false
    \_statement=SELECT * FROM "newComp"."recipe_e" WHERE "recipeId" = ? LIMIT ? ALLOW FILTERING; with params
                (java.lang.Integer) 2001, (java.lang.Integer) 50000
    \_options=Options{consistency=Optional[ONE], serialConsistency=Optional.empty, fallbackConsistency=Option
              al.empty, pagingState=null, pageSize=-1, user=Optional.empty, waitForSchemaAgreement=true, asyn
              c=true}
    \_usesIndex=false
DsegPropertiesStep([comment],value,label = comm...                     3           3           0.593     4.71
                                            >TOTAL                     -           -          12.579        -

The type of traversal step is listed, along with the number of traversers and the time to complete the traversal step. If a traversal step can be processed in parallel, multiple traversers will be employed to retrieve data. Some traversal steps are graph-global requiring retrieval from the entire graph; DsegGraphStep is a graph-global retrieval that finds vertices that match certain conditions. Other traversal steps are graph-local walks and can be processed in parallel; DsegVertexStep is a graph-local walk that walks through the graph along constrained paths. DSE Graph uses automatic query optimization to determine the traversal strategies to efficiently use any index structures that exist.

Looking at the metrics, the question of performance comes to mind. For instance, is there any way to optimize the traversal shown above? In fact, a simple modification results in a time savings:

gremlin> g.V().has('recipe', 'name', 'Beef Bourguignon').inE('reviewed').values('comment').profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
DsegGraphStep(vertex,[],(label = recipe & name ...                     1           1           7.109    83.34
  query-optimizer                                                                              0.342
    \_condition=((label = recipe & name = Beef Bourguignon) & (true))
  query-setup                                                                                  0.013
    \_isFitted=true
    \_isSorted=false
    \_isScan=false
  index-query                                                                                  6.110
    \_indexType=Search
    \_usesCache=false
    \_statement=SELECT "recipeId" FROM "DSE_GRAPH_QUICKSTART"."recipe_p" WHERE "solr_query" = '{"q":"*:*", "fq":["name:Bee
                f\\ Bourguignon"]}' LIMIT ?; with params (java.lang.Integer) 50000
    \_options=Options{consistency=Optional[ONE], serialConsistency=Optional.empty, fallbackConsistency=Option
              al.empty, pagingState=null, pageSize=-1, user=Optional.empty, waitForSchemaAgreement=true, asyn
              c=true}
DsegVertexStep(IN,[reviewed],edge,(direction = ...                     4           4           1.234    14.47
  query-optimizer                                                                              0.187
    \_condition=((direction = IN & label = reviewed) & (true))
  query-setup                                                                                  0.009
    \_isFitted=true
    \_isSorted=false
    \_isScan=false
  vertex-query                                                                                 0.474
    \_usesCache=false
    \_statement=SELECT * FROM "DSE_GRAPH_QUICKSTART"."recipe_e" WHERE "recipeId" = ? AND "~~edge_label_id" = ? LIMIT ? ALL
                OW FILTERING; with params (java.lang.Integer) 2001, (java.lang.Integer) 65633, (java.lang.Int
                eger) 50000
    \_options=Options{consistency=Optional[ONE], serialConsistency=Optional.empty, fallbackConsistency=Option
              al.empty, pagingState=null, pageSize=-1, user=Optional.empty, waitForSchemaAgreement=true, asyn
              c=true}
    \_usesIndex=false
DsegPropertiesStep([comment],value,label = comm...                     3           3           0.186     2.19
                                            >TOTAL                     -           -           8.530        -

The change made is subtle. The traversal steps, inE() has been replaced by inE('reviewed'), to find only the edges that are reviews. Although each measurement can vary, generally the second traversal will outperform the first traversal.

DataStax Studio provides more easy to use metrics, in which scrolling over or selecting any phase of the traversal can be explored: