Simple Traversals
Returning to the Recipe Toy Graph, let’s expand the graph to include reviewers and ratings.
Load the following script to add the reviewer
vertices and recipe-reviewer
edges.
You must have run the generateRecipe.groovy
script previously, so that the recipe vertices exist before loading this script:
// Generates review vertices and edges for Recipe Toy Graph
// :load /tmp/generateReviews.groovy
// reviewer vertices
johnDoe = graph.addVertex(label, 'reviewer', 'name','John Doe')
johnSmith = graph.addVertex(label, 'reviewer', 'name', 'John Smith')
janeDoe = graph.addVertex(label, 'reviewer', 'name','Jane Doe')
sharonSmith = graph.addVertex(label, 'reviewer', 'name','Sharon Smith')
betsyJones = graph.addVertex(label, 'reviewer', 'name','Betsy Jones')
beefBourguignon = g.V().has('recipe', 'name','Beef Bourguignon').tryNext().orElseGet {graph.addVertex(label, 'recipe', 'name', 'Beef Bourguignon')}
spicyMeatLoaf = g.V().has('recipe', 'name','Spicy Meatloaf').tryNext().orElseGet {graph.addVertex(label, 'recipe', 'name', 'Spicy Meatloaf')}
carrotSoup = g.V().has('recipe', 'name','Carrot Soup').tryNext().orElseGet {graph.addVertex(label, 'recipe', 'name', 'Carrot Soup')}
// reviewer - recipe edges
johnDoe.addEdge('rated', beefBourguignon, 'timestamp', '2014-01-01T05:15:00.00Z', 'stars', 5, 'comment', 'Pretty tasty!')
johnSmith.addEdge('rated', beefBourguignon, 'timestamp', '2014-01-23T00:00:00.00Z', 'stars', 4)
janeDoe.addEdge('rated', beefBourguignon, 'timestamp', '2014-02-01T00:00:00.00Z', 'stars', 5, 'comment', 'Yummy!')
sharonSmith.addEdge('rated', beefBourguignon, 'timestamp', '2015-01-01T00:00:00.00Z', 'stars', 3, 'comment', 'It was okay.')
johnDoe.addEdge('rated', spicyMeatLoaf, 'timestamp', '2015-12-31T10:56:00.00Z', 'stars', 4, 'comment', 'Really spicy - be careful!')
sharonSmith.addEdge('rated', spicyMeatLoaf, 'timestamp', '2014-07-23T00:30:00.00Z', 'stars', 3, 'comment', 'Too spicy for me. Use less garlic.')
janeDoe.addEdge('rated', carrotSoup, 'timestamp', '2015-12-30T01:20:00.00Z', 'stars', 5, 'comment', 'Loved this soup! Yummy vegetarian!')
Run the script by first identifying the script, and then remotely executing it.
gremlin> :load /tmp/generateReviews.groovy
The recipes that were previously entered are queried to assign the result to recipe variables.
The variables are then used to create the reviewer-recipe edges.
These queries make use of two Apache Tinkerpop methods, tryNext()
and orElseGet()
;
see the Apache Tinkerpop Java API for more information.
Exploring recipe ratings
Check if the vertices are created by counting the number of vertices with the reviewer
label.
gremlin> g.V().hasLabel('reviewer').count()
==>5
List all the reviewer using values
:
// Get the names of all the reviewers
gremlin> g.V().hasLabel('reviewer').values('name')
==>John Smith
==>Sharon Smith
==>Betsy Jones
==>Jane Doe
==>John Doe
Verifying that the reviewers are created is useful, but creating traversals that answer queries is more important. For instance, what does John Doe say about recipes?
Use a query that identifies a vertex label as reviewer with a name value of John Doe.
g.V().has('reviewer', 'name','John Doe').outE('rated').values('comment')
The use of the outgoing edges command outE('rated')
to find all the recipes that John Doe has rated allows the value of the property comments
to be retrieved:
==>Pretty tasty!
==>Really spicy - be careful!
It might be nice to know which recipes John Doe reviewed, so another traversal can be used.
g.V().has('reviewer', 'name','John Doe').outE('rated').inV().values('name')
resulting in:
==>Beef Bourguignon
==>Spicy Meatloaf
To find all the reviews that give a recipe more than 3 stars is a reasonable question to ask.
Try a traversal using gt(3)
, or greater than 3 to filter the stars
values:
gremlin> g.E().hasLabel('rated').has('stars', gt(3)).valueMap()
==>[stars:4, timestamp:2014-01-23T00:00:00Z]
==>[comment:Loved this soup! Yummy vegetarian!, stars:5, timestamp:2015-12-30T00:00:00Z]
==>[comment:Yummy!, stars:5, timestamp:2014-02-01T00:00:00Z]
==>[comment:Pretty tasty!, stars:5, timestamp:2014-01-01T00:00:00Z]
==>[comment:Really spicy - be careful!, stars:4, timestamp:2015-12-31T00:00:00Z]
The traversal shown finds each edge that is labeled rated
and filters the edges found to output only those edges with a star
rating of 4 or 5.
But this traversal doesn’t output the answer to the original question.
The traversal needs modification to get the incoming vertices with inV()
, and to list those incoming vertices by name with values('name')
:
gremlin> g.E().hasLabel('rated').has('stars', gt(3)).inV().values('name')
==>Beef Bourguignon
==>Spicy Meatloaf
==>Beef Bourguignon
==>Carrot Soup
==>Beef Bourguignon
The results indicate that Beef Bourguignon
has been rated three times, although we don’t have any reviewer information, just duplication of the recipe title in the list.
Returning to the previous query, let’s look for more recent reviews.
Adding an additional traversal step to filter by the timestamp
can find the 4 and 5 star ratings using gte(4)
or greater than or equal to 4, with a review date of Jan 1, 2015 or later.
gremlin> g.E().hasLabel('rated').has('stars',gte(4)).has('timestamp', gte(Instant.parse('2015-01-01T00:00:00.00Z'))).valueMap()
==>[comment:Loved this soup! Yummy vegetarian!, timestamp:2015-12-30T00:00:00Z, stars:5]
==>[comment:Really spicy - be careful!, timestamp:2015-12-31T00:00:00Z, stars:4]
Chaining traversal steps together can yield very exacting results.
For instance, if we added the inV().values('name')
to the last query, we’d now refine the results to find all 4-5 star reviews since the beginning of the year 2015.
Manipulating the ratings with statistical functions yields interesting answers. For instance, what is the mean value of all the recipe ratings?
gremlin> g.E().hasLabel('rated').values('stars').mean()
==>4.142857142857143
The results show that the reviewers like the recipes they reviewed, and establishes that reviewers in this sample did not write reviews for recipes that they did not like.
Perhaps a prolific reviewer would have a wider range of reviews. Find the maximum number of reviews that a single reviewer has written.
gremlin> g.V().hasLabel('reviewer').map(outE('rated').count()).max()
==>2
This traversal maps all the outgoing edges using outE('rated')
of each reviewer and counts them, then determines which count has the highest value using max()
.
Another measure that can be investigated is the mean rating of each reviewer. This traversal query uses a number of Apache TinkerPop traversal steps.
The as()
step allows display labels to be created for the two items that will be lists, the reviewer’s name and the mean stars
value for each reviewer.
These display labels, reviewer
and starCount
are then used in a select()
step that gets each value, first the reviewer’s name using by('name')
and then the starCount
using by(outE('rated').values('stars').mean()
.
The select()
step checks each reviewer
vertex and then traverses to discover the associated starCount
value.
gremlin> g.V().hasLabel('reviewer').as('reviewer','starCount').
select('reviewer','starCount').
by('name').
by(outE('rated').values('stars').mean())
==>[reviewer:Jane Doe, starCount:5.0]
==>[reviewer:Betsy Jones, starCount:NaN]
==>[reviewer:John Doe, starCount:4.5]
==>[reviewer:John Smith, starCount:4.0]
==>[reviewer:Sharon Smith, starCount:3.0]
Notice that Betsy Jones is listed as a reviewer, but has not reviewed any recipes.
Her starCount
lists NaN (not a number).
It is clear from the results that Jane Doe really likes at least one recipe, while Sharon Smith does not.
Ordering the results by the starCount
, or mean star rating, can allow the highest rater and the lowest rater to be discovered.
Here, the traversal steps order().by(select('starCount').decr()
use the output of the select('starCount')
step to order the display in decremental order.
gremlin> g.V().hasLabel('reviewer').as('reviewer','starCount').
select('reviewer','starCount').
by('name').
by(outE('rated').values('stars').mean()).
order().by(select('starCount'), decr)
==>[reviewer:Betsy Jones, starCount:NaN]
==>[reviewer:Jane Doe, starCount:5.0]
==>[reviewer:John Doe, starCount:4.5]
==>[reviewer:John Smith, starCount:4.0]
==>[reviewer:Sharon Smith, starCount:3.0]
Betsy Jones and her lack of ratings still cause the listing to be incorrect.
We could add a traversal step limit(1)
to the traversal and get the highest rater, Jane Doe, if Betsy were not listed.
A tricky traversal step, coalesce()
, is used to change NaN to a zero value.
gremlin> g.V().hasLabel('reviewer').as('reviewer','starCount').
select('reviewer','starCount').
by('name').
by(coalesce(outE('rated').values('stars'),constant(0)).mean()).
order().by(select('starCount'), decr)
==>[reviewer:Jane Doe, starCount:5.0]
==>[reviewer:John Doe, starCount:4.5]
==>[reviewer:John Smith, starCount:4.0]
==>[reviewer:Sharon Smith, starCount:3.0]
==>[reviewer:Betsy Jones, starCount:0.0]
Note that now Betsy Jones has a starCount
of 0.0
, the true value.
Find the star rating each reviewer has given to recipes:
g.V().hasLabel('reviewer').as('reviewer','rating').out().as('recipe').
select('reviewer','rating','recipe').
by('name').
by(outE('rated').values('stars')).
by(values('name'))
Note how the recipe name is traversed and named with the step modulator as('recipe')
after the reviewer and rating are labeled from the reviewer vertices with as('reviewer','rating')
.
The first two items in the output listing are retrieved starting at the reviewer vertex while the third item is retrieved from the adjacent recipe vertex.
==>{reviewer=John Doe, rating=5, recipe=Beef Bourguignon}
==>{reviewer=John Doe, rating=5, recipe=Spicy Meatloaf}
==>{reviewer=John Smith, rating=4, recipe=Beef Bourguignon}
==>{reviewer=Jane Doe, rating=5, recipe=Beef Bourguignon}
==>{reviewer=Jane Doe, rating=5, recipe=Carrot Soup}
==>{reviewer=Sharon Smith, rating=3, recipe=Beef Bourguignon}
==>{reviewer=Sharon Smith, rating=3, recipe=Spicy Meatloaf}
In general, the most interesting statistic from the reviews answers the question about how many people rated a particular recipe, and what the mean rating is for that particular recipe.
The graph traversal starts from a recipe vertex this time, and retrieves the recipe name, the number of reviews by counting the incoming edges with inE('rated').count()
, and the mean value of the incoming edges with inE('rated').values('stars').mean()
.
The coalesce()
step shown earlier could be used to change all NaN
values for meanRating
into zeroes.
g.V().hasLabel('recipe').as('recipe','numberOfReviews','meanRating').
select('recipe','numberOfReviews','meanRating').
by('name').
by(inE('rated').count()).
by(inE('rated').values('stars').mean())
==>{recipe=Beef Bourguignon, numberOfReviews=4, meanRating=4.25}
==>{recipe=Wild Mushroom Stroganoff, numberOfReviews=0, meanRating=NaN}
==>{recipe=Spicy Meatloaf, numberOfReviews=2, meanRating=3.5}
==>{recipe=Rataouille, numberOfReviews=0, meanRating=NaN}
==>{recipe=Salade Nicoise, numberOfReviews=0, meanRating=NaN}
==>{recipe=Roast Pork Loin, numberOfReviews=0, meanRating=NaN}
==>{recipe=Oysters Rockefeller, numberOfReviews=0, meanRating=NaN}
==>{recipe=Carrot Soup, numberOfReviews=1, meanRating=5.0}
Searching recipes
A common query for recipes is finding recipes that contain a certain ingredient.
g.V().hasLabel('recipe').out().has('name','beef').in().hasLabel('recipe').values('name')
==>Beef Bourguignon
A modification allows a query that includes either one ingredient or another.
g.V().hasLabel('recipe').out().has('name',within('beef','carrots')).in().hasLabel('recipe').values('name')
==>Beef Bourguignon
==>Carrot Soup
Finding all the ingredients for a particular recipe is a common query.
g.V().match(
__.as('a').hasLabel('ingredient'),
__.as('a').in('includes').has('name','Beef Bourguignon')).
select('a').by('name')
This query uses a match()
step to find a match for the ingredients used to make Beef Bourguignon.
The traversal starts by filtering all vertices to find the ingredients, then traverses to the recipe vertices along the includes
edges using in('includes')
.
This query also uses a Groovy double underscore variable as a private variable for the match method.
The results are:
==>tomato paste
==>beef
==>onion
==>mashed garlic
==>butter
Although inside()
is most commonly used for geospatial searches, the method can be used to find anything that falls within a particular range of values.
An example is finding books that have a publishing date between 1960 and 1970:
g.V().has('book', 'year', inside(1960,1970)).valueMap()
The results are:
==>{ISBN=[0-394-40135-2], year=[1968], name=[The French Chef Cookbook]}
==>{year=[1961], name=[The Art of French Cooking, Vol. 1]}
Grouping output
Group output from a graph traversal using the group()
traversal step.
For example, display all the vertices by name, grouped by label:
g.V().group().by(label).by('name')
Note that the meals, ingredients, authors, books, recipes, and reviewers are all grouped in the results:
==>[meal:[JuliaDinner, Saturday Feast, EverydayDinner], ingredient:[olive oil, chicken broth,
eggplant, pork sausage, green bell pepper, yellow onion, celery, hard-boiled egg, shallots,
zucchini, butter, green beans, mashed garlic, onion, mushrooms, bacon, parsley, oyster,
tomato, thyme, pork loin, tuna, tomato paste, ground beef, red wine, fennel, Pernod,
chervil, egg noodles, carrots, beef], author:[Louisette Bertholie, Kelsie Kerr,
Alice Waters, Julia Child, Emeril Lagasse, Simone Beck, Patricia Curtan, Patricia Simon,
James Beard, Fritz Streiff], book:[Simca's Cuisine: 100 Classic French Recipes for Every
Occasion, The French Chef Cookbook, The Art of Simple Food: Notes, Lessons, and Recipes
from a Delicious Revolution, The Art of French Cooking, Vol. 1], recipe:[Wild Mushroom
Stroganoff, Roast Pork Loin, Spicy Meatloaf, Rataouille, Beef Bourguignon, Oysters
Rockefeller, Salade Nicoise, Carrot Soup], reviewer:[Sharon Smith, John Smith, Jane Doe,
Betsy Jones, John Doe]]
Another example groups all books by year and displays a listing of each year books were published followed by the book titles:
g.V().hasLabel('book').group().by('year').by('name')
and lists:
==>{1968=[The French Chef Cookbook, The French Chef Cookbook],
1972=[Simca's Cuisine: 100 Classic French Recipes for Every Occasion, Simca's Cuisine: 100
Classic French Recipes for Every Occasion], 2007=[The Art of Simple Food: Notes, Lessons,
and Recipes from a Delicious Revolution, The Art of Simple Food: Notes, Lessons, and Recipes
from a Delicious Revolution], 1961=[The Art of French Cooking, Vol. 1, The Art of French
Cooking, Vol. 1]}
Grouping for processing using local()
Oftentimes, it is critical to do local processing for a particular step in the graph traversal.
The next two examples use the limit()
command to show how local()
can change the processing from the whole stream entering the query to a portion of the query.
First, find just two authors and the year that they have published books:
g.V().hasLabel('author').as('author').out().properties('year').as('year').
select('author','year').
by('name').
by().
limit(2)
This query results in returning the first two records in the database:
==>{author=Julia Child, year=vp[year->1961]}
==>{author=Julia Child, year=vp[year->1968]}
Using local()
, change this query to find the first two books that each author in the graph has published:
g.V().hasLabel('author').as('author').
local(out().properties('year').as('year').limit(2)).
select('author','year').
by('name').
by()
Note that up to two books are displayed for each author:
==>{author=Julia Child, year=vp[year->1961]}
==>{author=Julia Child, year=vp[year->1968]}
==>{author=Simone Beck, year=vp[year->1961]}
==>{author=Simone Beck, year=vp[year->1972]}
==>{author=Louisette Bertholie, year=vp[year->1961]}
==>{author=Patricia Simon, year=vp[year->1972]}
==>{author=Alice Waters, year=vp[year->2007]}
==>{author=Patricia Curtan, year=vp[year->2007]}
==>{author=Kelsie Kerr, year=vp[year->2007]}
==>{author=Fritz Streiff, year=vp[year->2007]}
The traversal step local()
has many applications for processing a subsection of a graph within a graph traversal to return results before moving on to further processing.