Using search indexes
Basics of using search indexes
DataStax Graph (DSG) leverages DSE Search indexes to efficiently filter vertices and edges by properties, reducing query latency. DSE Search uses a modified Apache Solr to create the search indexes. Graph search indexes can be created using textual, numeric and geospatial data.
Using search indexes requires DSE Search as well as DSG to be running on the nodes in a cluster. It is important to know that search indexing operates on a per-datacenter basis, so if DSE Search and DSG are running in different datacenters, the behavior of indexing may not match what is expected.
In general, if a query requires textual search or geospatial search, a search index must be used.
Search indexes are created for both full text and string searches by default, but properties can be designated with either option using asText
or asString
, respectively.
Textual search indexes are by default indexed in both tokenized (TextField) and non-tokenized (StrField) forms.
This means that all textual predicates (token, tokenPrefix, tokenRegex, eq, neq, regex, prefix) will be usable with all textual vertex or edge properties indexed.
Practically, search indexes should be created using the asString()
method only in cases where there is absolutely no use for tokenization and text analysis, such as for inventory categories (silverware, shoes, clothing).
The asText()
method is used if searching tokenized text, such as long multi-sentence descriptions.
The query optimizer will choose whether to use analyzed or non-analyzed indexing based on the textual predicate used.
Property key indexes defined with asText()
or undefined (since this is the default) can use the following options for search:
Property key indexes defined with asString()
can use the following options for search:
The |
In addition, property key indexes defined with asString()
can use the following Apache TinkerPop options for search:
Creating a textual search index
An example search index from Creating indexes for vertex label recipe
that will be used for all examples below:
schema.vertexLabel('recipe').
searchIndex().
ifNotExists().
by('instructions').asText().
by('name').
by('cuisine').
waitForIndex(30).
create()
This search index uses DSE Search to index instructions
as full text using tokenization, and name
as both text and string.
Only those properties that specifically should be indexed as non-tokenized data must specify asString()
.
If there are proporties that specifically should be indexed only as tokenized data, specify asText()
.
Search using token methods on full text
token()
In a traversal query, use a token search to find list the names of all recipes that have the word Saute
in the instructions.
The method token()
is used with a supplied word.
g.V().has('recipe','instructions', token('Saute')).values('name')
results in:
==>Oysters Rockefeller
==>Beef Bourguignon
==>Wild Mushroom Stroganoff
Why does this search find these three recipes? Because the instructions for each meet the search requirements:
tokenPrefix()
In a traversal query, use a token prefix search to list the names of all recipes that have a word that includes a prefix of Sea
in the instructions.
The method tokenPrefix()
is used with a supplied prefix (a set of alphanumeric characters).
g.V().hasLabel('recipe').has('instructions', tokenPrefix('Sea')).values('name','instructions')
results in:
==>Oysters Rockefeller
==>Saute the shallots, celery, herbs, and seasonings in 3 tablespoons of the butter for 3 minutes. Add the watercress and let it wilt.
==>Roast Pork Loin
==>The day before, separate the meat from the ribs, stopping about 1 inch before the end of the bones. Season the pork liberally inside and out with salt and pepper and refrigerate overnight.
Two recipes are returned, one with the word Season in the instructions, and one with the word seasonings in the instructions.
Case is insensitive in tokenPrefix()
indexing.
tokenRegex()
In a traversal query, use a token regular expression (regex) search to find all recipes that have a word that includes the regular expression specified.
The regex, .sea*in.
, looks for the letters sea preceded by any number of other characters and followed by any number of other characters until the letters in are found and also followed by any number of other characters in the instructions and list the recipe names.
The method tokenRegex()
is used with a supplied regex.
include::example$food/TRAVERSALS/search_tokenRegex.gremlin[]
results in:
==>Oysters Rockefeller
==>Saute the shallots, celery, herbs, and seasonings in 3 tablespoons of the butter for 3 minutes. Add the watercress and let it wilt.
Note that in this query, only the Oysters Rockefeller
recipe is returned because the word Season in the Roast Pork Loin recipe does not meet the requirements for the regular expression.
Search using non-tokenized methods on strings
eq()
In a traversal query, use a non-token search to list all recipes that have Carrot Soup
in the recipe name.
Note that this search is case-sensitive, so using carrot soup
would not find a vertex.
The method eq()
is used with a supplied name.
g.V().hasLabel('recipe').has('name', eq('Carrot Soup')).values('name')
results in:
==>Carrot Soup
In a traversal query, use a non-token search to list all recipes that have Carrot
in the recipe name.
The method eq()
is used with a supplied name.
g.V().hasLabel('recipe').has('name', eq('Carrot')).values('name')
No match is found, because only a partial name was specified.
For asString()
indexes, the string must match.
There is an alternative predicate, within()
, that works similarly to eq()
, but not exactly the same.
neq()
In a traversal query, use a non-token search to list all recipes that do not have the word Saute
in recipe instructions using neq()
:
g.V().has('recipe', 'instructions', neq('Saute')).values('name')
results in:
==>Salade Nicoise
==>Spicy Meatloaf
==>Rataouille
==>Carrot Soup
==>Roast Pork Loin
There is an alternative predicate, without()
, that works identically to neq()
.
phrase()
The phrase()
predicate is used with properties designated as TextFields.Find the exact phrase Wild Mushroom Stroganoff in a recipe name:
g.V().hasLabel('recipe').has('name', phrase('Wild Mushroom Stroganoff',0))
The 0
designates that the result must be an exact phrase.The query results in:
==>v[dseg:/recipe/2004]
The vertex id for the correct recipe is returned.
The phrase()
predicate can be used for proximity searches, to discover phrases that have terms that are within a certain distance of one another in the tokenized text.
g.V().hasLabel('recipe').has('name', phrase('Wild Mushroom Stroganoff',1))
The value of 1 designates that the result must only have words in the recipe name that are one term away from one another.
==>v[dseg:/recipe/2004]
The vertex for the correct recipe is returned.
A match for g.V().hasLabel('recipe').has('name', phrase('Wild Mushroom',1))
will also return the correct vertex, butg.V().hasLabel('recipe').has('name', phrase('Mushroom Wild',1))
will not.
prefix()
In a traversal query, use a non-token search to find all recipes that have a name beginning with the letter R
.
The method prefix()
is used with a supplied string.
g.V().hasLabel('recipe').has('name', prefix('R')).values('name')
results in:
==>Roast Pork Loin
==>Rataouille
Matches are found for each recipe name that begins with R
, provided the recipe name was designated with asString()
in the search index.
regex()
In a traversal query, use a non-token search to find all recipes that have a name that includes a specified regular expression.
The method regex()
is used with a supplied regex.
g.V().hasLabel('recipe').has('name', regex('.*ee.*')).values('name')
results in:
==>Beef Bourguignon
Matches are found for each recipe name that include the regex .ee.
to find all strings that include ee preceded and followed by any number of other characters, provided the recipe name was designated with asString()
in the search index.
Search using fuzzy methods
Search using fuzzy methods.
fuzzy()
The fuzzy()
predicate uses optimal string alignment distance calculations to match properties designated as StrFields.
Variations in the letters used in words, such as misspellings, are the focus of this predicate.
The edit distance specified refers to the number of transpositions of letters, with a single transposition of letters constituting one edit.Find the exact person
name of James Beard:
g.V().hasLabel('person').has('name', fuzzy('James Beard', 0)).values('name')
The 0
designates that the result must be an exact match.
This query results in:
==>James BEARD
Changing the last value in a fuzzy()
predicate will find misspellings:
g.V().hasLabel('person').has('name', fuzzy('James Beard', 1)).values('name')
The 1
designates that the result matches with an edit distance of at most one.
This query results in:
==>James BEARD
==>Jmaes BEARD
If a person
vertex exists with the misspelling Jmaes Beard, the query shown will find both vertices.
The value of 1 finds this misspelling because of the single transposition of the letters a and m.
Note that searching for a misspelling will find the records with the correct spelling, as well as the misspelled name:
g.V().hasLabel('person').has('name', fuzzy('Jmase Beard', 2)).values('name')
The 2
designates that the result must match with at most two transpositions.
This query results in:
==>James BEARD
==>Jmaes BEARD
==>Jmase BEARD
If a person
vertex exists with the misspelling Jmaes Beard, the query shown will find both vertices.
The value of 2 finds both the misspelling because of the single transposition of letters, e and s in Jmaes Beard, as well as the correct spelling with a second transposition of letters from Jmase Beard to James Beard .
Specifying an edit distance of 3 or greater matches too many terms for useful results. The resulting search index will be too large to efficiently filter queries. |
tokenFuzzy()
The tokenFuzzy()
predicate similar to fuzzy()
, but searches for variation across individual tokens in analyzed textual data (TextFields).Find the recipe name that includes the word Wild while searching for the word with a one-letter misspelling:
g.V().hasLabel('recipe').has('name', tokenFuzzy('Wlid',1)).values('name')
The 1
designates that one letter misspelling (one transposition) is acceptable.This query results in:
==>Wild Mushroom Stroganoff
Search using Apache TinkerPop methods on strings
Using Apache TinkerPop predicates with DSE Search indexes
Apache TinkerPop predicates can be used with DSE Search indexes in DataStax Graph (DSG).
Note that this kind of search on strings is case-sensitive, so using Pork
or pork
will find different results.
The supported predicates are:
-
containing()
-
notContaining()
-
startingWith()
-
notStartingWith()
-
endingWith()
-
notEndingWith()
containing()
In a traversal query, use a string search to list all recipes that have the word Pork
in the recipe name.
A partial match of the recipe name returns a result, unlike eq()
.
Note that this search is case-sensitive, so using pork
would not find a vertex.
g.V().has('recipe', 'name', containing('Pork')).values('name')
results in:
==>Roast Pork Loin
notContaining()
In a traversal query, use a string search to list all recipes that do not have the word Pork
in the recipe name using notContaining()
.
A partial match of the recipe name will exclude the name from the result, unlike neq()
.
g.V().has('recipe', 'name', notContaining('Pork')).values('name')
results in:
==>Salade Nicoise
==>Wild Mushroom Stroganoff
==>Spicy Meatloaf
==>Oysters Rockefeller
==>Rataouille
==>Carrot Soup
==>Beef Bourguignon
startingWith()
In a traversal query, use a string search to find all recipes that have a name beginning with Beef
.
The method startingWith()
is used with a supplied string.
g.V().has('recipe', 'name', startingWith('Beef')).values('name')
results in:
==>Beef Bourguignon
Matches are found for each recipe name that begins with Beef
, provided the recipe name was designated with asString()
in the search index.
A search using startingWith('Bee')
will find the same result;
full words are not necessary.
|
notStartingWith()
In a traversal query, use a string search to find all recipes that do not have a name beginning with Beef
.
The method notStartingWith()
is used with a supplied string.
g.V().has('recipe', 'name', notStartingWith('Beef')).values('name')
results in:
==>Salade Nicoise
==>Wild Mushroom Stroganoff
==>Spicy Meatloaf
==>Oysters Rockefeller
==>Rataouille
==>Carrot Soup
==>Roast Pork Loin
endingWith()
In a traversal query, use a string search to find all recipes that have a name that ends with Soup
.
The method endingWith()
is used with a supplied string.
A search using endingWith('oup')
will find the same result;
full words are not necessary.
g.V().has('recipe', 'name', endingWith('Soup')).values('name')
results in:
==>Carrot Soup
notEndingWith()
In a traversal query, use a string search to find all recipes that do not have a name that ends with Soup
.
The method notEndingWith()
is used with a supplied string.
g.V().has('recipe', 'name', notEndingWith('Soup')).values('name')
results in:
==>Salade Nicoise
==>Wild Mushroom Stroganoff
==>Spicy Meatloaf
==>Oysters Rockefeller
==>Rataouille
==>Beef Bourguignon
==>Roast Pork Loin
Search using tuple and user-defined type (UDT) values
Tuple search
Tuple search is used to discover values stored within tuples. Search indexes make such searches possible. First, a search index must be created.
schema.vertexLabel('person').searchIndex().ifNotExists().by('country').create()
This query is looking for all persons that lived in some country until the date 1960-01-01
, with country.field2
storing the value of end_date
for a tuple that stores country, start_date, end_date
.
g.V().has('person', 'country.field2', '1960-01-01' as LocalDate)
results in:
==>v[dseg:/person/e7cd5752-bc0d-4157-a80f-7523add8dbcd]
who happens to be Julia CHILD
.
So, to search for a matching value in a tuple, the tuple name and the field number must be supplied.
The fields begin with field0
.
UDT search
UDT search is used to discover values stored within UDTs or nested UDTs. Search indexes make such searches possible. First, a search index must be created.
schema.vertexLabel('location').searchIndex().ifNotExists().by('loc_details').create()
This query is looking for all persons that lived in some country until the date 1960-01-01
, with country.field2
storing the value of end_date
for a tuple that stores country, start_date, end_date
.
g.V().has('location', 'loc_details.loc_address.address1' , '213 F St')
results in:
==>v[dseg:/location/g13]
which happens to be Zippy Mart
.
So, to search for a matching value in a UDT, the tuple name and each of the nested field number must be supplied.
In this case, the property loc_details
is a UDT with its own property loc_address
which is also a UDT that has a property address1
that is a street address.
Search using geospatial values
Geospatial search
Geospatial search is used to discover geospatial relationships. Search indexes make such searches possible. First, a search index must be created.
schema.vertexLabel('location').
searchIndex().
ifNotExists().
by('loc_id').asString().
by('geo_point').by('loc_details').
create()
This query is looking for all fridge sensors that are located at homes and meet the requirement of being inside the described circle that is designated as a circle with a center at (118, 34) and a radius of 20 degrees with the method Geo.inside()
.
The in().in()
steps allow the query to traverse from the location
vertices to the home
vertices and then to the fridge_sensor
vertices.
g.V().hasLabel('location').
has('geo_point', Geo.inside(Geo.point(118,34),20, Geo.Unit.DEGREES)).
in().in()
Other indexes will probably be required for the |
results in:
==>v[dseg:/fridge_sensor/31/100/55555/1]
==>v[dseg:/fridge_sensor/31/200/55556/3]
More information on geospatial queries can be found in Geospatial traversals. The main point here is that the geospatial portion of the query can only be met using a search index.
Search using numerical values
Numeric search
Search indexes can also be used for non-textual values such as the integers values of cal_goal
:
schema.vertexLabel('person').
searchIndex().
ifNotExists().
by('person_id').
by('badge').
by('cal_goal').
by('country').
by('gender').
by('macro_goal').
by('name').
create()
This example illustrates that only one search index can exist for each vertex or edge label and includes seven properties that are indexed.
With this search index, a numeric query to find all persons who have a calorie goal greater than 1200 calories is:
g.V().has('person', 'cal_goal', gt(1200)).values('name', 'cal_goal')
results in: Five people have calorie goals of greater than 1200 calories:
==>Sharon SMITH
==>1600
==>Betsy JONES
==>1700
==>John DOE
==>1750
==>John Smith
==>1800
==>Jane DOE
==>1500
To sort the previous search, add additional methods order().by('cal_goal', incr)
to sort in increasing order by calorie goals, and fold()
to create a more readable result:
g.V().has('person', 'cal_goal', gt(1200)).
order().
by('cal_goal', incr).
values('cal_goal', 'name').
fold()
results in:
==>[1500, Jane DOE, 1600, Sharon SMITH, 1700, Betsy JONES, 1750, John DOE, 1800, John SMITH]
Search using two search indexes for a single traversal query
Using two search indexes in one traversal
Create a second search index like an example search index from Creating indexes for vertex label person
.
schema.vertexLabel('person').index('search').search().
by('name').asString().
by('nick_name').ifNotExists().add()
This search index will use DSE Search to index nickname
as full text using tokenization, and name
as a string.
This traversal query demonstrates a mid-traversal V()
that allows a search index for person
as well as a search index for recipe
to be used to execute the query.
The first index uses a tokenRegex()
to find recipe instructions that start with the word Braise;
this part of the query is labeled as r for use later in the query.
Then the search index for person is searched for an person name that starts with the letter J, and traversed through an outgoing edge to a vertex where the search found in the first part of the query is found with where(eq('r'))
.
g.V().has('recipe', 'instructions', tokenRegex('Braise.*')).as('r').
V().has('person', 'name', prefix('J')).out().where(eq('r')).values('name')
results in:
==>Beef Bourguignon
==>Beef Bourguignon
==>Beef Bourguignon
==>Beef Bourguignon
This query traversal finds the recipe Beef Bourguignon
four times, and illustrates some of the complexity that can be successfully used with search indexes.
A modified query that gets the path from recipe ->person->recipe
finds that Julia CHILD
created the recipe Beef Bourguignon
, but also finds the three reviews written about Beef Bourguignon by John DOE
, John SMITH
, and Jane DOE
:
g.V().has('recipe', 'instructions', tokenRegex('Braise.*')).as('r').
V().has('person', 'name', prefix('J')).out().where(eq('r')).path().unfold().values('name')
with results:
==>Beef Bourguignon
==>John DOE
==>Beef Bourguignon
==>Beef Bourguignon
==>Julia CHILD
==>Beef Bourguignon
==>Beef Bourguignon
==>John SMITH
==>Beef Bourguignon
==>Beef Bourguignon
==>Jane DOE
==>Beef Bourguignon
Each three lines in the results represents the recipe ->person->recipe
path.