Using search indexes

Basics of using search indexes

DataStax Graph (DSG) leverages DSE Search indexes to efficiently filter vertices and edges by properties, reducing query latency. DSE Search uses a modified Apache Solr to create the search indexes. Graph search indexes can be created using textual, numeric and geospatial data.

Using search indexes requires DSE Search as well as DSG to be running on the nodes in a cluster. It is important to know that search indexing operates on a per-datacenter basis, so if DSE Search and DSG are running in different datacenters, the behavior of indexing may not match what is expected.

In general, if a query requires textual search or geospatial search, a search index must be used.

Search indexes are created for both full text and string searches by default, but properties can be designated with either option using asText or asString, respectively. Textual search indexes are by default indexed in both tokenized (TextField) and non-tokenized (StrField) forms. This means that all textual predicates (token, tokenPrefix, tokenRegex, eq, neq, regex, prefix) will be usable with all textual vertex or edge properties indexed. Practically, search indexes should be created using the asString() method only in cases where there is absolutely no use for tokenization and text analysis, such as for inventory categories (silverware, shoes, clothing). The asText() method is used if searching tokenized text, such as long multi-sentence descriptions. The query optimizer will choose whether to use analyzed or non-analyzed indexing based on the textual predicate used.

Property key indexes defined with asText() or undefined (since this is the default) can use the following options for search:

Property key indexes defined with asString() can use the following options for search:

The eq() search cannot be used with property key indexes created with asText()because they contain tokenized data and are therefore not suitable for exact text matches.

In addition, property key indexes defined with asString() can use the following Apache TinkerPop options for search:

Creating a textual search index

An example search index from Creating indexes for vertex label recipe that will be used for all examples below:

schema.vertexLabel('recipe').
  searchIndex().
  ifNotExists().
  by('instructions').asText().
  by('name').
  by('cuisine').
  waitForIndex(30).
  create()

This search index uses DSE Search to index instructions as full text using tokenization, and name as both text and string. Only those properties that specifically should be indexed as non-tokenized data must specify asString(). If there are proporties that specifically should be indexed only as tokenized data, specify asText().

Search using token methods on full text

token()

In a traversal query, use a token search to find list the names of all recipes that have the word Saute in the instructions. The method token() is used with a supplied word.

g.V().has('recipe','instructions', token('Saute')).values('name')

results in:

==>Oysters Rockefeller
==>Beef Bourguignon
==>Wild Mushroom Stroganoff

Why does this search find these three recipes? Because the instructions for each meet the search requirements:

usingSearchIndexes10

tokenPrefix()

In a traversal query, use a token prefix search to list the names of all recipes that have a word that includes a prefix of Sea in the instructions. The method tokenPrefix() is used with a supplied prefix (a set of alphanumeric characters).

g.V().hasLabel('recipe').has('instructions', tokenPrefix('Sea')).values('name','instructions')

results in:

==>Oysters Rockefeller
==>Saute the shallots, celery, herbs, and seasonings in 3 tablespoons of the butter for 3 minutes. Add the watercress and let it wilt.
==>Roast Pork Loin
==>The day before, separate the meat from the ribs, stopping about 1 inch before the end of the bones. Season the pork liberally inside and out with salt and pepper and refrigerate overnight.

Two recipes are returned, one with the word Season in the instructions, and one with the word seasonings in the instructions. Case is insensitive in tokenPrefix() indexing.

tokenRegex()

In a traversal query, use a token regular expression (regex) search to find all recipes that have a word that includes the regular expression specified. The regex, .sea*in., looks for the letters sea preceded by any number of other characters and followed by any number of other characters until the letters in are found and also followed by any number of other characters in the instructions and list the recipe names. The method tokenRegex() is used with a supplied regex.

 include::example$food/TRAVERSALS/search_tokenRegex.gremlin[]

results in:

==>Oysters Rockefeller
==>Saute the shallots, celery, herbs, and seasonings in 3 tablespoons of the butter for 3 minutes. Add the watercress and let it wilt.

Note that in this query, only the Oysters Rockefeller recipe is returned because the word Season in the Roast Pork Loin recipe does not meet the requirements for the regular expression.

Search using non-tokenized methods on strings

eq()

In a traversal query, use a non-token search to list all recipes that have Carrot Soup in the recipe name. Note that this search is case-sensitive, so using carrot soup would not find a vertex. The method eq() is used with a supplied name.

g.V().hasLabel('recipe').has('name', eq('Carrot Soup')).values('name')

results in:

==>Carrot Soup

In a traversal query, use a non-token search to list all recipes that have Carrot in the recipe name. The method eq() is used with a supplied name.

g.V().hasLabel('recipe').has('name', eq('Carrot')).values('name')

No match is found, because only a partial name was specified. For asString() indexes, the string must match.

There is an alternative predicate, within(), that works similarly to eq(), but not exactly the same.

neq()

In a traversal query, use a non-token search to list all recipes that do not have the word Saute in recipe instructions using neq():

g.V().has('recipe', 'instructions', neq('Saute')).values('name')

results in:

==>Salade Nicoise
==>Spicy Meatloaf
==>Rataouille
==>Carrot Soup
==>Roast Pork Loin

There is an alternative predicate, without(), that works identically to neq().

phrase()

The phrase() predicate is used with properties designated as TextFields.Find the exact phrase Wild Mushroom Stroganoff in a recipe name:

g.V().hasLabel('recipe').has('name', phrase('Wild Mushroom Stroganoff',0))

The 0 designates that the result must be an exact phrase.The query results in:

==>v[dseg:/recipe/2004]

The vertex id for the correct recipe is returned.

The phrase() predicate can be used for proximity searches, to discover phrases that have terms that are within a certain distance of one another in the tokenized text.

g.V().hasLabel('recipe').has('name', phrase('Wild Mushroom Stroganoff',1))

The value of 1 designates that the result must only have words in the recipe name that are one term away from one another.

==>v[dseg:/recipe/2004]

The vertex for the correct recipe is returned.

A match for g.V().hasLabel('recipe').has('name', phrase('Wild Mushroom',1)) will also return the correct vertex, butg.V().hasLabel('recipe').has('name', phrase('Mushroom Wild',1)) will not.

prefix()

In a traversal query, use a non-token search to find all recipes that have a name beginning with the letter R. The method prefix() is used with a supplied string.

g.V().hasLabel('recipe').has('name', prefix('R')).values('name')

results in:

==>Roast Pork Loin
==>Rataouille

Matches are found for each recipe name that begins with R, provided the recipe name was designated with asString() in the search index.

regex()

In a traversal query, use a non-token search to find all recipes that have a name that includes a specified regular expression. The method regex() is used with a supplied regex.

g.V().hasLabel('recipe').has('name', regex('.*ee.*')).values('name')

results in:

==>Beef Bourguignon

Matches are found for each recipe name that include the regex .ee. to find all strings that include ee preceded and followed by any number of other characters, provided the recipe name was designated with asString() in the search index.

Search using fuzzy methods

Search using fuzzy methods.

fuzzy()

The fuzzy() predicate uses optimal string alignment distance calculations to match properties designated as StrFields. Variations in the letters used in words, such as misspellings, are the focus of this predicate. The edit distance specified refers to the number of transpositions of letters, with a single transposition of letters constituting one edit.Find the exact person name of James Beard:

g.V().hasLabel('person').has('name', fuzzy('James Beard', 0)).values('name')

The 0 designates that the result must be an exact match. This query results in:

==>James BEARD

Changing the last value in a fuzzy() predicate will find misspellings:

g.V().hasLabel('person').has('name', fuzzy('James Beard', 1)).values('name')

The 1 designates that the result matches with an edit distance of at most one. This query results in:

==>James BEARD
==>Jmaes BEARD

If a person vertex exists with the misspelling Jmaes Beard, the query shown will find both vertices. The value of 1 finds this misspelling because of the single transposition of the letters a and m.

Note that searching for a misspelling will find the records with the correct spelling, as well as the misspelled name:

g.V().hasLabel('person').has('name', fuzzy('Jmase Beard', 2)).values('name')

The 2 designates that the result must match with at most two transpositions. This query results in:

==>James BEARD
==>Jmaes BEARD
==>Jmase BEARD

If a person vertex exists with the misspelling Jmaes Beard, the query shown will find both vertices. The value of 2 finds both the misspelling because of the single transposition of letters, e and s in Jmaes Beard, as well as the correct spelling with a second transposition of letters from Jmase Beard to James Beard .

Specifying an edit distance of 3 or greater matches too many terms for useful results. The resulting search index will be too large to efficiently filter queries.

tokenFuzzy()

The tokenFuzzy() predicate similar to fuzzy(), but searches for variation across individual tokens in analyzed textual data (TextFields).Find the recipe name that includes the word Wild while searching for the word with a one-letter misspelling:

g.V().hasLabel('recipe').has('name', tokenFuzzy('Wlid',1)).values('name')

The 1 designates that one letter misspelling (one transposition) is acceptable.This query results in:

==>Wild Mushroom Stroganoff

Search using Apache TinkerPop methods on strings

Using Apache TinkerPop predicates with DSE Search indexes

Apache TinkerPop predicates can be used with DSE Search indexes in DataStax Graph (DSG). Note that this kind of search on strings is case-sensitive, so using Pork or pork will find different results. The supported predicates are:

  • containing()

  • notContaining()

  • startingWith()

  • notStartingWith()

  • endingWith()

  • notEndingWith()

containing()

In a traversal query, use a string search to list all recipes that have the word Pork in the recipe name. A partial match of the recipe name returns a result, unlike eq(). Note that this search is case-sensitive, so using pork would not find a vertex.

g.V().has('recipe', 'name', containing('Pork')).values('name')

results in:

==>Roast Pork Loin

notContaining()

In a traversal query, use a string search to list all recipes that do not have the word Pork in the recipe name using notContaining(). A partial match of the recipe name will exclude the name from the result, unlike neq().

g.V().has('recipe', 'name', notContaining('Pork')).values('name')

results in:

==>Salade Nicoise
==>Wild Mushroom Stroganoff
==>Spicy Meatloaf
==>Oysters Rockefeller
==>Rataouille
==>Carrot Soup
==>Beef Bourguignon

startingWith()

In a traversal query, use a string search to find all recipes that have a name beginning with Beef. The method startingWith() is used with a supplied string.

g.V().has('recipe', 'name', startingWith('Beef')).values('name')

results in:

==>Beef Bourguignon

Matches are found for each recipe name that begins with Beef, provided the recipe name was designated with asString() in the search index. A search using startingWith('Bee') will find the same result; full words are not necessary.

prefix() and startingWith() will return the same results.

notStartingWith()

In a traversal query, use a string search to find all recipes that do not have a name beginning with Beef. The method notStartingWith() is used with a supplied string.

g.V().has('recipe', 'name', notStartingWith('Beef')).values('name')

results in:

==>Salade Nicoise
==>Wild Mushroom Stroganoff
==>Spicy Meatloaf
==>Oysters Rockefeller
==>Rataouille
==>Carrot Soup
==>Roast Pork Loin

endingWith()

In a traversal query, use a string search to find all recipes that have a name that ends with Soup. The method endingWith() is used with a supplied string. A search using endingWith('oup') will find the same result; full words are not necessary.

g.V().has('recipe', 'name', endingWith('Soup')).values('name')

results in:

==>Carrot Soup

notEndingWith()

In a traversal query, use a string search to find all recipes that do not have a name that ends with Soup. The method notEndingWith() is used with a supplied string.

g.V().has('recipe', 'name', notEndingWith('Soup')).values('name')

results in:

==>Salade Nicoise
==>Wild Mushroom Stroganoff
==>Spicy Meatloaf
==>Oysters Rockefeller
==>Rataouille
==>Beef Bourguignon
==>Roast Pork Loin

Search using tuple and user-defined type (UDT) values

Tuple search is used to discover values stored within tuples. Search indexes make such searches possible. First, a search index must be created.

schema.vertexLabel('person').searchIndex().ifNotExists().by('country').create()

This query is looking for all persons that lived in some country until the date 1960-01-01, with country.field2 storing the value of end_date for a tuple that stores country, start_date, end_date.

g.V().has('person', 'country.field2', '1960-01-01' as LocalDate)

results in:

==>v[dseg:/person/e7cd5752-bc0d-4157-a80f-7523add8dbcd]

who happens to be Julia CHILD. So, to search for a matching value in a tuple, the tuple name and the field number must be supplied. The fields begin with field0.

UDT search is used to discover values stored within UDTs or nested UDTs. Search indexes make such searches possible. First, a search index must be created.

schema.vertexLabel('location').searchIndex().ifNotExists().by('loc_details').create()

This query is looking for all persons that lived in some country until the date 1960-01-01, with country.field2 storing the value of end_date for a tuple that stores country, start_date, end_date.

g.V().has('location', 'loc_details.loc_address.address1' , '213 F St')

results in:

==>v[dseg:/location/g13]

which happens to be Zippy Mart. So, to search for a matching value in a UDT, the tuple name and each of the nested field number must be supplied. In this case, the property loc_details is a UDT with its own property loc_address which is also a UDT that has a property address1 that is a street address.

Search using geospatial values

Geospatial search is used to discover geospatial relationships. Search indexes make such searches possible. First, a search index must be created.

schema.vertexLabel('location').
  searchIndex().
  ifNotExists().
  by('loc_id').asString().
  by('geo_point').by('loc_details').
  create()

This query is looking for all fridge sensors that are located at homes and meet the requirement of being inside the described circle that is designated as a circle with a center at (118, 34) and a radius of 20 degrees with the method Geo.inside(). The in().in() steps allow the query to traverse from the location vertices to the home vertices and then to the fridge_sensor vertices.

g.V().hasLabel('location').
  has('geo_point', Geo.inside(Geo.point(118,34),20, Geo.Unit.DEGREES)).
  in().in()

Other indexes will probably be required for the in().in() portion of this query, but the error message will display the indexes that must be created.

results in:

==>v[dseg:/fridge_sensor/31/100/55555/1]
==>v[dseg:/fridge_sensor/31/200/55556/3]

More information on geospatial queries can be found in Geospatial traversals. The main point here is that the geospatial portion of the query can only be met using a search index.

Search using numerical values

Search indexes can also be used for non-textual values such as the integers values of cal_goal:

schema.vertexLabel('person').
  searchIndex().
  ifNotExists().
  by('person_id').
  by('badge').
  by('cal_goal').
  by('country').
  by('gender').
  by('macro_goal').
  by('name').
  create()

This example illustrates that only one search index can exist for each vertex or edge label and includes seven properties that are indexed.

With this search index, a numeric query to find all persons who have a calorie goal greater than 1200 calories is:

 g.V().has('person', 'cal_goal', gt(1200)).values('name', 'cal_goal')

results in: Five people have calorie goals of greater than 1200 calories:

==>Sharon SMITH
==>1600
==>Betsy JONES
==>1700
==>John DOE
==>1750
==>John Smith
==>1800
==>Jane DOE
==>1500

To sort the previous search, add additional methods order().by('cal_goal', incr) to sort in increasing order by calorie goals, and fold() to create a more readable result:

g.V().has('person', 'cal_goal', gt(1200)).
  order().
    by('cal_goal', incr).
  values('cal_goal', 'name').
  fold()

results in:

==>[1500, Jane DOE, 1600, Sharon SMITH, 1700, Betsy JONES, 1750, John DOE, 1800, John SMITH]

Search using two search indexes for a single traversal query

Using two search indexes in one traversal

Create a second search index like an example search index from Creating indexes for vertex label person.

schema.vertexLabel('person').index('search').search().
            by('name').asString().
            by('nick_name').ifNotExists().add()

This search index will use DSE Search to index nickname as full text using tokenization, and name as a string.

This traversal query demonstrates a mid-traversal V() that allows a search index for person as well as a search index for recipe to be used to execute the query. The first index uses a tokenRegex() to find recipe instructions that start with the word Braise; this part of the query is labeled as r for use later in the query. Then the search index for person is searched for an person name that starts with the letter J, and traversed through an outgoing edge to a vertex where the search found in the first part of the query is found with where(eq('r')).

g.V().has('recipe', 'instructions', tokenRegex('Braise.*')).as('r').
  V().has('person', 'name', prefix('J')).out().where(eq('r')).values('name')

results in:

==>Beef Bourguignon
==>Beef Bourguignon
==>Beef Bourguignon
==>Beef Bourguignon

This query traversal finds the recipe Beef Bourguignon four times, and illustrates some of the complexity that can be successfully used with search indexes. A modified query that gets the path from recipe ->person->recipe finds that Julia CHILD created the recipe Beef Bourguignon, but also finds the three reviews written about Beef Bourguignon by John DOE, John SMITH, and Jane DOE:

 g.V().has('recipe', 'instructions', tokenRegex('Braise.*')).as('r').
   V().has('person', 'name', prefix('J')).out().where(eq('r')).path().unfold().values('name')

with results:

==>Beef Bourguignon
==>John DOE
==>Beef Bourguignon
==>Beef Bourguignon
==>Julia CHILD
==>Beef Bourguignon
==>Beef Bourguignon
==>John SMITH
==>Beef Bourguignon
==>Beef Bourguignon
==>Jane DOE
==>Beef Bourguignon

Each three lines in the results represents the recipe ->person->recipe path.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com