Indexing

Explain indexes and how they affect DataStax Graph (DSG) performance.

Indexes play a significant role in making DataStax Graph (DSG) queries performant. Graph queries that must traverse the entire graph will have poor performance with high latency, which explains why full-scan queries are a bad idea in production environments. Indexes can narrow down the starting point for a graph traversal, ignoring either vertices or edges that do not require scanning to complete the query.

In addition, if a query does not use the partition and clustering keys, the query will throw an exception if an appropriate index doesn't exist. Although for development, there are other options to evaluate the query, for production, indexes are required. Thus, the query g.V('dseg:/person/4ce9caf1-25b8-468e-a983-69bad20c017a') is the only acceptable query if the partition key is the person_id. A query to find James Beard fails if an index based on name is not present. Once that index exists, the query g.V().has('person', 'name', 'James BEARD') finds the same vertex.

DSG uses three types of indexes to increase the performance of graph queries: Each type has a well-defined role that allows users to pick the right choice. In addition, DSG has a new index analyzer to identify a required index based on a query. If you are new to graph querying, the index analyzer is a useful learning tool that analyzes a query, makes a recommendation, and can apply the recommended index to the graph.
Table 1. DSG Index Types
Index Use Predicates
Materialized view (MV) Most efficient index for high cardinality, high selectivity properties and equality predicates. Most appropriate if collections (set, list, map) are not part of the query. Equality: eq, within
Secondary Efficient index for low cardinality, low selectivity properties and equality predicates. Most appropriate if collections (set, list, map) are part of the query. Predicates on list, set, or map collections: contains,containsKey, containsValue, entryEq
Search Efficient and versatile index for properties with a wide range of cardinality and selectivity. Most appropriate when text search conditions or multiple condition are part of the query, especially fuzzy matching.
  • Text and String-specific: token, tokenPrefix, tokenRegex, tokenFuzzy, phrase, regex, prefix, fuzzy
  • TinkerPop String-specific:containing, notContaining, startingWith, notStartingWith, endingWith, notEndingWith
  • Geospatial: inside
  • Inequality: neq, without, gt, gte, lt, lte, inside, outside, between
  • Multiple conditions, where some of the conditions are against a list or set collection, tuple, or user-defined type (UDT)
inverse() Speciality index for creating an edge MV index supporting traversals against the edge's natural direction, i.e., bidirectional edges. Allows use of vertex steps: in, out, both, inE, outE, bothE, inV, outV, and bothV

When queries must use an index, DSG uses a two-step process to decide on the appropriate index to use. First, if any search-only predicates are part of the query, a search index will be used. If a query does not use a search-only predicate, then DSG will select an existing index in the order: MV index > secondary index > DSE Search index.

Index Analyzer

The index analyzer uses the indexFor() step along with either analyze() to analyze and recommend an index or apply() to apply a recommended index. The type of index recommended is based on the entered query and is based on the suggested index use decribed for each index type below. In addition, for complex queries, the index analyzer may recommend multiple indexes.

To use the index analyzer, enter a graph query:
schema.indexFor(g.V().has('person', 'name', 'Julia CHILD')).analyze()
and get a typical return:
==>Traversal requires that the following indexes are created:
schema.vertexLabel('person').materializedView('person_by_name').ifNotExists().partitionBy('name').clusterBy('person_id', Asc).create()
Notice that the type of index, materializedView is part of the recommendation, as well as a recommended index name and partition and clustering keys. Both vertex labels and edge labels can be analyzed for indexes.
To use the index analyzer for an edge property query:
schema.indexFor(g.E().hasLabel('reviewed').has('stars', 5)).analyze()
with the results:
==>Traversal requires that the following indexes are created:
schema.edgeLabel('reviewed').
  from('person').to('recipe').
  materializedView('person__reviewed__recipe_by_stars').
  ifNotExists().
  partitionBy('stars').
  clusterBy(OUT, 'person_id', Asc).
  clusterBy(IN, 'recipe_id', Asc).
  create()

The index analyzer will create an index after analysis if apply() is used. For exploration, use analyze() to see what recommendations are suggested, and use apply() if the index suggested is the best index. Another way to use analyze() is to get an index suggestion, then manually create the index if optimizations are desired.

Materialized view indexes

A materialized view (MV) index will generally be recommended if predicates in the query use general property columns and do not use collections (set, list, map) or tuples. This index type will also be recommended if the query is not search-specific. Materialized views are tables generated from a base table to provide a query based on a different primary key than the base table. Materialized view indexes also support the use of multiple partition keys to address the issue of partitions growing too large. Searching materialized views yields similar response times to searching base tables, although writing the data incurs a small time penalty. When data is written or updated in the graph, the index information is updated in the MV table along with the graph tables. A consequence of using a MV table is higher write latencies, but results in lower read latencies for graph traversals.

Secondary indexes

A secondary index (2i) will generally be recommended if predicates in the query are using collections. Collections can be indexed and queried to find a collection containing a particular value. Maps are indexed a bit differently from sets and lists, given the key-value nature of maps. Each specific predicate used in a query will necessitate an index using the following options:
Table 2. Secondary index query-index options
Query predicate Index option Applies to
contains(x) indexValues() Set, List
containsKey(x) indexKeys() Map
containsValue(x) indexValues() Map
entryEq(x, y) indexEntries() Map
eq(x) indexFull() Frozen Set, List, Map

Sets and lists can index all values found by indexing the collection column. Maps can index a map key, map value, or map entry using the methods shown below. Multiple indexes can be created on the same map column in a table so that map keys, values, or entries can be queried. In addition, frozen collections can be indexed using FULL to index the full content of a frozen collection.

Search indexes

A search index is recommended if a query is search-specific or has multiple conditions that a query must meet and some of the conditions are for list or set collection, tuples, or user-defined types (UDTs). Search-specific queries include textual, numeric or geospatial properties and rely on DSE Search; all data types except map collections can be indexed. DSE Search uses one search index per vertex or edge label, so all properties indexed must be defined in one index. DSE Search utilitzes two types of indexing:
  • Full text indexing (asText()) performs tokenization and secondary processing such as case normalization. Full text indexing is useful for queries where partial match of text is required, and lends itself to regular expressing (regEx) searching.
  • String indexing (asString()) is useful for queries where an exact string is sought and no tokenization is required, similar to Solr faceting.

Search indexes are created for both full text and string searches by default, but properties can be designated with either option using asText or asString, respectively. Textual search indexes are by default indexed in both tokenized (TextField) and non-tokenized (StrField) forms. This means that all textual predicates (token, tokenPrefix, tokenRegex, eq, neq, regex, prefix) will be usable with all textual vertex or edge properties indexed. Practically, search indexes should be created using the asString() method only in cases where there is absolutely no use for tokenization and text analysis, such as for inventory categories (silverware, shoes, clothing). The asText() method is used if searching tokenized text, such as long multi-sentence descriptions. The query optimizer will choose whether to use analyzed or non-analyzed indexing based on the textual predicate used.

Queries that need search indexes use search predicates like:
Table 3. Text and String search index uses
Use Tokenized (full text) Non-tokenized (string)
A comparison looking for a match of a token containing the value: 'Wild Stroganoff' or a phrase containing either an exact match: 'Wild Stroganoff',0or a proximal match: 'Wild Stroganoff',1 token phrase
A comparison using some text followed by a wildcard: John* tokenPrefix prefix
A comparison using a regular expression: .*sea.*in.* tokenRegex regex
Uses optimal string alignment distance calculations to match properties: 'Wlid',1 or 'Jmase Beard', 2 tokenFuzzy fuzzy
Additional predicates that require a search index are neq / without and geospatial predicate inside.
CAUTION: tokenRegex will display case insensitivity in queries.

Graph indexing best practices

The most important fact to remember is that a search index is the only choice for indexing if search-specific predicates are defined. For instance, g.V().has('person', 'gender', 'F').has('country.field1', 'France') will use a search index for both properties,country and gender to narrow the query, whereas materialized view and secondary indexes are created for a single property.

Both secondary and materialied view index data are stored in Cassandra tables, with secondary indexes stored locally on cluster nodes, while MVs are distributed across the cluster. Secondary indexes are the only choice if a query needs to lookup a collection key or value. MVs are a good choice if a non-collection value is sought. Both types of index have limitations that must be considered. Secondary indexes can accumulate tombstones that affect read performance if frequent data mutation occurs. MVs add write latency, so a small number of MV tables per base table is critical. If too many secondary indexes or MV indexes are created for a single vertex or edge label, revisit the data modeling for the graph and consider adding more vertex labels with fewer properties for each.

Edge indexes must be considered carefully if a graph has any celebrity, or super-nodes, vertices that have many incoming edges. While a celebrity may have outgoing edges that number in the hundreds, they may have millions of incoming edges. Creating an inverse index on the millions of edges will have a poor performance effect on the graph, while an inverse index on the outgoing edges is fine. In general, the data model will benefit from breaking the incoming edges with an additional partition key and no indexing.

DSG automatically uses the appropriate index when processing a query; designation of which index type to use is not a feature. As mentioned earlier, if a query does not use a search-only predicate, then DSG will select an existing index in the order: MV index > secondary index > DSE Search index. However, choosing the optimal type of index is key to good performance, and is a good reason to check queries with the index analyzer.

Indexes do take time to build, and until an index is available, queries that depend on an index can fail. Applications that create schema, immediately followed by data insertion that require search indexes will likely experience errors. The waitFor() option when creating indexes manually can prevent failure by giving the index time to completely build. If search indexes are used, also be aware that the graph queries should be run on DSE Search-enabled nodes in the cluster. DSE Search requires extra resources for each index built: a minimum of 256MB of memory by default, and two physical cores. For a typical 32GB node, 16 search indexes would be a reasonable number to create.