Indexing

Explain indexes and how they affect DSE Graph performance.

Indexes play a significant role in making DSE Graph queries performant. Graph queries that must traverse the entire graph to find information will have poor performance, which explains why full-scan queries are disallowed in production environments. Two aspects of querying a graph can be improved with indexing: the initial vertex or vertices from which to start a traversal, and the narrowing of the edges and vertices to traverse from this starting point. DSE Graph implements two types of indexes, global indexes and vertex-centric indexes (VCIs) to address these different aspects of query processing. Global indexes are used to find the starting point for a query and involve finding a matching vertex property value. Vertex-centric indexes are used to narrow down the scope of a query after a starting point is defined.

Global indexing overview

Global indexes identify the starting point for a graph traversal query using a vertex label and a property. It is important to understand that graph queries must start from a vertex, not an edge. Although a vertex-centered index using an edge property can be used to narrow a traversal, a traversal cannot start from an edge. In a distributed graph database like DSE Graph, the most efficient traversal would start with a vertex identified by its vertex id, such as this query that uses the vertex id for Julia Child:
g.V(['~label':'person', 'personId':1])
However, identifying a vertex by vertex id is rather restrictive. Using a vertex label and a property in a traversal allows DSE Graph to identify the DSE node where the vertex data resides without reading all data from all DSE nodes. Most graph queries will first use a global index to find a starting vertex with a friendlier property:
g.V().has('person', 'name', 'Julia Child')
Since the property name is not part of the vertex id, an index is required to match the search conditions with the correct vertex, and that index is a global index.

Global indexing in DSE Graph can be accomplished with one of three DSE indexing methods: a materialized view (MV), a search index, or a secondary index.

Materialized views are tables generated from a base table to provide a query based on a different primary key than the base table. This type of index is best used for values of high cardinality of nearly unique values, or high selectivity. Selectivity is derived from cardinality, using the following formula:
selectivity = ( cardinality / number of rows ) * 100%
In general, low cardinality results in low selectivity, and high cardinality results in high selectivity. Searching materialized views yields similar response times to searching base tables, although writing the data incurs a small time penalty. When data is written or updated in the graph, the index information is updated in the MV table along with the graph tables. A consequence of using a MV table is higher write latencies, but results in lower read latencies for graph traversals.

Search indexes are used when textual, numeric or geospatial indexing are required and rely on DSE Search. Since graph data is stored in DSE database tables, one search core is available per vertex label. For each vertex label that will be indexed with search, all properties must be added to a single search index named search. Because search is implemented with DSE Search, all data types can be indexed. For two indexing options, full text and string, the property key must be defined, as different indexing results. Full text indexing performs tokenization and secondary processing such as case normalization. Full text indexing is useful for queries where partial match of text is required, and lends itself to regular expressing (regEx) searching. String indexing is useful for queries where an exact string is sought and no tokenization is required, similar to Solr faceting. This type of index is best for low selectivity, but lends itself to fuzzy matching for both tokenized and non-tokenized indexing.

Secondary indexing in DSE Graph follows the same rule of thumb as DSE secondary indexing. This type of index is meant for lower cardinality values, or alternatively, for low selectivity values. The number of values for indexing should number in the tens to hundreds at most; for instance, searching by country is a good candidate for secondary indexing. In addition, only equality conditions can be used to match values, and no ordering or range queries on values can be used. If more complex value matching is required, search indexes are the superior choice.

To summarize global indexes:
Index type Use
Materialized view Most efficient index for high cardinality, high selectivity vertex properties and equality predicates.
Secondary index Efficient index for low cardinality, low selectivity vertex properties and equality predicates.
Search index

Efficient and versatile index for vertex properties with a wide range of cardinality and selectivity. A search index supports a variety of predicates:

  • Full Text and String searches

  • Fuzzy search, both tokenized and non-tokenized

  • Phrase Search

  • Spatial (geospatial, Cartesian) searches

Composite index keys are not currently supported in DSE Graph.

Vertex-centric indexing (VCI) overview

Vertex-centric indexes (VCI) are used to narrow the traversal based on an additional property. Global indexes can be applied across all vertices with a specified vertex label, as opposed to VCIs which apply to a filtered set of vertices. Vertex-centric indexes are especially important in reducing the complexity of a traversal from O(n) to O(1) or O(log n), using Big O notation. Two types of VCI exist, edge indexes and property indexes. Edges indexes are useful for traversing edges based on associated properties, to avoid linear scans of all incident edges of a vertex, since traversing all incident edges can quickly compound the cost of a traversal if many incident edges exist. For instance, an edge index is useful in picking just certain edges once a global index has initiated the traversal at a particular vertex (in this case, Julia Child):
g.V().has('person', 'name', 'Julia Child').outE('created').has('create_date', gt(1960-01-01)) 
Property indexes are created to index meta-properties. Property indexes can support both equality and inequality predicates, and are useful in cases where a range of values must be returned by a query. This example will find all the countries that Fritz Streiff lived in and order them by the year he started living in the country:
g.V().has('person', 'name', 'Fritz Streiff').properties('country').has('start_year', order().by(decr))

Vertex-centric indexing in DSE Graph is accomplished with materialized views (MVs) for both edge and property indexes, and have the same properties as described above for global indexes.

Indexing best practices

The most important fact to remember is that a search index is the only choice for indexing two or more properties that define the starting point for a query. Multiple materialized view or secondary indexes cannot be used for global indexing. For instance, g.V().has('person', 'gender', 'F').has('person', 'country', 'France') will only use one index, not both, if the indexes are materialized view or secondary indexes. If a search index is defined, both properties,country and gender, are used. Once the starting point is defined, a vertex-centric index can be used to narrow the query.

More than one index can be created on the same property, such as creating both a materialized view (MV) index and a search index on the property amount. The DSE Graph query optimizer automatically uses the appropriate index when processing a query; designation of an index type to use is not a feature. The order of preference that DSE Graph uses is MV index > secondary index > DSE Search index to ensure best performance. However, choosing the optimal type of index is key to good performance. For instance, it is important to understand the limitations of materialized views, and base the number of MV indexes on that understanding. See . Different index types may be created on different properties as appropriate, based on the selectivity. In general, secondary indexes in DSE Graph are limited in usefulness, for the same reasons that constrict their general use in DSE. Materialized view indexing should be the first choice, unless textual search is required and a search index is selected.

If a search index is created, be aware that building the index can take time, and that until the index is available, queries that depend on the index can fail. Applications that create schema, immediately followed by data insertion that require search indexes will likely experience errors. Also, queries that use search indexes should be run on DSE Search-enabled nodes in the cluster. Search indexes also require extra resources. Each index allocates a minimum of 256MB of memory by default, and each index will require two physical cores. For a typical 32GB node, 16 search indexes would be a reasonable number to create.

Queries that use textual predicates (regex, tokenRegex, prefix, tokenPrefix, token, and eq/neq) can be accomplished without DSE search indexes. However, such queries will not make use of secondary or materialized indexes and will instead use full graph scans to return results. By default, Production mode does not allow full graph scans, so such queries will fail. If such matching search methods are required, search indexes are strongly suggested.
CAUTION: tokenRegex will display case insensitivity in queries, whether a search index is used or not.

Textual search indexes are by default indexed in both tokenized (TextField) and non-tokenized (StrField) forms. This means that all textual predicates (token, tokenPrefix, tokenRegex, eq, neq, regex, prefix) will be usable with all textual vertex properties indexed. Practically, search indexes should be created using the asString() method only in cases where there is absolutely no use for tokenization and text analysis, such as for inventory categories (silverware, shoes, clothing). The asText() method is used if searching tokenized text, such as long multi-sentence descriptions. The query optimizer will choose whether to use analyzed or non-analyzed indexing based on the textual predicate used.

It is possible to modify search index schema to change search characteristics. Although DSE Graph will not overwrite these out-of-band changes, it is recommended that you do not add or remove fields in this manner - only DSE Graph commands should be used. The general use of this feature is mainly to change the behavior of a search, such as adding case sensitivity to a type of search.