Indexing graph overview

Explain indexes and how they affect DSE Graph performance.

DSE Graph implements two types of indexes, vertex-centric indexes and global indexes. Vertex-centric indexes are local and specific to a single vertex. Global indexes are specific to a vertex label and property and are graph-wide. All indexes contribute to the performance of graph traversals on large distributed graphs. The type of index lookup will affect performance, and each has pros and cons.

Vertex-centric indexing (VCI) overview

Vertex-centric indexes (VCI) are created locally for a specific vertex, unlike global indexes which are global to the graph and index elements for fast global lookups. VCIs are used once a query has been filtered down to a specific instance of a vertex label, meaning specific vertices. VCIs sort and index the incident edges and adjacent vertices of a vertex according to the incident edge labels or properties. When a vertex is queried, its index is consulted to avoid linear scans of all incident edges. Traversals can be reduced to O(1) or O(log n) from O(n). A typical graph traversal touches numerous vertices, compounding the cost of each incident edge scan if indexes are not consulted.

In DSE Graph, vertex-centric indexing is maintained as materialized views (MVs). Materialized views are tables generated from a base table to provide a query based on a different primary key than the base table. This type of index is best used for values of high cardinality of nearly unique values, or high selectivity. Selectivity is derived from cardinality, using the following formula:
selectivity = ( cardinality / number of rows ) * 100%
In general, low cardinality results in low selectivity, and high cardinality results in high selectivity.

Searching materialized views yields similar response times to searching base tables, although writing the data incurs a small time penalty. When data is written or updated in the graph, the index information is updated in the MV table along with the graph tables. A consequence of using a MV table is higher write latencies, but results in lower read latencies for graph traversals. Edge indexes and property indexes are vertex-centric indexes.

Vertex-centric indexing also plays a role in solving the super-node issue. A super-node is a vertex that has an exponentially larger number of incident edges. The example generally given is to compare the number of followers that reader has to those of a celebrity - hundreds or thousands of followers, compared to millions of followers. A graph traversal checking the index of a super-node will take an outsized amount of time just to read the index table. Using vertex-centric indexing in conjunction with a partitioned vertex table (PVT), the index can be stored on multiple partitions and distributed across the DSE cluster. For vertices that have in excess of one million edges, graph partitioning is necessary due to the storage limitations of a single DSE database table. Distributing the index tables also enables better response to graph traversals.
Note: Partitioned vertex tables are an experimental feature for handling supernodes that are deprecated in DSE 5.1 and will be removed in DSE 6.0. However, data modeling techniques are currently a better avenue for mitigating supernode issues.

Global indexing overview

Indexes can affect traversal query performance. Decreasing the number of starting points for a graph traversal can greatly reduce the latency for a query result. If a traversal must start by checking all the vertices in a graph, time is lost finding the right starting point. If a starting vertex can be identified, that time is not required. Global indexing improves the performance of queries by identifying the starting location of a query using the vertex label and property.

Global indexing in DSE Graph uses DSE secondary indexing or DSE Search indexing. Global indexes can be applied across all vertices with a specified vertex label, as opposed to VCIs which apply to a filtered set of vertices.

Secondary indexing in DSE Graph follows the same rule of thumb as DSE secondary indexing. This type of index is meant for lower cardinality values, or alternatively, for low selectivity values. The number of values for indexing should number in the tens to hundreds at most; for instance, searching by country is a good candidate for secondary indexing. In addition, only equality conditions can be used to match values, and no ordering or range queries on values can be used. If more complex value matching is required, search indexes are the superior choice.

Search indexes are used when textual, numeric or geospatial indexing are required and rely on DSE Search. Since graph data is stored in DSE database tables, one search core is available per vertex label. For each vertex label that will be indexed with search, all properties must be added to a single search index named search. Because search is implemented with DSE Search, all data types can be indexed. For two indexing options, full text and string, the property key must be defined, as different indexing results. Full text indexing performs tokenization and secondary processing such as case normalization. Full text indexing is useful for queries where partial match of text is required, and lends itself to regular expressing (regEx) searching. String indexing is useful for queries where an exact string is sought and no tokenization is required, similar to Solr faceting. This type of index is best for low selectivity, but lends itself to fuzzy matching. DSE 5.1 adds fuzzy search for both tokenized and non-tokenized indexing.

Composite index keys are not currently supported in DSE Graph.

Indexing best practices

The most important fact to remember is that a search index is the only choice for indexing two or more properties that define the starting point for a query. Multiple materialized view or secondary indexes cannot be used for global indexing. For instance, g.V().has('author', 'gender', 'F').has('author', 'country', 'France') will only use one index, not both, if the indexes are materialized view or secondary indexes. If a search index is defined, both properties,country and gender, are used. Once the starting point is defined, a vertex-centric index can be used to narrow the query.

More than one index can be created on the same property, such as creating both a materialized index and a search index on the property amount. The DSE Graph query optimizer will automatically use the appropriate index when processing a query; designation of an index type to use is not a feature. The order of preference that DSE Graph uses is MV index > secondary index > DSE Search index to ensure best performance. Different index types may be created on different properties as appropriate, based on the selectivity. A special case exists for indexing vertices created with composite keys; a search index is the only choice for indexing two or more properties, especially for graph loading with the DSE Graph Loader. Separate materialized view indexes will not be used for the property keys that make up the composite key (custom vertex id) and the DSE Graph Loader will fail to create the vertices.

In general, secondary indexes in DSE Graph are limited in usefulness, for the same reasons that constrict their general use in DSE. Materialized view indexing should be considered

If a search index is created, be aware that building the index can take time, and that until the index is available, queries that depend on the index can fail. Applications that create schema, immediately followed by data insertion that require search indexes will likely experience errors. Also, queries that use search indexes should be run on DSE Search-enabled nodes in the cluster.

Search indexes do require resources. Each index allocates a minimum of 256MB by default, and each index will require two physical cores. For a typical 32GB node, 16 search indexes would be a reasonable number to create.

Queries that use textual predicates (regex, tokenRegex, prefix, tokenPrefix, token, and eq/neq) can be accomplished without DSE search indexes. However, such queries will not make use of secondary or materialized indexes and will instead use full graph scans to return results. By default, Production mode does not allow full graph scans, so such queries will fail. If such matching search methods are required, search indexes are strongly suggested.
CAUTION: tokenRegex will display case insensitivity in queries, whether a search index is used or not.
In DSE 5.1 and later, textual search indexes are by default indexed in both tokenized (TextField) and non-tokenized (StrField) forms. This means that all textual predicates (token, tokenPrefix, tokenRegex, eq, neq, regex, prefix) will be usable with all textual vertex properties created. Practically, search indexes should be created using the asString() method only in cases where there is absolutely no use for tokenization and text analysis, such as for inventory categories (silverware, shoes, clothing). The asText() method is used if searching tokenized text, such as long multi-sentence descriptions. The query optimizer will choose whether to use analyzed or non-analyzed indexing based on the textual predicate used.
Note: Prior to DSE 5.1, search indexes defaulted to asText() for textual property data, if not specified as asString().

It is possible to modify the search index schema to change search characteristics. Although DSE Graph will not overwrite these out-of-band changes, it is recommended that you do not add or remove fields in this manner - only DSE Graph commands should be used. The general use of this feature is mainly to change the behavior of a search, such as adding case sensitivity to a type of search.