Indexing graph overview

Explain indexes and how they affect DSE Graph performance.

DSE Graph implements two types of indexes, vertex-centric indexes and global indexes. Vertex-centric indexes are local and specific to a single vertex. Global indexes are specific to a vertex label and property and are graph-wide. All indexes contribute to the performance of graph traversals on large distributed graphs.

Vertex-centric indexing (VCI) overview

Vertex-centric indexes (VCI) are specific to a vertex, unlike global indexes which are global to the graph and index elements for fast global lookups. Vertex-centric indexes sort and index the incident edges and adjacent vertices of a vertex according to the incident edge labels or properties. When a vertex is queried, its index is consulted to avoid linear scans of all incident edges. Traversals can be reduced to O(1) or O(log n) from O(n). A typical graph traversal touches numerous vertices, compounding the cost of each incident edge scan if indexes are not consulted.

In DSE Graph, vertex-centric indexing is maintained as a materialized view (MV) table. When data is written or updated in the graph, the index is updated rather than utilizing the read repair functionality of Cassandra. A consequence of using a MV table is higher write latencies, but results in lower read latencies for graph traversals. Edge indexes and property indexes are vertex-centric indexes.

Vertex-centric indexing also plays a role in solving the super-node issue. A super-node is a vertex that has an exponentially larger number of incident edges. The example generally given is to compare the number of followers that reader has to those of a celebrity - hundreds or thousands of followers, compared to millions of followers. A graph traversal checking the index of a super-node will take an outsized amount of time just to read the index table. Using vertex-centric indexing in conjunction with a partitioned vertex table (PVT), the index can be stored on multiple partitions and distributed across the DSE cluster. For vertices that have in excess of one million edges, graph partitioning is necessary due to the storage limitations of a single Cassandra table. Distributing the index tables also enables better response to graph traversals.

Note: Partitioned vertex tables are an experimental feature for handling supernodes that will be removed in DSE 6.0.. However, data modeling techniques are currently a better avenue for mitigating supernode issues.

Global indexing overview

Indexes can affect traversal query performance. Decreasing the number of starting points for a graph traversal can greatly reduce the latency for a query result. If a traversal must start by checking all the vertices in a graph, time is lost finding the right starting point. If a starting vertex can be identified, that time is not required. Indexing the location of vertices based on the vertex label and property value improves the performance.

Global indexing uses the built-in indexing features of DSE. DSE Graph is stored in Cassandra, so two types of indexing available are based on Cassandra secondary indexing capability and materialized views. In addition, DSE Graph can take advantage of DSE Search for full text and string indexing. Global indexing can be accomplished with any of these three indexing types. The type selected depends on the data itself. The type of index lookup will also affect performance.

Secondary indexing in DSE Graph follows the same rule of thumb as secondary indexing in Cassandra. This type of index is meant for lower cardinality values, alternatively, for low selectivity values. Selectivity is derived from cardinality, using the following formula:

selectivity = ( cardinality / number of rows ) * 100%

In general, low cardinality results in low selectivity, and high cardinality results in high selectivity. Searching by country is a good candidate for secondary indexing.

Materialized view indexing uses Cassandra materialized views to store data. Materialized views are tables generated from a base table to provide a query based on a different primary key than the base table. This type of index is best used for values of high cardinality or nearly unique values or high selectivity. Searching materialized views yields similar response times to searching base tables, although writing the data incurs a small time penalty.

Search indexes rely on DSE Search. Since graph data is stored in Cassandra tables, one search core is available per vertex label. These indexes differ from secondary and materialized view indexes in their syntax due to this unique property. For each vertex label that will be indexed with search, all properties must be added to a single search index named search. Because search is implemented with DSE Search, two indexing options are available, full text and string. Full text indexing performs tokenization and secondary processing such as case normalization. Full text indexing is useful for queries where partial match of text is required. String indexing is useful for queries where an exact string is sought and no tokenization must be performed, such as for Solr faceting. This type of index is best for low selectivity, like secondary indexes.

Composite index keys are not currently supported in DSE Graph.

Indexing best practices

More than one index can be created on the same property, such as creating both a materialized index and a search index on the property amount. The DSE Graph query optimizer will automatically use the appropriate index when processing a query; designation of an index type to use is not a feature. The order of preference that DSE Graph uses is MV index > secondary index > DSE Search index to ensure best performance. Different index types may be created on different properties as appropriate, based on the selectivity. A special case exists for indexing vertices created with composite keys; a search index is the only choice for indexing two or more properties, especially for graph loading with the DSE Graph Loader. Separate materialized view indexes will not be used for the property keys that make up the composite key (custom vertex id) and the DSE Graph Loader will fail to create the vertices.

In general, secondary indexes in DSE Graph are limited in usefulness, for the same reasons that constrict their general use in DSE. Materialized view indexing should be considered

If a search index is created, be aware that building the index can take time, and that until the index is available, queries that depend on the index can fail. Applications that create schema, immediately followed by data insertion that require search indexes will likely experience errors. Also, queries that use search indexes should be run on DSE Search-enabled nodes in the cluster.

Search indexes do require resources. Each index allocates a minimum of 256MB by default, and each index will require two physical cores. For a typical 32GB node, 16 search indexes would be a reasonable number to create.

Queries that use textual predicates (regex, tokenRegex, prefix, tokenPrefix, token, and eq/neq) can be accomplished without DSE search indexes. However, such queries will not make use of secondary or materialized indexes and will instead use full graph scans to return results. By default, Production mode does not allow full graph scans, so such queries will fail. If such matching search methods are required, search indexes are strongly suggested.

CAUTION:

tokenRegex will display case insensitivity in queries, whether a search index is used or not.