Indexing
Explain indexes and how they affect DataStax Graph (DSG) performance.
Indexes play a significant role in making DataStax Graph (DSG) queries performant. Graph queries that must traverse the entire graph will have poor performance with high latency, which explains why full-scan queries are a bad idea in production environments. Indexes can narrow down the starting point for a graph traversal, ignoring either vertices or edges that do not require scanning to complete the query.
In addition, if a query does not use the partition and clustering keys, the query will throw an
exception if an appropriate index doesn't exist. Although for development, there are other
options to evaluate the query, for production, indexes are required. Thus, the query
g.V('dseg:/person/4ce9caf1-25b8-468e-a983-69bad20c017a')
is the only acceptable
query if the partition key is the person_id. A query to find James Beard fails if an index
based on name
is not present. Once that index exists, the query
g.V().has('person', 'name', 'James BEARD')
finds the same vertex.
Index | Use | Predicates |
---|---|---|
Materialized view (MV) | Most efficient index for high cardinality, high selectivity properties and equality predicates. Most appropriate if collections (set, list, map) are not part of the query. | Equality: eq , within |
Secondary | Efficient index for low cardinality, low selectivity properties and equality predicates. Most appropriate if collections (set, list, map) are part of the query. | Predicates on list , set , or map
collections: contains ,containsKey ,
containsValue , entryEq |
Search | Efficient and versatile index for properties with a wide range of cardinality and selectivity. Most appropriate when text search conditions or multiple condition are part of the query, especially fuzzy matching. |
|
inverse() |
Speciality index for creating an edge MV index supporting traversals against the edge's natural direction, i.e., bidirectional edges. | Allows use of vertex steps: in , out ,
both , inE , outE , bothE ,
inV , outV , and bothV |
When queries must use an index, DSG uses a two-step process to decide on the appropriate index
to use. First, if any search-only predicates are part of the query, a search index will be used.
If a query does not use a search-only predicate, then DSG will select an existing index in the
order: MV index > secondary index > DSE Search index
.
Index Analyzer
The index analyzer uses the indexFor()
step along with either
analyze()
to analyze and recommend an index or apply()
to
apply a recommended index. The type of index recommended is based on the entered query and is
based on the suggested index use decribed for each index type below. In addition, for complex
queries, the index analyzer may recommend multiple indexes.
schema.indexFor(g.V().has('person', 'name', 'Julia CHILD')).analyze()and get a typical return:
==>Traversal requires that the following indexes are created: schema.vertexLabel('person').materializedView('person_by_name').ifNotExists().partitionBy('name').clusterBy('person_id', Asc).create()Notice that the type of index,
materializedView
is part of the recommendation, as well
as a recommended index name and partition and clustering keys. Both vertex labels and edge
labels can be analyzed for indexes. schema.indexFor(g.E().hasLabel('reviewed').has('stars', 5)).analyze()with the results:
==>Traversal requires that the following indexes are created: schema.edgeLabel('reviewed'). from('person').to('recipe'). materializedView('person__reviewed__recipe_by_stars'). ifNotExists(). partitionBy('stars'). clusterBy(OUT, 'person_id', Asc). clusterBy(IN, 'recipe_id', Asc). create()
The index analyzer will create an index after analysis if apply()
is used.
For exploration, use analyze()
to see what recommendations are suggested, and
use apply()
if the index suggested is the best index. Another way to use
analyze()
is to get an index suggestion, then manually create the index if
optimizations are desired.
Materialized view indexes
A materialized view (MV) index will generally be recommended if predicates in the query use general property columns and do not use collections (set, list, map) or tuples. This index type will also be recommended if the query is not search-specific. Materialized views are tables generated from a base table to provide a query based on a different primary key than the base table. Materialized view indexes also support the use of multiple partition keys to address the issue of partitions growing too large. Searching materialized views yields similar response times to searching base tables, although writing the data incurs a small time penalty. When data is written or updated in the graph, the index information is updated in the MV table along with the graph tables. A consequence of using a MV table is higher write latencies, but results in lower read latencies for graph traversals.
Secondary indexes
Query predicate | Index option | Applies to |
---|---|---|
contains(x) | indexValues() | Set, List |
containsKey(x) | indexKeys() | Map |
containsValue(x) | indexValues() | Map |
entryEq(x, y) | indexEntries() | Map |
eq(x) | indexFull() | Frozen Set, List, Map |
Sets and lists can index all values found by indexing the collection column. Maps can index a
map key, map value, or map entry using the methods shown below. Multiple indexes can be created
on the same map column in a table so that map keys, values, or entries can be queried. In
addition, frozen collections can be indexed using FULL
to index the full
content of a frozen collection.
Search indexes
- Full text indexing (
asText()
) performs tokenization and secondary processing such as case normalization. Full text indexing is useful for queries where partial match of text is required, and lends itself to regular expressing (regEx) searching. - String indexing (
asString()
) is useful for queries where an exact string is sought and no tokenization is required, similar to Solr faceting.
Search indexes are created for both full text and string searches by
default, but properties can be designated with either option using
asText
or asString
, respectively. Textual
search indexes are by default indexed in both tokenized (TextField) and
non-tokenized (StrField) forms. This means that all textual predicates (token,
tokenPrefix, tokenRegex, eq, neq, regex, prefix) will be usable with all textual
vertex or edge properties indexed. Practically, search indexes should be created
using the asString()
method only in cases where there is absolutely
no use for tokenization and text analysis, such as for inventory categories
(silverware, shoes, clothing). The asText()
method is used if
searching tokenized text, such as long multi-sentence descriptions. The query
optimizer will choose whether to use analyzed or non-analyzed indexing based on the
textual predicate used.
Use | Tokenized (full text) | Non-tokenized (string) |
---|---|---|
A comparison looking for a match of a token containing the value: 'Wild
Stroganoff' or a phrase containing either an exact match: 'Wild
Stroganoff',0 or a proximal match: 'Wild Stroganoff',1 |
token |
phrase |
A comparison using some text followed by a wildcard: John* |
tokenPrefix |
prefix |
A comparison using a regular expression: .*sea.*in.* |
tokenRegex |
regex |
Uses optimal string alignment distance calculations to match properties:
'Wlid',1 or 'Jmase Beard', 2 |
tokenFuzzy |
fuzzy |
neq
/
without
and geospatial predicate inside
.tokenRegex
will display case insensitivity in queries.Graph indexing best practices
The most important fact to remember is that a search index is the only choice for indexing if
search-specific predicates are defined. For instance, g.V().has('person', 'gender',
'F').has('country.field1', 'France')
will use a search index for both
properties,country
and gender
to narrow the query, whereas
materialized view and secondary indexes are created for a single property.
Both secondary and materialied view index data are stored in Cassandra tables, with secondary indexes stored locally on cluster nodes, while MVs are distributed across the cluster. Secondary indexes are the only choice if a query needs to lookup a collection key or value. MVs are a good choice if a non-collection value is sought. Both types of index have limitations that must be considered. Secondary indexes can accumulate tombstones that affect read performance if frequent data mutation occurs. MVs add write latency, so a small number of MV tables per base table is critical. If too many secondary indexes or MV indexes are created for a single vertex or edge label, revisit the data modeling for the graph and consider adding more vertex labels with fewer properties for each.
Edge indexes must be considered carefully if a graph has any celebrity, or super-nodes, vertices that have many incoming edges. While a celebrity may have outgoing edges that number in the hundreds, they may have millions of incoming edges. Creating an inverse index on the millions of edges will have a poor performance effect on the graph, while an inverse index on the outgoing edges is fine. In general, the data model will benefit from breaking the incoming edges with an additional partition key and no indexing.
DSG automatically uses the appropriate index when processing a query; designation of which
index type to use is not a feature. As mentioned earlier, if a query does not use a search-only
predicate, then DSG will select an existing index in the order: MV index > secondary
index > DSE Search index
. However, choosing the optimal type of index is key to good
performance, and is a good reason to check queries with the index analyzer.
Indexes do take time to build, and until an index is available, queries that depend on an
index can fail. Applications that create schema, immediately followed by data insertion that
require search indexes will likely experience errors. The waitFor()
option when
creating indexes manually can prevent failure by giving the index time to completely build. If
search indexes are used, also be aware that the graph queries should be run on DSE
Search-enabled nodes in the cluster. DSE Search requires extra resources for each index built: a
minimum of 256MB of memory by default, and two physical cores. For a typical 32GB node, 16
search indexes would be a reasonable number to create.