Indexing concepts
The data stored in CQL tables can be queried by a variety of methods. The main method uses the partition key defined for a table, and is called primary indexing. However, often a query must be selected using another column of a table, and secondary indexing is required. Secondary indexing use fast, efficient lookup of data that matches a given condition.
CQL has the following types of indexing available:
Indexing type | Astra DB Serverless | Astra DB Classic | Apache Cassandra | DataStax Enterprise |
---|---|---|---|---|
Primary indexing |
✓ |
✓ |
✓ |
✓ |
Storage-attached indexing (SAI) |
✓ |
✓ |
✓ for 5.0+ |
✓ |
Secondary indexing (2i) |
✓ |
✓ |
✓ |
|
SSTable-attached indexing (SASI) |
✓ |
✓ but not recommended |
||
DSE Search indexing |
✓ |
✓ |
After any index is created, data can be queried using that index.
Primary indexing
The primary index is the partition key in CQL. The storage engine of CQL uses the partition key to store rows of data, and the most efficient and fast lookup of data matches the partition key.
Storage-attached indexing (SAI)
SAI uses indexes for non-partition columns, and attaches the indexing information to the SSTables that store the rows of data. SAI is the most appropriate indexing method for most use cases.
Secondary indexing (2i)
Secondary indexing (2i) are the original built-in indexes for CQL. These indexes are a local index, stored in a hidden table on each node of a CQL cluster, separate from the table that contains the values being indexed. This indexing method is only recommended when used in conjunction with a partition key.
SSTable-attached indexing (SASI)
SASI uses indexes for non-partition columns, and creates an index file for each SSTable that store the rows of data.
SASI is the most appropriate indexing method for LIKE
full text searches.
DSE Search indexing
DataStax Enterprise also features a Apache Solr-Apache Lucene search indexing method. DSE Search offers support for the simplest keyword searching through to complex queries on multiple fields and faceted search results. This indexing method also features tokenized text search for use with analyzers.
Indexing comparison
SAI is the recommended indexing method for most use cases. There are some use cases where 2i or DSE Search is more appropriate. SASI is never recommended, and is only available for backward compatibility.
For example, if you are using Leveled Compaction Strategy (LCS) with a small file size, you will have many SSTables. In this case, 2i is probably better than SAI.
If you are using DSE Search, you’ll to use DSE Search indexing for any full text search applications.
That being said, you should still plan on an increase of 2x latency and throughput on indexed tables compared to non-indexed tables. To quote a long time Cassandra contributor, "There is no magic, just good engineering."
Storage-attached indexing (SAI)
SAI was designed to be a more general indexing method that could be used for any type of query. It was also designed to address the issues with 2i and SASI, namely write amplification and index size.
SASI proved that in-memory indexing and flushing indexes with SSTables was the right way to use the Cassandra write path, while adding new functionality. With SAI, when the mutation is acknowledged, meaning fully committed, the data is indexed. Optimizations gained from a lot of testing, vastly improved the write performance. Compared to 2i, SAI results in a 40% increase in throughput and over 200% better write latencies.
SAI created the most dramatic improvement in index size. It improves the storage requirements by 8x compared to SASI and 5x compared to 2i. SAI uses two different types of indexing schemes based on the data type.
-
Text - Inverted indexes are created with terms broken into a dictionary. The biggest improvement is from the use of Trie-based indexing which offers much better compression which means smaller index sizes.
-
Numeric - Utilizing a data structure called block kd-trees, taken from Lucene, offers excellent range query performance. A separate row ID list is maintained to optimize for token order queries.
Secondary indexing (2i)
Indexing is never free - the more you add, the more you affect write performance. In Cassandra, that means write-amplification issues. When a mutation on an indexed column occurs, an indexing operation triggers reindexing data in a separate index file. More indexes on a CQL table can dramatically increase disk activity during write operations. If a single Cassandra node gets too many writes, there is saturated disk activity. That type of activity destabilizes individual nodes, creating Cassandra cluster problems. 2i should be used sparingly for this reason. Index size is fairly linear, but with re-indexing, the amount of disk space needed can be hard to plan for in an active cluster.
Do not use a 2i index in these situations:
-
On high-cardinality columns for a query of a huge volume of records for a small number of results. See Problems using a high-cardinality column index below.
-
In tables that use a counter column.
-
On a frequently updated or deleted column. See Problems using an index on a frequently updated or deleted column below.
-
To look for a row in a large partition unless narrowly queried. See Problems using an index to look for a row in a large partition unless narrowly queried below.
include:develop:partial$2i-not-allowed.adoc[]
SSTable-attached indexing (SASI)
SSTable-attached Secondary Indexing (SASI) — SASI was not originally designed as a general indexing method, but rather to address two query types:
-
Finding rows based on partial data matching. Wildcard, or LIKE queries.
-
Range queries on sparse data, specifically timestamps. How many records fit in a time range type queries.
It did both of those operations quite well and it also reduced the issue of write amplification with legacy 2i. As mutations are written to a Cassandra node, the data is indexed in-memory during the initial write, much like how memtables are written. No disk activity is required on a permutation, a huge improvement. When memtables are flushed to SSTables, the corresponding index for the data is flushed. Every index file written is immutable and attached to the SSTable, hence the name SSTable-attached. When compaction occurs, data is reindexed and written to a new file as new SSTables are created. This action is a major improvement in disk activity, but the index size can be an issue. The on-disk index format uses an enormous amount of disk space for each column indexed. In addition, SASI is marked as experimental and there have never been any feature improvements. Many bugs have been found over time with expensive fixes that have brought on the discussion of whether SASI should be removed altogether. However, SASI provides functionality that is not currently available in SAI, so it remains a viable option for some use cases.
SASI indexes must be enabled for use in Hyper-Converged Database 1.0 and Apache Cassandra 5.0.
SASI indexes are not supported in DataStax Enterprise 6.9 and DataStax Astra DB.
DSE Search
DSE Search is only available in DataStax Astra DB DataStax Astra DB Classic and DataStax Enterprise {dse-version-6.8} and 6.9. DSE Search is a full text search engine built on top of Apache Solr and Apache Lucene. When your searching needs are complex, DSE Search is a good option. However, it is not recommended for simple queries, as the overhead of a separate search engine is not worth the resources and latency overhead.
cassandra.yaml settings for indexing
The following settings in the cassandra.yaml
file control indexing:
SASI indexes are deprecated in DataStax Enterprise 6.8 and later.
SASI indexes are enabled by default and no options are defined in the cassandra.yaml
file.
See also:
- Building and maintaining indexes
-
How to rebuild all secondary indexes manually.