Indexing concepts

The data stored in CQL tables can be queried by a variety of methods. The main method uses the partition key defined for a table, and is called primary indexing. However, often a query must be selected using another column of a table, and secondary indexing is required. Secondary indexing use fast, efficient lookup of data that matches a given condition.

CQL has the following types of indexing available:

Indexing type Astra DB Serverless Astra DB Classic Apache Cassandra DataStax Enterprise

Primary indexing

Storage-attached indexing (SAI)

✓ for 5.0+

Secondary indexing (2i)

SSTable-attached indexing (SASI)

✓ but not recommended

DSE Search indexing

After any index is created, data can be queried using that index.

Primary indexing

The primary index is the partition key in CQL. The storage engine of CQL uses the partition key to store rows of data, and the most efficient and fast lookup of data matches the partition key.

Storage-attached indexing (SAI)

SAI uses indexes for non-partition columns, and attaches the indexing information to the SSTables that store the rows of data. SAI is the most appropriate indexing method for most use cases.

Secondary indexing (2i)

Secondary indexing (2i) are the original built-in indexes for CQL. These indexes are a local index, stored in a hidden table on each node of a CQL cluster, separate from the table that contains the values being indexed. This indexing method is only recommended when used in conjunction with a partition key.

SSTable-attached indexing (SASI)

SASI uses indexes for non-partition columns, and creates an index file for each SSTable that store the rows of data. SASI is the most appropriate indexing method for LIKE full text searches.

DSE Search indexing

DataStax Enterprise also features a Apache Solr-Apache Lucene search indexing method. DSE Search offers support for the simplest keyword searching through to complex queries on multiple fields and faceted search results. This indexing method also features tokenized text search for use with analyzers.

Indexing comparison

SAI is the recommended indexing method for most use cases. There are some use cases where 2i or DSE Search is more appropriate. SASI is never recommended, and is only available for backward compatibility.

For example, if you are using Leveled Compaction Strategy (LCS) with a small file size, you will have many SSTables. In this case, 2i is probably better than SAI.

If you are using DSE Search, you’ll to use DSE Search indexing for any full text search applications.

That being said, you should still plan on an increase of 2x latency and throughput on indexed tables compared to non-indexed tables. To quote a long time Cassandra contributor, "There is no magic, just good engineering."

Storage-attached indexing (SAI)

SAI was designed to be a more general indexing method that could be used for any type of query. It was also designed to address the issues with 2i and SASI, namely write amplification and index size.

SASI proved that in-memory indexing and flushing indexes with SSTables was the right way to use the Cassandra write path, while adding new functionality. With SAI, when the mutation is acknowledged, meaning fully committed, the data is indexed. Optimizations gained from a lot of testing, vastly improved the write performance. Compared to 2i, SAI results in a 40% increase in throughput and over 200% better write latencies.

SAI created the most dramatic improvement in index size. It improves the storage requirements by 8x compared to SASI and 5x compared to 2i. SAI uses two different types of indexing schemes based on the data type.

  • Text - Inverted indexes are created with terms broken into a dictionary. The biggest improvement is from the use of Trie-based indexing which offers much better compression which means smaller index sizes.

  • Numeric - Utilizing a data structure called block kd-trees, taken from Lucene, offers excellent range query performance. A separate row ID list is maintained to optimize for token order queries.

Secondary indexing (2i)

Indexing is never free - the more you add, the more you affect write performance. In Cassandra, that means write-amplification issues. When a mutation on an indexed column occurs, an indexing operation triggers reindexing data in a separate index file. More indexes on a CQL table can dramatically increase disk activity during write operations. If a single Cassandra node gets too many writes, there is saturated disk activity. That type of activity destabilizes individual nodes, creating Cassandra cluster problems. 2i should be used sparingly for this reason. Index size is fairly linear, but with re-indexing, the amount of disk space needed can be hard to plan for in an active cluster.

Do not use a 2i index in these situations:

include:develop:partial$2i-not-allowed.adoc[]

SSTable-attached indexing (SASI)

SSTable-attached Secondary Indexing (SASI) — SASI was not originally designed as a general indexing method, but rather to address two query types:

  • Finding rows based on partial data matching. Wildcard, or LIKE queries.

  • Range queries on sparse data, specifically timestamps. How many records fit in a time range type queries.

It did both of those operations quite well and it also reduced the issue of write amplification with legacy 2i. As mutations are written to a Cassandra node, the data is indexed in-memory during the initial write, much like how memtables are written. No disk activity is required on a permutation, a huge improvement. When memtables are flushed to SSTables, the corresponding index for the data is flushed. Every index file written is immutable and attached to the SSTable, hence the name SSTable-attached. When compaction occurs, data is reindexed and written to a new file as new SSTables are created. This action is a major improvement in disk activity, but the index size can be an issue. The on-disk index format uses an enormous amount of disk space for each column indexed. In addition, SASI is marked as experimental and there have never been any feature improvements. Many bugs have been found over time with expensive fixes that have brought on the discussion of whether SASI should be removed altogether. However, SASI provides functionality that is not currently available in SAI, so it remains a viable option for some use cases.

SASI indexes must be enabled for use in Hyper-Converged Database 1.0 and Apache Cassandra 5.0.

SASI indexes are not supported in DataStax Enterprise 6.9 and DataStax Astra DB.

DSE Search is only available in DataStax Astra DB DataStax Astra DB Classic and DataStax Enterprise {dse-version-6.8} and 6.9. DSE Search is a full text search engine built on top of Apache Solr and Apache Lucene. When your searching needs are complex, DSE Search is a good option. However, it is not recommended for simple queries, as the overhead of a separate search engine is not worth the resources and latency overhead.

cassandra.yaml settings for indexing

The following settings in the cassandra.yaml file control indexing:

SASI indexes are deprecated in Cassandra 5.0. They are disabled by default. To enable SASI indexes, set the enable_sasi_indexes option to true.

Cassandra 5.0 introduces the following settings for secondary indexing:

default_secondary_index

The default value is legacy_local_table which enables 2i indexes. This allows the CQL CREATE INDEX to create 2i indexes without a USING clause. If this setting is switched to sai, then the CQL CREATE INDEX will create SAI indexes without a USING clause.

default_secondary_index_enabled

The default value is true, allowing a default secondary index implementation. If this setting is changed to false, then the CQL CREATE INDEX must specify an index implementation via USING.

If you do not have a use for 2i indexes, you can disable them and use the simpler syntax of CREATE INDEX for SAI indexes.

See also:

Building and maintaining indexes

How to rebuild all secondary indexes manually.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com