SAI FAQs

Frequently asked questions about Storage-Attached Indexing (SAI).

cassandra.yaml

The location of the cassandra.yaml file depends on the type of installation:
Package installations /etc/dse/cassandra/cassandra.yaml
Tarball installations installation_location/resources/cassandra/conf/cassandra.yaml

Use this FAQ to find answers to common questions and get help with Storage-Attached Indexing (SAI).

What is SAI?
SAI is a highly-scalable, globally-distributed index for Apache Cassandra® that is available for DataStax Astra and DataStax Enterprise (DSE) databases. SAI combines:
  • the storage-attached architecture of open source SSTable Attached Secondary Indexes (SASI)
  • a number of highly optimized on-disk index structures
  • the DSE thread-per-core (TPC) infrastructure
Which databases are supported?
Supported databases:
  • DataStax Astra
  • DataStax Enterprise 6.8.3 and later

After creating your database, a keyspace, and one or more tables, use CREATE CUSTOM INDEX ... USING 'StorageAttachedIndex' to define one or more SAI indexes on the table. For DataStax Astra databases, use the CQL Console and enter the commands from the cqlsh prompt. For DSE databases, use cqlsh. The same CREATE CUSTOM INDEX ... USING 'StorageAttachedIndex' command is available for both. See SAI quick start.

What configuration settings should I use with SAI?
Compared with most indexing environments, SAI configuration and related settings are much simpler. Key points:
  • Increase --XX:MaxDirectMemorySize, leaving approximately 15-20% of memory for the OS and other in-memory structures.
  • In cassandra.yaml, explicitly set file_cache_size_in_mb to 75% of that value.
  • Heavy mixed read/write workloads may want to:
    • Decrease range_request_timeout_in_ms
    • Increase write_request_timeout_in_ms
  • If the memtable_flush_writers value is set too low, writes may stall. If this occurs in your environment, increase memtable_flush_writers.
Aside from memory, SAI uses the same tunable parameters for Cassandra, such as compaction throughput and compaction executors. This matters for write performance. For read performance, again, maximizing use of the Chunk Cache will benefit SAI reads because all on-disk index components are accessed through this mechanism. Refer to Configure SAI indexes.
What computing challenges does SAI solve?

Oftentimes, developers ask: "How can I query additional fields outside of the Cassandra partition key?" SAI implements efficient indexes based on a table's columns, such as parts of a composite partition key. Before SAI, you could index clustering keys, but you could not index parts of a composite partition. The development of SAI was inspired by SASI to achieve the goal of efficient and simpler filtering via the creation of secondary indexes.

SAI also makes data modeling easier because you do not need to create custom tables just to cater to particular query patterns. You can create a table that is most natural for you, write to just that table, and query it any way you want.

What are the advantages of using SAI?

SAI makes it possible to define multiple indexes on the same database table. Each SAI index can be based on any column in the table. Exception: there is no need to define an SAI index based on the partition key when it's comprised of only one column; in this case, SAI issues an invalid query message. You can also define an SAI index using a single column in the table's composite partition key. A composite partition key means that the partition is based on two or more columns. In this case with an SAI index, you would specify just one of the columns that comprises the composite partition key.

For developers, SAI removes several previous pain points, including the need to duplicate denormalized data to query non-PrimaryKey columns.

For operators, SAI has several advantages, including the use of significantly less disk space for indexing; fewer failure points; easier uptime due to the simplified architecture of SAI; and fewer copies of data to secure.

Does SAI replace Solr-based DSE Search?

SAI is not an enterprise search engine. While it does provide some of the same functionality, SAI is not a complete replacement for DSE Search. At its core, SAI is a filtering engine, and simplifies data modeling and client applications that would otherwise rely heavily on maintaining multiple query-specific tables.

How does schema management compare between DSE Search and SAI?

SAI is an index, not a search engine. Unlike the Solr-based DSE Search, SAI has no need for schema management. SAI configuration is simpler and is tuned with existing database parameters, such as in cassandra.yaml. With SAI, there is no Solr commit log to accept writes during bootstrap; unlike Solr, SAI does not need to wait for bootstrap to read the database configuration. With SAI, schema/indexing options reside in the index metadata, which is handled by native database schema management.

How do I use SAI features?

With Storage-Attached Indexing, one difference compared to DSE Search is that queries are entirely CQL-based. The features, by design, are intentionally simple and easy to use.

At a high level, SAI indexes are:
  1. Created and dropped per column via CQL CREATE CUSTOM INDEX ... USING 'StorageAttachedIndex' commands and DROP INDEX commands. Start in SAI quick start.
  2. Rebuilt and backed up via nodetool. See nodetool.
  3. Monitored via a combination of nodetool, CQL virtual tables, system metrics, JMX, and Grafana dashboards. See Monitor SAI indexes.
Can I have a Solr-based DSE Search index and an SAI index on the same database table?

No. You cannot have a DSE Search index and an SAI index defined for the same database table.

In your development environment, DataStax recommends that you use SAI indexes on a few tables and observe the results. For existing tables with Solr indexes, in development, consider using DROP SEARCH INDEX to first remove the DSE Search indexes. Then use CREATE CUSTOM INDEX .. USING 'StorageAttachedIndex' to observe the performance of queries, including the additional functionality of using any column in the SAI indexes. See CREATE CUSTOM INDEX.

Are there limits on the number of SAI indexes per table, and total per cluster?
By default in DataStax Astra and DSE 6.8.3 and later:
  • The maximum number of SAI indexes per table is 10. The limit is set by sai_indexes_per_table_failure_threshold in cassandra.yaml.
  • The maximum number of SAI indexes in the entire cluster is 100, as set by sai_indexes_total_failure_threshold in cassandra.yaml.
See the guardrails section of the cassandra.yaml reference topic.
On which column in a database table can I base an SAI index?

Define each SAI index on any table column. Exception: there is no need to define an SAI index based on the partition key when it's comprised of only one column; in this case, SAI issues an invalid query message.

You can also define an SAI index using a single column in the table's composite partition key. A composite partition key means that the partition is based on two or more columns. In this case with an SAI index, you would specify just one of the columns that comprises the composite partition key.

With collection maps, starting in DSE 6.8.4, you can define one or more SAI indexes on the same column, specifying keys, values, and entries as map types. SAI also supports list and set collections.

For examples, refer to Examine SAI column index and query rules.
Tip: Also note the SAI limits per table and per cluster, as listed in this FAQ.
When I DROP and recreate an SAI index on the same column, does that block any read operations? And is there a way to check the indexing status?
When you DROP / recreate an SAI index, you are not blocked from entering queries that do not use the index. However, you cannot use that SAI index (based on the same column) until it has finished building and is queryable. To determine the current state of a given index, query the system_views.indexes virtual table. Example:
SELECT is_queryable,is_building FROM system_views.indexes WHERE keyspace_name='keyspace' 
       AND table_name='table' AND index_name='index';
See DSE virtual tables and Virtual tables for SAI indexes and SSTables.
What are the write and read paths used by SAI indexes?

SAI indexes Memtables and SSTables as they are written, resolving the differences between those indexes at read time. See SAI write path and read path.

What on-disk index formats does SAI support?
SAI supports two on-disk index formats, optimized for:
  • Equality and non-exact matching on strings.

    Strings are indexed on-disk using the trie data structure, in conjunction with postings (term/row pairs) lists. The trie is heap friendly, providing string prefix compression for terms, and can match any query that can be expressed as a deterministic finite automaton. The feature minimizes on-disk footprint and supports simple token skipping.

  • Equality and range queries on numeric and non-literal types.

    Numeric values and the other non-literal CQL types (timestamp, date, UUID) are indexed on-disk using k-dimensional tree, a balanced structure that provides fast lookups across one or more dimensions, and compression for both values and postings.

What is the disk footprint overhead for SAI indexes?
SAI requires significantly lower disk usage compared to other native or bolt-on Cassandra index solutions. SAI produces an additional 20-35% disk usage compared with unindexed data. The SAI disk usage is largely dependent on the underlying data model and the number of columns indexed. For a comparison with other indexing methods, see this DataStax blog post, and especially the Total Cost of Ownership section: Better Cassandra Indexes for a Better Data Model: Introducing Storage-Attached Indexing.
What are the supported column data types for SAI indexing?
The supported types are: ASCII, BIGINT, DATE, DECIMAL, DOUBLE, FLOAT, INET, INT, SMALLINT, TEXT, TIME, TIMESTAMP, TIMEUUID, TINYINT, UUID, VARCHAR, VARINT.
Note:
  • INET support for IPv4 and IPv6 was new starting with DSE 6.8.2.
  • The DECIMAL and VARINT support was new starting with DSE 6.8.3.
  • SAI also supports collections – see the next FAQ.
Does SAI support indexes on a collection column?
What are the supported query operators?
For queries on tables with SAI indexes:
  • Numerics: =, <, >, <=, >=, AND
  • Strings: =, CONTAINS, CONTAINS key, CONTAINS VALUES, AND

Not supported: LIKE, IN, OR

Example:
SELECT * FROM cycling.cyclist_semi_pro WHERE registration > '2010-01-01' AND registration < '2015-12-31' LIMIT 10;
On the CREATE CUSTOM INDEX command for SAI, what options are available?
Use the WITH OPTIONS clause only for string-based SAI indexes, such as when you specify a column to index that is a TEXT data type. Example with a string column, lastname:
CREATE CUSTOM INDEX lastname_sai_idx ON cycling.cyclist_semi_pro (lastname) 
USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': false, 'normalize': true }; 
See CREATE CUSTOM INDEX and examples in the SAI quick start and Examine SAI column index and query rules topics.
Does SAI support composite indexing; meaning, a single index on multiple columns?

No. There is a 1-to-1 mapping of an SAI index to a column. However, you can create a separate index on each column in a given table, up to the configured maximums – see this related FAQ. Also, SAI can use multiple defined indexes within a single read query.

How can I view SAI memory usage metrics?

SAI follows the Threads Per Core (TPC) memory model for DSE. The SAI memory footprint is divided between the JVM heap and the Chunk Cache. The heap stores memtable indexes, and the chunk cache stores recently accessed on-disk index components as well as other SSTable components. SAI provides metrics for both the heap and the chunk cache. For each index, SAI also provides metrics for determining the size in bytes of memory used by the on-disk data structure, as well as disk usage. Refer to Index group metrics. SAI also provides Table state metrics that give you visibility into the disk usage, the percentage of disk usage of the base table, the index builds in progress, and related metrics. See Table state metrics. These metrics and many others are accessible via DSE OpsCenter. Also, DataStax provides preconfigured Grafana dashboards.

What is the performance impact of adding SAI columns to a read query? How many AND clauses can I add?
There is no limit on the number of index columns that can be used in a single query. The sai_indexes_per_table_failure_threshold setting in cassandra.yaml controls the maximum number of SAI indexes allowed in a single table (10, by default). However, querying against multiple indexed columns incurs a cost that is related to the increased number of index components processed. When evaluating multiple indexed columns in a query, SAI performs a workflow (1: Traverse. 2: Merged. 3: Intersect) that ultimately coalesces data from multiple memtables and SSTables. DataStax performance testing has observed an (up to) 2x increase in latency and a 30% decrease in throughput when the number of index columns queried doubles. The relative impact on performance is largely dependent on the data model and distribution. DataStax recommends that you test specific use cases in your development environment.
Note: In a query, AND queries will process up to two SAI indexes; if more than two SAI indexes are used by the query, this circumstance will result in SAI performing post-filtering on the remaining clauses.
For related information, see the match streaming and post filtering example.
Are SAI write operations asynchronous, or does SAI wait before acknowledging the write to the user?

The SAI write path is actually very simple. The indexes live with the data, both in memtables and SSTables. When a write is acknowledged to the client, the data is already indexed. This is a synchronous process. When the memtable is flushed, the indexes are flushed as well. See SAI write path and read path.

The on-disk index components are broken down into per-SSTable index files and per-column index files. The column indexes do not store the primary keys or tokens; instead, they store compressible row IDs. The per-SSTable index files link the row IDs from the column-indexes to their backing SSTables. This SAI design allows all column indexes within a single SSTable to share per-SSTable index files, which further helps reduce the disk footprint.

With regard to write performance, how does SAI compare with Solr-based DSE Search?

All DSE Search (Solr) index updates first perform a "read-before-write" procedure against the partition or row being indexed. As described above, the SAI write path is much simpler; basically, the indexes follow along with the data. DataStax testing has observed SAI write throughput perform 86% better than DSE Search (Solr), and perform 670% better with respect to latency. In general, DataStax testing observed SAI produce single-digit millisecond (ms) latencies for mutations, and single-digit to low-teen ms latencies for reads.

What are the guidelines regarding column cardinality with SAI indexes?

Column cardinality can affect read performance when it comes to range queries among replicas. The number of rows matching a value of a high-cardinality column, such as credit card numbers, is more likely to be isolated on very few nodes (or even isolated to one node), while the rows matching a value on a low-cardinality column are more likely to reside on numerous nodes. If a query does not specify a partition key, the Cassandra coordinator scans the token ring and group token ranges by endpoints (nodes). The coordinator then concurrently execute read commands for all participating endpoints. In the worst case where the indexed column has very high cardinality, an entire cluster scan may be required before finding a match. With low-cardinality columns, be aware that if your LIMIT is higher than the number of values in your targeted column, Cassandra has to search all replicas again before determining that the LIMIT cannot be satisfied. In this case, Cassandra returns only the number of matching results.

What are the circumstances under which SAI applies post filtering?
SAI applies post-filtering in numerous scenarios. For example, consider a simple table, and an SAI index on just one of the two non-PK columns:
CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE test.mytable (id int PRIMARY KEY,
       col1 text,
       col2 timestamp);
CREATE CUSTOM INDEX mytable_col1_idx ON test.mytable (col1) USING 'StorageAttachedIndex';
Given a query such as the following:
SELECT * FROM test.mytable WHERE col1 = 'hello world' and col2 < toTimestamp(now()) ALLOW FILTERING;
For this query, DSE narrows down the search by the indexed column (col1) first, then applies post-filtering on col2. (Use the ALLOW FILTERING clause with caution.) In this scenario, no additional replica roundtrips are needed; the post filtering on col2 is carried out on the replicas themselves.

Another case were post-filtering comes into play is when constructing a query that involves more than two SAI indexes. Refer to this related FAQ about AND queries.

Can I create an SAI index based on a static column?
Yes. For example, consider a transaction_by_customer table where you have a primary key customer_id, plus static columns to contain each customer's address, phone_number, and date_of_birth. Given a query such as:
SELECT * from transaction_by_customer where customer_id = 'xyz123';
If there are 100,000 transaction_by_customer rows, because you defined those three static fields, this query runs against a table that uses significantly less disk space, as compared to an environment where writes had inserted the per-customer values (address, phone_number, date_of_birth) in every row. SAI indexes that are based on static columns use much less disk space, and perform better than Solr-based indexes on non-static fields.
Note: DSE Search (Solr) does not allow you to create search indexes on tables that contain static columns. See Unsupported features for DSE Search. SAI delivers the option and advantage of creating indexes based on static columns, while also achieving the benefit of conserving table space.
For indexed strings, how does SAI handle Unicode characters in the column data?

When you create an SAI index based on a string column, set the normalize option to true if you want SAI to perform Unicode normalization on the column data. SAI supports Normalization Form C (NFC) Unicode. When set to true, SAI normalizes the different versions of a given Unicode character to a single version, retaining all the marks and symbols in the index. For example, SAI would change the character Å (U+212B) to Å (U+00C5). See CREATE CUSTOM INDEX.

Can the index's column name have special characters?

SAI validates the column name on which an index is being defined. SAI allows alphanumeric characters and underscores only. SAI returns InvalidRequestException if you try to define an index on a column name that contains other characters, and does not create the index.

What partitioner does SAI support?

SAI supports only the Murmur3Partitioner.

Is there a migration approach to move from DSE Search to SAI?
In your development environment, a general approach:
  1. Establish baseline performance test numbers
  2. Determine which queries that currently depend on DSE Search can migrate
  3. For those queries, add SAI indexing to tables/columns to satisfy column-based data requests
  4. Update the application to use CQL SAI; if relevant, remove redundant tables that were defined to duplicate denormalized data so that you could previously query non-PrimaryKey fields
  5. Remove Solr Cores
  6. Remove Solr Nodes
  7. Run performance tests again and compare results
How do I provide feedback and get support?

Send your feedback through the DataStax Community, via product.feedback@datastax.com, or both.