SAI FAQs

Frequently asked questions about Storage-Attached Indexing (SAI).

cassandra.yaml

The location of the cassandra.yaml file depends on the type of installation:
Package installations /etc/dse/cassandra/cassandra.yaml
Tarball installations installation_location/resources/cassandra/conf/cassandra.yaml

Use this FAQ to find answers to common questions and get help with Storage-Attached Indexing (SAI).

What is SAI?
SAI is a highly-scalable, globally-distributed index for Apache Cassandra® that is available for DataStax Astra and DataStax Enterprise (DSE) databases. SAI combines:
  • the storage-attached architecture of open source SSTable Attached Secondary Indexes (SASI)
  • a number of highly optimized on-disk index structures
  • the DSE thread-per-core (TPC) infrastructure
Which databases are supported?
Supported databases:
  • DataStax Astra
  • DataStax Enterprise 6.8.3 and later

After creating your database, a keyspace, and one or more tables, use CREATE CUSTOM INDEX ... USING 'StorageAttachedIndex' to define one or more SAI indexes on the table. For DataStax Astra databases, use the CQL Console and enter the commands from the cqlsh prompt. For DSE databases, use cqlsh. The same CREATE CUSTOM INDEX ... USING 'StorageAttachedIndex' command is available for both. See SAI quick start.

What configuration settings should I use with SAI?
Compared with most indexing environments, SAI configuration and related settings are much simpler. Key points:
  • Increase --XX:MaxDirectMemorySize, leaving approximately 15-20% of memory for the OS and other in-memory structures.
  • In cassandra.yaml, explicitly set file_cache_size_in_mb to 75% of that value.
  • Heavy mixed read/write workloads may want to:
    • Decrease range_request_timeout_in_ms
    • Increase write_request_timeout_in_ms
  • If the memtable_flush_writers value is set too low, writes may stall. If this occurs in your environment, increase memtable_flush_writers.
Refer to Configure SAI indexes.
What computing challenges does SAI solve?

Oftentimes, developers ask: "How can I query additional fields outside of the Cassandra partition key?" SAI implements efficient indexes based on a table's columns, such as parts of a composite partition key. Before SAI, you could index clustering keys, but you could not index parts of a composite partition. The development of SAI was inspired by SASI to achieve the goal of efficient and simpler filtering via the creation of secondary indexes.

SAI also makes data modeling easier because you do not need to create custom tables just to cater to particular query patterns. You can create a table that is most natural for you, write to just that table, and query it any way you want.

What are the advantages of using SAI?

SAI makes it possible to define multiple indexes on the same database table. Each SAI index can be based on any column in the table. Exception: there is no need to define an SAI index based on the partition key when it's comprised of only one column; in this case, SAI issues an invalid query message. You can also define an SAI index using a single column in the table's composite partition key. A composite partition key means that the partition is based on two or more columns. In this case with an SAI index, you would specify just one of the columns that comprises the composite partition key.

For developers, SAI removes several previous pain points, including the need to duplicate denormalized data to query non-PrimaryKey columns.

For operators, SAI has several advantages, including the use of significantly less disk space for indexing; fewer failure points; easier uptime due to the simplified architecture of SAI; and fewer copies of data to secure.

Does SAI replace Solr-based DSE Search?

SAI is not an enterprise search engine. While it does provide some of the same functionality, SAI is not a complete replacement for DSE Search. At its core, SAI is a filtering engine, and simplifies data modeling and client applications that would otherwise rely heavily on maintaining multiple query-specific tables.

How does schema management compare between DSE Search and SAI?

SAI is an index, not a search engine. Unlike the Solr-based DSE Search, SAI has no need for schema management. SAI configuration is simpler and is tuned with existing database parameters, such as in cassandra.yaml. With SAI, there is no Solr commit log to accept writes during bootstrap; unlike Solr, SAI does not need to wait for bootstrap to read the database configuration. With SAI, schema/indexing options reside in the index metadata, which is handled by native database schema management.

How do I use SAI features?

With Storage-Attached Indexing, one difference compared to DSE Search is that queries are entirely CQL-based. The features, by design, are intentionally simple and easy to use.

At a high level, SAI indexes are:
  1. Created and dropped per column via CQL CREATE CUSTOM INDEX ... USING 'StorageAttachedIndex' commands and DROP INDEX commands. Start in SAI quick start.
  2. Rebuilt and backed up via nodetool. See nodetool.
  3. Monitored via a combination of nodetool, CQL virtual tables, system metrics, JMX, and Grafana dashboards. See Monitor SAI indexes.
Can I have a Solr-based DSE Search index and an SAI index on the same database table?

No. You cannot have a DSE Search index and an SAI index defined for the same database table.

In your development environment, DataStax recommends that you use SAI indexes on a few tables and observe the results. For existing tables with Solr indexes, in development, consider using DROP SEARCH INDEX to first remove the DSE Search indexes. Then use CREATE CUSTOM INDEX .. USING 'StorageAttachedIndex' to observe the performance of queries, including the additional functionality of using any column in the SAI indexes. See CREATE CUSTOM INDEX.

Are there limits on the number of SAI indexes per table, and total per cluster?
By default in DataStax Astra and DSE 6.8.3 and later:
  • The maximum number of SAI indexes per table is 10. The limit is set by sai_indexes_per_table_failure_threshold in cassandra.yaml.
  • The maximum number of SAI indexes in the entire cluster is 100, as set by sai_indexes_total_failure_threshold in cassandra.yaml.
See the guardrails section of the cassandra.yaml reference topic.
Does SAI support indexes on a collection column?

Yes – see Using collections with SAI and CREATE CUSTOM INDEX, including SAI collection map examples with keys, values, and entries.

On which column in a database table can I base an SAI index?

Define each SAI index on any table column. Exception: there is no need to define an SAI index based on the partition key when it's comprised of only one column; in this case, SAI issues an invalid query message.

You can also define an SAI index using a single column in the table's composite partition key. A composite partition key means that the partition is based on two or more columns. In this case with an SAI index, you would specify just one of the columns that comprises the composite partition key.

With collection maps, starting in DSE 6.8.4, you can define one or more SAI indexes on the same column, specifying keys, values, and entries as map types.

For examples, refer to Examine SAI column index and query rules.
Tip: Also note the SAI limits per table and per cluster, as listed in this FAQ.
When I DROP and recreate an SAI index on the same column, does that block any read operations? And is there a way to check the indexing status?
When you DROP / recreate an SAI index, you are not blocked from entering queries that do not use the index. However, you cannot use that SAI index (based on the same column) until it has finished building and is queryable. To determine the current state of a given index, query the system_views.indexes virtual table. Example:
SELECT is_queryable,is_building FROM systemviews.indexes WHERE keyspace_name='keyspace' 
       AND table_name='table' AND index_name='index';
See DSE virtual tables and About virtual tables for SAI indexes and SSTables.
What are the write and read paths used by SAI indexes?

SAI indexes Memtables and SSTables as they are written, resolving the differences between those indexes at read time. See SAI write path and read path.

What on-disk index formats does SAI support?
SAI supports two on-disk index formats, optimized for:
  • Equality and non-exact matching on strings.

    Strings are indexed on-disk using the trie data structure, in conjunction with postings (term/row pairs) lists. The trie is heap friendly, providing string prefix compression for terms, and can match any query that can be expressed as a deterministic finite automaton. The feature minimizes on-disk footprint and supports simple token skipping.

  • Equality and range queries on numeric and non-literal types.

    Numeric values and the other non-literal CQL types (timestamp, date, UUID) are indexed on-disk using k-dimensional tree, a balanced structure that provides fast lookups across one or more dimensions, and compression for both values and postings.

What are the supported column data types for SAI indexing?
The supported types are: ASCII, BIGINT, DATE, DECIMAL, DOUBLE, FLOAT, INET, INT, SMALLINT, TEXT, TIME, TIMESTAMP, TIMEUUID, TINYINT, UUID, VARCHAR, VARINT.
Note: INET support for IPv4 and IPv6 is new starting with DSE 6.8.2. The DECIMAL and VARINT support is new starting with DSE 6.8.3.
What are the supported query operators?
For queries on tables with SAI indexes:
  • Numerics: =, <, >, <=, >=, AND
  • Strings: =, CONTAINS, CONTAINS key, CONTAINS VALUES, AND

Not supported: LIKE, IN, OR

Example:
SELECT * FROM cycling.cyclist_semi_pro WHERE registration > '2010-01-01' AND registration < '2015-12-31' LIMIT 10;
On the CREATE CUSTOM INDEX command for SAI, what options are available?
Use the WITH OPTIONS clause only for string-based SAI indexes, such as when you specify a column to index that is a TEXT data type. Example with a string column, lastname:
CREATE CUSTOM INDEX lastname_sai_idx ON cycling.cyclist_semi_pro (lastname) 
USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': false, 'normalize': true }; 
See CREATE CUSTOM INDEX and examples in the SAI quick start and Examine SAI column index and query rules topics.
For indexed strings, how does SAI handle Unicode characters?

When you create an SAI index based on a string column, set the normalize option to true if you want SAI to perform Unicode normalization. SAI supports Normalization Form C (NFC) Unicode. When set to true, SAI normalizes the different versions of a given Unicode character to a single version, retaining all the marks and symbols in the index. For example, SAI would change the character Å (U+212B) to Å (U+00C5). See CREATE CUSTOM INDEX.

What partitioner does SAI support?

SAI supports only the Murmur3Partitioner.

What options do I have to monitor the health of indexes that I create with SAI?

Monitor your SAI indexes by enabling TRACING; viewing the data in system virtual tables implemented for SAI; and examining SAI metrics. See Monitor SAI indexes.

What specific metrics does SAI provide?

See the metrics topic in this guide for details about SAI metrics. DataStax provides preconfigured Grafana dashboards on this GitHub site that integrate with DSE Metrics Collector. Use DSE Metrics Collector to export DSE metrics to a monitoring tool like Prometheus, and then visualize the DSE metrics in Grafana dashboards.

Is there a migration approach to move from DSE Search to SAI?
In your development environment, a general approach:
  1. Establish baseline performance test numbers
  2. Determine which queries that currently depend on DSE Search can migrate
  3. For those queries, add SAI indexing to tables/columns to satisfy column-based data requests
  4. Update the application to use CQL SAI; if relevant, remove redundant tables that were defined to duplicate denormalized data so that you could previously query non-PrimaryKey fields
  5. Remove Solr Cores
  6. Remove Solr Nodes
  7. Run performance tests again and compare results
How do I provide feedback and get support?

Send your feedback through the DataStax Community, via product.feedback@datastax.com, or both.