SAI FAQ

Use this FAQ to find answers to common questions and get help with Storage-Attached Indexing (SAI).

What is SAI?

Storage-Attached Indexing (SAI) is a highly-scalable, globally-distributed index for Cassandra, DataStax DataStax Astra DB, and DataStax Enterprise databases. SAI combines a storage-attached architecture with a number of highly optimized on-disk index structures.

Which databases are supported?

Supported databases:

  • DataStax DataStax Astra DB (Vector)

  • DataStax DataStax Astra DB DataStax Astra DB DataStax Astra DB Serverless

  • DataStax DataStax Astra DB DataStax Astra DB DataStax Astra DB Classic

  • HCD 1.0

  • DataStax Enterprise 6.9.0 and later

  • DataStax Enterprise 6.8.3 and later

After creating a database, keyspace, and one or more tables,use CREATE INDEX to define one or more SAI indexes on the table. Examples are in SAI examples.

What configuration settings should I use with SAI?

Compared with most indexing environments, SAI configuration and related settings are much simpler. Key points:

  • Increase --XX:MaxDirectMemorySize, leaving approximately 15-20% of memory for the OS and other in-memory structures.

  • In cassandra.yaml, explicitly set file_cache_size to 75% of that value.

  • Heavy mixed read/write workloads may want to:

    • Decrease range_request_timeout

    • Increase write_request_timeout

  • If the memtable_flush_writers value is set too low, writes may stall. If this occurs in your environment, increase memtable_flush_writers.

Aside from memory, SAI uses the same tunable parameters for Cassandra, such as compaction throughput and compaction executors. This matters for write performance. For read performance, again, maximizing use of the Chunk Cache will benefit SAI reads because all on-disk index components are accessed through this mechanism. Refer to Configure SAI indexes.

What computing challenges does SAI solve?

Oftentimes, developers ask: "How can I query additional fields outside of the CQL partition key?"

SAI implements efficient indexes based on a table’s columns, such as parts of a composite partition key. Before SAI, you could index clustering keys, but you could not index parts of a composite partition. The development of SAI was written to achieve the goal of efficient and simpler filtering via the creation of secondary indexes.

+ SAI also makes data modeling easier because you do not need to create custom tables just to cater to particular query patterns. You can create a table that is most natural for you, write to just that table, and query it any way you want.

What are the advantages of using SAI?

SAI makes it possible to define multiple indexes on the same database table. Each SAI index can be based on any column in the table. Exception: There is no need to define an SAI index based on the partition key when it’s comprised of only one column; in this case, SAI issues an invalid query message. You can also define an SAI index using a single column in the table’s composite partition key. A composite partition key means that the partition is based on two or more columns. In this case with an SAI index, you would specify just one of the columns that comprises the composite partition key.

For developers, SAI removes several previous pain points, including the need to duplicate denormalized data to query non-primary key columns.

For operators, SAI has several advantages, including the use of significantly less disk space for indexing, fewer failure points, easier uptime due to the simplified architecture of SAI, and fewer copies of data to secure.

How do I use SAI features?

SAI indexes and queries are entirely CQL-based. The features, by design, are intentionally simple and easy to use.

At a high level, SAI indexes are:

  • Created and dropped per column via CQL. Start in SAI quickstart.

  • Rebuilt and backed up via nodetool. See nodetool.

Are there limits on the number of SAI indexes per table, and total per cluster?

SAI has limits that may or may not affect a cluster.

  • sai_sstable_indexes_per_query_warn_threshold: commented out, set to 32 Maximum number of referenced SAI indexes on a replica when executing a SELECT query before emitting a warning (defaults to 32)

  • sai_sstable_indexes_per_query_fail_threshold: commented out, set to -1 Maximum number of referenced SAI indexes on a replica when executing a SELECT query before emitting a failure (defaults to -1 to disable)

  • non_partition_restricted_index_query_enabled: commented out, defaults to true This is a guardrail enabling secondary index queries that do not restrict on partition key.

    On which column in a database table can I base an SAI index?

    Define each SAI index on any table column. Exception: there is no need to define an SAI index based on the partition key when it’s comprised of only one column; in this case, SAI issues an invalid query message.

    You can also define an SAI index using any single column in the table’s composite partition key. A composite partition key means that the partition is based on two or more columns. In this case with an SAI index, you would specify any of the columns that comprises the composite partition key.

    With collection maps, starting in DSE 6.8.4, you can define one or more SAI indexes on the same column, specifying keys, values, and entries as map types. SAI also supports list and set collections.

    In CQL queries of database tables with SAI indexes, the CONTAINS clauses are supported with, and specific to:

    • SAI collection maps with keys, values, and entries

    • SAI collections with list and set types

    When I DROP and recreate an SAI index on the same column, does that block any read operations? And is there a way to check the indexing status?

    When you DROP or recreate an SAI index, you are not blocked from entering queries that do not use the index. However, you cannot use that SAI index (based on the same column) until it has finished building and is queryable. To determine the current state of a given index, query the system_views.indexes virtual table. Example:

    SELECT is_queryable,is_building FROM system_views.indexes
      WHERE keyspace_name='<keyspace>'
      AND table_name='<table>'
      AND index_name='<index>';

    ?// LLP: check for oss; not true for astra See DSE virtual tables and Virtual tables for SAI indexes and SSTables.

    What are the write and read paths used by SAI indexes?

    SAI indexes Memtables and SSTables as they are written, resolving the differences between those indexes at read time. See SAI write path and read path.

    What on-disk index formats does SAI support?

    SAI supports two on-disk index formats, optimized for:

  • Equality and non-exact matching on strings.

    Strings are indexed on-disk using the trie data structure, in conjunction with postings (term/row pairs) lists. The trie is heap-friendly, providing string prefix compression for terms, and can match any query that can be expressed as a deterministic finite automaton. The feature minimizes on-disk footprint and supports simple token skipping.

  • Equality and range queries on numeric and non-literal types.

    Numeric values and the other non-literal CQL types (timestamp, date, UUID) are indexed on-disk using k-dimensional tree, a balanced structure that provides fast lookups across one or more dimensions, and compression for both values and postings.

    What is the disk footprint overhead for SAI indexes?

    SAI requires significantly lower disk usage compared to other native or bolt-on Cassandra index solutions. SAI produces an additional 20-35% disk usage compared with unindexed data. The SAI disk usage is largely dependent on the underlying data model and the number of columns indexed. For a comparison with other indexing methods, see this DataStax blog post, and especially the Total Cost of Ownership section: Better Cassandra Indexes for a Better Data Model: Introducing Storage-Attached Indexing.

    What are the supported column data types for SAI indexing?

    The supported types are: ASCII, BIGINT, DATE, DECIMAL, DOUBLE, FLOAT, INET, INT, SMALLINT, TEXT, TIME, TIMESTAMP, TIMEUUID, TINYINT, UUID, VARCHAR, VARINT.

    • SAI also supports collections — see the next FAQ.

Does SAI support indexes on a collection column?

Yes — SAI supports collections of type map, list, and set.

In CQL queries of database tables with SAI indexes, the CONTAINS clauses are supported with, and specific to:

  • SAI collection maps with keys, values, and entries

  • SAI collections with list and set types

What are the supported query operators?

For queries on tables with SAI indexes:

  • Numerics: =, <, >, , >=, AND, IN, OR (for vector databases)

  • Strings: =, AND, IN

  • Collections: = CONTAINS, CONTAINS KEY

    The unsupported query operators are:

  • Strings or Numerics: LIKE, OR (for non-vector databases)

    Examples:

    • CQL command

    • Result

    SELECT * FROM cycling.cyclist_semi_pro
      WHERE registration > '2010-01-01' AND registration < '2015-12-31' LIMIT 10;
    WARNING: cqlsh was built against 5.0-beta1, but this server is 5.0.  All features may not work!
    
     id | affiliation         | age | country | firstname | lastname | registration
    ----+---------------------+-----+---------+-----------+----------+--------------
      5 |       Como Velocità |  24 |     ITA |     Irene |  Cantona |   2012-07-22
     16 | CU Alums Crankworkz |  28 |     USA |     Jenny |   Hamler |   2012-07-22
     15 |      Exeter Academy |  27 |     USA |    Thomas |   Fulton |   2012-12-15
     20 |     London Cyclists |  18 |     GBR |    Leslie |     Boyd |   2012-12-15
      9 |       Normandy Club |  24 |     FRA |    Richie |  Draxler |   2011-02-26
    
    (5 rows)
    • CQL command

    • Result

    SELECT firstname,lastname,teams FROM cycling.cyclist_teams 
      WHERE teams CONTAINS KEY 2015;
    WARNING: cqlsh was built against 5.0-beta1, but this server is 5.0.  All features may not work!
    
     firstname | lastname   | teams
    -----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
      Marianne |        VOS |                                                                                                                                   {2015: 'Rabobank-Liv Woman Cycling Team'}
    
    (2 rows)

    For query examples with CONTAINS clauses that take advantage of SAI collection maps, lists, and sets, be sure to see SAI collection map examples with keys, values, and entries and SAI collection examples with list and set types.

On the CREATE CUSTOM INDEX command for SAI, what options are available?

Use the WITH OPTIONS clause to indicate how SAI should handle case sensitivity and special characters in the index. For example, given a string column lastname:

CREATE CUSTOM INDEX lastname_sai_idx ON cycling.cyclist_semi_pro (lastname)
  USING 'StorageAttachedIndex' WITH OPTIONS =
  {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};

SAI added the ascii option with the DSE 6.8.7 release. The default is false. When set to true, SAI converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (the first 127 ASCII characters) to the ASCII equivalent, if one exists. For example, this option changes à to a.

+ See CREATE CUSTOM INDEX and examples in the SAI quickstart topic.

Does SAI support composite indexing: meaning, a single index on multiple columns?

No. There is a 1-to-1 mapping of an SAI index to a column. However, you can create a separate index on each column in a given table, up to the configured maximums — see this related FAQ. Also, SAI can use multiple defined indexes within a single read query.

How can I view SAI memory usage metrics?

SAI follows the Threads Per Core (TPC) memory model for DSE. The SAI memory footprint is divided between the JVM heap and the Chunk Cache. The heap stores memtable indexes, and the chunk cache stores recently accessed on-disk index components as well as other SSTable components. SAI provides metrics for both the heap and the chunk cache. For each index, SAI also provides metrics for determining the size in bytes of memory used by the on-disk data structure, as well as disk usage. Refer to Index group metrics. SAI also provides Table state metrics that give you visibility into the disk usage, the percentage of disk usage of the base table, the index builds in progress, and related metrics. See Table state metrics. These metrics and many others are accessible via DSE OpsCenter. Also, DataStax provides preconfigured Grafana dashboards.

What is the performance impact of adding SAI columns to a read query? How many AND clauses can I add?

There is no limit on the number of index columns that can be used in a single query. The sai_indexes_per_table_failure_threshold setting in cassandra.yaml controls the maximum number of SAI indexes allowed in a single table (10, by default). However, querying against multiple indexed columns incurs a cost that is related to the increased number of index components processed. When evaluating multiple indexed columns in a query, SAI performs a workflow (1: Traverse. 2: Merged. 3: Intersect) that ultimately coalesces data from multiple memtables and SSTables. DataStax performance testing has observed an (up to) 2x increase in latency and a 30% decrease in throughput when the number of index columns queried doubles. The relative impact on performance is largely dependent on the data model and distribution. DataStax recommends that you test specific use cases in your development environment.

In a query, AND queries will process up to two SAI indexes; if more than two SAI indexes are used by the query, this circumstance will result in SAI performing post-filtering on the remaining clauses.

For related information, see the match streaming and post filtering example.

Are SAI write operations asynchronous, or does SAI wait before acknowledging the write to the user?

The SAI write path is actually very simple. The indexes live with the data, both in memtables and SSTables. When a write is acknowledged to the client, the data is already indexed. This is a synchronous process. When the memtable is flushed, the indexes are flushed as well. See SAI write path and read path.

The on-disk index components are broken down into per-SSTable index files and per-column index files. The column indexes do not store the primary keys or tokens; instead, they store compressible row IDs. The per-SSTable index files link the row IDs from the column-indexes to their backing SSTables. This SAI design allows all column indexes within a single SSTable to share per-SSTable index files, which further helps reduce the disk footprint.

With regard to write performance, how does SAI compare with Solr-based DSE Search?

All DSE Search (Solr) index updates first perform a "read-before-write" procedure against the partition or row being indexed. As described above, the SAI write path is much simpler; basically, the indexes follow along with the data. DataStax testing has observed SAI write throughput perform 86% better than DSE Search (Solr), and perform 670% better with respect to latency. In general, DataStax testing observed SAI produce single-digit millisecond (ms) latencies for mutations, and single-digit to low-teen ms latencies for reads.

What are the guidelines regarding column cardinality with SAI indexes?

Column cardinality can affect read performance when it comes to range queries among replicas. The number of rows matching a value of a high-cardinality column, such as credit card numbers, is more likely to be isolated on very few nodes (or even isolated to one node), while the rows matching a value on a low-cardinality column are more likely to reside on numerous nodes. If a query does not specify a partition key, the Cassandra coordinator scans the token ring and group token ranges by endpoints (nodes). The coordinator then concurrently execute read commands for all participating endpoints. In the worst case where the indexed column has very high cardinality, an entire cluster scan may be required before finding a match. With low-cardinality columns, be aware that if your LIMIT is higher than the number of values in your targeted column, Cassandra has to search all replicas again before determining that the LIMIT cannot be satisfied. In this case, Cassandra returns only the number of matching results.

What are the circumstances under which SAI applies post filtering?

SAI applies post-filtering in numerous scenarios. For example, consider a simple table, and an SAI index on just one of the two non-PK columns:

CREATE KEYSPACE test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE test.mytable (id int PRIMARY KEY,
       col1 text,
       col2 timestamp);
CREATE CUSTOM INDEX mytable_col1_idx ON test.mytable (col1) USING 'StorageAttachedIndex';

Given a query such as the following:

SELECT * FROM test.mytable WHERE col1 = 'hello world' and col2 < toTimestamp(now()) ALLOW FILTERING;

For this query, DSE narrows down the search by the indexed column (col1) first, then applies post-filtering on col2. (Use the ALLOW FILTERING clause with caution.) In this scenario, no additional replica roundtrips are needed; the post filtering on col2 is carried out on the replicas themselves.

Another case were post-filtering comes into play is when constructing a query that involves more than two SAI indexes. Refer to this related FAQ about AND queries.

Can I create an SAI index based on a static column?

Yes. For example, consider a transaction_by_customer table where you have a primary key customer_id, plus static columns to contain each customer’s address, phone_number, and date_of_birth. Given a query such as:

SELECT * from transaction_by_customer where customer_id = 'xyz123';

If there are 100,000 transaction_by_customer rows, because you defined those three static fields, this query runs against a table that uses significantly less disk space, as compared to an environment where writes had inserted the per-customer values (address, phone_number, date_of_birth) in every row. SAI indexes that are based on static columns use much less disk space, and perform better than Solr-based indexes on non-static fields.

 DSE Search (Solr) does not allow you to create search indexes on tables that contain static columns.
See https://docs.datastax.com/en/dse/6.8/dse-admin/datastax_enterprise/search/unsupportedSearchFeatures.html[Unsupported features for DSE Search].
SAI delivers the option and advantage of creating indexes based on static columns, while also achieving the benefit of conserving table space.
For indexed strings, how does SAI handle Unicode characters in the column data?

When you create an SAI index based on a string column, set the normalize option to true if you want SAI to perform Unicode normalization on the column data. SAI supports Normalization Form C (NFC) Unicode. When set to true, SAI normalizes the different versions of a given Unicode character to a single version, retaining all the marks and symbols in the index. For example, SAI would change the character Å (U+212B) to Å (U+00C5). See CREATE CUSTOM INDEX.

Can the index’s column name have special characters?

SAI validates the column name on which an index is being defined. SAI allows alphanumeric characters and underscores only. SAI returns InvalidRequestException if you try to define an index on a column name that contains other characters, and does not create the index.

What partitioner does SAI support?

SAI supports only the Murmur3Partitioner.

Is there a migration approach to move from DSE Search to SAI?

In your development environment, a general approach:

  1. Establish baseline performance test numbers

  2. Determine which queries that currently depend on DSE Search can migrate

  3. For those queries, add SAI indexing to tables/columns to satisfy column-based data requests

  4. Update the application to use CQL SAI; if relevant, remove redundant tables that were defined to duplicate denormalized data so that you could previously query non-PrimaryKey fields

  5. Remove Solr Cores

  6. Remove Solr Nodes

  7. Run performance tests again and compare results

How do I provide feedback and get support?

Send your feedback through the DataStax Community, via product.feedback@datastax.com, or both.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com