SAI FAQ
Use this FAQ to find answers to common questions and get help with Storage-Attached Indexing (SAI).
- What is SAI?
-
Storage-Attached Indexing (SAI) is a highly-scalable, globally-distributed index for Cassandra, DataStax DataStax Astra DB, and DataStax Enterprise databases. SAI combines a storage-attached architecture with a number of highly optimized on-disk index structures. SAI is built on the DataStax Enterprise thread-per-core (TPC) infrastructure in DataStax Enterprise 6.8.
- Which databases are supported?
-
Supported databases:
-
DataStax DataStax Astra DB (Vector)
-
DataStax DataStax Astra DB DataStax Astra DB DataStax Astra DB Serverless
-
DataStax DataStax Astra DB DataStax Astra DB DataStax Astra DB Classic
-
HCD 1.0
-
DataStax Enterprise 6.9.0 and later
-
DataStax Enterprise 6.8.3 and later
-
After creating a database, keyspace, and one or more tables,use CREATE INDEX to define one or more SAI indexes on the table. Examples are in SAI examples.
- What configuration settings should I use with SAI?
-
Compared with most indexing environments, SAI configuration and related settings are much simpler. Key points:
-
Increase
--XX:MaxDirectMemorySize
, leaving approximately 15-20% of memory for the OS and other in-memory structures. -
In cassandra.yaml, explicitly set
file_cache_size_in_mb
to 75% of that value. -
Heavy mixed read/write workloads may want to:
-
Decrease
range_request_timeout_in_ms
-
Increase
write_request_timeout_in_ms
-
-
If the
memtable_flush_writers
value is set too low, writes may stall. If this occurs in your environment, increasememtable_flush_writers
.
Aside from memory, SAI uses the same tunable parameters for Cassandra, such as compaction throughput and compaction executors. This matters for write performance. For read performance, again, maximizing use of the Chunk Cache will benefit SAI reads because all on-disk index components are accessed through this mechanism. Refer to Configure SAI indexes.
-
- What computing challenges does SAI solve?
-
Oftentimes, developers ask: "How can I query additional fields outside of the CQL partition key?"
SAI implements efficient indexes based on a table’s columns, such as parts of a composite partition key. Before SAI, you could index clustering keys, but you could not index parts of a composite partition. The development of SAI was written to achieve the goal of efficient and simpler filtering via the creation of secondary indexes.
+ SAI also makes data modeling easier because you do not need to create custom tables just to cater to particular query patterns. You can create a table that is most natural for you, write to just that table, and query it any way you want.
- What are the advantages of using SAI?
-
SAI makes it possible to define multiple indexes on the same database table. Each SAI index can be based on any column in the table. Exception: There is no need to define an SAI index based on the partition key when it’s comprised of only one column; in this case, SAI issues an
invalid query
message. You can also define an SAI index using a single column in the table’s composite partition key. A composite partition key means that the partition is based on two or more columns. In this case with an SAI index, you would specify just one of the columns that comprises the composite partition key.For developers, SAI removes several previous pain points, including the need to duplicate denormalized data to query non-primary key columns.
For operators, SAI has several advantages, including the use of significantly less disk space for indexing, fewer failure points, easier uptime due to the simplified architecture of SAI, and fewer copies of data to secure.
- Does SAI replace Solr-based DSE Search?
-
SAI is not an enterprise search engine. While it does provide some of the same functionality, SAI is not a complete replacement for DSE Search. At its core, SAI is a filtering engine, and simplifies data modeling and client applications that would otherwise rely heavily on maintaining multiple query-specific tables.
- How does schema management compare between SAI and DSE Search?
-
SAI is an index, not a search engine. Unlike the Solr-based DSE Search, SAI has no need for schema management. SAI configuration is simpler and is tuned with existing database parameters, such as in
cassandra.yaml
. With SAI, there is no Solr commit log to accept writes during bootstrap. SAI does not need to wait for bootstrap to read the database configuration like Solr does. With SAI, schema/indexing options reside in the index metadata, which is handled by native database schema management. - How do I use SAI features?
-
SAI indexes and queries are entirely CQL-based. The features, by design, are intentionally simple and easy to use.
At a high level, SAI indexes are:
-
Created and dropped per column via CQL. Start in SAI quickstart.
-
Rebuilt and backed up via
nodetool
. See nodetool. -
Monitored via a combination of
nodetool
, CQL virtual tables, system metrics, JMX, and Grafana dashboards. See Monitor SAI indexes.
-
- Can you have a Solr-based DSE Search index and an SAI index on the same database table?
-
No. You cannot have a DSE Search index and an SAI index defined for the same database table.
In your development environment, DataStax recommends that you use SAI indexes on a few tables and observe the results. For existing tables with Solr indexes, in development, consider using
DROP SEARCH INDEX
to first remove the DSE Search indexes. Then useCREATE CUSTOM INDEX … USING 'StorageAttachedIndex'
to observe the performance of queries, including the additional functionality of using any column in the SAI indexes. See CREATE CUSTOM INDEX.
- Are there limits on the number of SAI indexes per table, and total per cluster?
-
SAI has limits that may or may not affect a cluster.
-
sai_indexes_total_failure_threshold: commented out, set to 100
-
sai_indexes_per_table_failure_threshold: commented out, set to 10
- On which column in a database table can I base an SAI index?
-
Define each SAI index on any table column. Exception: there is no need to define an SAI index based on the partition key when it’s comprised of only one column; in this case, SAI issues an
invalid query
message.You can also define an SAI index using any single column in the table’s composite partition key. A composite partition key means that the partition is based on two or more columns. In this case with an SAI index, you would specify any of the columns that comprises the composite partition key.
With collection maps, starting in DSE 6.8.4, you can define one or more SAI indexes on the same column, specifying
keys
,values
, andentries
as map types. SAI also supportslist
andset
collections.In CQL queries of database tables with SAI indexes, the
CONTAINS
clauses are supported with, and specific to:-
SAI collection maps with
keys
,values
, andentries
-
SAI collections with
list
andset
types
-
- When I
DROP
and recreate an SAI index on the same column, does that block any read operations? And is there a way to check the indexing status? -
When you
DROP
or recreate an SAI index, you are not blocked from entering queries that do not use the index. However, you cannot use that SAI index (based on the same column) until it has finished building and is queryable. To determine the current state of a given index, query thesystem_views.indexes
virtual table. Example:SELECT is_queryable,is_building FROM system_views.indexes WHERE keyspace_name='<keyspace>' AND table_name='<table>' AND index_name='<index>';
?// LLP: check for oss; not true for astra See DSE virtual tables and Virtual tables for SAI indexes and SSTables.
- What are the write and read paths used by SAI indexes?
-
SAI indexes Memtables and SSTables as they are written, resolving the differences between those indexes at read time. See SAI write path and read path.
- What on-disk index formats does SAI support?
-
SAI supports two on-disk index formats, optimized for:
-
Equality and non-exact matching on strings.
Strings are indexed on-disk using the trie data structure, in conjunction with postings (term/row pairs) lists. The trie is heap-friendly, providing string prefix compression for terms, and can match any query that can be expressed as a deterministic finite automaton. The feature minimizes on-disk footprint and supports simple token skipping.
-
Equality and range queries on numeric and non-literal types.
Numeric values and the other non-literal CQL types (
timestamp
,date
,UUID
) are indexed on-disk using k-dimensional tree, a balanced structure that provides fast lookups across one or more dimensions, and compression for both values and postings.- What is the disk footprint overhead for SAI indexes?
-
SAI requires significantly lower disk usage compared to other native or bolt-on Cassandra index solutions. SAI produces an additional 20-35% disk usage compared with unindexed data. The SAI disk usage is largely dependent on the underlying data model and the number of columns indexed. For a comparison with other indexing methods, see this DataStax blog post, and especially the Total Cost of Ownership section: Better Cassandra Indexes for a Better Data Model: Introducing Storage-Attached Indexing.
- What are the supported column data types for SAI indexing?
-
The supported types are:
ASCII, BIGINT, DATE, DECIMAL, DOUBLE, FLOAT, INET, INT, SMALLINT, TEXT, TIME, TIMESTAMP, TIMEUUID, TINYINT, UUID, VARCHAR, VARINT
.-
INET
support for IPv4 and IPv6 was new starting with DSE 6.8.2. -
The
DECIMAL
andVARINT
support was new starting with DSE 6.8.3. -
SAI also supports collections — see the next FAQ.
-
- Does SAI support indexes on a collection column?
-
Yes — SAI supports collections of type
map
,list
, andset
.In CQL queries of database tables with SAI indexes, the
CONTAINS
clauses are supported with, and specific to:-
SAI collection maps with
keys
,values
, andentries
-
SAI collections with
list
andset
types
-
- What are the supported query operators?
-
For queries on tables with SAI indexes:
-
Numerics:
=
,<
,>
,⇐
,>=
,AND
-
Strings:
=
,AND
-
Collections:
=
,CONTAINS
,CONTAINS KEY
The unsupported query operators are:
-
Strings or Numerics:
LIKE
,OR
,IN
Examples:
-
CQL command
-
Result
SELECT * FROM cycling.cyclist_semi_pro WHERE registration > '2010-01-01' AND registration < '2015-12-31' LIMIT 10;
id | affiliation | age | country | firstname | lastname | registration ----+---------------------+-----+---------+-----------+----------+-------------- 5 | Como Velocità | 24 | ITA | Irene | Cantona | 2012-07-22 16 | CU Alums Crankworkz | 28 | USA | Jenny | Hamler | 2012-07-22 15 | Exeter Academy | 27 | USA | Thomas | Fulton | 2012-12-15 20 | London Cyclists | 18 | GBR | Leslie | Boyd | 2012-12-15 9 | Normandy Club | 24 | FRA | Richie | Draxler | 2011-02-26 (5 rows) id | affiliation | age | country | firstname | lastname | registration ----+---------------------+-----+---------+-----------+----------+-------------- 5 | Como Velocità | 24 | ITA | Irene | Cantona | 2012-07-22 16 | CU Alums Crankworkz | 28 | USA | Jenny | Hamler | 2012-07-22 15 | Exeter Academy | 27 | USA | Thomas | Fulton | 2012-12-15 20 | London Cyclists | 18 | GBR | Leslie | Boyd | 2012-12-15 9 | Normandy Club | 24 | FRA | Richie | Draxler | 2011-02-26 (5 rows)
-
CQL command
-
Result
SELECT firstname,lastname,teams FROM cycling.cyclist_teams WHERE teams CONTAINS KEY 2015;
firstname | lastname | teams -----------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'} Marianne | VOS | {2015: 'Rabobank-Liv Woman Cycling Team'} (2 rows) firstname | lastname | teams -----------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'} Marianne | VOS | {2015: 'Rabobank-Liv Woman Cycling Team'} (2 rows)
For query examples with
CONTAINS
clauses that take advantage of SAI collection maps, lists, and sets, be sure to see SAI collection map examples with keys, values, and entries and SAI collection examples with list and set types. -
-
- On the
CREATE CUSTOM INDEX
command for SAI, what options are available? -
Use the
WITH OPTIONS
clause to indicate how SAI should handle case sensitivity and special characters in the index. For example, given a string columnlastname
:CREATE CUSTOM INDEX lastname_sai_idx ON cycling.cyclist_semi_pro (lastname) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};
SAI added the
ascii
option with the DSE 6.8.7 release. The default isfalse
. When set totrue
, SAI converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (the first 127 ASCII characters) to the ASCII equivalent, if one exists. For example, this option changes à to a.+ See CREATE CUSTOM INDEX and examples in the SAI quickstart topic.
- Does SAI support composite indexing: meaning, a single index on multiple columns?
-
No. There is a 1-to-1 mapping of an SAI index to a column. However, you can create a separate index on each column in a given table, up to the configured maximums — see this related FAQ. Also, SAI can use multiple defined indexes within a single read query.
- How can I view SAI memory usage metrics?
-
SAI follows the Threads Per Core (TPC) memory model for DSE. The SAI memory footprint is divided between the JVM heap and the Chunk Cache. The heap stores memtable indexes, and the chunk cache stores recently accessed on-disk index components as well as other SSTable components. SAI provides metrics for both the heap and the chunk cache. For each index, SAI also provides metrics for determining the size in bytes of memory used by the on-disk data structure, as well as disk usage. Refer to Index group metrics. SAI also provides Table state metrics that give you visibility into the disk usage, the percentage of disk usage of the base table, the index builds in progress, and related metrics. See Table state metrics. These metrics and many others are accessible via DSE OpsCenter. Also, DataStax provides preconfigured Grafana dashboards.
- What is the performance impact of adding SAI columns to a read query? How many
AND
clauses can I add? -
There is no limit on the number of index columns that can be used in a single query. The
sai_indexes_per_table_failure_threshold
setting in cassandra.yaml controls the maximum number of SAI indexes allowed in a single table (10, by default). However, querying against multiple indexed columns incurs a cost that is related to the increased number of index components processed. When evaluating multiple indexed columns in a query, SAI performs a workflow (1: Traverse. 2: Merged. 3: Intersect) that ultimately coalesces data from multiple memtables and SSTables. DataStax performance testing has observed an (up to) 2x increase in latency and a 30% decrease in throughput when the number of index columns queried doubles. The relative impact on performance is largely dependent on the data model and distribution. DataStax recommends that you test specific use cases in your development environment.In a query, AND queries will process up to two SAI indexes; if more than two SAI indexes are used by the query, this circumstance will result in SAI performing post-filtering on the remaining clauses.
For related information, see the match streaming and post filtering example.
- Are SAI write operations asynchronous, or does SAI wait before acknowledging the write to the user?
-
The SAI write path is actually very simple. The indexes live with the data, both in memtables and SSTables. When a write is acknowledged to the client, the data is already indexed. This is a synchronous process. When the memtable is flushed, the indexes are flushed as well. See SAI write path and read path.
The on-disk index components are broken down into per-SSTable index files and per-column index files. The column indexes do not store the primary keys or tokens; instead, they store compressible row IDs. The per-SSTable index files link the row IDs from the column-indexes to their backing SSTables. This SAI design allows all column indexes within a single SSTable to share per-SSTable index files, which further helps reduce the disk footprint.
- With regard to write performance, how does SAI compare with Solr-based DSE Search?
-
All DSE Search (Solr) index updates first perform a "read-before-write" procedure against the partition or row being indexed. As described above, the SAI write path is much simpler; basically, the indexes follow along with the data. DataStax testing has observed SAI write throughput perform 86% better than DSE Search (Solr), and perform 670% better with respect to latency. In general, DataStax testing observed SAI produce single-digit millisecond (ms) latencies for mutations, and single-digit to low-teen ms latencies for reads.
- What are the guidelines regarding column cardinality with SAI indexes?
-
Column cardinality can affect read performance when it comes to range queries among replicas. The number of rows matching a value of a high-cardinality column, such as credit card numbers, is more likely to be isolated on very few nodes (or even isolated to one node), while the rows matching a value on a low-cardinality column are more likely to reside on numerous nodes. If a query does not specify a partition key, the Cassandra coordinator scans the token ring and group token ranges by endpoints (nodes). The coordinator then concurrently execute read commands for all participating endpoints. In the worst case where the indexed column has very high cardinality, an entire cluster scan may be required before finding a match. With low-cardinality columns, be aware that if your
LIMIT
is higher than the number of values in your targeted column, Cassandra has to search all replicas again before determining that theLIMIT
cannot be satisfied. In this case, Cassandra returns only the number of matching results. - What are the circumstances under which SAI applies post filtering?
-
SAI applies post-filtering in numerous scenarios. For example, consider a simple table, and an SAI index on just one of the two non-PK columns:
CREATE KEYSPACE test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE test.mytable (id int PRIMARY KEY, col1 text, col2 timestamp);
CREATE CUSTOM INDEX mytable_col1_idx ON test.mytable (col1) USING 'StorageAttachedIndex';
Given a query such as the following:
SELECT * FROM test.mytable WHERE col1 = 'hello world' and col2 < toTimestamp(now()) ALLOW FILTERING;
For this query, DSE narrows down the search by the indexed column (
col1
) first, then applies post-filtering oncol2
. (Use theALLOW FILTERING
clause with caution.) In this scenario, no additional replica roundtrips are needed; the post filtering oncol2
is carried out on the replicas themselves.Another case were post-filtering comes into play is when constructing a query that involves more than two SAI indexes. Refer to this related FAQ about
AND
queries. - Can I create an SAI index based on a static column?
-
Yes. For example, consider a
transaction_by_customer
table where you have a primary keycustomer_id
, plus static columns to contain each customer’saddress
,phone_number
, anddate_of_birth
. Given a query such as:SELECT * from transaction_by_customer where customer_id = 'xyz123';
If there are 100,000
transaction_by_customer
rows, because you defined those three static fields, this query runs against a table that uses significantly less disk space, as compared to an environment where writes had inserted the per-customer values (address
,phone_number
,date_of_birth
) in every row. SAI indexes that are based on static columns use much less disk space, and perform better than Solr-based indexes on non-static fields.DSE Search (Solr) does not allow you to create search indexes on tables that contain static columns. See https://docs.datastax.com/en/dse/6.8/dse-admin/datastax_enterprise/search/unsupportedSearchFeatures.html[Unsupported features for DSE Search]. SAI delivers the option and advantage of creating indexes based on static columns, while also achieving the benefit of conserving table space.
- For indexed strings, how does SAI handle Unicode characters in the column data?
-
When you create an SAI index based on a string column, set the
normalize
option totrue
if you want SAI to perform Unicode normalization on the column data. SAI supports Normalization Form C (NFC) Unicode. When set totrue
, SAI normalizes the different versions of a given Unicode character to a single version, retaining all the marks and symbols in the index. For example, SAI would change the character Å (U+212B) to Å (U+00C5). See CREATE CUSTOM INDEX. - Can the index’s column name have special characters?
-
SAI validates the column name on which an index is being defined. SAI allows alphanumeric characters and underscores only. SAI returns
InvalidRequestException
if you try to define an index on a column name that contains other characters, and does not create the index. - What partitioner does SAI support?
-
SAI supports only the Murmur3Partitioner.
- Is there a migration approach to move from DSE Search to SAI?
-
In your development environment, a general approach:
-
Establish baseline performance test numbers
-
Determine which queries that currently depend on DSE Search can migrate
-
For those queries, add SAI indexing to tables/columns to satisfy column-based data requests
-
Update the application to use CQL SAI; if relevant, remove redundant tables that were defined to duplicate denormalized data so that you could previously query non-PrimaryKey fields
-
Remove Solr Cores
-
Remove Solr Nodes
-
Run performance tests again and compare results
-
- How do I provide feedback and get support?
-
Send your feedback through the DataStax Community, via product.feedback@datastax.com, or both.