SASI concepts

SSTable-Attached Indexing (SASI) is a highly-scalable, per-SSTable index for Cassandra and DataStax Enterprise databases. SSTable Attached Secondary Indexes (SASI) are implemented as a corresponding index file for each SSTable created on disk. SSTables are created during normal flush from memtable to disk, during compaction, and during streaming operations (node joining or being decommissioned). SASI enables full text search as well as faster multi-criteria search in CQL.

SASI can be less resource-intensive, using less memory, disk, and CPU than the built-in secondary index (2i) implementation. Providing a different functionality than 2i or SAI, SASI allows queries that filter using partial or full text matches. SASI enables querying with prefix, suffix, or substrings on strings, similar to the SQL implementation of LIKE = "foo%", LIKE = "%foo", or LIKE = "%foo%" in SELECT queries. It also supports SPARSE indexing to improve performance of querying large, dense number ranges such as time series data.

SASI indexes can be created on any non-partition key columns of any CQL data type, except collections.

SASI enables queries that filter based on:

  • AND logic

  • numeric range

  • text type equality

  • CONTAINs logic for LIKE queries

  • PREFIX logic for LIKE queries

  • tokenized data

  • row-aware query path

  • case sensitivity (optional)

  • unicode normalization (optional)

Does not work with ByteOrderedPartitioner or RandomPartitioner, only Murmur3Partitioner.

Does not work with NOT EQUALS or OR operators.

equality, prefix, suffix,compound, sasi-indexed column + non-indeed column, delimiter-based tokenization analysis

SASI takes advantage of Cassandra’s write-once immutable ordered data model to build indexes when data is flushed from the memtable to disk. The SASI index data structures are built in memory as the SSTable is written and flushed to disk as sequential writes before the SSTable writing completes. One index file is written for each indexed column.

SASI supports all queries already supported by CQL, and supports the LIKE operator using PREFIX, CONTAINS, and SPARSE. If ALLOW FILTERING is used, SASI also supports queries with multiple predicates using AND. With SASI, the performance pitfalls of using filtering are not realized because the filtering is not performed even if ALLOW FILTERING is used.

SASI is implemented using memory mapped B+ trees, an efficient data structure for indexes. B+ trees allow range queries to perform quickly. SASI generates an index for each SSTable. Some key features that arise from this design are: SASI can reference offsets in the data file, skipping the Bloom filter and partition indexes to go directory to where data is stored. When SSTables are compacted, new indexes are generated automatically. Currently, SASI does not support collections. Regular secondary indexes can be built for collections. Static columns are supported in Cassandra 3.6 and later.

Cassandra SASI Index Technical Deep Dive Author: doanduyhai April 25, 2016 31 Comments

A) What is SASI ? SASI stands for SSTable-Attached Secondary Index, e.g. the life-cycle of SASI index files are the same as the one of corresponding SSTables. SASI is a contribution from a team of engineers, below is the list of all contributors:

Pavel Yaskevich Jordan West Jason Brown Mikhail Stepura Michael Kjellman SASI is not yet-another-implementation of Cassandra secondary index interface, it introduces a new idea: let the index file follows the life-cycle of the SSTable. It means that whenever an SSTable is created on disk, a corresponding SASI index file is also created. When are SSTables created ?

during normal flush during compaction during streaming operations (node joining or being decommissioned) To enable this new architecture, the Cassandra source code had to be modified to introduce the new SSTableFlushObserver class whose goal is to intercept SSTable flushing and generates the corresponding SASI index file.

B) SASI Syntax and Usage SASI uses the standard CQL syntax to create a custom secondary index. Let’s see all the available index options.

1) For text data types (text, varchar & ascii) Indexing mode:

PREFIX: allows matching text value by: - prefix using the LIKE 'prefix%' syntax - exact match using equality (=) CONTAINS: allows matching text value by: - prefix using the LIKE 'prefix%' syntax (if org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer is used) - suffix using the LIKE '%suffix' syntax - substring using the LIKE '%substring%' syntax - exact match using equality (=) (if org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer is used)

Indexing mode: - analyzed (true/false): activate text analysis. Warning: lower-case/upper-case normalization requires an analyzer

Analyzer class (analyzer_class):

  • org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer with options: — case_sensitive (true/false): search using case sensitivity — normalize_lowercase (true/false): store text as lowercase — normalize_uppercase (true/false): store text as uppercase

  • org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer with options: — tokenization_locale: locale to be used for tokenization, stemming and stop words skipping — tokenization_enable_stemming (true/false): enable stemming (locale dependent) — tokenization_skip_stop_words (true/false): skip indexing stop words (locale dependent) — tokenization_normalize_lowercase (true/false): store text as lowercase — tokenization_normalize_uppercase (true/false): store text as uppercase

2) For other data types (int, date, uuid …) Indexing mode:

PREFIX: allows matching values by: - equality (=) - range ( <, ≤, >, ≥ ) SPARSE: allows matching sparse index values by: - equality (=) - range ( <, ≤, >, ≥ )

There is an important remark about SPARSE mode. By sparse, it means that for each indexed value, there are very few (maximum 5 actually) matching rows. If there are more than 5 matching rows, an exception similar to the one below will be thrown: Term - 'xxx' belongs to more than 5 keys in SPARSE mode, which is not allowed.

SPARSE mode has been designed primarily to index very unique values and allow efficient storage and efficient range query. For example, if you’re storing user account and creates an index on the account_creation_date column (millisecond precision), it’s likely that you’ll have very few matching user(s) for a given date. However, you’ll be able to search user whose account has been created between a wide range of date (WHERE account_creation_date > xxx AND account_creation_date < yyy) in a very efficient manner.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000,