Using an SSTable Attached Secondary Index (SASI)

Using CQL to create an SSTable Attached Secondary Index (SASI) on a column after defining a table.

SSTable Attached Secondary Indexes (SASI) have improved the performance of secondary indexes but should be used with caution.
Attention: SASI indexes in DSE are experimental. DataStax does not support SASI indexes for production.
Using CQL, SSTable Attached Secondary Indexes (SASI) can be created on a non-collection column defined in a table. Secondary indexes are used to query a table that uses a column that is not normally possible to query, such as a non-primary key column. SASI implements the following types of indexes: PREFIX, CONTAINS, and SPARSE.

Procedure

PREFIX index
  • Create an index fn_prefix for the table cyclist_name on the column firstname. PREFIX is the default mode, so it does not need to be explicitly specified.
    CREATE TABLE IF NOT EXISTS cycling.cyclist_name (
      id UUID PRIMARY KEY,
      lastname text,
      firstname text
    );
    CREATE CUSTOM INDEX IF NOT EXISTS fn_prefix
    ON cycling.cyclist_name (firstname)
    USING 'org.apache.cassandra.index.sasi.SASIIndex';
    SELECT *
    FROM cycling.cyclist_name;
     id                                   | firstname | lastname
    --------------------------------------+-----------+-----------------
     e7ae5cf3-d358-4d99-b900-85902fda9bb0 |      Alex |           FRAME
     fb372533-eb95-4bb4-8685-6ef61e994caa |   Michael |        MATTHEWS
     5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |             VOS
     220844bf-4860-49d6-9a4b-6b5d3a79cbfb |     Paolo |       TIRALONGO
     6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 |    Steven |      KRUIKSWIJK
     e7cd5752-bc0d-4157-a80f-7523add8dbcd |      Anna | VAN DER BREGGEN
    
    (6 rows)
  • Queries can find exact matches for values in firstname. Note that indexing is used for this query because the primary key id is not specified:
    SELECT *
    FROM cycling.cyclist_name
    WHERE firstname = 'Marianne';
     id                                   | firstname | lastname
    --------------------------------------+-----------+----------
     5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |      VOS
    
    (1 rows)
  • Queries can find matches for values in firstname based on partial matches. The use of LIKE specifies that the match is looking for a word that starts with the letter M. The % after the letter M matches any characters after M and returns any matching values. Note that indexing is used for this query because the primary key id is not specified:
    SELECT *
    FROM cycling.cyclist_name
    WHERE firstname LIKE 'M%';
     id                                   | firstname | lastname
    --------------------------------------+-----------+----------
     fb372533-eb95-4bb4-8685-6ef61e994caa |   Michael | MATTHEWS
     5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |      VOS
    
    (2 rows)
  • Many queries will fail to find matches based on the partial string. The following queries do not return a match or return an error.
    The following query does not return a match. The first name is stored as 'Marianne'. The query provides 'MARIANNE', which does not match the alphabetic case sensitivity for 'Marianne':
    SELECT * FROM cycling.cyclist_name WHERE firstname = 'MARIANNE';
    The placement of the % characters are critical. Because the index uses the PREFIX mode, only a trailing % yields results when coupled with LIKE. This query uses LIKE and provides a trailing %, but the lowercase m at the start does not match any rows:
    SELECT * FROM cycling.cyclist_name WHERE firstname LIKE 'm%';
    These queries specify patterns that do not match any rows:
    SELECT * FROM cycling.cyclist_name WHERE firstname = 'M%';
    SELECT * FROM cycling.cyclist_name WHERE firstname = '%M';
    SELECT * FROM cycling.cyclist_name WHERE firstname = '%M%';
    SELECT * FROM cycling.cyclist_name WHERE firstname = 'm%';
    These queries generate errors stating that the column is not properly indexed (after the fn_contains index is created in the next section, the queries run):
    SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%m%';
    SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%m%' ALLOW FILTERING;
    SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%M%';
CONTAINS index
  • Create an index fn_contains for the table cyclist_name on the column firstname. CONTAINS is the specified mode, so that partial patterns specified, and not just the prefix, are matched.
    CREATE CUSTOM INDEX IF NOT EXISTS fn_contains
    ON cycling.cyclist_name (firstname)
    USING 'org.apache.cassandra.index.sasi.SASIIndex'
    WITH OPTIONS = { 'mode': 'CONTAINS' };
  • Queries can find exact matches for values in firstname. Note that indexing is used for the following query, because the primary key id is not specified in the query. For queries that use CONTAINS indexing, the ALLOW FILTERING phrase must be included, although the database does not actually perform the filtering.
    SELECT *
    FROM cycling.cyclist_name
    WHERE firstname = 'Marianne'
    ALLOW FILTERING;
     id                                   | firstname | lastname
    --------------------------------------+-----------+----------
     5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |      VOS
    
    (1 rows)

    The same results are returned as those in the first PREFIX index query example previously shown, using a slightly modified query.

  • Queries can find matches for values in firstname based on partial matches. The use of LIKE specifies that the match is looking for a word that contains the letter M. The % before and after the letter M matches any characters that can return a matching value. Note that indexing is used for this query because the primary key id is not specified:
    SELECT *
    FROM cycling.cyclist_name
    WHERE firstname LIKE '%M%';
     id                                   | firstname | lastname
    --------------------------------------+-----------+----------
     fb372533-eb95-4bb4-8685-6ef61e994caa |   Michael | MATTHEWS
     5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |      VOS
    
    (2 rows)
    The same results are returned as those in the second PREFIX index query example previously shown, using a slightly modified query.
  • The CONTAINS index has a more versatile matching algorithm than PREFIX. The following queries return the same row. The first query specifies %arianne, which matches firstname values that end with arianne. The second query specifies %arian%, which matches firstname values that contain arian.
    SELECT *
    FROM cycling.cyclist_name
    WHERE firstname LIKE '%arianne';
    SELECT *
    FROM cycling.cyclist_name
    WHERE firstname LIKE '%arian%';
     id                                   | firstname | lastname
    --------------------------------------+-----------+----------
     5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |      VOS
    
    (1 rows)
  • With a CONTAINS index, inequality pattern matching is possible. Note again the use of ALLOW FILTERING in the following query, which is required but causes no latency in the query response:
    SELECT *
    FROM cycling.cyclist_name
    WHERE firstname > 'Mar'
    ALLOW FILTERING;
     id                                   | firstname | lastname
    --------------------------------------+-----------+------------
     fb372533-eb95-4bb4-8685-6ef61e994caa |   Michael |   MATTHEWS
     5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |        VOS
     220844bf-4860-49d6-9a4b-6b5d3a79cbfb |     Paolo |  TIRALONGO
     6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 |    Steven | KRUIKSWIJK
    
    (4 rows)
  • Like with PREFIX indexing, many queries will fail to find matches based on the partial string. These queries fail to find matches:
    SELECT * FROM cycling.cyclist_name WHERE firstname = 'MariAnne' ALLOW FILTERING;
    SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%m%';
    SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%M';
    SELECT * FROM cycling.cyclist_name WHERE firstname LIKE 'm%';

    The first two queries fail because of case sensitivity. 'MariAnne' has one uppercase letter, whereas the stored value does not. The other queries fail due to placement of the %.

  • Either the PREFIX index or the CONTAINS index can be created with case sensitivity by adding an analyzer_class and case_sensitive option.
    CREATE CUSTOM INDEX IF NOT EXISTS fn_suffix_allcase
    ON cycling.cyclist_name (firstname)
    USING 'org.apache.cassandra.index.sasi.SASIIndex'
    WITH OPTIONS = {
      'mode': 'CONTAINS',
      'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
      'case_sensitive': 'false'
    };
    The analyzer_class used here is the non-tokenizing analyzer that does not perform analysis on the text in the specified column. The option case_sensitive is set to false to make the indexing case-insensitive.
  • With the addition of the analyzer class and option, this query now works with the lowercase m:
    SELECT *
    FROM cycling.cyclist_name
    WHERE firstname LIKE '%m%';
     id                                   | firstname | lastname
    --------------------------------------+-----------+----------
     fb372533-eb95-4bb4-8685-6ef61e994caa |   Michael | MATTHEWS
     5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |      VOS
    
    (2 rows)
  • If queries are narrowed with an indexed column value, non-indexed columns can be specified. Compound queries can also be created with multiple indexed columns. The following example alters the table to add a column named age, which is included in a query at the end. The age column is not indexed.
    ALTER TABLE cycling.cyclist_name
    ADD age int;
    
    UPDATE cycling.cyclist_name
    SET age = 23
    WHERE id = 5b6962dd-3f90-4c93-8f61-eabfa4a803e2;
    
    INSERT INTO cycling.cyclist_name (
      id, age, firstname, lastname
    ) VALUES (
      8566eb59-07df-43b1-a21b-666a3c08c08a, 18, 'Marianne', 'DAAE'
    );
    SELECT *
    FROM cycling.cyclist_name
    WHERE firstname = 'Marianne'
      AND age > 20
    ALLOW FILTERING;
     id                                   | age | firstname | lastname
    --------------------------------------+-----+-----------+----------
     5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  23 |  Marianne |      VOS
    
    (1 rows)
SPARSE index
  • The SPARSE index is meant to improve performance of querying large, dense number ranges like timestamps for data inserted every millisecond. If the data is numeric, millions of columns values with a small number of partition keys might be included in the data. If range queries will be performed against the index, then SPARSE is the best index type. For numeric data that does not meet this criteria, PREFIX is the best index type.
    CREATE CUSTOM INDEX IF NOT EXISTS fn_sparse
    ON cycling.comments (created_at)
    USING 'org.apache.cassandra.index.sasi.SASIIndex'
    WITH OPTIONS = { 'mode': 'SPARSE' };

    Use SPARSE indexing for data that is sparse (every term/column value has less than 5 matching keys). Indexing the created_at field in time series data (where there is typically few matching rows/events per created_at timestamp) is a good use case. SPARSE indexing is primarily an optimization for range queries, especially large ranges that span large timespans.

  • To illustrate the use of the SPARSE index, create a table and insert some time series data:
    CREATE TABLE IF NOT EXISTS cycling.comments (
      record_id timeuuid,
      id uuid,
      commenter text,
      comment text,
      created_at timestamp,
      PRIMARY KEY (id, created_at)
    )
    WITH CLUSTERING ORDER BY (created_at DESC);
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), e7ae5cf3-d358-4d99-b900-85902fda9bb0, '2017-02-14 12:43:20-0800', 'Raining too hard should have postponed', 'Alex');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), e7ae5cf3-d358-4d99-b900-85902fda9bb0, '2017-02-14 12:43:20.234-0800', 'Raining too hard should have postponed', 'Alex');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), e7ae5cf3-d358-4d99-b900-85902fda9bb0, '2017-03-21 13:11:09.999-0800', 'Second rest stop was out of water', 'Alex');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), e7ae5cf3-d358-4d99-b900-85902fda9bb0, '2017-04-01 06:33:02.16-0800', 'LATE RIDERS SHOULD NOT DELAY THE START', 'Alex');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), c7fceba0-c141-4207-9494-a29f9809de6f, totimestamp(now()), 'The gift certificate for winning was the best', 'Amy');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), c7fceba0-c141-4207-9494-a29f9809de6f, '2017-02-17 12:43:20.234+0400', 'Glad you ran the race in the rain', 'Amy');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), c7fceba0-c141-4207-9494-a29f9809de6f, '2017-03-22 5:16:59.001+0400', 'Great snacks at all reststops', 'Amy');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), c7fceba0-c141-4207-9494-a29f9809de6f, '2017-04-01 17:43:08.030+0400', 'Last climb was a killer', 'Amy');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 8566eb59-07df-43b1-a21b-666a3c08c08a, totimestamp(now()), 'Fastest womens time ever way to go amy!', 'Maryanne');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 8566eb59-07df-43b1-a21b-666a3c08c08a, '2017-02-13 11:20:17.020-0600', 'Great race on a crappy day', 'Maryanne');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 8566eb59-07df-43b1-a21b-666a3c08c08a, '2017-03-20 15:45:10.101-0600', 'Saggers really rocked it', 'Maryanne');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 8566eb59-07df-43b1-a21b-666a3c08c08a, '2017-04-14 05:16:52.009-0600', 'Not bad for a flatlander', 'Maryanne');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), fb372533-eb95-4bb4-8685-6ef61e994caa, totimestamp(now()), 'Great course', 'Michael');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), fb372533-eb95-4bb4-8685-6ef61e994caa, '2017-02-15 18:22:11-0800', 'Some entries complain a lot', 'Michael');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), fb372533-eb95-4bb4-8685-6ef61e994caa, '2017-03-16 19:43:01.030-0800', 'Getting read for the race', 'Michael');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), fb372533-eb95-4bb4-8685-6ef61e994caa, '2017-03-22 1:19:44.060-0800', 'Awesome race glad you held it anyway', 'Michael');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), fb372533-eb95-4bb4-8685-6ef61e994caa, '2017-04-07 11:21:14.001-0800', 'Thanks for waiting for me!', 'Michael');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 9011d3be-d35c-4a8d-83f7-a3c543789ee7, totimestamp(now()), 'Can''t wait for the next race', 'Katarzyna');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 9011d3be-d35c-4a8d-83f7-a3c543789ee7, '2017-01-01 11:20:17.020-0600', 'Gearing up for the seaon', 'Katarzyna');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 5b6962dd-3f90-4c93-8f61-eabfa4a803e2, totimestamp(now()), 'Thanks for all your hard work', 'Marianne');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 220844bf-4860-49d6-9a4b-6b5d3a79cbfb, totimestamp(now()), 'A for effort!', 'Paolo');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), c4b65263-fe58-4846-83e8-f0e1c13d518f, totimestamp(now()), 'Closing ceremony was a little lame', 'Rossella');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 38ab64b6-26cc-4de9-ab28-c257cf011659, totimestamp(now()), 'Next time guys!', 'Marcia');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 38ab64b6-26cc-4de9-ab28-c257cf011659, '2017-02-11 22:09:56+0800', 'First race was amazing, can''t wait for more', 'Marcia');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47, totimestamp(now()), 'So many great races thanks y''all', 'Steven');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47, '2017-02-02 09:49:00.02+0800', 'Best of luck everybody I can''t make it', 'Steven');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47, '2017-04-05 12:01:00.003', 'Bike damaged in transit bummer', 'Steven');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), e7cd5752-bc0d-4157-a80f-7523add8dbcd, totimestamp(now()), 'Go team, you rocked it', 'Anna');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 6d5f1663-89c0-45fc-8cfd-60a373b01622, totimestamp(now()), 'Next year the tour of california!', 'Melissa');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 95addc4c-459e-4ed7-b4b5-472f19a67995, totimestamp(now()), 'Next year for sure!', 'Vera');
    INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 95addc4c-459e-4ed7-b4b5-472f19a67995, '2017-02-13 11:40:16.123-0600', 'I can do without the rain@@@@', 'Vera');
  • This query returns the comments made before the timestamp 2017-02-13 11:40:16.123:
    SELECT *
    FROM cycling.comments
    WHERE created_at < '2017-02-13 11:40:16.123';
     id                                   | created_at                      | comment                                     | commenter | record_
    id
    --------------------------------------+---------------------------------+---------------------------------------------+-----------+--------
    ------------------------------
     9011d3be-d35c-4a8d-83f7-a3c543789ee7 | 2017-01-01 17:20:17.020000+0000 |                    Gearing up for the seaon | Katarzyna | c591c6e
    1-0c9f-11ea-bd8d-9f9b8a53b5f5
     38ab64b6-26cc-4de9-ab28-c257cf011659 | 2017-02-11 14:09:56.000000+0000 | First race was amazing, can't wait for more |    Marcia | c597934
    1-0c9f-11ea-bd8d-9f9b8a53b5f5
     6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 | 2017-02-02 01:49:00.020000+0000 |      Best of luck everybody I can't make it |    Steven | c59aa08
    1-0c9f-11ea-bd8d-9f9b8a53b5f5
    
    (3 rows)
    The query returns all the results where created_at is found to be less than the timestamp supplied. The inequalities >=, > and <= are all valid operators.

    SPARSE indexing is used only for numeric data, so LIKE queries do not apply.

Using analyzers
  • Analyzers can be specified that will analyze the text in the specified column. The NonTokenizingAnalyzer is used for cases where the text is not analyzed, but case normalization or sensitivity is required. The StandardAnalyzer is used for analysis that involves stemming, case normalization, case sensitivity, skipping common words like "and", "the", and "or", and localization of the language used to complete the analysis. This example adds a text column named comment to the cyclist_name table to provide details about the cyclist, adds some comments, and then creates the index:
    ALTER TABLE cycling.cyclist_name
    ADD comment text;
    UPDATE cycling.cyclist_name
    SET comment = 'Rides hard, gets along with others, a real winner'
    WHERE id = fb372533-eb95-4bb4-8685-6ef61e994caa;
    
    UPDATE cycling.cyclist_name
    SET comment = 'Rides fast, does not get along with others, a real dude'
    WHERE id = 5b6962dd-3f90-4c93-8f61-eabfa4a803e2;
    CREATE CUSTOM INDEX IF NOT EXISTS stdanalyzer_idx
    ON cycling.cyclist_name (comment) USING 'org.apache.cassandra.index.sasi.SASIIndex'
    WITH OPTIONS = {
      'mode': 'CONTAINS',
      'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
      'analyzed': 'true',
      'tokenization_skip_stop_words': 'and, the, or',
      'tokenization_enable_stemming': 'true',
      'tokenization_normalize_lowercase': 'true',
      'tokenization_locale': 'en'
    };
  • This query searches for the presence of a designated string, using the analyzed text to return a result:
    SELECT *
    FROM cycling.cyclist_name
    WHERE comment LIKE 'ride';
     id                                   | comment                                                 | firstname | lastname
    --------------------------------------+---------------------------------------------------------+-----------+----------
     fb372533-eb95-4bb4-8685-6ef61e994caa |       Rides hard, gets along with others, a real winner |   Michael | MATTHEWS
     5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Rides fast, does not get along with others, a real dude |  Marianne |      VOS
    
    (2 rows)
    The query returns all the results where ride is found either as an exact word or as a stem for another word, which is rides in this example.