Using an SSTable Attached Secondary Index (SASI)
Using CQL to create an SSTable Attached Secondary Index (SASI) on a column after defining a table.
PREFIX
, CONTAINS
, and
SPARSE
.Procedure
-
Create an index
fn_prefix
for the tablecyclist_name
on the columnfirstname
.PREFIX
is the default mode, so it does not need to be explicitly specified.CREATE TABLE IF NOT EXISTS cycling.cyclist_name ( id UUID PRIMARY KEY, lastname text, firstname text );
CREATE CUSTOM INDEX IF NOT EXISTS fn_prefix ON cycling.cyclist_name (firstname) USING 'org.apache.cassandra.index.sasi.SASIIndex';
SELECT * FROM cycling.cyclist_name;
id | firstname | lastname --------------------------------------+-----------+----------------- e7ae5cf3-d358-4d99-b900-85902fda9bb0 | Alex | FRAME fb372533-eb95-4bb4-8685-6ef61e994caa | Michael | MATTHEWS 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS 220844bf-4860-49d6-9a4b-6b5d3a79cbfb | Paolo | TIRALONGO 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 | Steven | KRUIKSWIJK e7cd5752-bc0d-4157-a80f-7523add8dbcd | Anna | VAN DER BREGGEN (6 rows)
-
Queries can find exact matches for values in
firstname
. Note that indexing is used for this query because the primary keyid
is not specified:SELECT * FROM cycling.cyclist_name WHERE firstname = 'Marianne';
id | firstname | lastname --------------------------------------+-----------+---------- 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS (1 rows)
-
Queries can find matches for values in
firstname
based on partial matches. The use ofLIKE
specifies that the match is looking for a word that starts with the letterM
. The%
after the letterM
matches any characters after M and returns any matching values. Note that indexing is used for this query because the primary keyid
is not specified:SELECT * FROM cycling.cyclist_name WHERE firstname LIKE 'M%';
id | firstname | lastname --------------------------------------+-----------+---------- fb372533-eb95-4bb4-8685-6ef61e994caa | Michael | MATTHEWS 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS (2 rows)
-
Many queries will fail to find matches based on the partial string. The
following queries do not return a match or return an error.
The following query does not return a match. The first name is stored as
'Marianne'
. The query provides'MARIANNE'
, which does not match the alphabetic case sensitivity for'Marianne'
:SELECT * FROM cycling.cyclist_name WHERE firstname = 'MARIANNE';
The placement of the%
characters are critical. Because the index uses thePREFIX
mode, only a trailing%
yields results when coupled withLIKE
. This query usesLIKE
and provides a trailing%
, but the lowercasem
at the start does not match any rows:SELECT * FROM cycling.cyclist_name WHERE firstname LIKE 'm%';
These queries specify patterns that do not match any rows:SELECT * FROM cycling.cyclist_name WHERE firstname = 'M%'; SELECT * FROM cycling.cyclist_name WHERE firstname = '%M'; SELECT * FROM cycling.cyclist_name WHERE firstname = '%M%'; SELECT * FROM cycling.cyclist_name WHERE firstname = 'm%';
These queries generate errors stating that the column is not properly indexed (after thefn_contains
index is created in the next section, the queries run):SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%m%'; SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%m%' ALLOW FILTERING; SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%M%';
-
Create an index
fn_contains
for the tablecyclist_name
on the columnfirstname.
CONTAINS
is the specified mode, so that partial patterns specified, and not just the prefix, are matched.CREATE CUSTOM INDEX IF NOT EXISTS fn_contains ON cycling.cyclist_name (firstname) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'mode': 'CONTAINS' };
-
Queries can find exact matches for values in
firstname
. Note that indexing is used for the following query, because the primary keyid
is not specified in the query. For queries that useCONTAINS
indexing, theALLOW FILTERING
phrase must be included, although the database does not actually perform the filtering.SELECT * FROM cycling.cyclist_name WHERE firstname = 'Marianne' ALLOW FILTERING;
id | firstname | lastname --------------------------------------+-----------+---------- 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS (1 rows)
The same results are returned as those in the first
PREFIX
index query example previously shown, using a slightly modified query. -
Queries can find matches for values in
firstname
based on partial matches. The use ofLIKE
specifies that the match is looking for a word that contains the letterM
. The%
before and after the letterM
matches any characters that can return a matching value. Note that indexing is used for this query because the primary keyid
is not specified:SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%M%';
The same results are returned as those in the secondid | firstname | lastname --------------------------------------+-----------+---------- fb372533-eb95-4bb4-8685-6ef61e994caa | Michael | MATTHEWS 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS (2 rows)
PREFIX
index query example previously shown, using a slightly modified query. -
The
CONTAINS
index has a more versatile matching algorithm thanPREFIX
. The following queries return the same row. The first query specifies%arianne
, which matchesfirstname
values that end witharianne
. The second query specifies%arian%
, which matchesfirstname
values that containarian
.SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%arianne';
SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%arian%';
id | firstname | lastname --------------------------------------+-----------+---------- 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS (1 rows)
-
With a
CONTAINS
index, inequality pattern matching is possible. Note again the use ofALLOW FILTERING
in the following query, which is required but causes no latency in the query response:SELECT * FROM cycling.cyclist_name WHERE firstname > 'Mar' ALLOW FILTERING;
id | firstname | lastname --------------------------------------+-----------+------------ fb372533-eb95-4bb4-8685-6ef61e994caa | Michael | MATTHEWS 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS 220844bf-4860-49d6-9a4b-6b5d3a79cbfb | Paolo | TIRALONGO 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 | Steven | KRUIKSWIJK (4 rows)
-
Like with
PREFIX
indexing, many queries will fail to find matches based on the partial string. These queries fail to find matches:SELECT * FROM cycling.cyclist_name WHERE firstname = 'MariAnne' ALLOW FILTERING; SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%m%'; SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%M'; SELECT * FROM cycling.cyclist_name WHERE firstname LIKE 'm%';
The first two queries fail because of case sensitivity.
'MariAnne'
has one uppercase letter, whereas the stored value does not. The other queries fail due to placement of the%
. -
Either the
PREFIX
index or theCONTAINS
index can be created with case sensitivity by adding ananalyzer_class
andcase_sensitive
option.CREATE CUSTOM INDEX IF NOT EXISTS fn_suffix_allcase ON cycling.cyclist_name (firstname) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'mode': 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false' };
Theanalyzer_class
used here is the non-tokenizing analyzer that does not perform analysis on the text in the specified column. The optioncase_sensitive
is set tofalse
to make the indexing case-insensitive. -
With the addition of the analyzer class and option, this query now works with
the lowercase
m
:SELECT * FROM cycling.cyclist_name WHERE firstname LIKE '%m%';
id | firstname | lastname --------------------------------------+-----------+---------- fb372533-eb95-4bb4-8685-6ef61e994caa | Michael | MATTHEWS 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS (2 rows)
-
If queries are narrowed with an indexed column value, non-indexed columns can
be specified. Compound queries can also be created with multiple indexed
columns. The following example alters the table to add a column named
age
, which is included in a query at the end. Theage
column is not indexed.ALTER TABLE cycling.cyclist_name ADD age int; UPDATE cycling.cyclist_name SET age = 23 WHERE id = 5b6962dd-3f90-4c93-8f61-eabfa4a803e2; INSERT INTO cycling.cyclist_name ( id, age, firstname, lastname ) VALUES ( 8566eb59-07df-43b1-a21b-666a3c08c08a, 18, 'Marianne', 'DAAE' );
SELECT * FROM cycling.cyclist_name WHERE firstname = 'Marianne' AND age > 20 ALLOW FILTERING;
id | age | firstname | lastname --------------------------------------+-----+-----------+---------- 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | 23 | Marianne | VOS (1 rows)
-
The
SPARSE
index is meant to improve performance of querying large, dense number ranges like timestamps for data inserted every millisecond. If the data is numeric, millions of columns values with a small number of partition keys might be included in the data. If range queries will be performed against the index, thenSPARSE
is the best index type. For numeric data that does not meet this criteria,PREFIX
is the best index type.CREATE CUSTOM INDEX IF NOT EXISTS fn_sparse ON cycling.comments (created_at) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'mode': 'SPARSE' };
Use
SPARSE
indexing for data that is sparse (every term/column value has less than 5 matching keys). Indexing thecreated_at
field in time series data (where there is typically few matching rows/events percreated_at
timestamp) is a good use case.SPARSE
indexing is primarily an optimization for range queries, especially large ranges that span large timespans. -
To illustrate the use of the
SPARSE
index, create a table and insert some time series data:CREATE TABLE IF NOT EXISTS cycling.comments ( record_id timeuuid, id uuid, commenter text, comment text, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), e7ae5cf3-d358-4d99-b900-85902fda9bb0, '2017-02-14 12:43:20-0800', 'Raining too hard should have postponed', 'Alex'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), e7ae5cf3-d358-4d99-b900-85902fda9bb0, '2017-02-14 12:43:20.234-0800', 'Raining too hard should have postponed', 'Alex'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), e7ae5cf3-d358-4d99-b900-85902fda9bb0, '2017-03-21 13:11:09.999-0800', 'Second rest stop was out of water', 'Alex'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), e7ae5cf3-d358-4d99-b900-85902fda9bb0, '2017-04-01 06:33:02.16-0800', 'LATE RIDERS SHOULD NOT DELAY THE START', 'Alex'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), c7fceba0-c141-4207-9494-a29f9809de6f, totimestamp(now()), 'The gift certificate for winning was the best', 'Amy'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), c7fceba0-c141-4207-9494-a29f9809de6f, '2017-02-17 12:43:20.234+0400', 'Glad you ran the race in the rain', 'Amy'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), c7fceba0-c141-4207-9494-a29f9809de6f, '2017-03-22 5:16:59.001+0400', 'Great snacks at all reststops', 'Amy'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), c7fceba0-c141-4207-9494-a29f9809de6f, '2017-04-01 17:43:08.030+0400', 'Last climb was a killer', 'Amy'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 8566eb59-07df-43b1-a21b-666a3c08c08a, totimestamp(now()), 'Fastest womens time ever way to go amy!', 'Maryanne'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 8566eb59-07df-43b1-a21b-666a3c08c08a, '2017-02-13 11:20:17.020-0600', 'Great race on a crappy day', 'Maryanne'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 8566eb59-07df-43b1-a21b-666a3c08c08a, '2017-03-20 15:45:10.101-0600', 'Saggers really rocked it', 'Maryanne'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 8566eb59-07df-43b1-a21b-666a3c08c08a, '2017-04-14 05:16:52.009-0600', 'Not bad for a flatlander', 'Maryanne'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), fb372533-eb95-4bb4-8685-6ef61e994caa, totimestamp(now()), 'Great course', 'Michael'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), fb372533-eb95-4bb4-8685-6ef61e994caa, '2017-02-15 18:22:11-0800', 'Some entries complain a lot', 'Michael'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), fb372533-eb95-4bb4-8685-6ef61e994caa, '2017-03-16 19:43:01.030-0800', 'Getting read for the race', 'Michael'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), fb372533-eb95-4bb4-8685-6ef61e994caa, '2017-03-22 1:19:44.060-0800', 'Awesome race glad you held it anyway', 'Michael'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), fb372533-eb95-4bb4-8685-6ef61e994caa, '2017-04-07 11:21:14.001-0800', 'Thanks for waiting for me!', 'Michael'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 9011d3be-d35c-4a8d-83f7-a3c543789ee7, totimestamp(now()), 'Can''t wait for the next race', 'Katarzyna'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 9011d3be-d35c-4a8d-83f7-a3c543789ee7, '2017-01-01 11:20:17.020-0600', 'Gearing up for the seaon', 'Katarzyna'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 5b6962dd-3f90-4c93-8f61-eabfa4a803e2, totimestamp(now()), 'Thanks for all your hard work', 'Marianne'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 220844bf-4860-49d6-9a4b-6b5d3a79cbfb, totimestamp(now()), 'A for effort!', 'Paolo'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), c4b65263-fe58-4846-83e8-f0e1c13d518f, totimestamp(now()), 'Closing ceremony was a little lame', 'Rossella'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 38ab64b6-26cc-4de9-ab28-c257cf011659, totimestamp(now()), 'Next time guys!', 'Marcia'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 38ab64b6-26cc-4de9-ab28-c257cf011659, '2017-02-11 22:09:56+0800', 'First race was amazing, can''t wait for more', 'Marcia'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47, totimestamp(now()), 'So many great races thanks y''all', 'Steven'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47, '2017-02-02 09:49:00.02+0800', 'Best of luck everybody I can''t make it', 'Steven'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47, '2017-04-05 12:01:00.003', 'Bike damaged in transit bummer', 'Steven'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), e7cd5752-bc0d-4157-a80f-7523add8dbcd, totimestamp(now()), 'Go team, you rocked it', 'Anna'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 6d5f1663-89c0-45fc-8cfd-60a373b01622, totimestamp(now()), 'Next year the tour of california!', 'Melissa'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 95addc4c-459e-4ed7-b4b5-472f19a67995, totimestamp(now()), 'Next year for sure!', 'Vera'); INSERT INTO cycling.comments (record_id, id, created_at, comment, commenter) VALUES (now(), 95addc4c-459e-4ed7-b4b5-472f19a67995, '2017-02-13 11:40:16.123-0600', 'I can do without the rain@@@@', 'Vera');
-
This query returns the comments made before the timestamp
2017-02-13 11:40:16.123
:SELECT * FROM cycling.comments WHERE created_at < '2017-02-13 11:40:16.123';
The query returns all the results whereid | created_at | comment | commenter | record_ id --------------------------------------+---------------------------------+---------------------------------------------+-----------+-------- ------------------------------ 9011d3be-d35c-4a8d-83f7-a3c543789ee7 | 2017-01-01 17:20:17.020000+0000 | Gearing up for the seaon | Katarzyna | c591c6e 1-0c9f-11ea-bd8d-9f9b8a53b5f5 38ab64b6-26cc-4de9-ab28-c257cf011659 | 2017-02-11 14:09:56.000000+0000 | First race was amazing, can't wait for more | Marcia | c597934 1-0c9f-11ea-bd8d-9f9b8a53b5f5 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 | 2017-02-02 01:49:00.020000+0000 | Best of luck everybody I can't make it | Steven | c59aa08 1-0c9f-11ea-bd8d-9f9b8a53b5f5 (3 rows)
created_at
is found to be less than the timestamp supplied. The inequalities>=
,>
and<=
are all valid operators.SPARSE
indexing is used only for numeric data, soLIKE
queries do not apply.
-
Analyzers can be specified that will analyze the text in the specified column.
The
NonTokenizingAnalyzer
is used for cases where the text is not analyzed, but case normalization or sensitivity is required. TheStandardAnalyzer
is used for analysis that involves stemming, case normalization, case sensitivity, skipping common words like "and", "the", and "or", and localization of the language used to complete the analysis. This example adds a text column namedcomment
to thecyclist_name
table to provide details about the cyclist, adds some comments, and then creates the index:ALTER TABLE cycling.cyclist_name ADD comment text;
UPDATE cycling.cyclist_name SET comment = 'Rides hard, gets along with others, a real winner' WHERE id = fb372533-eb95-4bb4-8685-6ef61e994caa; UPDATE cycling.cyclist_name SET comment = 'Rides fast, does not get along with others, a real dude' WHERE id = 5b6962dd-3f90-4c93-8f61-eabfa4a803e2;
CREATE CUSTOM INDEX IF NOT EXISTS stdanalyzer_idx ON cycling.cyclist_name (comment) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'mode': 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer', 'analyzed': 'true', 'tokenization_skip_stop_words': 'and, the, or', 'tokenization_enable_stemming': 'true', 'tokenization_normalize_lowercase': 'true', 'tokenization_locale': 'en' };
-
This query searches for the presence of a designated string, using the analyzed
text to return a result:
SELECT * FROM cycling.cyclist_name WHERE comment LIKE 'ride';
The query returns all the results whereid | comment | firstname | lastname --------------------------------------+---------------------------------------------------------+-----------+---------- fb372533-eb95-4bb4-8685-6ef61e994caa | Rides hard, gets along with others, a real winner | Michael | MATTHEWS 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Rides fast, does not get along with others, a real dude | Marianne | VOS (2 rows)
ride
is found either as an exact word or as a stem for another word, which isrides
in this example.