Query with SAI
The SAI quickstart focuses only on defining multiple indexes based on non-primary key columns (a very useful feature). Let’s explore other options using some examples of how you can run queries on tables that have differently defined SAI indexes.
Vector search
This example uses the following table and index:
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE CUSTOM INDEX IF NOT EXISTS ann_index
ON cycling.comments_vs(comment_vector) USING 'StorageAttachedIndex';
Simple query
To query data using Vector Search, use a SELECT
query:
-
CQL
-
Result
SELECT * FROM cycling.comments_vs
ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
LIMIT 3;
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] | Alex | 616e77e0-22a2-11ee-b99d-1f350647414a
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-02-17 08:43:20.234000+0000 | Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] | Amy | 6170c1d0-22a2-11ee-b99d-1f350647414a
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-04-01 13:43:08.030000+0000 | Last climb was a killer | [0.3, 0.75, 0.2, 0.2, 0.5] | Amy | 62105d30-22a2-11ee-b99d-1f350647414a
The limit has to be 1,000 or fewer. |
Scrolling to the right on the results shows the comments from the table that most closely matched the embeddings used for the query.
Similarity query
To obtain the similarity calculation of the best scoring node closest to the query data as part of the results, use a modified SELECT query:
- CQL
-
SELECT comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 3;
- Result
-
comment | system.similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05]) -----------------------------------------------+----------------------------------------------------------------------- Second rest stop was out of water | 0.949701 The gift certificate for winning was the best | 0.86062 The gift certificate for winning was the best | 0.86062 (3 rows)
The supported functions for this type of query are:
-
similarity_dot_product
-
similarity_cosine
-
similarity_euclidean
with the parameters of (<vector_column>, <embedding_value>). Both parameters represent vectors.
Vector Search utilizes Approximate Nearest Neighbor (ANN) that in most cases yields results almost as good as the exact match. The scaling is superior to Exact Nearest Neighbor (KNN).
Least-similar searches are not supported.
Vector Search works optimally on tables with no overwrites or deletions of the comment_vector column. For an comment_vector column with changes, expect slower search results.
The embeddings were randomly generated in this example. Generally, you would run both your source documents/contents through an embeddings generator, as well as the query you were asking to match. This example is simply to show the mechanics of how to use CQL to create vector search data objects.
Single index match on a column
This example uses the following table and indexes:
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE CUSTOM INDEX commenter_idx
ON cycling.comments_vs (commenter)
USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX created_at_idx
ON cycling.comments_vs (created_at)
USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX ann_index
ON cycling.comments_vs (comment_vector)
USING 'StorageAttachedIndex';
The column commenter
is not the partition key in this table, so an index is required to query on it.
Query for a match on that column:
-
Query
-
Result
SELECT * FROM cycling.comments_vs
WHERE commenter = 'Alex';
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] | Alex | 6d0cdaa0-272b-11ee-859f-b9098002fcac
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] | Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
Single index match on a column with options
This example uses the following table and indexes:
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE CUSTOM INDEX commenter_cs_idx ON cycling.comments_vs (commenter)
USING 'StorageAttachedIndex'
WITH OPTIONS = {'case_sensitive': 'true', 'normalize': 'true', 'ascii': 'true'};
Case-sensitivty
The column commenter
is not the partition key in this table, so an index is required to query on it.
If we want to check commenter
as a case-sensitive value, we can use the case_sensitive
option set to true
.
Note that no results are returned if you use an inappropriately case-sensitive value in the query:
-
Query
-
Result
SELECT * FROM comments_vs WHERE commenter ='alex';
id | created_at | comment | comment_vector | commenter | record_id
----+------------+---------+----------------+-----------+-----------
(0 rows)
When we switch the case of the cyclist’s name to match the case in the index, the query succeeds:
-
Query
-
Result
SELECT comment,commenter FROM comments_vs WHERE commenter ='Alex';
comment | commenter
----------------------------------------+-----------
LATE RIDERS SHOULD NOT DELAY THE START | Alex
Second rest stop was out of water | Alex
(2 rows)
Index match on a composite partition key column
This example uses the following table and indexes:
CREATE TABLE IF NOT EXISTS cycling.rank_by_year_and_name (
race_year int,
race_name text,
cyclist_name text,
rank int,
PRIMARY KEY ((race_year, race_name), rank)
);
CREATE CUSTOM INDEX race_name_idx
ON cycling.rank_by_year_and_name (race_name)
USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX race_year_idx
ON cycling.rank_by_year_and_name (race_year)
USING 'StorageAttachedIndex';
Composite partition keys have a partition defined by multiple columns in a table.
Normally, you would need to specify all the columns in the partition key to query the table with a WHERE
clause.
However, an SAI index makes it possible to define an index using a single column in the table’s composite partition key.
You can create an SAI index on each column in the composite partition key, if you need to query based on just one column.
SAI indexes also allow you to query tables without using the inefficient ALLOW FILTERING
directive.
The ALLOW FILTERING
directive requires scanning all the partitions in a table, leading to poor performance.
The race_year
and race_name
columns comprise the composite partition key for the cycling.rank_by_year_and_name
table.
Query for a match on the column race_name
:
-
Query
-
Result
SELECT * FROM cycling.rank_by_year_and_name
WHERE race_name = 'Tour of Japan - Stage 4 - Minami > Shinshu';
race_year | race_name | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Daniel MARTIN
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Johan Esteban CHAVES
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Benjamin PRADES
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Benjamin PRADES
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Adam PHELAN
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Thomas LEBAS
Query for a match on the column race_year
:
-
Query
-
Result
SELECT * FROM cycling.rank_by_year_and_name
WHERE race_year = 2014;
race_year | race_name | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
2014 | 4th Tour of Beijing | 1 | Phillippe GILBERT
2014 | 4th Tour of Beijing | 2 | Daniel MARTIN
2014 | 4th Tour of Beijing | 3 | Johan Esteban CHAVES
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Daniel MARTIN
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Johan Esteban CHAVES
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Benjamin PRADES
Multiple indexes matched with AND
This example uses the following table and indexes:
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE CUSTOM INDEX commenter_idx
ON cycling.comments_vs (commenter)
USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX created_at_idx
ON cycling.comments_vs (created_at)
USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX ann_index
ON cycling.comments_vs (comment_vector)
USING 'StorageAttachedIndex';
Several indexes are created for the table to demonstrate how to query for matches on more than one column.
Query for matches on more than one column, and both columns must match:
-
Query
-
Result
SELECT * FROM cycling.comments_vs
WHERE
created_at='2017-03-21 21:11:09.999000+0000'
AND commenter = 'Alex';
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+-----------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] | Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
Multiple indexes matched with OR
This example uses the following table and indexes:
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE CUSTOM INDEX commenter_idx
ON cycling.comments_vs (commenter)
USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX created_at_idx
ON cycling.comments_vs (created_at)
USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX ann_index
ON cycling.comments_vs (comment_vector)
USING 'StorageAttachedIndex';
Several indexes are created for the table to demonstrate how to query for matches on more than one column.
Query for a match on either one column or the other:
-
Query
-
Result
SELECT * FROM cycling.comments_vs
WHERE
created_at='2017-03-21 21:11:09.999000+0000'
OR created_at='2017-03-22 01:16:59.001000+0000';
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+-----------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] | Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-03-22 01:16:59.001000+0000 | Great snacks at all reststops | [0.1, 0.4, 0.1, 0.52, 0.09] | Amy | 6d0fc0d0-272b-11ee-859f-b9098002fcac
Multiple indexes matched with IN
This example uses the following table and indexes:
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE CUSTOM INDEX commenter_idx
ON cycling.comments_vs (commenter)
USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX created_at_idx
ON cycling.comments_vs (created_at)
USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX ann_index
ON cycling.comments_vs (comment_vector)
USING 'StorageAttachedIndex';
Several indexes are created for the table to demonstrate how to query for matches on more than one column.
Query for match with column values in a list of values:
-
Query
-
Result
SELECT * FROM cycling.comments_vs
WHERE created_at IN
('2017-03-21 21:11:09.999000+0000'
,'2017-03-22 01:16:59.001000+0000');
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+-----------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] | Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-03-22 01:16:59.001000+0000 | Great snacks at all reststops | [0.1, 0.4, 0.1, 0.52, 0.09] | Amy | 6d0fc0d0-272b-11ee-859f-b9098002fcac
User-defined type
SAI can index either a single field of a user-defined type (UDT) or a list of UDTs. This example shows how to index a list of UDTs.
This example uses the following user-defined type (UDT), table and index:
CREATE TYPE IF NOT EXISTS cycling.race (
race_title text,
race_date timestamp,
race_time text
);
CREATE TABLE IF NOT EXISTS cycling.cyclist_races (
id UUID PRIMARY KEY,
lastname text,
firstname text,
races list<FROZEN <race>>
);
CREATE CUSTOM INDEX races_idx
ON cycling.cyclist_races (races)
USING 'StorageAttachedIndex';
An index is created on the list of UDTs column races
in the cycling.cyclist_races
table.
Query with CONTAINS
from the list races
column:
-
CQL
-
Result
SELECT * FROM cycling.cyclist_races
WHERE races CONTAINS {
race_title:'Rabobank 7-Dorpenomloop Aalburg',
race_date:'2015-05-09',
race_time:'02:58:33'};
id | firstname | lastname | races
--------------------------------------+-----------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS | [{race_title: 'Rabobank 7-Dorpenomloop Aalburg', race_date: '2015-05-09 00:00:00.000000+0000', race_time: '02:58:33'}, {race_title: 'Ronde van Gelderland', race_date: '2015-04-19 00:00:00.000000+0000', race_time: '03:22:23'}]
(1 rows)
SAI indexing with collections
SAI supports collections of type map
, list
, and set
.
Collections allow you to group and store data together in a column.
In a relational database, a grouping such as a user’s multiple email addresses is achieved via many-to-one joined relationship between (for example) a user
table and an email
table.
DSE avoids joins between two tables by storing the user’s email addresses in a collection column in the user
table.
Each collection specifies the data type of the data held.
A collection is appropriate if the data for collection storage is limited. If the data has unbounded growth potential, like messages sent or sensor events registered every second, do not use collections. Instead, use a table with a compound primary key where data is stored in the clustering columns.
In CQL queries of database tables with SAI indexes, the
|
Using the set type
This example uses the following table and index:
CREATE TABLE IF NOT EXISTS cycling.cyclist_career_teams (
id UUID PRIMARY KEY,
lastname text,
teams set<text>
);
CREATE CUSTOM INDEX teams_idx
ON cycling.cyclist_career_teams (teams)
USING 'StorageAttachedIndex';
An index is created on the set column teams
in the cyclist_career_teams
table.
Query with CONTAINS
from the set teams
column:
-
CQL
-
Result
SELECT * FROM cycling.cyclist_career_teams
WHERE teams CONTAINS 'Rabobank-Liv Giant';
id | lastname | teams
--------------------------------------+----------+------------------------------------------------------------------------------------------------------
5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | VOS | {'Nederland bloeit', 'Rabobank Women Team', 'Rabobank-Liv Giant', 'Rabobank-Liv Woman Cycling Team'}
Using the list type
This example uses the following table and index:
CREATE TABLE IF NOT EXISTS cycling.upcoming_calendar (
year int,
month int,
events list<text>,
PRIMARY KEY (year, month)
);
CREATE CUSTOM INDEX events_idx
ON cycling.upcoming_calendar (events)
USING 'StorageAttachedIndex';
An index is created on the list column events
in the upcoming_calendar
table.
Query with CONTAINS
from the list events
column:
-
CQL
-
Result
SELECT * FROM cycling.upcoming_calendar
WHERE events CONTAINS 'Criterium du Dauphine';
year | month | events
------+-------+-----------------------------------------------
2015 | 6 | ['Criterium du Dauphine', 'Tour de Sui\nsse']
A slightly more complex query selects rows that either contain a particular event or have a particular month date:
-
CQL
-
Result
SELECT * FROM cycling.upcoming_calendar
WHERE
events CONTAINS 'Criterium du Dauphine'
OR month = 7;
year | month | events
------+-------+-----------------------------------------------
2015 | 6 | ['Criterium du Dauphine', 'Tour de Sui\nsse']
2015 | 7 | ['Tour de France']
Using the map type
This example uses the following table and indexes:
CREATE TABLE IF NOT EXISTS cycling.cyclist_teams (
id uuid PRIMARY KEY,
firstname text,
lastname text,
teams map<int, text>
);
// Create an index on a map key to find all cyclist/team combos for a year
CREATE INDEX IF NOT EXISTS team_year_keys_idx
ON cycling.cyclist_teams ( KEYS (teams) );
// Create an index on a map key to find all cyclist/team combos for a year
CREATE INDEX IF NOT EXISTS team_year_values_idx
ON cycling.cyclist_teams ( VALUES (teams) );
// Create an index on a map key to find all cyclist/team combos for a year
CREATE INDEX IF NOT EXISTS team_year_entries_idx
ON cycling.cyclist_teams ( ENTRIES (teams) );
Indexes created on the map column teams
in the cyclist_career_teams
table target the keys, values, and full entries of the column data.
Query with KEYS
from the map teams
column:
-
CQL
-
Result
SELECT * FROM cyclist_teams WHERE teams CONTAINS KEY 2014;
id | firstname | lastname | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS | {2014: 'Rabobank-Liv Woman Cycling Team', 2015: 'Rabobank-Liv Woman Cycling Team'}
Query a value from the map teams
column, noting that only the keyword CONTAINS
is included:
-
CQL
-
Result
SELECT * FROM cyclist_teams WHERE teams CONTAINS 'Team Garmin - Cervelo';
id | firstname | lastname | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
Query entries from the map teams
column, noting the difference in the WHERE
clause:
-
CQL
-
Result
SELECT * FROM cyclist_teams
WHERE
teams[2014] = 'Boels:Dolmans Cycling Team'
AND teams[2015] = 'Boels:Dolmans Cycling Team';
id | firstname | lastname | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
This example looks for a row where two entries are present in the map teams
column.
For more information, see: