SAI examples
SAI indexing can be used to index and query both primary key columns and non-primary key columns of almost any type. Find the type of index you want to create and follow the examples to create and query the index.
Each example starts with a table creation, followed by index creation, so that you can see the table schema and the index schema. Then sample queries for using the index are presented in a tabbed codeblock, to make it easier to examine queries side-by-side.
A keyspace must be created before tables and indexes can be created. The following keyspace is used for all examples on this page:
CREATE KEYSPACE IF NOT EXISTS cycling
WITH REPLICATION = {
'class' : 'SimpleStrategy',
'replication_factor' : 1
};
Primary key columns
Make an index on primary key columns, either composite partition key columns or clustering columns.
Single primary key column
This index creation will fail, because the table has a single PRIMARY KEY
column.
Such indexes are not allowed.
The following example demonstrates the failure.
-
Create a table called
cycling.alt_stats
with a primary key if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.cyclist_alt_stats ( id UUID PRIMARY KEY, lastname text, birthday date, nationality text, weight float, w_units text, height float, first_race date, last_race date );
-
Create a SAI index on the
id
column, which is a primary key.CREATE CUSTOM INDEX id_idx ON cycling.cyclist_alt_stats (id) USING 'StorageAttachedIndex';
Results
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot create secondary index on the only partition key column id"
This index creation will fail, because the
id
column is the single primary key, as well as the single partition key. Such indexes are not allowed.
Composite partition key columns in primary key
A composite partition key uses two or more columns to define the partition key in the PRIMARY KEY
definition.
Normally, you would need to specify all the columns in the partition key to query the table with a WHERE
clause.
However, an SAI index makes it possible to define an index using a single column in the table’s composite partition key.
You can create an SAI index on each column in the composite partition key, if you need to query based on just one column at a time.
SAI indexes also allow you to query tables without using the inefficient ALLOW FILTERING
directive.
The ALLOW FILTERING
directive requires scanning all the partitions in a table, leading to poor performance.
To filter a query using a column that is not the full partition key, you must create a SAI index on the column that you wish to use for the query.
-
Create a table called
cycling.rank_by_year_and_name
with a composite partition key and a clustering column if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.rank_by_year_and_name ( race_year int, race_name text, cyclist_name text, rank int, PRIMARY KEY ((race_year, race_name), rank) );
-
Create two SAI indexes on the
race_year
andrace_name
columns. These two columns are the composite partition key in thePRIMARY KEY
definition.CREATE CUSTOM INDEX IF NOT EXISTS race_year_idx ON cycling.rank_by_year_and_name (race_year) USING 'StorageAttachedIndex'; CREATE CUSTOM INDEX IF NOT EXISTS race_name_idx ON cycling.rank_by_year_and_name (race_name) USING 'StorageAttachedIndex';
-
Verify the created indexes by checking the table schema with the
DESCRIBE TABLE
CQL command.DESCRIBE INDEX cycling.race_year_idx; DESCRIBE INDEX cycling.race_name_idx;
Results
CREATE CUSTOM INDEX race_year_idx ON cycling.rank_by_year_and_name (race_year) USING 'StorageAttachedIndex'; CREATE CUSTOM INDEX race_name_idx ON cycling.rank_by_year_and_name (race_name) USING 'StorageAttachedIndex';
-
Now that you have created the table and indexes, try some
SELECT
queries.-
The first example tab,
SELECT 1
, shows a query with only one of the partition key columns. -
The second example tab,
SELECT 2 - AND
, shows a query with both of the partition key columns usingAND
.-
SELECT 1
-
SELECT 2 - AND
Query the table using only the
race_year`column. A similar query using only the `race_name
column can also be performed.SELECT * FROM cycling.rank_by_year_and_name WHERE race_year = 2014 AND race_name = 'Tour of Japan - Stage 4 - Minami > Shinshu';
Results
race_year | race_name | rank | cyclist_name -----------+--------------------------------------------+------+---------------------- 2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Daniel MARTIN 2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Johan Esteban CHAVES 2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Benjamin PRADES (3 rows)
This shows the distinct advantage of SAI over 2i. For 2i indexes, you must use the
ALLOW FILTERING
directive when querying a table using a column that is not the full partition key, but SAI indexes do not requireALLOW FILTERING
.You may want to query the table using both the
race_year
andrace_name
columns.SELECT * FROM cycling.rank_by_year_and_name WHERE race_year = 2014 AND race_name = 'Tour of Japan - Stage 4 - Minami > Shinshu';
+ .Results
Details
race_year | race_name | rank | cyclist_name -----------+--------------------------------------------+------+---------------------- 2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Daniel MARTIN 2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Johan Esteban CHAVES 2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Benjamin PRADES (3 rows)
-
-
Clustering column in primary key
A clustering column sorts the rows within a partition. Clustering columns are part of the PRIMARY KEY
definition.
To filter a query using just a clustering column, you must create a 2i index on the column that you wish to use for the query.
If you wish to use a clustering column in a query, it is best to consider restricting a query to a single partition. Using a partition key column plus a clustering column will have better performance.
-
Create a table called
cycling.rank_by_year_and_name
with a composite partition key and a clustering column if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.rank_by_year_and_name ( race_year int, race_name text, cyclist_name text, rank int, PRIMARY KEY ((race_year, race_name), rank) );
-
Create a SAI index on the
rank
clustering column.CREATE CUSTOM INDEX IF NOT EXISTS rank_idx ON cycling.rank_by_year_and_name (rank) USING 'StorageAttachedIndex';
-
Verify the created indexes by checking the schema with the
DESCRIBE INDEX
CQL command.DESCRIBE INDEX cycling.rank_idx;
Results
CREATE CUSTOM INDEX rank_idx ON cycling.rank_by_year_and_name (rank) USING 'StorageAttachedIndex';
-
Now that you have created the table and index, try some queries.
-
The first example tab shows a query with only the clustering column.
-
The second example tab shows a query with the clustering column and the partition key column.
-
SELECT 1 - rank
-
SELECT 2 - AND
Query the table using only the
rank
column.SELECT * FROM cycling.rank_by_year_and_name WHERE rank = 3;
Results
race_year | race_name | rank | cyclist_name -----------+--------------------------------------------+------+---------------------- 2014 | 4th Tour of Beijing | 3 | Johan Esteban CHAVES 2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Benjamin PRADES 2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Thomas LEBAS (3 rows)
Query the table using the
race_year
andrank
columns.SELECT * FROM cycling.rank_by_year_and_name WHERE race_year = 2014 AND rank = 3 ALLOW FILTERING;
Results
race_year | race_name | rank | cyclist_name -----------+--------------------------------------------+------+---------------------- 2014 | 4th Tour of Beijing | 3 | Johan Esteban CHAVES 2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Benjamin PRADES (2 rows)
-
-
Non-primary key columns
Indexes can be created on non-primary key columns of almost any type.
Text column
Text columns can be indexed and searched both for equality and inequality.
-
Create a table called
cycling.alt_stats
with a non-primary key TEXT column callednationality
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.cyclist_alt_stats ( id UUID PRIMARY KEY, lastname text, birthday date, nationality text, weight float, w_units text, height float, first_race date, last_race date );
-
Create a SAI index on the
nationality
column, a non-primary key TEXT column.CREATE CUSTOM INDEX nationality_idx ON cycling.cyclist_alt_stats (nationality) USING 'StorageAttachedIndex';
-
Verify the created indexes by checking the index schema:
DESCRIBE INDEX cycling.nationality_idx;
Results
CREATE CUSTOM INDEX nationality_idx ON cycling.cyclist_alt_stats (nationality) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a value for the column.
-
Query the table using the
nationality
column.SELECT first_race, last_race, birthday FROM cycling.cyclist_alt_stats WHERE nationality = 'France';
Results
first_race | last_race | birthday ------------+------------+------------ 2006-03-15 | 2017-04-16 | 1990-05-27 (1 rows)
Date column
-
Create a table called
cycling.alt_stats
with DATE column calledbirthday
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.cyclist_alt_stats ( id UUID PRIMARY KEY, lastname text, birthday date, nationality text, weight float, w_units text, height float, first_race date, last_race date );
-
Create a SAI index on the
birthday
column, a non-primary key TEXT column.CREATE CUSTOM INDEX birthday_idx ON cycling.cyclist_alt_stats (birthday) USING 'StorageAttachedIndex';
-
Verify the created indexes by checking the index:
DESCRIBE INDEX cycling.birthday_idx;
Results
CREATE CUSTOM INDEX birthday_idx ON cycling.cyclist_alt_stats (birthday) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a value for the column.
-
Query the table using the
birthday
column.SELECT lastname, birthday, nationality FROM cycling.cyclist_alt_stats WHERE birthday = '1981-03-29';
Results
lastname | birthday | nationality -----------+------------+------------- VEIKKANEN | 1981-03-29 | Finland (1 rows)
Now that you have created indexes on the columns birthday
and nationality
, there are a number of other queries that you can perform.
Multiple matches with AND
, OR
, and IN
operators are a common use case for SAI indexes.
Inequality ranges are another useful case of SAI indexes.
-
The first example tab,
SELECT 1 - AND
, shows a query with both thebirthday
andnationality
columns. -
The second example tab,
SELECT 2 - OR
, shows that a query with either thebirthday
ornationality
columns. Depending on your product, this query may or may not succeed. -
The third example tab,
SELECT 3 - IN
, shows how to query with a list ofnationality
values. -
The fourth example tab,
SELECT 4 - RANGE
, shows how to query with an inequality range ofbirthday
values.
-
SELECT 1 - AND
-
SELECT 2 - OR
-
SELECT 3 - IN
-
SELECT 4 - RANGE
Query the table using both a nationality and a birthday.
SELECT lastname, birthday, nationality FROM cycling.cyclist_alt_stats
WHERE birthday = '1991-08-25' AND nationality = 'Ethiopia'
ALLOW FILTERING;
SELECT lastname, birthday, nationality FROM cycling.cyclist_alt_stats
WHERE birthday = '1991-08-25' AND nationality = 'Ethiopia';
Results
lastname | birthday | nationality
----------+------------+-------------
GRMAY | 1991-08-25 | Ethiopia
(1 rows)
lastname | birthday | nationality
----------+------------+-------------
GRMAY | 1991-08-25 | Ethiopia
(1 rows)
Query the table using a nationality or a birthday.
SELECT lastname, birthday, nationality FROM cycling.cyclist_alt_stats
WHERE birthday = '1991-08-25' OR nationality = 'Ethiopia';
Results
lastname | birthday | nationality
----------+------------+-------------
GRMAY | 1991-08-25 | Ethiopia
(1 rows)
Query the table using nationality with IN
.
SELECT * FROM cycling.cyclist_alt_stats
WHERE nationality IN ('Finland', 'France');
Results
id | birthday | first_race | height | last_race | lastname | nationality | w_units | weight
--------------------------------------+------------+------------+--------+------------+-----------+-------------+---------+--------
c09e9451-50da-483d-8108-e6bea2e827b3 | 1981-03-29 | 1996-05-21 | 1.78 | 2012-10-02 | VEIKKANEN | Finland | kg | 66
4ceb495c-55ab-4f71-83b9-81117252bb13 | 1990-05-27 | 2006-03-15 | null | 2017-04-16 | DUVAL | France | null | null
(2 rows)
Query the table using a birthday range.
SELECT lastname, birthday, nationality FROM cycling.cyclist_alt_stats
WHERE birthday >'1981-03-29';
Results
lastname | birthday | nationality
-----------+------------+-------------
BRUTT | 1982-01-29 | Russia
ISAYCHEV | 1986-04-21 | Russia
TSATEVICH | 1989-07-05 | Russia
PAULINHO | 1990-05-27 | Portugal
GRMAY | 1991-08-25 | Ethiopia
BELKOV | 1985-01-09 | Russia
DUVAL | 1990-05-27 | France
(7 rows)
Inequalities are a useful feature of SAI indexes.
Integer column
An integer column can be indexed similarly to a text column.
-
Create a table called
cycling.rank_by_year_and_name
with INT column calledrank
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.rank_by_year_and_name ( race_year int, race_name text, cyclist_name text, rank int, PRIMARY KEY ((race_year, race_name), rank) );
-
Create a SAI index on the
rank
column, a non-primary key INT column.CREATE CUSTOM INDEX IF NOT EXISTS rank_idx ON cycling.rank_by_year_and_name (rank) USING 'StorageAttachedIndex';
-
Verify the created indexes by checking the index schema.
DESCRIBE INDEX cycling.rank_idx;
Results
CREATE CUSTOM INDEX rank_idx ON cycling.rank_by_year_and_name (rank) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a value for the column.
-
SELECT 1 - equality
-
SELECT 2 - inequality range
Query the table using the rank
column in an equality.
SELECT * FROM cycling.rank_by_year_and_name
WHERE rank = 3;
Results
race_year | race_name | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
2014 | 4th Tour of Beijing | 3 | Johan Esteban CHAVES
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Benjamin PRADES
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Thomas LEBAS
(3 rows)
Query the table using the rank
column in an inequality range.
SELECT * FROM cycling.rank_by_year_and_name
WHERE rank < 3;
Results
race_year | race_name | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
2014 | 4th Tour of Beijing | 1 | Phillippe GILBERT
2014 | 4th Tour of Beijing | 2 | Daniel MARTIN
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Daniel MARTIN
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Johan Esteban CHAVES
2015 | Giro d'Italia - Stage 11 - Forli > Imola | 1 | Ilnur ZAKARIN
2015 | Giro d'Italia - Stage 11 - Forli > Imola | 2 | Carlos BETANCUR
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Benjamin PRADES
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Adam PHELAN
(8 rows)
Float column
Float columns can be indexed and searched.
-
Create a table called
cycling.cyclist_alt_stats
with FLOAT column calledweight
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.cyclist_alt_stats ( id UUID PRIMARY KEY, lastname text, birthday date, nationality text, weight float, w_units text, height float, first_race date, last_race date );
-
Create a SAI index on the
weight
column, a non-primary key FLOAT column.CREATE CUSTOM INDEX weight_idx ON cycling.cyclist_alt_stats (weight) USING 'StorageAttachedIndex';
-
Verify the created indexes by checking the table schema with the
DESCRIBE TABLE
CQL command.DESCRIBE INDEX cycling.weight_idx;
Results
CREATE CUSTOM INDEX weight_idx ON cycling.cyclist_alt_stats (weight) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a value for the column.
-
The first example tab shows that a query with the
weight
column can be performed using an equality. -
The second example tab shows that a query with the
weight
column can be performed using an inequality range.
-
SELECT 1 - equality
-
SELECT 2 - inequality range
Query the table using the weight
column in an equality.
SELECT lastname, nationality, weight FROM cycling.cyclist_alt_stats
WHERE weight = 66;
Results
lastname | nationality | weight
-----------+-------------+--------
VEIKKANEN | Finland | 66
(1 rows)
Query the table using the weight
column in an inequality range.
SELECT lastname, nationality, weight FROM cycling.cyclist_alt_stats
WHERE weight > 70;
Results
lastname | nationality | weight
----------+-------------+--------
ISAYCHEV | Russia | 80
BELKOV | Russia | 71
(2 rows)
Timestamp column
Timestamp columns can be indexed and searched.
-
Create a table called
cycling.comments
with TIMESTAMP column calledcreated_at
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.comments ( record_id timeuuid, id uuid, commenter text, comment text, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create a SAI index on the
comments
column, a non-primary key TIMESTAMP column.CREATE CUSTOM INDEX created_at_idx ON cycling.comments (created_at) USING 'StorageAttachedIndex';
-
Verify the created indexes by checking the index schema.
DESCRIBE INDEX cycling.created_at_idx;
Results
CREATE CUSTOM INDEX created_at_idx ON cycling.comments (created_at) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a value for the column.
-
The first example tab shows that a query with the
created_at
column can be performed using an equality. -
The second example tab shows that a query with the
created_at
column can be performed using an inequality range.
-
SELECT 1 - equality
-
SELECT 2 - inequality range
Query the table using the created_at
column in an equality.
SELECT commenter, comment, created_at FROM cycling.comments
WHERE created_at = '2024-06-07 03:13:40.268';
Results
commenter | comment | created_at
-----------+----------------------+---------------------------------
Amy | Ride in any weather! | 2024-06-07 03:13:40.268000+0000
(1 rows)
Query the table using the created_at
column in an inequality range.
SELECT commenter, comment, created_at FROM cycling.comments
WHERE created_at < '2024-06-07 03:13:40.268';
Results
commenter | comment | created_at
-----------+----------------------------------------+---------------------------------
Alex | Raining too hard should have postponed | 2023-04-01 11:21:59.001000+0000
(1 rows)
Tuple column
Tuples columns can be indexed and searched, but only the full tuple can be indexed.
-
Create a table called
cycling.nation_rank
with TUPLE column calledinfo
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.nation_rank ( nation text PRIMARY KEY, info tuple<int, text, int> );
-
Create a SAI index on the
info
column, a non-primary key TUPLE column.CREATE CUSTOM INDEX info_idx ON cycling.nation_rank (info) USING 'StorageAttachedIndex';
-
Verify the created indexes by checking the table schema with the
DESCRIBE TABLE
CQL command.DESCRIBE INDEX cycling.info_idx;
Results
CREATE CUSTOM INDEX info_idx ON cycling.nation_rank (info) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a value for the column.
Query the table using the info
column in an equality.
Tuples can only be filtered using the full tuple value.
SELECT * FROM cycling.nation_rank
WHERE info = (3, 'Phillippe GILBERT', 6222);
Results
nation | info
---------+--------------------------------
Belgium | (3, 'Phillippe GILBERT', 6222)
(1 rows)
User-defined type (UDT) column
SAI can index either a user-defined type (UDT) or a list of UDTs. This example shows how to index a list of UDTs.
-
Create a UDT called
cycling.race
if it doesn’t already exist.CREATE TYPE IF NOT EXISTS cycling.race ( race_title text, race_date timestamp, race_time text ); DESCRIBE TYPE cycling.race;
-
Create a table called
cycling.cyclist_races
with a list of UDTs column calledraces
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.cyclist_races ( id UUID PRIMARY KEY, lastname text, firstname text, races list<FROZEN <race>> );
-
Create a SAI index on the
races
column, a non-primary key list of UDTs column.CREATE CUSTOM INDEX IF NOT EXISTS races_idx ON cycling.cyclist_races (races) USING 'StorageAttachedIndex';
-
Verify the created indexes by checking the table schema with the
DESCRIBE TABLE
CQL command.CREATE CUSTOM INDEX races_idx ON cycling.cyclist_races (values(races)) USING 'StorageAttachedIndex';
Results
CREATE CUSTOM INDEX races_idx ON cycling.cyclist_races (values(races)) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a value for the column.
-
Query with
CONTAINS
from the listraces
column:SELECT * FROM cycling.cyclist_races WHERE races CONTAINS { race_title:'Rabobank 7-Dorpenomloop Aalburg', race_date:'2015-05-09', race_time:'02:58:33'};
Results
id | firstname | lastname | races --------------------------------------+-----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS | [race(race_title='Rabobank 7-Dorpenomloop Aalburg', race_date=datetime.datetime(2015, 5, 9, 0, 0), race_time='02:58:33'), race(race_title='Ronde van Gelderland', race_date=datetime.datetime(2015, 4, 19, 0, 0), race_time='03:22:23')] (1 rows)
Non-primary key collection column
Collections can be indexed and queried to find a collection containing a particular value. Sets and lists are indexed a bit differently from maps, given the key-value nature of maps.
Sets and lists can index all values found by indexing the collection column.
Maps can index a map key, map value, or map entry using the methods shown below.
Multiple indexes can be created on the same map column in a table so that map keys, values, or entries can be queried.
In addition, frozen collections can be indexed using FULL
to index the full content of a frozen collection.
Set
Create an index on a set to find all the cyclists that have been on a particular team.
-
Create a table called
cycling.cyclist-career-teams
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.cyclist_career_teams ( id UUID PRIMARY KEY, lastname text, teams set<text> );
-
Create a SAI index on the
teams
column.CREATE CUSTOM INDEX IF NOT EXISTS teams_idx ON cycling.cyclist_career_teams (teams) USING 'StorageAttachedIndex';
-
Verify the created indexes by checking the index schema.
DESCRIBE INDEX cycling.teams_idx;
Results
CREATE CUSTOM INDEX teams_idx ON cycling.cyclist_career_teams (values(teams)) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a value for the column.
Query the table using the teams column with a CONTAINS
phrase to get the rows where a set contains a particular value.
SELECT lastname, teams FROM cycling.cyclist_career_teams
WHERE teams CONTAINS 'Rabobank-Liv Giant';
Results
lastname | teams
----------+----------------------------------------------------------------------------------------------------
BRAND | {'AA Drink - Leontien.nl', 'Leontien.nl', 'Rabobank-Liv Giant', 'Rabobank-Liv Woman Cycling Team'}
(1 rows)
List
Create an index on a set to find all the cyclists that have a particular sponsor.
-
Create a table called
cycling.race_sponsors
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.race_sponsors ( race_year int, race_name text, sponsorship list<text>, PRIMARY KEY (race_year, race_name) );
-
Create a SAI index on the
sponsorship
list column.CREATE CUSTOM INDEX sponsorship_idx ON cycling.race_sponsors (sponsorship) USING 'StorageAttachedIndex';
-
Verify the created index by checking the index schema.
DESCRIBE INDEX cycling.sponsorship_idx;
Results
CREATE CUSTOM INDEX sponsorship_idx ON cycling.race_sponsors (values(sponsorship)) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a value for the column.
Query the table using the sponsorship column with a CONTAINS
phrase to get the rows where a set contains a particular value.
SELECT * FROM cycling.race_sponsors WHERE sponsorship CONTAINS 'Carrefour';
Results
race_year | race_name | sponsorship
-----------+----------------+-------------------------------------------------
2018 | Tour de France | ['LCL', 'Carrefour', 'Skoda', 'Vittel', 'Krys']
(1 rows)
For map collections, create an index on the map key, map value, or map entry.
Map - keys
Create an index on a map key to find all cyclist/team combinations for a particular year.
-
Create a table called
cycling.cyclist_teams
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.cyclist_teams ( id uuid PRIMARY KEY, firstname text, lastname text, teams map<int, text> );
-
Create a SAI index on the
teams
map column.CREATE CUSTOM INDEX IF NOT EXISTS team_keys_idx ON cycling.cyclist_teams ( KEYS (teams) ) USING 'StorageAttachedIndex';
-
Verify the created index by checking the index schema.
DESCRIBE index cycling.team_keys_idx;
Results
CREATE CUSTOM INDEX team_keys_idx ON cycling.cyclist_teams (keys(teams)) USING 'StorageAttachedIndex';
-
Now that you have created the table and index, query using a key for the map column. Query the table using the
teams
column with aCONTAINS KEY
phrase to get the rows where a map contains a particular key.SELECT firstname,lastname,teams FROM cycling.cyclist_teams WHERE teams CONTAINS KEY 2015;
Results
firstname | lastname | teams -----------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'} Marianne | VOS | {2015: 'Rabobank-Liv Woman Cycling Team'} (2 rows)
-
Note that not all map columns are designed to make search via the map key useful. For example, look at this next query. Create a table called
cycling.birthday
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.birthday_list ( cyclist_name text PRIMARY KEY, blist map<text, text> );
-
Create a SAI index on the
blist
map column.CREATE CUSTOM INDEX IF NOT EXISTS blist_keys_idx ON cycling.birthday_list ( KEYS(blist) ) USING 'StorageAttachedIndex';
-
Verify the created index by checking the index schema.
DESCRIBE INDEX cycling.blist_keys_idx;
Results
CREATE CUSTOM INDEX blist_keys_idx ON cycling.birthday_list (keys(blist)) USING 'StorageAttachedIndex';
Query the table using the blist
column with a CONTAINS KEY
phrase to get the rows where a map contains a particular key.
SELECT * FROM cycling.birthday_list WHERE blist CONTAINS KEY 'age';
Results
cyclist_name | blist
------------------+--------------------------------------------------------------
Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
Claudio VANDELLI | {'age': '54', 'bday': '27/07/1961', 'nation': 'ITALY'}
Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}
Allan DAVIS | {'age': '35', 'bday': '27/07/1980', 'nation': 'AUSTRALIA'}
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(6 rows)
In this case, the map key is not a useful index because the map key is not a unique identifier, but the same value for all rows.
Map - values
Create an index on the map values and find cyclists whose teams
column has a particular value found in the specified map.
-
Create a table called
cycling.cyclist_teams
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.cyclist_teams ( id uuid PRIMARY KEY, firstname text, lastname text, teams map<int, text> );
-
Create a SAI index on the
teams
map column.CREATE CUSTOM INDEX IF NOT EXISTS team_values_idx ON cycling.cyclist_teams ( VALUES (teams) ) USING 'StorageAttachedIndex';
+
DESCRIBE index cycling.team_values_idx;
+ .Results
Details
CREATE CUSTOM INDEX team_values_idx ON cycling.cyclist_teams (values(teams)) USING 'StorageAttachedIndex';
-
Now that you have created the table and index, query using a value for the map column.
SELECT firstname,lastname,teams FROM cycling.cyclist_teams WHERE teams CONTAINS 'Team Garmin - Cervelo';
Results
firstname | lastname | teams -----------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Jamie | BENNETT | {2011: 'Team Garmin - Cervelo', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team'} Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'} (2 rows)
-
Because the values of a map column are commonly unique, querying on a map value is more generally useful. Create a table called
cycling.birthday
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.birthday_list ( cyclist_name text PRIMARY KEY, blist map<text, text> );
-
Create a SAI index on the
blist
map column.CREATE CUSTOM INDEX IF NOT EXISTS blist_values_idx ON cycling.birthday_list ( VALUES(blist) ) USING 'StorageAttachedIndex';
-
Verify the created index by checking the table schema.
DESCRIBE INDEX cycling.blist_values_idx;
Results
CREATE CUSTOM INDEX blist_values_idx ON cycling.birthday_list (values(blist)) USING 'StorageAttachedIndex';
-
Now that you have created the table and index, query using a value for the map column.
SELECT * FROM cycling.birthday_list WHERE blist CONTAINS 'NETHERLANDS';
Results
cyclist_name | blist ---------------+-------------------------------------------------------------- Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'} Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'} (2 rows)
Map - entries
Create an index on the map entries and find cyclists who are the same age.
An index using ENTRIES
is only valid for maps.
-
Create a table called
cycling.birthday
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.birthday_list ( cyclist_name text PRIMARY KEY, blist map<text, text> );
-
Create a SAI index on the
blist
map column.CREATE CUSTOM INDEX IF NOT EXISTS blist_entries_idx ON cycling.birthday_list ( ENTRIES(blist) ) USING 'StorageAttachedIndex';
-
Verify the created index by checking the index schema.
DESCRIBE INDEX cycling.blist_entries_idx;
Results
CREATE CUSTOM INDEX blist_entries_idx ON cycling.birthday_list (entries(blist)) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a key and a value for the map column.
-
SELECT 1
-
SELECT 2
-
DATE
-
AND
Query the table using both a key and a value to get the rows where a map contains a particular entry.
SELECT * FROM cycling.birthday_list WHERE blist[ 'age' ] = '23';
Results
cyclist_name | blist
------------------+----------------------------------------------------------
Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(2 rows)
Query the table using both a key and a value to get the rows where a map contains cyclists from the same country in the entry for each.
SELECT * FROM cycling.birthday_list WHERE blist[ 'nation' ] = 'NETHERLANDS';
Results
cyclist_name | blist
---------------+--------------------------------------------------------------
Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}
(2 rows)
Any number of different queries can be made using the ENTRIES
index, such as selecting a matching entry with a DATE
data type.
SELECT * FROM cycling.birthday_list WHERE blist [ 'bday' ] = '27/07/1992'
ALLOW FILTERING;
Results
cyclist_name | blist
------------------+----------------------------------------------------------
Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(2 rows)
Two different matches can be made using AND
and the ENTRIES
index.
SELECT * FROM cycling.birthday_list
WHERE blist[ 'nation' ] = 'CANADA' AND blist[ 'age' ] = '23';
Results
// tag::table-drop[]
// end::table-drop[]
// tag::table-create[]
// end::table-create[]
// tag::table-describe-base[]
CREATE TABLE cycling.birthday_list (
cyclist_name text PRIMARY KEY,
blist map<text, text>
) WITH additional_write_policy = '99PERCENTILE'
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND nodesync = {'enabled': 'true', 'incremental': 'true'}
AND read_repair = 'BLOCKING'
AND speculative_retry = '99PERCENTILE';
// end::table-describe-base[]
// tag::index-drop-blist-keys-idx[]
// end::index-drop-blist-keys-idx[]
// tag::index-drop-blist-values-idx[]
// end::index-drop-blist-values-idx[]
// tag::index-drop-blist-entries-idx[]
// end::index-drop-blist-entries-idx[]
// tag::index-create-blist-keys-idx-v1[]
// end::index-create-blist-keys-idx-v1[]
// tag::index-describe-blist-keys-idx-v1[]
CREATE INDEX blist_keys_idx ON cycling.birthday_list (keys(blist));
// end::index-describe-blist-keys-idx-v1[]
// tag::index-create-blist-values-idx-v1[]
// end::index-create-blist-values-idx-v1[]
// tag::index-describe-blist-values-idx-v1[]
CREATE INDEX blist_values_idx ON cycling.birthday_list (values(blist));
// end::index-describe-blist-values-idx-v1[]
// tag::index-create-blist-entries-idx-v1[]
// end::index-create-blist-entries-idx-v1[]
// tag::index-describe-blist-entries-idx-v1[]
CREATE INDEX blist_entries_idx ON cycling.birthday_list (entries(blist));
// end::index-describe-blist-entries-idx-v1[]
// tag::data-insert-one[]
// end::data-insert-one[]
// tag::data-insert-all[]
// end::data-insert-all[]
// tag::select-age-equals-23-2i[]
cyclist_name | blist
------------------+----------------------------------------------------------
Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(2 rows)
// end::select-age-equals-23-2i[]
// tag::select-birthday-equals-27071992-2i[]
cyclist_name | blist
------------------+----------------------------------------------------------
Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(2 rows)
// end::select-birthday-equals-27071992-2i[]
// tag::select-nation-equals-NETHERLANDS-2i[]
cyclist_name | blist
---------------+--------------------------------------------------------------
Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}
(2 rows)
// end::select-nation-equals-NETHERLANDS-2i[]
// tag::select-nation-equals-CANADA-and-age-equals-23-2i[]
cyclist_name | blist
------------------+---------------------------------------------------------
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(1 rows)
// end::select-nation-equals-CANADA-and-age-equals-23-2i[]
// tag::select-key-is-age-2i[]
cyclist_name | blist
------------------+--------------------------------------------------------------
Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
Claudio VANDELLI | {'age': '54', 'bday': '27/07/1961', 'nation': 'ITALY'}
Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}
Allan DAVIS | {'age': '35', 'bday': '27/07/1980', 'nation': 'AUSTRALIA'}
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(6 rows)
// end::select-key-is-age-2i[]
// tag::select-nation-contains-NETHERLANDS-2i[]
cyclist_name | blist
---------------+--------------------------------------------------------------
Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}
(2 rows)
// end::select-nation-contains-NETHERLANDS-2i[]
// tag::index-create-blist-keys-idx-v3[]
// end::index-create-blist-keys-idx-v3[]
// tag::index-describe-blist-keys-idx-v3[]
CREATE CUSTOM INDEX blist_keys_idx ON cycling.birthday_list (keys(blist)) USING 'StorageAttachedIndex';
// end::index-describe-blist-keys-idx-v3[]
// tag::index-create-blist-values-idx-v3[]
// end::index-create-blist-values-idx-v3[]
// tag::index-describe-blist-values-idx-v3[]
CREATE CUSTOM INDEX blist_values_idx ON cycling.birthday_list (values(blist)) USING 'StorageAttachedIndex';
// end::index-describe-blist-values-idx-v3[]
// tag::index-create-blist-entries-idx-v3[]
// end::index-create-blist-entries-idx-v3[]
// tag::index-describe-blist-entries-idx-v3[]
CREATE CUSTOM INDEX blist_entries_idx ON cycling.birthday_list (entries(blist)) USING 'StorageAttachedIndex';
// end::index-describe-blist-entries-idx-v3[]
// tag::data-insert-one[]
// end::data-insert-one[]
// tag::data-insert-all[]
// end::data-insert-all[]
// tag::select-all[]
cyclist_name | blist
------------------+--------------------------------------------------------------
Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
Claudio VANDELLI | {'age': '54', 'bday': '27/07/1961', 'nation': 'ITALY'}
Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}
Allan DAVIS | {'age': '35', 'bday': '27/07/1980', 'nation': 'AUSTRALIA'}
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(6 rows)
// end::select-all[]
// tag::select-age-equals-23[]
cyclist_name | blist
------------------+----------------------------------------------------------
Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(2 rows)
// end::select-age-equals-23[]
// tag::select-birthday-equals-27071992[]
cyclist_name | blist
------------------+----------------------------------------------------------
Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(2 rows)
// end::select-birthday-equals-27071992[]
// tag::select-nation-equals-NETHERLANDS[]
cyclist_name | blist
---------------+--------------------------------------------------------------
Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}
(2 rows)
// end::select-nation-equals-NETHERLANDS[]
// tag::select-nation-equals-CANADA-and-age-equals-23[]
cyclist_name | blist
------------------+---------------------------------------------------------
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(1 rows)
// end::select-nation-equals-CANADA-and-age-equals-23[]
// tag::select-key-is-age[]
cyclist_name | blist
------------------+--------------------------------------------------------------
Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
Claudio VANDELLI | {'age': '54', 'bday': '27/07/1961', 'nation': 'ITALY'}
Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}
Allan DAVIS | {'age': '35', 'bday': '27/07/1980', 'nation': 'AUSTRALIA'}
Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}
(6 rows)
// end::select-key-is-age[]
// tag::select-nation-contains-NETHERLANDS[]
cyclist_name | blist
---------------+--------------------------------------------------------------
Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}
(2 rows)
// end::select-nation-contains-NETHERLANDS[]
// tag::select-json[]
[json]
----------------------------------------------------------------------------------------------------------
{"cyclist_name": "Claudio HEINEN", "blist": {"age": "23", "bday": "27/07/1992", "nation": "GERMANY"}}
{"cyclist_name": "Claudio VANDELLI", "blist": {"age": "54", "bday": "27/07/1961", "nation": "ITALY"}}
{"cyclist_name": "Luc HAGENAARS", "blist": {"age": "28", "bday": "27/07/1987", "nation": "NETHERLANDS"}}
{"cyclist_name": "Toine POELS", "blist": {"age": "52", "bday": "27/07/1963", "nation": "NETHERLANDS"}}
{"cyclist_name": "Allan DAVIS", "blist": {"age": "35", "bday": "27/07/1980", "nation": "AUSTRALIA"}}
{"cyclist_name": "Laurence BOURQUE", "blist": {"age": "23", "bday": "27/07/1992", "nation": "CANADA"}}
(6 rows)
// end::select-json[]
The queries using a map entry where the key is a unique identifier do look slightly different. Study the example below and compare it to the previous example.
-
Create a table called
cycling.cyclist-teams
if it doesn’t already exist.CREATE TABLE IF NOT EXISTS cycling.cyclist_teams ( id uuid PRIMARY KEY, firstname text, lastname text, teams map<int, text> );
-
Create a SAI index on the
teams
map column.CREATE CUSTOM INDEX IF NOT EXISTS team_entries_idx ON cycling.cyclist_teams ( ENTRIES (teams) ) USING 'StorageAttachedIndex';
-
Verify the created index by checking the index schema.
DESCRIBE index cycling.team_entries_idx;
Results
CREATE CUSTOM INDEX team_entries_idx ON cycling.cyclist_teams (entries(teams)) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a key and a value for the map column.
-
SELECT 1
-
AND
Query the table using both a key and a value to get the rows where a map contains a particular entry.
SELECT firstname,lastname,teams FROM cycling.cyclist_teams
WHERE teams[2014] = 'Boels:Dolmans Cycling Team';
Results
firstname | lastname | teams
-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Jamie | BENNETT | {2011: 'Team Garmin - Cervelo', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team'}
Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
(2 rows)
Two different matches can be made using AND
and the ENTRIES
index.
SELECT firstname,lastname,teams FROM cycling.cyclist_teams
WHERE teams[2014] = 'Boels:Dolmans Cycling Team'
AND teams[2015] = 'Boels:Dolmans Cycling Team';
Results
firstname | lastname | teams
-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
(1 rows)
Frozen collection
Create an index on the full content of a FROZEN
list.
The table in this example stores the number of Pro wins, Grand Tour races, and Classic races that a cyclist has competed in.
The SELECT
statement finds any cyclist who has 39 Pro race wins, 7 Grand Tour starts, and 14 Classic starts.
The FROZEN
keyword is required to index the full content of a frozen collection and applies to any collection type: set, list, or map.
-
Create a table called
race_starts
if it does not yet exist.CREATE TABLE IF NOT EXISTS cycling.race_starts ( cyclist_name text PRIMARY KEY, rnumbers FROZEN<LIST<int>> );
-
Create a SAI index on the
rnumbers
frozen list column.CREATE CUSTOM INDEX IF NOT EXISTS rnumbers_idx ON cycling.race_starts ( FULL(rnumbers) ) USING 'StorageAttachedIndex';
-
Verify the created index by checking the index schema.
DESCRIBE INDEX cycling.rnumbers_idx;
Results
CREATE CUSTOM INDEX rnumbers_idx ON cycling.race_starts (full(rnumbers)) USING 'StorageAttachedIndex';
Now that you have created the table and index, query using a value for the column.
Query the table using the rnumbers
column with a CONTAINS
phrase to get the rows where a frozen list contains a particular value:
SELECT * FROM cycling.race_starts WHERE rnumbers = [39, 7, 14];
Results
cyclist_name | rnumbers
----------------+-------------
John DEGENKOLB | [39, 7, 14]
(1 rows)
Vector search
Vector search is a feature that allows you to search for similar vectors in a table and compare the vector embeddings using a similarity function.
Simple vector ANN example
In its simplest form, a vector index is created on a table with a vector column. That column is then queried with a approximate nearest neighbor (ANN) algorithm to find the most similar vectors to a given query vector.
In this example, a table is created with a vector column called comment_vector
.
An index is created on the comment_vector
column.
The SELECT
query returns the rows that are most similar to the embedding vector in the SELECT
query.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column.CREATE CUSTOM INDEX comment_ann_idx ON cycling.comments_vs(comment_vector) USING 'StorageAttachedIndex';
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_ann_idx;
Results
CREATE CUSTOM INDEX comment_ann_idx ON cycling.comments_vs (comment_vector) USING 'StorageAttachedIndex';
-
Query the table to find the most similar vectors to a given query vector.
SELECT * FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55] LIMIT 3;
Results
id | created_at | comment | comment_vector | commenter | record_id --------------------------------------+---------------------------------+----------------------------------------+---------------------------------------------------+-----------+-------------------------------------- e8ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | rain, rain,rain, go away! | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | John | 3d42f050-37da-11ef-81ed-f92c3c7170c3 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Alex | 3d3d9921-37da-11ef-81ed-f92c3c7170c3 e8ae5df3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane | 3d433e71-37da-11ef-81ed-f92c3c7170c3 (3 rows)
Vector search with source model
Vector indexes also allow a defined source_model
option for the CREATE INDEX
command.
Parameter | Description | Default |
---|---|---|
|
Configures the index for optimal performance for your vectors.
Options are: |
|
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with the source model set tobert
.CREATE CUSTOM INDEX comment_source_model_idx ON cycling.comments_vs (comment_vector) USING 'StorageAttachedIndex' WITH OPTIONS = { 'source_model': 'bert'};
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_source_model_idx;
Results
CREATE CUSTOM INDEX comment_source_model_idx ON cycling.comments_vs (comment_vector) USING 'StorageAttachedIndex' WITH OPTIONS = {'source_model': 'bert'};
-
Query the table to find the most similar vectors to a given query vector.
SELECT * FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55] LIMIT 3;
Results
id | created_at | comment | comment_vector | commenter | record_id --------------------------------------+---------------------------------+----------------------------------------+---------------------------------------------------+-----------+-------------------------------------- e8ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | rain, rain,rain, go away! | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | John | 3d42f050-37da-11ef-81ed-f92c3c7170c3 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Alex | 3d3d9921-37da-11ef-81ed-f92c3c7170c3 e8ae5df3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane | 3d433e71-37da-11ef-81ed-f92c3c7170c3 (3 rows)
Vector search with similarity function
When you specify a |
Vector indexes also allow a defined similarity_function
option for the CREATE INDEX
command.
Parameter | Description | Default |
---|---|---|
|
Specified similarity function for your index.
Options are: |
|
You can calculate the similarity of the best scoring row in a table using a vector search query. For applications where similarity and relevance are crucial, this calculation helps you make informed decisions. This calculation enables algorithms to provide more tailored and accurate results.
When you write the SELECT
query, supply both the vector column and the vector embeddings to compare.
The SELECT
query returns the row that is most similar to the vector in the search query.
This example creates an index on the comment_vector
column with the similarity function set to dot_product
:
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with the similarity function set todot_product
.CREATE CUSTOM INDEX comment_sim_function_idx ON cycling.comments_vs (comment_vector) USING 'StorageAttachedIndex' WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT'};
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_sim_function_idx;
Results
CREATE CUSTOM INDEX comment_sim_function_idx ON cycling.comments_vs (comment_vector) USING 'StorageAttachedIndex' WITH OPTIONS = {'similarity_function': 'DOT_PRODUCT'};
-
Query the table to find all the vectors.
SELECT comment,comment_vector,commenter FROM cycling.comments_vs;
Results
comment | comment_vector | commenter ------------------------------------------------------+---------------------------------------------------------------------------------+----------- rain, rain,rain, go away! | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | John LATE RIDERS SHOULD NOT DELAY THE START | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Alex Second rest stop was out of water | b'?}p\\xa4?\\x00\\x00\\x00?}p\\xa4=\\xcc\\xcc\\xcd>\\xae\\x14{' | Alex Raining too hard should have postponed | b'>\\xe6ff=\\xb8Q\\xec<#\\xd7\\n>L\\xcc\\xcd=\\xe1G\\xae' | Alex The gift certificate for winning was the best | b'>\\x05\\x1e\\xb8?L\\xcc\\xcd>\\xb333>.\\x14{<\\xf5\\xc2\\x8f' | Amy Last climb was a killer | b'>\\x99\\x99\\x9a?@\\x00\\x00>L\\xcc\\xcd>L\\xcc\\xcd?\\x00\\x00\\x00' | Amy Great snacks at all reststops | b'=\\xcc\\xcc\\xcd>\\xcc\\xcc\\xcd=\\xcc\\xcc\\xcd?\\x05\\x1e\\xb8=\\xb8Q\\xec' | Amy Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy THE RACE WAS FABULOUS! | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy The <B>gift certificate</B> for winning was the best | b'>\\x05\\x1e\\xb8?L\\xcc\\xcd>\\xb333>.\\x14{<\\xf5\\xc2\\x8f' | Amy Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane Boy, was it a drizzle out there! | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Jane (12 rows)
-
Query the table to find the most similar vectors to a given query vector.
SELECT comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 3;
Results
comment | system.similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05]) ----------------------------------------+----------------------------------------------------------------------- Second rest stop was out of water | 0.949701 rain, rain,rain, go away! | 0.789776 LATE RIDERS SHOULD NOT DELAY THE START | 0.789776 (3 rows)
Vector search with text analyzer
Analyzers process the text in a column to enable term matching on strings. Combined with vector-based search algorithms, term matching makes it easier to find relevant information in a table. Analyzers semantically filter the results of a vector search by specific terms.
Analyzers are built on the Lucene Java Analyzer API. SAI uses the Lucene Java Analyzer API to transform text columns into tokens for indexing and querying which can use built-in or custom analyzers.
For example, if you generate a vector embedding from the phrase “tell me about rain comments”, you can use a vector search to get a list of rows with similar vectors. These rows will likely correlate with rain-related strings.
SELECT comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05])
FROM cycling.comments_vs
ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05]
LIMIT 3;
Alternatively, you can filter these search results by a specific keyword, such as rain
:
SELECT comment,comment_vector,commenter FROM cycling.comments_vs
WHERE comment : 'rain';
An analyzed index stores values derived from the raw column values. The stored values are dependent on the analyzer [Configuration options], which include tokenization, filtering, and charFiltering.
Restrictions
|
Non-tokenizing filters
These filters cannot be combined with an index_analyzer
, but can be chained in a pipeline with other non-tokenizing filters.
|
Normalize input using Normalization Form C (NFC) |
|
Transform all inputs to lowercase |
|
same as the ASCIIFoldingFilter from Lucene; converts ASCII characters to their associated UTF-8 values |
This example looks for a comment that contains the word rain
, but uses the word Rain
in the query:
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thecase_sensitive
filter.CREATE CUSTOM INDEX comment_case_insensitive_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'case_sensitive': false};
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_case_insensitive_idx;
Results
CREATE CUSTOM INDEX comment_case_insensitive_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false'};
-
Query the table to find the matches for the word
rain
in thecomment
column.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'Glad you ran the Race in the Rain';
Results
comment | comment_vector | commenter -----------------------------------+-------------------------------------------------------------------------+----------- Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy (1 rows)
Analyzer using a built-in analyzer
There are several built-in analyzers from the Lucene project.
To use a text analyzer, you must configure the index_analyzer
in the CREATE INDEX
statement.
This analyzer determines how a column value is analyzed before indexing occurs.
The analyzer is applied to the query term search as well.
Built-in analyzers
The following built-in analyzers are available:
Name |
Description |
standard |
Filters StandardTokenizer output that divides text into terms on word boundaries and then uses the LowerCaseFilter. |
simple |
Filters LetterTokenizer output that divides text into terms whenever it encounters a character which is not a letter and then uses the LowerCaseFilter. |
whitespace |
Analyzer that uses WhitespaceTokenizer to divide text into terms whenever it encounters any whitespace character. |
stop |
Filters LetterTokenizer output with LowerCaseFilter and removes Lucene’s default English stop words. |
lowercase |
Normalizes input by applying LowerCaseFilter (no additional tokenization is performed). |
keyword |
Analyzer that uses KeywordTokenizer, which is an identity function ("noop") on input values and tokenizes the entire input as a single token. |
<language> |
Analyzers for specific languages. For example, |
To configure the OPTION index_analyzer
, either a single string that identifies a built-in analyzer is used, or if a more complex index analyzer configuration is required, a JSON object is used.
The JSON object configures a single tokenizer
, with zero or more optional filters
and/or charFilters
.
Tokenizers
A tokenizer breaks up a stream of characters into individual tokens (usually individual words) and outputs a stream of tokens. Different tokenizers use different rules to break up the stream of characters. For example, a tokenizer might break text into terms based on word boundaries, converting the sentence "How now, brown cow?" into the terms [How, now, brown, cow].
The built-in index analyzers in the table above use the following tokenizers:
Name | Orientation | Description |
---|---|---|
|
word |
Divides text into terms on word boundaries. Removes most punctuation symbols. |
|
word |
Divides text into terms whenever it encounters a character which is not a letter. |
|
word |
Divides text into terms whenever it encounters a character which is not a letter, and then lowercases all terms. |
|
word |
Divides text into terms whenever it encounters any whitespace character. |
|
Partial word (small fragments of words) |
Divides text into terms whenever it encounters any whitespace character, then returns n-grams of each word.
For example, |
|
Partial word (small fragments of words) |
Divides text into terms whenever it encounters any whitespace character, then returns n-grams of each word which are anchored to the start of the word.
For example, |
|
structured |
Tokenizes the entire input as a single token. |
Token filters
A token filter receives the token stream that the tokenizer creates, and may add, remove, or change tokens. Common token filters are:
Name | Description |
---|---|
|
Converts all tokens to lowercase for case-insensitive search. |
|
Removes common words (stop words) such as |
|
Introduces synonyms such as |
|
Add tokens that span multiple positions, such as multi-word synonyms. This is a graph token filter. |
Character filters (charFilters)
A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. Common character filters are:
Name | Description |
---|---|
|
Strips out HTML elements like |
|
Replaces any occurrences of the specified strings with the specified replacements. |
|
Replaces any characters matching a regular expression with the specified replacement. |
Simple STANDARD analyzer example
This example uses the standard
analyzer to search for the word rain
in the comment
column.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thestandard
analyzer.CREATE CUSTOM INDEX comment_standard_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': 'STANDARD'};
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_standard_idx;
Results
CREATE CUSTOM INDEX comment_standard_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': 'STANDARD'};
-
Query the table to find the matches for the word
rain
in thecomment
column using thestandard
analyzer.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'Rain';
Results
comment | comment_vector | commenter -----------------------------------+-------------------------------------------------------------------------+----------- rain, rain,rain, go away! | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | John Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane (3 rows)
This query returns the rows with the word
rain
orRain
in thecomment
column, but not the row withraining
. Thestandard
analyzer splits the text into case-independent terms, but does not perform stemming. It is also case-insensitive so that both uppercase and lowercase are returned.
STANDARD with lowercase analyzer example
This example uses the standard
analyzer to search for the word rain
in the comment
column.
It also includes the filter lowercase
to ensure that the search is case-insensitive.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thestandard
analyzer and thelowercase
filter.CREATE CUSTOM INDEX comment_standard_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer" : { "name" : "standard", "args" : {} }, "filters" : [ { "name" : "lowercase", "args": {} } ], "charFilters" : [] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_standard_lowercase_idx;
Results
CREATE CUSTOM INDEX comment_standard_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer" : { "name" : "standard", "args" : {} }, "filters" : [ { "name" : "lowercase", "args": {} } ], "charFilters" : [] }'};
-
Query the table to find the matches for the word
rain
in thecomment
column using thestandard
analyzer and thelowercase
filter.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'Glad you ran the Race in the Rain';
Results
comment | comment_vector | commenter -----------------------------------+-------------------------------------------------------------------------+----------- Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy (1 rows)
This query returns the rows with the words in the phrase selected, regardless of the word case. Stemming is still not performed.
STANDARD analyzer with stemming example
This example uses the standard
analyzer to search for the word rain
in the comment
column.
It also includes the filter porterstem
to ensure that the search can find stemming text, such as rain
and raining
.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thestandard
analyzer and theporterstem
filter.CREATE CUSTOM INDEX comment_stemming_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "porterstem"}] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_stemming_idx;
Results
CREATE CUSTOM INDEX comment_stemming_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "porterstem"}] }'};
-
Query the table to find the matches for the word
rain
in thecomment
column using thestandard
analyzer and theporterstem
filter.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'rain';
Results
comment | comment_vector | commenter -----------------------------------+-------------------------------------------------------------------------+----------- rain, rain,rain, go away! | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | John Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy (2 rows)
This query returns the rows with the word
rain
orRain
in thecomment
column, but not the row withraining
. You would expect the row withraining
to be returned because theporterstem
filter is used. However, that word in the phrase is actually mixed-case,Raining
, so that row is not returned. -
Query the table to find the matches for the words
rain
andrace
in thecomment
column using thestandard
analyzer and theporterstem
filter.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'rain' AND comment: 'race';
Results
comment | comment_vector | commenter -----------------------------------+-------------------------------------------------------------------------+----------- Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy (1 rows)
Once again, the lack of lowercase filtering means that the row with
Racing
is not returned, but only the row with bothrain
andrace
in the comment in lowercase.
STANDARD analyzer with stemming and case-insensitive example
If another filter is added to the last example, now the stemming and case-insensitive filters are combined.
With this filter, the query returns the rows with the words rain
and Raining
in the comment
column for the search for rain
.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thestandard
analyzer, theporterstem
,lowercase
and case-insensitive filters.CREATE CUSTOM INDEX comment_stemming_lowercase_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "lowercase"}, {"name" : "porterstem"}] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_stemming_lowercase_idx;
Results
CREATE CUSTOM INDEX comment_stemming_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "lowercase"}, {"name" : "porterstem"}] }'};
-
Query the table to find the matches for the words
rain
andraining
in thecomment
column using thestandard
analyzer, theporterstem
, andlowercase
filters. To ensure tokens are lowercased, add thelowercase
filter before theporterstem
filter in the analyzer configuration.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'Glad you ran the Race in the Rain';
Results
comment | comment_vector | commenter ----------------------------------------+-------------------------------------------------------------------------+----------- rain, rain,rain, go away! | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | John Raining too hard should have postponed | b'>\\xe6ff=\\xb8Q\\xec<#\\xd7\\n>L\\xcc\\xcd=\\xe1G\\xae' | Alex Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane (4 rows)
Four rows! Now we get the results for 'rain',
Rain
, and 'Raining'!
TOKENIZER whitespace
An example that uses the whitespace
tokenizer.
When you use the whitespace
tokenizer, the text is split into terms whenever it encounters any whitespace character, but not other characters like commas.
The query below shows that difference from the standard
tokenizer.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thewhitespace
tokenizer.CREATE CUSTOM INDEX comment_whitespace_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer":{"name" : "whitespace"}} }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_whitespace_idx;
Results
CREATE CUSTOM INDEX comment_whitespace_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer":{"name" : "whitespace"}} }'};
-
Query the table to find the matches for the word
rain
in thecomment
column but notrain, rain,rain
using thewhitespace
tokenizer.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'rain';
Results
comment | comment_vector | commenter -----------------------------------+-------------------------------------------------------------------------+----------- Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy (1 rows)
See how the row with
rain
is returned, but not the row withrain, rain,rain
. Sometimes thewhitespace
tokenizer is useful when you want to split on whitespace, but not on other characters. Otherwise, thestandard
tokenizer is more useful.
WHITESPACE analyzer alternative
Using the whitespace
analyzer is simpler, but it is not configurable.
If you wish to figure and still tokenize on whitespace, you can use the whitespace
tokenizer with the lowercase
filter.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thewhitespace
tokenizer.CREATE CUSTOM INDEX comment_whitespace_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer':'whitespace'};
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_whitespace_idx;
Results
CREATE CUSTOM INDEX comment_whitespace_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': 'whitespace'};
-
Query the table to find the matches for
rain
but notrain, rain,rain
in thecomment
column.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'rain';
Results
comment | comment_vector | commenter -----------------------------------+-------------------------------------------------------------------------+----------- Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy (1 rows)
Same response as the last
whitespace
tokenizer example.
WHITESPACE analyzer with lowercase
This example uses the whitespace
tokenizer with a lowercase
filter to search for the word rain
or Rain
between whitespaces in the comment
column.
The lowercase
filter ensures that the search is case-insensitive.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thewhitespace
tokenizer andlowercase
.CREATE CUSTOM INDEX comment_whitespace_lowercase_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer":{"name" : "whitespace"}, "filters":[{"name" : "lowercase"}] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_whitespace_lowercase_idx;
Results
CREATE CUSTOM INDEX comment_whitespace_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer":{"name" : "whitespace"}, "filters":[{"name" : "lowercase"}] }'};
-
Query the table to find the matches for
rain
orRain
in between whitespaces in thecomment
column.
SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'rain';
+ .Results
Details
comment | comment_vector | commenter
-----------------------------------+-------------------------------------------------------------------------+-----------
Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy
Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane
(2 rows)
This query returns the rows with the word rain
or Rain
in the comment
column, but not the row with raining
.
Two rows are returned for this query
WHITESPACE analyzer with stemming
You can also use the whitespace
tokenizer with stemming.
This example is case-sensitive, since the lowercase
filter is not included.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thewhitespace
tokenizer and stemming.CREATE CUSTOM INDEX comment_whitespace_stemming_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer":{"name" : "whitespace"}, "filters":[{"name" : "porterstem"}] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_whitespace_stemming_idx;
Results
CREATE CUSTOM INDEX comment_whitespace_stemming_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer":{"name" : "whitespace"}, "filters":[{"name" : "porterstem"}] }'};
-
Query the table to find the matches for
Rain
and any stemming words likeRaining
in thecomment
column.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'Rain';
Results
comment | comment_vector | commenter ----------------------------------------+-----------------------------------------------------------+----------- Raining too hard should have postponed | b'>\\xe6ff=\\xb8Q\\xec<#\\xd7\\n>L\\xcc\\xcd=\\xe1G\\xae' | Alex Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane (2 rows)
This query returns the rows with the word
Rain
orRaining
in thecomment
column, but not the rows withrain
orrain, rain,rain
.
N-GRAM with lowercase
This example uses the n-gram
tokenizer to search for the letters ra
in the comment
column.
The n-gram
tokenizer divides text into terms whenever it encounters any whitespace character, then returns n-grams of each word.
For example ra
in rain
would be [ra, ai, in]
.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thengram
tokenizer andlowercase
filter.CREATE CUSTOM INDEX comment_ngram_lowercase_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer" : {"name" : "ngram", "args" : {"minGramSize":"2", "maxGramSize":"3"}}, "filters" : [{"name" : "lowercase"}] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_ngram_lowercase_idx;
Results
CREATE CUSTOM INDEX comment_ngram_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer" : {"name" : "ngram", "args" : {"minGramSize":"2", "maxGramSize":"3"}}, "filters" : [{"name" : "lowercase"}] }'};
-
Query the table to find the matches for the letters
ra
in thecomment
column using then-gram
tokenizer. This query will return all 6 rows, since all of them contain the lettersra
orRa
.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'ra';
Results
comment | comment_vector | commenter ----------------------------------------+-------------------------------------------------------------------------+----------- rain, rain,rain, go away! | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | John Raining too hard should have postponed | b'>\\xe6ff=\\xb8Q\\xec<#\\xd7\\n>L\\xcc\\xcd=\\xe1G\\xae' | Alex Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy THE RACE WAS FABULOUS! | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane (5 rows)
All the rows are returned. It is an interesting example of how the
n-gram
tokenizer works. For your curiosity, what would you expect the results to be for the queryini
? If you guessed one row, you are correct!
STANDARD analyzer with pattern
This example uses the standard
analyzer to search for a pattern in the text.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column with thewhitespace
tokenizer.CREATE CUSTOM INDEX comment_pattern_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer" : {"name" : "simplepattern", "args" : {"pattern":"[0123456789]{3}"}} }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_pattern_idx;
Results
CREATE CUSTOM INDEX comment_pattern_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer" : {"name" : "simplepattern", "args" : {"pattern":"[0123456789]{3}"}} }'};
-
Query the table to find the matches ??
SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment: '335';
Results
comment | comment_vector | commenter ------------------+----------------+----------- fd-786-335-552-x | null | Amy fd-786-335-514-x | null | Amy fd-333-335-514-x | null | Amy (3 rows)
STANDARD analyzer with keyword htmlstrip
This example uses the standard
analyzer to search for a keyword in the text and will strip out HTML elements to match.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column.CREATE CUSTOM INDEX comment_keyword_htmlstrip_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer" : {"name" : "keyword"}, "charFilters" : [{"name" : "htmlstrip"}] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_keyword_htmlstrip_idx;
Results
CREATE CUSTOM INDEX comment_keyword_htmlstrip_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer" : {"name" : "keyword"}, "charFilters" : [{"name" : "htmlstrip"}] }'};
-
Query the table to do the HTML strip.
SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'The gift certificate for winning was the best';
Results
comment | comment_vector | commenter ------------------------------------------------------+-----------------------------------------------------------------+----------- The gift certificate for winning was the best | b'>\\x05\\x1e\\xb8?L\\xcc\\xcd>\\xb333>.\\x14{<\\xf5\\xc2\\x8f' | Amy The <B>gift certificate</B> for winning was the best | b'>\\x05\\x1e\\xb8?L\\xcc\\xcd>\\xb333>.\\x14{<\\xf5\\xc2\\x8f' | Amy (2 rows)
See how this query returns two rows, one with the phrase
gift certificate
and one with the same phrase wrapped in bold HTML tags. If you are analyzing text that might contain HTML, this is a useful filter.
LANGUAGE analyzer with lowercase
This example uses the czech
analyzer to search for the Czech word pánové
with accents.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column.CREATE CUSTOM INDEX comment_language_lowercase_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "lowercase"}, {"name" : "czechstem"}] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_language_lowercase_idx;
Results
CREATE CUSTOM INDEX comment_language_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "lowercase"}, {"name" : "czechstem"}] }'};
Two rows are returned, one with the word
pánové
and one withPánové
. The sentence without the word is not returned. -
Query the table for word matches in Czech. The query is case-insensitive, so the results are the same for
pánové
andPánové
.SELECT comment,comment_vector,commenter FROM cycling.comments_vs WHERE comment : 'pánové';
Results
comment | comment_vector | commenter ---------------------+----------------+----------- pánové, to je dobré | null | Michal Pánové, to je dobré | null | Martina (2 rows)
Note that the only row not returned does not include the word
pánové
.
KEYWORD TOKENIZER with synonym and lowercase
This example uses the keyword
tokenizer with the synonym
and lowercase
filters to search for the words drizzle
and rain
as synonyms in the comment
column.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column.CREATE CUSTOM INDEX comment_synonym_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer": {"name" : "whitespace"}, "filters" : [ {"name" : "lowercase"}, {"name" : "synonym", "args" : {"synonyms" : "rain, drizzle => rain"}}] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_synonym_lowercase_idx;
Results
CREATE CUSTOM INDEX comment_synonym_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer": {"name" : "whitespace"}, "filters" : [ {"name" : "lowercase"}, {"name" : "synonym", "args" : {"synonyms" : "rain, drizzle => rain"}}] }'};
-
Query the table to match synonyms.
SELECT comment, comment_vector, commenter FROM cycling.comments_vs WHERE comment : 'drizzle'; SELECT comment, comment_vector, commenter FROM cycling.comments_vs WHERE comment : 'rain';
Results
comment | comment_vector | commenter -----------------------------------+-------------------------------------------------------------------------+----------- Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane Boy, was it a drizzle out there! | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Jane (3 rows) comment | comment_vector | commenter -----------------------------------+-------------------------------------------------------------------------+----------- Glad you ran the race in the rain | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Amy Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane Boy, was it a drizzle out there! | b'>\\x99\\x99\\x9a>\\xae\\x14{>L\\xcc\\xcd?G\\xae\\x14>\\x80\\x00\\x00' | Jane (3 rows)
In these queries,
rain
anddrizzle
are synonyms, so the query returns all rows with either word.
KEYWORD TOKENIZER with synonym and keyword
This example uses the keyword
tokenizer with the synonyms
of Alex, Alexander, and alexandria, and the keyword
filter to search for the words alex
in the comment
column.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column.CREATE CUSTOM INDEX commenter_synonym_keyword_idx ON cycling.comments_vs(commenter) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer": {"name" : "keyword"}, "filters" : [ {"name" : "synonym", "args" : {"synonyms" : "Alex, alex, Alexander, alexander => Alex"}}] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.commenter_synonym_keyword_idx;
Results
CREATE CUSTOM INDEX commenter_synonym_keyword_idx ON cycling.comments_vs (commenter) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer": {"name" : "keyword"}, "filters" : [ {"name" : "synonym", "args" : {"synonyms" : "Alex, alex, Alexander, alexander => Alex"}}] }'};
-
Query the table to match synonyms.
SELECT * FROM cycling.comments_vs WHERE commenter : 'Alex'; SELECT * FROM cycling.comments_vs WHERE commenter : 'alex'; SELECT * FROM cycling.comments_vs WHERE commenter : 'Alexander'; SELECT * FROM cycling.comments_vs WHERE commenter : 'alexander';
Results
id | created_at | comment | comment_vector | commenter | record_id --------------------------------------+---------------------------------+----------------------------------------+-----------------------------------------------------------------+-----------+-------------------------------------- e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Alex | 3d3d9921-37da-11ef-81ed-f92c3c7170c3 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | b'?}p\\xa4?\\x00\\x00\\x00?}p\\xa4=\\xcc\\xcc\\xcd>\\xae\\x14{' | Alex | 3d3c3991-37da-11ef-81ed-f92c3c7170c3 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-02-14 20:43:20.000000+0000 | Raining too hard should have postponed | b'>\\xe6ff=\\xb8Q\\xec<#\\xd7\\n>L\\xcc\\xcd=\\xe1G\\xae' | Alex | 3d390540-37da-11ef-81ed-f92c3c7170c3 e7ae5cf3-d358-4d99-e900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT BE ALLOWED | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | alexander | 84f658b1-37da-11ef-81ed-f92c3c7170c3 (4 rows) id | created_at | comment | comment_vector | commenter | record_id --------------------------------------+---------------------------------+-----------------------+-----------------------------------------------------------------+-----------+-------------------------------------- e7ae5cf3-d358-4d99-d900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Rest stops were great | b'?}p\\xa4?\\x00\\x00\\x00?}p\\xa4=\\xcc\\xcc\\xcd>\\xae\\x14{' | alex | 84f5e381-37da-11ef-81ed-f92c3c7170c3 (1 rows) id | created_at | comment | comment_vector | commenter | record_id --------------------------------------+---------------------------------+----------------------+-----------------------------------------------------------+-----------+-------------------------------------- e7ae5cf3-d358-4d99-c900-85902fda9bb0 | 2017-02-14 20:43:20.000000+0000 | Too easy to postpone | b'>\\xe6ff=\\xb8Q\\xec<#\\xd7\\n>L\\xcc\\xcd=\\xe1G\\xae' | Alexander | 84f34b70-37da-11ef-81ed-f92c3c7170c3 (1 rows) id | created_at | comment | comment_vector | commenter | record_id --------------------------------------+---------------------------------+----------------------------------------+-----------------------------------------------------------------+-----------+-------------------------------------- e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Alex | 3d3d9921-37da-11ef-81ed-f92c3c7170c3 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | b'?}p\\xa4?\\x00\\x00\\x00?}p\\xa4=\\xcc\\xcc\\xcd>\\xae\\x14{' | Alex | 3d3c3991-37da-11ef-81ed-f92c3c7170c3 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-02-14 20:43:20.000000+0000 | Raining too hard should have postponed | b'>\\xe6ff=\\xb8Q\\xec<#\\xd7\\n>L\\xcc\\xcd=\\xe1G\\xae' | Alex | 3d390540-37da-11ef-81ed-f92c3c7170c3 e7ae5cf3-d358-4d99-e900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT BE ALLOWED | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | alexander | 84f658b1-37da-11ef-81ed-f92c3c7170c3 (4 rows)
ANALYZER with mapping charFilter and lowercase
This example uses the mapping
charFilter to mapping happy and sad to emojis in the comment
column.
-
Create a table with a vector column if it does not already exist.
CREATE TABLE IF NOT EXISTS cycling.comments_vs ( record_id timeuuid, id uuid, commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
-
Create an index on the
comment_vector
column.CREATE CUSTOM INDEX comment_mapping_lowercase_idx ON cycling.comments_vs(comment) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer' : '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "lowercase"}], "charFilters" : [ { "name" :"mapping", "args": {"mapping" : "\":)\" => \"_happy_\", \":(\" => \"_sad_\"" } }] }' };
-
Verify that the index was created.
DESCRIBE INDEX cycling.comment_mapping_lowercase_idx;
Results
CREATE CUSTOM INDEX comment_mapping_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "lowercase"}], "charFilters" : [ { "name" :"mapping", "args": {"mapping" : "\":)\" => \"_happy_\", \":(\" => \"_sad_\"" } }] }'};
-
Query the table to see if the mapping works. This query looks for the equivalent of
sad
in emojis.SELECT comment FROM cycling.comments_vs WHERE comment : '_sad_';
Results
comment ------------------------------------------- LATE RIDERS SHOULD NOT DELAY THE START :( (1 rows)
-
Query the table to see if the searching the emoji directly works.
SELECT comment FROM cycling.comments_vs WHERE comment : ':(';
Results
comment ------------------------------------------- LATE RIDERS SHOULD NOT DELAY THE START :( (1 rows)
See also:
-
CREATE INDEX for more information about creating indexes.