SAI examples

SAI indexing can be used to index and query both primary key columns and non-primary key columns of almost any type. Find the type of index you want to create and follow the examples to create and query the index.

Each example starts with a table creation, followed by index creation, so that you can see the table schema and the index schema. Then sample queries for using the index are presented in a tabbed codeblock, to make it easier to examine queries side-by-side.

You must create a keyspace before you can create tables and indexes. To follow along with these examples, create a keyspace named cycling.

CREATE KEYSPACE IF NOT EXISTS cycling
WITH REPLICATION = {
  'class' : 'SimpleStrategy',
  'replication_factor' : 1
};

If you’re new to SAI, see the SAI quickstart.

Primary key columns

Make an index on primary key columns, either composite partition key columns or clustering columns.

Single primary key column

This index creation will fail, because the table has a single PRIMARY KEY column. Such indexes are not allowed. The following example demonstrates the failure.

In the cycling keyspace, create a table called alt_stats:

CREATE TABLE IF NOT EXISTS cycling.cyclist_alt_stats (
  id UUID PRIMARY KEY,
  lastname text,
  birthday date,
  nationality text,
  weight float,
  w_units text,
  height float,
  first_race date,
  last_race date
);

Create a SAI index on the id column, which is a primary key:

CREATE CUSTOM INDEX id_idx ON cycling.cyclist_alt_stats (id)
  USING 'StorageAttachedIndex';

Index creation fails because the id column is the only primary key and the only partition key. Such indexes are not allowed.

Result

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot create secondary index on the only partition key column id"

Composite partition key columns in primary key

A composite partition key uses two or more columns to define the partition key in the PRIMARY KEY definition. Normally, you would need to specify all the columns in the partition key to query the table with a WHERE clause.

However, an SAI index makes it possible to define an index using a single column in the table’s composite partition key. You can create an SAI index on each column in the composite partition key, if you need to query based on just one column at a time. SAI indexes also allow you to query tables without using the inefficient ALLOW FILTERING directive. The ALLOW FILTERING directive requires scanning all the partitions in a table, leading to poor performance.

To filter a query using a column that is not the full partition key, you must create a SAI index on the column that you wish to use for the query.

Create a table called cycling.rank_by_year_and_name with a composite partition key and a clustering column if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.rank_by_year_and_name (
  race_year int,
  race_name text,
  cyclist_name text,
  rank int,
  PRIMARY KEY ((race_year, race_name), rank)
);

Create two SAI indexes on the composite partition key columns in the PRIMARY KEY definition. For this example, these columns are race_year and race_name.

CREATE CUSTOM INDEX IF NOT EXISTS race_year_idx
  ON cycling.rank_by_year_and_name (race_year) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX IF NOT EXISTS race_name_idx
  ON cycling.rank_by_year_and_name (race_name) USING 'StorageAttachedIndex';

Use the DESCRIBE INDEX command to get information about the new indexes:

DESCRIBE INDEX cycling.race_year_idx;
DESCRIBE INDEX cycling.race_name_idx;

Result

CREATE CUSTOM INDEX race_year_idx ON cycling.rank_by_year_and_name (race_year) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX race_name_idx ON cycling.rank_by_year_and_name (race_name) USING 'StorageAttachedIndex';

Now that you have created the table and indexes, try some SELECT queries:

Example: SELECT 1

This example shows a query with only one of the partition key columns.

Query the table using only the race_year`column. A similar query using only the `race_name column can also be performed.

SELECT * FROM cycling.rank_by_year_and_name
  WHERE race_year = 2014;

This example shows the advantage of SAI over 2i: SAI indexes do not require ALLOW FILTERING when querying on a column that is not the full partition key. In contrast, with 2i indexes, you must use ALLOW FILTERING for such queries.

Result

 race_year | race_name                                  | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
      2014 |                        4th Tour of Beijing |    1 |    Phillippe GILBERT
      2014 |                        4th Tour of Beijing |    2 |        Daniel MARTIN
      2014 |                        4th Tour of Beijing |    3 | Johan Esteban CHAVES
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    1 |        Daniel MARTIN
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    2 | Johan Esteban CHAVES
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |      Benjamin PRADES

(6 rows)

Example: SELECT 2 - AND

This example shows a query with both of the partition key columns using AND.

Query the table using both the race_year and race_name columns:

SELECT * FROM cycling.rank_by_year_and_name
  WHERE race_year = 2014
  AND race_name  = 'Tour of Japan - Stage 4 - Minami > Shinshu';

Result

 race_year | race_name                                  | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    1 |        Daniel MARTIN
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    2 | Johan Esteban CHAVES
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |      Benjamin PRADES

(3 rows)

Clustering column in primary key

A clustering column sorts the rows within a partition. Clustering columns are part of the PRIMARY KEY definition. To filter a query using just a clustering column, you must create a 2i index on the column that you wish to use for the query.

If you wish to use a clustering column in a query, it is best to consider restricting a query to a single partition. Using a partition key column plus a clustering column will have better performance.

Create a table called cycling.rank_by_year_and_name with a composite partition key and a clustering column if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.rank_by_year_and_name (
  race_year int,
  race_name text,
  cyclist_name text,
  rank int,
  PRIMARY KEY ((race_year, race_name), rank)
);

Create a SAI index on the rank clustering column.

CREATE CUSTOM INDEX IF NOT EXISTS rank_idx
ON cycling.rank_by_year_and_name (rank) USING 'StorageAttachedIndex';

Verify the created indexes by checking the schema with the DESCRIBE INDEX CQL command.

DESCRIBE INDEX cycling.rank_idx;

Result

CREATE CUSTOM INDEX rank_idx ON cycling.rank_by_year_and_name (rank) USING 'StorageAttachedIndex';

Now that you have created the table and index, try some queries:

Example: SELECT 1 - rank

This example shows a query with only the clustering column.

Query the table using only the rank column:

SELECT * FROM cycling.rank_by_year_and_name
  WHERE rank = 3;

Result

 race_year | race_name                                  | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
      2014 |                        4th Tour of Beijing |    3 | Johan Esteban CHAVES
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |      Benjamin PRADES
      2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |         Thomas LEBAS

(3 rows)

Example: SELECT 2 - AND

This example shows a query with the clustering column and the partition key column.

Query the table using the race_year and rank columns:

SELECT * FROM cycling.rank_by_year_and_name
  WHERE race_year = 2014 AND rank = 3
ALLOW FILTERING;

Result

 race_year | race_name                                  | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
      2014 |                        4th Tour of Beijing |    3 | Johan Esteban CHAVES
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |      Benjamin PRADES

(2 rows)

Non-primary key columns

Indexes can be created on non-primary key columns of almost any type.

Text column

Text columns can be indexed and searched both for equality and inequality.

Create a table called cycling.alt_stats with a non-primary key TEXT column called nationality if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.cyclist_alt_stats (
  id UUID PRIMARY KEY,
  lastname text,
  birthday date,
  nationality text,
  weight float,
  w_units text,
  height float,
  first_race date,
  last_race date
);

Create a SAI index on the nationality column, a non-primary key TEXT column.

CREATE CUSTOM INDEX nationality_idx ON cycling.cyclist_alt_stats (nationality)
  USING 'StorageAttachedIndex';

Verify the created indexes by checking the index schema:

DESCRIBE INDEX cycling.nationality_idx;

Result

CREATE CUSTOM INDEX nationality_idx ON cycling.cyclist_alt_stats (nationality) USING 'StorageAttachedIndex';

Query the table using the nationality column.

SELECT first_race, last_race, birthday FROM cycling.cyclist_alt_stats
  WHERE nationality = 'France';

Result

 first_race | last_race  | birthday
------------+------------+------------
 2006-03-15 | 2017-04-16 | 1990-05-27

(1 rows)

Date column

Create a table called cycling.alt_stats with DATE column called birthday if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.cyclist_alt_stats (
  id UUID PRIMARY KEY,
  lastname text,
  birthday date,
  nationality text,
  weight float,
  w_units text,
  height float,
  first_race date,
  last_race date
);

Create a SAI index on the birthday column, a non-primary key TEXT column.

CREATE CUSTOM INDEX birthday_idx ON cycling.cyclist_alt_stats (birthday)
  USING 'StorageAttachedIndex';

Verify the created indexes by checking the index:

DESCRIBE INDEX cycling.birthday_idx;

Result

CREATE CUSTOM INDEX birthday_idx ON cycling.cyclist_alt_stats (birthday) USING 'StorageAttachedIndex';

Query the table using the birthday column.

SELECT lastname, birthday, nationality FROM cycling.cyclist_alt_stats
  WHERE birthday = '1981-03-29';

Result

 lastname  | birthday   | nationality
-----------+------------+-------------
 VEIKKANEN | 1981-03-29 |     Finland

(1 rows)

Now that you have created indexes on the columns birthday and nationality, try these additional queries that demonstrate common use cases for SAI indexes, including multiple matches with logical operators and inequality ranges:

Example: SELECT 1 - AND

This example shows a query with both the birthday and nationality columns.

Query the table using both a nationality and a birthday:

SELECT lastname, birthday, nationality FROM cycling.cyclist_alt_stats
  WHERE birthday = '1991-08-25' AND nationality = 'Ethiopia'

Result

 lastname | birthday   | nationality
----------+------------+-------------
    GRMAY | 1991-08-25 |    Ethiopia

(1 rows)

Example: SELECT 2 - OR

The OR operator in WHERE clauses for SAI queries isn’t supported for DSE 6.8.

Example: SELECT 3 - IN

This example shows how to query with a list of nationality values.

Query the table using nationality with IN:

SELECT * FROM cycling.cyclist_alt_stats
  WHERE nationality IN ('Finland', 'France');

Result

 id                                   | birthday   | first_race | height | last_race  | lastname  | nationality | w_units | weight
--------------------------------------+------------+------------+--------+------------+-----------+-------------+---------+--------
 c09e9451-50da-483d-8108-e6bea2e827b3 | 1981-03-29 | 1996-05-21 |   1.78 | 2012-10-02 | VEIKKANEN |     Finland |      kg |     66
 4ceb495c-55ab-4f71-83b9-81117252bb13 | 1990-05-27 | 2006-03-15 |   null | 2017-04-16 |     DUVAL |      France |    null |   null

(2 rows)

Example: SELECT 4 - RANGE

Inequalities are a useful feature of SAI indexes. This example shows how to query with an inequality range of birthday values.

Query the table using a birthday range:

SELECT lastname, birthday, nationality FROM cycling.cyclist_alt_stats
  WHERE birthday >'1981-03-29';

Result

 lastname  | birthday   | nationality
-----------+------------+-------------
     BRUTT | 1982-01-29 |      Russia
  ISAYCHEV | 1986-04-21 |      Russia
 TSATEVICH | 1989-07-05 |      Russia
  PAULINHO | 1990-05-27 |    Portugal
     GRMAY | 1991-08-25 |    Ethiopia
    BELKOV | 1985-01-09 |      Russia
     DUVAL | 1990-05-27 |      France

(7 rows)

Integer column

An integer column can be indexed similarly to a text column:

Create a table called cycling.rank_by_year_and_name with INT column called rank if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.rank_by_year_and_name (
  race_year int,
  race_name text,
  cyclist_name text,
  rank int,
  PRIMARY KEY ((race_year, race_name), rank)
);

Create a SAI index on the rank column, a non-primary key INT column.

CREATE CUSTOM INDEX IF NOT EXISTS rank_idx
ON cycling.rank_by_year_and_name (rank) USING 'StorageAttachedIndex';

Verify the created indexes by checking the index schema.

DESCRIBE INDEX cycling.rank_idx;

Result

CREATE CUSTOM INDEX rank_idx ON cycling.rank_by_year_and_name (rank) USING 'StorageAttachedIndex';

Now that you have created the table and index, query using a value for the column:

Example: SELECT 1 - equality

Query the table using the rank column in an equality:

SELECT * FROM cycling.rank_by_year_and_name
  WHERE rank = 3;

Result

 race_year | race_name                                  | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
      2014 |                        4th Tour of Beijing |    3 | Johan Esteban CHAVES
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |      Benjamin PRADES
      2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |         Thomas LEBAS

(3 rows)

Example: SELECT 2 - inequality range

Query the table using the rank column in an inequality range:

SELECT * FROM cycling.rank_by_year_and_name
  WHERE rank < 3;

Result

 race_year | race_name                                  | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
      2014 |                        4th Tour of Beijing |    1 |    Phillippe GILBERT
      2014 |                        4th Tour of Beijing |    2 |        Daniel MARTIN
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    1 |        Daniel MARTIN
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    2 | Johan Esteban CHAVES
      2015 |   Giro d'Italia - Stage 11 - Forli > Imola |    1 |        Ilnur ZAKARIN
      2015 |   Giro d'Italia - Stage 11 - Forli > Imola |    2 |      Carlos BETANCUR
      2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    1 |      Benjamin PRADES
      2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    2 |          Adam PHELAN

(8 rows)

Float column

Float columns can be indexed and searched:

Create a table called cycling.cyclist_alt_stats with FLOAT column called weight if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.cyclist_alt_stats (
  id UUID PRIMARY KEY,
  lastname text,
  birthday date,
  nationality text,
  weight float,
  w_units text,
  height float,
  first_race date,
  last_race date
);

Create a SAI index on the weight column, a non-primary key FLOAT column.

CREATE CUSTOM INDEX weight_idx ON cycling.cyclist_alt_stats (weight)
  USING 'StorageAttachedIndex';

Verify the created indexes by checking the table schema with the DESCRIBE TABLE CQL command.

DESCRIBE INDEX cycling.weight_idx;

Result

CREATE CUSTOM INDEX weight_idx ON cycling.cyclist_alt_stats (weight) USING 'StorageAttachedIndex';

Now that you have created the table and index, query using a value for the column:

Example: SELECT 1 - equality

This example shows that a query with the weight column can be performed using an equality.

Query the table using the weight column in an equality:

SELECT lastname, nationality, weight FROM cycling.cyclist_alt_stats
 WHERE weight = 66;

Result

 lastname  | nationality | weight
-----------+-------------+--------
 VEIKKANEN |     Finland |     66

(1 rows)

Example: SELECT 2 - inequality range

This example shows that a query with the weight column can be performed using an inequality range.

Query the table using the weight column in an inequality range:

SELECT lastname, nationality, weight FROM cycling.cyclist_alt_stats
 WHERE weight > 70;

Result

 lastname | nationality | weight
----------+-------------+--------
 ISAYCHEV |      Russia |     80
   BELKOV |      Russia |     71

(2 rows)

Timestamp column

Timestamp columns can be indexed and searched:

Create a table called cycling.comments with TIMESTAMP column called created_at if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.comments (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create a SAI index on the comments column, a non-primary key TIMESTAMP column.

CREATE CUSTOM INDEX created_at_idx ON cycling.comments (created_at)
  USING 'StorageAttachedIndex';

Verify the created indexes by checking the index schema.

DESCRIBE INDEX cycling.created_at_idx;

Result

CREATE CUSTOM INDEX created_at_idx ON cycling.comments (created_at) USING 'StorageAttachedIndex';

Now that you have created the table and index, query using a value for the column:

Example: SELECT 1 - equality

This example shows that a query with the created_at column can be performed using an equality.

Query the table using the created_at column in an equality:

SELECT commenter, comment, created_at FROM cycling.comments
  WHERE created_at = '2024-06-07 03:13:40.268';

Result

 commenter | comment              | created_at
-----------+----------------------+---------------------------------
       Amy | Ride in any weather! | 2024-06-07 03:13:40.268000+0000

(1 rows)

Example: SELECT 2 - inequality range

This example shows that a query with the created_at column can be performed using an inequality range.

Query the table using the created_at column in an inequality range:

SELECT commenter, comment, created_at FROM cycling.comments
  WHERE created_at < '2024-06-07 03:13:40.268';

Result

 commenter | comment                                | created_at
-----------+----------------------------------------+---------------------------------
      Alex | Raining too hard should have postponed | 2023-04-01 11:21:59.001000+0000

(1 rows)

Tuple column

Tuples columns can be indexed and searched, but only the full tuple can be indexed:

Create a table called cycling.nation_rank with TUPLE column called info if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.nation_rank (
  nation text PRIMARY KEY,
  info tuple<int, text, int>
);

Create a SAI index on the info column, a non-primary key TUPLE column.

CREATE CUSTOM INDEX info_idx ON cycling.nation_rank (info) USING 'StorageAttachedIndex';

Verify the created indexes by checking the table schema with the DESCRIBE TABLE CQL command.

DESCRIBE INDEX cycling.info_idx;

Result

CREATE CUSTOM INDEX info_idx ON cycling.nation_rank (info) USING 'StorageAttachedIndex';

Query the table using the info column in an equality.

Tuples can only be filtered using the full tuple value.

SELECT * FROM cycling.nation_rank
  WHERE info = (3, 'Phillippe GILBERT', 6222);

Result

 nation  | info
---------+--------------------------------
 Belgium | (3, 'Phillippe GILBERT', 6222)

(1 rows)

User-defined type (UDT) column

SAI can index either a user-defined type (UDT) or a list of UDTs. This example shows how to index a list of UDTs:

Create a UDT called cycling.race if it doesn’t already exist.

CREATE TYPE IF NOT EXISTS cycling.race (
  race_title text,
  race_date timestamp,
  race_time text
);

Create a table called cycling.cyclist_races with a list of UDTs column called races if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.cyclist_races (
  id UUID PRIMARY KEY,
  lastname text,
  firstname text,
  races list<FROZEN <race>>
);

Create a SAI index on the races column, a non-primary key list of UDTs column.

CREATE CUSTOM INDEX IF NOT EXISTS races_idx
  ON cycling.cyclist_races (races) USING 'StorageAttachedIndex';

Verify the created indexes by checking the table schema with the DESCRIBE TABLE CQL command.

DESCRIBE INDEX cycling.races_idx;

Result

CREATE CUSTOM INDEX races_idx ON cycling.cyclist_races (values(races)) USING 'StorageAttachedIndex';

Query with CONTAINS from the list races column:

SELECT * FROM cycling.cyclist_races
  WHERE races CONTAINS {
    race_title:'Rabobank 7-Dorpenomloop Aalburg',
    race_date:'2015-05-09',
    race_time:'02:58:33'};

Result

 id                                   | firstname | lastname | races
--------------------------------------+-----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |      VOS | [race(race_title='Rabobank 7-Dorpenomloop Aalburg', race_date=datetime.datetime(2015, 5, 9, 0, 0), race_time='02:58:33'), race(race_title='Ronde van Gelderland', race_date=datetime.datetime(2015, 4, 19, 0, 0), race_time='03:22:23')]

(1 rows)

Non-primary key collection column

Collections (map, list, and set) can be indexed and queried to find a collection containing a particular value.

A collection is appropriate if the data for collection storage is limited. If the data has unbounded growth potential, like messages sent or sensor events registered every second, do not use collections. Instead, use a table with a compound primary key where data is stored in the clustering columns.

Using these features with SAI indexes was introduced in DSE 6.8.3. Also, starting in DSE 6.8.5, you have the option of creating multiple SAI collection indexes with different map types on the same column.

Because maps use key-value pairs, there is a slight difference between indexing for maps compared to sets and lists:

Sets and lists can index all values found by indexing the collection column.
Maps can index a map key, map value, or map entry. Multiple indexes can be created on the same map column in a table so that map keys, values, and entries can be queried.

Frozen collections can be indexed using FULL to index the full content of a frozen collection.

When querying with SAI indexes on collections, the CONTAINS clause is supported for the following:

SAI collection maps with keys, values, and entries
SAI collections with list and set types

Set

Create an index on a set to find all the cyclists that have been on a particular team:

Create a table called cycling.cyclist-career-teams if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.cyclist_career_teams (
  id UUID PRIMARY KEY,
  lastname text,
  teams set<text>
);

Create a SAI index on the teams column.

CREATE CUSTOM INDEX IF NOT EXISTS teams_idx ON cycling.cyclist_career_teams (teams) USING 'StorageAttachedIndex';

Verify the created indexes by checking the index schema.

DESCRIBE INDEX cycling.teams_idx;

Result

CREATE CUSTOM INDEX teams_idx ON cycling.cyclist_career_teams (values(teams)) USING 'StorageAttachedIndex';

Query the table using the teams column with a CONTAINS phrase to get the rows where a set contains a particular value.

SELECT lastname, teams FROM cycling.cyclist_career_teams
  WHERE teams CONTAINS 'Rabobank-Liv Giant';

Result

 lastname | teams
----------+----------------------------------------------------------------------------------------------------
    BRAND | {'AA Drink - Leontien.nl', 'Leontien.nl', 'Rabobank-Liv Giant', 'Rabobank-Liv Woman Cycling Team'}

(1 rows)

List

Create an index on a list to find all the cyclists that have a particular sponsor:

Create a table called cycling.race_sponsors if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.race_sponsors (
  race_year int,
  race_name text,
  sponsorship list<text>,
  PRIMARY KEY (race_year, race_name)
);

Create a SAI index on the sponsorship list column.

CREATE CUSTOM INDEX sponsorship_idx ON cycling.race_sponsors (sponsorship)
  USING 'StorageAttachedIndex';

Verify the created index by checking the index schema.

DESCRIBE INDEX cycling.sponsorship_idx;

Result

CREATE CUSTOM INDEX sponsorship_idx ON cycling.race_sponsors (values(sponsorship)) USING 'StorageAttachedIndex';

Query the table using the sponsorship column with a CONTAINS phrase to get the rows where a list contains a particular value.

SELECT * FROM cycling.race_sponsors WHERE sponsorship CONTAINS 'Carrefour';

Result

 race_year | race_name      | sponsorship
-----------+----------------+-------------------------------------------------
      2018 | Tour de France | ['LCL', 'Carrefour', 'Skoda', 'Vittel', 'Krys']

(1 rows)

A more complex query can select rows that contain either a particular event or a particular month date:

SELECT * FROM cycling.upcoming_calendar
  WHERE events CONTAINS 'Criterium du Dauphine'
    OR month = 7;

Result

 year | month | events
------+-------+----------------------------------------------------------------------------
 2015 |     6 | ['Tour de France', 'Criterium du Dauphine', 'Vuelta Ciclista a Venezuela']
 2015 |     7 |                                                         ['Tour de France']

(2 rows)

Map - keys

For map collections, create an index on the map key, map value, or map entry.

Create an index on a map key to find all cyclist/team combinations for a particular year:

Create a table called cycling.cyclist_teams if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.cyclist_teams (
  id uuid PRIMARY KEY,
  firstname text,
  lastname text,
  teams map<int, text>
);

Create a SAI index on the teams map column.

CREATE CUSTOM INDEX IF NOT EXISTS team_keys_idx
  ON cycling.cyclist_teams ( KEYS (teams) ) USING 'StorageAttachedIndex';

Verify the created index by checking the index schema.

DESCRIBE INDEX cycling.team_keys_idx;

Result

CREATE CUSTOM INDEX team_keys_idx ON cycling.cyclist_teams (keys(teams)) USING 'StorageAttachedIndex';

Query the table using the teams column with a CONTAINS KEY phrase to get the rows where a map contains a particular key.

SELECT firstname,lastname,teams FROM cycling.cyclist_teams
  WHERE teams CONTAINS KEY 2015;

Result

 firstname | lastname   | teams
-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
  Marianne |        VOS |                                                                                                                                   {2015: 'Rabobank-Liv Woman Cycling Team'}

(2 rows)

Not all map columns are designed for useful searches on map keys. In some cases, indexing on map values or map entries produces more useful results. The following example demonstrates how to query with a map key index, but the map key index used in this example doesn’t produce useful results.

Create a table called cycling.birthday if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.birthday_list (
  cyclist_name text PRIMARY KEY,
  blist map<text, text>
);

Create a SAI index on the blist map column.

CREATE CUSTOM INDEX IF NOT EXISTS blist_keys_idx
  ON cycling.birthday_list ( KEYS(blist) )
  USING 'StorageAttachedIndex';

Verify the created index by checking the index schema.

DESCRIBE INDEX cycling.blist_keys_idx;

Result

CREATE CUSTOM INDEX blist_keys_idx ON cycling.birthday_list (keys(blist)) USING 'StorageAttachedIndex';

Query the table using the blist column with a CONTAINS KEY phrase to get the rows where a map contains a particular key.

SELECT * FROM cycling.birthday_list WHERE blist CONTAINS KEY 'age';

Result

 cyclist_name     | blist
------------------+--------------------------------------------------------------
   Claudio HEINEN |     {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
 Claudio VANDELLI |       {'age': '54', 'bday': '27/07/1961', 'nation': 'ITALY'}
    Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
      Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}
      Allan DAVIS |   {'age': '35', 'bday': '27/07/1980', 'nation': 'AUSTRALIA'}
 Laurence BOURQUE |      {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}

(6 rows)

In this case, the map key is not a useful index because the map key is not a unique identifier; it has the same value for all rows.

Map - values

For map collections, create an index on the map key, map value, or map entry.

Create an index on the map values and find cyclists whose teams column has a particular value found in the specified map:

Create a table called cycling.cyclist_teams if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.cyclist_teams (
  id uuid PRIMARY KEY,
  firstname text,
  lastname text,
  teams map<int, text>
);

Create a SAI index on the teams map column.

CREATE CUSTOM INDEX IF NOT EXISTS team_values_idx
  ON cycling.cyclist_teams ( VALUES (teams) ) USING 'StorageAttachedIndex';

Verify the created index by checking the table schema.

DESCRIBE INDEX cycling.team_values_idx;

Result

CREATE CUSTOM INDEX team_values_idx ON cycling.cyclist_teams (values(teams)) USING 'StorageAttachedIndex';

Now that you have created the table and index, query using a value for the map column.

SELECT firstname,lastname,teams FROM cycling.cyclist_teams
  WHERE teams CONTAINS 'Team Garmin - Cervelo';

Result

 firstname | lastname   | teams
-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Jamie |    BENNETT |                                                                     {2011: 'Team Garmin - Cervelo', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team'}
 Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}

(2 rows)

Because the values of a map column are often unique, querying on a map value can be more useful than querying on map keys. This is demonstrated in the following example.

Create a table called cycling.birthday if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.birthday_list (
  cyclist_name text PRIMARY KEY,
  blist map<text, text>
);

Create a SAI index on the blist map column.

CREATE CUSTOM INDEX IF NOT EXISTS blist_values_idx
  ON cycling.birthday_list ( VALUES(blist) )
  USING 'StorageAttachedIndex';

Verify the created index by checking the table schema.

DESCRIBE INDEX cycling.blist_values_idx;

Result

CREATE CUSTOM INDEX blist_values_idx ON cycling.birthday_list (values(blist)) USING 'StorageAttachedIndex';

Now that you have created the table and index, query using a value for the map column.

SELECT * FROM cycling.birthday_list WHERE blist CONTAINS 'NETHERLANDS';

Result

 cyclist_name  | blist
---------------+--------------------------------------------------------------
 Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
   Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}

(2 rows)

Map - entries

For map collections, create an index on the map key, map value, or map entry.

Create an index on the map entries and find cyclists who are the same age. An index using ENTRIES is only valid for maps.

Create a table called cycling.birthday if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.birthday_list (
  cyclist_name text PRIMARY KEY,
  blist map<text, text>
);

Create a SAI index on the blist map column.

CREATE CUSTOM INDEX IF NOT EXISTS blist_entries_idx
  ON cycling.birthday_list ( ENTRIES(blist) )
  USING 'StorageAttachedIndex';

Verify the created index by checking the index schema.

DESCRIBE INDEX cycling.blist_entries_idx;

Result

CREATE CUSTOM INDEX blist_entries_idx ON cycling.birthday_list (entries(blist)) USING 'StorageAttachedIndex';

Now that you have created the table and index, query using a key and a value for the map column:

Example: SELECT 1

Query the table using both a key and a value to get the rows where a map contains a particular entry:

SELECT * FROM cycling.birthday_list WHERE blist[ 'age' ] = '23';

Result

 cyclist_name     | blist
------------------+----------------------------------------------------------
   Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
 Laurence BOURQUE |  {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}

(2 rows)

Example: SELECT 2

Query the table using both a key and a value to get the rows where a map contains cyclists from the same country in the entry for each:

SELECT * FROM cycling.birthday_list WHERE blist[ 'nation' ] = 'NETHERLANDS';

Result

 cyclist_name  | blist
---------------+--------------------------------------------------------------
 Luc HAGENAARS | {'age': '28', 'bday': '27/07/1987', 'nation': 'NETHERLANDS'}
   Toine POELS | {'age': '52', 'bday': '27/07/1963', 'nation': 'NETHERLANDS'}

(2 rows)

Example: DATE

Any number of different queries can be made using the ENTRIES index, such as selecting a matching entry with a DATE data type:

SELECT * FROM cycling.birthday_list WHERE blist [ 'bday' ] = '27/07/1992';

Result

 cyclist_name     | blist
------------------+----------------------------------------------------------
   Claudio HEINEN | {'age': '23', 'bday': '27/07/1992', 'nation': 'GERMANY'}
 Laurence BOURQUE |  {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}

(2 rows)

Example: AND

Two different matches can be made using AND and the ENTRIES index:

SELECT * FROM cycling.birthday_list
  WHERE blist[ 'nation' ] = 'CANADA' AND blist[ 'age' ] = '23';

Result

 cyclist_name     | blist
------------------+---------------------------------------------------------
 Laurence BOURQUE | {'age': '23', 'bday': '27/07/1992', 'nation': 'CANADA'}

(1 rows)

Queries on a map entry where the key is a unique identifier are structured differently, as shown in the following example:

Create a table called cycling.cyclist-teams if it doesn’t already exist.

CREATE TABLE IF NOT EXISTS cycling.cyclist_teams (
  id uuid PRIMARY KEY,
  firstname text,
  lastname text,
  teams map<int, text>
);

Create a SAI index on the teams map column.

CREATE CUSTOM INDEX IF NOT EXISTS team_entries_idx
  ON cycling.cyclist_teams ( ENTRIES (teams) ) USING 'StorageAttachedIndex';

Verify the created index by checking the index schema.

DESCRIBE INDEX cycling.team_entries_idx;

Result

CREATE CUSTOM INDEX team_entries_idx ON cycling.cyclist_teams (entries(teams)) USING 'StorageAttachedIndex';

Now that you have created the table and index, query using a key and a value for the map column:

Example: SELECT 1

Query the table using both a key and a value to get the rows where a map contains a particular entry:

SELECT firstname,lastname,teams FROM cycling.cyclist_teams
  WHERE teams[2014] = 'Boels:Dolmans Cycling Team';

Result

 firstname | lastname   | teams
-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Jamie |    BENNETT |                                                                     {2011: 'Team Garmin - Cervelo', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team'}
 Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}

(2 rows)

Example: AND

Different matches can be made using AND and the ENTRIES index:

SELECT firstname,lastname,teams FROM cycling.cyclist_teams
  WHERE teams[2014] = 'Boels:Dolmans Cycling Team'
    AND teams[2015] = 'Boels:Dolmans Cycling Team';

Result

 firstname | lastname   | teams
-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}

(1 rows)

Frozen collection

Create an index on the full content of a FROZEN list. The table in this example stores the number of Pro wins, Grand Tour races, and Classic races that a cyclist has competed in. The SELECT statement finds any cyclist who has 39 Pro race wins, 7 Grand Tour starts, and 14 Classic starts.

The FROZEN keyword is required to index the full content of a frozen collection and applies to any collection type (set, list, or map).

Create a table called race_starts if it does not yet exist.

CREATE TABLE IF NOT EXISTS cycling.race_starts (
  cyclist_name text PRIMARY KEY,
  rnumbers FROZEN<LIST<int>>
);

Create a SAI index on the rnumbers frozen list column.

CREATE CUSTOM INDEX IF NOT EXISTS rnumbers_idx ON cycling.race_starts ( FULL(rnumbers) )
USING 'StorageAttachedIndex';

Verify the created index by checking the index schema.

DESCRIBE INDEX cycling.rnumbers_idx;

Result

CREATE CUSTOM INDEX rnumbers_idx ON cycling.race_starts (full(rnumbers)) USING 'StorageAttachedIndex';

Query the table using the rnumbers column with a CONTAINS phrase to get the rows where a frozen list contains a particular value:

SELECT * FROM cycling.race_starts WHERE rnumbers = [39, 7, 14];

Result

 cyclist_name   | rnumbers
----------------+-------------
 John DEGENKOLB | [39, 7, 14]

(1 rows)

Vector search

Vector search is a feature that allows you to search for similar vectors in a table and compare the vector embeddings using a similarity function.

Simple vector ANN example

In its simplest form, a vector index is created on a table with a vector column. That column is then queried with a approximate nearest neighbor (ANN) algorithm to find the most similar vectors to a given query vector.

In this example, a table is created with a vector column called comment_vector. An index is created on the comment_vector column. The SELECT query returns the rows that are most similar to the embedding vector in the SELECT query.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column.

CREATE CUSTOM INDEX comment_ann_idx ON cycling.comments_vs(comment_vector)
  USING 'StorageAttachedIndex';

Verify that the index was created.

DESCRIBE INDEX cycling.comment_ann_idx;

Result

CREATE CUSTOM INDEX comment_ann_idx ON cycling.comments_vs (comment_vector) USING 'StorageAttachedIndex';

Query the table to find the most similar vectors to a given query vector.

SELECT * FROM cycling.comments_vs
  ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
  LIMIT 3;

Result

 id                                   | created_at                      | comment                                | comment_vector               | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
 e8ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |              rain, rain,rain, go away! | [0.9, 0.54, 0.12, 0.1, 0.95] |      John | 6711e6c0-2f6a-11ef-bd2f-836fa334e187
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex | 670c4170-2f6a-11ef-bd2f-836fa334e187
 e8ae5df3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |                    Rain like a monsoon | [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane | 6712d120-2f6a-11ef-bd2f-836fa334e187

(3 rows)

Vector search with source model

Vector indexes also allow a defined source_model option for the CREATE INDEX command. This option configures the index for optimal performance for your vectors. Supported values include ada002, bert, cohere-v3, gecko, nv-qa-4, openai_v3_large, openai_v3_small, and other.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the source model set to bert.

CREATE CUSTOM INDEX comment_source_model_idx
  ON cycling.comments_vs (comment_vector) USING 'StorageAttachedIndex'
    WITH OPTIONS = { 'source_model': 'bert'};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_source_model_idx;

Result

CREATE CUSTOM INDEX comment_source_model_idx ON cycling.comments_vs (comment_vector) USING 'StorageAttachedIndex' WITH OPTIONS = {'source_model': 'bert'};

Query the table to find the most similar vectors to a given query vector.

SELECT * FROM cycling.comments_vs
  ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
  LIMIT 3;

Result

 id                                   | created_at                      | comment                                | comment_vector               | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
 e8ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |              rain, rain,rain, go away! | [0.9, 0.54, 0.12, 0.1, 0.95] |      John | 6711e6c0-2f6a-11ef-bd2f-836fa334e187
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex | 670c4170-2f6a-11ef-bd2f-836fa334e187
 e8ae5df3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |                    Rain like a monsoon | [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane | 6712d120-2f6a-11ef-bd2f-836fa334e187

(3 rows)

Vector search with similarity function

When you specify a source_model in the CREATE INDEX command, you don’t need to define a similarity_function. The database will apply the correct one automatically.

Similarity search (or vector search) finds rows that are most similar to a query by comparing stored vectors with the query vector. To do this, it calculates similarity scores using a specified similarity metric. For applications where similarity and relevance are crucial, this calculation helps you make informed decisions, and it enables algorithms to provide more tailored and accurate results.

To specify the similarity metric to use for these calculations, use the similarity_function option when creating a vector index with the CREATE INDEX command. Supported values include cosine (default), dot_product, and euclidean.

When you write the SELECT query, supply both the vector column and the vector embeddings to compare. The SELECT query returns the row that is most similar to the vector in the search query.

This example creates an index on the comment_vector column with the similarity function set to dot_product:

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the similarity function set to dot_product.

CREATE CUSTOM INDEX comment_sim_function_idx
  ON cycling.comments_vs (comment_vector) USING 'StorageAttachedIndex'
    WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT'};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_sim_function_idx;

Result

CREATE CUSTOM INDEX comment_sim_function_idx ON cycling.comments_vs (comment_vector) USING 'StorageAttachedIndex' WITH OPTIONS = {'similarity_function': 'DOT_PRODUCT'};

Query the table to find all the vectors.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs;

Result

 comment                                              | comment_vector                | commenter
------------------------------------------------------+-------------------------------+-----------
                            rain, rain,rain, go away! |  [0.9, 0.54, 0.12, 0.1, 0.95] |      John
               LATE RIDERS SHOULD NOT DELAY THE START |  [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex
                    Second rest stop was out of water |  [0.99, 0.5, 0.99, 0.1, 0.34] |      Alex
               Raining too hard should have postponed | [0.45, 0.09, 0.01, 0.2, 0.11] |      Alex
        The gift certificate for winning was the best | [0.13, 0.8, 0.35, 0.17, 0.03] |       Amy
                              Last climb was a killer |    [0.3, 0.75, 0.2, 0.2, 0.5] |       Amy
                        Great snacks at all reststops |   [0.1, 0.4, 0.1, 0.52, 0.09] |       Amy
                    Glad you ran the race in the rain |  [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy
                               THE RACE WAS FABULOUS! |  [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy
 The <B>gift certificate</B> for winning was the best | [0.13, 0.8, 0.35, 0.17, 0.03] |       Amy
                                  Rain like a monsoon |  [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane
                     Boy, was it a drizzle out there! |  [0.3, 0.34, 0.2, 0.78, 0.25] |      Jane

(12 rows)

Query the table to find the most similar vectors to a given query vector.

SELECT comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05])
    FROM cycling.comments_vs
    ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05]
    LIMIT 3;

Result

 comment                                | system.similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05])
----------------------------------------+-----------------------------------------------------------------------
      Second rest stop was out of water |                                                              0.949701
 LATE RIDERS SHOULD NOT DELAY THE START |                                                              0.789776
                    Rain like a monsoon |                                                              0.789776

(3 rows)

Vector search with text analyzer

Analyzers process the text in a column to enable term matching on strings. Combined with vector-based search algorithms, term matching makes it easier to find relevant information in a table. Analyzers semantically filter the results of a vector search by specific terms.

Analyzers are built on the Apache Lucene™ Java Analyzer API. SAI uses the Lucene Java Analyzer API to transform text columns into tokens for indexing and querying which can use built-in or custom analyzers.

For example, if you generate a vector embedding from the phrase "tell me about rain comments", you can use a vector search to get a list of rows with similar vectors. These rows will likely correlate with rain-related strings.

SELECT comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05])
    FROM cycling.comments_vs
    ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05]
    LIMIT 3;

Alternatively, you can filter these search results by a specific keyword, such as rain:

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'rain';

An analyzed index stores values derived from the raw column values. The stored values are dependent on the analyzer configuration options, which include tokenization, filtering, and charFiltering.

Restrictions

Only SAI indexes support the : operator.
The analyzed column cannot be part of the primary key, including the partition key and clustering columns.
The : operator can be used with only SELECT statements.
The : operator cannot be used with light-weight transactions (LWTs), such as a condition for an IF clause.

Non-tokenizing filters

These filters cannot be combined with an index_analyzer, but can be chained in a pipeline with other non-tokenizing filters.

normalize

Normalize input using Normalization Form C (NFC)

case_sensitive

Transform all inputs to lowercase

ascii

same as the ASCIIFoldingFilter from Lucene; converts ASCII characters to their associated UTF-8 values

This example looks for a comment that contains the word rain, but uses the word Rain in the query:

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the case_sensitive filter.

CREATE CUSTOM INDEX comment_case_insensitive_idx ON cycling.comments_vs (comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = { 'case_sensitive': false};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_case_insensitive_idx;

Result

CREATE CUSTOM INDEX comment_case_insensitive_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false'};

Query the table to find the matches for the word rain in the comment column.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'Glad you ran the Race in the Rain';

Result

 comment                           | comment_vector               | commenter
-----------------------------------+------------------------------+-----------
 Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy

(1 rows)

Analyzer using a built-in analyzer

There are several built-in analyzers from the Lucene project. To use a text analyzer, you must configure the index_analyzer in the CREATE INDEX statement. This analyzer determines how a column value is analyzed before indexing occurs. The analyzer is applied to the query term search as well.

Built-in analyzers

The following built-in analyzers are available:

Name

Description

standard

Filters StandardTokenizer output that divides text into terms on word boundaries and then uses the LowerCaseFilter.

simple

Filters LetterTokenizer output that divides text into terms whenever it encounters a character which is not a letter and then uses the LowerCaseFilter.

whitespace

Analyzer that uses WhitespaceTokenizer to divide text into terms whenever it encounters any whitespace character.

stop

Filters LetterTokenizer output with LowerCaseFilter and removes the default English stop words for Lucene.

lowercase

Normalizes input by applying LowerCaseFilter (no additional tokenization is performed).

keyword

Analyzer that uses KeywordTokenizer, which is an identity function ("noop") on input values and tokenizes the entire input as a single token.

Analyzers for specific languages. For example, english and french. Options are: ARABIC, ARMENIAN, BASQUE, BENGALI, BRAZILIAN, BULGARIAN, CATALAN, CJK, CZECH, DANISH, DUTCH, ENGLISH, ESTONIAN, FINNISH, FRENCH, GALICIAN, GERMAN, GREEK, HINDI, HUNGARIAN, INDONESIAN, IRISH, ITALIAN, LATVIAN, LITHUANIAN, NORWEGIAN, PERSIAN, PORTUGUESE, ROMANIAN, RUSSIAN, SORANI, SPANISH, SWEDISH

To configure the OPTION index_analyzer, either a single string that identifies a built-in analyzer is used, or if a more complex index analyzer configuration is required, a JSON object is used. The JSON object configures a single tokenizer, with zero or more optional filters and/or charFilters.

Tokenizers

A tokenizer breaks up a stream of characters into individual tokens (usually individual words) and outputs a stream of tokens. Different tokenizers use different rules to break up the stream of characters. For example, a tokenizer might break text into terms based on word boundaries, converting the sentence "How now, brown cow?" into the terms [How, now, brown, cow].

The built-in index analyzers in the table above use the following tokenizers:

Name Orientation Description

Name	Orientation	Description
`standard`	word	Divides text into terms on word boundaries. Removes most punctuation symbols.
`letter`	word	Divides text into terms whenever it encounters a character which is not a letter.
`lowercase`	word	Divides text into terms whenever it encounters a character which is not a letter, and then lowercases all terms.
`whitespace`	word	Divides text into terms whenever it encounters any whitespace character.
`n-gram`	Partial word (small fragments of words)	Divides text into terms whenever it encounters any whitespace character, then returns n-grams of each word. For example, `cow → [co, ow]`.
`edge_n-gram`	Partial word (small fragments of words)	Divides text into terms whenever it encounters any whitespace character, then returns n-grams of each word which are anchored to the start of the word. For example, `cow → [c, co, cow]`.
`keyword`	structured	Tokenizes the entire input as a single token.

standard

word

Divides text into terms on word boundaries. Removes most punctuation symbols.

letter

word

Divides text into terms whenever it encounters a character which is not a letter.

lowercase

word

Divides text into terms whenever it encounters a character which is not a letter, and then lowercases all terms.

whitespace

word

Divides text into terms whenever it encounters any whitespace character.

n-gram

Partial word (small fragments of words)

Divides text into terms whenever it encounters any whitespace character, then returns n-grams of each word. For example, cow → [co, ow].

edge_n-gram

Partial word (small fragments of words)

Divides text into terms whenever it encounters any whitespace character, then returns n-grams of each word which are anchored to the start of the word. For example, cow → [c, co, cow].

keyword

structured

Tokenizes the entire input as a single token.

Token filters

A token filter receives the token stream that the tokenizer creates, and may add, remove, or change tokens. Common token filters are:

Name Description

Name	Description
`lowercase`	Converts all tokens to lowercase for case-insensitive search.
`stop`	Removes common words (stop words) such as `the` and `and` from the token stream.
`synonym`	Introduces synonyms such as `bad`, `awful` and `terrible` into the token stream.
`synonym_graph`	Add tokens that span multiple positions, such as multi-word synonyms. This is a graph token filter.

lowercase

Converts all tokens to lowercase for case-insensitive search.

stop

Removes common words (stop words) such as the and and from the token stream.

synonym

Introduces synonyms such as bad, awful and terrible into the token stream.

synonym_graph

Add tokens that span multiple positions, such as multi-word synonyms. This is a graph token filter.

Character filters (charFilters)

A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. Common character filters are:

Name Description

Name	Description
`html_strip`	Strips out HTML elements like `<b>` and decodes HTML entities like `&`.
`mapping`	Replaces any occurrences of the specified strings with the specified replacements.
`pattern_replace`	Replaces any characters matching a regular expression with the specified replacement.

html_strip

Strips out HTML elements like <b> and decodes HTML entities like &.

mapping

Replaces any occurrences of the specified strings with the specified replacements.

pattern_replace

Replaces any characters matching a regular expression with the specified replacement.

Simple STANDARD analyzer example

This example uses the standard analyzer to search for the word rain in the comment column.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the standard analyzer.

CREATE CUSTOM INDEX comment_standard_idx ON cycling.comments_vs (comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {'index_analyzer': 'STANDARD'};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_standard_idx;

Result

CREATE CUSTOM INDEX comment_standard_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': 'STANDARD'};

Query the table to find the matches for the word rain in the comment column using the standard analyzer.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'Rain';

This query returns the rows with the word rain or Rain in the comment column, but not the row with raining. The standard analyzer splits the text into case-independent terms, but does not perform stemming. It is also case-insensitive so that both uppercase and lowercase are returned.

Result

 comment                           | comment_vector               | commenter
-----------------------------------+------------------------------+-----------
 Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy
               Rain like a monsoon | [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane
  Boy, was it a drizzle out there! | [0.3, 0.34, 0.2, 0.78, 0.25] |      Jane

(3 rows)

STANDARD with lowercase analyzer example

This example uses the standard analyzer to search for the word rain in the comment column. It also includes the filter lowercase to ensure that the search is case-insensitive.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the standard analyzer and the lowercase filter.

CREATE CUSTOM INDEX comment_standard_lowercase_idx ON cycling.comments_vs (comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer" : { "name" : "standard", "args" : {} },
  "filters" : [ { "name" : "lowercase", "args": {} } ],
  "charFilters" : []
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_standard_lowercase_idx;

Result

CREATE CUSTOM INDEX comment_standard_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer" : { "name" : "standard", "args" : {} },
  "filters" : [ { "name" : "lowercase", "args": {} } ],
  "charFilters" : []
  }'};

Query the table to find the matches for the word rain in the comment column using the standard analyzer and the lowercase filter.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'Glad you ran the Race in the Rain';

This query returns the rows with the words in the phrase selected, regardless of the word case. Stemming is not performed.

Result

 comment                           | comment_vector               | commenter
-----------------------------------+------------------------------+-----------
 Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy

(1 rows)

STANDARD analyzer with stemming example

This example uses the standard analyzer to search for the word rain in the comment column. It also includes the filter porterstem to ensure that the search can find stemming text, such as rain and raining.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the standard analyzer and the porterstem filter.

CREATE CUSTOM INDEX comment_stemming_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
    "tokenizer" : {"name" : "standard"},
    "filters" : [{"name" : "porterstem"}]
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_stemming_idx;

Result

CREATE CUSTOM INDEX comment_stemming_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
    "tokenizer" : {"name" : "standard"},
    "filters" : [{"name" : "porterstem"}]
  }'};

Query the table to find the matches for the word rain in the comment column using the standard analyzer and the porterstem filter.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'rain';

This query returns the rows with the word rain or Rain in the comment column, but not the row with raining. You would expect the row with raining to be returned because the porterstem filter is used. However, that word in the phrase is actually mixed-case, Raining, so that row is not returned.

Result

 comment                           | comment_vector               | commenter
-----------------------------------+------------------------------+-----------
         rain, rain,rain, go away! | [0.9, 0.54, 0.12, 0.1, 0.95] |      John
 Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy

(2 rows)

Query the table to find the matches for the words rain and race in the comment column using the standard analyzer and the porterstem filter.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'rain' AND comment: 'race';

As in the previous example, the lack of lowercase filtering means that the row with Racing is not returned, but only the row with both rain and race in the comment in lowercase:

Result

 comment                           | comment_vector               | commenter
-----------------------------------+------------------------------+-----------
 Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy

(1 rows)

STANDARD analyzer with stemming and case-insensitive example

If another filter is added to the last example, now the stemming and case-insensitive filters are combined. With this filter, the query returns the rows with the words rain and Raining in the comment column for the search for rain.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the standard analyzer, the porterstem, lowercase and case-insensitive filters.

CREATE CUSTOM INDEX comment_stemming_lowercase_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
    "tokenizer" : {"name" : "standard"},
    "filters" : [{"name" : "lowercase"}, {"name" : "porterstem"}]
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_stemming_lowercase_idx;

Result

CREATE CUSTOM INDEX comment_stemming_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
    "tokenizer" : {"name" : "standard"},
    "filters" : [{"name" : "lowercase"}, {"name" : "porterstem"}]
  }'};

Query the table to find the matches for the words rain and raining in the comment column using the standard analyzer, the porterstem, and lowercase filters.

To ensure tokens are lowercased, add the lowercase filter before the porterstem filter in the analyzer configuration.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'Glad you ran the Race in the Rain';

This configuration returns rows matching 'rain', Rain, and 'Raining':

Result

 comment                                | comment_vector                | commenter
----------------------------------------+-------------------------------+-----------
              rain, rain,rain, go away! |  [0.9, 0.54, 0.12, 0.1, 0.95] |      John
 Raining too hard should have postponed | [0.45, 0.09, 0.01, 0.2, 0.11] |      Alex
      Glad you ran the race in the rain |  [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy
                    Rain like a monsoon |  [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane

(4 rows)

TOKENIZER whitespace

An example that uses the whitespace tokenizer. When you use the whitespace tokenizer, the text is split into terms whenever it encounters any whitespace character, but not other characters like commas. The query below shows that difference from the standard tokenizer.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the whitespace tokenizer.

CREATE CUSTOM INDEX comment_whitespace_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer":{"name" : "whitespace"}}
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_whitespace_idx;

Result

CREATE CUSTOM INDEX comment_whitespace_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer":{"name" : "whitespace"}}
  }'};

Query the table to find the matches for the word rain in the comment column but not rain, rain,rain using the whitespace tokenizer.
```
SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'rain';
```
This configuration returns the row with rain, but not the row with rain, rain,rain. The whitespace tokenizer can be useful when you want to split on whitespace but not on other characters. Otherwise, the standard tokenizer is more useful.
Result
```
 comment                           | comment_vector               | commenter
-----------------------------------+------------------------------+-----------
 Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy

(1 rows)
```

WHITESPACE analyzer alternative

Using the whitespace analyzer is simpler, but it is not configurable. If you wish to figure and still tokenize on whitespace, you can use the whitespace tokenizer with the lowercase filter.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the whitespace tokenizer.

CREATE CUSTOM INDEX comment_whitespace_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {'index_analyzer':'whitespace'};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_whitespace_idx;

Result

CREATE CUSTOM INDEX comment_whitespace_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': 'whitespace'};

Query the table to find the matches for rain but not rain, rain,rain in the comment column.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'rain';

Result

 comment                           | comment_vector               | commenter
-----------------------------------+------------------------------+-----------
 Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy

(1 rows)

WHITESPACE analyzer with lowercase

This example uses the whitespace tokenizer with a lowercase filter to search for the word rain or Rain between whitespaces in the comment column. The lowercase filter ensures that the search is case-insensitive.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the whitespace tokenizer and lowercase.

CREATE CUSTOM INDEX comment_whitespace_lowercase_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer":{"name" : "whitespace"},
  "filters":[{"name" : "lowercase"}]
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_whitespace_lowercase_idx;

Result

CREATE CUSTOM INDEX comment_whitespace_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer":{"name" : "whitespace"},
  "filters":[{"name" : "lowercase"}]
  }'};

Query the table to find the matches for rain or Rain in between whitespaces in the comment column.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'rain';

This query returns the rows with the word rain or Rain in the comment column but not the row with raining:

Result

 comment                           | comment_vector               | commenter
-----------------------------------+------------------------------+-----------
 Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy
               Rain like a monsoon | [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane

(2 rows)

WHITESPACE analyzer with stemming

You can also use the whitespace tokenizer with stemming. This example is case-sensitive, since the lowercase filter is not included.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the whitespace tokenizer and stemming.

CREATE CUSTOM INDEX comment_whitespace_stemming_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer":{"name" : "whitespace"},
  "filters":[{"name" : "porterstem"}]
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_whitespace_stemming_idx;

Result

CREATE CUSTOM INDEX comment_whitespace_stemming_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer":{"name" : "whitespace"},
  "filters":[{"name" : "porterstem"}]
  }'};

Query the table to find the matches for Rain and any stemming words like Raining in the comment column.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'Rain';

This query returns the rows with the word Rain or Raining in the comment column but not the rows with rain or rain, rain,rain.

Result

 comment                                | comment_vector                | commenter
----------------------------------------+-------------------------------+-----------
 Raining too hard should have postponed | [0.45, 0.09, 0.01, 0.2, 0.11] |      Alex
                    Rain like a monsoon |  [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane

(2 rows)

N-GRAM with lowercase

This example uses the n-gram tokenizer to search for the letters ra in the comment column. The n-gram tokenizer divides text into terms whenever it encounters any whitespace character, then returns n-grams of each word. For example ra in rain would be [ra, ai, in].

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the ngram tokenizer and lowercase filter.

CREATE CUSTOM INDEX comment_ngram_lowercase_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer" : {"name" : "ngram", "args" : {"minGramSize":"2", "maxGramSize":"3"}},
  "filters" : [{"name" : "lowercase"}]
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_ngram_lowercase_idx;

Result

CREATE CUSTOM INDEX comment_ngram_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer" : {"name" : "ngram", "args" : {"minGramSize":"2", "maxGramSize":"3"}},
  "filters" : [{"name" : "lowercase"}]
  }'};

Query the table to find the matches for the letters ra in the comment column using the n-gram tokenizer. This query will return all 6 rows, since all of them contain the letters ra or Ra.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'ra';

All the rows are returned. It is an interesting example of how the n-gram tokenizer works. Try the query with ini to see how the result changes.

Result

 comment                                | comment_vector                | commenter
----------------------------------------+-------------------------------+-----------
              rain, rain,rain, go away! |  [0.9, 0.54, 0.12, 0.1, 0.95] |      John
 Raining too hard should have postponed | [0.45, 0.09, 0.01, 0.2, 0.11] |      Alex
      Glad you ran the race in the rain |  [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy
                 THE RACE WAS FABULOUS! |  [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy
                    Rain like a monsoon |  [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane

(5 rows)

STANDARD analyzer with pattern

This example uses the standard analyzer to search for a pattern in the text:

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column with the whitespace tokenizer.

CREATE CUSTOM INDEX comment_pattern_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer" : {"name" : "simplepattern", "args" : {"pattern":"[0123456789]{3}"}}
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_pattern_idx;

Result

CREATE CUSTOM INDEX comment_pattern_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer" : {"name" : "simplepattern", "args" : {"pattern":"[0123456789]{3}"}}
  }'};

Query the table:

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment: '335';

Result

 comment          | comment_vector | commenter
------------------+----------------+-----------
 fd-786-335-552-x |           null |       Amy
 fd-786-335-514-x |           null |       Amy
 fd-333-335-514-x |           null |       Amy

(3 rows)

STANDARD analyzer with keyword htmlstrip

This example uses the standard analyzer to search for a keyword in the text and will strip out HTML elements to match:

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column.

CREATE CUSTOM INDEX comment_keyword_htmlstrip_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer" : {"name" : "keyword"},
  "charFilters" : [{"name" : "htmlstrip"}]
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_keyword_htmlstrip_idx;

Result

CREATE CUSTOM INDEX comment_keyword_htmlstrip_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer" : {"name" : "keyword"},
  "charFilters" : [{"name" : "htmlstrip"}]
  }'};

Query the table to do the HTML strip.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'The gift certificate for winning was the best';

This query returns two rows, one with the phrase gift certificate and one with the same phrase wrapped in bold HTML tags. If you are analyzing text that might contain HTML, this is a useful filter.

Result

 comment                                              | comment_vector                | commenter
------------------------------------------------------+-------------------------------+-----------
        The gift certificate for winning was the best | [0.13, 0.8, 0.35, 0.17, 0.03] |       Amy
 The <B>gift certificate</B> for winning was the best | [0.13, 0.8, 0.35, 0.17, 0.03] |       Amy

(2 rows)

LANGUAGE analyzer with lowercase

This example uses the czech analyzer to search for the Czech word pánové with accents:

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column.

CREATE CUSTOM INDEX comment_language_lowercase_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer" : {"name" : "standard"},
  "filters" : [{"name" : "lowercase"}, {"name" : "czechstem"}]
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_language_lowercase_idx;

Two rows are returned, one with the word pánové and one with Pánové. The sentence without the word is not returned.

Result

CREATE CUSTOM INDEX comment_language_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer" : {"name" : "standard"},
  "filters" : [{"name" : "lowercase"}, {"name" : "czechstem"}]
  }'};

Query the table for word matches in Czech. The query is case-insensitive, so the results are the same for pánové and Pánové.

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'pánové';

All rows are returned except the one row without the word pánové:

Result

 comment             | comment_vector | commenter
---------------------+----------------+-----------
 pánové, to je dobré |           null |    Michal
 Pánové, to je dobré |           null |   Martina

(2 rows)

KEYWORD TOKENIZER with synonym and lowercase

This example uses the keyword tokenizer with the synonym and lowercase filters to search for the words drizzle and rain as synonyms in the comment column:

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column.

CREATE CUSTOM INDEX comment_synonym_lowercase_idx ON cycling.comments_vs (comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer": {"name" : "whitespace"},
  "filters" : [ {"name" : "lowercase"},
  {"name" : "synonym", "args" : {"synonyms" : "rain, drizzle => rain"}}]
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_synonym_lowercase_idx;

Result

CREATE CUSTOM INDEX comment_synonym_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer": {"name" : "whitespace"},
  "filters" : [ {"name" : "lowercase"},
  {"name" : "synonym", "args" : {"synonyms" : "rain, drizzle => rain"}}]
  }'};

Query the table to match synonyms.

SELECT comment, comment_vector, commenter FROM cycling.comments_vs
  WHERE comment : 'drizzle';

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE comment : 'rain';

In these queries, rain and drizzle are synonyms, so the query returns all rows with either word:

Result

 comment                           | comment_vector               | commenter
-----------------------------------+------------------------------+-----------
 Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy
               Rain like a monsoon | [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane
  Boy, was it a drizzle out there! | [0.3, 0.34, 0.2, 0.78, 0.25] |      Jane

(3 rows)

KEYWORD TOKENIZER with synonym and keyword

This example uses the keyword tokenizer with the synonyms of Alex, Alexander, and alexandria, and the keyword filter to search for the words alex in the comment column:

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column.

CREATE CUSTOM INDEX commenter_synonym_keyword_idx ON cycling.comments_vs(commenter)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer": {"name" : "keyword"},
  "filters" : [
  {"name" : "synonym", "args" : {"synonyms" : "Alex, alex, Alexander, alexander => Alex"}}]
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.commenter_synonym_keyword_idx;

Result

CREATE CUSTOM INDEX commenter_synonym_keyword_idx ON cycling.comments_vs (commenter) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer": {"name" : "keyword"},
  "filters" : [
  {"name" : "synonym", "args" : {"synonyms" : "Alex, alex, Alexander, alexander => Alex"}}]
  }'};

Query the table to match synonyms.

SELECT * FROM cycling.comments_vs WHERE commenter : 'alexander';

Result

 id                                   | created_at                      | comment                                | comment_vector                | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+-------------------------------+-----------+--------------------------------------
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START |  [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex | 700ebed0-4084-11ef-893f-0fd142cea825
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 |      Second rest stop was out of water |  [0.99, 0.5, 0.99, 0.1, 0.34] |      Alex | 700d8650-4084-11ef-893f-0fd142cea825
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-02-14 20:43:20.000000+0000 | Raining too hard should have postponed | [0.45, 0.09, 0.01, 0.2, 0.11] |      Alex | 6fe2f3e0-4084-11ef-893f-0fd142cea825
 e7ae5cf3-d358-4d99-e900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |      LATE RIDERS SHOULD NOT BE ALLOWED |  [0.9, 0.54, 0.12, 0.1, 0.95] | alexander | c812c9f0-4084-11ef-893f-0fd142cea825

(4 rows)

ANALYZER with mapping charFilter and lowercase

This example uses the mapping charFilter to mapping happy and sad to emojis in the comment column.

Create a table with a vector column if it does not already exist.

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Create an index on the comment_vector column.

CREATE CUSTOM INDEX comment_mapping_lowercase_idx ON cycling.comments_vs(comment)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {
  'index_analyzer' : '{
  "tokenizer" : {"name" : "standard"},
  "filters" : [{"name" : "lowercase"}],
  "charFilters" : [ { "name" :"mapping", "args": {"mapping" : "\":)\" => \"_happy_\", \":(\" => \"_sad_\"" } }]
  }'
};

Verify that the index was created.

DESCRIBE INDEX cycling.comment_mapping_lowercase_idx;

Result

CREATE CUSTOM INDEX comment_mapping_lowercase_idx ON cycling.comments_vs (comment) USING 'StorageAttachedIndex' WITH OPTIONS = {'index_analyzer': '{
  "tokenizer" : {"name" : "standard"},
  "filters" : [{"name" : "lowercase"}],
  "charFilters" : [ { "name" :"mapping", "args": {"mapping" : "\":)\" => \"_happy_\", \":(\" => \"_sad_\"" } }]
  }'};

Query the table to see if the mapping works. This query looks for the equivalent of sad in emojis.

SELECT comment FROM cycling.comments_vs WHERE comment : '_sad_';

Result

 comment
-------------------------------------------
 LATE RIDERS SHOULD NOT DELAY THE START :(

(1 rows)

Query the table to see if the searching the emoji directly works.

SELECT comment FROM cycling.comments_vs WHERE comment : ':(';

Result

 comment
-------------------------------------------
 LATE RIDERS SHOULD NOT DELAY THE START :(

(1 rows)

SAI examples

Was this helpful?