Query with SAI

Query operators

SAI supports the following query operators for tables with SAI indexes:

  • Numerics: =, <, >, , >=, AND, IN, OR (for vector databases)

  • Strings: =, AND, IN

  • Collections: = CONTAINS, CONTAINS KEY

SAI does not supports the following query operators for tables with SAI indexes:

  • Strings or Numerics: LIKE, OR (for non-vector databases)

The SAI quickstart focuses only on defining multiple indexes based on non-primary key columns (a very useful feature). A few examples are provided here to show how to query tables with SAI indexes.

The SAI examples page provides more extensive examples.

SAI only supports SELECT queries, but not UPDATE or DELETE queries.

This example uses the following table and index:

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE CUSTOM INDEX comment_ann_idx ON cycling.comments_vs(comment_vector) 
  USING 'StorageAttachedIndex';

Simple query

To query data using Vector Search, use a SELECT query:

SELECT * FROM cycling.comments_vs 
  ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55] 
  LIMIT 3;
Results
WARNING: cqlsh was built against 5.0-beta1, but this server is 5.0.  All features may not work!

 id                                   | created_at                      | comment                                | comment_vector               | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
 e8ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |              rain, rain,rain, go away! | [0.9, 0.54, 0.12, 0.1, 0.95] |      John | 6711e6c0-2f6a-11ef-bd2f-836fa334e187
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex | 670c4170-2f6a-11ef-bd2f-836fa334e187
 e8ae5df3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |                    Rain like a monsoon | [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane | 6712d120-2f6a-11ef-bd2f-836fa334e187

(3 rows)

The query must have a limit of 1,000 or fewer.

Scrolling to the right on the results shows the comments from the table that most closely matched the embeddings used for the query.

Similarity query

To obtain the similarity calculation of the best scoring node closest to the query data as part of the results, use a modified SELECT query:

SELECT  comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05]) 
    FROM cycling.comments_vs
    ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] 
    LIMIT 3;
Results
WARNING: cqlsh was built against 5.0-beta1, but this server is 5.0.  All features may not work!

 comment                                | system.similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05])
----------------------------------------+-----------------------------------------------------------------------
      Second rest stop was out of water |                                                              0.949701
 LATE RIDERS SHOULD NOT DELAY THE START |                                                              0.789776
                    Rain like a monsoon |                                                              0.789776

(3 rows)

The supported functions for this type of query are:

  • similarity_dot_product

  • similarity_cosine

  • similarity_euclidean

with the parameters of (<vector_column>, <embedding_value>). Both parameters represent vectors.

Vector Search utilizes Approximate Nearest Neighbor (ANN) that in most cases yields results almost as good as the exact match. The scaling is superior to Exact Nearest Neighbor (KNN).

Least-similar searches are not supported.

Vector Search works optimally on tables with no overwrites or deletions of the comment_vector column. For an comment_vector column with changes, expect slower search results.

The embeddings were randomly generated in this example. Generally, you would run both your source documents/contents through an embeddings generator, as well as the query you were asking to match. This example is simply to show the mechanics of how to use CQL to create vector search data objects.

Single index match on a column

This example uses the following table and indexes:

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE CUSTOM INDEX commenter_idx ON cycling.comments_vs (commenter)
  USING 'StorageAttachedIndex' 
  WITH OPTIONS = {'case_sensitive': 'true', 'normalize': 'true', 'ascii': 'true'};
CREATE CUSTOM INDEX created_at_idx ON cycling.comments_vs (created_at) 
  USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX comment_ann_idx ON cycling.comments_vs(comment_vector) 
  USING 'StorageAttachedIndex';

The column commenter is not the partition key in this table, so an index is required to query on it.

Query for a match on that column:

SELECT * FROM cycling.comments_vs WHERE commenter = 'Alex';
Results
WARNING: cqlsh was built against 5.0-beta1, but this server is 5.0.  All features may not work!

 id                                   | created_at                      | comment                                | comment_vector                | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+-------------------------------+-----------+--------------------------------------
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START |  [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex | 670c4170-2f6a-11ef-bd2f-836fa334e187
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 |      Second rest stop was out of water |  [0.99, 0.5, 0.99, 0.1, 0.34] |      Alex | 670ba530-2f6a-11ef-bd2f-836fa334e187
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-02-14 20:43:20.000000+0000 | Raining too hard should have postponed | [0.45, 0.09, 0.01, 0.2, 0.11] |      Alex | 670b3000-2f6a-11ef-bd2f-836fa334e187

(3 rows)

Note that a failure will occur if you try this query before creating the index:

SELECT * FROM cycling.comments_vs WHERE commenter = 'Alex';
Results
WARNING: cqlsh was built against 5.0-beta1, but this server is 5.0.  All features may not work!

 id                                   | created_at                      | comment                                | comment_vector                | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+-------------------------------+-----------+--------------------------------------
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START |  [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex | 670c4170-2f6a-11ef-bd2f-836fa334e187
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 |      Second rest stop was out of water |  [0.99, 0.5, 0.99, 0.1, 0.34] |      Alex | 670ba530-2f6a-11ef-bd2f-836fa334e187
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-02-14 20:43:20.000000+0000 | Raining too hard should have postponed | [0.45, 0.09, 0.01, 0.2, 0.11] |      Alex | 670b3000-2f6a-11ef-bd2f-836fa334e187

(3 rows)

Single index match on a column with options

This example uses the following table and indexes:

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE CUSTOM INDEX commenter_idx ON cycling.comments_vs (commenter)
  USING 'StorageAttachedIndex' 
  WITH OPTIONS = {'case_sensitive': 'true', 'normalize': 'true', 'ascii': 'true'};

Case-sensitivity

The column commenter is not the partition key in this table, so an index is required to query on it. If we want to check commenter as a case-sensitive value, we can use the case_sensitive option set to true.

Note that no results are returned if you use an inappropriately case-sensitive value in the query:

SELECT comment,comment_vector,commenter FROM cycling.comments_vs 
  WHERE commenter : 'alex';
Results
WARNING: cqlsh was built against 5.0-beta1, but this server is 5.0.  All features may not work!

 comment | comment_vector | commenter
---------+----------------+-----------


(0 rows)

When we switch the case of the cyclist’s name to match the case in the index, the query succeeds:

SELECT comment,comment_vector,commenter FROM cycling.comments_vs 
  WHERE commenter : 'Alex';
Results
WARNING: cqlsh was built against 5.0-beta1, but this server is 5.0.  All features may not work!

 comment                                | comment_vector                | commenter
----------------------------------------+-------------------------------+-----------
 LATE RIDERS SHOULD NOT DELAY THE START |  [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex
      Second rest stop was out of water |  [0.99, 0.5, 0.99, 0.1, 0.34] |      Alex
 Raining too hard should have postponed | [0.45, 0.09, 0.01, 0.2, 0.11] |      Alex

(3 rows)

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com