Vector search quickstart

Vector search is a foundational use case for vector databases.

This guide demonstrates how to perform a vector search with CQL.

Create schema and write vector data

To run a vector search with CQL, you need to prepare your database for vector search and then run a vector search query.

Use an existing keyspace or create a keyspace for this quickstart. If you want to follow along with the examples, name your keyspace cycling.

Use the CREATE KEYSPACE command to create a keyspace named cycling:
```
CREATE KEYSPACE IF NOT EXISTS cycling
WITH REPLICATION = {
  'class' : 'SimpleStrategy',
  'replication_factor' : 1
};
```
Select the keyspace that you want to use for this quickstart:
```
USE cycling;
```
Create a table called comments_vs to store the demo data for this quickstart.

Vector data is stored alongside related non-vector data, which is also known as metadata. The vector embeddings are stored in a column of type vector. For more information, see Vector concepts: Embeddings.
```
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
```
You can also add a vector column to an existing table:
```
ALTER TABLE cycling.comments_vs
  ADD comment_vector VECTOR <FLOAT, 5>;
```
In this example, the vector uses the float data type and specifies the array dimension of 5 to store the embeddings. In Apache Cassandra 5.0 and later, the vector data type is a built-in type that supports vectors of type float as well as vectors of arbitrary subtype.
Index the vector column by creating a custom index with Storage Attached Indexing (SAI). For this example, the index is named comment_vector.
```
CREATE CUSTOM INDEX comment_ann_idx ON cycling.comments_vs(comment_vector)
  USING 'StorageAttachedIndex';
```
For more information, see Vector concepts: Indexing.

Insert vector and non-vector data into the table:

INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),e7ae5cf3-d358-4d99-b900-85902fda9bb0, 'Alex','Raining too hard should have postponed','2017-02-14 12:43:20-0800',[0.45, 0.09, 0.01, 0.2, 0.11]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),e7ae5cf3-d358-4d99-b900-85902fda9bb0,'Alex','Second rest stop was out of water','2017-03-21 13:11:09.999-0800',[0.99, 0.5, 0.99, 0.1, 0.34]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),e7ae5cf3-d358-4d99-b900-85902fda9bb0,'Alex','LATE RIDERS SHOULD NOT DELAY THE START','2017-04-01 06:33:02.16-0800',[0.9, 0.54, 0.12, 0.1, 0.95]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-4207-9494-a29f9809de6f,'Amy','The gift certificate for winning was the best',totimestamp(now()),[0.13, 0.8, 0.35, 0.17, 0.03]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-7207-9494-a29f9809de6f,'Amy','The <B>gift certificate</B> for winning was the best',totimestamp(now()),[0.13, 0.8, 0.35, 0.17, 0.03]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-4207-9494-a29f9809de6f,'Amy','Glad you ran the race in the rain','2017-02-17 12:43:20.234+0400',[0.3, 0.34, 0.2, 0.78, 0.25]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-4207-9594-a29f9809de6f,'Jane','Boy, was it a drizzle out there!','2017-02-17 12:43:20.234+0400',[0.3, 0.34, 0.2, 0.78, 0.25]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(), c7fceba0-c141-3207-9494-a29f9809de6f,'Amy','THE RACE WAS FABULOUS!','2017-02-17 12:43:20.234+0400',[0.3, 0.34, 0.2, 0.78, 0.25]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-4207-9494-a29f9809de6f, 'Amy','Great snacks at all reststops','2017-03-22 5:16:59.001+0400',[0.1, 0.4, 0.1, 0.52, 0.09]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-4207-9494-a29f9809de6f,'Amy','Last climb was a killer','2017-04-01 17:43:08.030+0400',[0.3, 0.75, 0.2, 0.2, 0.5]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),e8ae5cf3-d358-4d99-b900-85902fda9bb0,'John','rain, rain,rain, go away!','2017-04-01 06:33:02.16-0800',[0.9, 0.54, 0.12, 0.1, 0.95]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),e8ae5df3-d358-4d99-b900-85902fda9bb0,'Jane','Rain like a monsoon','2017-04-01 06:33:02.16-0800',[0.9, 0.54, 0.12, 0.1, 0.95]);

Note the format of the vector data type. Vector data must be stored in a valid format so that it can be indexed and searched correctly. Additionally, your embeddings must all originate from the same embedding model and match the dimensionality of your vector index. If embeddings originate from different models, the vector search won’t represent an accurate comparison.

This example uses randomly generated embeddings to demonstrate the vector search functionality. In a production scenario, you would produce embeddings specifically for your data and your search query.

For more information, see Vector concepts: Embeddings.

Run vector search queries

Vector search works optimally on tables with no overwrites or deletions of the vector column. For a vector column with changes, expect slower search results.

Vector search utilizes approximate nearest neighbor (ANN) that in most cases yields results almost as good as the exact match. The scaling is superior to exact nearest neighbor (KNN). Least-similar searches are not supported. For more information, see Vector concepts: Vector search.

Use a SELECT query to run a standard vector search:

SELECT * FROM cycling.comments_vs
  ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
  LIMIT 3;

The results include up to 1,000 rows that are most similar to the given query vector.

Result

 id                                   | created_at                      | comment                                | comment_vector               | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
 e8ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |              rain, rain,rain, go away! | [0.9, 0.54, 0.12, 0.1, 0.95] |      John | 6711e6c0-2f6a-11ef-bd2f-836fa334e187
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex | 670c4170-2f6a-11ef-bd2f-836fa334e187
 e8ae5df3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |                    Rain like a monsoon | [0.9, 0.54, 0.12, 0.1, 0.95] |      Jane | 6712d120-2f6a-11ef-bd2f-836fa334e187

(3 rows)

To include the similarity score in the results, use a modified SELECT query.

The supported functions for this type of query are similarity_dot_product, similarity_cosine, and similarity_euclidean with the parameters of (<vector_column>, <embedding_value>). Both parameters represent vectors.

SELECT comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05])
    FROM cycling.comments_vs
    ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05]
    LIMIT 3;

Result

 comment                                | system.similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05])
----------------------------------------+-----------------------------------------------------------------------
      Second rest stop was out of water |                                                              0.949701
 LATE RIDERS SHOULD NOT DELAY THE START |                                                              0.789776
                    Rain like a monsoon |                                                              0.789776

(3 rows)

Vector search quickstart

Create schema and write vector data

Run vector search queries

See also

CassIO for AI workloads

Was this helpful?

Give Feedback