Use vector search with Cassandra Query Language (CQL)

This example demonstrates how to use the vector search with Cassandra Query Language (CQL) with Hyper-Converged Database (HCD) 1.2.

To enable your machine learning model, vector search uses data compared by similarity within a database, even if it is not explicitly defined by a connection. A vector is an array of floating point type that represents a specific object or entity.

The foundation of vector search lies within the embeddings, which are compact representations of text as vectors of floating-point numbers. These embeddings are generated by feeding the text through an API, which uses a neural network to transform the input into a fixed-length vector. Embeddings capture the semantic meaning of the text, providing a more nuanced understanding than traditional term-based approaches. The vector representation allows for input that is substantially similar to produce output vectors that are geometrically close; inputs that are not similar are geometrically further apart.

To enable vector search, a vector data type is available in your HCD database with vector search. For more information, see vector search examples in the CQL documentation.

The embeddings were randomly generated in this tutorial. Generally, you would run both your source documents/contents through an embeddings generator, as well as the query you were asking to match. This example is simply to show the mechanics of how to use CQL to create vector search data objects.

Install HCD

Go install HCD if you haven’t already. For exploration, use the Docker installation option.

Install a terminal to run your client

The clients can be tested by running them in a terminal. You’ll want Xterm, Terminal, or another terminal emulator.

Identify your credentials

You need to identify the credentials, or token, for your database. For initial exploration, you can use the default superuser credentials set in the database. The default superuser credentials are: cassandra is the username and cassandra is the password. These values will be used with a token provider in the client to generate a TOKEN used for authentication to run the client.
1. Before going much further, you should create a new user with a secure password. This new user, once created, will be used to generate a token for authentication.
You need the API endpoint that your client will connect to. The DB_API_ENDPOINT is the URL of the database you installed. The port number is 8181 by default. For example, if you installed the database using a Docker container, the API endpoint is http://localhost:8181. This value will be your DB_API_ENDPOINT.

You may either assign your username/password and API endpoint to environment variables in your terminal, or modify the client code to include them directly, as shown in the following examples. Another value that you will want to set is the OPENAI_API_KEY. This is the API key that you received when you signed up for the OpenAI API. This key is used to authenticate your requests to the OpenAI API, and the clients use it to vectorize the text that you provide.

Linux or macOS
Windows
Google Colab

export DB_API_ENDPOINT=DB_API_ENDPOINT # Your database API endpoint
export DB_USERNAME=USERNAME # Your username to run the client
export DB_PASSWORD=PASSWORD # Your password for the username

set DB_DB_API_ENDPOINT=DB_API_ENDPOINT # Your database API endpoint

set DB_USERNAME=USERNAME # Your username to run the client

set DB_PASSWORD=PASSWORD # Your password for the username

os.environ["DB_API_ENDPOINT"] = "DB_API_ENDPOINT" # Your database API endpoint
os.environ["DB_USERNAME"] = "USERNAME" # Your username to run the client
os.environ["DB_PASSWORD"] = "PASSWORD" # Your password for the username

Create vector search keyspace

Create a new vector search keyspace called cycling:

CREATE KEYSPACE IF NOT EXISTS cycling
  WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};

Use vector search keyspace

Select the keyspace cycling for creating the vector search example:

USE cycling;

Create vector search table

Create a new table called comments_vs to store comments information:

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

Alternatively, if you didn’t have a vector column in your original table, you could alter the original table to add a vector column comment_vector to store the embeddings:

ALTER TABLE cycling.comments_vs
  ADD comment_vector VECTOR <FLOAT, 5>; (1)

1	Notice that the vector uses the `float` data type and specifies the array dimension of 5 to store the embeddings.

Create vector search index

Create the custom index called comment_vector with Storage Attached Indexing (SAI):

CREATE CUSTOM INDEX comment_ann_idx ON cycling.comments_vs(comment_vector)
  USING 'StorageAttachedIndex';

For more about SAI, see the SAI documentation.

Load vector search data into your database

Insert data into the table:

INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),e7ae5cf3-d358-4d99-b900-85902fda9bb0, 'Alex','Raining too hard should have postponed','2017-02-14 12:43:20-0800',[0.45, 0.09, 0.01, 0.2, 0.11]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),e7ae5cf3-d358-4d99-b900-85902fda9bb0,'Alex','Second rest stop was out of water','2017-03-21 13:11:09.999-0800',[0.99, 0.5, 0.99, 0.1, 0.34]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),e7ae5cf3-d358-4d99-b900-85902fda9bb0,'Alex','LATE RIDERS SHOULD NOT DELAY THE START','2017-04-01 06:33:02.16-0800',[0.9, 0.54, 0.12, 0.1, 0.95]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-4207-9494-a29f9809de6f,'Amy','The gift certificate for winning was the best',totimestamp(now()),[0.13, 0.8, 0.35, 0.17, 0.03]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-7207-9494-a29f9809de6f,'Amy','The <B>gift certificate</B> for winning was the best',totimestamp(now()),[0.13, 0.8, 0.35, 0.17, 0.03]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-4207-9494-a29f9809de6f,'Amy','Glad you ran the race in the rain','2017-02-17 12:43:20.234+0400',[0.3, 0.34, 0.2, 0.78, 0.25]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-4207-9594-a29f9809de6f,'Jane','Boy, was it a drizzle out there!','2017-02-17 12:43:20.234+0400',[0.3, 0.34, 0.2, 0.78, 0.25]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(), c7fceba0-c141-3207-9494-a29f9809de6f,'Amy','THE RACE WAS FABULOUS!','2017-02-17 12:43:20.234+0400',[0.3, 0.34, 0.2, 0.78, 0.25]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-4207-9494-a29f9809de6f, 'Amy','Great snacks at all reststops','2017-03-22 5:16:59.001+0400',[0.1, 0.4, 0.1, 0.52, 0.09]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),c7fceba0-c141-4207-9494-a29f9809de6f,'Amy','Last climb was a killer','2017-04-01 17:43:08.030+0400',[0.3, 0.75, 0.2, 0.2, 0.5]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),e8ae5cf3-d358-4d99-b900-85902fda9bb0,'John','rain, rain,rain, go away!','2017-04-01 06:33:02.16-0800',[0.9, 0.54, 0.12, 0.1, 0.95]);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector) VALUES (now(),e8ae5df3-d358-4d99-b900-85902fda9bb0,'Jane','Rain like a monsoon','2017-04-01 06:33:02.16-0800',[0.9, 0.54, 0.12, 0.1, 0.95]);

Note the format of the new vector data type.

Query vector data with CQL

This example demonstrates how to use the vector search with CQL with HCD.

Simple query

To query data using vector search, use a SELECT query:

SELECT * FROM cycling.comments_vs
  ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
  LIMIT 3;

Result

 id                                   | created_at                      | comment                                | comment_vector                                    | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+---------------------------------------------------+-----------+--------------------------------------
 e8ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |              rain, rain,rain, go away! | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' |      John | a2ad6561-2759-11ef-81d8-676f970e6feb
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' |      Alex | a2a9e2f1-2759-11ef-81d8-676f970e6feb
 e8ae5df3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |                    Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' |      Jane | a2adda91-2759-11ef-81d8-676f970e6feb

(3 rows)

The limit has to be 1,000 or fewer.

Scrolling to the right on the results shows the comments from the table that most closely matched the embeddings used for the query.

Similarity query

To obtain the similarity calculation of the best scoring node closest to the query data as part of the results, use a modified SELECT query:

SELECT  comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05])
    FROM cycling.comments_vs
    ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05]
    LIMIT 3;

Result

 comment                                | system.similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05])
----------------------------------------+-----------------------------------------------------------------------
      Second rest stop was out of water |                                                              0.949701
              rain, rain,rain, go away! |                                                              0.789776
 LATE RIDERS SHOULD NOT DELAY THE START |                                                              0.789776

(3 rows)

The supported functions for this type of query are:

similarity_dot_product
similarity_cosine
similarity_euclidean

with the parameters of (<vector_column>, <embedding_value>). Both parameters represent vectors.

Vector search utilizes Approximate Nearest Neighbor (ANN) that in most cases yields results almost as good as the exact match. The scaling is superior to Exact Nearest Neighbor (KNN).

Least-similar searches are not supported.

Vector search works optimally on tables with no overwrites or deletions of the comment_vector column. For an comment_vector column with changes, expect slower search results.

The embeddings were randomly generated in this example. Generally, you would run both your source documents/contents through an embeddings generator, as well as the query you were asking to match. This example is simply to show the mechanics of how to use CQL to create vector search data objects.

With the code examples, you have a working example of our vector search. Load your own data and use the search function.

CassIO for AI workloads

CassIO abstracts away the details of accessing the Cassandra database for the typical needs of generative artificial intelligence (AI) or other machine learning workloads. CassIO offers a low-boilerplate, ready-to-use set of tools for seamless integration of Cassandra in most AI-oriented applications.

For more information, see CassIO.