Vector Search Quickstart
To enable your machine learning model, Vector Search uses data to be compared by similarity within a database, even if it is not explicitly defined by a connection. A vector is an array of floating point type that represents a specific object or entity.
The foundation of Vector Search lies within the embeddings, which are compact representations of text as vectors of floating-point numbers. These embeddings are generated by feeding the text through an API, which uses a neural network to transform the input into a fixed-length vector. Embeddings capture the semantic meaning of the text, providing a more nuanced understanding than traditional term-based approaches. The vector representation allows for input that is substantially similar to produce output vectors that are geometrically close; inputs that are not similar are geometrically further apart.
To enable Vector Search, a new vector
data type is available in your
Astra DB, Apache Cassandra®, or DataStax Enterprise (DSE) database with Vector Search.
Prerequisites
There are no prerequisite tasks.
In general, to use Vector Search with DSE, you’ll follow these instructions: The embeddings were randomly generated in this quickstart. Generally, you would run both your source documents/contents through an embeddings generator, as well as the query you were asking to match. This example is simply to show the mechanics of how to use CQL to create vector search data objects.
Create vector search keyspace
Create a new vector search keyspace called cycling
:
CREATE KEYSPACE IF NOT EXISTS cycling
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };
Use vector search keyspace
Select the keyspace cycling
for creating the vector search example:
USE cycling;
Create vector search table
Create a new table called comments_vs
to store comments information:
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
Alternatively, if you didn’t have a vector column in your original table, you could alter the original table to add a vector column comment_vector
to store the embeddings:
-- ALTER TABLE cycling.comments_vs
-- ADD comment_vector VECTOR <FLOAT, 5>; (1)
ALTER TABLE cycling.comments_vs
ADD comment_vector VECTOR <FLOAT, 5>;
1 | Notice that the vector uses the float data type and specifies the array dimension of 5 to store the embeddings. |
Create vector search index
Create the custom index called comment_vector
with Storage Attached Indexing (SAI):
CREATE CUSTOM INDEX IF NOT EXISTS ann_index
ON cycling.comments_vs(comment_vector) USING 'StorageAttachedIndex';
For more about SAI, see the Storage Attached Index documentation.
Load Vector Search data into your database
Insert data into the table:
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
e7ae5cf3-d358-4d99-b900-85902fda9bb0,
'Alex',
'Raining too hard should have postponed',
'2017-02-14 12:43:20-0800',
[0.45, 0.09, 0.01, 0.2, 0.11]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
e7ae5cf3-d358-4d99-b900-85902fda9bb0,
'Alex',
'Second rest stop was out of water',
'2017-03-21 13:11:09.999-0800',
[0.99, 0.5, 0.99, 0.1, 0.34]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
e7ae5cf3-d358-4d99-b900-85902fda9bb0,
'Alex',
'LATE RIDERS SHOULD NOT DELAY THE START',
'2017-04-01 06:33:02.16-0800',
[0.9, 0.54, 0.12, 0.1, 0.95]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
c7fceba0-c141-4207-9494-a29f9809de6f,
'Amy',
'The gift certificate for winning was the best',
totimestamp(now()),
[0.13, 0.8, 0.35, 0.17, 0.03]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
c7fceba0-c141-4207-9494-a29f9809de6f,
'Amy',
'Glad you ran the race in the rain',
'2017-02-17 12:43:20.234+0400',
[0.3, 0.34, 0.2, 0.78, 0.25]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
c7fceba0-c141-4207-9494-a29f9809de6f,
'Amy',
'Great snacks at all reststops',
'2017-03-22 5:16:59.001+0400',
[0.1, 0.4, 0.1, 0.52, 0.09]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
c7fceba0-c141-4207-9494-a29f9809de6f,
'Amy',
'Last climb was a killer',
'2017-04-01 17:43:08.030+0400',
[0.3, 0.75, 0.2, 0.2, 0.5]
);
Note the format of the new vector
data type.
Simple query
To query data using Vector Search, use a SELECT
query:
-
CQL
-
Result
SELECT * FROM cycling.comments_vs
ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
LIMIT 3;
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] | Alex | 616e77e0-22a2-11ee-b99d-1f350647414a
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-02-17 08:43:20.234000+0000 | Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] | Amy | 6170c1d0-22a2-11ee-b99d-1f350647414a
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-04-01 13:43:08.030000+0000 | Last climb was a killer | [0.3, 0.75, 0.2, 0.2, 0.5] | Amy | 62105d30-22a2-11ee-b99d-1f350647414a
The limit has to be 1,000 or fewer. |
Scrolling to the right on the results shows the comments from the table that most closely matched the embeddings used for the query.
Similarity query
To obtain the similarity calculation of the best scoring node closest to the query data as part of the results, use a modified SELECT query:
- CQL
-
SELECT comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 3;
- Result
-
comment | system.similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05]) -----------------------------------------------+----------------------------------------------------------------------- Second rest stop was out of water | 0.949701 The gift certificate for winning was the best | 0.86062 The gift certificate for winning was the best | 0.86062 (3 rows)
The supported functions for this type of query are:
-
similarity_dot_product
-
similarity_cosine
-
similarity_euclidean
with the parameters of (<vector_column>, <embedding_value>). Both parameters represent vectors.
Vector Search utilizes Approximate Nearest Neighbor (ANN) that in most cases yields results almost as good as the exact match. The scaling is superior to Exact Nearest Neighbor (KNN).
Least-similar searches are not supported.
Vector Search works optimally on tables with no overwrites or deletions of the comment_vector column. For an comment_vector column with changes, expect slower search results.
The embeddings were randomly generated in this example. Generally, you would run both your source documents/contents through an embeddings generator, as well as the query you were asking to match. This example is simply to show the mechanics of how to use CQL to create vector search data objects.
With the code examples, you have a working example of our Vector Search. Load your own data and use the search function. The embeddings were randomly generated in this quickstart. Generally, you would run both your source documents/contents through an embeddings generator, as well as the query you were asking to match. This example is simply to show the mechanics of how to use CQL to create vector search data objects.
Create vector search keyspace
Create a new vector search keyspace called cycling
:
CREATE KEYSPACE IF NOT EXISTS cycling
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };
Use vector search keyspace
Select the keyspace cycling
for creating the vector search example:
USE cycling;
Create vector search table
Create a new table called comments_vs
to store comments information:
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
Alternatively, if you didn’t have a vector column in your original table, you could alter the original table to add a vector column comment_vector
to store the embeddings:
-- ALTER TABLE cycling.comments_vs
-- ADD comment_vector VECTOR <FLOAT, 5>; (1)
ALTER TABLE cycling.comments_vs
ADD comment_vector VECTOR <FLOAT, 5>;
1 | Notice that the vector uses the float data type and specifies the array dimension of 5 to store the embeddings. |
Create vector search index
Create the custom index called comment_vector
with Storage Attached Indexing (SAI):
CREATE CUSTOM INDEX IF NOT EXISTS ann_index
ON cycling.comments_vs(comment_vector) USING 'StorageAttachedIndex';
For more about SAI, see the Storage Attached Index documentation.
Load Vector Search data into your database
Insert data into the table:
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
e7ae5cf3-d358-4d99-b900-85902fda9bb0,
'Alex',
'Raining too hard should have postponed',
'2017-02-14 12:43:20-0800',
[0.45, 0.09, 0.01, 0.2, 0.11]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
e7ae5cf3-d358-4d99-b900-85902fda9bb0,
'Alex',
'Second rest stop was out of water',
'2017-03-21 13:11:09.999-0800',
[0.99, 0.5, 0.99, 0.1, 0.34]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
e7ae5cf3-d358-4d99-b900-85902fda9bb0,
'Alex',
'LATE RIDERS SHOULD NOT DELAY THE START',
'2017-04-01 06:33:02.16-0800',
[0.9, 0.54, 0.12, 0.1, 0.95]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
c7fceba0-c141-4207-9494-a29f9809de6f,
'Amy',
'The gift certificate for winning was the best',
totimestamp(now()),
[0.13, 0.8, 0.35, 0.17, 0.03]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
c7fceba0-c141-4207-9494-a29f9809de6f,
'Amy',
'Glad you ran the race in the rain',
'2017-02-17 12:43:20.234+0400',
[0.3, 0.34, 0.2, 0.78, 0.25]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
c7fceba0-c141-4207-9494-a29f9809de6f,
'Amy',
'Great snacks at all reststops',
'2017-03-22 5:16:59.001+0400',
[0.1, 0.4, 0.1, 0.52, 0.09]
);
INSERT INTO cycling.comments_vs (record_id, id, commenter, comment, created_at, comment_vector)
VALUES (
now(),
c7fceba0-c141-4207-9494-a29f9809de6f,
'Amy',
'Last climb was a killer',
'2017-04-01 17:43:08.030+0400',
[0.3, 0.75, 0.2, 0.2, 0.5]
);
Note the format of the new vector
data type.
Simple query
To query data using Vector Search, use a SELECT
query:
-
CQL
-
Result
SELECT * FROM cycling.comments_vs
ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
LIMIT 3;
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] | Alex | 616e77e0-22a2-11ee-b99d-1f350647414a
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-02-17 08:43:20.234000+0000 | Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] | Amy | 6170c1d0-22a2-11ee-b99d-1f350647414a
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-04-01 13:43:08.030000+0000 | Last climb was a killer | [0.3, 0.75, 0.2, 0.2, 0.5] | Amy | 62105d30-22a2-11ee-b99d-1f350647414a
The limit has to be 1,000 or fewer. |
Scrolling to the right on the results shows the comments from the table that most closely matched the embeddings used for the query.
Similarity query
To obtain the similarity calculation of the best scoring node closest to the query data as part of the results, use a modified SELECT query:
- CQL
-
SELECT comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 3;
- Result
-
comment | system.similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05]) -----------------------------------------------+----------------------------------------------------------------------- Second rest stop was out of water | 0.949701 The gift certificate for winning was the best | 0.86062 The gift certificate for winning was the best | 0.86062 (3 rows)
The supported functions for this type of query are:
-
similarity_dot_product
-
similarity_cosine
-
similarity_euclidean
with the parameters of (<vector_column>, <embedding_value>). Both parameters represent vectors.
Vector Search utilizes Approximate Nearest Neighbor (ANN) that in most cases yields results almost as good as the exact match. The scaling is superior to Exact Nearest Neighbor (KNN).
Least-similar searches are not supported.
Vector Search works optimally on tables with no overwrites or deletions of the comment_vector column. For an comment_vector column with changes, expect slower search results.
The embeddings were randomly generated in this example. Generally, you would run both your source documents/contents through an embeddings generator, as well as the query you were asking to match. This example is simply to show the mechanics of how to use CQL to create vector search data objects.
With the code examples, you have a working example of our Vector Search. Load your own data and use the search function.
CassIO for AI workloads
CassIO abstracts away the details of accessing the Cassandra database for the typical needs of generative artificial intelligence (AI) or other machine learning workloads. CassIO offers a low-boilerplate, ready-to-use set of tools for seamless integration of Cassandra in most AI-oriented applications.
For more, see CassIO.