Python driver quickstart
DataStax recommends using the Python client with HCD databases. Use the Python driver only if you are working with an existing application that previously used a CQL-based driver or if you plan to explicitly use CQL. |
Review the Connection methods comparison page to determine the option that best suits your use case.
This quickstart provides an end-to-end workflow for how to use the Python driver to connect to your database, load a set of vector embeddings, and perform a similarity search to find vectors that are close to the one in your query.
Prerequisites
You need the following items to complete this quickstart:
-
A running HCD database
-
Python 3.7+
Install the cassandra-driver
package
Identify your pip version and upgrade if needed before installing the cassandra-driver
package.
-
Verify that pip is version 23.0 or higher.
pip --version
-
Upgrade pip if needed.
python -m pip install --upgrade pip
-
Install the
cassandra-driver
package.pip install cassandra-driver
Import libraries and connect to the database
Import the necessary libraries and establish a connection to your database.
This configuration is recommended for basic use cases that are not proofs of concept or production use. For proofs of concept or production use, see Production configuration. |
import os
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
cluster = Cluster(auth_provider=auth_provider)
session = cluster.connect()
Create a table and vector-compatible Storage Attached Index (SAI)
Create a table named vector_test
in your database with columns for an integer id, text, and a 5-dimensional vector.
Then, create a custom index on the vector column of this table using a storage-attached index with a dot product similarity function for efficient vector searches.
keyspace = "cycling"
v_dimension = 5
session.execute((
"CREATE TABLE IF NOT EXISTS {keyspace}.vector_test (id INT PRIMARY KEY, "
"text TEXT, vector VECTOR<FLOAT,{v_dimension}>);"
).format(keyspace=keyspace, v_dimension=v_dimension))
session.execute((
"CREATE CUSTOM INDEX IF NOT EXISTS idx_vector_test "
"ON {keyspace}.vector_test "
"(vector) USING 'StorageAttachedIndex' WITH OPTIONS = "
"{{'similarity_function' : 'cosine'}};"
).format(keyspace=keyspace))
Load data
Insert a few documents with embeddings into the collection.
text_blocks = [
(1, "ChatGPT integrated sneakers that talk to you", [0.1, 0.15, 0.3, 0.12, 0.05]),
(2, "An AI quilt to help you sleep forever", [0.45, 0.09, 0.01, 0.2, 0.11]),
(3, "A deep learning display that controls your mood", [0.1, 0.05, 0.08, 0.3, 0.6]),
]
for block in text_blocks:
id, text, vector = block
session.execute(
f"INSERT INTO {keyspace}.vector_test (id, text, vector) VALUES (%s, %s, %s)",
(id, text, vector)
)
Perform a similarity search
Find documents that are close to a specific vector embedding.
ann_query = (
f"SELECT id, text, similarity_cosine(vector, [0.15, 0.1, 0.1, 0.35, 0.55]) as sim FROM {keyspace}.vector_test "
"ORDER BY vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55] LIMIT 2"
)
for row in session.execute(ann_query):
print(f"[{row.id}] \"{row.text}\" (sim: {row.sim:.4f})")
The Python driver is now connected to your database, a set of vector embeddings has been loaded, and a similarity search to find vectors that are close to the one in your query has been performed.
Resources
See the Python driver documentation for details about APIs, statements, connection pooling, load balancing, retry policies, and other topics.