Python driver quickstart
This quickstart provides an end-to-end workflow for how to use the Python driver to connect to your database. For demonstration purposes, it also shows how to use the driver to run a vector search with Cassandra Query Language (CQL) statements.
Prerequisites
You need the following items to complete this quickstart:
-
A running DSE database
-
Python 3.7+
-
A running DSE cluster
-
Python 3.7 or later installed
-
pip 23.0 or later installed
Install the Python driver
Install the cassandra-driver package:
pip install cassandra-driver
Connect the Python driver
Import the necessary libraries and establish a connection to your database.
-
Production configuration
-
Basic configuration
When using the Python driver in production environments or with simulated production workloads, DataStax recommends robust session configuration with profile and cluster details to help optimize driver performance.
The following code initializes a session to connect to your database with the cassandra-driver, and it sets up the connection with authentication details sourced from environment variables.
Additionally, it includes options for connection timeout, request timeout, and protocol version.
from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT, ProtocolVersion
auth_provider=PlainTextAuthProvider("user", "password"])
profile = ExecutionProfile(request_timeout=30)
cluster = Cluster(
auth_provider=auth_provider,
execution_profiles={EXEC_PROFILE_DEFAULT: profile},
protocol_version=ProtocolVersion.V4
)
session = cluster.connect()
You can use a minimal session configuration for testing or lower environments where you don’t need to optimize the cluster details for production workloads.
The following code initializes a session to connect to your database with the cassandra-driver, and it sets up the connection with authentication details sourced from environment variables:
import os
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
cluster = Cluster(auth_provider=auth_provider)
session = cluster.connect()
Create a table and vector-compatible Storage Attached Index (SAI)
Create a table named vector_test in your database with columns for an integer id, text, and a 5-dimensional vector.
Then, create a custom index on the vector column of this table using a storage-attached index with a dot product similarity function for efficient vector searches.
keyspace = "cycling"
v_dimension = 5
session.execute((
"CREATE TABLE IF NOT EXISTS {keyspace}.vector_test (id INT PRIMARY KEY, "
"text TEXT, vector VECTOR<FLOAT,{v_dimension}>);"
).format(keyspace=keyspace, v_dimension=v_dimension))
session.execute((
"CREATE CUSTOM INDEX IF NOT EXISTS idx_vector_test "
"ON {keyspace}.vector_test "
"(vector) USING 'StorageAttachedIndex' WITH OPTIONS = "
"{{'similarity_function' : 'cosine'}};"
).format(keyspace=keyspace))
Load data
Insert a few documents with embeddings into the collection.
text_blocks = [
(1, "ChatGPT integrated sneakers that talk to you", [0.1, 0.15, 0.3, 0.12, 0.05]),
(2, "An AI quilt to help you sleep forever", [0.45, 0.09, 0.01, 0.2, 0.11]),
(3, "A deep learning display that controls your mood", [0.1, 0.05, 0.08, 0.3, 0.6]),
]
for block in text_blocks:
id, text, vector = block
session.execute(
f"INSERT INTO {keyspace}.vector_test (id, text, vector) VALUES (%s, %s, %s)",
(id, text, vector)
)
Perform a similarity search
Find documents that are close to a specific vector embedding.
ann_query = (
f"SELECT id, text, similarity_cosine(vector, [0.15, 0.1, 0.1, 0.35, 0.55]) as sim FROM {keyspace}.vector_test "
"ORDER BY vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55] LIMIT 2"
)
for row in session.execute(ann_query):
print(f"[{row.id}] \"{row.text}\" (sim: {row.sim:.4f})")
The Python driver is now connected to your database, a set of vector embeddings has been loaded, and a similarity search to find vectors that are close to the one in your query has been performed.