Integrate LangChain with Astra DB Serverless
LangChain can use Astra DB Serverless to store and retrieve vectors for ML applications.
Prerequisites
This guide requires the following:
-
An active Astra account
-
An active Serverless (Vector) database
-
An application token with the Database Administrator role
-
Python 3.8 or later
-
pip 23.0 or later
-
The required Python packages:
pip install "langchain>=0.2" "langchain-astradb>=0.4" \ "langchain-openai>=0.1" "datasets>=3.0" "pypdf>=5.0" \ "python-dotenv>=1.0"
Connect to the Serverless (Vector) database
-
Import libraries and connect to the database.
-
Local install
-
Google Colab
Create a
.env
file in the folder where you will create your Python script. Populate the file with the Astra DB application token and endpoint values from the Database Details section of your database’s Overview tab, and your OpenAI API key..envASTRA_DB_APPLICATION_TOKEN="TOKEN" ASTRA_DB_API_ENDPOINT="API_ENDPOINT" ASTRA_DB_KEYSPACE="default_keyspace" # A keyspace that exists in the database OPENAI_API_KEY="API_KEY"
import os from getpass import getpass os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("ASTRA_DB_APPLICATION_TOKEN = ") os.environ["ASTRA_DB_API_ENDPOINT"] = input("ASTRA_DB_API_ENDPOINT = ") if _desired_keyspace := input("ASTRA_DB_KEYSPACE (optional) = "): os.environ["ASTRA_DB_KEYSPACE"] = _desired_keyspace os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY = ")
The database endpoint format is
https://ASTRA_DB_ID-ASTRA_DB_REGION.apps.astra.datastax.com
. -
-
Import your dependencies. To avoid a namespace collision, don’t name the file
langchain.py
.-
Local install
-
Google Colab
integrate.pyimport os from langchain_astradb import AstraDBVectorStore from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from datasets import load_dataset from dotenv import load_dotenv
from langchain_astradb import AstraDBVectorStore from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from datasets import load_dataset
-
-
Load your environment variables.
-
Local install
-
Google Colab
integrate.pyload_dotenv() ASTRA_DB_APPLICATION_TOKEN = os.environ["ASTRA_DB_APPLICATION_TOKEN"] ASTRA_DB_API_ENDPOINT = os.environ["ASTRA_DB_API_ENDPOINT"] ASTRA_DB_KEYSPACE = os.environ.get("ASTRA_DB_KEYSPACE") OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
ASTRA_DB_APPLICATION_TOKEN = os.environ["ASTRA_DB_APPLICATION_TOKEN"] ASTRA_DB_API_ENDPOINT = os.environ["ASTRA_DB_API_ENDPOINT"] ASTRA_DB_KEYSPACE = os.environ.get("ASTRA_DB_KEYSPACE") OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
If you’re using Microsoft Azure OpenAI, include these additional environment variables:
OPENAI_API_TYPE="azure" OPENAI_API_VERSION="2023-05-15" OPENAI_API_BASE="https://RESOURCE_NAME.openai.azure.com" OPENAI_API_KEY="API_KEY"
-
Create embeddings from text
-
Specify the embeddings model, database, and collection to use. If the collection does not exist, it is created automatically.
integrate.pyembedding = OpenAIEmbeddings() vstore = AstraDBVectorStore( collection_name="test", embedding=embedding, token=ASTRA_DB_APPLICATION_TOKEN, api_endpoint=ASTRA_DB_API_ENDPOINT, namespace=ASTRA_DB_KEYSPACE, )
-
Load a small dataset of philosophical quotes with the Python dataset module.
integrate.pyphilo_dataset = load_dataset("datastax/philosopher-quotes")["train"] print("An example entry:") print(philo_dataset[16])
-
Process metadata and convert to LangChain documents.
integrate.pydocs = [] for entry in philo_dataset: metadata = {"author": entry["author"]} if entry["tags"]: # Add metadata tags to the metadata dictionary for tag in entry["tags"].split(";"): metadata[tag] = "y" # Add a LangChain document with the quote and metadata tags doc = Document(page_content=entry["quote"], metadata=metadata) docs.append(doc)
-
Compute embeddings for each document and store in the database.
integrate.pyinserted_ids = vstore.add_documents(docs) print(f"\nInserted {len(inserted_ids)} documents.")
Verify integration
Show quotes that are similar to a specific quote.
results = vstore.similarity_search("Our life is what we make of it", k=3)
for res in results:
print(f"* {res.page_content} [{res.metadata}]")
Run the code
Run the code you defined earlier.
python integrate.py