Integrate LangChain with Astra DB Serverless

query_builder 15 min

LangChain can use Astra DB Serverless to store and retrieve vectors for ML applications.

Prerequisites

This guide requires the following:

Connect to the Serverless (Vector) database

  1. Import libraries and connect to the database.

    • Local install

    • Google Colab

    Create a .env file in the folder where you will create your Python script. Populate the file with the Astra token and endpoint values from the Database Details section of your database’s Overview tab, and your OpenAI API key.

    .env
    ASTRA_DB_APPLICATION_TOKEN="TOKEN"
    ASTRA_DB_API_ENDPOINT="API_ENDPOINT"
    ASTRA_DB_NAMESPACE="default_keyspace" # A namespace that exists in the database
    OPENAI_API_KEY="API_KEY"
    import os
    from getpass import getpass
    os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("ASTRA_DB_APPLICATION_TOKEN = ")
    os.environ["ASTRA_DB_API_ENDPOINT"] = input("ASTRA_DB_API_ENDPOINT = ")
    if _desired_namespace := input("ASTRA_DB_NAMESPACE (optional) = "):
        os.environ["ASTRA_DB_NAMESPACE"] = _desired_namespace
    os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY = ")

    The endpoint format is https://ASTRA_DB_ID-ASTRA_DB_REGION.apps.astra.datastax.com.

  2. Import your dependencies. To avoid a namespace collision, don’t name the file langchain.py.

    • Local install

    • Google Colab

    integrate.py
    import os
    from langchain_astradb import AstraDBVectorStore
    from langchain_core.documents import Document
    from langchain_openai import OpenAIEmbeddings
    
    from datasets import load_dataset
    from dotenv import load_dotenv
    from langchain_astradb import AstraDBVectorStore
    from langchain_core.documents import Document
    from langchain_openai import OpenAIEmbeddings
    
    from datasets import load_dataset
  3. Load your environment variables.

    • Local install

    • Google Colab

    integrate.py
    load_dotenv()
    
    ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")
    ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
    ASTRA_DB_NAMESPACE = os.environ.get("ASTRA_DB_NAMESPACE")
    OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
    ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")
    ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
    ASTRA_DB_NAMESPACE = os.environ.get("ASTRA_DB_NAMESPACE")
    OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

    If you’re using Microsoft Azure OpenAI, include these additional environment variables:

    OPENAI_API_TYPE="azure"
    OPENAI_API_VERSION="2023-05-15"
    OPENAI_API_BASE="https://RESOURCE_NAME.openai.azure.com"
    OPENAI_API_KEY="API_KEY"

Create embeddings from text

  1. Specify the embeddings model, database, and collection to use. If the collection does not exist, it is created automatically.

    integrate.py
    embedding = OpenAIEmbeddings()
    vstore = AstraDBVectorStore(
        embedding=embedding,
        namespace=ASTRA_DB_NAMESPACE,
        collection_name="test",
        token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
        api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
    )
  2. Load a small dataset of philosophical quotes with the Python dataset module.

    integrate.py
    philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]
    print("An example entry:")
    print(philo_dataset[16])
  3. Process metadata and convert to LangChain documents.

    integrate.py
    docs = []
    for entry in philo_dataset:
        metadata = {"author": entry["author"]}
        if entry["tags"]:
            # Add metadata tags to the metadata dictionary
            for tag in entry["tags"].split(";"):
                metadata[tag] = "y"
        # Add a LangChain document with the quote and metadata tags
        doc = Document(page_content=entry["quote"], metadata=metadata)
        docs.append(doc)
  4. Compute embeddings for each document and store in the database.

    integrate.py
    inserted_ids = vstore.add_documents(docs)
    print(f"\nInserted {len(inserted_ids)} documents.")

Verify integration

Show quotes that are similar to a specific quote.

integrate.py
results = vstore.similarity_search("Our life is what we make of it", k=3)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

Run the code

Run the code you defined earlier.

python integrate.py

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com