Integrate LangChain with Astra DB Serverless (Vector)

query_builder 15 min

The DataStax Astra DB Serverless (Vector) documentation site is currently in Public Preview and is provided on an “AS IS” basis, without warranty or indemnity of any kind. For more, see the DataStax Preview Terms.

LangChain can use Astra DB Serverless (Vector) to store and retrieve vectors for ML applications.

To get started, you need an active Astra account.

Set up the integration

Install dependencies

  1. Verify that pip is version 23.0 or higher.

    pip --version
  2. Upgrade pip if needed.

    python -m pip install --upgrade pip
  3. Install all of the dependencies. You must have Python 3.7 or higher.

    pip install "langchain==0.0.339" "astrapy==0.6.0" \
        "datasets==2.14.7" "openai==1.3.0" "pypdf==3.17.1" \
        "tiktoken==0.5.1"

Create a vector database

  1. Create a vector-enabled Astra database at Astra Portal. Note your database’s API endpoint URL.

  2. Create a token with Database Administrator permissions in the Astra Connect tab.

  3. Set your environment variables.

    • Local install

    • Google Colab

    Create a .env file in the root of your program with the values from your Astra Connect tab.

    .env
    ASTRA_DB_APPLICATION_TOKEN="<AstraCS:...>"
    ASTRA_DB_API_ENDPOINT="<Astra DB API endpoint>"
    OPENAI_API_KEY="sk-..."
    import os
    from getpass import getpass
    os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("ASTRA_DB_APPLICATION_TOKEN = ")
    os.environ["ASTRA_DB_API_ENDPOINT"] = input("ASTRA_DB_API_ENDPOINT = ")
    os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY = ")

    The endpoint format is https://<ASTRA_DB_ID>-<ASTRA_DB_REGION>.apps.astra.datastax.com.

  4. Import your dependencies.

    • Local install

    • Google Colab

    integrate.py
    import os
    import langchain.vectorstores
    from langchain.schema import Document
    from langchain.embeddings import OpenAIEmbeddings
    
    from datasets import load_dataset
    from dotenv import load_dotenv
    import langchain.vectorstores
    from langchain.schema import Document
    from langchain.embeddings import OpenAIEmbeddings
    
    from datasets import load_dataset
  5. Load your environment variables.

    • Local install

    • Google Colab

    load_dotenv()
    
    ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")
    ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
    OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
    ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")
    ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
    OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

    See Advanced configuration for Azure OpenAI values.

    Don’t name the file langchain.py to avoid a namespace collision.

Create embeddings from text

  1. Use LangChain to create embeddings from text.

    integrate.py
    embedding = OpenAIEmbeddings()
    vstore = langchain.vectorstores.AstraDB(
        embedding=embedding,
        collection_name="test",
        token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
        api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
    )
  2. Load a small dataset of philosophical quotes with the Python dataset module.

    integrate.py
    philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]
    print("An example entry:")
    print(philo_dataset[16])
  3. Process metadata and convert to LangChain documents.

    integrate.py
    docs = []
    for entry in philo_dataset:
        metadata = {"author": entry["author"]}
        if entry["tags"]:
            # Add metadata tags to the metadata dictionary
            for tag in entry["tags"].split(";"):
                metadata[tag] = "y"
        # Add a LangChain document with the quote and metadata tags
        doc = Document(page_content=entry["quote"], metadata=metadata)
        docs.append(doc)
  4. Compute embeddings for each document and store in the vector database.

    integrate.py
    inserted_ids = vstore.add_documents(docs)
    print(f"\nInserted {len(inserted_ids)} documents.")

Verify integration

Show quotes that are similar to a specific quote.

integrate.py
results = vstore.similarity_search("Our life is what we make of it", k=3)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

Run the code

Run the code you defined above.

python integrate.py

Advanced configuration

If you’re using Azure OpenAI, include these additional environment variables:

OPENAI_API_TYPE="azure"
OPENAI_API_VERSION="2023-05-15"
OPENAI_API_BASE="https://<your resource name>.openai.azure.com"
OPENAI_API_KEY="<openai-api-key>"

Next steps

  • Build a chatbot with LangChain auto_stories Tutorial

    Learn how to use Astra DB Serverless (Vector) with LangChain to do retrieval augmented generation (RAG) on a documentation site.

Support

Was This Helpful?

Give Feedback

How can we improve the documentation?