Knowledge Base Search on Proprietary Data powered by Astra DB Serverless

This notebook guides you through setting up RAGStack using Astra DB Serverless Search, OpenAI, and CassIO to implement a generative Q&A over your own documentation.

ChatGPT excels at answering questions and offers a nice dialog interface to ask questions and get answers, but it only knows about topics from its training data.

What do you do when you have your own documents? How can you leverage GenAI and LLM models to get insights into those? You can use Retrieval-Augmented Generation (RAG) to create a Q/A Bot to answer specific questions over your documentation.

You can create this in two steps:

+ . Analyze and store existing documentation. . Provide search capabilities for the LLM model to retrieve your documentation.

+ Ideally, you embed the data as vectors and store them in a vector database, then use the LLM models on top of that database.

This notebook demonstrates a basic two-step RAG technique for enabling GPT to answer questions using a library of reference on your own documentation using Astra DB Serverless Search.

Get started with this notebook

  1. Install the following libraries.

    pip install \
        "ragstack-ai" \
        "openai" \
        "pypdf" \
        "python-dotenv"
  2. Import dependencies.

    import os
    from dotenv import load_dotenv
    from langchain_openai import OpenAIEmbeddings
    from langchain.vectorstores import Cassandra
    from langchain_community.document_loaders import TextLoader
    from langchain_community.document_loaders import PyPDFLoader
    from langchain.chat_models import ChatOpenAI
    from langchain.indexes.vectorstore import VectorStoreIndexWrapper
    from cassandra.cluster import Cluster
    from cassandra.auth import PlainTextAuthProvider

    You will need a secure connect bundle and a user with access permission. For demo purposes, the "administrator" role will work fine. For more, see Prerequisites.

  3. Initialize the environment variables.

    ASTRA_DB_SECURE_BUNDLE_PATH = os.getenv("ASTRA_DB_SECURE_BUNDLE_PATH")
    ASTRA_DB_APPLICATION_TOKEN = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
    ASTRA_DB_KEYSPACE = os.getenv("ASTRA_DB_NAMESPACE")
    ASTRA_DB_TABLE_NAME = os.getenv("ASTRA_DB_COLLECTION")
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
  4. Retrieve the text of a short story that will be indexed in the vector store and set it as the sample data. This is a short story by Edgar Allen Poe called "The Cask of Amontillado".

    curl https://raw.githubusercontent.com/CassioML/cassio-website/main/docs/frameworks/langchain/texts/amontillado.txt --output amontillado.txt
    SAMPLEDATA = ["amontillado.txt"]
  5. Connect to Astra DB Serverless.

    cluster = Cluster(cloud={"secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH},
                      auth_provider=PlainTextAuthProvider("token", ASTRA_DB_APPLICATION_TOKEN))
    session = cluster.connect()

Read files, create embeddings, and store in Astra DB Serverless

CassIO seamlessly integrates with RAGStack and LangChain, offering Cassandra-specific tools for many tasks. This example uses vector stores, indexers, embeddings, and queries, with OpenAI for LLM services.

  1. Loop through each file and load it into the vector store.

    documents = []
    for filename in SAMPLEDATA:
        path = os.path.join(os.getcwd(), filename)
    
        if filename.endswith(".pdf"):
            loader = PyPDFLoader(path)
            new_docs = loader.load_and_split()
            print(f"Processed pdf file: {filename}")
        elif filename.endswith(".txt"):
            loader = TextLoader(path)
            new_docs = loader.load_and_split()
            print(f"Processed txt file: {filename}")
        else:
            print(f"Unsupported file type: {filename}")
    
        if len(new_docs) > 0:
            documents.extend(new_docs)
  2. Initialize the vector store with the documents and the OpenAI embeddings.

    cass_vstore = Cassandra.from_documents(
        documents=documents,
        embedding=OpenAIEmbeddings(),
        session=session,
        keyspace=ASTRA_DB_KEYSPACE,
        table_name=ASTRA_DB_TABLE_NAME,
    )
  3. Empty the list of file names — we don’t want to accidentally load the same files again.

    SAMPLEDATA = []
    print(f"\nProcessing done.")

Query the vector store and execute some "searches" against it

  1. Start with a similarity search using the Vectorstore’s implementation.

    prompt = "Who is Luchesi?"
    
    matched_docs = cass_vstore.similarity_search(query=prompt, k=1)
    
    for i, d in enumerate(matched_docs):
        print(f"\n## Document {i}\n")
        print(d.page_content)
  2. To implement Q/A over documents, you need to perform some additional steps. Create an Index on top of the vector store.

    index = VectorStoreIndexWrapper(vectorstore=cass_vstore)
  3. Create a retriever from the Index. A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever.

  4. Query the index for relevant vectors to the prompt:

    prompt = "Who is Luchesi?"
    index.query(question=prompt)
  5. Alternatively, use a retrieval chain with a custom prompt:

    from langchain.chains import RetrievalQA
    from langchain.llms import OpenAI
    from langchain.prompts import ChatPromptTemplate
    
    prompt= """
    You are Marv, a sarcastic but factual chatbot. End every response with a joke related to the question.
    Context: {context}
    Question: {question}
    Your answer:
    """
    prompt = ChatPromptTemplate.from_template(prompt)
    
    qa = RetrievalQA.from_chain_type(llm=OpenAI(), retriever=cass_vstore.as_retriever(), chain_type_kwargs={"prompt": prompt})
    
    result = qa.run("{question: Who is Luchesi?")
    result

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com