Evaluating RAG Pipelines with LangChain

This notebook demonstrates how to evaluate a RAG pipeline using LangChain’s QA Evaluator. This evaluator helps measure the correctness of a response given some context, making it ideally suited for evaluating a RAG pipeline. At the end of this notebook, you will have a measurable QA model using RAG.

In this tutorial, you will use an Astra DB Serverless vector store, an OpenAI embedding model, an OpenAI LLM, LangChain, and LangSmith.

Prerequisites

You will need an vector-enabled Astra DB Serverless database and an OpenAI Account.

See the Notebook Prerequisites page for more details.

  1. Create an vector-enabled Astra DB Serverless database.

  2. Create an OpenAI account

  3. Within your database, create an Astra DB keyspace

  4. Within your database, create an Astra DB Access Token with Database Administrator permissions.

  5. Get your Astra DB Serverless API Endpoint: https://<ASTRA_DB_ID>-<ASTRA_DB_REGION>.apps.astra.datastax.com

  6. Initialize the environment variables in a .env file.

    ASTRA_DB_APPLICATION_TOKEN=AstraCS:...
    ASTRA_DB_API_ENDPOINT=https://9d9b9999-999e-9999-9f9a-9b99999dg999-us-east-2.apps.astra.datastax.com
    ASTRA_DB_COLLECTION=test
    OPENAI_API_KEY=sk-f99...
  7. Enter your settings for Astra DB Serverless and OpenAI:

    astra_token = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
    astra_endpoint = os.getenv("ASTRA_DB_API_ENDPOINT")
    collection = os.getenv("ASTRA_DB_COLLECTION")
    openai_api_key = os.getenv("OPENAI_API_KEY")

You will also need a LangSmith account and the following environment variables set. LANGCHAIN_PROJECT defaults to default if not specified.

LANGCHAIN_TRACING_V2="true"
LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
LANGCHAIN_API_KEY="<your-api-key>"
LANGCHAIN_PROJECT="Project:"

Setup

ragstack-ai includes all the packages you need to build a RAG pipeline.

The additional langchain[openai] package includes LangSmith.

  1. Install necessary libraries:

    pip install ragstack-ai langchain[openai]
  2. Import dependencies:

    import os
    from dotenv import load_dotenv
    from langchain.embeddings.openai import OpenAIEmbeddings
    from langchain_astradb import AstraDBVectorStore
    from langchain_community.document_loaders import TextLoader
    from langchain_community.document_loaders import PyPDFLoader
    from langchain.evaluation import EvaluatorType
    from langchain.smith import RunEvalConfig, run_on_dataset
    from langsmith import Client
    from langsmith.utils import LangSmithError
    from langchain.chains import RetrievalQA

Configure embedding model and populate vector store

  1. Configure your embedding model and vector store:

    embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
    vstore = AstraDBVectorStore(
        collection_name=collection,
        embedding=embedding,
        token=astra_token,
        api_endpoint=astra_endpoint,
    )
    print("Astra vector store configured")
  2. Retrieve and process text for the vector store:

    curl https://raw.githubusercontent.com/CassioML/cassio-website/main/docs/frameworks/langchain/texts/amontillado.txt --output amontillado.txt
    SAMPLEDATA = ["amontillado.txt"]
  3. Loop through each file and load it into our vector store. You loaded amontillado.txt in the previous step, but this processor can also process PDFs.

    SAMPLEDATA = [] clears the list so the same files aren’t processed twice.

    documents = []
    for filename in SAMPLEDATA:
        path = os.path.join(os.getcwd(), filename)
    
        if filename.endswith(".pdf"):
            loader = PyPDFLoader(path)
            new_docs = loader.load_and_split()
            print(f"Processed pdf file: {filename}")
        elif filename.endswith(".txt"):
            loader = TextLoader(path)
            new_docs = loader.load_and_split()
            print(f"Processed txt file: {filename}")
        else:
            print(f"Unsupported file type: {filename}")
    
        if len(new_docs) > 0:
            documents.extend(new_docs)
    
    SAMPLEDATA = []
    
    print(f"\nProcessing done.")
  4. Create embeddings by inserting your documents into the vector store. The final print statement verifies that the documents were embedded.

    inserted_ids = vstore.add_documents(documents)
    print(f"\nInserted {len(inserted_ids)} documents.")
    
    print(vstore.astra_db.collection(collection).find())
  5. Retrieve context from your vector database, and pass it to the model with the prompt.

    retriever = vstore.as_retriever(search_kwargs={"k": 3})
    
    prompt_template = """
    Answer the question based only on the supplied context. If you don't know the answer, say you don't know the answer.
    Context: {context}
    Question: {question}
    Your answer:
    """
    prompt = ChatPromptTemplate.from_template(prompt_template)
    model = ChatOpenAI(openai_api_key=openai_api_key)
    
    chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
    )
    
    chain.invoke(
        "In the given context, what motivates the narrator, Montresor, to seek revenge against Fortunato?"
    )

Evaluate RAG responses

LangChain offers several built-in evaluators to test the efficacy of your RAG pipeline. Because you’ve now created a RAG pipeline, the QA Evaluator is a good fit.

Remember that LLMs are probabilistic — responses will not be the exact same for each invocation. Evaluation results will differ between invocations, and they may be imperfect. Using the metrics as part of a larger holistic testing strategy for your RAG application is recommended.

  1. Setup LangSmith for evaluation.

    LANGCHAIN_TRACING_V2 = os.getenv("LANGCHAIN_TRACING_V2")
    LANGCHAIN_ENDPOINT = os.getenv("LANGCHAIN_ENDPOINT")
    LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
    LANGCHAIN_PROJECT = os.getenv("LANGCHAIN_PROJECT")
  2. Set evaluation questions for your data.

    eval_questions = [
        "What motivates the narrator, Montresor, to seek revenge against Fortunato?",
        "What are the major themes in this story?",
        "What is the significance of the story taking place during the carnival season?",
        "What literary techniques does Poe use to create suspense and tension in the story?",
    ]
    
    eval_answers = [
        "Montresor is insulted by Lenora and seeks revenge.",  # Incorrect Answer
        "The major themes are happiness and trustworthiness.",  # Incorrect Answer
        "The carnival season is a time of celebration and merriment, which contrasts with the sinister events of the story.",
        "Poe uses foreshadowing, irony, and symbolism to create suspense and tension.",
    ]
    
    examples = zip(eval_questions, eval_answers)
  3. Create your dataset in LangSmith. This code first checks if the dataset exists, and if not, creates one with your evaluation questions.

    client = Client()
    dataset_name = "test_eval_dataset"
    
    try:
        dataset = client.read_dataset(dataset_name=dataset_name)
        print("using existing dataset: ", dataset.name)
    except LangSmithError:
        dataset = client.create_dataset(
            dataset_name=dataset_name,
            description="sample evaluation dataset",
        )
        for question, answer in examples:
            client.create_example(
                inputs={"input": question},
                outputs={"answer": answer},
                dataset_id=dataset.id,
            )
    
        print("Created a new dataset: ", dataset.name)
  4. Since chains and agents can be stateful (they can have memory), create a constructor to pass in to the run_on_dataset method. This is so any state in the chain is not reused when evaluating individual examples.

    def create_qa_chain(llm, vstore, return_context=True):
        qa_chain = RetrievalQA.from_chain_type(
            llm,
            retriever=vstore.as_retriever(),
            return_source_documents=return_context,
        )
        return qa_chain
  5. Run evaluation.

    evaluation_config = RunEvalConfig(
        evaluators=[
            "qa",
            "context_qa",
            "cot_qa",
        ],
        prediction_key="result",
    )
    
    client = Client()
    run_on_dataset(
        dataset_name=dataset_name,
        llm_or_chain_factory=create_qa_chain(llm=model, vstore=vstore),
        client=client,
        evaluation=evaluation_config,
        verbose=True,
    )

The evaluators selected above perform different measurements against your LLM responses.

  • context_qa instructs the LLM chain to use the provided reference context in determining correctness.

  • qa instructs an LLMChain to directly grade a response as "correct" or "incorrect" based on the reference answer.

  • cot_qa instructs the LLM chain to use chain of thought "reasoning" before determining a final verdict. This tends to lead to responses that better correlate with human labels, for a slightly higher token and runtime cost.

For more on Langchain evaluators, see Evaluation Overview.

What’s next?

Having set up a RAG pipeline and run evaluation over it, you can explore more advanced queries, use internal documentation for evaluation, implement advanced RAG techniques, and evaluate with external evaluation tools.

Find related details by opening the RAGStack Examples Index.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com