Evaluating RAG Pipelines with LangChain
This notebook demonstrates how to evaluate a RAG pipeline using LangChain’s QA Evaluator. This evaluator helps measure the correctness of a response given some context, making it ideally suited for evaluating a RAG pipeline. At the end of this notebook, you will have a measurable QA model using RAG.
In this tutorial, you will use an Astra DB Serverless vector store, an OpenAI embedding model, an OpenAI LLM, LangChain, and LangSmith.
Prerequisites
You will need an vector-enabled Astra DB Serverless database and an OpenAI Account.
See the Notebook Prerequisites page for more details.
-
Create an vector-enabled Astra DB Serverless database.
-
Create an OpenAI account
-
Within your database, create an Astra DB keyspace
-
Within your database, create an Astra DB Access Token with Database Administrator permissions.
-
Get your Astra DB Serverless API Endpoint: https://<ASTRA_DB_ID>-<ASTRA_DB_REGION>.apps.astra.datastax.com
-
Initialize the environment variables in a
.env
file.ASTRA_DB_APPLICATION_TOKEN=AstraCS:... ASTRA_DB_API_ENDPOINT=https://9d9b9999-999e-9999-9f9a-9b99999dg999-us-east-2.apps.astra.datastax.com ASTRA_DB_COLLECTION=test OPENAI_API_KEY=sk-f99...
-
Enter your settings for Astra DB Serverless and OpenAI:
astra_token = os.getenv("ASTRA_DB_APPLICATION_TOKEN") astra_endpoint = os.getenv("ASTRA_DB_API_ENDPOINT") collection = os.getenv("ASTRA_DB_COLLECTION") openai_api_key = os.getenv("OPENAI_API_KEY")
You will also need a LangSmith account and the following environment variables set.
LANGCHAIN_PROJECT
defaults to default
if not specified.
LANGCHAIN_TRACING_V2="true"
LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
LANGCHAIN_API_KEY="<your-api-key>"
LANGCHAIN_PROJECT="Project:"
Setup
ragstack-ai
includes all the packages you need to build a RAG pipeline.
The additional langchain[openai]
package includes LangSmith.
-
Install necessary libraries:
pip install ragstack-ai langchain[openai]
-
Import dependencies:
import os from dotenv import load_dotenv from langchain.embeddings.openai import OpenAIEmbeddings from langchain_astradb import AstraDBVectorStore from langchain_community.document_loaders import TextLoader from langchain_community.document_loaders import PyPDFLoader from langchain.evaluation import EvaluatorType from langchain.smith import RunEvalConfig, run_on_dataset from langsmith import Client from langsmith.utils import LangSmithError from langchain.chains import RetrievalQA
Configure embedding model and populate vector store
-
Configure your embedding model and vector store:
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key) vstore = AstraDBVectorStore( collection_name=collection, embedding=embedding, token=astra_token, api_endpoint=astra_endpoint, ) print("Astra vector store configured")
-
Retrieve and process text for the vector store:
curl https://raw.githubusercontent.com/CassioML/cassio-website/main/docs/frameworks/langchain/texts/amontillado.txt --output amontillado.txt SAMPLEDATA = ["amontillado.txt"]
-
Loop through each file and load it into our vector store. You loaded
amontillado.txt
in the previous step, but this processor can also process PDFs.SAMPLEDATA = []
clears the list so the same files aren’t processed twice.documents = [] for filename in SAMPLEDATA: path = os.path.join(os.getcwd(), filename) if filename.endswith(".pdf"): loader = PyPDFLoader(path) new_docs = loader.load_and_split() print(f"Processed pdf file: {filename}") elif filename.endswith(".txt"): loader = TextLoader(path) new_docs = loader.load_and_split() print(f"Processed txt file: {filename}") else: print(f"Unsupported file type: {filename}") if len(new_docs) > 0: documents.extend(new_docs) SAMPLEDATA = [] print(f"\nProcessing done.")
-
Create embeddings by inserting your documents into the vector store. The final print statement verifies that the documents were embedded.
inserted_ids = vstore.add_documents(documents) print(f"\nInserted {len(inserted_ids)} documents.") print(vstore.astra_db.collection(collection).find())
-
Retrieve context from your vector database, and pass it to the model with the prompt.
retriever = vstore.as_retriever(search_kwargs={"k": 3}) prompt_template = """ Answer the question based only on the supplied context. If you don't know the answer, say you don't know the answer. Context: {context} Question: {question} Your answer: """ prompt = ChatPromptTemplate.from_template(prompt_template) model = ChatOpenAI(openai_api_key=openai_api_key) chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser() ) chain.invoke( "In the given context, what motivates the narrator, Montresor, to seek revenge against Fortunato?" )
Evaluate RAG responses
LangChain offers several built-in evaluators to test the efficacy of your RAG pipeline. Because you’ve now created a RAG pipeline, the QA Evaluator is a good fit.
Remember that LLMs are probabilistic — responses will not be the exact same for each invocation. Evaluation results will differ between invocations, and they may be imperfect. Using the metrics as part of a larger holistic testing strategy for your RAG application is recommended.
-
Setup LangSmith for evaluation.
LANGCHAIN_TRACING_V2 = os.getenv("LANGCHAIN_TRACING_V2") LANGCHAIN_ENDPOINT = os.getenv("LANGCHAIN_ENDPOINT") LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY") LANGCHAIN_PROJECT = os.getenv("LANGCHAIN_PROJECT")
-
Set evaluation questions for your data.
eval_questions = [ "What motivates the narrator, Montresor, to seek revenge against Fortunato?", "What are the major themes in this story?", "What is the significance of the story taking place during the carnival season?", "What literary techniques does Poe use to create suspense and tension in the story?", ] eval_answers = [ "Montresor is insulted by Lenora and seeks revenge.", # Incorrect Answer "The major themes are happiness and trustworthiness.", # Incorrect Answer "The carnival season is a time of celebration and merriment, which contrasts with the sinister events of the story.", "Poe uses foreshadowing, irony, and symbolism to create suspense and tension.", ] examples = zip(eval_questions, eval_answers)
-
Create your dataset in LangSmith. This code first checks if the dataset exists, and if not, creates one with your evaluation questions.
client = Client() dataset_name = "test_eval_dataset" try: dataset = client.read_dataset(dataset_name=dataset_name) print("using existing dataset: ", dataset.name) except LangSmithError: dataset = client.create_dataset( dataset_name=dataset_name, description="sample evaluation dataset", ) for question, answer in examples: client.create_example( inputs={"input": question}, outputs={"answer": answer}, dataset_id=dataset.id, ) print("Created a new dataset: ", dataset.name)
-
Since chains and agents can be stateful (they can have memory), create a constructor to pass in to the
run_on_dataset
method. This is so any state in the chain is not reused when evaluating individual examples.def create_qa_chain(llm, vstore, return_context=True): qa_chain = RetrievalQA.from_chain_type( llm, retriever=vstore.as_retriever(), return_source_documents=return_context, ) return qa_chain
-
Run evaluation.
evaluation_config = RunEvalConfig( evaluators=[ "qa", "context_qa", "cot_qa", ], prediction_key="result", ) client = Client() run_on_dataset( dataset_name=dataset_name, llm_or_chain_factory=create_qa_chain(llm=model, vstore=vstore), client=client, evaluation=evaluation_config, verbose=True, )
The evaluators
selected above perform different measurements against your LLM responses.
-
context_qa
instructs the LLM chain to use the provided reference context in determining correctness. -
qa
instructs an LLMChain to directly grade a response as "correct" or "incorrect" based on the reference answer. -
cot_qa
instructs the LLM chain to use chain of thought "reasoning" before determining a final verdict. This tends to lead to responses that better correlate with human labels, for a slightly higher token and runtime cost.
For more on Langchain evaluators, see Evaluation Overview.
What’s next?
Having set up a RAG pipeline and run evaluation over it, you can explore more advanced queries, use internal documentation for evaluation, implement advanced RAG techniques, and evaluate with external evaluation tools.
Find related details by opening the RAGStack Examples Index.