Quickstart: Q&A Search with LangChain
Overview
Use the CassIO LangChain integration to run a directed text similarity search on data from the Hugging Face datasets library. The example on this page does not require you to download any datasets.
Prerequisites
-
You must have a Serverless Cassandra with Vector Search database. If you already have one, skip to the next section, or take a couple of minutes to create one.
-
Ensure you have Python v3.8 or later installed and ready on your computer.
-
Create a
mini-qa.py
file in your local environment. Optionally, you can use this example as a Google Colab.
Connection Parameters
After creating your database, prepare this database for use.
-
Go to your database dashboard, copy your Database ID and keep it ready. It’s located below the database name at the top of the dashboard.
-
Go to Connect on your database dashboard. Generate a token with the role, Database Administrator.
-
Get an OpenAI API key to generate embeddings. Go to your OpenAI account, select your profile in the top right, and select View API keys.
-
Select Create new secret key and copy that key in a separate location.
Create and run the Python script using LangChain and CassIO
This Python script transforms data into embeddings, stores them in an Astra database, and allows you to query them with a prompt. These embeddings are used to answer a text-based question by retrieving the relevant documents and producing an answer. It showcases the benefits of using a vector database with LLMs.
-
Install the required libraries on your localhost with the following command.
pip install "cassio>=0.1.3" datasets langchain openai tiktoken
-
Copy the code below and add the OpenAI key, your token and your Database ID.
# Vector support using Langchain, Apache Cassandra (Astra DB is built using # Cassandra), and OpenAI (to generate embeddings) # LangChain components to use from langchain.vectorstores.cassandra import Cassandra from langchain.indexes.vectorstore import VectorStoreIndexWrapper from langchain.llms import OpenAI from langchain.embeddings import OpenAIEmbeddings # Support for dataset retrieval with Hugging Face from datasets import load_dataset # With CassIO, the engine powering the Astra DB integration in LangChain, # you will also initialize the DB connection: import cassio ASTRA_DB_APPLICATION_TOKEN = "AstraCS:..." # enter the "AstraCS:..." string found in in your Token JSON file ASTRA_DB_ID = "01234567-..." # enter your Database ID OPENAI_API_KEY = "sk-..." # enter your OpenAI key cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)
-
The OpenAI key adds support to generate embeddings. Copy and append this code (
OPENAI_API_KEY
) into yourmini-qa.py
file.llm = OpenAI(openai_api_key=OPENAI_API_KEY) embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
-
When creating the LangChain Cassandra vector store, backed by Astra, provide the embeddings you created (
embedding
) and the name of the database table which is automatically created (in this case,qa_mini_demo
). The other parameters are passed as null, signaling to use the default settings from the earliercassio.init(…)
statement.astra_vector_store = Cassandra( embedding=embedding, session=None, keyspace=None, table_name="qa_mini_demo", )
-
The script next loads Hugging Face’s dataset, generates embeddings on the returned dataset, and stores them in your Astra vector database. Copy and append the following partial Python code into your
mini-qa.py
file:The example pulls 100 headlines. To use more headlines, update the
NUM_HEADLINES
variable.NUM_HEADLINES = 100 print("Loading data from huggingface ... ", end="") onion_dataset = load_dataset("Biddls/Onion_News", split="train") headlines = onion_dataset["text"][:NUM_HEADLINES] print("Done.") print("\nGenerating embeddings and storing headlines in AstraDB ... ", end="") astra_vector_store.add_texts(headlines) print("Inserted %i headlines." % len(headlines)) astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)
-
Copy and append the
question
andanswer
code into yourmini-qa.py
file.first_question = True while True: if first_question: query_text = input("\nEnter your question (or type 'quit' to exit): ").strip() else: query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip() if query_text.lower() == "quit": break if query_text == "": continue first_question = False print("\nQUESTION: \"%s\"" % query_text) answer = astra_vector_index.query(query_text, llm=llm).strip() print("ANSWER: \"%s\"\n" % answer) print("FIRST DOCUMENTS BY RELEVANCE:") for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4): print(" [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))
-
Your copy of the Python script is now configured with the updated variable values. Run the
mini-qa.py
script.$> python mini-qa.py
-
Ask a question to test the code. Here are some suggestions:
-
What are scientists doing with amoebas?
-
Did ChatGPT take the bar exam?
-
Are gas stoves a controversial item in a household?
-
What’s Next?
You can also try the examples to find out just how easy it is to query your data.