Quickstart: Q&A Search with LangChain
Overview
Use Astra DB with Vector Search and the CassIO Langchain integration to run a directed text similarity search on data from the Hugging Face datasets library. The example on this page does not require you to download any datasets.
New to these concepts? See details.
The following Colab Python notebook runs a more comprehensive and complex version of this example:
Prerequisites
An Astra Vector Search database is required. If you already have one, skip to Prepare for using your vector database, or take a couple of minutes to create one with the following instructions.
Create a serverless database with Vector Search
You are required to create a serverless Astra database with Vector Search before using its capacities to work with data. Details to consider precede the procedural steps.
Considerations
As you create a serverless Astra database with Vector Search, fill out the required fields according to these rules:
Database Name: Name your database something meaningful. The database name cannot be altered after the database is created.
Keyspace Name:
Name your keyspace to reflect your data model; where all of your tables are stored within the database.
You cannot name your keyspace dse
or system
.
Use only alphanumeric characters and no more than 48 total characters.
The Vector Search examples use vsearch
as the keyspace name.
Provider: Choose a provider from one of the three major cloud providers; Amazon Web Services (AWS), Google Cloud (GCP), or Microsoft Azure.
When using the free plan, only the Google Cloud region is available for a serverless Astra database with Vector Search. |
Region: The region associated with the chosen provider automatically populates this field.
Deploy a serverless Astra database with Vector Search
Configure the basic details to create a serverless Astra database with Vector Search.
-
In your Astra DB dashboard, select Create Database.
-
Select Serverless (with Vector).
-
Enter your database details:
-
Database Name.
-
Keyspace Name.
-
-
Choose a Provider.
-
Select Create Database.
When this database is active, a green notification appears at the top of your screen. Your Astra database with Vector Search is ready to use for your content.
Prepare for using your vector database
-
Create a token and download your Secure Connect Bundle (SCB) for your application.
-
Select your database and then click on the Connect tab.
-
Get an application token using the Generate token button in the Quick Start section.
Make sure you download the
<db-name>-token.json
file to use later. -
Use the Get Bundle button to generate and download your Secure Connect Bundle (SCB).
-
-
Get an OpenAI API key to generate embeddings.
This example requires you to generate text embeddings. You can do this using many services, but this example uses the OpenAI API, which requires its own key. Get the key by visiting platform.openai.com, or by logging into your OpenAI account, selecting your profile in the top right, and selecting API Keys.
Create a Python script using Langchain and CassIO
This Python script leverages OpenAI models to transform Onion_News data into embeddings stored in an Astra database and incorporates a text-based query system. The script uses these embeddings to answer user queries and retrieve relevant documents. It showcases the combination of machine learning, databases, and user interaction for real-time information extraction.
-
Use your local environment and ensure that Python v3.8 or later is installed and ready.
-
Install the required libraries with the following command on your localhost:
pip install cassio datasets langchain openai tiktoken
-
Copy and configure example Python code.
-
Access the <db-name>-token.json and SCB that you previously downloaded in Prepare for using your vector database.
-
Code for the Python script is provided for you to copy, step-by-step, along with instructions for configuration updates. As you copy sections, you must update the following variable values as follows:
-
SECURE_CONNECT_BUNDLE_PATH
-
Enter the path to your secure connect bundle 'secure-connect-<db-name>.zip'.
-
-
ASTRA_DB_TOKEN_JSON_PATH
-
Enter the path to your token details json file '<db-name>-token.json'. The script extracts token details.
-
-
ASTRA_DB_KEYSPACE
-
Enter the Astra database keyspace you would like to use.
-
-
OPENAI_API_KEY
-
Enter your OpenAI key.
-
-
headlines = myDataset["text"][:50]
-
To pull more than 50 headlines from the
Onion_News
dataset, change the numerical value for the headlines variable.
-
-
-
Copy the following partial Python code into a file named
mini-qa.py
and update the variables.Provide the path information to your download locations for <db-name>-token.json and Secure Connect Bundle SCB. The download instructions are in Prepare for using your vector database.
ASTRA_DB_SECURE_BUNDLE_PATH = # enter the path to your secure connect bundle 'secure-connect***.zip' ASTRA_DB_TOKEN_JSON_PATH = # enter the path to your token details json file '***-token.json' ASTRA_DB_KEYSPACE =# enter the Astra database keyspace you would like to use OPENAI_API_KEY = # enter your OpenAI key
-
Now that your variables are updated, the next section of code imports the needed libraries. Copy and append the following partial Python code into your
mini-qa.py
file:# Vector support using Langchain, Apache Cassandra (Astra DB is built using # Cassandra), and OpenAI (to generate embeddings) from langchain.vectorstores.cassandra import Cassandra from langchain.indexes.vectorstore import VectorStoreIndexWrapper from langchain.llms import OpenAI from langchain.embeddings import OpenAIEmbeddings # These are used to authenticate with Astra DB from cassandra.cluster import Cluster from cassandra.auth import PlainTextAuthProvider # Support for dataset retrieval with Hugging Face from datasets import load_dataset import json
-
Create a connection to your Astra database using the secure connect bundle (SCB) and Astra token variables downloaded in Prepare for using your vector database. They contain everything your Astra database needs to securely connect and establish communications. Copy and append the following partial Python code into your
mini-qa.py
file:cloud_config= { "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH } with open(ASTRA_DB_TOKEN_JSON_PATH) as f: secrets = json.load(f) ASTRA_DB_APPLICATION_TOKEN = secrets["token"] # token is pulled from your token json file auth_provider=PlainTextAuthProvider("token", ASTRA_DB_APPLICATION_TOKEN) cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider) astraSession = cluster.connect()
-
Add support to generate embeddings using OpenAI. Copy and append the following partial Python code into your
mini-qa.py
file:llm = OpenAI(openai_api_key=OPENAI_API_KEY) myEmbedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
-
Langchain now puts this all together into a vector store object that includes OpenAI embedding support (myEmbedding) and your Astra database with vector search (astraSession). A table (
qa_mini_demo
) is automatically created that includes vector search support in the keyspace provided for theASTRA_DB_KEYSPACE
variable. Copy and append the following partial Python code into yourmini-qa.py
file:myCassandraVStore = Cassandra( embedding=myEmbedding, session=astraSession, keyspace=ASTRA_DB_KEYSPACE, table_name="qa_mini_demo", )
-
At this point in the Python script the setup is complete and the required services are configured. The script next loads Hugging Face’s
Onion_News
dataset, generates embeddings on the returned dataset, and stores them in your Astra vector database. Copy and append the following partial Python code into yourmini-qa.py
file:This example pulls 50 headlines. To use more, update the value in the ‘headlines’ variable. More headlines yields more accurate results, but also increases the duration of the embedding process.
print("Loading data from huggingface") myDataset = load_dataset("Biddls/Onion_News", split="train") headlines = myDataset["text"][:50] print("\nGenerating embeddings and storing in AstraDB") myCassandraVStore.add_texts(headlines) print("Inserted %i headlines.\n" % len(headlines)) vectorIndex = VectorStoreIndexWrapper(vectorstore=myCassandraVStore)
-
This is the
question
andanswer
part of the script. The script prompts the user to ask a question, then it prints the question, an answer, and related documents by relevance. Copy and append the following partial Python code into yourmini-qa.py
file:first_question = True while True: if first_question: query_text = input("\nEnter your question (or type 'quit' to exit): ") first_question = False else: query_text = input("\nWhat's your next question (or type 'quit' to exit): ") if query_text.lower() == "quit": break print("QUESTION: \"%s\"" % query_text) answer = vectorIndex.query(query_text, llm=llm).strip() print("ANSWER: \"%s\"\n" % answer) print("DOCUMENTS BY RELEVANCE:") for doc, score in myCassandraVStore.similarity_search_with_score(query_text, k=4): print(" %0.4f \"%s ...\"" % (score, doc.page_content[:60]))
-
Your copy of the Python script is now configured with the updated variable values.
The Python code in its entirety:
ASTRA_DB_SECURE_BUNDLE_PATH = # enter the path to your secure connect bundle 'secure-connect***.zip' ASTRA_DB_TOKEN_JSON_PATH = # enter the path to your token details json file '***-token.json' ASTRA_DB_KEYSPACE =# enter the Astra database keyspace you would like to use OPENAI_API_KEY = # enter your OpenAI key # Vector support using Langchain, Apache Cassandra (Astra DB is built using # Cassandra), and OpenAI (to generate embeddings) from langchain.vectorstores.cassandra import Cassandra from langchain.indexes.vectorstore import VectorStoreIndexWrapper from langchain.llms import OpenAI from langchain.embeddings import OpenAIEmbeddings # These are used to authenticate with Astra DB from cassandra.cluster import Cluster from cassandra.auth import PlainTextAuthProvider # Support for dataset retrieval with Hugging Face from datasets import load_dataset import json cloud_config= { "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH } with open(ASTRA_DB_TOKEN_JSON_PATH) as f: secrets = json.load(f) ASTRA_DB_APPLICATION_TOKEN = secrets["token"] # token is pulled from your token json file auth_provider=PlainTextAuthProvider("token", ASTRA_DB_APPLICATION_TOKEN) cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider) astraSession = cluster.connect() llm = OpenAI(openai_api_key=OPENAI_API_KEY) myEmbedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY) myCassandraVStore = Cassandra( embedding=myEmbedding, session=astraSession, keyspace=ASTRA_DB_KEYSPACE, table_name="qa_mini_demo", ) print("Loading data from huggingface") myDataset = load_dataset("Biddls/Onion_News", split="train") headlines = myDataset["text"][:50] print("\nGenerating embeddings and storing in AstraDB") myCassandraVStore.add_texts(headlines) print("Inserted %i headlines.\n" % len(headlines)) vectorIndex = VectorStoreIndexWrapper(vectorstore=myCassandraVStore) first_question = True while True: if first_question: query_text = input("\nEnter your question (or type 'quit' to exit): ") first_question = False else: query_text = input("\nWhat's your next question (or type 'quit' to exit): ") if query_text.lower() == "quit": break print("QUESTION: \"%s\"" % query_text) answer = vectorIndex.query(query_text, llm=llm).strip() print("ANSWER: \"%s\"\n" % answer) print("DOCUMENTS BY RELEVANCE:") for doc, score in myCassandraVStore.similarity_search_with_score(query_text, k=4): print(" %0.4f \"%s ...\"" % (score, doc.page_content[:60]))
-
-
Run the
mini-qa.py
script.-
Python
command -
Sample result
$> python mini-qa.py
Loading data from huggingface Found cached dataset text (/Users/<user-name>/.cache/huggingface/datasets/Biddls___text/Biddls--Onion_News-68e4388dcc8b1aec/0.0.0/cb1e9bd71a82ad27976be3b12b...2) Generating embeddings and storing in AstraDB Inserted 50 headlines.
-
-
The script prompts you for a question which you type in. Here are some suggestions:
-
What are the biggest questions in science?
-
What should I know about Silicon Valley banks?
-
Are amoebas really our overlords?
This example uses the question "Are amoebas really our overlords?"
-
Python
command -
Sample result
Enter your question (or type 'quit' to exit):
QUESTION: "Are amoebas really our overlords?" ANSWER: "No, amoebas are not our overlords." DOCUMENTS BY RELEVANCE: 0.9270 "Biologists Torture Amoeba For Information On Where Life Came ..." 0.8821 "5,000-Mile-Wide Blob Of Seaweed Heading Towards Florida #~# ..." 0.8808 "Study Finds More Americans Turning To Own Feverish Imaginati ..." 0.8799 "Whites Ousted From Role As Master Race After Racist Past Com ..." What's your next question (or type 'quit' to exit):
-
-
What’s next?
Explore and learn more from our Examples.