Build a Wikipedia chatbot with LangChain
Wikipedia is a top search result for search engines because it’s a trusted site. DataStax built WikiChat, a way to ask Wikipedia questions and get back natural language answers, using Next.js, LangChain, Vercel, OpenAI, Cohere, and Astra DB Serverless.
Astra DB Serverless can concurrently ingest any Wikipedia updates, reindex them, and make them available for users to query without any delay to rebuild indexes.
For a video demo of this tutorial, see WikiChat: Build a real-time RAG app on Wikipedia with LangChain.js, Vercel, and Astra DB. For a narrative walkthrough of this tutorial, see Building a Wikipedia Chatbot with Astra DB, LangChain, and Vercel. For the full code, see the WikiChat Github repository. |
Objective
In this tutorial, you will build a chatbot with the 1,000 most popular Wikipedia pages and use Wikipedia’s real-time updates feed to update its store of information. Similar to many retrieval-augmented generation (RAG) applications, you can build and deploy a Wikipedia-based chatbot with two parts:
-
Data ingest: For data ingest, WikiChat includes a list of sources to scrape. In this tutorial, LangChain chunks the text data, Cohere creates the embeddings, and then Astra DB Serverless stores the data.
-
Web app: For the conversational tone, WikiChat is built on Next.js, Vercel’s AI library, Cohere, and OpenAI. When a user asks a question, we use Cohere to create the embeddings for that question, query Astra DB Serverless using vector search, and then feed those results into OpenAI to create a conversational response for the user.
Prerequisites
To complete this tutorial, you’ll need the following:
-
An active Astra account
-
A paid Cohere account
-
A paid OpenAI account
You should also be proficient in the following tasks:
-
Interacting with databases.
-
Running a basic Python script.
-
Entering parameters in a user interface for Vercel.
Clone the Git repository
Clone the chatbot Git repository and switch to that directory.
git clone https://github.com/datastax/wikichat.git
cd wikichat
Create your Serverless (Vector) database
This tutorial uses a Serverless (Vector) database to store the WikiChat app data and embeddings. If you’re new to vector databases, see What are vector databases?.
-
In the Astra Portal navigation menu, click Databases, and then click Create Database.
-
Select the Serverless (Vector) deployment type.
-
Enter a meaningful, human-readable Database name.
After you create a database, you can’t change its name.
Database names are permanent. They must start and end with a letter or number, and they can contain no more than 50 characters, including letters, numbers, and the special characters
& + - _ ( ) < > . , @
. -
Select a Provider and Region to host your database.
On the Free plan, you can access a limited set of supported regions. To access lock Locked regions, you must upgrade your subscription plan.
To minimize latency in production databases, select a region that is close to your application’s users.
-
Click Create Database.
New databases start in Pending status, and then move to Initializing. Your database is ready to use when it reaches Active status.
Set your environment variables
-
In your
wikichat
directory, copy the.env.example
file at the root of the project to.env
. You will store the credentials and configuration information for the APIs to build this app in.env
. -
In the Astra Portal navigation menu, select your database.
-
In the Database Details section, copy the database’s API Endpoint, and then set it as the
ASTRA_DB_API_ENDPOINT
environment variable. -
Click Generate Token to create an application token for the database, and then set it as the
ASTRA_DB_APPLICATION_TOKEN
environment variable. -
Set the
ASTRA_DB_KEYSPACE
environment variable todefault_keyspace
. -
In the OpenAI Platform, create an API key, and then set it as the
OPENAI_API_KEY
environment variable. -
In the Cohere Dashboard, create an API key, and then set it as the
COHERE_API_KEY
environment variable. -
Make sure the following environment variables are set in
.env
:ASTRA_DB_API_ENDPOINT=ENDPOINT ASTRA_DB_APPLICATION_TOKEN=APPLICATION_TOKEN ASTRA_DB_KEYSPACE=default_keyspace OPENAI_API_KEY=OPENAI_API_KEY COHERE_API_KEY=COHERE_API_KEY
Install the dependencies
-
In the project root directory, create a virtual environment.
python3 -m venv .venv
-
Activate the environment for the current terminal session.
source .venv/bin/activate
-
Install the project package and dependencies.
npm install pip3 install -r requirements.txt
Load the data
-
Run the
wiki_data.py
with theload-and-listen
script to ingest the articles and listen for changes.-
load-and-listen
-
Results
python3 scripts/wiki_data.py load-and-listen
2024-01-23 14:03:34.058 - INFO - root - unknown_worker - Running command load-and-listen with args LoadPipelineArgs(max_articles=2000, truncate_first=False, rotate_collections_every=100000, max_file_lines=0, file='scripts/data/wiki_links.txt') 2024-01-23 14:03:34.058 - INFO - root - unknown_worker - Starting... 2024-01-23 14:03:34.058 - INFO - root - unknown_worker - Reading links from file scripts/data/wiki_links.txt limit is 0 2024-01-23 14:03:34.059 - INFO - root - unknown_worker - Read 978 links from file scripts/data/wiki_links.txt 2024-01-23 14:03:34.060 - INFO - root - unknown_worker - Starting to listen for changes 2024-01-23 14:03:34.067 - INFO - root - unknown_worker - Processing: Total Time (h:mm:s): 0:00:02.189130 Report interval (s): 10 Wikipedia Listener: Total events: 0 (total) 0.0 (op/s) Canary events: 0 (total) 0.0 (op/s) Bot events: 0 (total) 0.0 (op/s) Skipped events: 0 (total) 0.0 (op/s) enwiki edits: 0 (total) 0.0 (op/s) Chunks: Chunks created: 0 (total) 0.0 (op/s) Chunk diff new: 0 (total) 0.0 (op/s) Chunk diff deleted: 0 (total) 0.0 (op/s) Chunk diff unchanged: 0 (total) 0.0 (op/s) Chunks vectorized: 0 (total) 0.0 (op/s) Database: Rotations: 0 (total) 0.0 (op/s) Chunks inserted: 0 (total) 0.0 (op/s) Chunks deleted: 0 (total) 0.0 (op/s) Chunk collisions: 0 (total) 0.0 (op/s) Articles read: 0 (total) 0.0 (op/s) Articles inserted: 0 (total) 0.0 (op/s) Pipeline: {'load_article': 968, 'chunk_article': 0, 'calc_chunk_diff': 0, 'vectorize_diff': 0, 'store_article_diff': 0} Errors: None Articles: Skipped - redirect: 0 (total) 0.0 (op/s) Skipped - zero vector: 0 (total) 0.0 (op/s) Recent URLs: None
-
-
Open a new terminal and start a development server.
npm run dev
-
Open http://localhost:3000 to view the chatbot in your browser.
Deploy your chatbot
You can deploy WikiChat to a serverless environment, such as Vercel.
-
In the Vercel Dashboard, search for and import the third-party Git repo from https://github.com/datastax/wikichat.
-
Select the Next.js Framework Preset option.
-
Set the Environment Variables to match the ones you defined above.
-
Click Deploy.
After a few minutes, you can see your deployed WikiChat app.
After you deploy in Vercel the first time, auto-deployment triggers for each subsequent commit.
For more about using Vercel, see the Vercel documentation.
Next steps
In this tutorial, you built a chatbot that can answer questions about the most popular and recently updated pages on Wikipedia. Here’s what you accomplished:
-
Loaded an initial dataset by scraping the 1,000 most popular Wikipedia articles.
-
Listened for real-time updates and only processed diffs.
-
Chunked text data using LangChain and generated embeddings using Cohere.
-
Stored application and vector data in Astra DB Serverless.
-
Built a web-based chatbot UI using Vercel AI SDK.
-
Performed vector search with Astra DB Serverless.
-
Generated accurate and context-aware responses using OpenAI.
Next, try customizing the code to scrape your own sites, use different prompts, and redesign the frontend.
You can also try building apps with the Astra DB Data API clients. To get started with the clients, see the Astra DB Serverless quickstart.