Build a Graph RAG system with LangChain and GraphVectorStore

query_builder 20 min

Graph RAG is an enhancement to retrieval-augmented generation (RAG) that retrieves elements from a knowledge graph to serve as grounding context for a large language model (LLM).

In this tutorial, you will build a simple graph RAG system. First, a knowledge graph is built from a small set of linked HTML documents. Then, the graph is used during the retrieval step of RAG to provide extended context to the LLM when generating a response.

For more information about this tutorial’s use case, dataset, and the value of Graph RAG, see Your Documents Are Trying to Tell You What’s Relevant: Better RAG Using Links.

Prerequisites

For this tutorial, you need the following:

An active Astra account
An OpenAI account with a positive credit balance
Python 3.11 or later

Create your Serverless (Vector) database

In the Astra Portal navigation menu, click Databases, and then click Create Database.
Select the Serverless (Vector) deployment type.
Enter a meaningful, human-readable Database name.

After you create a database, you can’t change its name.

Database names are permanent. They must start and end with a letter or number, and they can contain no more than 50 characters, including letters, numbers, and the special characters & + - _ ( ) < > . , @.
Select a Provider and Region to host your database.

On the Free plan, you can access a limited set of supported regions. To access Locked regions, you must upgrade your subscription plan.

To minimize latency in production databases, select a region that is close to your application’s users.
Click Create Database.

New databases start in Pending status, and then move to Initializing. Your database is ready to use when it reaches Active status.

Install dependencies

Run the following commands in a Jupyter notebook or similar environment. For example, you can use this tutorial’s Colab notebook. To run these commands in a different environment, you must modify them accordingly.

Install the project dependencies:

%pip install langchain langchain-community langchain-openai beautifulsoup4 cassio

Install the nest_asyncio package, if required for compatibility within Jupyter.
```
%pip install nest_asyncio

import nest_asyncio
nest_asyncio.apply()
```

Set your environment variables

Use the os Python package to set the following environment variables:
```
import os

os.environ['ASTRA_DB_API_ENDPOINT'] = 'ENDPOINT'
os.environ['ASTRA_DB_APPLICATION_TOKEN'] = 'APPLICATION_TOKEN'
os.environ['OPENAI_API_KEY'] = 'OPENAI_API_KEY'
```
Replace the following:
- ENDPOINT: Your database’s API endpoint.
  
  In the Astra Portal, select your database, and then locate the Database Details section. Copy the database’s API Endpoint, and then set it as the ASTRA_DB_API_ENDPOINT environment variable.
- APPLICATION_TOKEN: An application token for your database.
  
  In the Astra Portal, select your database, and then locate the Database Details section. Click Generate Token, store the token securely, and then set it as the ASTRA_DB_APPLICATION_TOKEN environment variable.
- OPENAI_API_KEY: Your OpenAI API key.
  
  In the OpenAI Platform, create an API key, store it securely, and then set it as the OPENAI_API_KEY environment variable.

Build a Graph RAG system on linked HTML documents

Build the knowledge graph

Prepare the data for your knowledge graph.

For this tutorial, the following code creates a small dataset of HTML documents with links to one another:

HTML document dataset example

# HTML document dataset
html_doc_list = [
{
    'url': 'https://en.wikipedia.org/wiki/Space_Needle',
    'html_doc': """
<html><head><title>Space Needle</title></head>
<body>
<p class="title"><b>Space Needle</b></p>

<p class="content">
The Space Needle is an observation tower in <a href="https://en.wikipedia.org/wiki/Seattle" id="link_seattle">Seattle</a>, Washington, United States. Considered to be an icon of the city, it has been designated a Seattle landmark. Located in the <a href="https://en.wikipedia.org/wiki/Lower_Queen_Anne,_Seattle">Lower Queen Anne</a> neighborhood, it was built in the Seattle Center for the 1962 World's Fair, which drew over 2.3 million visitors.
</p>

<p class="content">
At 605 ft (184 m) high the Space Needle was once the tallest structure west of the Mississippi River. The tower is 138 ft (42 m) wide, weighs 9,550 short tons (8,660 metric tons), and is built to withstand winds of up to 200 mph (320 km/h) and earthquakes of up to 9.0 magnitude, as strong as the 1700 Cascadia earthquake.
</p>
"""
},

{
    'url': 'https://en.wikipedia.org/wiki/Lower_Queen_Anne,_Seattle',
    'html_doc':  """
<html><head><title>Lower Queen Anne, Seattle</title></head>
<body>
<p class="title"><b>Lower Queen Anne, Seattle</b></p>

<p class="content">
Lower Queen Anne (officially known since 2021 as Uptown)[1] is a neighborhood in <a href="https://en.wikipedia.org/wiki/Seattle" id="link_seattle">Seattle</a>, Washington, at the base of Queen Anne Hill. While its boundaries are not precise, the toponym usually refers to the shopping, office, and residential districts to the north and west of Seattle Center. The neighborhood is connected to Upper Queen Anne—the shopping district at the top of the hill—by an extremely steep section of Queen Anne Avenue N. known as the Counterbalance, in memory of the cable cars that once ran up and down it.
</p>

<p class="content">
While "Lower Queen Anne" and "Uptown" are rarely used to refer to the grounds of Seattle Center itself, most of Seattle Center is in the neighborhood; these include Climate Pledge Arena (home of the Seattle Storm of the WNBA and the Seattle Kraken of the NHL), the Exhibition Hall, McCaw Hall (home of the Seattle Opera and Pacific Northwest Ballet), the Cornish Playhouse (home of the Intiman Summer Theatre Festival and Cornish College of the Arts), the Bagley Wright Theater (home of Seattle Repertory Theatre), and the studios for KEXP radio. Lower Queen Anne also has a three-screen movie theater, the SIFF Cinema Uptown,[2] and On the Boards, a center for avant-garde theater and music.
</p>
"""
},

# Demo documents that are not very informative, but they are retrieved by the vector store in illustrative examples
{
    'url': 'https://TheSpaceNeedleisGreat',
    'html_doc': """
<html><head><title>The Space Needle is Great.</title></head>
<body><p class="title"><b>The Space Needle is Great.</b></p>
<p class="content">The Space Needle is Great.</p>
"""
},
{
    'url': 'https://TheSpaceNeedleisTALL',
    'html_doc': """
<html><head><title>The Space Needle is TALL.</title></head>
<body><p class="title"><b>The Space Needle is TALL.</b></p>
<p class="content">The Space Needle is TALL.</p>
"""
},
{
    'url': 'https://SeattleIsOutWest',
    'html_doc': """
<html><head><title>Seattle is Out West</title></head>
<body><p class="title"><b>Seattle is Out West</b></p>
<p class="content">Seattle is Out West</p>
"""
},
{
    'url': 'https://QueenAnneWasAPerson',
    'html_doc': """
<html><head><title>Queen Anne Was a Person</title></head>
<body><p class="title"><b></b></p>
<p class="content">Queen Anne Was a Person</p>
"""
},
]

Build a knowledge graph from the data.

The following code uses BeautifulSoup and HtmlLinkExtractor to parse and process the HTML documents into a knowledge graph. Documents that have links to each other are also connected in the graph.

from bs4 import BeautifulSoup
from pprint import pprint

from langchain_core.documents import Document
from langchain_community.graph_vectorstores.links import add_links
from langchain_community.graph_vectorstores.extractors.html_link_extractor import HtmlInput, HtmlLinkExtractor


def process_html_doc(html_doc, url):
    soup_doc = BeautifulSoup(html_doc, 'html.parser')
    doc = Document(
        page_content=soup_doc.get_text(),
        metadata={"source": url}
    )
    html_link_extractor = HtmlLinkExtractor()
    add_links(doc, html_link_extractor.extract_one(HtmlInput(soup_doc, url)))
    return doc

# The processed documents
docs = [process_html_doc(x['html_doc'], x['url'])
        for x in html_doc_list]

Set up a GraphVectorStore, and then add documents:

from langchain_openai import OpenAIEmbeddings
from langchain_community.graph_vectorstores.cassandra import CassandraGraphVectorStore
import cassio

# Initialize AstraDB / Cassandra connections.
cassio.init(auto=True)

# Create a GraphVectorStore, combining Vector nodes and Graph edges
gvstore = CassandraGraphVectorStore(
    embedding=OpenAIEmbeddings(),
    table_name='graph_rag_tutorial_store',
    keyspace='default_keyspace',  # Change this value if you are using a different keyspace.
)

# Add documents to the GVS
doc_ids = gvstore.add_documents(docs)

Set up the OpenAI model and prompt:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o")

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

Use the knowledge graph for retrieval

Configure the retriever based on the GraphVectorStore, and then use it for retrieval:
```
# Set up the retriever based on the GraphVectorStore
retriever = gvstore.as_retriever(
    search_kwargs={
        "depth": 0,
        "k": 8
    }
)
```
The depth parameter defines the number of steps the retriever traverses the graph from each document in the initial set. A value of 0 means there is no graph traversal.

The k parameter defines the number of documents retrieved by vector search in the initial retrieval step. Documents retrieved from graph traversal are added to this set.

Send a prompt to the retriever:

QUESTION = "What is close to the Space Needle?"
# QUESTION = "What is in the Lower Queen Anne neighborhood?"
# QUESTION = "What is in the same neighborhood as the Space Needle?"

results = retriever.invoke(QUESTION)

pprint([x.metadata['source']
       for x in results])

Example result

['https://TheSpaceNeedleisGreat',
 'https://TheSpaceNeedleisTALL',
 'https://en.wikipedia.org/wiki/Space_Needle',
 'https://en.wikipedia.org/wiki/Lower_Queen_Anne,_Seattle']

Use the knowledge graph for retrieval and generation (end-to-end RAG)

The preceding example demonstrated retrieval only. The following example shows how to do both retrieval and generation with an LLM.

Set up the chain, as in LangChain, for end-to-end RAG:

retriever = gvstore.as_retriever(
    search_kwargs={
        "depth": 0,  # depth of graph traversal; 0 is no traversal at all
        "k": 3
    }
)

# helper function for formatting
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Send a prompt to the chain, and then receive a response:

QUESTION = "What is close to the Space Needle?"
# # QUESTION = "What is in the Lower Queen Anne neighborhood?"
# # QUESTION = "What is in the same neighborhood as the Space Needle?"

response = chain.invoke(QUESTION)
pprint(response)

Example result

'The Space Needle is located in the Lower Queen Anne neighborhood.'

Use deconstructed retrieval and generation

With a deconstructed RAG process, you build and run the chain so that, afterwards, you can inspect the intermediate retrieval results that were the inputs for the generation step.

QUESTION = "What is close to the Space Needle?"
# QUESTION = "What is in the Lower Queen Anne neighborhood?"
# QUESTION = "What is in the same neighborhood as the Space Needle?"

retriever = gvstore.as_retriever(
    search_kwargs={
        "depth": 1,  # depth of graph traversal; 0 is no traversal at all
        "k": 3       # number of docs returned by initial vector search---not including graph Links
    }
)

results = retriever.invoke(QUESTION)
input = {"context": format_docs(results),   # uses retrieved `results`
         "question": QUESTION}
chain = (
    prompt
    | llm
    | StrOutputParser()
)
response = chain.invoke(input)    # invoke the chain starting with `results`

# output
print('Question:\n', QUESTION, '\n')
print('Retrieved documents:')
pprint([x.metadata['source']
       for x in results])
print('\nLLM response:')
pprint(response)

Example result

Question:
 What is close to the Space Needle?

Retrieved documents:
['https://TheSpaceNeedleisGreat',
 'https://TheSpaceNeedleisTALL',
 'https://en.wikipedia.org/wiki/Space_Needle',
 'https://en.wikipedia.org/wiki/Lower_Queen_Anne,_Seattle']

LLM response:
('The Space Needle is located in the Lower Queen Anne neighborhood, which is '
 'also home to various attractions including Seattle Center, Climate Pledge '
 'Arena, Exhibition Hall, McCaw Hall, Cornish Playhouse, Bagley Wright '
 'Theater, and the studios for KEXP radio.')

Cleanup

After completing the tutorial, you can erase the tutorial data from your Astra DB account:

You can delete the entire database.

You can delete the tutorial table.

For example, you can run the following code in your notebook environment to delete the graph_rag_tutorial_store table from the default_keyspace in your Astra DB database:

# Specify the keyspace and table name
keyspace = 'default_keyspace'
table_name = 'graph_rag_tutorial_store'

# Execute the TRUNCATE command to delete the specified table
session = cassio.config.resolve_session()
session.execute(f"TRUNCATE {keyspace}.{table_name}")

print(f"All data in table '{table_name}' in keyspace '{keyspace}' has been dropped.")

Next steps

For more information on building or extending this graph RAG system, see Graph Vector Store in the LangChain documentation.

Build a Graph RAG system with LangChain and GraphVectorStore

Prerequisites

Create your Serverless (Vector) database

Install dependencies

Set your environment variables

Build a Graph RAG system on linked HTML documents

Build the knowledge graph

Use the knowledge graph for retrieval

Use the knowledge graph for retrieval and generation (end-to-end RAG)

Use deconstructed retrieval and generation

Cleanup

Next steps

Was this helpful?

Give Feedback