Introduction to Graph-Based Knowledge Extraction and Traversal

The ragstack-ai-knowledge-graph and ragstack-ai-knowledge-store libraries are no longer under development.

Instead, you can find the latest tools and techniques for working with knowledge graphs and graph RAG in the Graph RAG project.

If you have further questions, contact DataStax Support.

A knowledge graph represents information as nodes. Nodes are connected by edges indicating relationships between them. Each edge includes the source (for example, "Marie Curie" the person), the target ("Nobel Prize" the award) and a type, indicating how the source relates to the target (for example, “won”).

Retrieval augmented generation based on chunking and vector similarity search has some weaknesses.

  1. Similarity search only looks for information most similar to the question. This makes it harder to answer questions with multiple topics, or address cases where less-similar information is still relevant to the question.

  2. Similarity search limits the number of chunks retrieved. What if there is similar information spread across multiple places? A similarity search must choose between retrieving multiple copies of the information (inefficient), or choosing only one copy (less context).

Knowledge graphs address these shortcomings. For example, if multiple sources have similar information, that knowledge is stored as one node instead of as disparate chunks.

From a developer’s perspective, a knowledge graph is built into a RAG pipeline similarly to a vector search. The difference is in the underlying data structure and how the information is stored and retrieved.

For example: consider a tech support system, where you find an article that is similar to your question, and it says. "If you have trouble with step 4, see this article for more information". Even if "more information" is not similar to your original question, it likely provides more information.

The article’s HTML links can be examples of edges in a knowledge graph. These edges connect the initial article to additional information, indicating that they are related. This relationship would not be captured in a vector similarity search.

These edges also increase the diversity of results. Within the same tech support system, if you retrieve 100 chunks that are highly similar to the question, you have retrieved 100 chunks that are also highly similar to themselves. Following edges to linked information increases diversity.

How is Knowledge Graph RAG different from RAG?

Short answer: it isn’t. Knowledge graphs are a method of doing RAG, but with a different representation of the information.

RAG with similarity search creates a vector representation of information based on chunks of text. The query is compared to the question, and the most similar chunks are returned as the answer.

Knowledge graph RAG extracts a knowledge graph from information, and stores the graph representation in a vector or graph knowledge store.

Instead of a similarity search query, the graph store is traversed to extract a sub-graph of the knowledge graph’s edges and properties. For example, a query for "Marie Curie" returns a sub-graph of nodes representing her relationships, accomplishments, and other relevant information - the context.

You’re telling the graph store to "start with this node, and show me the relationships to a depth of 2 nodes outwards."

What’s the difference between entity-centric and content-centric knowledge graphs?

Entity-centric knowledge graphs capture edge relationships between entities. A knowledge graph is extracted with an LLM from unstructured information, and its entities and their edge relationships are stored in a vector or graph store.

However, extracting this entity-centric knowledge graph from unstructured information is difficult, time-consuming, and error-prone. A user has to guide the LLM on the kinds of nodes and relationships to be extracted with a schema, and if the knowledge schema changes, the graph has to be processed again. The context advantages of entity-centric knowledge graphs are great, but the cost to build and maintain them is much higher than just chunking and embedding content to a vector store.

Content-centric knowledge graphs offer a compromise between the ease and scalability of vector similarity search, and the context and relationships of entity-centric knowledge graphs.

The content-centric approach starts with nodes that represent content (a specific document about Seattle), instead of concepts or entities (a node representing Seattle). A node may represent a table, an image, or a section of a document. Since the node represents the original content, the nodes are exactly what is stored when using vector search.

Unstructured content is loaded, chunked, and written to a vector store. Each chunk can be run through a variety of analyses to identify links. For example, links in the content may turn into links_to edges, and keywords may be extracted from the chunk to link up with other chunks on the same topic.

To add edges, each chunk may be annotated with URLs that its content represents, or each chunk may be associated with keywords.

Retrieval is where the benefits of vector search and content-centric traversal come together. The query’s initial starting points in the knowledge graph are identified based on vector similarity to the question, and then additional chunks are selected by following edges from that node. Including nodes that are related both by embedding distance (similarity) and graph distance (related) leads to a more diverse set of chunks with deeper context and less hallucinations.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2025 DataStax | Privacy policy | Terms of use | Manage Privacy Choices

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com