Introduction to Graph-Based Knowledge Extraction and Traversal

RAGStack offers two libraries supporting knowledge graph extraction and traversal, ragstack-ai-knowledge-graph and ragstack-ai-knowledge-store.

A knowledge graph represents information as nodes. Nodes are connected by edges indicating relationships between them. Each edge includes the source (for example, "Marie Curie" the person), the target ("Nobel Prize" the award) and a type, indicating how the source relates to the target (for example, “won”).

A graph database isn’t required to use the knowledge graph libraries - RAGStack uses Astra DB or Apache Cassandra to store and retrieve graphs.

The ragstack-ai-knowledge-graph library offers entity-centric knowledge graph extraction and traversal. It extracts a knowledge graph from unstructured information and creates nodes from entities, or concepts (for example, "Seattle").

The ragstack-ai-knowledge-store library offers content-centric knowledge graph extraction and traversal. It extracts a knowledge graph from unstructured information and creates nodes from content (for example, a specific document about Seattle).

This feature is currently under development and has not been fully tested. It is not supported for use in production environments. Please use this feature in testing and development environments only.

Retrieval augmented generation based on chunking and vector similarity search has some weaknesses.

  1. Similarity search only looks for information most similar to the question. This makes it harder to answer questions with multiple topics, or address cases where less-similar information is still relevant to the question.

  2. Similarity search limits the number of chunks retrieved. What if there is similar information spread across multiple places? A similarity search must choose between retrieving multiple copies of the information (inefficient), or choosing only one copy (less context).

Knowledge graphs address these shortcomings. For example, if multiple sources have similar information, that knowledge is stored as one node instead of as disparate chunks.

From a developer’s perspective, a knowledge graph is built into a RAG pipeline similarly to a vector search. The difference is in the underlying data structure and how the information is stored and retrieved.

For example: consider a tech support system, where you find an article that is similar to your question, and it says. "If you have trouble with step 4, see this article for more information". Even if "more information" is not similar to your original question, it likely provides more information.

The article’s "see more information" is an example of an edge in a knowledge graph. The edge connects the initial article to additional information, indicating that the two are related. This relationship would not be captured in a similarity search.

These edges also increase the diversity of results. Within the same tech support system, if you retrieve 100 chunks that are highly similar to the question, you have retrieved 100 chunks that are also highly similar to themselves. Following edges to linked information increases diversity.

The ragstack-ai-knowledge-graph library

The ragstack-ai-knowledge-graph library contains functions for the extraction and traversal of entity-centric knowledge graphs.

To install the package, run:

pip install ragstack-ai-knowledge-graph

To install the library as an extra with the RAGStack Langchain package, run:

pip install "ragstack-ai-langchain[knowledge-graph]"

For more information, see Knowledge Graph RAG.

The ragstack-ai-knowledge-store library

The ragstack-ai-knowledge-store library contains functions for creating a content-centric vector-and-graph store. This store combines the benefits of vector stores with the context and relationships of a related edges.

To install the package, run:

pip install ragstack-ai-knowledge-store

To install the library as an extra with the RAGStack Langchain package, run:

pip install "ragstack-ai-langchain[knowledge-store]"

For more information, see Graph Store.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com