ColBERT in RAGStack with Astra

colab badge

Use ColBERT, Astra DB, and RAGStack to:

  1. Create ColBERT embeddings

  2. Index embeddings on Astra

  3. Retrieve embeddings with RAGStack and Astra

  4. Use the LangChain ColBERT retriever plugin

Prerequisites

Import the ragstack-ai-colbert package:

pip install ragstack-ai-colbert

Prepare data and create embeddings

  1. Prepare documents for chunking.

    arctic_botany_dict = {
        "Introduction to Arctic Botany": "Arctic botany is the study of plant life in the Arctic, a region characterized by extreme cold, permafrost, and minimal sunlight for much of the year. Despite these harsh conditions, a diverse range of flora thrives here, adapted to survive with minimal water, low temperatures, and high light levels during the summer. This introduction aims to shed light on the resilience and adaptation of Arctic plants, setting the stage for a deeper dive into the unique botanical ecosystem of the Arctic.",
        "Arctic Plant Adaptations": "Plants in the Arctic have developed unique adaptations to endure the extreme climate. Perennial growth, antifreeze proteins, and a short growth cycle are among the evolutionary solutions. These adaptations not only allow the plants to survive but also to reproduce in short summer months. Arctic plants often have small, dark leaves to absorb maximum sunlight, and some species grow in cushion or mat forms to resist cold winds. Understanding these adaptations provides insights into the resilience of Arctic flora.",
        "The Tundra Biome": "The Arctic tundra is a vast, treeless biome where the subsoil is permanently frozen. Here, the vegetation is predominantly composed of dwarf shrubs, grasses, mosses, and lichens. The tundra supports a surprisingly rich biodiversity, adapted to its cold, dry, and windy conditions. The biome plays a crucial role in the Earth's climate system, acting as a carbon sink. However, it's sensitive to climate change, with thawing permafrost and shifting vegetation patterns.",
        "Arctic Plant Biodiversity": "Despite the challenging environment, the Arctic boasts a significant variety of plant species, each adapted to its niche. From the colorful blooms of Arctic poppies to the hardy dwarf willows, these plants form a complex ecosystem. The biodiversity of Arctic flora is vital for local wildlife, providing food and habitat. This diversity also has implications for Arctic peoples, who depend on certain plant species for food, medicine, and materials.",
        "Climate Change and Arctic Flora": "Climate change poses a significant threat to Arctic botany, with rising temperatures, melting permafrost, and changing precipitation patterns. These changes can lead to shifts in plant distribution, phenology, and the composition of the Arctic flora. Some species may thrive, while others could face extinction. This dynamic is critical to understanding future Arctic ecosystems and their global impact, including feedback loops that may exacerbate global warming.",
        "Research and Conservation in the Arctic": "Research in Arctic botany is crucial for understanding the intricate balance of this ecosystem and the impacts of climate change. Scientists conduct studies on plant physiology, genetics, and ecosystem dynamics. Conservation efforts are focused on protecting the Arctic's unique biodiversity through protected areas, sustainable management practices, and international cooperation. These efforts aim to preserve the Arctic flora for future generations and maintain its role in the global climate system.",
        "Traditional Knowledge and Arctic Botany": "Indigenous peoples of the Arctic have a deep connection with the land and its plant life. Traditional knowledge, passed down through generations, includes the uses of plants for nutrition, healing, and materials. This body of knowledge is invaluable for both conservation and understanding the ecological relationships in Arctic ecosystems. Integrating traditional knowledge with scientific research enriches our comprehension of Arctic botany and enhances conservation strategies.",
        "Future Directions in Arctic Botanical Studies": "The future of Arctic botany lies in interdisciplinary research, combining traditional knowledge with modern scientific techniques. As the Arctic undergoes rapid changes, understanding the ecological, cultural, and climatic dimensions of Arctic flora becomes increasingly important. Future research will need to address the challenges of climate change, explore the potential for Arctic plants in biotechnology, and continue to conserve this unique biome. The resilience of Arctic flora offers lessons in adaptation and survival relevant to global challenges."
    }
    arctic_botany_chunks = list(arctic_botany_dict.values())
  2. Start the ColBERT configuration and create embeddings.

    from ragstack_colbert import ColbertEmbeddingModel, ChunkData
    # colbert stuff starts
    colbert = ColbertEmbeddingModel()
    
    chunks = [ChunkData(text=text, metadata={}) for text in arctic_botany_chunks]
    
    embedded_chunks = colbert.embed_chunks(chunks=chunks, doc_id="arctic botany")
  3. Examine the embeddings.

    assert len(embedded_chunks) == 8

Create a vector store in Astra

  1. Ingest embeddings and create a vector store in Astra.

    from ragstack_colbert import CassandraVectorStore
    from getpass import getpass
    import cassio
    
    keyspace="default_keyspace"
    database_id=getpass("Enter your Astra Database Id:")
    astra_token=getpass("Enter your Astra Token:")
    
    cassio.init(token=astra_token, database_id=database_id, keyspace=keyspace)
    session=cassio.config.resolve_session()
    
    db = CassandraVectorStore(
        keyspace = keyspace,
        table_name="colbert_embeddings4",
        session = session,
    )
  2. Create an index in Astra.

    db.put_chunks(chunks=embedded_chunks, delete_existing=True)

Retrieve embeddings from the Astra index

Create a RAGStack retriever and ask questions against the indexed embeddings. The library provides:

  • Embed query tokens

  • Generate candidate documents using Astra ANN search

  • Max similarity scoring

  • Ranking

import logging
import nest_asyncio
nest_asyncio.apply()

logging.getLogger('cassandra').setLevel(logging.ERROR) # workaround to suppress logs
from ragstack_colbert import ColbertRetriever
retriever = ColbertRetriever(
    vector_store=db, embedding_model=colbert
)

answers = retriever.retrieve("What's artic botany", k=2)
for answer in answers:
  print(f"Rank: {answer.rank} Score: {answer.score} Text: {answer.data.text}\n")
Result
Rank: 1 Score: 35.225955963134766 Text: Arctic botany is the study of plant life in the Arctic, a region characterized by extreme cold, permafrost, and minimal sunlight for much of the year. Despite these harsh conditions, a diverse range of flora thrives here, adapted to survive with minimal water, low temperatures, and high light levels during the summer. This introduction aims to shed light on the resilience and adaptation of Arctic plants, setting the stage for a deeper dive into the unique botanical ecosystem of the Arctic.

Rank: 2 Score: 29.655662536621094 Text: Research in Arctic botany is crucial for understanding the intricate balance of this ecosystem and the impacts of climate change. Scientists conduct studies on plant physiology, genetics, and ecosystem dynamics. Conservation efforts are focused on protecting the Arctic's unique biodiversity through protected areas, sustainable management practices, and international cooperation. These efforts aim to preserve the Arctic flora for future generations and maintain its role in the global climate system.

LangChain retriever

Alternatively, use the ColBERT extra with your RAGStack package to retrieve documents.

  1. Install the RAGStack LangChain package with the ColBERT extra.

    pip install ragstack-ai-langchain[colbert]
  2. Run the LangChain retriever against the indexed embeddings.

    from ragstack_langchain.colbert import ColbertLCRetriever
    
    lc_retriever = ColbertLCRetriever(retriever, k=2)
    docs = lc_retriever.get_relevant_documents("what kind fish lives shallow coral reefs atlantic, india ocean, red sea, gulf of mexico, pacific, and arctic ocean")
    print(f"first answer: {docs[0].page_content}")
    Result
    ....
    first answer: Despite the challenging environment, the Arctic boasts a significant variety of plant species, each adapted to its niche. From the colorful blooms of Arctic poppies to the hardy dwarf willows, these plants form a complex ecosystem. The biodiversity of Arctic flora is vital for local wildlife, providing food and habitat. This diversity also has implications for Arctic peoples, who depend on certain plant species for food, medicine, and materials.
    ....

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com