Introduction to ColBERT

ColBERT stands for "Contextualized Late Interaction over BERT".

"Contextualized Late Interaction" describes a unique method of interacting with Stanford University’s BERT model.

ColBERT is a machine learning retrieval model that improves the computational efficiency and contextual depth of information retrieval tasks.

How do BERT and ColBERT work together?

BERT embeds text chunks as matrices of token-level vectors, enabling much deeper context matching than a single vector embedding per chunk.
BERT manages this additional depth by pre-processing documents and queries into uniform lengths with the Wordpiece tokenizer, ideal for batch processing on GPUs.
"Contextualized Late Interaction" first retrieves the top-k chunks with the highest similarity scores to query tokens. The top-k chunks are then sorted again. The query tokens are compared to every token in the chunk to rank the chunks by the highest aggregate similarity score.

To get started using ColBERT with RAGStack and Astra DB, see the ColBERT example code.

RAGStack-ai-colbert packages

ragstack-ai-colbert contains the implementation of the ColBERT retrieval.

The colbert module provides a vanilla implementation for ColBERT retrieval. It is not tied to any specific framework and can be used with any of the RAGStack packages.

To install the ragstack-ai-colbert package:

pip install ragstack-ai-colbert

To use ColBERT with LangChain or LLamaIndex, install ColBERT as an extra:

pip install "ragstack-ai-langchain[colbert]"
pip install "ragstack-ai-llamaindex[colbert]"

How is ColBERT different from RAG?

In the common RAG usage, a standard embedding model represents each chunk as a single vector embedding. This is called "sparse embedding". A cosine similarity search performs similarity matching between document and query embeddings, and the top-k results are returned. This is fast and straightforward, but some context and efficiency are lost.

In the ColBERT model, each chunk is represented as a list of token-level embedding vectors. This is called "dense embedding". This per-token "bag of words" within a chunk offers far deeper context than a single vector per chunk. Document embeddings are pre-computed and indexed with a uniform length to facilitate batch processing.

ColBERT queries are performed in two stages:

The query is embedded (densely) and an Approximate Nearest Neighbor (ANN) search compares every query vector token to every context vector token. Recall that the BERT context chunks have embeddings for each token, so this is a dense comparison. The closest matches are returned as the top-k chunks.
Contextualized Late Interaction ranks the top-k chunks by a fine-grained similarity score. For each query’s token embedding, the score function generates a highest similarity score based on the max dot product of the query token vector, and all the token embeddings per chunk. The aggregate of all the max scores across all the query tokens is the overall similarity score of that particular chunk.

A vector index in the database significantly improves the speed of this comparison.

ColBERT, RAGStack, and Astra DB

The ColBERT v2.0 library transforms a text chunk into a matrix of token-level embeddings. The output is a 128-dimensional vector for each token in a chunk. This results in a two-dimensional matrix, which doesn’t align with the current LangChain interface that outputs a list of floats.

To solve this problem, the ragstack-ai-colbert packages and extras include new classes for mapping token-level embedding vectors to the Astra DB vector database.

The CassandraVectorStore class extends the BaseVectorStore class to store and retrieve vector embeddings generated by ColBERT.

The Contextualized Late Interaction retrieval logic is defined in the ColbertRetriever class, which asynchronously retrieves and scores chunks.

Together, these classes enable ColBERT to be used with Astra DB and the RAGStack ecosystem. But why expend all this effort to use ColBERT with Astra DB?

Advantages of ColBERT on Astra DB

Our testing with ColBERT has shown that it delivers significantly better recall than any single-vector encodings, but this comes at the cost of a significantly larger dataset size.

For a dataset with 10 million passages (which is ~25% of the English-language Wikipedia), the OpenAI-v3-small model requires 61.44GB, while the ColBERT model requires 768GB.

Most vector indexes can’t scale to this size due to problems with index segmentation and memory footprint. Astra DB incorporates new vector indexing techniques to address these issues.

Index segmentation problems degrade query time. Most vector databases can’t index data larger than available RAM in a single physical index, so larger logical indexes are created by splitting the dataset into memory-sized segments. The problem with this approach is that searching within a segment is a logarithmic-time operation in the number of vectors, while combining results across multiple segments is linear-time. So, as your data set grows past the maximum size of a single segment, query time quickly degrades.

Astra DB has larger-than-memory index construction that allows over an order of magnitude more vectors in a single index segment.

Memory footprint problems are expensive. Most vector databases require memory proportional to the number of vectors to serve requests. This is done with either the original full-resolution vectors, or quantized (compressed) vectors. But even 16x compression (which must be done with appropriate reranking to avoid destroying recall) requires 48GB of RAM dedicated to just the compressed vectors of a ColBERT dataset of 10M passages (about 1.5B vectors). Adding in other indexes, (graph index) edge caching, and row caching can easily require expensive 128GB server instances.

Astra DB has fused Asymmetric Distance Computation (ADC) graph traversal that reduces the in-memory footprint of a vector index to near-zero.

On top of these improvements, Astra DB preserves the nonblocking index structure and synchronous, realtime index updates that are the hallmark of Astra’s non-Vector indexes.

DataStax has open-sourced the underlying index technology as JVector.

For more on the challenges of vector indexing at scale, see: