Introduction to vector databases

DataStax Enterprise (DSE) 6.9 introduces a new type of database that enables you to store and search high-dimensional vectors. Vector databases enable use cases that require efficient similarity search.

Data stored in a database is useful, but the context of that data is critical to applications. Vector Search does similarity comparison of stored database data to discover connections in data that may not be explicitly defined.

Data representation

Vector search relies on representing data points as high-dimensional vectors. The choice of vector representation depends on the nature of the data.

For data that consists of text documents, techniques like word embeddings (e.g., Word2Vec) or document embeddings (e.g., Doc2Vec) can be used to convert text into vectors. More complex models can also be used to generate embeddings using Large Language Models (LLMs) like OpenAI GPT-4 or Meta LLaMA 2. Word2Vec is a relatively simple model that uses a shallow neural network to learn embeddings for words based on their context. The key concept is that Word2Vec generates a single fixed vector for each word, regardless of the context in which the word is used. LLMs are much more complex models that use deep neural networks, specifically transformer architectures, to learn embeddings for words based on their context. Unlike Word2Vec, these models generate contextual embeddings, meaning the same word can have different embeddings depending on the context in which it is used.

Images can be represented using deep learning techniques like convolutional neural networks (CNNs) or pre-trained models such as Contrastive Language Image Pre-training (CLIP). Select a vector representation that captures the essential features of the data.

Embeddings

Embeddings are vectors, often generated by machine learning models, that capture semantic relationships between concepts or objects. Related objects are positioned close to each other in the embedding space.

Preprocess embeddings

You may need to normalize or standardize your vectors before writing them to the database.

Method Definition Features

Normalizing

Scale data to a length of one by dividing each element in a vector by the vector’s length, which is also known as its Euclidean norm or L2 norm.

  • Eliminates the impact of vector scale.

  • Makes high-dimensional data consistent.

  • Allows you to use the dot product similarity metric, which is about 50% faster than cosine.

Standardizing

Shift and scale data for a mean of zero and a standard deviation of one.

  • Gives vectors the properties of a Gaussian distribution.

  • Ensures all features contribute equally to distance calculations.

If embeddings are not normalized, the dot product silently returns meaningless query results.

When you use OpenAI, PaLM, or Simsce to generate your embeddings, they are normalized by default. If you use a different library, you may need to normalize your vectors to use the dot product.

An example of normalizing a vector is shown below:

Original:

[0.1, 0.15, 0.3, 0.12, 0.05]
[0.45, 0.09, 0.01, 0.2, 0.11]
[0.1, 0.05, 0.08, 0.3, 0.6]
Results

Normalized:

[0.27, 0.40, 0.80, 0.32, 0.13]
[0.88, 0.18, 0.02, 0.39, 0.21]
[0.15, 0.07, 0.12, 0.44, 0.88]

Define a vector field

It’s important to define the right type and embedding model for your vector fields.

Type

Vector fields use the VECTOR type with a fixed dimensionality. The dimensionality refers to the number of floats in the vector, which could be represented as VECTOR<FLOAT, 768>. The dimension value is defined by the embedding model you use.

Embedding model

Select an embedding model for your dataset that creates good structure by ensuring related objects are near each other in the embedding space. You may need to test different embedding models. You must embed the query with the same embedding model you used for the data.

There are many embedding models. Here are some of the most popular models to get you started:

Model Dimensions Link

bge-large-en-v1.5

1024

huggingface.co

bge-base-en-v1.5

768

huggingface.co

bge-small-en-v1.5

384

huggingface.co

distiluse-base-multilingual-cased-v2

512

huggingface.co

e5-small-v2

384

huggingface.co

ember-v1

1024

huggingface.co

glove.6B.300d

300

huggingface.co

gte-large

1024

huggingface.co

gte-base

768

huggingface.co

gte-small

384

huggingface.co

instructor-xl

768

huggingface.co

jina-embeddings-v2-base-en

768

jina.ai

komninos

300

github.com

text-embedding-ada-002

1536

openai.com

Similarity metrics

Similarity metrics are used to compute the similarity of two vectors. When you create a collection, you can choose one of three metric types:

Cosine and dot product are equivalent for normalized vectors.

However, if your embeddings are not normalized, then don’t use dot product as it will silently give you nonsense in queries.

Cosine metric

When the metric is set to cosine, the database uses cosine similarity to determine how similar two vectors are. Cosine does not require vectors to be normalized.

Given two vectors A and B, the cosine similarity is computed as the dot product of the vectors divided by the product of their magnitudes (lengths). The formula for cosine similarity is:

Cosine similarity formula

Where:

  • A⋅B is the dot product of vectors A and B.

  • ∥A∥ is the magnitude of vector A.

  • ∥B∥ is the magnitude of vector B.

When returned by DSE 6.9, the result is a similarity score which is a number between 0 and 1:

  • A value of 0 indicates that the vectors are diametrically opposed.

  • A value of 0.5 suggests the vectors are orthogonal (or perpendicular) and have no match.

  • A value of 1 indicates that the vectors are identical in direction.

Dot product metric

When the metric is set to dot_product, the database uses the dot product to determine how similar two vectors are. The dot product algorithm is about 50% faster than cosine, but it requires vectors to be normalized.

Given two vectors:

Mathematical formula for two vectors

In an n-dimensional space, their dot product is calculated as:

Dot product calculation for the two vectors

The dot product gives a scalar (single number) result. It has important geometric implications: if the dot product is zero, the two vectors are orthogonal (perpendicular) to each other. When the vectors are normalized, the dot product represents the cosine of the angle between the two vectors.

In the context of a DSE database, the dot product can be used for similarity searches for the following reasons:

  • In high-dimensional vector spaces, such as those produced by embedding algorithms or neural networks, similar items are represented by vectors that are close to each other.

  • The cosine similarity between two vectors is a measure of their directional similarity, regardless of their magnitude. If you compute the dot product of two normalized vectors, you get the cosine similarity.

By computing the dot product between a query vector and the vectors in a DSE database, you can efficiently find items in the database that are similar to the query.

Euclidean metric

When the metric is set to euclidean, the database uses the Euclidean distance to determine how similar two vectors are. The Euclidean distance is the most common way of measuring the "ordinary" straight-line distance between two points in Euclidean space.

Given two points P and Q in an n-dimensional space with the following coordinates:

Formula showing P and Q coordinates

The Euclidean distance between these two points is defined by the following formula:

Euclidean distance formula

The Euclidean similarity value is derived from the Euclidean distance with the following formula:

Formula showing how Euclidean distance is used to determine similarity

As the Euclidean distance increases from zero-to-infinity, the Euclidean similarity decreases from one-to-zero.

In the context of a DSE database, the following apply:

Vectors as points

Each vector in the database can be thought of as a point in some high-dimensional space.

Distance between vectors

When you want to find how "close" two vectors are, the Euclidean distance is one of the most intuitive and commonly used metrics. If two vectors have a small Euclidean distance between them, they are close in the vector space; if they have a large Euclidean distance, they are far apart.

Querying and operations

When you set the metric to euclidean, DSE can use the Euclidean distance as the metric for any operations that require comparing vectors. For instance, if you’re performing a nearest neighbor search, DSE returns vectors that have the smallest Euclidean distance to the query vector.

At its core, a vector database is about efficient vector search, which allows you to find similar content.

Here’s how vector search works:

  1. Create a collection of embeddings for some content.

  2. Pick a new piece of content.

  3. Generate an embedding for that piece of content.

  4. Run a similarity search on the collection.

You’ll get a list of the content in your collection with embeddings that are most similar to this new content.

To use vector search effectively, you need to pair it with metadata and the right embedding model.

  • Store relevant metadata about a vector in other fields in your table. For example, if your vector is an image, store a reference to the original image in the same table.

  • Select an embedding model based on your data and the queries you will make. Embedding models exist for text, images, audio, video, and more.

While vector embeddings can replace or augment some functions of a traditional database, vector embeddings are not a replacement for other data types. Vector search is best used as a supplement to existing search techniques because of its limitations:

  • Vector embeddings are not human-readable.

  • Embeddings are not best for directly retrieving data from a table. However, you can pair a vector search with a traditional search. For example, you can find the most similar blog posts by a particular author.

  • The embedding model might not be able to capture all relevant information from the data, leading to incorrect or incomplete results.

Indexing

DataStax Enterprise (DSE) uses multiple indexing techniques to speed up searches:

JVector

The DSE database uses the JVector vector search engine to construct a graph index. JVector adds new documents to the graph immediately, so you can efficiently search right away. To save space and improve performance, JVector can compress vectors with quantization.

Storage-Attached Index (SAI)

SAI is an indexing technique to efficiently find rows that satisfy query predicates. DataStax Enterprise (DSE) provides numeric-, text-, and vector-based indexes to support different kinds of searches. You can customize indexes based on your requirements (e.g. a specific similarity function or text transformation).

When you run a search, SAI loads a superset of all possible results from storage based on the predicates you provide. SAI then evaluates the search criteria and sorts the results by vector similarity. The top limit results are returned to the user.

Common use cases

Vector search is important for LLM use cases, including Retrieval-Augmented Generation (RAG) and AI agents.

Retrieval-Augmented Generation (RAG)

RAG is a technique for improving the accuracy of an LLM. RAG accomplishes this by adding relevant content directly to the LLM’s context window. Here’s how it works:

  1. Pick an embedding model.

  2. Generate embeddings from your data.

  3. Store these embeddings in a vector database.

  4. When the user submits a query, generate an embedding from the query using the same model.

  5. Run a vector search to find data that’s similar to the user’s query.

  6. Pass this data to the LLM so it’s available in the context window.

Now, when the LLM generates a response, it is less likely to make things up (hallucinate).

The RAGStack example using DSE demonstrates how to use vector search to improve the accuracy of an LLM.

AI agents

An AI agent provides an LLM with the ability to take different actions depending on the goal. In the preceding RAG example, a user might submit a query unrelated to your content. You can build an agent to take the necessary actions to fetch relevant content.

For example, you might design an agent to run a Google search with the user’s query. It can pass the results of that search to the LLM’s context window. It can also generate embeddings and store both the content and the embeddings in a vector database. In this way, your agent can build a persistent memory of the world and its actions.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com