Intro to vector databases
In Astra DB Serverless, you can create Serverless (Vector) and Serverless (Non-Vector) databases.
Vector databases enable LLM use cases that require efficient similarity search, including Retrieval-Augmented Generation (RAG) and AI agents:
Retrieval-Augmented Generation (RAG) |
RAG is a technique for improving the accuracy of an LLM by adding relevant content directly to the LLM’s context window. Here’s how RAG works:
|
AI agents |
An AI agent provides an LLM with the ability to take different actions depending on the goal. In the preceding RAG example, a user might submit a query unrelated to your content. You can build an agent to take the necessary actions to fetch relevant content. For example, you might design an agent to run a Google search with the user’s query. It can pass the results of that search to the LLM’s context window. It can also generate embeddings, and then store both the content and the embeddings in a vector database. In this way, your agent can build a persistent memory of the world and its actions. For an example of an AI agent implementation, see Integrate Semantic Kernel with Astra DB Serverless. |
Embeddings
Embeddings are vectors, often generated by machine learning (ML) models, that capture semantic relationships between concepts or objects. Related objects are positioned close to each other in the embedding space.
Learn more about embeddings and vectors
The process of generating embeddings transforms data (such as text or images) into vectors.
Vectors are numerical representations of data in relationship to other data in the same space. The set of vectors for an entire collection of data is the embedding.
Vectors are like individual addresses, and the embedding is a map of an entire neighborhood or city. By comparing vectors, an ML model can understand the degree of similarity of the data.
For a more detailed introduction to embeddings, see the DataStax guide What are Vector Embeddings.
Preprocess embeddings
You can use feature scaling to prepare your data before you use it to train a model. Feature scaling balances or scales vectors consistently so that no one feature of your data unintentionally outweighs others.
Normalizing and standardizing are two feature scaling methods that you can use to preprocess your embeddings before writing them to a database. The method you use depends on factors like algorithm or data type.
Method | Mathematical definition | Features |
---|---|---|
Normalizing |
Scale data to a length of one by dividing each element in a vector by the vector’s length, which is also known as its Euclidean norm or L2 norm. Normalized datasets always range from 0 to 1. |
|
Standardizing |
Shift and scale data for a mean of zero and a standard deviation of one. Standardized datasets always have a mean of 0 and a standard deviation of 1. They can also have any upper and lower values. |
|
If embeddings are not normalized, the dot product silently returns meaningless query results. When you use OpenAI, PaLM, or Simsce to generate your embeddings, they are normalized by default. If you use a different library, you may need to normalize your vectors to use the dot product. |
Chunking
Another way to prepare your data for use in ML apps is chunking.
Chunking is the process of breaking up large, usually contiguous blocks of data into smaller pieces. For example, instead of generating an embedding for one large paragraph of text, you could break the paragraph into 150 character chunks. Each chunk then has its own embedding, rather than a single embedding for the entire paragraph.
Chunking can help control costs because you pass fewer, more relevant objects to the LLM.
For more information about chunking, see Chunking: Let’s Break It Down.
Define a vector field
Vector fields use the VECTOR
type with a fixed dimensionality.
The dimensionality refers to the number of floats in the vector, which could be represented as VECTOR<FLOAT, 768>
.
The dimension value is defined by the embedding model you use.
Embedding models
Embedding models translate data into vectors.
It’s important to select an embedding model for your dataset that creates good structure and ensures related objects are near each other in the embedding space.
You must embed the query with the same embedding model you used for the data.
Embedding models process data differently, and some models are better suited to certain data types. You may need to test different embedding models to find the best one for your needs.
Many embedding models are available. Here are some of the most popular models:
Model | Dimensions | Link |
---|---|---|
bge-large-en-v1.5 |
1024 |
|
bge-base-en-v1.5 |
768 |
|
bge-small-en-v1.5 |
384 |
|
distiluse-base-multilingual-cased-v2 |
512 |
|
e5-small-v2 |
384 |
|
ember-v1 |
1024 |
|
glove.6B.300d |
300 |
|
gte-large |
1024 |
|
gte-base |
768 |
|
gte-small |
384 |
|
instructor-xl |
768 |
|
jina-embeddings-v2-base-en |
768 |
|
komninos |
300 |
|
text-embedding-ada-002 |
1536 |
Embedding providers are embeddings services that can help you generate embeddings for your data. |
Similarity metrics
Similarity metrics compute the similarity of two vectors.
Vectors, such as those produced by embedding algorithms, inherently reflect similarity. Similar items, such as two pieces of related text, result in embedding vectors that are similar, or near, to each other.
During a vector search, a query vector is compared against the vectors in the database. The search results represent the pieces of data, such as text or images, that are most relevant to the query vector.
Vector search results are based on similarity scores, which are calculated using a similarity metric.
When you create a collection, you can choose one of three metric types:
-
Cosine (default)
Cosine and dot product are equivalent for vectors normalized to unit norm. However, if your embeddings are not normalized, then don’t use dot product because it will silently give you nonsense in query results. |
Cosine metric
When the metric is set to cosine
, the database uses cosine similarity to determine the closeness (similarity) of two vectors.
Cosine similarity compares the direction that two vectors point towards, and it ignores their length (also known as norm). For this reason, cosine doesn’t require vectors to be normalized.
This computation results in a value between 0 and 1 for each set of vectors, where 1 represents maximal similarity (vectors that point in the exact same direction).
Mathematical details for cosine similarity
Given two vectors A and B, their cosine similarity is computed based on the dot product of the vectors divided by the product of their magnitudes (norms). The formula for cosine similarity is:
Where:
-
A ⋅ B is the dot product of vectors A and B (see Dot product for a defining formula).
-
∥A∥ is the magnitude (norm) of vector A.
-
∥B∥ is the magnitude (norm) of vector B.
The resulting similarity score, ranging from 0 to 1, indicates how close (similar) one vector is to another:
-
A value of 0 indicates that the vectors are diametrically opposed, having the same direction but opposite sense, regardless of their magnitude.
-
A value of 0.5 denotes that the vectors are orthogonal (perpendicular) to each other.
-
A value of 1 indicates that the vectors are identical in both direction and sense, but they do not have necessarily the same magnitude.
Dot product metric
When the metric is set to dot_product
, the database uses the dot product to determine the closeness (similarity) of two vectors.
The dot product algorithm is about 50% faster than cosine, but it requires vectors to be normalized to unit norm to give meaningful results. Although most embedding models yield normalized vectors, it is a best practice to verify before switching to this similarity metric.
Mathematical details for dot product similarity
Consider two vectors A and B of dimension d. Expressed with their numeric component, these look like:
Their dot product is calculated as follows:
The dot product gives a scalar result (a number). This has important geometric implications: If the dot product is zero, then the two vectors are orthogonal (perpendicular) to each other. When the vectors are normalized to unit norm, the dot product represents the cosine of the angle between them and is always between -1 and +1.
The dot product similarity, built on the above, applies the same rescaling as for the cosine case:
The dot product similarity is designed to work like cosine similarity for the special case of vectors with unit norm. In practice, dot product doesn’t divide by the vectors' norm, which is supposedly equal to one, and this saves a fair amount of CPU. For vectors of arbitrary norm, however, the outcome is unpredictable and might not even be bound in any finite numeric range.
In the context of a Serverless (Vector) database, the dot product similarity yields the same results as the cosine similarity, as long as the vectors have unit norm. In other words, if you compute the dot product similarity of two normalized vectors, you get their cosine similarity. In this case, dot product is to be preferred because it computationally faster.
Euclidean metric
When the metric is set to euclidean
, the database uses the Euclidean distance to determine the closeness (similarity) of two vectors.
The Euclidean distance is the most common way of measuring the ordinary, straight-line distance between two points in Euclidean space.
Mathematical details for Euclidean similarity
Given two vectors A and B in a d-dimensional space, expressed in components as
the Euclidean distance between them is defined by the relation:
The Euclidean similarity is derived from the Euclidean distance through the following formula:
The Euclidean similarity, based on the Euclidian distance, increases as the distance decreases.
In order to lie in the zero-to-one interval, and to increase as the distance decreases, Euclidian similarity features an inverse relationship with the (squared) distance, as well as a cutoff value to prevent an explosion to infinity when the two vectors approach each other.
As the Euclidean distance increases from zero to infinity, the Euclidean similarity decreases from one to zero. For vectors normalized to one, the Euclidean distance always lies between zero and two: correspondingly, the similarity never drops below one-fifth.
In the context of a Serverless (Vector) database, the following apply:
- Vectors as points
-
Each vector in the database can be thought of as a point in some high-dimensional space.
- Distance between vectors
-
When you want to find how close two vectors are, the Euclidean distance is one of the most intuitive and commonly used metrics.
If two vectors have a small Euclidean distance between them, they are close in the vector space, and therefore similar according to Euclidean similarity. If they have a large Euclidean distance, they are far apart, and therefore dissimilar.
- Querying and operations
-
When you set the metric to
euclidean
, Astra DB uses the Euclidean distance as the metric for any operations that require comparing vectors. For instance, if you’re performing a nearest neighbor search, Astra DB returns vectors that have the smallest Euclidean distance to the query vector.
Vector search
At its core, a vector database is about efficient vector search, which allows you to find similar content. Here’s how vector search works:
-
Generate embeddings for a collection of content.
-
Generate an embedding for a new piece of content outside the original collection.
-
Use the new embedding to run a similarity search on the collection, which returns the content from the collection that is most similar to the new content.
To learn more about running a vector search in Astra DB Serverless, see Perform a vector search and Vector Search on Astra DB from the DataStax Developers YouTube channel.
Best practices for vector search
To use vector search effectively, you need to pair it with metadata and the right embedding model.
-
Store relevant metadata about a vector in other fields in your table. For example, if your vector is an image, store a reference to the original image in the same table.
-
Select an embedding model based on your data and the queries you will make. Embedding models exist for text, images, audio, video, and more.
Limitations of vector search
While vector embeddings can replace or augment some functions of a traditional database, vector embeddings are not a replacement for other data types. Vector search is best used as a supplement to existing search techniques because of its limitations:
-
Vector embeddings are not human-readable.
-
Embeddings are not best for directly retrieving data from a table. However, you can pair a vector search with a traditional search. For example, you can find the most similar blog posts by a particular author.
-
The embedding model might not be able to capture all relevant information from the data, leading to incorrect or incomplete results.
Indexing
Astra DB uses multiple indexing techniques to speed up searches:
JVector |
Serverless (Vector) databases use the JVector vector search engine to construct graph indexes. JVector adds new documents to the graph immediately, so you can efficiently search right away. To save space and improve performance, JVector can compress vectors with quantization. For more information, see Why Vector Size Matters. |
Storage-Attached Indexing (SAI) |
Storage-Attached Indexing is an indexing technique to efficiently find rows that satisfy query predicates. Astra DB provides numeric-, text-, and vector-based indexes to support different kinds of searches. You can customize indexes based on your requirements, such as specific similarity functions or text transformations. When you run a search, SAI loads a superset of all possible results from storage based on the predicates you provide.
SAI evaluates the search criteria, sorts the results by vector similarity, and then returns the top |
For more information about indexing, see Database limits and The indexing option.