This page is a collection of terms and concepts used to describe vector databases within the the context of Generative AI, with links for further reading.


A system that can make decisions based on its inputs or environment. Intelligent agents, such as those employed in machine learning and artificial intelligence systems, often use vector databases to facilitate rapid and efficient search, comparison, and retrieval of high-dimensional data. Read More

Approximate Nearest Neighbors (ANN) search

A method used to quickly find approximate nearest neighbors in large datasets, often with high-dimensional features, sacrificing some accuracy for speed. Read More


Chunking breaks text into chunks (subsets of tokens) that represent a piece of information. In techniques like RAG, documents undergo chunking, where embeddings are generated from these chunks, stored in a vector database, and retrieved as part of the prompting process. Read More


Datasets that contain various forms of data, such as vector embeddings. Datasets and collections are often used interchangeably - we will use "collections" in this guide. Read More


A collection of data points or records used for analysis. Datasets and collections are often used interchangeably - we will use "collections" in this guide. Read More


Turning data, like words or images, into vectors to capture their meaning. Read More

FLARE pattern

A method to effectively ask questions to AI models. Read More


Organizing data to make retrieval more efficient. Read More

k-Nearest Neighbors (kNN)

A supervised machine learning algorithm that classifies an item based on the majority class of its 'k' most similar items in the dataset. Read More

Large Language Models (LLMs)

Models that can generate long passages of text. Read More


The process of adjusting data values to a common scale to ensure that different features have equal importance in machine learning algorithms. Read More

Prompt engineering

Crafting the right questions to get desired answers from AI. Read More

RAG (Retrieval Augmented Generation)

A method that retrieves relevant documents and then generates a response. Read More


The ability of an AI agent to iteratively inspect its own code, evaluate its performance, and correct mistakes. Read More

Similarity metric/function

A function that quantifies how similar two objects or datasets are, commonly used in machine learning and data analysis. Read More


A tool or process that breaks down input data, such as text, into smaller units that are semantically relevant for processing in a model, often called tokens. Read More


A type of deep learning architecture used for processing sequences of data. Read More


An ordered list of numbers, frequently used in AI. Embeddings are a specific type of vectors that encode semantic meaning. Read More

Vector database

A database designed for storing vectors. Read More

Vector index

A data structure used to efficiently store and query high-dimensional vectors for similarity or distance-based retrievals. Read More

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com