Astra Vector Search
Data stored in a database is useful, but the context of that data is critical to applications. Machine learning in applications allows users to get product recommendations, match similar images, and a host of other capabilties. A machine learning model is a program that can find patterns or make decisions from a previously unseen dataset. To power a machine learning model in an application, Vector Search does similarity comparison of stored database data to discover connections in data that may not be explicitly defined.
One key to doing similarity comparisons in a machine learning model is the ability to store vector embeddings, arrays of floating-point numbers that represent the similarity of specific objects or entities. Astra Vector Search brings that functionality to the high availability, serverless Astra database.
The foundation of Vector Search lies within the embeddings, compact representations of text or images as high-dimensional vectors of floating-point numbers. For text processing, embeddings are generated by feeding the text to a machine learning model. These models generally use a neural network to transform the input into a fixed-length vector. When words are represented as high-dimensional vectors, the aim is to arrange the vectors so that similar words end up closer together in the vector space and dissimilar word end up further apart. Creating the vectors in this manner is referred to as preserving semantic or structural similarity. Embeddings capture the semantic meaning of the text, which in turn, allow queries to rely on a more nuanced understanding of the text as opposed to traditional term-based approaches.
Large Language Models (LLMs) generate contextual embeddings for the data, and optimize embeddings for queries. Trained embeddings like those produced by LLMs can be used in Natural Language Processing (NLP) tasks such as text classification, sentiment analysis, and machine translation.
Storage Attached Indexing (SAI)
SAI is a required feature providing unparalleled I/O throughput for serverless Astra databases with Vector Search. SAI is a highly-scalable and globally-distributed index that adds column-level indexes to any vector data type column.
SAI provides the most indexing functionality available - indexing both queries and content (large inputs include such items as documents, words, and images) to capture semantics.
For more about SAI, see Storage Attached Index documentation.
You cannot change index settings without dropping and rebuilding the index. |
It is better to create the index and then load the data. This avoids the concurrent building of the index as data loads.
CassIO for AI workloads
For typical generative artificial intelligence (AI) or other machine learning workloads, CassIO abstracts away the details of accessing the Cassandra database. CassIO offers a low-boilerplate, ready-to-use set of tools for seamless integration of Cassandra in most AI-oriented applications.
For more, see CassIO.
Integrations with Astra Vector Search
Third-party integrations connect your serverless Astra database with Vector Search to various Large Language Model (LLM) frameworks. LLMs are stateless and need a database to store embeddings. Use any of the following frameworks to streamline your vector-based similarity searches and to aid in developing developing LLM-powered applications.
-
CassIO LangChain: The Astra integration for Langchain builds on top of the open-source CassIO library, providing a set of standardized facilities to interact with Astra DB (and Cassandra) through typical patterns needed by Machine Learning (ML) and LLM applications. This integration takes advantage of Astra DB with Vector Search capabilities and makes it possible to run advanced LLM workloads based on semantic similarity without leaving your Astra DB storage backend.
-
Feast: is Apache’s open-source feature store for ML that uses Python. Feast manages infrastucture for MLOps and data engineers and emphasizes data management and model versioning.
-
GCP Dataflow: is a serverless, fast, and cost-effective managed service for batch and streaming data processing pipelines and is based on open-source Apache Beam.
-
Microsoft Power Query: is a data preparation and transformation Extract, Transform, and Load (ETL) engine that lets you connect to various data sources.
For the full list of third-party integrations, see Astra DB integrations.
Accessing your data to query
Several methods are available to access your data.
-
Use a Python script to load embeddings data into your serverless Astra database with Vector Search. See Q&A Search with LangChain.
-
Use DSBulk Loader with CSV or JSON data.
-
To enable Vector Search, a new
vector
data type is available in your serverless Astra database with Vector Search. Cassandra Query Language (CQL) uses this data type. For more, see Query vector data with CQL.
What’s next?
Create your serverless Astra database with Vector Search.
Proceed with setting up your database schema, create a custom index with Storage Attached Indexing (SAI), load data, and query that data with a supplied Python script or with CQL commands.
-
Run the Quickstart or any of the available Examples with its accompanying Colab: