Astra DB glossary

Approximate Nearest Neighbor (ANN)

A machine learning algorithm that locates the most similar vectors to a given item in a dataset.

authenticate

To establish the identity of a user or application.

authorization

To establish permissions to database resources through roles.

capacity unit

Represents three database instances grouped together with three replicas.

Classless inter-domain routing (CIDR)

Set of internet protocol (IP) standards used to create unique identifiers for networks and individual devices.

In the table definition, a clustering column is a column that is part of the compound primary key definition, but not the first column, which is the position reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.

column

The smallest increment of data, which contains a name, a value, and a timestamp. Also known as a cell.

cosine similarity

A metric measuring the similarity between two non-zero vectors in a multi-dimensional space. It quantifies the cosine of the angle between the vectors; the angle representing each vector’s orientation and direction relative to each other. Zero (0) indicates complete dissimilarity. Negative one (-1) indicates exact opposite orientation of the vectors. One (1) indicates complete similarity.

CQL shell

The Cassandra Query Language shell (cqlsh) utility.

database

A group of distributed instances for storing data. Each paid Astra DB database has at least three instances.

embeddings

A mathematical technique in machine learning where complex, high-dimensional data is represented as points in a lower-dimensional space. The process of creating an embedding preserves the relevant properties of the original data, such as distance and similarity, enabling easier computational processing. For instance, words with similar meanings in Natural Language Processing (NLP) can be set close to each other in the reduced space, facilitating their use in machine learning models.

Euclidean distance

A coordinate geometry non-negative distance metric between two points, quantifying the similarity or dissimilarity between those data points represented as vectors. Use it to compare generated samples to real data points.

index

A native capability for finding a column in the database that does not involve using the primary key.

instances

The basic database infrastructure component where you store your data. Commonly referred to as a "node" in Cassandra terminology.

integration

Integrations connect a third-party tool of your choice to Astra DB for data management, machine learning, analytics, and more.

Jaccard similarity

A measure of similarity between two sets of features or elements in generated data and real data. The mathematical calculation is the size of the intersection of two sets divided by the size of their union, and ranges from zero (0) to one (1). One (1) indicates identical sets.

keyspace

The defining container for replication, similar to a schema in a relational database. All tables belong to a keyspace. See Guardrails & Limits page for more info.

Machine Learning (ML)

A branch of artificial intelligence (AI) and computer science that uses and develops computer systems capable of learning and adapting without explicit instruction. ML uses algorithms and statistical models to analyze data and identify patterns, make decisions, and improve its system.

Natural Language Processing (NLP)

Helps computers interpret and share the human language to offer the best use for the user.

partition index

A list of primary keys and the start position of data.

partition key

The first column declared in the PRIMARY KEY definition, or in the case of a compound key, multiple columns can declare those columns that form the primary key.

partition summary

A subset of the partition index. By default, 1 partition key out of every 128 is sampled.

primary key

The partition key. One or more columns that uniquely identify a row in a table.

region

A group of related nodes configured together within a database for replication purposes. A region is virtual datacenter hosted on your selected cloud provider. Using separate regions prevents transactions from being impacted by other workloads and lowers latency. Depending on the replication factor, data can be written to multiple regions. Regions cannot span physical locations. A region in Astra DB is the same concept as a "datacenter" in Apache Cassandra™ or DataStax Enterprise.

replication

The process of storing copies of data on multiple instances. Replication ensures reliability and fault tolerance. All replicas are equally important; there is no primary or master replica.

replication factor

Determines the number of copies of a database. Higher replication factors provide increased reliability and fault tolerance.

row

1) Columns that have the same primary key. 2) A collection of cells per combination of columns in the storage engine.

service account

Allows you to manage your databases with the DevOps API, which can be used to create, terminate, resize, park, and unpark your database. The service account is created at the organization level.

table

Stores data based on a primary key, which consists of a partition key and optional clustering columns. A partition key defines the node on which the data is stored, and divides data into logical groups. Define partition keys that evenly distribute the data and also satisfy specific queries. Query and write requests across multiple partitions should be avoided if possible. A clustering column defines the sort order of rows within a partition. When defining a clustering column, consider the purpose of the data. For example, retrieving the most recent transactions, sorted by date, in descending order.

Vector

An array of floating point type that represents a specific object or entity.

Vector Search

Reviews data on a database to determine the distance between the vectors. The closer they are, the more similar the data. The more the distance, the less similar the data.