DataStax glossary

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

A

Adjacency list

A collection of unordered lists used to represent a finite graph. Each list describes the set of neighbors of a vertex in the graph.

Adjacent vertex

A vertex directly attached to another vertex by an edge in a graph.

Agent

A system that can make decisions based on its inputs or environment.

In a machine learning context, agents often use vector databases to facilitate rapid and efficient search, comparison, and retrieval of high-dimensional data. See What are vector databases?.

A DataStax Agent is a specific type of agent used with DSE OpsCenter. These agents must be installed on every managed node in a cluster, and they are necessary to perform most DSE OpsCenter functionality. See also Lifecycle Manager (LCM).

Anti-entropy

The synchronization of replica data on nodes to ensure that the data is fresh.

Approximate nearest neighbor (ANN)

A machine learning algorithm that locates the most similar vectors to a given item in a dataset. It can reduce the amount of time required to search large datasets, but it sacrifices some accuracy for speed. See What are vector databases?.

B

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Back pressure

Pausing or blocking the buffering of incoming requests after reaching the threshold until the internal processing of buffered requests catches up.

Bloom filter

An off-heap structure associated with each SSTable that checks if any data for the requested row exists in the SSTable before doing any disk I/O.

Bootstrap

The process by which new nodes join a cluster and transparently gather necessary data from existing nodes.

C

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Cardinality

The number of unique values in a column. For example, a column of ID numbers unique for each employee would have high cardinality while a column of employee ZIP codes would have low cardinality because multiple employees can have the same ZIP code.

An index on a column with low cardinality can boost read performance because the index is significantly smaller than the column. An index for a high-cardinality column may reduce performance. If your application requires a search on a high-cardinality column, a materialized view is ideal.

CassandraDatacenter

A Kubernetes custom resource (CR) representing a DataStax Enterprise (DSE) or an Apache Cassandra® logical datacenter and rack configuration. An operator reconciles the declarative state within the CassandraDatacenter CR against nodes within a cluster. A single MissionControlCluster can be related to more than one CassandraDatacenter CR.

Cell

The smallest increment of stored data. Contains a value in a row-column intersection.

Chunking

A technique used to prepare text data for processing by breaking it into smaller, manageable pieces called chunks. Chunks are subsets of tokens that represent a piece of information. In techniques like Retrieval-Augmented Generation (RAG), embeddings are generated from chunks rather than an entire document. Chunks are stored in a database and retrieved based on relevance to a given query. See What are vector databases?.

Cluster

Two or more database instances that exchange messages using the gossip protocol.

Clustering

The storage engine process that creates an index and keeps data in order based on the index.

Clustering column

In the table definition, a clustering column is a column that is part of the compound primary key definition. Note that the clustering column cannot be the first column because that position is reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.

Coalescing strategy

Strategy to combine multiple network messages into a single packet for outbound TCP connections to nodes in the same datacenter (intra-DC) or to nodes in a different datacenter (inter-DC). A coalescing strategy is provided with a blocking queue of pending messages and an output collection for messages to send.

Collection

In vector databases, collections are datasets that contain various forms of data, such as vector embeddings. The terms datasets and collections are often used interchangeably.

In Cassandra Query Language (CQL), a collection is a data type that defines a group of objects represented as a single unit. Collections can be maps, lists, or sets.

Column

The smallest increment of data. Contains a name, a value, and a timestamp.

Column family

A container for rows, similar to the table in a relational system. Known as a table in Cassandra Query Language (CQL) 3.x and later.

Commit log

A file to which the database appends changed data for recovery in the event of a hardware failure.

Compaction

The process of consolidating SSTables, discarding tombstones, and regenerating the SSTable index. The available compaction strategies are:

DateTieredCompactionStrategy (DTCS) (deprecated)
LeveledCompactionStrategy (LCS)
SizeTieredCompactionStrategy (STCS)
TimeWindowCompactionStrategy (TWCS)

Composite partition key

A partition key consisting of multiple columns.

Compound primary key

A primary key consisting of the partition key, which determines the node on which data is stored, and one or more additional columns that determine clustering.

Consistency

The synchronization of data on replicas in a cluster. Consistency is categorized as weak or strong.

Consistency level

A setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively.

Container

Self-contained applications and all the dependencies needed to run them. For example, a DSE container includes DSE, management API, and Java Runtime Environment. Additionally, each container is cryptographically verified at runtime to ensure there have been no modifications after being packaged by DataStax. All Mission Control components are packaged as containers and are scheduled as Pods. See Pod.

Control plane

The management layer of a Mission Control installation. After administrators define a MissionControlCluster custom resource (CR) through the Mission Control UI or CLI, the resources are distributed across the appropriate Data Planes. This layer contains the Mission Control UI and API services, a number of operators that reconcile cluster-level resources, and observability components.

Converge

The process of aligning the real-world state of the node, datacenter, or cluster with the desired state by successfully executing a Lifecycle Manager (LCM) job.

Coordinator node

The node that determines which nodes in the ring should get the request based on the cluster configured snitch.

Cosine similarity

A metric measuring the similarity between two non-zero vectors in a multi-dimensional space. It quantifies the cosine of the angle between the vectors; the angle representing each vector’s orientation and direction relative to each other. Zero (0) indicates complete dissimilarity. Negative one (-1) indicates exact opposite orientation of the vectors. One (1) indicates complete similarity. See What are vector databases?.

CQL shell

The Cassandra Query Language (CQL) shell utility, also known as cqlsh.

Cross-datacenter forwarding

A technique for optimizing replication across datacenters by sending data from one datacenter to a node in another datacenter.

The receiving node then forwards the data to other nodes in its datacenter.

D

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Datacenter

A group of related nodes that are configured together within a cluster for replication and workload segregation purposes. Not necessarily a separate location or physical datacenter. Datacenter names are case sensitive and cannot be changed.

Data plane

An infrastructure layer where DataStax Enterprise (DSE) and Apache Cassandra® workloads are deployed. These resources operate at the datacenter or region level of globally deployed DSE clusters. Services within the data plane include operators for reconciliation of local resources, observability ingestion and routing components, and DSE or Cassandra nodes.

Dataset

A collection of data points or records, such as the total contents of a database or data used to train a machine learning model.

Although collection can be a synonym for dataset, this isn’t the same as a CQL collection.

Data type

A particular kind of data item, defined by the values it can take or the operations that can be performed on it.

DateTieredCompactionStrategy (DTCS)

DateTieredCompactionStrategy (DTCS) is deprecated starting in Apache Cassandra® 3.8 and in DataStax Enterprise (DSE). This strategy is particularly useful for time series data and stores data written within a certain period of time in the same SSTable. For example, Cassandra can store your last hour of data in one SSTable time window while the next 4 hours of data are stored in another time window, etc. The most common queries for time series workloads retrieve the last hour/day/month of data.

Deep query

Graph queries that traverse a dense graph (a large number of connected vertices) or a graph with a high branching factor.

Denormalization

Denormalization refers to the process of optimizing the read performance of a database by adding redundant data or by grouping data. This process is accomplished by duplicating data in multiple tables or by grouping data for queries.

Directed graph

A set of vertices and a set of arcs (ordered pairs of vertices). In DSE Graph, edges are directional and the term "arcs" is not used.

E

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Edge

A connection between graph vertices. Edges can be unordered (no directional orientation) or ordered (directional). An edge can also be described as an object that has a vertex at its tail and head.

Element

A graph element is a vertex, edge, or property.

Embedding

Turning data, like words or images, into vectors to capture their meaning. See What are vector databases?.

Also describes a mathematical technique in machine learning where complex, high-dimensional data is represented as points in a lower-dimensional space. The process of creating an embedding preserves the relevant properties of the original data, such as distance and similarity, enabling easier computational processing. For instance, words with similar meanings in natural language processing can be set close to each other in the reduced space, facilitating their use in machine learning models.

Euclidean distance

A coordinate geometry non-negative distance metric between two points, quantifying the similarity or dissimilarity between those data points represented as vectors. Use it to compare generated samples to real data points.

Eventual consistency

The database maximizes availability and partition tolerance. The database ensures eventual data consistency by updating all replicas during read operations and periodically checking and updating any replicas not directly accessed. The updating and checking ensures that any query always returns the most recent version of the result set and that all replicas of any given row eventually become completely consistent with each other.

Extended Backus-Naur Form (EBNF)

A syntax expressing a context-free grammar that formally describes a language. EBNF extends its precursor Backus-Naur Form (BNF) with additional operators allowed in expansions. Syntax (railroad) diagrams graphically depict EBNF grammars.

F

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Faceted search

Faceted search is the dynamic clustering of items or search results into categories. Faceted search uses any value in any field to drill into search results or even skip searching entirely.

Forward-Looking Active REtrieval augmented generation (FLARE)

An advanced retrieval technique that combines retrieval and generation in LLMs. It enhances the accuracy of responses by iteratively predicting the upcoming sentence to anticipate future content when the model encounters a token it is uncertain about. See Active Retrieval Augmented Generation.

Frozen

A Cassandra Query Language (CQL) Collection or user-defined type (UDT) that treats the value as an immutable blob. To update or delete an item, redefine the entire value of the item because individual elements of the item cannot be updated or deleted.

An example is a frozen list: ['bicycle', 'treadmill', 'unicycle', 'bicycle'].

If you want to update or append a value to this frozen list, the entire list must be rewritten: UPDATE items SET list_of_items = ['bicycle', 'treadmill', 'unicycle', 'bicycle', 'rower'] WHERE id = 1;

G

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Garbage collector

A Java background process that frees heap memory when it is no longer in use by the program. The main Java algorithms to allocate and clean up memory are Continuous Mark Sweep (CMS) and Garbage-First (G1). DataStax Enterprise (DSE) 5.1 and higher use the G1 garbage collector by default.

Global index

An index structure over an entire graph.

Gossip

A peer-to-peer communication protocol for exchanging location and state information between nodes.

Graph

A collection of vertices and edges.

Graph degree

The largest vertex degree of a graph.

Graph index

A data structure that allows for the fast retrieval of elements by a particular key-value pair.

Graph partitioning

A process that consists of dividing a graph into components of about the same size with few connections between the components.

Graph traversal

An algorithmic walk across the elements of a graph according to the referential structure explicit within the graph data structure.

H

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Hadoop Distributed File System (HDFS)

A file system that stores data on nodes to improve performance. HDFS is a necessary component in addition to MapReduce in an Apache Hadoop® distribution.

Headroom

The amount of disk space required by a process (such as compaction) in addition to the space occupied by the data being processed.

I

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Incident edge

When a vertex is an endpoint of an edge in a graph.

Index

A native capability for finding a column in the database that does not involve using the primary key.

Indexing

Organizing data to make retrieval more efficient. There are many ways to index data depending on the use case.

J

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Jaccard similarity

A measure of similarity between two sets of features or elements in generated data and real data. The mathematical calculation is the size of the intersection of two sets divided by the size of their union, and ranges from zero (0) to one (1). One (1) indicates identical sets.

Job Tracker

Used for analytics nodes that analyze data using Apache Hadoop®. Data is analyzed using both DSE Hadoop and external Apache Hadoop systems. Within a datacenter, the Job Tracker monitors the execution and status of distributed tasks that comprise a MapReduce job.

K

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Kafka Struct

A record using the Apache Kafka® structure that contains a set of named fields with values. Each field in a Kafka Struct uses an independent Schema.

The DataStax Apache Kafka Connector supports both generic structs and advanced struct types backed by the schema registry, such as the Avro format.

K8ssandraCluster

A Kubernetes custom resource (CR) representing a DataStax Enterprise (DSE) or an Apache Cassandra® cluster. This CR is automatically generated after creating a MissionControlCluster resource.

Keyspace

A container for collections or tables in a database
A namespace container that defines how data is replicated on nodes in each datacenter

Keytab

A file containing a pair of Kerberos principals and encrypted keys. Allows authentication with a Kerberos-enabled DataStax Enterprise (DSE) database without entering a password.

K-nearest neighbors (KNN)

Also known as exact nearest neighbors.

A supervised machine learning algorithm that classifies an item based on the majority class of its 'k' most similar items in the dataset. KNN is stricter but more resource intensive than ANN. Cassandra-based databases don’t support KNN. See What is the k-nearest neighbors algorithm.

KOTS

An open-source application providing an Admin Console and Command Line Interface (CLI) to handle installing, managing, and updating software packages such as Mission Control.

Kurl

Provides a kubeadm-based Kubernetes distribution with add-ons for common cloud-native components. Mission Control utilizes Kurl as its embedded Kubernetes runtime.

krb5.conf

File that contains the Kerberos configuration used by clients for connection and ticket generation. See MIT Kerberos krb5.conf documentation. The default location is /etc. If krb5.conf is in another location, override the default location by setting the environment variable KRB5_CONFIG. To use multiple configuration files, set a colon-separated filename list in KRB5_CONFIG; all files are read.

kubectl

A Kubernetes-specific command line interface (CLI) that allows communication and control of K8s clusters through their API server.

kubelet

A process run on each piece of infrastructure within a Mission Control deployment. kubelet is responsible for coordinating the running of containers and reporting back the status of various tasks and processes.

L

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Large language model (LLM)

A machine learning system designed to generate natural language text. These models are trained on text data to learn grammar and context, enabling them to produce coherent and contextually relevant text based on given prompts. The response, also known as a prediction, is based on the model’s training. The model calculates the expected response that would follow a given prompt.

LLMs can be generalized or specialized. For example, domain-specific LLMs are trained on data from a particular field, such as medicine or law, to provide more accurate and relevant responses in that domain.

The term language model (without large) is often used in contexts where either a large (broadly trained) or small (narrowly trained) model could be used.

Language models are generally separate from other types of models, such as embedding models, which are designed to generate embeddings.

LeveledCompactionStrategy (LCS)

This compaction strategy creates SSTables of a fixed, relatively small size that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2, and so on) is ten times as large as the previous level. Disk I/O is more uniform and predictable on higher levels than on lower levels as SSTables are continuously being compacted into progressively larger levels. At each level, row keys are merged into non-overlapping SSTables in the next level. This process improves performance for reads because the database can determine which SSTables in each level to check for the existence of row key data.

Lifecycle Manager (LCM)

LCM is a web-based provisioning and configuration management system for installing, configuring, and managing DataStax Enterprise (DSE) clusters and DSE nodes.

LCM stores and enforces the cluster configuration definition, including datacenter and node topology, and it integrates deeply with the full spectrum of DSE settings.

Linearizable consistency

Also called serializable consistency, linearizable consistency is the restriction that one operation cannot be executed unless and until another operation has completed.

The database supports Lightweight transactions to ensure linearizable consistency in writes. The first phase of a Lightweight transaction works at SERIAL consistency and follows the Paxos protocol to ensure that the required operation succeeds. If this phase succeeds, the write is performed at the consistency level specified for the operation. Reads performed at the SERIAL consistency level execute without database built-in read repair operations.

List

A Cassandra Query Language (CQL) collection that consists of an ordered list of values. An example of list data is ['bicycle', 'treadmill', 'unicycle', 'bicycle']. Because the list is ordered, a value may appear more than once in the list.

M

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Machine learning (ML)

A branch of artificial intelligence (AI) and computer science that uses and develops computer systems capable of learning and adapting without explicit instruction. ML uses algorithms and statistical models to analyze data and identify patterns, make decisions, and improve its system.

Map

A Cassandra Query Language (CQL) collection that consists of an ordered list of key-value pairs. An example of map data is ['john':'bicycle', 'jane':'treadmill', 'devon':'unicycle']. Each item in the map can be indexed by key, value, or full entry.

MapReduce

The Apache Hadoop® parallel processing engine that quickly processes large datasets. A necessary component in addition to Hadoop Distributed File System (HDFS) in an Apache Hadoop distribution.

Materialized view (MV)

A table with data that is automatically inserted and updated from another base table. The MV has a primary key that differs from the base table, allowing the implementation of different queries.

memtable

A database table-specific, in-memory data structure that resembles a write-back cache.

Meta-property

A property that describes some attribute of another property in a graph.

MissionControlCluster

A Kubernetes Custom Resource representing a DataStax Enterprise (DSE) or an Apache Cassandra® cluster. There is a one-to-one relationship between this resource and a cluster deployed. This top-level resource includes the cluster type (dse versus cassandra), the version, the desired topology (datacenter and rack definitions), and configuration overrides.

milliCPU (millicore)

An absolute, metric specification for a CPU resource. For example, 1,000 milliCPU (or 1,000m CPU) equals 1 CPU unit. The suffix m means milli.

Fractional values are allowed but not relative quantities. Precision finer than 1m is not allowed.

The specification is used to measure CPU usage. 0.1 CPU specifies the same amount of CPU on a single-core, dual-core, or 48-core machine..

In Kubernetes, One CPU is equivalent to:

1 AWS vCPU
1 GCP Core
1 Azure vCore
1 Hyperthread on a bare-metal Intel processor with Hyperthreading

Mixed workload

When a single cluster can run transactional, search, and analytics nodes.

Mutation

A change to data, either an insertion or a deletion.

N

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Namespace

A schema for a logical grouping, for example:

A logical grouping of topics in a streaming tenant.
A container for collections of documents in a database. See also Keyspace.

Natural language processing (NLP)

A branch of artificial intelligence that enables computers to process commands given in plain human language (sentences, phrases, and paragraphs) and, depending on the task, generate a response in human language.

Node

A Java virtual machine (a platform-independent execution environment that converts Java bytecode into machine language and executes it) that runs an instance of the Licensed Software.

Node repair

A process that makes all data on a replica consistent.

Non-frozen

A Cassandra Query Language (CQL) collection or user-defined type (UDT) that treats the value as an mutable blob. To update or delete an item, redefine individual values of the item. An example is a frozen list: ['bicycle', 'treadmill', 'unicycle', 'bicycle']. To update or append a value to a non-frozen list, append the value.

UPDATE items SET list_of_items = list_of_items + ['rower']
  WHERE id = 1;

Normalization

In a vector database, vector normalization is the process of scaling vectors to unit length, typically as a part of preprocessing embeddings. Because normalized vectors have the same length, ANN searches can use the efficient dot product similarity metric.

In a non-vector context, normalization is a series of steps used to eliminate redundancy and reduce the chances of data inconsistency in a database’s schema.

In a non-vector context, relational databases use normalization to reduce redundancy through joins. However, Cassandra-based databases, such as DSE, HCD, and Astra DB, don’t support joins. This means the data models aren’t normalized, and they include redundant data, trading additional storage for improved read performance.

O

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Online analytical processing (OLAP)

Processing that performs a multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling. Compare to OLTP.

Online transaction processing (OLTP)

Processing characterized by a large number of short on-line transactions for data entry and retrieval. Compare to OLAP.

Operator

A software controller which manages the reconciliation of any custom resource (CR), like MissionControlCluster and CassandraDatacenter, against running processes on worker hosts. This includes creating new resources and recovering processes that have gone down.

Order

The magnitude of the number of edges to the number of vertices.

P

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Partition

A partition is a collection of data addressable by a key. This data resides on one node in a Cassandra cluster. A partition is replicated on as many nodes as the replication factor specifies.

Partition index

A list of primary keys and the start position of data.

Partition key

A partition keys represents a logical entity which helps a Cassandra cluster know on which node some requested data resides.

The partition key is the first column declared in the primary key definition.

In a compound key, multiple columns can declare the columns that form the primary key.

Partition range

The limits of the partition that differ depending on the configured partitioner. Murmur3Partitioner (default) range is -2⁶³ to +2⁶³ and RandomPartitioner range is 0 to 2¹²⁷-1.

Partition summary

A subset of the partition index. By default, 1 partition key out of every 128 is sampled.

Partitioned vertex

A portion of a vertex’s data resulting from dividing the vertex into smaller components for graph database storage. A partitioned vertex is used for vertices that have a large number of edges.

Partitioner

Distributes data across a cluster. The types of partitioners are Murmur3Partitioner (default), RandomPartitioner, and OrderPreservingPartitioner.

PersistentVolume (PV)

A Kubernetes resource representing a piece of storage available for use within a cluster. It can represent persistent physical drives or ephemeral block volumes to be provisioned on demand. See StorageClass and PersistentVolumeClaim (PVC).

PersistentVolumeClaim (PVC)

A Kubernetes resource mapping storage, each PersistentVolume (PV) to compute, and [Pods] with the associated configuration provided by a StorageClass Mission Control names PVCs following a pattern, such as name: premium-rwo-retain, and can pre-allocate them prior to provisioning a cluster.

Pod

A group of containers. See Node.

Primary key

The partition key. One or more columns that uniquely identify a row in a table.

Prompt engineering

Crafting queries to get desired answers from LLMs.

Property

In a graph context, a key-value pair that describes an attribute of a vertex or edge. Property key describes the key in the key-value pair. All properties are global in DSE Graph, meaning that a property can be used for any vertices. For example, name can be used for all vertices in a graph.

Q

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Quorum

A quorum is the minimum number of replicas that must acknowledge a read or write operation executed at consistency levels QUORUM, LOCAL_QUORUM, or EACH_QUORUM, for the operation to succeed.

A quorum is calculated as (replication factor / 2) + 1.

If the result is not an integer, it is rounded down to the previous whole number.

For example, if the replication factor is 3, quorum is 2. If the replication factor is 4, quorum is 3.

R

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Rack

A physical or logical grouping of nodes within a datacenter. Group nodes into racks to represent their physical proximity, such as nodes located within the same rack or in neighboring racks. You can also use racks to group nodes that are logically close to each other, such as nodes in the same network segment or with similar workloads.

Range movement

A change in the expanse of tokens assigned to a node.

Read repair

A process that updates database replicas with the most recent version of frequently-read data.

Reflexion

The ability of an AI agent to iteratively inspect its own code, evaluate its performance, and correct mistakes.

Replica placement strategy

A specification that determines the replicas for each row of data.

Replication factor (RF)

The total number of replicas across a cluster, abbreviated as RF. A replication factor of 1 means that there is only one copy of each row in a cluster. If the node containing the row goes down, the row cannot be retrieved. A replication factor of 2 indicates two copies of each row and that each copy is on a different node. All replicas are equally important; there is no primary or master replica.

Replication group

See Datacenter.

Resilient Distributed Datasets (RDD)

A fundamental data structure of Apache Spark™. RDD is an immutable distributed collection of objects. RDD actions return values and transformations that return pointers to new RDDs.

Retrieval-Augmented Generation (RAG)

A technique for building AI applications that modifies the response generated by an LLM with relevant context retrieved with vector search. See What are vector databases?.

Rolling restart

A procedure that is performed during upgrading nodes in a cluster for zero downtime. Nodes are upgraded and restarted one at a time while other nodes continue to operate online.

Row

Columns that have the same primary key.
A group of cells per combination of columns in the storage engine.

Row cache

A database component for improving the performance of read-intensive operations. In off-heap memory, the row cache holds the most recently read rows from the local SSTables. Each local read operation stores its result set in the row cache and sends it to the coordinator node. The next read first checks the row cache. If the required data is there, the database returns it immediately. This initial read can save further seeks in the Bloom filter, partition key cache, partition summary, partition index, and SSTables.

The database uses LRU (least-recently-used) eviction to ensure that the row cache is refreshed with the most frequently accessed rows. The size of the row cache can be configured in the cassandra.yaml file.

S

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Scan query

A graph query that traverses an entire graph or large sections of the graph.

Scheme

Authentication scheme: Defines a service used either for authentication or for a role assignment such as Kerberos or LDAP, or both.
Database scheme: Describes all database resources.

SearchAnalytics

In DSE, nodes started as stand-alone processes or services in SearchAnalytics mode allow you to create analytics queries that use search indexes.

Search index

In DataStax Enterprise (DSE), a search index is an Apache Solr™ core. Each DSE Search index uses an internally stored index configuration pair that is automatically generated when the index is created. See Search index configuration and Search index schema.

Secret

A Kubernetes (K8s) object that stores sensitive information such as passwords, API keys, and Secure Shell Protocol (SSH) keys and that enable pods to use that information without the data being exposed. Sensitive data is exposed to containers either as a file in a volume mount or through environment variables. Examples within Mission Control are TLS certificates (including Java Key Stores), Java Key Store passwords, and cluster super user credentials.

Seed

A node used to bootstrap the gossip process for new nodes joining a cluster. The node provides no other function, and it isn’t a single point of failure for a cluster. Also known as a seed node.

Segment

A small token range of a table used by NodeSync to validate and repair data across replicas. NodeSync uses the size of the entire table to determine how many segments (depth) to divide the table into so that segments are approximately 200 MB.

Serializable consistency

See Linearizable consistency.

Set

A Cassandra Query Language (CQL) collection that consists of an unordered list of unique values. An example of set data is {'bicycle', 'treadmill', 'unicycle'}.

Similarity metric

A function that quantifies how similar two objects or datasets are, commonly used in machine learning and data analysis. Also known as similarity function or similarity measure. See What are vector databases?.

SizeTieredCompactionStrategy (STCS)

The default compaction strategy. This strategy triggers a minor compaction when there are a number of similar sized SSTables on disk as configured by the table subproperty, min_threshold. A minor compaction does not involve all the tables in a keyspace.

For more information, see STCS compaction subproperties.

Slice

A set of clustered columns in a partition that you query as a set using, for example, a conditional WHERE clause.

Snitch

The mapping from the IP addresses of nodes to physical and virtual locations, such as racks and datacenters. The request routing mechanism is affected by which of the several types of snitches is used.

Sorted string table (SSTable)

An immutable data file to which the database writes memtables periodically. SSTables are stored on disk sequentially and maintained for each database table.

Static column

A special column that is shared by all rows of a partition.

StorageClass

A Kubernetes resource providing a configuration template to a backend storage provisioner. This may include rules around reclaiming disks after they are no longer bound to a particular workload or may identify when disks should be scheduled in relation to workloads.

View StorageClass details from the command line with kubectl get storageclass.

Streaming

A component that handles data exchange among nodes in a cluster. It is part of the SSTable file.

Examples include:

When bootstrapping a new node, the new node gets data from existing nodes using streaming.
When running nodetool repair, nodes exchange out-of-sync data using streaming.
When bulkloading data from backup, sstableloader uses streaming to complete a task.

Strong consistency

As a database reads data it performs a read repair before returning results.

Superuser

A role attribute that provides root database access. Superusers have all permissions on all objects. DataStax Enterprise (DSE) databases include the superuser role cassandra with password cassandra by default. This account runs queries, including logins, with a consistency level of QUORUM. DataStax recommends creating a superuser for deployments and removing the cassandra role.

T

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Table

A group of columns ordered by name and fetched by row. A row consists of columns and has a primary key; the first part of the key is a column name. Subsequent parts of a compound key are other column names that define the order of columns in the table.

Task Tracker

For each node, one Task Tracker service handles the Apache Hadoop® MapReduce tasks that are scheduled for a Hadoop-enabled node.

Time-to-live (TTL)

An optional expiration date for values that are inserted into a column. See Expiring data with time-to-live.

TimeWindowCompactionStrategy (TWCS)

This compaction strategy compacts SSTables based on a series of time windows. During the current time window, the SSTables are compacted into one or more SSTables. At the end of the current time window, all SSTables are compacted into a single larger SSTable. The compaction process repeats at the start of the next time window. Each TWCS time window contains data within a specified range and contains varying amounts of data.

Token

In a cluster management context, a token is an element on a ring that depends on the partitioner. Determines the node’s position on the ring and the portion of data for which it is responsible. The range for the Murmur3Partitioner (default) is -2⁶³ to +2⁶³. The range for the RandomPartitioner is 0 to 2¹²⁷-1.

For other contexts, see Tokenizer.

Tokenizer

A tool or process that breaks down input data, such as text, into smaller units that are semantically relevant for processing in a model, often called tokens. See Tokenization.

Tombstone

A marker in a row that indicates a column was deleted. During compaction, marked columns are deleted.

For more information, see What are tombstones?.

Transformers

A type of deep learning architecture used for processing sequences of data. See The Illustrated Transformer.

Traversal source

A domain specific language (DSL) that specifies the traversal methods used by a graph traversal.

Tunable consistency

The database ensures that all replicas of any given row eventually become completely consistent. For situations requiring immediate and complete consistency, the database can be tuned to provide 100% consistency for specified operations, datacenters, or clusters. The database cannot be tuned to complete consistency for all data and operations.

Tuple

A fixed-length set of one or more items of defined data-typed positional fields. An example of tuple data is ('Italy', 'Fabio Fabio', 163), where the three values have the data types text, text, and integer, respectively. Tuples are always frozen, meaning the whole value must be updated or deleted, and individual elements cannot be changed. However, defining a tuple does not require the keyword frozen.

U

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Undirected graph

A set of vertices and a set of edges (unordered pairs of vertices).

Upsert

A change in the database that updates a specified column in a row if the column exists. If the column does not exist, then that column is inserted.

User-defined type (UDT)

A Cassandra Query Language (CQL) data type that consists of a user-defined object of objects. An example of UDT data is an address stored as {name:'john doe', street:'999 High St', city:'Oxford', country:'UK'}. Each UDT includes the column names and data types for all the columns included.

V

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Vector

An ordered list of numbers, frequently used in neural networks and machine learning applications. Embeddings are a specific type of vectors that encode semantic meaning. See What are vector databases?.

Also refers to an array of floating point type that represents a specific object or entity.

Vector database

A database designed for storing vectors. See What are vector databases?.

Vector index

A data structure used to efficiently store and query high-dimensional vectors for similarity or distance-based retrievals. See What are vector databases?.

Vector search

Reviews data on a database to determine the distance between the vectors. The closer they are, the more similar the data. The more the distance, the less similar the data. See What are vector databases?.

Vertex

A vertex is the fundamental unit from which graphs are formed. A vertex can also be described as an object that has incoming and outgoing edges.

Vertex-centric index

A local index structure built per vertex in a graph.

Vertex degree

The number of edges incident to a vertex in a graph.

Virtual node (Vnode)

By default, nodes are responsible for a single partitioning range in the full token range of a cluster. With vnodes enabled, each node is responsible for several virtual nodes, effectively spreading a partitioning range across more nodes in a cluster. Enabling vnodes can reduce the risk of hotspotting or straining one node in a cluster.

W

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Weak consistency

When reading data, the database performs read repair after returning results.

Wide row

A data partition that, in Cassandra Query Language (CQL) version 3.0, transposes into familiar row-based resultsets.

X, Y, Z

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X, Y, Z

Zombie

A row or cell that reappears in a database table after deletion. This can happen if a node goes down for a long period of time and is then restored without being repaired.

Deleted data is not erased from database tables; it is marked with tombstones until compaction. The tombstones created on one node must be propagated to the nodes containing the deleted data. If one of these nodes goes down before this happens, the node may not receive the most up-to-date tombstones. If the node is not repaired before it comes back online, the database finds the non-tombstoned items and propagates them to other nodes as new data. To avoid this problem in DSE, HCD, or open-source Apache Cassandra®, run nodetool repair on any restored node before rejoining it to its cluster.

For more information, see Delete data.