DataStax glossary

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

A

adjacency list

A collection of unordered lists used to represent a finite graph. Each list describes the set of neighbors of a vertex in the graph.

adjacent vertex

A vertex directly attached to another vertex by an edge in a graph.

agents

Vector agent: A system that can make decisions based on its inputs or environment. Intelligent agents, such as those employed in machine learning and artificial intelligence systems, often use vector databases to facilitate rapid and efficient search, comparison, and retrieval of high-dimensional data. Read More

DataStax agents: Must be installed on every managed node in a cluster and are necessary to perform most of the functionality within OpsCenter. See Lifecycle Manager (LCM).

anti-entropy

The synchronization of replica data on nodes to ensure that the data is fresh.

Approximate Nearest Neighbor (ANN)

A machine learning algorithm that locates the most similar vectors to a given item in a dataset.

The ANN search method is used to quickly find approximate nearest neighbors in large datasets, often with high-dimensional features, sacrificing some accuracy for speed. Read More

authentication

Process of establishing the identity of a user, DSE tool, or application.

authorization

Process of establishing permissions to database resources through roles.

B

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

back pressure

Pausing or blocking the buffering of incoming requests after reaching the threshold until the internal processing of buffered requests catches up.

bloom filter

An off-heap structure associated with each SSTable that checks if any data for the requested row exists in the SSTable before doing any disk I/O.

bootstrap

The process by which new nodes join the cluster transparently gathering the data needed from existing nodes.

C

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

cardinality

The number of unique values in a column. For example, a column of ID numbers unique for each employee would have high cardinality while a column of employee ZIP codes would have low cardinality because multiple employees can have the same ZIP code.

An index on a column with low cardinality can boost read performance because the index is significantly smaller than the column. An index for a high-cardinality column may reduce performance. If your application requires a search on a high-cardinality column, a materialized view is ideal.

CassandraDatacenter

A Kubernetes Custom Resource (CR) representing a DataStax Enterprise or an Apache Cassandra logical datacenter and rack configuration. An operator reconciles the declarative state within the CassandraDatacenter CR against nodes within the cluster. Note that a single MissionControlCluster may be related to multiple CassandraDatacenter custom resources. See [mission-control-cluster].

cell

The smallest increment of stored data. Contains a value in a row-column intersection.

chunking

Chunking breaks text into chunks (subsets of tokens) that represent a piece of information. In techniques like RAG, documents undergo chunking, where embeddings are generated from these chunks, stored in a vector database, and retrieved as part of the prompting process. Read More

cluster

Two or more database instances that exchange messages using the gossip protocol.

clustering

The storage engine process that creates an index and keeps data in order based on the index.

clustering column

In the table definition, a clustering column is a column that is part of the compound primary key definition. Note that the clustering column cannot be the first column because that position is reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.

coalescing strategy

Strategy to combine multiple network messages into a single packet for outbound TCP connections to nodes in the same data center (intra-DC) or to nodes in a different data center (inter-DC). A coalescing strategy is provided with a blocking queue of pending messages and an output collection for messages to send.

collection

Vector collection: Datasets that contain various forms of data, such as vector embeddings. Datasets and collections are often used interchangeably. Read More

CQL collection: A data type that is a group of objects that is represented as a single unit. Collections in CQL are: set, list, and map.

column

The smallest increment of data. Contains a name, a value, and a timestamp.

column family

A container for rows, similar to the table in a relational system. Called a table in CQL 3.

commit log

A file to which the database appends changed data for recovery in the event of a hardware failure.

compaction

The process of consolidating SSTables, discarding tombstones, and regenerating the SSTable index. The available compaction strategies are:

composite partition key

A partition key consisting of multiple columns.

compound primary key

A primary key consisting of the partition key, which determines the node on which data is stored, and one or more additional columns that determine clustering.

consistency

The synchronization of data on replicas in a cluster. Consistency is categorized as weak or strong.

consistency level

A setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively.

Container

Self-contained applications and all the dependencies needed to run them. For example, a DataStax Enterprise container includes DSE, management API, and Java Runtime Environment. Additionally, each container is cryptographically verified at runtime to ensure there have been no modifications after being packaged by DataStax. All DataStax Mission Control components are packaged as containers and are scheduled as Pods. See Pod.

Control Plane

The management layer of a DataStax Mission Control installation. After administrators define MissionControlCluster custom resources through the User Interface (UI) or Command Line Interface (CLI), the resources are distributed across the appropriate Data Planes. This layer contains the Mission Control UI and API services, a number of operators that reconcile cluster-level resources, and observability components.

converge

The process of aligning the real-world state of the node, datacenter, or cluster with the desired state by successfully executing a Lifecycle Manager job.

coordinator node

The node that determines which nodes in the ring should get the request based on the cluster configured snitch.

cosine similarity

A metric measuring the similarity between two non-zero vectors in a multi-dimensional space. It quantifies the cosine of the angle between the vectors; the angle representing each vector’s orientation and direction relative to each other. Zero (0) indicates complete dissimilarity. Negative one (-1) indicates exact opposite orientation of the vectors. One (1) indicates complete similarity.

CQL shell

The Cassandra Query Language shell (cqlsh) utility.

cross-data center forwarding

A technique for optimizing replication across datacenters by sending data from one datacenter to a node in another datacenter. The receiving node then forwards the data to other nodes in its data center.

D

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

datacenter

A group of related nodes that are configured together within a cluster for replication and workload segregation purposes. Not necessarily a separate location or physical data center. Datacenter names are case sensitive and cannot be changed.

Data Plane

An infrastructure layer where DSE and Apache Cassandra workloads are deployed. These resources operate at the datacenter or region level of globally deployed DataStax Enterprise clusters. Services within the Data Plane include operators for reconciliation of local resources, observability ingestion and routing components, and DSE or Cassandra nodes.

dataset

A collection of data points or records used for analysis. Datasets and collections are often used interchangeably. Read More

data type

A particular kind of data item, defined by the values it can take or the operations that can be performed on it.

DateTieredCompactionStrategy (DTCS)

DateTieredCompactionStrategy (DTCS) is deprecated starting in Apache Cassandra 3.8 and in DataStax Enterprise. This strategy is particularly useful for time series data and stores data written within a certain period of time in the same SSTable. For example, Apache Cassandra can store your last hour of data in one SSTable time window while the next 4 hours of data are stored in another time window, etc. The most common queries for time series workloads retrieve the last hour/day/month of data.

deep query

Graph queries that traverse a dense graph (a large number of connected vertices) or a graph with a high branching factor.

denormalization

Denormalization refers to the process of optimizing the read performance of a database by adding redundant data or by grouping data. This process is accomplished by duplicating data in multiple tables or by grouping data for queries.

directed graph

A set of vertices and a set of arcs (ordered pairs of vertices). In DSE Graph, edges are directional and the term "arcs" is not used.

E

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

EBNF

EBNF (Extended Backus-Naur Form) syntax expresses a context-free grammar that formally describes a language. EBNF extends its precursor BNF (Backus-Naur Form) with additional operators allowed in expansions. Syntax (railroad) diagrams graphically depict EBNF grammars.

edge

A connection between graph vertices. Edges can be unordered (no directional orientation) or ordered (directional). An edge can also be described as an object that has a vertex at its tail and head.

element

A graph element is a vertex, edge, or property.

embedding

Turning data, like words or images, into vectors to capture their meaning. Read More

Also describes a mathematical technique in machine learning where complex, high-dimensional data is represented as points in a lower-dimensional space. The process of creating an embedding preserves the relevant properties of the original data, such as distance and similarity, enabling easier computational processing. For instance, words with similar meanings in Natural Language Processing (NLP) can be set close to each other in the reduced space, facilitating their use in machine learning models.

Euclidean distance

A coordinate geometry non-negative distance metric between two points, quantifying the similarity or dissimilarity between those data points represented as vectors. Use it to compare generated samples to real data points.

eventual consistency

The database maximizes availability and partition tolerance. The database ensures eventual data consistency by updating all replicas during read operations and periodically checking and updating any replicas not directly accessed. The updating and checking ensures that any query always returns the most recent version of the result set and that all replicas of any given row eventually become completely consistent with each other.

F

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

Faceted search is the dynamic clustering of items or search results into categories. Faceted search uses any value in any field to drill into search results or even skip searching entirely.

FLARE pattern

An advanced retrieval technique that combines retrieval and generation in LLMs. It enhances the accuracy of responses by iteratively predicting the upcoming sentence to anticipate future content when the model encounters a token it is uncertain about. Read More

frozen

A CQL collection or user-defined type (UDT) that treats the value as an immutable blob. To update or delete an item, redefine the entire value of the item because individual elements of the item cannot be updated or deleted.

An example is a frozen list: ['bicycle', 'treadmill', 'unicycle', 'bicycle']. If you want to update or append a value to this frozen list, the entire list must be rewritten:

UPDATE items SET list_of_items = ['bicycle', 'treadmill', 'unicycle', 'bicycle', 'rower']
  WHERE id = 1;

G

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

garbage collector

A Java background process that frees heap memory when it is no longer in use by the program. The main Java algorithms to allocate and clean up memory are Continuous Mark Sweep (CMS) and Garbage-First (G1). DataStax Enterprise 5.1 and higher use the G1 garbage collector by default.

global index

An index structure over an entire graph.

gossip

A peer-to-peer communication protocol for exchanging location and state information between nodes.

graph

A collection of vertices and edges.

graph degree

The largest vertex degree of a graph.

graph index

A data structure that allows for the fast retrieval of elements by a particular key-value pair.

graph partitioning

A process that consists of dividing a graph into components of about the same size with few connections between the components.

graph traversal

An algorithmic walk across the elements of a graph according to the referential structure explicit within the graph data structure.

H

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

HDD

A hard disk drive (HDD) or spinning disk is a data storage device used for storing and retrieving digital information using one or more rigid rapidly rotating disks. Compare to SSD.

HDFS

Hadoop Distributed File System (HDFS) stores data on nodes to improve performance. HDFS is a necessary component in addition to MapReduce in a Hadoop distribution.

headroom

The amount of disk space required by a process (such as compaction) in addition to the space occupied by the data being processed.

I

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

idempotent

An operation that can occur multiple times without changing the result, such as performing the same update multiple times without affecting the outcome.

immutable

Data on a disk that cannot be overwritten.

incident edge

When a vertex is an endpoint of an edge in a graph.

index

A native capability for finding a column in the database that does not involve using the primary key.

Indexing

Organizing data to make retrieval more efficient. Read More

integration

Integrations connect a third-party tool of your choice to Astra DB for data management, machine learning, analytics, and more.

J

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

Jaccard similarity

A measure of similarity between two sets of features or elements in generated data and real data. The mathematical calculation is the size of the intersection of two sets divided by the size of their union, and ranges from zero (0) to one (1). One (1) indicates identical sets.

Job Tracker

Used for analytics nodes that analyze data using Hadoop. Data is analyzed using both DSE Hadoop and external Hadoop systems. Within a data center, the Job Tracker monitors the execution and status of distributed tasks that comprise a MapReduce job.

K

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

Kafka Struct

An Apache Kafka-structured record containing a set of named fields with values. Each field in a Kafka Struct uses an independent Schema. DataStax Apache Kafka Connector supports both generic structs and advanced struct types backed by the schema registry, such as the Avro format.

K8ssandraCluster

A Kubernetes Custom Resource representing a DataStax Enterprise or an Apache Cassandra cluster. This custom resource is automatically generated after creation of a MissionControlCluster resource.

keyspace

A namespace container that defines how data is replicated on nodes in each datacenter.

keytab

A file containing a pair of Kerberos principals and encrypted keys. Allows authentication with a Kerberos-enabled DataStax database without entering a password.

k-Nearest Neighbors (kNN)

A supervised machine learning algorithm that classifies an item based on the majority class of its 'k' most similar items in the dataset. Read More

KOTS

An open-source application providing an Admin Console and Command Line Interface (CLI) to handle installing, managing, and updating software packages such as DataStax Mission Control.

Kurl

Provides a kubeadm-based Kubernetes distribution with add-ons for common cloud-native components. DataStax Mission Control utilizes Kurl as its embedded Kubernetes runtime.

krb5.conf

File that contains the Kerberos configuration used by clients for connection and ticket generation. See MIT Kerberos krb5.conf documentation. The default location is /etc. If krb5.conf is in another location, override the default location by setting the environment variable KRB5_CONFIG. To use multiple configuration files, set a colon-separated filename list in KRB5_CONFIG; all files are read.

kubectl

A Kubernetes-specific command line interface (CLI) that allows communication and control of K8s clusters through their API server.

Kubelet

A process run on each piece of infrastructure within a DataStax Mission Control deployment. Kubelet is responsible for coordinating the running of containers and reporting back the status of various tasks and processes.

L

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

Large Language Models (LLMs)

Models that can generate long passages of text. Read More

Lifecycle Manager (LCM)

Lifecycle Manager (LCM) is a provisioning and configuration management system for easily managing DataStax Enterprise (DSE) clusters. This management system is a web interface enabling efficient installation and configuration of DataStax Enterprise nodes. The LCM completely defines the cluster configuration including datacenter and node topology and integrates deeply with the full spectrum of DataStax Enterprise settings.

LeveledCompactionStrategy (LCS)

This compaction strategy creates SSTables of a fixed, relatively small size that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2, and so on) is ten times as large as the previous level. Disk I/O is more uniform and predictable on higher levels than on lower levels as SSTables are continuously being compacted into progressively larger levels. At each level, row keys are merged into non-overlapping SSTables in the next level. This process improves performance for reads because the database can determine which SSTables in each level to check for the existence of row key data.

linearizable consistency

Also called serializable consistency, linearizable consistency is the restriction that one operation cannot be executed unless and until another operation has completed.

The database supports Lightweight transactions to ensure linearizable consistency in writes. The first phase of a Lightweight transaction works at SERIAL consistency and follows the Paxos protocol to ensure that the required operation succeeds. If this phase succeeds, the write is performed at the consistency level specified for the operation. Reads performed at the SERIAL consistency level execute without database built-in read repair operations.

list

A CQL collection that consists of an ordered list of values. An example of list data is ['bicycle', 'treadmill', 'unicycle', 'bicycle']. Because the list is ordered, a value may appear more than once in the list.

M

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

Machine Learning (ML)

A branch of artificial intelligence (AI) and computer science that uses and develops computer systems capable of learning and adapting without explicit instruction. ML uses algorithms and statistical models to analyze data and identify patterns, make decisions, and improve its system.

map

A CQL collection that consists of an ordered list of key-value pairs. An example of map data is ['john':'bicycle', 'jane':'treadmill', 'devon':'unicycle']. Each item in the map can be indexed by key, value, or full entry.

MapReduce

Hadoop’s parallel processing engine that quickly processes large data sets. A necessary component in addition to MapReduce in a Hadoop distribution.

materialized view

A materialized view is a table with data that is automatically inserted and updated from another base table. Has a primary key that differs from the base table, allowing the implementation of different queries.

memtable

A database table-specific, in-memory data structure that resembles a write-back cache.

meta-property

A property that describes some attribute of another property in a graph.

MissionControlCluster

A Kubernetes Custom Resource representing a DataStax Enterprise or an Apache Cassandra cluster. There is a one-to-one relationship between this resource and a cluster deployed. This top-level resource includes the cluster type (dse versus cassandra), the version, the desired topology (datacenter and rack definitions), and configuration overrides.

milliCPU

An absolute, metric specification for a CPU resource. 1,000 milliCPU (or 1,000m CPU) equals 1 CPU unit. The suffix `m ` means milli. Fractional values are allowed; never relative quantities. Precision finer than 1m is not allowed. The specification is used to measure CPU usage. 0.1 CPU specifies the same amount of CPU on a single-core, dual-core, or 48-core machine.

In Kubernetes, One CPU is equivalent to:

  • 1 AWS vCPU

  • 1 GCP Core

  • 1 Azure vCore

  • 1 Hyperthread on a bare-metal Intel processor with Hyperthreading

Synonym: millicore

mixed workload

When a single cluster can run transactional, search, and analytics nodes.

mutation

A mutation is either an insertion or a deletion.

N

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

namespace

A container to hold collections of documents.

Natural Language Processing (NLP)

Helps computers interpret and share the human language to offer the best use for the user.

node

A Java virtual machine (a platform-independent execution environment that converts Java bytecode into machine language and executes it) that runs an instance of the Licensed Software.

node repair

A process that makes all data on a replica consistent.

non-frozen

A CQL collection or user-defined type (UDT) that treats the value as an mutable blob. To update or delete an item, redefine individual values of the item. An example is a frozen list: ['bicycle', 'treadmill', 'unicycle', 'bicycle']. To update or append a value to a non-frozen list, append the value.

UPDATE items SET list_of_items = list_of_items + ['rower']
  WHERE id = 1;

normalization

Vector normalization: The process of adjusting data values to a common scale to ensure that different features have equal importance in machine learning algorithms. Read More

DSE normalization: A series of steps used to eliminate redundancy and reduce the chances of data inconsistency in a database’s schema. In DataStax Enterprise, this process is inefficient because joining data in multiple tables for queries requires accessing more nodes.

O

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

OLAP

Online Analytical Processing (OLAP) performs a multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling. Compare to OLTP.

OLTP

Online transaction processing (OLTP) is characterized by a large number of short on-line transactions for data entry and retrieval. Compare to OLAP.

Operator

A software controller which manages the reconciliation of custom resources like MissionControlCluster and CassandraDatacenter against running processes on worker hosts. Their duties include the creation of new resources and logic to recover processes that have gone down.

order

The magnitude of the number of edges to the number of vertices.

P

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

partition

A partition is a collection of data addressable by a key. This data resides on one node in a Cassandra cluster. A partition is replicated on as many nodes as the replication factor specifies.

partition index

A list of primary keys and the start position of data.

partition key

A partition keys represents a logical entity which helps a Cassandra cluster know on which node some requested data resides.

The partition key is the first column declared in the primary key definition. In a compound key, multiple columns can declare the columns that form the primary key.

partition range

The limits of the partition that differ depending on the configured partitioner. Murmur3Partitioner (default) range is -263 to +263 and RandomPartitioner range is 0 to 2127-1.

partition summary

A subset of the partition index. By default, 1 partition key out of every 128 is sampled.

partitioned vertex

A portion of a vertex’s data resulting from dividing the vertex into smaller components for graph database storage. A partitioned vertex is used for vertices that have a large number of edges.

Partitioner

Distributes data across a cluster. The types of partitioners are Murmur3Partitioner (default), RandomPartitioner, and OrderPreservingPartitioner.

PersistentVolume (PV)

A Kubernetes resource representing a piece of storage available for use within the cluster. It can represent persistent physical drives or ephemeral block volumes to be provisioned on demand. See StorageClass and PersistentVolumeClaim (PVC).

PersistentVolumeClaim (PVC)

A Kubernetes resource mapping storage, each PersistentVolume (PV) to compute, and [Pods] with the associated configuration provided by a StorageClass. DataStax Mission Control names PVCs following a pattern (eg., name: premium-rwo-retain), and may pre-allocate them prior to provisioning a cluster.

Pod

A collection of containers. See node.

primary key

The partition key. One or more columns that uniquely identify a row in a table.

prompt engineering

Crafting the right questions to get desired answers from AI. Read More

property

A key-value pair that describes some attribute of either a vertex or an edge. Property key is used to describe the key in the key-value pair. All properties are global in DSE Graph, meaning that a property can be used for any vertices. For example, "name" can be used for all vertices in a graph.

R

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

range movement

A change in the expanse of tokens assigned to a node.

RDD

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. RDD is an immutable distributed collection of objects. RDD actions return values and transformations that return pointers to new RDDs.

read repair

A process that updates database replicas with the most recent version of frequently-read data.

reflexion

The ability of an AI agent to iteratively inspect its own code, evaluate its performance, and correct mistakes. Read More

replica placement strategy

A specification that determines the replicas for each row of data.

replication factor (RF)

The total number of replicas across the cluster, abbreviated as RF. A replication factor of 1 means that there is only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved. A replication factor of 2 indicates two copies of each row and that each copy is on a different node. All replicas are equally important; there is no primary or master replica.

Retrieval-Augmented Generation (RAG)

A method that retrieves relevant documents and then generates a response. See this Retrieval-Augmented Generation (RAG) Explained: Understanding Key Concepts article.

role

A set of permissions assigned to users that limits their access to database resources. When using internal authentication, roles can also have passwords and represent a single user, DSE client tool, or application.

rolling restart

A procedure that is performed during upgrading nodes in a cluster for zero downtime. Nodes are upgraded and restarted one at a time while other nodes continue to operate online.

row

1) Columns that have the same primary key.
2) A collection of cells per combination of columns in the storage engine.

row cache

A database component for improving the performance of read-intensive operations. In off-heap memory, the row cache holds the most recently read rows from the local SSTables. Each local read operation stores its result set in the row cache and sends it to the coordinator node. The next read first checks the row cache. If the required data is there, the database returns it immediately. This initial read can save further seeks in the Bloom filter, partition key cache, partition summary, partition index, and SSTables.

The database uses LRU (least-recently-used) eviction to ensure that the row cache is refreshed with the most frequently accessed rows. The size of the row cache can be configured in the cassandra.yaml file.

S

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

scan query

A graph query that traverses an entire graph or large sections of the graph.

scheme

1) Authentication: Defines a service used either for authentication or for a role assignment such as Kerberos or LDAP, or both.
2) Database: Describes all database resources.

SearchAnalytics

Nodes started as stand-alone processes or services in SearchAnalytics mode allow you to create analytics queries that use search indexes.

search index

In DataStax Enterprise, a search index is an Apache Solr™ core. Each DSE Search index uses an internally stored index configuration pair that is automatically generated when the index is created. See Search index configuration 6.8 and Search index schema 6.8.

Secret

A Kubernetes (K8s) object that stores sensitive information such as passwords, API keys, and Secure Shell Protocol (SSH) keys and that enable pods to use that information without the data being exposed. Sensitive data is exposed to containers either as a file in a volume mount or through environment variables. Examples within DataStax Mission Control are TLS certificates (including Java Key Stores), Java Key Store passwords, and cluster super user credentials.

seed

A seed, or seed node, is used to bootstrap the gossip process for new nodes joining a cluster. A seed node provides no other function and is not a single point of failure for a cluster.

segment

A segment is a small token range of a table used by NodeSync to validate and repair data across replicas. NodeSync uses the size of the entire table to determine how many segments (depth) to divide the table into so that segments are ~200MB.

set

A CQL collection that consists of an unordered list of unique values. An example of set data is {'bicycle', 'treadmill', 'unicycle'}.

similarity metric/function

A function that quantifies how similar two objects or datasets are, commonly used in machine learning and data analysis. Read More

SizeTieredCompactionStrategy (STCS)

The default compaction strategy. This strategy triggers a minor compaction when there are a number of similar sized SSTables on disk as configured by the table subproperty, min_threshold. A minor compaction does not involve all the tables in a keyspace. Also see STCS compaction subproperties in the relevant CQL documentation.

slice

A set of clustered columns in a partition that you query as a set using, for example, a conditional WHERE clause.

Snitch

The mapping from the IP addresses of nodes to physical and virtual locations, such as racks and datacenters. The request routing mechanism is affected by which of the several types of snitches is used.

SSD

A solid-state drive (SSD) is a solid-state storage device that uses integrated circuits to persistently store data. Compare to HDD.

SSTable

A sorted string table (SSTable) is an immutable data file to which the database writes memtables periodically. SSTables are stored on disk sequentially and maintained for each database table.

static column

A special column that is shared by all rows of a partition.

StorageClass

A Kubernetes resource providing a configuration template to a backend storage provisioner. This may include rules around reclaiming disks after they are no longer bound to a particular workload or may identify when disks should be scheduled in relation to workloads.

View StorageClass details from the command line with kubectl get storageclass.

streaming

A component that handles data exchange among nodes in a cluster. It is part of the SSTable file.

Examples include:

  • When bootstrapping a new node, the new node gets data from existing nodes using streaming.

  • When running nodetool repair, nodes exchange out-of-sync data using streaming.

  • When bulkloading data from backup, sstableloader uses streaming to complete a task.

strong consistency

As a database reads data it performs a read repair before returning results.

superuser

Superuser is a role attribute that provides root database access. Superusers have all permissions on all objects. DataStax Enterprise databases include the superuser role cassandra with password cassandra by default. This account runs queries, including logins, with a consistency level of QUORUM. DataStax recommends creating a superuser for deployments and removing the cassandra role.

T

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

table

A collection of columns ordered by name and fetched by row. A row consists of columns and has a primary key; the first part of the key is a column name. Subsequent parts of a compound key are other column names that define the order of columns in the table.

Task Tracker

One Task Tracker service per node handles the Hadoop MapReduce tasks that are scheduled for a Hadoop-enabled node.

TimeWindowCompactionStrategy (TWCS)

This compaction strategy compacts SSTables based on a series of time windows. During the current time window, the SSTables are compacted into one or more SSTables. At the end of the current time window, all SSTables are compacted into a single larger SSTable. The compaction process repeats at the start of the next time window. Each TWCS time window contains data within a specified range and contains varying amounts of data.

token

An element on the ring that depends on the partitioner. Determines the node’s position on the ring and the portion of data for which it is responsible. The range for the Murmur3Partitioner (default) is -263 to +263. The range for the RandomPartitioner is 0 to 2127-1.

tokenizer

A tool or process that breaks down input data, such as text, into smaller units that are semantically relevant for processing in a model, often called tokens. Read More

tombstone

A marker in a row that indicates a column was deleted. During compaction, marked columns are deleted.

transformers

A type of deep learning architecture used for processing sequences of data. Read More

traversal source

A domain specific language (DSL) that specifies the traversal methods used by a graph traversal.

TTL

Time-to-live (TTL) is an optional expiration date for values that are inserted into a column. Also see Expiring columns in the relevant CQL documentation.

tunable consistency

The database ensures that all replicas of any given row eventually become completely consistent. For situations requiring immediate and complete consistency, the database can be tuned to provide 100% consistency for specified operations, datacenters, or clusters. The database cannot be tuned to complete consistency for all data and operations.

tuple

A fixed-length set of one or more items of defined data-typed positional fields. An example of tuple data is ('Italy', 'Fabio Fabio', 163), where the three values have the data types text, text, and integer, respectively. Tuples are always frozen, meaning the whole value must be updated or deleted, and individual elements cannot be changed. However, defining a tuple does not require the keyword frozen.

U

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

undirected graph

A set of vertices and a set of edges (unordered pairs of vertices).

upsert

A change in the database that updates a specified column in a row if the column exists. If the column does not exist, then that column is inserted.

user-defined type (UDT)

A CQL data type that consists of a user-defined object of objects. An example of UDT data is {name:'john doe', street:'999 High St', city:'Oxford', country:'UK'] as a person’s address. Each UDT includes the column names and data types for all the columns included.

V

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

vector

An ordered list of numbers, frequently used in AI. Embeddings are a specific type of vectors that encode semantic meaning. Read More

Also refers to an array of floating point type that represents a specific object or entity.

vector database

A database designed for storing vectors. Read More

vector index

A data structure used to efficiently store and query high-dimensional vectors for similarity or distance-based retrievals. Read More

Reviews data on a database to determine the distance between the vectors. The closer they are, the more similar the data. The more the distance, the less similar the data.

vertex

A vertex is the fundamental unit from which graphs are formed. A vertex can also be described as an object that has incoming and outgoing edges.

vertex-centric index

A local index structure built per vertex in a graph.

vertex degree

The number of edges incident to a vertex in a graph.

Vnode

Vnode is a virtual node. Normally, nodes are responsible for a single partitioning range in the full token range of a cluster. With vnodes enabled, each node is responsible for several virtual nodes, effectively spreading a partitioning range across more nodes in the cluster. Enabling vnodes can reduce the risk of hotspotting or straining one node in the cluster.

W

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

weak consistency

When reading data, the database performs read repair after returning results.

wide row

A data partition that CQL 3 transposes into familiar row-based resultsets.

X, Y, Z

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

zombie

A row or cell that reappears in a database table after deletion. This can happen if a node goes down for a long period of time and is then restored without being repaired.

Deleted data is not erased from database tables; it is marked with tombstones until compaction. The tombstones created on one node must be propagated to the nodes containing the deleted data. If one of these nodes goes down before this happens, the node may not receive the most up-to-date tombstones. If the node is not repaired before it comes back online, the database finds the non-tombstoned items and propagates them to other nodes as new data.

To avoid this problem, run nodetool repair on any restored node before rejoining it to its cluster.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com