DataStax glossary
A
adjacency list
A collection of unordered lists used to represent a finite graph. Each list describes the set of neighbors of a vertex in the graph.
adjacent vertex
A vertex directly attached to another vertex by an edge in a graph.
agents
Vector agent: A system that can make decisions based on its inputs or environment. Intelligent agents, such as those employed in machine learning and artificial intelligence systems, often use vector databases to facilitate rapid and efficient search, comparison, and retrieval of high-dimensional data. Read More
DataStax agents: Must be installed on every managed node in a cluster and are necessary to perform most of the functionality within OpsCenter. See Lifecycle Manager (LCM).
anti-entropy
The synchronization of replica data on nodes to ensure that the data is fresh.
Approximate Nearest Neighbor (ANN)
A machine learning algorithm that locates the most similar vectors to a given item in a dataset.
The ANN search method is used to quickly find approximate nearest neighbors in large datasets, often with high-dimensional features, sacrificing some accuracy for speed. Read More
authentication
Process of establishing the identity of a user, DSE tool, or application.
authorization
Process of establishing permissions to database resources through roles.
B
back pressure
Pausing or blocking the buffering of incoming requests after reaching the threshold until the internal processing of buffered requests catches up.
bloom filter
An off-heap structure associated with each SSTable that checks if any data for the requested row exists in the SSTable before doing any disk I/O.
bootstrap
The process by which new nodes join the cluster transparently gathering the data needed from existing nodes.
C
cardinality
The number of unique values in a column. For example, a column of ID numbers unique for each employee would have high cardinality while a column of employee ZIP codes would have low cardinality because multiple employees can have the same ZIP code.
An index on a column with low cardinality can boost read performance because the index is significantly smaller than the column. An index for a high-cardinality column may reduce performance. If your application requires a search on a high-cardinality column, a materialized view is ideal.
CassandraDatacenter
A Kubernetes Custom Resource (CR) representing a DataStax Enterprise or an Apache Cassandra logical datacenter and rack configuration. An operator reconciles the declarative state within the CassandraDatacenter
CR against nodes within the cluster. Note that a single MissionControlCluster
may be related to multiple CassandraDatacenter
custom resources. See [mission-control-cluster].
cell
The smallest increment of stored data. Contains a value in a row-column intersection.
chunking
Chunking breaks text into chunks (subsets of tokens) that represent a piece of information. In techniques like RAG, documents undergo chunking, where embeddings are generated from these chunks, stored in a vector database, and retrieved as part of the prompting process. Read More
cluster
Two or more database instances that exchange messages using the gossip protocol.
clustering
The storage engine process that creates an index and keeps data in order based on the index.
clustering column
In the table definition, a clustering column is a column that is part of the compound primary key definition. Note that the clustering column cannot be the first column because that position is reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.
coalescing strategy
Strategy to combine multiple network messages into a single packet for outbound TCP connections to nodes in the same data center (intra-DC) or to nodes in a different data center (inter-DC). A coalescing strategy is provided with a blocking queue of pending messages and an output collection for messages to send.
collection
Vector collection: Datasets that contain various forms of data, such as vector embeddings. Datasets and collections are often used interchangeably. Read More
CQL collection: A data type that is a group of objects that is represented as a single unit. Collections in CQL are: set, list, and map.
column
The smallest increment of data. Contains a name, a value, and a timestamp.
column family
A container for rows, similar to the table in a relational system. Called a table in CQL 3.
commit log
A file to which the database appends changed data for recovery in the event of a hardware failure.
compaction
The process of consolidating SSTables, discarding tombstones, and regenerating the SSTable index. The available compaction strategies are:
composite partition key
A partition key consisting of multiple columns.
compound primary key
A primary key consisting of the partition key, which determines the node on which data is stored, and one or more additional columns that determine clustering.
consistency level
A setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively.
Container
Self-contained applications and all the dependencies needed to run them. For example, a DataStax Enterprise container includes DSE, management API, and Java Runtime Environment. Additionally, each container is cryptographically verified at runtime to ensure there have been no modifications after being packaged by DataStax. All DataStax Mission Control components are packaged as containers and are scheduled as Pods. See Pod.
Control Plane
The management layer of a DataStax Mission Control installation. After administrators define MissionControlCluster
custom resources through the User Interface (UI) or Command Line Interface (CLI), the resources are distributed across the appropriate Data Planes. This layer contains the Mission Control UI and API services, a number of operators that reconcile cluster-level resources, and observability components.
converge
The process of aligning the real-world state of the node, datacenter, or cluster with the desired state by successfully executing a Lifecycle Manager job.
coordinator node
The node that determines which nodes in the ring should get the request based on the cluster configured snitch.
cosine similarity
A metric measuring the similarity between two non-zero vectors in a multi-dimensional space. It quantifies the cosine of the angle between the vectors; the angle representing each vector’s orientation and direction relative to each other. Zero (0) indicates complete dissimilarity. Negative one (-1) indicates exact opposite orientation of the vectors. One (1) indicates complete similarity.
CQL shell
The Cassandra Query Language shell (cqlsh
) utility.
cross-data center forwarding
A technique for optimizing replication across datacenters by sending data from one datacenter to a node in another datacenter. The receiving node then forwards the data to other nodes in its data center.
D
datacenter
A group of related nodes that are configured together within a cluster for replication and workload segregation purposes. Not necessarily a separate location or physical data center. Datacenter names are case sensitive and cannot be changed.
Data Plane
An infrastructure layer where DSE and Apache Cassandra workloads are deployed. These resources operate at the datacenter or region level of globally deployed DataStax Enterprise clusters. Services within the Data Plane include operators for reconciliation of local resources, observability ingestion and routing components, and DSE or Cassandra nodes.
dataset
A collection of data points or records used for analysis. Datasets and collections are often used interchangeably. Read More
data type
A particular kind of data item, defined by the values it can take or the operations that can be performed on it.
DateTieredCompactionStrategy (DTCS)
DateTieredCompactionStrategy (DTCS) is deprecated starting in Apache Cassandra 3.8 and in DataStax Enterprise. This strategy is particularly useful for time series data and stores data written within a certain period of time in the same SSTable. For example, Apache Cassandra can store your last hour of data in one SSTable time window while the next 4 hours of data are stored in another time window, etc. The most common queries for time series workloads retrieve the last hour/day/month of data.
deep query
Graph queries that traverse a dense graph (a large number of connected vertices) or a graph with a high branching factor.
denormalization
Denormalization refers to the process of optimizing the read performance of a database by adding redundant data or by grouping data. This process is accomplished by duplicating data in multiple tables or by grouping data for queries.
directed graph
A set of vertices and a set of arcs (ordered pairs of vertices). In DSE Graph, edges are directional and the term "arcs" is not used.
E
EBNF
EBNF (Extended Backus-Naur Form) syntax expresses a context-free grammar that formally describes a language. EBNF extends its precursor BNF (Backus-Naur Form) with additional operators allowed in expansions. Syntax (railroad) diagrams graphically depict EBNF grammars.
edge
A connection between graph vertices. Edges can be unordered (no directional orientation) or ordered (directional). An edge can also be described as an object that has a vertex at its tail and head.
element
A graph element is a vertex, edge, or property.
embedding
Turning data, like words or images, into vectors to capture their meaning. Read More
Also describes a mathematical technique in machine learning where complex, high-dimensional data is represented as points in a lower-dimensional space. The process of creating an embedding preserves the relevant properties of the original data, such as distance and similarity, enabling easier computational processing. For instance, words with similar meanings in Natural Language Processing (NLP) can be set close to each other in the reduced space, facilitating their use in machine learning models.
Euclidean distance
A coordinate geometry non-negative distance metric between two points, quantifying the similarity or dissimilarity between those data points represented as vectors. Use it to compare generated samples to real data points.
eventual consistency
The database maximizes availability and partition tolerance. The database ensures eventual data consistency by updating all replicas during read operations and periodically checking and updating any replicas not directly accessed. The updating and checking ensures that any query always returns the most recent version of the result set and that all replicas of any given row eventually become completely consistent with each other.
F
faceted search
Faceted search is the dynamic clustering of items or search results into categories. Faceted search uses any value in any field to drill into search results or even skip searching entirely.
FLARE pattern
An advanced retrieval technique that combines retrieval and generation in LLMs. It enhances the accuracy of responses by iteratively predicting the upcoming sentence to anticipate future content when the model encounters a token it is uncertain about. Read More
frozen
A CQL collection or user-defined type (UDT) that treats the value as an immutable blob. To update or delete an item, redefine the entire value of the item because individual elements of the item cannot be updated or deleted.
An example is a frozen list: ['bicycle', 'treadmill', 'unicycle', 'bicycle']
.
If you want to update or append a value to this frozen list, the entire list must be rewritten:
UPDATE items SET list_of_items = ['bicycle', 'treadmill', 'unicycle', 'bicycle', 'rower']
WHERE id = 1;
G
garbage collector
A Java background process that frees heap memory when it is no longer in use by the program. The main Java algorithms to allocate and clean up memory are Continuous Mark Sweep (CMS) and Garbage-First (G1). DataStax Enterprise 5.1 and higher use the G1 garbage collector by default.
global index
An index structure over an entire graph.
gossip
A peer-to-peer communication protocol for exchanging location and state information between nodes.
graph
A collection of vertices and edges.
graph degree
The largest vertex degree of a graph.
graph index
A data structure that allows for the fast retrieval of elements by a particular key-value pair.
graph partitioning
A process that consists of dividing a graph into components of about the same size with few connections between the components.
graph traversal
An algorithmic walk across the elements of a graph according to the referential structure explicit within the graph data structure.
H
HDD
A hard disk drive (HDD) or spinning disk is a data storage device used for storing and retrieving digital information using one or more rigid rapidly rotating disks. Compare to SSD.
HDFS
Hadoop Distributed File System (HDFS) stores data on nodes to improve performance. HDFS is a necessary component in addition to MapReduce in a Hadoop distribution.
headroom
The amount of disk space required by a process (such as compaction) in addition to the space occupied by the data being processed.
I
idempotent
An operation that can occur multiple times without changing the result, such as performing the same update multiple times without affecting the outcome.
immutable
Data on a disk that cannot be overwritten.
incident edge
When a vertex is an endpoint of an edge in a graph.
index
A native capability for finding a column in the database that does not involve using the primary key.
integration
Integrations connect a third-party tool of your choice to Astra DB for data management, machine learning, analytics, and more.
J
Jaccard similarity
A measure of similarity between two sets of features or elements in generated data and real data. The mathematical calculation is the size of the intersection of two sets divided by the size of their union, and ranges from zero (0) to one (1). One (1) indicates identical sets.
Job Tracker
Used for analytics nodes that analyze data using Hadoop. Data is analyzed using both DSE Hadoop and external Hadoop systems. Within a data center, the Job Tracker monitors the execution and status of distributed tasks that comprise a MapReduce job.
K
Kafka Struct
An Apache Kafka-structured record containing a set of named fields with values. Each field in a Kafka Struct uses an independent Schema. DataStax Apache Kafka Connector supports both generic structs and advanced struct types backed by the schema registry, such as the Avro format.
K8ssandraCluster
A Kubernetes Custom Resource representing a DataStax Enterprise or an Apache Cassandra cluster. This custom resource is automatically generated after creation of a MissionControlCluster
resource.
keyspace
A namespace container that defines how data is replicated on nodes in each datacenter.
keytab
A file containing a pair of Kerberos principals and encrypted keys. Allows authentication with a Kerberos-enabled DataStax database without entering a password.
k-Nearest Neighbors (kNN)
A supervised machine learning algorithm that classifies an item based on the majority class of its 'k' most similar items in the dataset. Read More
KOTS
An open-source application providing an Admin Console and Command Line Interface (CLI) to handle installing, managing, and updating software packages such as DataStax Mission Control.
Kurl
Provides a kubeadm
-based Kubernetes distribution with add-ons for common cloud-native components. DataStax Mission Control utilizes Kurl as its embedded Kubernetes runtime.
krb5.conf
File that contains the Kerberos configuration used by clients for connection and ticket generation.
See MIT Kerberos krb5.conf documentation.
The default location is /etc
.
If krb5.conf
is in another location, override the default location by setting the environment variable KRB5_CONFIG
.
To use multiple configuration files, set a colon-separated filename list in KRB5_CONFIG
;
all files are read.
kubectl
A Kubernetes-specific command line interface (CLI) that allows communication and control of K8s clusters through their API server.
L
Large Language Models (LLMs)
Models that can generate long passages of text. Read More
Lifecycle Manager (LCM)
Lifecycle Manager (LCM) is a provisioning and configuration management system for easily managing DataStax Enterprise (DSE) clusters. This management system is a web interface enabling efficient installation and configuration of DataStax Enterprise nodes. The LCM completely defines the cluster configuration including datacenter and node topology and integrates deeply with the full spectrum of DataStax Enterprise settings.
LeveledCompactionStrategy (LCS)
This compaction strategy creates SSTables of a fixed, relatively small size that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2, and so on) is ten times as large as the previous level. Disk I/O is more uniform and predictable on higher levels than on lower levels as SSTables are continuously being compacted into progressively larger levels. At each level, row keys are merged into non-overlapping SSTables in the next level. This process improves performance for reads because the database can determine which SSTables in each level to check for the existence of row key data.
linearizable consistency
Also called serializable consistency, linearizable consistency is the restriction that one operation cannot be executed unless and until another operation has completed.
The database supports Lightweight transactions to ensure linearizable consistency in writes. The first phase of a Lightweight transaction works at SERIAL consistency and follows the Paxos protocol to ensure that the required operation succeeds. If this phase succeeds, the write is performed at the consistency level specified for the operation. Reads performed at the SERIAL consistency level execute without database built-in read repair operations.
list
A CQL collection that consists of an ordered list of values.
An example of list data is ['bicycle', 'treadmill', 'unicycle', 'bicycle']
.
Because the list is ordered, a value may appear more than once in the list.
M
Machine Learning (ML)
A branch of artificial intelligence (AI) and computer science that uses and develops computer systems capable of learning and adapting without explicit instruction. ML uses algorithms and statistical models to analyze data and identify patterns, make decisions, and improve its system.
map
A CQL collection that consists of an ordered list of key-value pairs.
An example of map data is ['john':'bicycle', 'jane':'treadmill', 'devon':'unicycle']
.
Each item in the map can be indexed by key, value, or full entry.
MapReduce
Hadoop’s parallel processing engine that quickly processes large data sets. A necessary component in addition to MapReduce in a Hadoop distribution.
materialized view
A materialized view is a table with data that is automatically inserted and updated from another base table. Has a primary key that differs from the base table, allowing the implementation of different queries.
memtable
A database table-specific, in-memory data structure that resembles a write-back cache.
meta-property
A property that describes some attribute of another property in a graph.
MissionControlCluster
A Kubernetes Custom Resource representing a DataStax Enterprise or an Apache Cassandra cluster. There is a one-to-one relationship between this resource and a cluster deployed. This top-level resource includes the cluster type (dse
versus cassandra
), the version, the desired topology (datacenter and rack definitions), and configuration overrides.
milliCPU
An absolute, metric specification for a CPU resource. 1,000 milliCPU (or 1,000m CPU) equals 1 CPU unit. The suffix `m ` means milli. Fractional values are allowed; never relative quantities. Precision finer than 1m is not allowed. The specification is used to measure CPU usage. 0.1 CPU specifies the same amount of CPU on a single-core, dual-core, or 48-core machine.
In Kubernetes, One CPU is equivalent to:
-
1 AWS vCPU
-
1 GCP Core
-
1 Azure vCore
-
1 Hyperthread on a bare-metal Intel processor with Hyperthreading
Synonym: millicore
mixed workload
When a single cluster can run transactional, search, and analytics nodes.
mutation
A mutation is either an insertion or a deletion.
N
namespace
A container to hold collections of documents.
Natural Language Processing (NLP)
Helps computers interpret and share the human language to offer the best use for the user.
node
A Java virtual machine (a platform-independent execution environment that converts Java bytecode into machine language and executes it) that runs an instance of the Licensed Software.
node repair
A process that makes all data on a replica consistent.
non-frozen
A CQL collection or user-defined type (UDT) that treats the value as an mutable blob.
To update or delete an item, redefine individual values of the item.
An example is a frozen list: ['bicycle', 'treadmill', 'unicycle', 'bicycle']
.
To update or append a value to a non-frozen list, append the value.
UPDATE items SET list_of_items = list_of_items + ['rower']
WHERE id = 1;
normalization
Vector normalization: The process of adjusting data values to a common scale to ensure that different features have equal importance in machine learning algorithms. Read More
DSE normalization: A series of steps used to eliminate redundancy and reduce the chances of data inconsistency in a database’s schema. In DataStax Enterprise, this process is inefficient because joining data in multiple tables for queries requires accessing more nodes.
O
OLAP
Online Analytical Processing (OLAP) performs a multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling. Compare to OLTP.
OLTP
Online transaction processing (OLTP) is characterized by a large number of short on-line transactions for data entry and retrieval. Compare to OLAP.
Operator
A software controller which manages the reconciliation of custom resources like MissionControlCluster
and CassandraDatacenter
against running processes on worker hosts. Their duties include the creation of new resources and logic to recover processes that have gone down.
order
The magnitude of the number of edges to the number of vertices.
P
partition
A partition is a collection of data addressable by a key. This data resides on one node in a Cassandra cluster. A partition is replicated on as many nodes as the replication factor specifies.
partition index
A list of primary keys and the start position of data.
partition key
A partition keys represents a logical entity which helps a Cassandra cluster know on which node some requested data resides.
The partition key is the first column declared in the primary key definition. In a compound key, multiple columns can declare the columns that form the primary key.
partition range
The limits of the partition that differ depending on the configured partitioner. Murmur3Partitioner (default) range is -263 to +263 and RandomPartitioner range is 0 to 2127-1.
partition summary
A subset of the partition index. By default, 1 partition key out of every 128 is sampled.
partitioned vertex
A portion of a vertex’s data resulting from dividing the vertex into smaller components for graph database storage. A partitioned vertex is used for vertices that have a large number of edges.
Partitioner
Distributes data across a cluster. The types of partitioners are Murmur3Partitioner (default), RandomPartitioner, and OrderPreservingPartitioner.
PersistentVolume (PV)
A Kubernetes resource representing a piece of storage available for use within the cluster. It can represent persistent physical drives or ephemeral block volumes to be provisioned on demand. See StorageClass and PersistentVolumeClaim (PVC).
PersistentVolumeClaim (PVC)
A Kubernetes resource mapping storage, each PersistentVolume (PV) to compute, and [Pods] with the associated configuration provided by a StorageClass. DataStax Mission Control names PVCs following a pattern (eg., name: premium-rwo-retain
), and may pre-allocate them prior to provisioning a cluster.
primary key
The partition key. One or more columns that uniquely identify a row in a table.
prompt engineering
Crafting the right questions to get desired answers from AI. Read More
property
A key-value pair that describes some attribute of either a vertex or an edge. Property key is used to describe the key in the key-value pair. All properties are global in DSE Graph, meaning that a property can be used for any vertices. For example, "name" can be used for all vertices in a graph.
R
range movement
A change in the expanse of tokens assigned to a node.
RDD
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. RDD is an immutable distributed collection of objects. RDD actions return values and transformations that return pointers to new RDDs.
read repair
A process that updates database replicas with the most recent version of frequently-read data.
reflexion
The ability of an AI agent to iteratively inspect its own code, evaluate its performance, and correct mistakes. Read More
replica placement strategy
A specification that determines the replicas for each row of data.
replication factor (RF)
The total number of replicas across the cluster, abbreviated as RF. A replication factor of 1 means that there is only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved. A replication factor of 2 indicates two copies of each row and that each copy is on a different node. All replicas are equally important; there is no primary or master replica.
Retrieval-Augmented Generation (RAG)
A method that retrieves relevant documents and then generates a response. See this Retrieval-Augmented Generation (RAG) Explained: Understanding Key Concepts article.
role
A set of permissions assigned to users that limits their access to database resources. When using internal authentication, roles can also have passwords and represent a single user, DSE client tool, or application.
rolling restart
A procedure that is performed during upgrading nodes in a cluster for zero downtime. Nodes are upgraded and restarted one at a time while other nodes continue to operate online.
row
1) Columns that have the same primary key.
2) A collection of cells per combination of columns in the storage engine.
row cache
A database component for improving the performance of read-intensive operations. In off-heap memory, the row cache holds the most recently read rows from the local SSTables. Each local read operation stores its result set in the row cache and sends it to the coordinator node. The next read first checks the row cache. If the required data is there, the database returns it immediately. This initial read can save further seeks in the Bloom filter, partition key cache, partition summary, partition index, and SSTables.
The database uses LRU (least-recently-used) eviction to ensure that the row cache is refreshed with the most frequently accessed rows. The size of the row cache can be configured in the cassandra.yaml file.
S
scan query
A graph query that traverses an entire graph or large sections of the graph.
scheme
1) Authentication: Defines a service used either for authentication or for a role assignment such as Kerberos or LDAP, or both.
2) Database: Describes all database resources.
SearchAnalytics
Nodes started as stand-alone processes or services in SearchAnalytics mode allow you to create analytics queries that use search indexes.
Secret
A Kubernetes (K8s) object that stores sensitive information such as passwords, API keys, and Secure Shell Protocol (SSH) keys and that enable pods to use that information without the data being exposed. Sensitive data is exposed to containers either as a file in a volume mount or through environment variables. Examples within DataStax Mission Control are TLS certificates (including Java Key Stores), Java Key Store passwords, and cluster super user credentials.
seed
A seed, or seed node, is used to bootstrap the gossip process for new nodes joining a cluster. A seed node provides no other function and is not a single point of failure for a cluster.
segment
A segment is a small token range of a table used by NodeSync to validate and repair data across replicas. NodeSync uses the size of the entire table to determine how many segments (depth) to divide the table into so that segments are ~200MB.
set
A CQL collection that consists of an unordered list of unique values.
An example of set data is {'bicycle', 'treadmill', 'unicycle'}
.
similarity metric/function
A function that quantifies how similar two objects or datasets are, commonly used in machine learning and data analysis. Read More
SizeTieredCompactionStrategy (STCS)
The default compaction strategy. This strategy triggers a minor compaction when there are a number of similar sized SSTables on disk as configured by the table subproperty, min_threshold. A minor compaction does not involve all the tables in a keyspace. Also see STCS compaction subproperties in the relevant CQL documentation.
slice
A set of clustered columns in a partition that you query as a set using, for example, a conditional WHERE clause.
Snitch
The mapping from the IP addresses of nodes to physical and virtual locations, such as racks and datacenters. The request routing mechanism is affected by which of the several types of snitches is used.
SSD
A solid-state drive (SSD) is a solid-state storage device that uses integrated circuits to persistently store data. Compare to HDD.
SSTable
A sorted string table (SSTable) is an immutable data file to which the database writes memtables periodically. SSTables are stored on disk sequentially and maintained for each database table.
static column
A special column that is shared by all rows of a partition.
StorageClass
A Kubernetes resource providing a configuration template to a backend storage provisioner. This may include rules around reclaiming disks after they are no longer bound to a particular workload or may identify when disks should be scheduled in relation to workloads.
View StorageClass details from the command line with kubectl get storageclass
.
streaming
A component that handles data exchange among nodes in a cluster.
It is part of the SSTable file.
Examples include:
-
When bootstrapping a new node, the new node gets data from existing nodes using streaming.
-
When running nodetool repair, nodes exchange out-of-sync data using streaming.
-
When bulkloading data from backup, sstableloader uses streaming to complete a task.
strong consistency
As a database reads data it performs a read repair before returning results.
superuser
Superuser is a role attribute that provides root database access.
Superusers have all permissions on all objects.
DataStax Enterprise databases include the superuser role cassandra
with password cassandra
by default.
This account runs queries, including logins, with a consistency level of QUORUM
.
DataStax recommends creating a superuser for deployments and removing the cassandra
role.
T
table
A collection of columns ordered by name and fetched by row. A row consists of columns and has a primary key; the first part of the key is a column name. Subsequent parts of a compound key are other column names that define the order of columns in the table.
Task Tracker
One Task Tracker service per node handles the Hadoop MapReduce tasks that are scheduled for a Hadoop-enabled node.
TimeWindowCompactionStrategy (TWCS)
This compaction strategy compacts SSTables based on a series of time windows. During the current time window, the SSTables are compacted into one or more SSTables. At the end of the current time window, all SSTables are compacted into a single larger SSTable. The compaction process repeats at the start of the next time window. Each TWCS time window contains data within a specified range and contains varying amounts of data.
token
An element on the ring that depends on the partitioner. Determines the node’s position on the ring and the portion of data for which it is responsible. The range for the Murmur3Partitioner (default) is -263 to +263. The range for the RandomPartitioner is 0 to 2127-1.
tokenizer
A tool or process that breaks down input data, such as text, into smaller units that are semantically relevant for processing in a model, often called tokens. Read More
tombstone
A marker in a row that indicates a column was deleted. During compaction, marked columns are deleted.
transformers
A type of deep learning architecture used for processing sequences of data. Read More
traversal source
A domain specific language (DSL) that specifies the traversal methods used by a graph traversal.
TTL
Time-to-live (TTL) is an optional expiration date for values that are inserted into a column. Also see Expiring columns in the relevant CQL documentation.
tunable consistency
The database ensures that all replicas of any given row eventually become completely consistent. For situations requiring immediate and complete consistency, the database can be tuned to provide 100% consistency for specified operations, datacenters, or clusters. The database cannot be tuned to complete consistency for all data and operations.
tuple
A fixed-length set of one or more items of defined data-typed positional fields.
An example of tuple data is ('Italy', 'Fabio Fabio', 163)
, where the three values have the data types text, text, and integer, respectively.
Tuples are always frozen, meaning the whole value must be updated or deleted, and individual elements cannot be changed.
However, defining a tuple does not require the keyword frozen
.
U
undirected graph
A set of vertices and a set of edges (unordered pairs of vertices).
upsert
A change in the database that updates a specified column in a row if the column exists. If the column does not exist, then that column is inserted.
user-defined type (UDT)
A CQL data type that consists of a user-defined object of objects.
An example of UDT data is {name:'john doe', street:'999 High St', city:'Oxford', country:'UK']
as a person’s address.
Each UDT includes the column names and data types for all the columns included.
V
vector
An ordered list of numbers, frequently used in AI. Embeddings are a specific type of vectors that encode semantic meaning. Read More
Also refers to an array of floating point type that represents a specific object or entity.
vector database
A database designed for storing vectors. Read More
vector index
A data structure used to efficiently store and query high-dimensional vectors for similarity or distance-based retrievals. Read More
vector search
Reviews data on a database to determine the distance between the vectors. The closer they are, the more similar the data. The more the distance, the less similar the data.
vertex
A vertex is the fundamental unit from which graphs are formed. A vertex can also be described as an object that has incoming and outgoing edges.
vertex-centric index
A local index structure built per vertex in a graph.
vertex degree
The number of edges incident to a vertex in a graph.
Vnode
Vnode is a virtual node. Normally, nodes are responsible for a single partitioning range in the full token range of a cluster. With vnodes enabled, each node is responsible for several virtual nodes, effectively spreading a partitioning range across more nodes in the cluster. Enabling vnodes can reduce the risk of hotspotting or straining one node in the cluster.
X, Y, Z
zombie
A row or cell that reappears in a database table after deletion. This can happen if a node goes down for a long period of time and is then restored without being repaired.
Deleted data is not erased from database tables; it is marked with tombstones until compaction. The tombstones created on one node must be propagated to the nodes containing the deleted data. If one of these nodes goes down before this happens, the node may not receive the most up-to-date tombstones. If the node is not repaired before it comes back online, the database finds the non-tombstoned items and propagates them to other nodes as new data.
To avoid this problem, run nodetool repair on any restored node before rejoining it to its cluster.