Partitions and keys
Astra DB, HCD, DSE, and Cassandra are distributed NoSQL databases that store data across multiple clustered nodes. To use these Cassandra-based databases effectively, you must understand how they distribute and group data. This section explains partitions and keys, which play a crucial role in managing data in Astra DB, HCD, DSE, and Cassandra.
Partitions
All Cassandra-based databases store data in rows and columns. Rows are grouped into partitions, and partitions are distributed across nodes in the database cluster. In a multi-node database cluster, with a replication factor greater than the number of nodes, each node manages a subset of partitions. All rows in single partition are stored on the same node.
Astra DB ensures fault tolerance and high availability by automatically replicating partition data across multiple cloud availability zones. HCD, DSE, and Cassandra users can configure replication manually during keyspace creation.
Keys
Cassandra-based databases use keys to manage data efficiently:
-
Primary keys uniquely identify rows.
-
Partition keys group rows and distribute partitions across a cluster.
-
Clustering keys (or clustering columns) order rows within a partition.
Primary keys
In Cassandra-based databases, a primary key uniquely identifies a row. It consists of the partition key and optional clustering columns.
If a table has no clustering columns, the partition key is the primary key. Most tables should use a composite primary key, consisting of a partition key and clustering columns, to allow multiple rows to be stored within a single partition.
Partition keys
A partition key consists of the first column or set of columns in a primary key. Multi-column partition keys are called composite partition keys. All rows with the same partition key are stored in the same partition. Rows in the same parition are organized for efficient retrieval.
The database uses the partition key as input to a hashing function to determine which node in the cluster stores the partition. Queries require the partition key and use the same hashing function to locate the partition and retrieve the data. This prevents the database from scanning all nodes in the cluster to find data. Cassandra-based databases support indexing methods that allow you to query data without the partition key, but these methods are less efficient than querying with the partition key.
Clustering columns
Clustering columns are optional columns in the primary key. In the primary key definition, clustering columns follow the partition key. In order to store multiple rows within a partition, you must define clustering columns. All rows with the same partition key are stored in the same partition. Within a partition, each row has a unique clustering column or collection of clustering columns. Rows are ordered in a partition based on the clustering columns, which enables efficient range queries in a partition.