A table is a database object that stores the data. It is analogous to a SQL table, although the data is stored much differently in Apache Cassandra. Tables can be created, modified, or dropped at runtime without blocking updates or queries.
Column definition comprises one part of table definition.
The columns are defined in a comma-delimited list of
<column-name> <data-type> pairs.
A common column definition is:
Most data types are straightforward in definition, and are similar to the common data types in other languages.
General data type
CQL data types
Array type of
Date and time
Collections - frozen or non-frozen
See the data type reference for more detailed information.
The other part of a table definition that must be included is the primary key. It is declared either within a column for a table with a single partition key:
username text PRIMARY KEY,
Or, the definition can be near the end of the table creation command with the
PRIMARY KEY command:
PRIMARY KEY (username, age)
A primary key in Cassandra consists of one or more partition keys and zero or more clustering column components. The order of these components in the definition always puts the partition key first and then the clustering column or columns.
The primary key is defined when the table is created and cannot be altered. If you must change the primary key, you’ll need create a new table schema and write the existing data to the new table.
The definition of a table’s primary key is critical. Carefully model how data in a table will be inserted and retrieved before choosing which columns to define in the primary key. When selecting the table’s primary key, consider:
the size of the partitions
the order of the data within partitions
the distribution of the partitions among the nodes of the cluster
the fact that a primary key cannot have a
Cassandra is a partition row store, so the first element of the primary key, the partition key, specifies which node in the Cassandra cluster will replicate a particular table row. Thus, the primary key identifies the location and order of stored data.
At the minimum, the primary key must consist of a partition key.
One or more columns are used to define a partition key.
If one column, such as
age is defined as the partition key, then all 35 year olds will be sorted into the same storage location.
In contrast to a simple partition key, a composite partition key uses two or more columns to identify where data resides. If more than one column is specified, the composite partition key splits a data set so that related data is stored on separate partitions.
Data is retrieved using the partition key. Keep in mind that to retrieve data from the table, values for all columns defined in the partition key have to be supplied unless an index is created.
The database stores an entire row of data on a node by partition key. If there is too much data in a single partition and data needs to be spread among multiple nodes, use a composite partition key. Using more than one column for the partition key breaks the data into chunks, or buckets. These columns form logical sets inside a partition to facilitate retrieval. The data is still grouped, but in smaller chunks.
This method can be effective if a cluster experiences hotspotting, or congestion in writing data to one node repeatedly, because a partition is heavily writing. Cassandra is often used for time series data, and hotspotting can be a real issue.
For example, if you want to sort data by both
name, specifying both columns as part of the partition key will store rows that store all the 35 year old Janes into a similar storage location, whether they live in
Seattle, Chicago, or Boston.
The second part of a primary key is optional, and consists of one or more clustering columns. Clustering columns order the data so that multiple rows within a single partition are clustered in a defined order. The clustering columns do not dictate the storage location of the data, only the order of the data within a partition. However, this feature can be very useful, for example, if the data stores time sequences, or groups of network equipment by region.
If a clustering column is specified in the primary key, the primary key is a compound primary key. The table is comprised of multi-row partitions.
Using a compound primary key creates a table that can be queried to return sorted results.
If the pro cycling example was designed for a relational database, a cyclists table would be created with a foreign key to
JOIN with the races table when querying.
In Cassandra, the data is denormalized because joins are not performant in a distributed system.
Grouping data in tables using a clustering column or columns is analogous to
JOINs in a relational database, but clustering columns are much more performant because only one table is accessed.
On a physical node, when rows for a partition key are stored in order based on the clustering columns, retrieval of rows is very efficient.