Table concepts

A table is a database object that stores the data. It is analogous to a SQL table, although the data is stored much differently in Apache Cassandra®. Tables can be created, modified, or dropped at runtime without blocking updates or queries.

Tables are defined as rows of columns. Each column has a required unique data type defined at table creation. Rows are defined by the choice of which column or columns are defined as the primary key of the table. Data is stored in a cell, which is the intersection of a row and column. For more information about these concepts, see Apache Cassandra® structure.

Column characteristics

Column definition comprises one part of table definition. The columns are defined in a comma-delimited list of <column-name> <data-type> pairs. A common column definition is:

username text,

Data types

Most data types are straightforward in definition, and are similar to the common data types in other languages.

General data type

CQL data types

Characters

ascii, text, or varchar

Integers

int, tinyint, smallint, bigint, or varint

Vectors

Array type of float32

Decimals

decimal, float, double

Date and time

date, DateRangeType, duration, time, timestamp

Unique identifiers

uuid, timeuuid

Specialized

blob, boolean, counter

Geospatial

PointType, LineStringType, PolygonType

Collections - frozen or non-frozen

list, set, map

Other

user-defined type (UDT), tuple

See the data type reference for more detailed information.

Primary Key

The other part of a table definition that must be included is the primary key. The primary key is a unique identifier for each row in the table. It can be a single column or a composite of multiple columns.

If a single column is required, it can be declared as part of a table column definition:

username text PRIMARY KEY,

If multiple columns are required, the definition can be near the end of the table creation command with the PRIMARY KEY command, defining the columns in parentheses:

PRIMARY KEY (username, age)

A primary key in Cassandra consists of one or more partition keys and zero or more clustering column components. The order of these components in the definition always puts the partition key first and then the clustering column or columns.

The primary key is defined when the table is created and cannot be altered. If you must change the primary key, you’ll need create a new table schema and write the existing data to the new table.

The definition of a table’s primary key is critical. Carefully model how data in a table will be inserted and retrieved before choosing which columns to define in the primary key. When selecting the table’s primary key, consider:

the size of the partitions
the order of the data within partitions
the distribution of the partitions among the nodes of the cluster
the fact that a primary key cannot have a NULL value

Partition key

Cassandra is a partition row store, so the first element of the primary key, the partition key, specifies which node in the Cassandra cluster will replicate a particular table row. Thus, the primary key identifies the location and order of stored data.

At the minimum, the primary key must consist of a partition key. One or more columns are used to define a partition key. If one column, such as age is defined as the partition key, then all 35 year olds will be sorted into the same storage location.

In contrast to a simple partition key, a composite partition key uses two or more columns to identify where data resides. If more than one column is specified, the composite partition key splits a dataset so that related data is stored on separate partitions.

Importance of the partition key

Data is retrieved using the partition key. Keep in mind that to retrieve data from the table, values for all columns defined in the partition key have to be supplied unless an index is created.

The database stores an entire row of data on a node by partition key. If there is too much data in a single partition and data needs to be spread among multiple nodes, use a composite partition key. Using more than one column for the partition key breaks the data into chunks, or buckets. These columns form logical sets inside a partition to facilitate retrieval. The data is still grouped, but in smaller chunks.

This method can be effective if a cluster experiences hotspotting, or congestion in writing data to one node repeatedly, because a partition is heavily writing. Cassandra is often used for time series data, and hotspotting can be a real issue.

For example, if you want to sort data by both age and name, specifying both columns as part of the partition key will store rows that store all the 35 year old Janes into a similar storage location, whether they live in Seattle, Chicago, or Boston.

Clustering column

The second part of a primary key is optional, and consists of one or more clustering columns.

Clustering columns order the data so that multiple rows within a single partition are clustered in a defined order. The clustering columns do not dictate the storage location of the data, only the order of the data within a partition. However, this feature can be very useful, for example, if the data stores time sequences, or groups of network equipment by region.

The clustering column cannot be the first column declared in the key definition because the partition key must be defined first.

If a clustering column is specified in the primary key, the primary key is considered a compound primary key. The table is comprised of multi-row partitions.

Importance of clustering columns

Using a compound primary key creates a table that can be queried to return sorted results. If the pro cycling example was designed for a relational database, a cyclists table would be created with a foreign key to JOIN with the races table when querying. In Cassandra, the data is denormalized because joins are not performant in a distributed system.

Grouping data in tables using a clustering column or columns is analogous to JOINs in a relational database, but clustering columns are much more performant because only one table is accessed. On a physical node, when rows for a partition key are stored in order based on the clustering columns, retrieval of rows is very efficient.