Best practices for data modeling in Cassandra-based databases

Data modeling in Cassandra-based databases is a critical part of building successful applications. A well-designed data model directly impacts performance, scalability, and maintainability. The best practices outlined here will help you design models that are efficient, reliable, and aligned with your application’s requirements.

Limit queries to a single partition

In Cassandra-based databases, query scope should be limited to a single partition whenever possible. A partition contains all rows that share the same partition key. The partition key determines which node in the cluster stores the data, and all rows in the partition are stored contiguously on that node.

Single-partition queries minimize the number of nodes that need to participate in returning results. Queries that span multiple partitions require coordination across nodes, resulting in increased overhead and slower response times.

Bucketing

Cassandra-based databases operate best when partition sizes are controlled. DataStax recommends keeping partitions small, ideally under 100 MB. Excessively large partitions can lead to performance issues, such as increased latency and higher resource consumption.

Use bucketing to prevent unbounded growth in partitions. Bucketing involves adding a column to the partition key to split large partitions into smaller, more manageable ones.

The following example shows IoT data from a sensor without bucketing. If one sensor sends data at a higher frequency than others, the partition for that sensor can grow very large. The partition key is the sensor_id, and the rows are ordered by timestamp. Each time a new temperature is received, the partition for the high-frequency sensor grows larger.

CREATE TABLE sensor_data (
    sensor_id UUID,
    timestamp TIMESTAMP,
    temperature FLOAT,
    PRIMARY KEY ((sensor_id), timestamp)
);

This is the same table, but with bucketing applied. The partition key now contains a day column, so each partition contains data for a single day only.

CREATE TABLE sensor_data (
    sensor_id UUID,
    day DATE,
    timestamp TIMESTAMP,
    temperature FLOAT,
    PRIMARY KEY ((sensor_id, day), timestamp)
);

Indexing

Use indexing in Cassandra-based databases to use non-key columns in WHERE clauses. By default, queries in Cassandra-based databases must use the entire partition key in the WHERE clause. This ensures that the query can be efficiently routed to the correct node in the cluster. Optionally, WHERE clauses may also include clustering columns.

Indexes allow you to query non-key columns without incurring the overhead of full table scans.

DataStax recommends using Storage-Attached Indexes (SAI) for all Cassandra-based databases except DSE 5.1. For DSE 5.1, use secondary indexes (2i), as SAI is not supported in that version.

If you use an unindexed non-key column in a WHERE clause, the query will fail with a message saying that you can use ALLOW FILTERING to bypass this restriction:

An error message for an invalid request that cannot be executed due to potential data filtering and unpredictable performance.

DataStax does not recommend using ALLOW FILTERING in production. This option can lead to serious performance issues, as it allows the database to scan all partitions to find matching rows. For large datasets, this can result in high latency and increased resource consumption.

Table-per-query

The table-per-query pattern is a common best practice in Cassandra-based databases. When designing a table, you choose the partition key, clustering columns, and any indexes to optimize for a specific query. Because Cassandra-based databases do not support joins, all the data required for a query must reside in a single table.

When considering a new query, create a new table for it rather than adapting an existing one. This approach allows you to optimize each table for a specific query, improving both performance and scalability. It also helps keep the data model simple and maintainable, as each table has a clear purpose and structure.

Denormalization

The table-per-query pattern in Cassandra-based databases naturally leads to denormalization, which is the practice of duplicating data across multiple tables to optimize query performance. Since all data for a query must be contained within a single table, you may need to duplicate data across multiple tables to support different queries. In some cases, you will design multiple tables with exactly the same data, but with different partition keys and clustering columns to optimize different queries.

For example, consider an application that needs to query employee data by department and also by location. The table-per-query pattern would lead to two separate tables, one for each query. In this case the tables would contain the same data, but different partition keys and clustering columns. A common naming convention for these tables is to use the query name as the table name, such as employees_by_department and employees_by_location.

The employees_by_department table uses the user-defined type CONTACT_INFO to store contact information for each employee. This table supports the following query:

SELECT * FROM employees_by_department WHERE department = 'Engineering';

CREATE TABLE employees_by_department (
    department TEXT,
    last_name TEXT,
    first_name TEXT,
    employee_id int,
    start_date DATE,
    location TEXT,
    contact_info CONTACT_INFO,
    PRIMARY KEY ((department), last_name, first_name, employee_id)
);

The employees_by_location table uses the same data, but with a different partition key and clustering columns. This table supports the following query:

SELECT * FROM employees_by_location WHERE location = 'New York';

CREATE TABLE employees_by_location (
    location TEXT,
    last_name TEXT,
    first_name TEXT,
    employee_id int,
    start_date DATE,
    department TEXT,
    contact_info CONTACT_INFO,
    PRIMARY KEY ((location), last_name, first_name, employee_id)
);

Tombstones

When you execute a DELETE statement, data is not immediately removed from the database. Instead, Cassandra-based databases insert a marker called a tombstone to indicate that the data has been deleted. Tombstones ensure that deleted data is not returned by queries, even before it is permanently removed during compaction.

Excessive tombstones can negatively impact performance, so it’s important to design your data model to minimize their creation.

While DELETE statements are the most common source of tombstones, they can also be created in the following scenarios:

Writing a NULL value using INSERT or UPDATE
Inserting or replacing an entire non-frozen collection
Allowing a record to expire after its TTL (time to live)
Updating a materialized view

BLOBs

Cassandra-based databases store binary data as Binary Large Objects (BLOBs). BLOBs are suitable for small binary payloads, but storing large BLOBs, such as images or videos, can lead to unbalanced partitions and degraded performance.

For large binary content, store the data in an object store or content delivery network (CDN), and then store the URI in your database instead.