How is data maintained?

A description of the stages DataStax Distribution of Apache Cassandra 3.11 uses to process data on the write path.

The DataStax Distribution of Apache Cassandra™ (DDAC) database write process stores data in files called SSTables. SSTables are immutable. Instead of overwriting existing rows with inserts or updates, the database writes new timestamped versions of the inserted or updated data in new SSTables. The database does not perform deletes by removing the deleted data. Instead, the database marks deleted data with tombstones.

Over time, the database may write many versions of a row in different SSTables. Each version may have a unique set of columns stored with different timestamps. As SSTables accumulate, the distribution of data can require accessing more and more SSTables to retrieve a complete row.

To keep the database healthy, the database periodically merges SSTables and discards old data. This process is called compaction.

Compaction

Compaction works on a collection of SSTables. From these SSTables, compaction collects all versions of each unique row and assembles one complete row, using the most up-to-date version (by timestamp) of each of the row's columns. The merge process is performant, because rows are sorted by partition key within each SSTable, and the merge process does not use random I/O. The new versions of each row is written to a new SSTable. The old versions, along with any rows that are ready for deletion, are left in the old SSTables, and are deleted when any pending reads are completed.

Compaction causes a temporary spike in disk space usage and disk I/O while old and new SSTables co-exist. As it completes, compaction frees up disk space occupied by old SSTables. It improves read performance by incrementally replacing old SSTables with compacted SSTables. The database can read data directly from the new SSTable even before it finishes writing, instead of waiting for the entire compaction process to finish.

As the database processes writes and reads, it replaces the old SSTables with new SSTables in the page cache. The process of caching the new SSTable, while directing reads away from the old one, is incremental and does not cause a dramatic cache miss. This means that Cassandra provides predictable high performance even under heavy load.

Compaction strategies

The Cassandra database supports different compaction strategies. These strategies control which SSTables are chosen for compaction and how the compacted rows are sorted into new SSTables. Each strategy has its own strengths. The sections that follow explain each compaction strategy.

Although each of the following sections starts with a generalized recommendation, many factors complicate the choice of a compaction strategy. See Choosing a compaction strategy.

SizeTieredCompactionStrategy (STCS)

Recommended for write-intensive workloads.

Pros: Compacts write-intensive workload very well.
Cons: Can hold on to stale data too long. Required memory increases over time.

The SizeTieredCompactionStrategy (STCS) initiates compaction when the database has accumulated a set number (default: 4) of similar-sized SSTables. STCS merges these SSTables into one larger SSTable. As the larger SSTables accumulate, STCS merges these into even larger SSTables. At any given time, several SSTables of varying sizes are present.

Figure 2. Size tiered compaction after many inserts

While STCS works well to compact a write-intensive workload, it makes reads slower because the merge-by-size process does not group data by rows. This makes it more likely that versions of a particular row may be spread over many SSTables. Also, STCS does not evict deleted data predictably because its trigger for compaction is SSTable size, and SSTables might not grow quickly enough to merge and evict old data. As the largest SSTables grow in size, the amount of disk space needed for both the new and old SSTables simultaneously during STCS compaction can outstrip a typical amount of disk space on a node.

LeveledCompactionStrategy (LCS)

Recommended for read-intensive workloads.

Pros: Disk requirements are easier to predict. Read operation latency is more predictable. Stale data is evicted more frequently.
Cons: Much higher I/O utilization impacting operation latency

The LeveledCompactionStrategy (LCS) alleviates some of the read operation issues with STCS. This strategy works with a series of levels. First, data in memtables is flushed to SSTables in the first level (L0). LCS compaction merges these first SSTables with larger SSTables in level L1.

Figure 3. Leveled compaction — adding SSTables

The SSTables in levels greater than L1 are merged into SSTables with a size greater than or equal to sstable_size_in_mb (default: 160 MB). If a L1 SSTable stores data of a partition that is larger than L2, LCS moves the SSTable past L2 to the next level up.

Figure 4. Leveled compaction after many inserts

In each of the levels above L0, LCS creates SSTables that are about the same size. Each level is 10 times the size of the last level, so level L1 has 10 times as many SSTables as L0, and level L2 has 100 times as many as L0. If the result of the compaction is more than 10 SSTables in level L1, the excess SSTables are moved to level L2.

Note: Keep in mind that the maximum overhead when using LCS is the sum of N-1 levels. For example, given a maximum table size of 160 megabytes, once past level 3, overhead requirements expand drastically from 1.7 terabytes at level 4 to 17 terabytes at level 5:

L0 1      * 160 MB = 160 MB
L1 10     * 160 MB = 1600 MB
L2 100    * 160 MB = 16000 + 1600 = 17 GB
L3 1000   * 160 MB = 160000 + 1600 + 16000 =  177 GB
L4 10000  * 160 MB = 1600000 + 1600 + 16000 + 160000 =  1.7 TB
L5 100000 * 160 MB = 16000000 + 1600 + 16000 + 160000 + 1600000 =  17 TB

To mitigate that situation, switch to STCS and add additional nodes or reduce the sstable size using sstablesplit.

The LCS compaction process guarantees that the SSTables within each level starting with L1 have non-overlapping data. For many reads, this process enables the database to retrieve all the required data from only one or two SSTables. In fact, 90% of all reads can be satisfied from one SSTable. Since LCS does not compact L0 tables, however, resource-intensive reads involving many L0 SSTables may still occur.

At levels beyond L0, LCS requires less disk space for compacting: generally, 10 times the fixed size of the SSTable. Obsolete data is evicted more often, so deleted data uses smaller portions of the SSTables on disk. However, LCS compaction operations take place more often and place more I/O burden on the node. For write-intensive workloads, the payoff of using this strategy is generally not worth the performance loss to I/O operations. In many cases, tests of LCS-configured tables reveal I/O saturation on writes and compactions.

Note: The database bypasses compaction operations when bootstrapping a new node using LCS into a cluster. The original data is moved directly to the correct level because there is no existing data, so no partition overlap per level is present. For more information, see the Apache Cassandra 2.2 - Bootstrapping Performance Improvements for Leveled Compaction blog.

TimeWindowCompactionStrategy (TWCS)

Recommended for time series and expiring time-to-live (TTL) workloads.

Pros: Well-suited for time series data, stored in tables that use the default TTL for all data. Simpler configuration than that of DateTieredCompactionStrategy (DTCS), which is deprecated in favor of TWCS.
Cons: Not appropriate if out-of-sequence time data is required, because SSTables will not compact well. Also, not appropriate for data without a TTL workload, as storage will grow without bound. Less fine-tuned configuration is possible than with DTCS.

The TimeWindowCompactionStrategy (TWCS) is similar to DTCS with simpler settings. TWCS groups SSTables using a series of time windows. During compaction, TWCS applies STCS to uncompacted SSTables in the most recent time window. At the end of a time window, TWCS compacts all SSTables that fall into that time window into a single SSTable based on the SSTable maximum timestamp. After the major compaction for a time window is completed, no further compaction of the data occurs, although tombstone compaction can still be run on sstables after the compaction_window threshold has passed. The process starts over with the SSTables written in the next time window.

Note: For tables where all cells have a time-to-live (TTL) applied, or tables using the default TTL, once all TTLs are passed, the gc_grace_seconds period has expired, and the droppable tombstone ratio is 100%, the sstable can be dropped without a compaction.

Figure 5. How TimeWindowCompactionStrategy works

As the figure shows, from 10 AM to 11 AM, the memtables are flushed from memory into 100MB SSTables. These SSTables are compacted into larger SSTables using STCS. At 11 AM, all these SSTables are compacted into a single SSTable, and never compacted again by TWCS.

At 12 PM, the new SSTables created between 11 AM and 12 PM are compacted using STCS, and at the end of the time window, the TWCS compaction repeats. Notice that each TWCS time window contains varying amounts of data.

Note: For an animated explanation, see the DataStax Academy Time Window Compaction Strategy video. A valid DataStax Academy account is required to view the video.

The TWCS configuration has two main property settings:

compaction_window_unit: time unit used to define the window size (milliseconds, seconds, hours, and so on)
compaction_window_size: how many units per window (1, 2, 3, and so on)

The configuration for the above example: compaction_window_unit = ‘minutes’,compaction_window_size = 60

DateTieredCompactionStrategy (DTCS) - Deprecated

Use TimeWindowCompactionStrategy (TWCS) instead.

The DateTieredCompactionStrategy (DTCS) is similar to STCS. But instead of compacting based on SSTable size, DTCS compacts based on SSTable age. Each column in an SSTable is marked with the timestamp at write time. As the age of an SSTable, DTCS uses the oldest (minimum) timestamp of any column the SSTable contains.

More information about compaction

The following blog posts and videos provide more information from developers that have tested compaction strategies: