Schema configuration tuning

Describes frequently changed configuration parameters and how to tune them for best performance.

cassandra.yaml

The location of the cassandra.yaml file depends on the type of installation:
Package installations /etc/dse/cassandra/cassandra.yaml
Tarball installations installation_location/resources/cassandra/conf/cassandra.yaml

DISCLAIMER

This document gives a general recommendation for DataStax Enterprise (DSE) and Apache Cassandra configuration tuning. It requires basic knowledge of DSE/Cassandra. It doesn’t replace the official documentation.

As described in Data model and schema configuration checks, data modeling is a critical part of a project's success. Additionally, the performance of the Cassandra or DSE cluster is influenced by schema configuration. Cassandra and DataStax Enterprise allow you to specify per-table configuration parameters, such as compaction strategy, compression of data, and more. This document describes the configuration parameters that are often changed, and provides advice on their tuning when it’s applicable.

Compaction strategy

Before take a deep dive on compaction algorithms, let’s review how data is written to disk:

As shown above, Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending with writing data to disk:

  1. Logs data in the commit log.
  2. Writes data to the memtable.
  3. Flushes data from the memtable, which is condition-based and set by configuration parameters in the cassandra.yaml file.
  4. Stores data on disk in SSTables.

SSTables are immutable. The data contained in them are never updated or deleted in place. Instead, when a new write is performed, changes are written to a new memtable and then a new SSTable.

A compaction process runs in the background that merges SSTables together. Otherwise, the more SSTables touched on the read path, the higher the latencies become. Ideally, when the database is properly tuned, a single query does not touch more than two or three SSTables.

DSE provides three main compaction strategies, each for a specific use case:

  • STCS: Good for mix workloads (read and write).
  • LCS: Good for read-heavy use cases. This strategy reduces the spread of data across multiple SSTables.
  • TWCS: Good for time series or for any table with TTL (Time-to-live).

Before going into tuning tables, here are some pros and cons for each compaction strategy:

Compaction Pros Cons
STCS Provides more compaction throughput and more aggressive compaction compared to LCS. In some read-heavy use cases, if data is overly spread, it is better to have a more aggressive compaction strategy. Requires leaving 50% free disk space to allow for compaction or four times the largest SSTable file on disk (two copies + new file).

SSTable files can grow to very large files that can stay on disk for a long time.

Inconsistent performance because data are spread across many SSTables files. When data are frequently updated, single rows can spread across multiple SSTables and impact system performance.

LCS Reduces the number of SSTable files that are touched on the read path since there is no overlap for a single partition (single SSTable) above level 0.

Reduces s free space needed for compaction.

Recommended node density is 500GB, which is fairly low.

Limited number (2) of potential compactors due to internal algorithms that require locking.

Write amplification consumes more IO.

TWCS Provides capability to increase node densities.

TTL data does not stress the system with tombstones because SSTables get dropped in a single action.

If not initially used, reloading data with a proper windowing layout is not possible.

Common compaction configuration parameters

If you see a lot of pending compactions, the system probably requires tuning. The two main parameters for compaction are:
concurrent_compactors
The number of parallel threads performing compaction.
Start to increase the compaction_throughput_mb_per_sec before and only change this value if you have enough IO. STCS and TWCS will benefit if this value is increased, but LCS won’t use more than 2 compactors at any time.
compaction_throughput_mb_per_sec
The throughput in MB/s for the compactor process. This is a limit for all compaction threads together, not per thread!
Note: Consider that the more compactions that are performed, the more resources are consumed and the GC pressure is increased.
To retrieve pending compactions, use your preferred monitoring solution or the following commands:
nodetool sjk mx -f Value -mg -b org.apache.cassandra.metrics:type=Compaction,name=PendingTasks
or
nodetool compactionstats
Be aware that tombstones can trigger compaction outside of the compaction strategy due on the following configuration parameters:
tombstone_compaction_interval
A single-SSTable compaction is triggered based on the estimated tombstone ratio. This option makes the minimum interval between two single-SSTable compactions tunable (default 1 day).
unchecked_tombstone_compaction
A single-SSTable compaction every day (default tombstone_compaction_interval), as long as the estimated tombstone ratio is higher than tombstone_threshold. In order to evict tombstones faster, you can reduce this value, but the system will consume more resources.
tombstone_threshold
The ratio of garbage-collectable tombstones to all contained columns.

The following sections contain the list of parameters available by compaction type. Only change these parameters if you have not been able to improve the compaction performance with the above configuration parameters.

Size Tiered Compaction Strategy (STCS)

With this compaction strategy, after the database sees enough tables of similar size in the range [average_sze * bucket_low, average_sze * bucket_high], it merges the SSTables into a single file. As shown below, the files get bigger and bigger as time goes by:
Figure 1. Size tiered compaction after many inserts
Several parameters are available for tuning STCS compaction, but generally only two are required and only for special cases:
min_threshold
The minimum number of SSTables required to trigger a minor compaction.
To increase the probability for compaction, you can reduce this value. However, as more compactions are triggered, the more resources are consumed.
min_sstable_size
STCS groups SSTables into buckets.
The bucketing process groups SSTables that differ in size by less than 50%. The bucketing process is too fine-grained for small SSTables. If your SSTables are small, use min_sstable_size to define a size threshold (in bytes), below which all SSTables belong to one unique bucket.

The best practice is to limit the number of SSTables for a single table between 15 and 25. If you see a number higher than 25, tweak the configuration to compact more often.

Leveled Compaction Strategy (LCS)

The leveled compaction strategy creates SSTables of a fixed, relatively small size (default 160 MB) that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Below shows a simple representation of the movement of data across the levels:
Figure 2. Leveled compaction — adding SSTables
Configurable parameters for LCS include sstable_size_in_mb, which sets the target size for SSTables.
Note: For very large partitions, the LCS compaction algorithm doesn’t split data into multiple SSTables. This guarantees a single SSTable per level condition.

Because L0 can be a chaos space, but careful to properly tune all memtable parameters as well. In read-heavy use cases, DataStax recommends flushing memtables less frequently than when using STCS.

Time Window Compaction Strategy

This compaction strategy supersedes DTCS starting with Cassandra 3.0. It is perfect for all time series use cases or scenarios where data are TTL’ed. The main principle is to bucketize the time horizon by defining a window of compaction where only the current active window compacts using the STCS algorithm. Once a window is completed, the database compacts all files allocated for the bucket into a single file. Within the active window the system behaves as STCS and all STCS parameters are applicable.
Figure 3. How TimeWindowCompactionStrategy works
In addition to the STCS configuration parameters, the TWCS has the following parameters:
compaction_window_unit
Time unit for defining the window size (milliseconds, seconds, hours, and so on).
compaction_window_size
Number of units per window (1, 2, 3, and so on).
Because the first bucket uses the STCS strategy, keep the total number of SSTable files around 20 for optimal compaction performance. In addition, limit the total number of inactive windows below 30 (maximum 50).
Note: Your use case might justify different values. To ensure performance does not degrade over time, be sure to properly test at scale.

To illustrate a time series use case: Imagine an iOT use case with a requirement to retain data for a period of 1 year (TTL = 365 days). Looking at the above recommendations, the window size should be equal to 365/30 => ~ 12 days with unit DAYS. With this configuration, only the data received during the last 12 days will be compacted and everything else will stay immutable, thus reducing the compaction overhead on the system.

Compression

To decrease the disk consumption, by default Cassandra stores data on the disk in compressed form. However, incorrectly configured compression may significantly harm the performance, so it makes sense to use correct settings. Existing settings can be obtained by executing the DESCRIBE TABLE command and looking for compression parameter.

To change settings, use ALTER TABLE <name> WITH compression = {...}. If you change the compression settings, be aware that they apply only to newly created SSTables. If you want to apply them to existing data, use nodetool upgradesstables -a keyspace table. For more information about internals of compression in Cassandra, see Performance Tuning - Compression with Mixed Workloads.

Compression algorithm

By default, tables on disk are compressed using the LZ4 compression algorithm. This algorithm provides a good balance between compression efficiency and additional CPU load arising from use of compression.

Cassandra also supports other compression schemas (set via sstable_compression field of compression parameter):
Snappy (SnappyCompressor)
Uses a compression ratio similar to LZ4, but much slower. It was used by default in the earlier versions of Cassandra, and if it’s still used, DataStax recommends switching to LZ4.
Note: Cassandra 4.0 has experimental support for the Zstandard compression algorithm.
Deflate (DeflateCompressor)
Provides better compression ratio than other compression algorithms, but generates much more CPU load and is much slower. Only use it after careful testing.
None (empty string)
Disables compression completely. It’s useful for removing overhead from decompression of data during read operations, especially for tables with small rows.

Compressed block size

By default, the chunk size of compressed data is 64Kb. This means that the database needs to read the entire amount to extract the necessary data and puts an additional load on the disk system when not needed. You can tune the chunk size to use lower values, like 16Kb or 32Kb. The Performance Tuning - Compression with Mixed Workloads blog shows the effects of using smaller compression chunk sizes using the value of the chunk_length_kb attribute of the table.

Be sure to take into account that reducing the chunk length can increase the amount of data stored on disk as the efficiency of compression decreases.
Note: Decreasing chunk size should only be done for tables that have random access to data stored in tables.

Read repair chance

Cassandra has two types of the read repair:
foreground
Repair happens when Cassandra detects data inconsistency during the read operation and blocks application operations until the repair process is complete.
background
Done proactively for a percentage of read operations. (Does not apply to all versions.)
The read_repair_chance and dclocal_read_repair_chance settings control the probability that background read repair will be performed in the whole cluster or only in the local datacenter. Unfortunately this implementation doesn’t always work as designed; there are even bugs when local read repair in the current datacenter that can lead to the cross-datacenter operations, which can impact system performance (CASSANDRA-9753). In cases of the data inconsistency, when the read operations are performed with LOCAL_QUORUM or QUORUM consistency levels, the foreground read repair happens anyway (inside local datacenter, or in the whole cluster). This means that triggering the additional repairs is unnecessary.
To disable background read repair, set both settings to 0. If you’re using a Cassandra or DSE version where this functionality still exists, see CASSANDRA-13910. To disable background read repair, use the following script on one of the nodes:
for i in `cqlsh -e 'describe schema;' |grep 'CREATE TABLE'|sed -e 's|CREATE TABLE \([^ ]*\) .*|\1|'`; do
 echo "alter table $i with read_repair_chance = 0.0 and dclocal_read_repair_chance = 0.0;"
done| tee alters.cql
cqlsh -f alters.cql

Bloom filter tuning

Bloom filter is an internal data structure that optimizes reading of the data from the disk. It’s created for every SSTable file in the system and loaded into the off-heap memory. The size of memory consumed is proportional to the number of partition keys stored in the specific SSTable. In some cases it could be significant, up to tens of gigabytes.

To check the amount of memory allocated to bloom filters per table, execute nodetool tablestats (or nodetool cfstats on older Cassandra versions ), and check the number in the line Bloom filter off heap memory used, which displays the size of bloom filter in bytes.

You can control the size of the bloom filter by tuning the bloom_filter_fp_chance option on a specific table. This parameter increases the false positive ratio, so the bloom filter will be smaller. For example, increasing the false positive ratio to 10% could decrease memory consumption for bloom filter almost two times. (See Bloom Filter Calculator) However, an increased false positive ratio can cause more SSTables to be read than necessary. Tuning bloom filters should be done only for tables that have rarely accessed data, such as an archive where data are stored only for compliance and rarely read or read by batch jobs.

Garbage collection of tombstones

Cassandra doesn’t delete data directly, but uses a special marker called tombstone. A tombstone shadows the deleted data. For details, see About Deletes and Tombstones in Cassandra. A tombstone is stored in the files for at least the period of time defined by the configuration parameter gc_grace_seconds. The default is 10 days (864000 seconds). Keep in mind that if one or more nodes in a cluster were down during the deletion of the data, those nodes must be returned to the cluster and repair performed during this period of time; otherwise deleted data could be resurrected. If a server was down longer than this period, remove all Cassandra data from it, and perform a node replacement.
Note: On DSE 6.0 and later, Nodesync prevents removal of the tombstones on tables that have Nodesync enabled until Cassandra makes sure that tombstones are synchronized to all replicas. If Nodesync is slow and can’t synchronize data fast enough, it may lead to an increase in the number of tombstones contained in tables for a significant amount of time.

A significant number of tombstones in the table can negatively impact the performance of the read operations, and in some cases you may need to decrease the value of the gc_grace_seconds parameter to remove tombstones faster. If this value is changed, you need to ensure that the full token range repair is performed during the specified period of time. Doing this can be challenging if node was down for a significant amount of time.

Also, a too low value in gc_grace_seconds can impact hinted handoff by preventing collection and replaying hints. If this is the case, it requires data repair.

Caching

Caching of data, actual or auxiliary, influences the read performance of the Cassandra and DSE nodes (see How is data read?). For tables you can tune two things specified in the caching configuration parameter:

  • Caching of the partition keys (the keys argument, possible values are ALL and NONE).

    This cache keeps the mapping of the partition key value into the compression offset map that identifies the compressed block on disk containing the data. If the partition key exists in the key cache, Cassandra may skip some steps when reading the data and subsequently increase read performance. The size of the key cache for all tables is specified in the cassandra.yaml, and you should use nodetool info to check its efficiency. If cache is full and the hit ratio is below 90%, it could signal that you should increase the size of the key cache.

    Note: In DSE 6 and later, a new SSTable file format was introduced that doesn’t require caching of the partition keys, so this feature is only useful for files in old formats.
  • Caching of the individual rows per partition (the rows_per_partition argument, possible values are ALL, NONE, or numeric value).

    Caching frequently used rows can speed up data access because the data is stored in memory. Generally use it for mostly-read workloads as changing the rows and partitions invalidates the corresponding entries in the cache and requires re-population of the cache (by reading from the disk).

By default, Cassandra has following configuration (cache all partition keys and don’t cache rows):
caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
If needed, you can change it with ALTER TABLE command (to allow caching of all partition keys, and 10 rows per partition):
ALTER TABLE ks.tbl WITH caching = {'keys': 'ALL', 'rows_per_partition': 10};

Speculative Retry

By default, a coordinator node sends queries to as many replicas as necessary to satisfy consistency levels: one for consistency level ONE, a quorum for QUORUM, and so on. Using the configuration parameter speculative_retry, you can specify when coordinators may query additional replicas. This is useful when replicas to which queries were sent by coordinator are slow or unresponsive. Speculative retries are used to reduce the 95 or 99 percentiles tail read latencies, but they put more load on the Cassandra nodes, which may need to handle more read requests than necessary. Keep the default value of 99PERCENTILE until you know what you’re doing.
Note: The speculative_retry setting does not effect reads with consistency level ALL because they already query all replicas.

This speculative_retry configuration parameter has the following values (case-insensitive):

  • ALWAYS: The coordinator node sends extra read requests to all other replicas for every read of that table.
  • Xpercentile: The read latency of each table is tracked, and coordinator node calculates X percentile of the typical read latency for a given table. The coordinator sends additional read requests once the wait time exceeds the calculated time. The problem with this setting is that when a node is unavailable, it increases the read latencies and as result increases the calculated values.
  • Nms: The coordinator node sends extra read requests to all other replicas when the coordinator node has not received responses within N milliseconds.
  • NONE: The coordinator node does not send additional read requests.
Note: In Cassandra 4.0, speculative_retry may have additional values. See CASSANDRA-13876 and CASSANDRA-14293.

Memtable flushing

Usually, Cassandra flushes memtable data to disk in the following situations:

  • The memtable flush threshold is met.
  • On shutdown.
  • When either nodetool flush or nodetool drain is executed.
  • When commit logs get full.

It’s also possible to explicitly flush data from a memtable on the regular cadence. This is regulated by the memtable_flush_period_in_ms configuration parameter. Usually, this parameter is best left with its default value (zero) so memtables are flushed as described above. Tune this parameter only in rare situations and only for specific tables. For example, tables with very rare modifications that you don't want to have commit logs sitting on the disk for a long period of time. In this case, set this parameter to flush memtable of a given table to every day, or every X hours. Note that this parameter is in the milliseconds!