Schema configuration tuning
Describes frequently changed configuration parameters and how to tune them for best performance.
cassandra.yaml
The location of the cassandra.yaml file depends on the type of installation:Package installations | /etc/dse/cassandra/cassandra.yaml |
Tarball installations | installation_location/resources/cassandra/conf/cassandra.yaml |
DISCLAIMER This document gives a general recommendation for DataStax Enterprise (DSE) and Apache Cassandra™ configuration tuning. It requires basic knowledge of DSE/Cassandra. It doesn’t replace the official documentation. |
As described in Data model and schema configuration checks, data modeling is a critical part of a project's success. Additionally, the performance of the Cassandra or DSE cluster is influenced by schema configuration. Cassandra and DataStax Enterprise allow you to specify per-table configuration parameters, such as compaction strategy, compression of data, and more. This document describes the configuration parameters that are often changed, and provides advice on their tuning when it’s applicable.
Compaction strategy
Before take a deep dive on compaction algorithms, let’s review how data is written to disk:
As shown above, Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending with writing data to disk:
- Logs data in the commit log.
- Writes data to the memtable.
- Flushes data from the memtable, which is condition-based and set by configuration parameters in the cassandra.yaml file.
- Stores data on disk in SSTables.
SSTables are immutable. The data contained in them are never updated or deleted in place. Instead, when a new write is performed, changes are written to a new memtable and then a new SSTable.
A compaction process runs in the background that merges SSTables together. Otherwise, the more SSTables touched on the read path, the higher the latencies become. Ideally, when the database is properly tuned, a single query does not touch more than two or three SSTables.
DSE provides three main compaction strategies, each for a specific use case:
- STCS: Good for mix workloads (read and write).
- LCS: Good for read-heavy use cases. This strategy reduces the spread of data across multiple SSTables.
- TWCS: Good for time series or for any table with TTL (Time-to-live).
Before going into tuning tables, here are some pros and cons for each compaction strategy:
Compaction | Pros | Cons |
---|---|---|
STCS | Provides more compaction throughput and more aggressive compaction compared to LCS. In some read-heavy use cases, if data is overly spread, it is better to have a more aggressive compaction strategy. | Requires leaving 50% free disk space to allow for compaction or four times the
largest SSTable file on disk (two copies + new file). SSTable files can grow to very large files that can stay on disk for a long time. Inconsistent performance because data are spread across many SSTables files. When data are frequently updated, single rows can spread across multiple SSTables and impact system performance. |
LCS | Reduces the number of SSTable files that are touched on the read path since
there is no overlap for a single partition (single SSTable) above level 0.
Reduces s free space needed for compaction. |
Recommended node density is 500GB, which is fairly low. Limited number (2) of potential compactors due to internal algorithms that require locking. Write amplification consumes more IO. |
TWCS | Provides capability to increase node densities. TTL data does not stress the system with tombstones because SSTables get dropped in a single action. |
If not initially used, reloading data with a proper windowing layout is not possible. |
Common compaction configuration parameters
- concurrent_compactors
- The number of parallel threads performing compaction.
- compaction_throughput_mb_per_sec
- The throughput in MB/s for the compactor process. This is a limit for all compaction threads together, not per thread!
nodetool sjk mx -f Value -mg -b org.apache.cassandra.metrics:type=Compaction,name=PendingTasks
nodetool compactionstats
- tombstone_compaction_interval
- A single-SSTable compaction is triggered based on the estimated tombstone ratio. This option makes the minimum interval between two single-SSTable compactions tunable (default 1 day).
- unchecked_tombstone_compaction
- A single-SSTable compaction every day (default
tombstone_compaction_interval
), as long as the estimated tombstone ratio is higher thantombstone_threshold
. In order to evict tombstones faster, you can reduce this value, but the system will consume more resources. - tombstone_threshold
- The ratio of garbage-collectable tombstones to all contained columns.
The following sections contain the list of parameters available by compaction type. Only change these parameters if you have not been able to improve the compaction performance with the above configuration parameters.
Size Tiered Compaction Strategy (STCS)
average_sze * bucket_low
, average_sze *
bucket_high
], it merges the SSTables into a single file. As shown below, the files
get bigger and bigger as time goes by:
- min_threshold
- The minimum number of SSTables required to trigger a minor compaction.
- min_sstable_size
- STCS groups SSTables into buckets.
The best practice is to limit the number of SSTables for a single table between 15 and 25. If you see a number higher than 25, tweak the configuration to compact more often.
Leveled Compaction Strategy (LCS)

sstable_size_in_mb
, which sets the
target size for SSTables. Because L0 can be a chaos space, but careful to properly tune all memtable parameters as well. In read-heavy use cases, DataStax recommends flushing memtables less frequently than when using STCS.
Time Window Compaction Strategy

- compaction_window_unit
- Time unit for defining the window size (milliseconds, seconds, hours, and so on).
- compaction_window_size
- Number of units per window (1, 2, 3, and so on).
To illustrate a time series use case: Imagine an iOT use case with a requirement to retain data for a period of 1 year (TTL = 365 days). Looking at the above recommendations, the window size should be equal to 365/30 => ~ 12 days with unit DAYS. With this configuration, only the data received during the last 12 days will be compacted and everything else will stay immutable, thus reducing the compaction overhead on the system.
Compression
To decrease the disk consumption, by default Cassandra stores data on the disk in compressed form. However, incorrectly configured compression may significantly harm the performance, so it makes sense to use correct settings. Existing settings can be obtained by executing the DESCRIBE TABLE command and looking for compression parameter.
To change settings, use ALTER TABLE <name> WITH compression = {...}. If
you change the compression settings, be aware that they apply only to newly created
SSTables. If you want to apply them to existing data, use nodetool upgradesstables
-a keyspace table
. For more information about internals of compression in
Cassandra, see Performance Tuning - Compression with Mixed
Workloads.
Compression algorithm
By default, tables on disk are compressed using the LZ4 compression algorithm. This algorithm provides a good balance between compression efficiency and additional CPU load arising from use of compression.
sstable_compression
field of compression parameter):- Snappy (SnappyCompressor)
- Uses a compression ratio similar to LZ4, but much slower. It was used by default in
the earlier versions of Cassandra, and if it’s still used, DataStax recommends
switching to LZ4. Note: Cassandra 4.0 has experimental support for the Zstandard compression algorithm.
- Deflate (DeflateCompressor)
- Provides better compression ratio than other compression algorithms, but generates much more CPU load and is much slower. Only use it after careful testing.
- None (empty string)
- Disables compression completely. It’s useful for removing overhead from decompression of data during read operations, especially for tables with small rows.
Compressed block size
By default, the chunk size of compressed data is 64Kb. This means that the database needs
to read the entire amount to extract the necessary data and puts an additional load on the
disk system when not needed. You can tune the chunk size to use lower values, like 16Kb or
32Kb. The Performance Tuning - Compression with Mixed
Workloads blog shows the effects of using smaller compression chunk sizes using the
value of the chunk_length_kb
attribute of the table.
Read repair chance
- foreground
- Repair happens when Cassandra detects data inconsistency during the read operation and blocks application operations until the repair process is complete.
- background
- Done proactively for a percentage of read operations. (Does not apply to all versions.)
Bloom filter tuning
Bloom filter is an internal data structure that optimizes reading of the data from the disk. It’s created for every SSTable file in the system and loaded into the off-heap memory. The size of memory consumed is proportional to the number of partition keys stored in the specific SSTable. In some cases it could be significant, up to tens of gigabytes.
To check the amount of memory allocated to bloom filters per table, execute
nodetool tablestats
(or nodetool cfstats
on older
Cassandra versions ), and check the number in the line Bloom filter off heap memory
used
, which displays the size of bloom filter in bytes.
You can control the size of the bloom filter by tuning the
bloom_filter_fp_chance
option on a specific table. This parameter
increases the false positive ratio, so the bloom filter will be smaller. For example,
increasing the false positive ratio to 10% could decrease memory consumption for bloom
filter almost two times. (See Bloom Filter Calculator) However, an increased false positive
ratio can cause more SSTables to be read than necessary. Tuning bloom filters should be done
only for tables that have rarely accessed data, such as an archive where data are stored
only for compliance and rarely read or read by batch jobs.
Garbage collection of tombstones
gc_grace_seconds
. The default is 10 days (864000
seconds). Keep in mind that if one or more nodes in a cluster were down during the deletion
of the data, those nodes must be returned to the cluster and repair performed during this
period of time; otherwise deleted data could be resurrected. If a server was down longer
than this period, remove all Cassandra data from it, and perform a node replacement. A significant number of tombstones in the table can negatively impact the performance of
the read operations, and in some cases you may need to decrease the value of the
gc_grace_seconds
parameter to remove tombstones faster. If this value is
changed, you need to ensure that the full token range repair is performed during the
specified period of time. Doing this can be challenging if node was down for a significant
amount of time.
Also, a too low value in gc_grace_seconds
can impact hinted handoff by preventing collection and
replaying hints. If this is the case, it requires data repair.
Caching
Caching of data, actual or auxiliary, influences the read performance of the Cassandra and DSE nodes (see How is data read?). For tables you can tune two things specified in the caching configuration parameter:
- Caching of the partition keys (the keys argument,
possible values are ALL and NONE).
This cache keeps the mapping of the partition key value into the compression offset map that identifies the compressed block on disk containing the data. If the partition key exists in the key cache, Cassandra may skip some steps when reading the data and subsequently increase read performance. The size of the key cache for all tables is specified in the cassandra.yaml, and you should use
nodetool info
to check its efficiency. If cache is full and the hit ratio is below 90%, it could signal that you should increase the size of the key cache.Note: In DSE 6 and later, a new SSTable file format was introduced that doesn’t require caching of the partition keys, so this feature is only useful for files in old formats. - Caching of the individual rows per partition (the
rows_per_partition
argument, possible values are ALL, NONE, or numeric value).Caching frequently used rows can speed up data access because the data is stored in memory. Generally use it for mostly-read workloads as changing the rows and partitions invalidates the corresponding entries in the cache and requires re-population of the cache (by reading from the disk).
caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
ALTER TABLE ks.tbl WITH caching = {'keys': 'ALL', 'rows_per_partition': 10};
Speculative Retry
This speculative_retry
configuration parameter has the following values
(case-insensitive):
- ALWAYS: The coordinator node sends extra read requests to all other replicas for every read of that table.
- Xpercentile: The read latency of each table is tracked, and coordinator node calculates X percentile of the typical read latency for a given table. The coordinator sends additional read requests once the wait time exceeds the calculated time. The problem with this setting is that when a node is unavailable, it increases the read latencies and as result increases the calculated values.
- Nms: The coordinator node sends extra read requests to all other replicas when the coordinator node has not received responses within N milliseconds.
- NONE: The coordinator node does not send additional read requests.
Memtable flushing
Usually, Cassandra flushes memtable data to disk in the following situations:
- The memtable flush threshold is met.
- On shutdown.
- When either
nodetool flush
ornodetool drain
is executed. - When commit logs get full.
It’s also possible to explicitly flush data from a memtable on the regular cadence. This is
regulated by the memtable_flush_period_in_ms
configuration parameter.
Usually, this parameter is best left with its default value (zero) so memtables are flushed
as described above. Tune this parameter only in rare situations and only for specific
tables. For example, tables with very rare modifications that you don't want to have commit
logs sitting on the disk for a long period of time. In this case, set this parameter to
flush memtable of a given table to every day, or every X hours. Note that this parameter is
in the milliseconds!