Tuning Bloom filters

Cassandra uses Bloom filters to determine whether an SSTable has data for a particular row.

Cassandra uses Bloom filters to determine whether an SSTable has data for a particular row. Bloom filters are unused for range scans, but are used for index scans. Bloom filters are probabilistic sets that allow you to trade memory for accuracy. This means that higher Bloom filter attribute settings bloom_filter_fp_chance use less memory, but will result in more disk I/O if the SSTables are highly fragmented. Bloom filter settings range from 0 to 1.0 (disabled). The default value of bloom_filter_fp_chance depends on the compaction strategy. The LeveledCompactionStrategy uses a higher default value (0.1) because it generally defragments more effectively than the SizeTieredCompactionStrategy, which has a default of 0.01. Memory savings are nonlinear; going from 0.01 to 0.1 saves about one third of the memory.

The settings you choose depend the type of workload. For example, to run an analytics application that heavily scans a particular table, you would want to inhibit the Bloom filter on the table by setting it high.

To view the observed Bloom filters false positive rate and the number of SSTables consulted per read use cfstats in the nodetool utility.

Starting in version 1.2, Bloom filters are stored off-heap so you don't need include it when determining the -Xmx settings (the maximum memory size that the heap can reach for the JVM).

To change the bloom filter property on a table, use CQL. For example:

ALTER TABLE addamsFamily WITH bloom_filter_fp_chance = 0.1;

After updating the value of bloom_filter_fp_chance on a table, Bloom filters need to be regenerated in one of these ways:

You do not have to restart Cassandra after regenerating SSTables.