About reads

How Cassandra combines results from the active memtable and potentially mutliple SSTables to satisfy a read.

Cassandra performs random reads from SSD in parallel with extremely low latency, unlike most databases. Rotational disks are not recommended. Cassandra reads, as well as writes, data by partition key, eliminating complex queries required by a relational database.

First, Cassandra checks the Bloom filter. Each SSTable has a Bloom filter associated with it that checks the probability of having any data for the requested partition key in the SSTable before doing any disk I/O.

If the probability is good, Cassandra checks the partition key cache and takes one of these courses of action:

  • If an index entry is found in the cache:
    • Cassandra goes to the compression offset map to find the compressed block having the data.
    • Fetches the compressed data on disk and returns the result set.
  • If an index entry is not found in the cache:
    • Cassandra searches the partition summary to determine the approximate location on disk of the index entry.
    • Next, to fetch the index entry, Cassandra hits the disk for the first time, performing a single seek and a sequential read of columns (a range read) in the SSTable if the columns are contiguous.
    • Cassandra goes to the compression offset map to find the compressed block having the data.
    • Fetches the compressed data on disk and returns the result set.

In Cassandra 1.2 and later, the Bloom filter and compression offset map are off-heap, which greatly increases the data handling capacity per node. Of the components in memory, only the partition key cache is a fixed size. Other components grow as the data set grows.
  • The Bloom filter grows to approximately 1-2 GB per billion partitions. In the extreme case, you can have one partition per row, so you can easily have billions of these entries on a single machine. The Bloom filter is tunable if you want to trade memory for performance.
  • By default, the partition summary is a sample of the partition index. You configure sample frequency by changing the index_interval property in the cassandra.yaml file. You can probably increase the index_interval to 512 without seeing degradation. Cassandra 1.2.5 reduced the size of the partition summary by using raw longs instead of boxed numbers inside jvm.
  • The compression offset map grows to 1-3 GB per terabyte compressed. The more you compress data, the greater number of compressed blocks you have and the larger the compression offset table.

Compression is enabled by default even though going through the compression offset map consumes CPU resources. Having compression enabled makes the page cache more effective, and typically, almost always pays off.