How is data written?

Understand how Cassandra writes and stores data.

Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending in with a write of data to disk:
  • Logging data in the commit log
  • Writing data to the memtable
  • Flushing data from the memtable
  • Storing data on disk in SSTables

Logging writes and memtable storage 

When a write occurs, Cassandra stores the data in a memory structure called memtable, and to provide configurable durability, it also appends writes to the commit log on disk. The commit log receives every write made to a Cassandra node, and these durable writes survive permanently even if power fails on a node. The memtable is a write-back cache of data partitions that Cassandra looks up by key. The memtable stores writes until reaching a configurable limit, and then is flushed.

Flushing data from the memtable 

To flush the data, Cassandra sorts memtables by token and then writes the data to disk sequentially. A partition index is also created on the disk that maps the tokens to a location on disk. When the memtable content exceeds the configurable threshold, the memtable is put in a queue that is flushed to disk. The queue can be configured with the memtable_heap_space_in_mb or memtable_offheap_space_in_mb setting in the cassandra.yaml file. If the data to be flushed exceeds the queue size, Cassandra blocks writes until the next flush succeeds. You can manually flush a table using nodetool flush. To reduce the commit log replay time, the recommended best practice is to flush the memtable before you restart the nodes. Commit log replay is the process of reading the commit log to recover lost writes in the event of interrupted operations.

Data in the commit log is purged after its corresponding data in the memtable is flushed to an SSTable on disk.

Storing data on disk in SSTables 

Memtables and SSTables are maintained per table. SSTables are immutable, not written to again after the memtable is flushed. Consequently, a partition is typically stored across multiple SSTable files. A number of other SSTable structures exist to assist read operations:

For each SSTable, Cassandra creates these structures:

  • Partition index

    A list of partition keys and the start position of rows in the data file written on disk

  • Partition summary

    A sample of the partition index stored in memory

  • Bloom filter

    A structure stored in memory that checks if row data exists in the memtable before accessing SSTables on disk

The SSTables are files stored on disk. The naming convention for SSTable files has changed with Cassandra 2.2 and later to shorten the file path. The data files are stored in a data directory that varies with installation. For each keyspace, a directory within the data directory stores each table. For example, /data/data/ks1/cf1-5be396077b811e3a3ab9dc4b9ac088d/la-1-big-Data.db represents a data file. ks1 represents the keyspace name to distinguish the keyspace for streaming or bulk loading data. A hexadecimal string, 5be396077b811e3a3ab9dc4b9ac088d in this example, is appended to table names to represent unique table IDs.

Several files are written to store the data, partition summary, statistics, and other information.

Cassandra creates a subdirectory for each table, which allows you to symlink a table to a chosen physical drive or data volume. This provides the capability to move very active tables to faster media, such as SSDs for better performance, and also divides tables across all attached storage devices for better I/O balance at the storage layer.