How is data written?
The DataStax Enterprise database processes data at several stages on the write path, starting with the immediate logging of a write and ending in with a write of data to disk:
Logging data in the commit log
Writing data to the memtable
Flushing data from the memtable
Storing data on disk in SSTables
The time stamp for all writes is UTC (Universal Time Coordinated).
When a write occurs, the database stores the data in a memory structure called memtable.
To provide configurable durability, the database also appends writes to the commit log on disk.
commitlog_sync is set to
sync in cassandra.yaml, the commit log receives every write made to a node.
These durable writes survive permanently even if power fails on a node.
The memtable is a write-back cache of data partitions that the database looks up by key.
The memtable stores write operations in sorted order until reaching a configurable limit, and then is flushed.
To flush data from the memtable, the database writes data to disk in the memtable-sorted order. A partition index is also created on the disk that maps the tokens to a location on disk.
When the memtable content exceeds the configurable threshold, or the commitlog space exceeds the
commitlog_total_space_in_mb in cassandra.yaml, the memtable is put in a queue that is flushed to disk.
If the data to be flushed exceeds the
memtable_cleanup_threshold in cassandra.yaml (which is automatically calculated), the database blocks writes until the next flush succeeds.
You can manually flush a table using
nodetool flush or
nodetool drain (flushes memtables without listening for connections to other nodes).
To reduce the commit log replay time, DataStax recommends flushing the memtable before you restart the nodes.
If a node stops working, replaying the commit log restores the writes to the memtable that were there before the node stopped.
The database uses the commit log to rebuild memtables.
The commit log is divided into segments.
Writes are recorded in order and new segments are created when the current segment reaches the
commitlog_segment_size_in_mb in cassandra.yaml.
The database purges commit log segments only after all the data in a segment has been flushed to disk from the memtable.
If the commit log directory reaches the maximum size (
commitlog_total_space_in_mb in cassandra.yaml), the oldest segments are purged and the corresponding tables are flushed to disk.
For example, consider the following two tables:
Table A has extremely high throughput Table B has very low throughput
All the commit log segments contain writes for both table A and table B, as well as system tables. Table A’s memtable fills up rapidly and gets flushed frequently; while table B’s memtable fills up slowly and is rarely flushed. When the commit log reaches the maximum size, it forces Table B’s memtable to flush and then purges the segments.
Table B is flushed into large chunks instead of hundreds of tiny SSTables. If the commit log space and memtable space are equal, Table B’s memtable would flush every time Table A is flushed, despite being much smaller. To summarize, if there is more than one table, it makes sense to have a larger space for commit log segments.
Memtables and SSTables are maintained per table. The commit log is shared among tables. SSTables are immutable, not written to again after the memtable is flushed. Consequently, a partition is typically stored across multiple SSTable files. A number of other SSTable structures exist to assist read operations.
SSTable names and versions
SSTables are files stored on disk. The data files are stored in a data directory that varies with installation. For each keyspace, a directory within the data directory stores each table.
/data/ks1/cf1-5be396077b811e3a3ab9dc4b9ac088d/la-1-big-Data.db represents a data file.
ks1 represents the keyspace name to distinguish the keyspace for streaming or bulk loading data.
In this example, a hexadecimal string,
5be396077b811e3a3ab9dc4b9ac088d, is appended to table names to represent unique table IDs.
The database creates a subdirectory for each table, which can be referenced to a chosen physical drive or data volume through a symbolic link (symlink). To improve performance, this capability allows you to move very active tables to faster media, such as SSDs, and also divides tables across all attached storage devices for better I/O balance at the storage layer.
For each SSTable, the database creates the following structures:
- Data (
The SSTable data
- Primary Index (
Index of the row keys with pointers to their positions in the data file
- Bloom filter (
A structure stored in memory that checks if row data exists in the memtable before accessing SSTables on disk
- Compression Information (
A file holding information about uncompressed data length, chunk offsets, and other compression information
- Statistics (
Statistical metadata about the content of the SSTable
- Digest (
A file holding adler32 checksum of the data file
- CRC (
A file holding the CRC32 for chunks in an a uncompressed file.
- SSTable Index Summary (
A sample of the partition index stored in memory
- SSTable Table of Contents (
A file that stores the list of all components for the SSTable TOC
- Secondary Index (`SI_.*.db`)
Built-in secondary index. Multiple SIs may exist per SSTable