Writes

Write requests trigger a multi-stage process known as the write path:

Append the write request to the commit log.
Write the data to memtable storage.
Flush memtables.
Store the data on disk in immutable SSTables.

This process is designed for data durability and consistency. It preserves the integrity of the database when writing data to disk, and it reduces the chance of lost or incomplete writes in the event of a failure on the write path.

The write path

The timestamp for all writes is Universal Time Coordinated (UTC).

Commit log and memtable storage of writes

When a write occurs, the database immediately does the following:

Appends the write to the commit log
Writes the data to a memtable.

Memtables are table-specific, in-memory data structures that temporarily store writes in sorted order until they are flushed to disk as immutable SSTables. Flushes can occur manually or automatically based on configurable thresholds. Functionally, the memtable is a write-back cache of data partitions that the database looks up by key.

The commit log enables configurable durability for writes. When commitlog_sync is set to sync, the commit log receives every write made to a node. These durable writes survive permanently even if the node experiences an outage along the write path.

Commit logs aren’t a guarantee that data will be written to disk, and they cannot prevent all possible dropped writes. For example, if a network outage occurs before a write request reaches the cluster, the database cannot append it to the commit log because it never received the request.

Memtable flushes

To flush data from the memtable, the database writes data to disk in the memtable-sorted order. A partition index is also created on the disk that maps the tokens to a location on disk.

When the memtable content exceeds the configurable threshold, or the commit log space exceeds the commitlog_total_space_in_mb, the memtable is put in a queue that is flushed to disk.

If the data to be flushed exceeds the memtable_cleanup_threshold (which is automatically calculated by default), then the database blocks writes until the next flush succeeds.

Memtables also automatically flush on shutdown.

Manually flush memtables

You can manually flush a table using nodetool flush or nodetool drain (flushes memtables without listening for connections to other nodes).

To reduce the commit log replay time, DataStax recommends flushing the memtable before you restart the nodes. If a node stops working, replaying the commit log restores the writes to the memtable that were there before the node stopped.

Commit log segments and crash recovery

When recovering from a crash, the database can use the commit log to rebuild memtables at startup.

During normal operations, the commit log is divided into segments. Writes are recorded in order, and new segments are created when the current segment reaches the commitlog_segment_size_in_mb.

The database purges commit log segments only after all data in a segment is flushed to disk from the memtable. If the commit log directory reaches the maximum size set by commitlog_total_space_in_mb, then the oldest segments are purged and the corresponding memtables are flushed to disk.

For example, assume you have two tables: Table A has high throughput and Table B has low throughput. Commit log segments contain writes for table A, table B, and system tables. Due to high throughput, Table A’s memtable fill rapidly and flushes frequently. In contrast, Table B’s memtable fill slowly and flushes infrequently. However, if the commit log reaches the maximum size due to heavy writes to Table A, then Table B’s memtable is forced to flush so the commit log segments can be purged.

Table B is flushed into large chunks instead of hundreds of tiny SSTables. If the commit log space and memtable space were equal, then Table B’s memtable would flush every time Table A’s memtable is flushed, despite being much smaller.

If you have multiple tables, increase the configured total space for the commit log to support larger commit log segments and avoid frequent memtable flushes for low-throughput tables.

Data stored on disk in SSTables

The commit log is shared among all tables, but each table has its own memtables and SSTables.

SSTables are immutable. After the memtable is flushed, the database doesn’t write to that SSTable again. Instead, new writes are stored in new memtables that are then flushed to new SSTables. Consequently, a partition is typically stored across multiple SSTable files.

To perform read operations, the database uses SSTable data files and other generated SSTable structures. Read request latency increases relative to the number of SSTables that the database must access to read an entire row. Ideally, one query shouldn’t require more than 3 SSTables. To minimize the number of SSTables, the database periodically runs a compaction process that consolidates SSTables.

SSTable file names, formats, and versions

SSTables are files that are stored on disk in your DSE installation’s data directory. Within the data directory are subdirectories for each keyspace that contain subdirectories for each table. Within each table’s subdirectory are the table’s SSTable data files (*-Data.db) and other SSTable-related files. For example:

/data/cycling/cyclist_expenses-e4f31e122bc511e8891b23da85222d3d/aa-3gix00ryj00zd5-bti-Data.db

SSTable data file paths provide the following information:

Naming conventions for SSTable directories and files
File path segment	Example	Description
`data` directory	`/data`	The `data` directory for a DSE installation.
Keyspace subdirectory	`/cycling`	The keyspace subdirectory name is the same as the actual keyspace name.
Table subdirectory	`/cyclist_expenses-e4f31e122bc511e8891b23da85222d3d`	The table subdirectory name is a combination of the actual table name and a hexadecimal string that serves as a unique identifier for the table to prevent table name collisions within the `data` directory.
SSTable data file	`aa-3gix00ryj00zd5-bti-Data.db`	SSTable data file names are formatted as `VERSION-IDENTIFIER-FORMAT-Data.db`: `VERSION`: The SSTable version name, such as `aa`. `IDENTIFIER`: A generated identifier that uniquely identifies the SSTable file. `FORMAT`: The SSTable format name, such as `bti`. `-Data.db`: A suffix and file type indicating this is an SSTable data file. When upgrading or migrating to DSE, it is important to understand which SSTable versions and formats are supported. For more information, see Supported platforms and compatibility for DSE.

Table subdirectories can be referenced to a specific physical drive or data volume through a symbolic link (symlink). This functionality can help improve performance. For example, you can move your most active tables to faster media like SSDs, or you can divide tables across multiple attached storage devices for better I/O balance at the storage layer.

SSTable structures

For each SSTable, the database creates several structures (files).

For more information about file paths and names, see SSTable file names, formats, and versions. For more information about how the database uses these structures during a read request, see Read operations.

Compression information (CompressionInfo.db): Contains compression metadata, such as the uncompressed data length and chunk offsets.
Data (Data.db): The actual SSTable data.
Digest (Digest.crc32, Digest.adler32, or Digest.sha1): Contains the CRC32, adler32, or SHA1 checksum of the data file (Data.db).
Primary index (Index.db): Index of the row keys with pointers to their positions in the data file.
Bloom filter (Filter.db): A structure stored in memory that checks if row data exists in the memtable before accessing SSTables on disk.
Statistics (Statistics.db): Statistical metadata about the content of the SSTable.
CRC (CRC.db): An uncompressed file containing the CRC32 for the SSTable chunks.
SSTable index summary (SUMMARY.db): A sample of the partition index stored in memory.
SSTable table of contents (TOC.txt): A list of components for an SSTable.
Secondary index (SI_.*.db): A built-in secondary index (SI). Each SSTable can have multiple SIs.

Writes

Commit log and memtable storage of writes

Memtable flushes

Manually flush memtables

Commit log segments and crash recovery

Data stored on disk in SSTables

SSTable file names, formats, and versions

SSTable structures

Was this helpful?

Give Feedback