Storage engine

About the DataStax Enterprise storage structure and engine.

Unlike a typical relational database that uses a balanced tree (B-tree), the DataStax Enterprise (DSE) database uses a storage structure similar to a log-structured merge tree. Essentially, the database avoids reading before writing. Read-before-write, especially in a large distributed system, can result in large latencies in read performance and other problems. For example, two clients read at the same time; one overwrites the row to make update A, and the other overwrites the row to make update B, removing update A. This race condition results in ambiguous query results, making it difficult to determine which update is correct.

To avoid using read-before-write for most writes, the storage engine groups inserts and updates in memory and, at intervals, sequentially writes the data to disk in append mode. Once written to disk, the data is immutable and is never overwritten. Reading data involves combining this immutable, sequentially-written data to discover the correct query results. You can use lightweight transactions (LWT) to check the state of the data before writing. However, this feature is recommended only for limited use.

A log-structured engine that avoids overwrites and uses sequential I/O to update data is essential for writing to hard disks (HDD) and solid-state disks (SSD). On HDD, writing randomly involves a higher number of seek operations, which carries substantial penalties.

For many databases, write amplification is a problem on SSDs. On these drives, memory must be erased before it can be written. Rewriting data requires a portion of the drive to be read, updated, and written to a new location, while also erasing the new location if was used previously before the write occurs. Therefore, much larger portions of the drive must be erased and rewritten than actually required by the new data. This phenomenon of write amplification impacts the life and speed of SSDs.

Because the DSE database sequentially writes immutable files, thereby avoiding write amplification and disk failure, the database accommodates inexpensive, consumer SSDs extremely well.