How are consistent read and write operations handled?

An introduction to how DataStax Enterprise extends eventual consistency with tunable consistency to vary the consistency of data read and written.

Consistency refers to how up-to-date and synchronized all replicas of a row of data are at any given moment. Ongoing repair operations in DataStax Enterprise (DSE) ensure that all replicas of a row will eventually be consistent. Repairs work to decrease the variability in replica data, but constant data traffic through a widely distributed system can lead to inconsistency (stale data) at any time. DSE is an AP (highly available and partition tolerant) system according to the CAP theorem. DSE has flexibility in its configuration, and can perform more like a CP (consistent and partition tolerant) system according to the CAP theorem, depending on the application requirements. Two important consistency features to understand are tunable consistency and linearizable consistency.

Tunable consistency

To ensure the database can provide the proper levels of consistency for its reads and writes, DSE extends the concept of eventual consistency by offering tunable consistency. The consistency level can be tuned for each operation, or set globally for a cluster or datacenter. You can vary the consistency for individual read or write operations so that the data returned is more or less consistent, as required by the client application. This allows DSE to act more like a CP or AP system, depending on the application requirements.

There is a tradeoff between operation latency and consistency: higher consistency incurs higher latency, and lower consistency permits lower latency. You can control latency by tuning consistency.

Note: It is not possible to tune a distributed database into a completely CA system. See You Can't Sacrifice Partition Tolerance for a more detailed discussion.

The consistency level determines the number of replicas that must acknowledge the read or write operation success to the client application. For read operations, the read consistency level specifies how many replicas must respond to a read request before returning data to the client application. If a read operation reveals inconsistency among replicas, the database initiates a read repair to update the inconsistent data.

For write operations, the write consistency level specifies how many replicas must respond to a write request before the write is considered successful. Even at low consistency levels, the database writes to all replicas of the partition key, including replicas in other datacenters. The write consistency level only specifies when the coordinator node can report to the client application that the write operation is considered complete. Write operations use hinted handoffs to ensure the writes are completed when replicas are down or otherwise not responsive to the write request.

Typically, a client specifies a consistency level that is less than the replication factor specified by the keyspace. Another common practice is to write at a consistency level of QUORUM and read at a consistency level of QUORUM. The choices made depend on the client application's needs. DSE provides maximum flexibility for application design.

Linearizable consistency

In ACID terms, linearizable consistency (or serial consistency) is a serial (immediate) isolation level for lightweight transactions. DSE does not use employ traditional mechanisms like locking or transactional dependencies when concurrently updating multiple rows or tables.

However, some operations must be performed in sequence and not interrupted by other operations. For example, duplications or overwrites in user-account creation can have serious consequences. Situations like race conditions (two clients updating the same record) can introduce inconsistency across replicas. Writing with high consistency does nothing to reduce this. You can apply linearizable consistency to a unique identifier, like a user ID or email address, although it is not required for all aspects of the user's account. Serial operations for these elements can be implemented in the database with the Paxos consensus protocol, which uses a quorum-based algorithm to ensure that at least some surviving processor retains knowledge of search results in the event of failure.

Lightweight transactions can be implemented without the need for a master database or two-phase commit process. Lightweight transaction write operations use the serial consistency level for Paxos consensus and the regular consistency level for writing to the table. For more information, see Lightweight Transactions.

Calculating consistency

Reliability of read and write operations depends on the consistency used to verify the operation. Strong consistency can be guaranteed when the following condition is true:

R + W > N

where

R is the consistency level of read operations
W is the consistency level of write operations
N is the number of replicas

For example, if the replication factor is 3, then the consistency level of the reads and writes combined must be at least 4. Read operations using 2 out of 3 replicas to verify the value, and write operations using 2 out of 3 replicas to verify the value will result in strong consistency. If fast write operations are required, but strong consistency is still desired, the write consistency level is lowered to 1, but now read operations have to verify a matched value on all 3 replicas. Writes will be fast, but reads will be slower.

Eventual consistency occurs if the following condition is true:

R + W <= N

If the replication factor is 3, then the consistency level of the reads and writes combined are 3 or less. For example, read operations using QUORUM (2 out of 3 replicas) to verify the value, and write operations using ONE (1 out of 3 replicas) to do fast writes will result in eventual consistency. All replicas will receive the data, but read operations are more vulnerable to selecting data before all replicas write the data.

Additional consistency examples:

Write at ONE and the replica crashes one second later. The other messages are not delivered. The data is lost.
Write at ONE and the operation times out. Future reads can return the old or the new value. You will not know the data is incorrect.
Write at ONE and one of the other replicas is down. The node comes back online. The application will get old data from that node until the node gets the correct data or a read repair occurs.
Write at QUORUM and then a read at QUORUM. One of the replicas dies. You will always get the correct data.