How are consistent read and write operations handled?

An introduction to how Cassandra extends eventual consistency with tunable consistency to vary the consistency of data read and written.

Consistency refers to how up-to-date and synchronized a row of Cassandra data is on all of its replicas. Using repair operations, Cassandra data will eventually be consistent in all replicas. Repairs work to decrease the variability in replica data, but at a given time, stale data can be present. Cassandra is a AP system according to the CAP theorem, providing high availability and partition tolerance. Cassandra does have flexibility in its configuration, though, and can perform more like a CP (consistent and partition tolerant) system according to the CAP theorem, depending on the application requirements. Two consistency features are tunable consistency and linearizable consistency.

Tunable consistency

To ensure that data is written and read correctly, Cassandra extends the concept of eventual consistency by offering tunable consistency. Tunable consistency allows individual read or write operations to be as strongly consistent as required by the client application. The consistency level of each read or write operation can be set, so that the data returned is more or less consistent, based on need. The tradeoff between operation latency and consistency level can be tuned down to the per-operation level, or set globally for a cluster or datacenter. Using tunable consistency, Cassandra can act more like a CP (consistent and partition tolerant) or AP (highly available and partition tolerant) system according to the CAP theorem, depending on the application requirements.

The consistency level determines only the number of replicas that need to acknowledge the read or write operation success to the client application. For read operations, the read consistency level specifies how many replicas must respond to a read request before returning data to the client application. Read operations will use read repair to update stale data in the background if discovered during a read operation.

For write operations, the write consistency level specified how many replicas must respond to a write request before the write is considered successful. Even at low consistency levels, Cassandra writes to all replicas of the partition key, including replicas in other datacenters. The write consistency level just specifies when the coordinator can report to the client application that the write operation is considered completed. Write operations will use hinted handoffs to ensure the writes are completed when replicas are down or otherwise not responsive to the write request.

Typically, a client specifies a consistency level that is less than the replication factor specified by the keyspace. Another common practice is to write at a consistency level of QUORUM and read at a consistency level of QUORUM. The choices made depend on the client application's needs, and Cassandra provides maximum flexibility for application design.

Linearizable consistency

In ACID terms, linearizable consistency (or serial consistency) is a serial (immediate) isolation level for lightweight transactions. Cassandra does not use locking or transactional dependencies when concurrently updating multiple rows or tables. However, sometimes operations must be performed in sequence and not interrupted by other operations. For example, a typical use case is the creation of user accounts, where a duplication or overwrite will have serious consequences. Linearizable consistency is not required for all aspects of the user's account, but the unique identifier like the userID or email address that claims the account is treated differently. Such a serial operation is implemented in Cassandra with the Paxos consensus protocol, which uses a quorum-based algorithm. Lightweight transactions can be implemented without the need for a master database or two-phase commit process.

Lightweight transaction write operations use the serial consistency level for Paxos consensus and the regular consistency level for the write to the table. For more information, see Lightweight Transactions.

Calculating consistency

Reliability of read and write operations depends on the consistency used to verify the operation. Strong consistency can be guaranteed when the following condition is true:
R + W > N
where
  • R is the consistency level of read operations
  • W is the consistency level of write operations
  • N is the number of replicas
If the replication factor is 3, then the consistency level of the reads and writes combined must be at least 4. For example, read operations using 2 out of 3 replicas to verify the value, and write operations using 2 out of 3 replicas to verify the value will result in strong consistency. If fast write operations are required, but strong consistency is still desired, the write consistency level is lowered to 1, but now read operations have to verify a matched value on all 3 replicas. Writes will be fast, but reads will be slower.
Eventual consistency occurs if the following condition is true:
R + W =< N
where
  • R is the consistency level of read operations
  • W is the consistency level of write operations
  • N is the number of replicas
If the replication factor is 3, then the consistency level of the reads and writes combined are 3 or less. For example, read operations using QUORUM (2 out of 3 replicas) to verify the value, and write operations using ONE (1 out of 3 replicas) to do fast writes will result in eventual consistency. All replicas will receive the data, but read operations are more vulnerable to selecting data before all replicas write the data.

Additional consistency examples:

  • You do a write at ONE, the replica crashes one second later. The other messages are not delivered. The data is lost.
  • You do a write at ONE, and the operation times out. Future reads can return the old or the new value. You will not know the data is incorrect.
  • You do a write at ONE, and one of the other replicas is down. The node comes back online. The application will get old data from that node until the node gets the correct data or a read repair occurs.
  • You do a write at QUORUM, and then a read at QUORUM. One of the replicas dies. You will always get the correct data.