How is data deleted?

How Cassandra deletes data and why deleted data can reappear.

Cassandra's processes for deleting data are designed to improve performance, and to work with Cassandra's built-in properties for data distribution and fault-tolerance.

Cassandra treats a delete as an insert or upsert. The data being added to the partition in the DELETE command is a deletion marker called a tombstone. The tombstones go through Cassandra's write path, and are written to SSTables on one or more nodes. The key difference feature of a tombstone: it has a built-in expiration date/time. At the end of its expiration period (for details see below) the tombstone is deleted as part of Cassandra's normal compaction process.

You can also mark a Cassandra record (row or column) with a time-to-live value. After this amount of time has ended, Cassandra marks the record with a tombstone, and handles it like other tombstoned records.

Deletion in a distributed system

In a multi-node cluster, Cassandra can store replicas of the same data on two or more nodes. This helps prevent data loss, but it complicates the delete process. If a node receives a delete for data it stores locally, the node tombstones the specified record and tries to pass the tombstone to other nodes containing replicas of that record. But if one replica node is unresponsive at that time, it does not receive the tombstone immediately, so it still contains the pre-delete version of the record. If the tombstoned record has already been deleted from the rest of the cluster befor that node recovers, Cassandra treats the record on the recovered node as new data, and propagates it to the rest of the cluster. This kind of deleted but persistent record is called a zombie.

To prevent the reappearance of zombies, the database gives each tombstone a grace period. The purpose of the grace period is to give unresponsive nodes time to recover and process tombstones normally. When multiple replica answers are part of a read request, and those responses differ, then whichever values are most recent take precedence. For example, if a node has a tombstone but another node has a more recent change, then the final result includes the more recent change.

If a node has a tombstone and another node has only an older value for the record, then the final record will have the tombstone. If a client writes a new update to the tombstone during the grace period, the database overwrites the tombstone.

When an unresponsive node recovers, Cassandra uses hinted handoff to replay the database mutations the node missed while it was down. Cassandra does not replay a mutation for a tombstoned record during its grace period. But if the node does not recover until after the grace period ends, Cassandra may miss the deletion.

After the tombstone's grace period ends, Cassandra deletes the tombstone during compaction.

The grace period for a tombstone is set by the property gc_grace_seconds. Its default value is 864000 seconds (ten days). Each table can have its own value for this property.

More about Cassandra deletes

Details:

  • The expiration date/time for a tombstone is the date/time of its creation plus the value of the table property gc_grace_seconds.
  • Cassandra also supports Batch data insertion and updates. This procedure also introduces the danger of replaying a record insertion after that record has been removed from the rest of the cluster. Cassandra does not replay a batched mutation for a tombstoned record that is still within its grace period.
  • On a single-node cluster, you can set gc_grace_seconds to 0 (zero).
  • To completely prevent the reappearance of zombie records, run nodetool repair on a node after it recovers, and on each table every gc_grace_seconds.
  • If all records in a table are given a TTL at creation, and all are allowed to expire and not deleted manually, it is not necessary to run nodetool repair for that table on a regular basis.
  • If you use the SizeTieredCompactionStrategy or DateTieredCompactionStrategy, you can delete expired tombstones immediately by manually starting the compaction process.
    CAUTION: If you force compaction, Cassandra may create one very large SSTable from all the data. Cassandra will not trigger another compaction for a long time. The data in the SSTable created during the forced compaction can grow very stale during this long period of non-compaction.
  • Cassandra allows you to set a default_time_to_live property for an entire table. Columns and rows marked with regular TTLs are processed as described above; but when a record exceeds the table-level TTL, Cassandra deletes it immediately, without tombstoning or compaction.
  • Cassandra supports immediate deletion through the DROP KEYSPACE and DROP TABLE statements.