What are tombstones?
Explains how tombstones are created and the impact on reads.
In DataStax Enterprise (DSE), a tombstone is created when data is deleted from a keyspace that has a replication factor (RF) greater than 1. The following list of examples is not exhaustive, but illustrates some operations that generate tombstones:
-
Using a CQL DELETE statement.
-
Expiring data with time-to-live (TTL).
-
Using internal operations, such as Using materialized views.
-
INSERT or UPDATE operations with a
null
value. -
UPDATE operations on a collection column.
When a tombstone is created, it can be marked on different parts of a partition. Based on the location of the marker, tombstones can be categorized into one of the following groups. Each category typically corresponds to one unique type of data deletion operation.
The tombstones go through the write path, and are written to SSTables on one or more nodes. A key differentiator of a tombstone is a built-in expiration known as the grace period, set by gc_grace_seconds. At the end of its expiration period, the tombstone will be deleted as part of the normal compaction process, if it no longer marks any deleted data in another SSTable.
Having an excessive number of tombstones in a table can negatively impact application performance. Many tombstones often indicate potential issues with either the data model or in the application.
Create the keyspace and tables
In the following examples, the cycling
keyspace is used to illustrate different tombstone categories.
Two tables are used: rank_by_year_and_cycling_name
and cyclist_career_teams
.
Because the following examples use both |
Alternatively, use one terminal for cqlsh
and issue CQL commands using DataStax Studio.
Before getting started, copy the following commands into a cqlsh
prompt to create the cycling
keyspace, create both tables, and insert data into the rank_by_year_and_cycling_name
table.
You insert data later into the cyclist_career_teams
table in Cell tombstones and TTL tombstones.
CREATE KEYSPACE cycling WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true; CREATE TABLE cycling.rank_by_year_and_name ( race_year int, race_name text, rank int, cyclist_name text, PRIMARY KEY ((race_year, race_name), rank) ) WITH CLUSTERING ORDER BY (rank ASC); INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2015, 'Tour of Japan - Stage 4 - Minami > Shinshu', 'Benjamin PRADES', 1); INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2015, 'Tour of Japan - Stage 4 - Minami > Shinshu', 'Adam PHELAN', 2); INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2015, 'Tour of Japan - Stage 4 - Minami > Shinshu', 'Thomas LEBAS', 3); INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2015, 'Giro d''Italia - Stage 11 - Forli > Imola', 'Ilnur ZAKARIN', 1); INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2015, 'Giro d''Italia - Stage 11 - Forli > Imola', 'Carlos BETANCUR', 2); INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2014, '4th Tour of Beijing', 'Phillippe GILBERT', 1); INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2014, '4th Tour of Beijing', 'Daniel MARTIN', 2); INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2014, '4th Tour of Beijing', 'Johan Esteban CHAVES', 3); CREATE TABLE cycling.cyclist_career_teams ( id UUID PRIMARY KEY, lastname text, teams set<text> );
Flushing to SSTables
After each modification to a table, run the nodetool flush
command on the cycling
keyspace to flush data from the memtable to the SSTables on disk.
This step is necessary before running sstabledump
to view the output.
nodetool flush cycling;
After flushing the cycling
keyspace, run the sstabledump
command on the SSTable, as shown in the following example.
cd /var/lib/cassandra/data/cycling/rank_by_year_and_name-bc05fba12baf11e8b4a8ad2b042f3e18
sstabledump mc-2-big-Data.db
The sstabledump utility is available in Apache Cassandra® 3.0, DSE 5.0, and later. For versions before these, use the sstable2json utility instead. |
Partition tombstones
Partition tombstones are generated when an entire partition is deleted explicitly. In the CQL DELETE statement, the WHERE clause is an equality condition against the partition key.
DELETE from cycling.rank_by_year_and_name WHERE race_year = 2014 AND race_name = '4th Tour of Beijing';
Looking at the sstabledump output for this partition, the deletion_info
tombstone marker is at the partition level, and is not associated with any rows or cells within the partition.
{
"partition" : {
"key" : [ "2014", "4th Tour of Beijing" ],
"position" : 0,
"deletion_info" : { "marked_deleted" : "2018-05-16T19:40:06.454282Z", "local_delete_time" : "2018-05-16T19:40:06Z" }
},
"rows" : [ ]
}
Row tombstones
Row tombstones are generated when a particular row within a partition is deleted explicitly. The schema has a composite primary key that includes both the partition key and the clustering key. In the CQL DELETE statement, the WHERE clause is an equality condition against both the partition key and the clustering key columns.
DELETE from cycling.rank_by_year_and_name WHERE race_year = 2015 AND race_name = 'Giro d''Italia - Stage 11 - Forli > Imola' AND rank = 2;
Looking at the sstabledump output for this partition, the deletion_info
tombstone marker is at the row level, and is identified by a clustering key under the partition.
Neither the partition nor the row cells contain the tombstone marker.
{
"partition" : {
"key" : [ "2015", "Giro d'Italia - Stage 11 - Forli > Imola" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 74,
"clustering" : [ 2 ],
"deletion_info" : { "marked_deleted" : "2018-05-18T15:29:06.227148Z", "local_delete_time" : "2018-05-18T15:29:06Z" },
"cells" : [ ]
}
]
}
Range tombstones
Range tombstones occur when several rows within a partition that can be expressed through a range search are deleted explicitly. The schema has a composite primary key that includes both a partition key and a clustering key. In the CQL DELETE statement, the WHERE clause is an equality condition against the partition key, plus an inequality condition against the clustering key.
If following this tutorial from the beginning, drop the |
DELETE from cycling.rank_by_year_and_name WHERE race_year = 2015 AND race_name = 'Tour of Japan - Stage 4 - Minami > Shinshu' AND rank > 1;
Looking at the sstabledump output for this partition, the deletion_info
tombstone marker is at the row level.
A special boundary marker, range_tombstone_bound
, marks the range scope (identified by the clustering key values) of the deleted rows.
{
"partition" : {
"key" : [ "2015", "Tour of Japan - Stage 4 - Minami > Shinshu" ],
"position" : 252
},
"rows" : [
{
"type" : "range_tombstone_bound",
"start" : {
"type" : "inclusive",
"deletion_info" : { "marked_deleted" : "2018-05-18T16:09:21.474713Z", "local_delete_time" : "2018-05-18T16:09:21Z" }
}
},
{
"type" : "range_tombstone_bound",
"end" : {
"type" : "exclusive",
"clustering" : [ 1 ],
"deletion_info" : { "marked_deleted" : "2018-05-18T16:09:21.474713Z", "local_delete_time" : "2018-05-18T16:09:21Z" }
}
}
]
}
ComplexColumn tombstones
ComplexColumn tombstones are generated when inserting or updating a collection type column, such as set, list, and map.
Previously we created the cyclist_career_teams
table.
Run the following cqlsh
command to insert data into that table.
INSERT INTO cycling.cyclist_career_teams ( id, lastname, teams) VALUES (cb07baad-eac8-4f65-b28a-bddc06a0de23, 'ARMITSTEAD', { 'Boels-Dolmans Cycling Team','AA Drink - Leontien.nl','Team Garmin - Cervelo' } );
Looking at the sstabledump output for this partition, no explicit manual deletion occurs on the partition, but a deletion_info
marker is listed at the cell level for the collection type column teams
.
{
"partition" : {
"key" : [ "cb07baad-eac8-4f65-b28a-bddc06a0de23" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 130,
"liveness_info" : { "tstamp" : "2018-05-18T16:26:23.779724Z" },
"cells" : [
{ "name" : "lastname", "value" : "ARMITSTEAD" },
{ "name" : "teams", "deletion_info" : { "marked_deleted" : "2018-05-18T16:26:23.779723Z", "local_delete_time" : "2018-05-18T16:26:23Z" } },
{ "name" : "teams", "path" : [ "AA Drink - Leontien.nl" ], "value" : "" },
{ "name" : "teams", "path" : [ "Boels-Dolmans Cycling Team" ], "value" : "" },
{ "name" : "teams", "path" : [ "Team Garmin - Cervelo" ], "value" : "" }
]
}
]
}
Cell tombstones
Cell tombstones are generated when explicitly deleting a value from a cell, such as a column for a specific row of a partition, or when inserting or updating a cell with null
values, as shown in the following example.
INSERT INTO cycling.rank_by_year_and_name ( race_year, race_name, cyclist_name, rank) VALUES (2018, 'Giro d''Italia - Stage 11 - Osimo > Imola', null, 1);
Looking at the "sstabledump" output for this partition, deletion_info
tombstone marker is associated with a particular cell.
{
"partition" : {
"key" : [ "2018", "Giro d'Italia - Stage 11 - Osimo > Imola" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 80,
"clustering" : [ 1 ],
"liveness_info" : { "tstamp" : "2018-05-18T17:13:42.602827Z" },
"cells" : [
{ "name" : "cyclist_name", "deletion_info" : { "local_delete_time" : "2018-05-18T17:13:42Z" }
}
]
}
]
}
TTL tombstones
TTL tombstones are generated when the TTL (time-to-live) period expires. The TTL expiration marker can occur at either the row or cell level. However, DSE marks TTL data differently from tombstone data that was explicitly deleted. Even if a partition has only a single row (with no clustering key), the TTL mark is still made at the row level.
The following statement sets TTL for an entire row.
INSERT INTO cycling.cyclist_career_teams ( id, lastname, teams) VALUES (e7cd5752-bc0d-4157-a80f-7523add8dbcd, 'VAN DER BREGGEN', { 'Rabobank-Liv Woman Cycling Team','Sengers Ladies Cycling Team','Team Flexpoint' }) USING TTL 1;
The following statement sets TTL for a single cell.
UPDATE cycling.rank_by_year_and_name USING TTL 1 SET cyclist_name = 'Cloudy Archipelago' WHERE race_year = 2018 AND race_name = 'Giro d''Italia - Stage 11 - Osimo > Imola' AND rank = 1;
Looking at the sstabledump output for these partitions, the first CQL statement marks the row (partition key: e7cd5752-bc0d-4157-a80f-7523add8dbcd) with an "expired" : true
TTL expiration marker in the liveness_info
section.
{
"partition" : {
"key" : [ "e7cd5752-bc0d-4157-a80f-7523add8dbcd" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 134,
"liveness_info" : { "tstamp" : "2018-05-18T17:38:13.135226Z", "ttl" : 1, "expires_at" : "2018-05-18T17:38:14Z", "expired" : true },
"cells" : [
{ "name" : "lastname", "value" : "VAN DER BREGGEN" },
{ "name" : "teams", "deletion_info" : { "marked_deleted" : "2018-05-18T17:38:13.135225Z", "local_delete_time" : "2018-05-18T17:38:13Z" } },
{ "name" : "teams", "path" : [ "Rabobank-Liv Woman Cycling Team" ], "value" : "" },
{ "name" : "teams", "path" : [ "Sengers Ladies Cycling Team" ], "value" : "" },
{ "name" : "teams", "path" : [ "Team Flexpoint" ], "value" : "" }
]
}
]
}
The second CQL statement marks the cell (partition key: 2018, clustering key: 1, column name: cyclist_name) with an "expired" : true
TTL expiration marker for the specific cell.
{
"partition" : {
"key" : [ "2018", "Giro d'Italia - Stage 11 - Osimo > Imola" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 95,
"clustering" : [ 1 ],
"cells" : [
{ "name" : "cyclist_name", "value" : "Cloudy Archipelago", "tstamp" : "2018-05-18T18:22:52.532855Z", "ttl" : 1, "expires_at" : "2018-05-18T18:22:53Z", "expired" : true }
]
}
]
}