What are tombstones?

Explains how tombstones are created and the impact on reads.

In the DataStax Distribution of Apache Cassandra™ (DDAC), a tombstone is created when data is deleted. The following list of examples is not exhaustive, but illustrates some operations that generate tombstones:

Using a CQL DELETE statement
Expiring data with time-to-live (TTL)
Using internal operations, such as Using materialized views
INSERT or UPDATE operations with a null value
UPDATE operations with a collection column

When a tombstone is created, it can be marked on different parts of a partition. Based on the location of the marker, tombstones can be categorized into one of the following groups. Each category typically corresponds to one unique type of data deletion operation.

The tombstones go through the write path, and are written to SSTables on one or more nodes. A key differentiator of a tombstone is a built-in expiration known as the grace period, set by gc_grace_seconds. At the end of its expiration period, the tombstone is deleted as part of the normal compaction process.

Having an excessive number of tombstones in a table can negatively impact application performance. Many tombstones often indicate potential issues with either the data model or in the application.

Create the keyspace and tables

In the following examples, the cycling keyspace is used to illustrate different tombstone categories. Two tables are used: rank_by_year_and_cycling_name and cyclist_career_teams.

Tip: Because the following examples use both cqlsh and CQL commands, using two different terminals is recommended.

Alternatively, use one terminal for cqlsh and issue CQL commands using DataStax Studio.

Before getting started, copy the following commands into a cqlsh prompt to create the cycling keyspace, create both tables, and insert data into the rank_by_year_and_cycling_name table.

You insert data later into the cyclist_career_teams table in Cell tombstones and TTL tombstones.

CREATE KEYSPACE cycling WITH replication = 
{'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;

CREATE TABLE cycling.rank_by_year_and_name (
    race_year int,
    race_name text,
    rank int,
    cyclist_name text,
    PRIMARY KEY ((race_year, race_name), rank)
) WITH CLUSTERING ORDER BY (rank ASC);

INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2015, 'Tour of Japan - Stage 4 - Minami > Shinshu', 'Benjamin PRADES', 1);
INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2015, 'Tour of Japan - Stage 4 - Minami > Shinshu', 'Adam PHELAN', 2);
INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2015, 'Tour of Japan - Stage 4 - Minami > Shinshu', 'Thomas LEBAS', 3);
INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2015, 'Giro d''Italia - Stage 11 - Forli > Imola', 'Ilnur ZAKARIN', 1);
INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2015, 'Giro d''Italia - Stage 11 - Forli > Imola', 'Carlos BETANCUR', 2);
INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2014, '4th Tour of Beijing', 'Phillippe GILBERT', 1);
INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2014, '4th Tour of Beijing', 'Daniel MARTIN', 2);
INSERT INTO cycling.rank_by_year_and_name (race_year, race_name, cyclist_name, rank) VALUES (2014, '4th Tour of Beijing', 'Johan Esteban CHAVES', 3);

CREATE TABLE cycling.cyclist_career_teams (
    id UUID PRIMARY KEY,
    lastname text,
    teams set<text>
);

Flushing to SSTables

After each modification to a table, run the nodetool flush command on the cycling keyspace to flush data from the memtable to the SSTables on disk. This step is necessary before running sstabledump to view the output.

nodetool flush cycling;

After flushing the cycling keyspace, run the sstabledump command on the SSTable, as shown in the following example.

cd /var/lib/cassandra/data/cycling/rank_by_year_and_name-bc05fba12baf11e8b4a8ad2b042f3e18

sstabledump mc-2-big-Data.db

Note: The sstabledump utility is available in Apache Cassandra™ 3.0, DDAC, DSE 5.0, and later. For prior versions, use the sstable2json utility instead.

Partition tombstones

Partition tombstones are generated when an entire partition is deleted explicitly. In the CQL DELETE statement, the WHERE clause is an equality condition against the partition key.

DELETE from cycling.rank_by_year_and_name WHERE 
 race_year = 2014 AND race_name = '4th Tour of Beijing';

Looking at the sstabledump output for this partition, the deletion_info tombstone marker is at the partition level, and is not associated with any rows or cells within the partition.

{
    "partition" : {
      "key" : [ "2014", "4th Tour of Beijing" ],
      "position" : 0,
      "deletion_info" : { "marked_deleted" : "2018-05-16T19:40:06.454282Z", "local_delete_time" : "2018-05-16T19:40:06Z" }
    },
    "rows" : [ ]
  }

Row tombstones

Row tombstones are generated when a particular row within a partition is deleted explicitly. The schema has a composite primary key that includes both the partition key and the clustering key. In the CQL DELETE statement, the WHERE clause is an equality condition against both the partition key and the clustering key columns.

DELETE from cycling.rank_by_year_and_name WHERE 
 race_year = 2015 AND race_name = 'Giro d''Italia - Stage 11 - Forli > Imola' AND rank = 2;

Looking at the sstabledump output for this partition, the deletion_info tombstone marker is at the row level, and is identified by a clustering key under the partition. Neither the partition nor the row cells contain the tombstone marker.

{
    "partition" : {
      "key" : [ "2015", "Giro d'Italia - Stage 11 - Forli > Imola" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 74,
        "clustering" : [ 2 ],
        "deletion_info" : { "marked_deleted" : "2018-05-18T15:29:06.227148Z", "local_delete_time" : "2018-05-18T15:29:06Z" },
        "cells" : [ ]
      }
    ]
  }

Range tombstones

Range tombstones occur when several rows within a partition that can be expressed through a range search are deleted explicitly. The schema has a composite primary key that includes both a partition key and a clustering key. In the CQL DELETE statement, the WHERE clause is an equality condition against the partition key, plus an inequality condition against the clustering key.

Tip: If following this tutorial from the beginning, drop the rank_by_year_and_name table and then recreate it to populate the table with the necessary data.

DELETE from cycling.rank_by_year_and_name WHERE 
 race_year = 2015 AND race_name = 'Tour of Japan - Stage 4 - Minami > Shinshu' AND rank > 1;

Looking at the sstabledump output for this partition, the deletion_info tombstone marker is at the row level. A special boundary marker, range_tombstone_bound, marks the range scope (identified by the clustering key values) of the deleted rows.

{
    "partition" : {
      "key" : [ "2015", "Tour of Japan - Stage 4 - Minami > Shinshu" ],
      "position" : 252
    },
    "rows" : [
      {
        "type" : "range_tombstone_bound",
        "start" : {
          "type" : "inclusive",
          "deletion_info" : { "marked_deleted" : "2018-05-18T16:09:21.474713Z", "local_delete_time" : "2018-05-18T16:09:21Z" }
        }
      },
      {
        "type" : "range_tombstone_bound",
        "end" : {
          "type" : "exclusive",
          "clustering" : [ 1 ],
          "deletion_info" : { "marked_deleted" : "2018-05-18T16:09:21.474713Z", "local_delete_time" : "2018-05-18T16:09:21Z" }
        }
      }
    ]
  }

ComplexColumn tombstones

ComplexColumn tombstones are generated when inserting or updating a collection type column, such as set, list, and map.

Previously we created the cyclist_career_teams table. Run the following cqlsh command to insert data into that table.

INSERT INTO cycling.cyclist_career_teams (
     id,
     lastname,
     teams)
     VALUES (cb07baad-eac8-4f65-b28a-bddc06a0de23, 'ARMITSTEAD', { 
     'Boels-Dolmans Cycling Team','AA Drink - Leontien.nl','Team Garmin - Cervelo' } );

Looking at the sstabledump output for this partition, no explicit manual deletion occurs on the partition, but a deletion_info marker is listed at the cell level for the collection type column teams.

{
    "partition" : {
      "key" : [ "cb07baad-eac8-4f65-b28a-bddc06a0de23" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 130,
        "liveness_info" : { "tstamp" : "2018-05-18T16:26:23.779724Z" },
        "cells" : [
          { "name" : "lastname", "value" : "ARMITSTEAD" },
          { "name" : "teams", "deletion_info" : { "marked_deleted" : "2018-05-18T16:26:23.779723Z", "local_delete_time" : "2018-05-18T16:26:23Z" } },
          { "name" : "teams", "path" : [ "AA Drink - Leontien.nl" ], "value" : "" },
          { "name" : "teams", "path" : [ "Boels-Dolmans Cycling Team" ], "value" : "" },
          { "name" : "teams", "path" : [ "Team Garmin - Cervelo" ], "value" : "" }
        ]
      }
    ]
  }

Cell tombstones

Cell tombstones are generated when explicitly deleting a value from a cell, such as a column for a specific row of a partition, or when inserting or updating a cell with null values, as shown in the following example.

INSERT INTO cycling.rank_by_year_and_name (
     race_year,
     race_name,
     cyclist_name,
     rank)
     VALUES (2018, 'Giro d''Italia - Stage 11 - Osimo > Imola', null, 1);

Looking at the "sstabledump" output for this partition, deletion_info tombstone marker is associated with a particular cell.

{
    "partition" : {
      "key" : [ "2018", "Giro d'Italia - Stage 11 - Osimo > Imola" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 80,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2018-05-18T17:13:42.602827Z" },
        "cells" : [
          { "name" : "cyclist_name", "deletion_info" : { "local_delete_time" : "2018-05-18T17:13:42Z" }
          }
        ]
      }
    ]
  }

TTL tombstones

TTL tombstones are generated when the TTL (time-to-live) period expires. The TTL expiration marker can occur at either the row or cell level. However, Cassandra marks TTL data differently from tombstone data that was explicitly deleted. Even if a partition has only a single row (with no clustering key), the TTL mark is still made at the row level.

The following statement sets TTL for an entire row.

INSERT INTO cycling.cyclist_career_teams (
     id,
     lastname,
     teams)
     VALUES (e7cd5752-bc0d-4157-a80f-7523add8dbcd, 'VAN DER BREGGEN', { 
     'Rabobank-Liv Woman Cycling Team','Sengers Ladies Cycling Team','Team Flexpoint' }) USING TTL 1;

The following statement sets TTL for a single cell.

UPDATE cycling.rank_by_year_and_name USING TTL 1
  SET cyclist_name = 'Cloudy Archipelago' WHERE race_year = 2018 AND 
  race_name = 'Giro d''Italia - Stage 11 - Osimo > Imola' AND rank = 1;

Looking at the sstabledump output for these partitions, the first CQL statement marks the row (partition key: e7cd5752-bc0d-4157-a80f-7523add8dbcd) with an "expired" : true TTL expiration marker in the liveness_info section.

{
    "partition" : {
      "key" : [ "e7cd5752-bc0d-4157-a80f-7523add8dbcd" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 134,
        "liveness_info" : { "tstamp" : "2018-05-18T17:38:13.135226Z", "ttl" : 1, "expires_at" : "2018-05-18T17:38:14Z", "expired" : true },
        "cells" : [
          { "name" : "lastname", "value" : "VAN DER BREGGEN" },
          { "name" : "teams", "deletion_info" : { "marked_deleted" : "2018-05-18T17:38:13.135225Z", "local_delete_time" : "2018-05-18T17:38:13Z" } },
          { "name" : "teams", "path" : [ "Rabobank-Liv Woman Cycling Team" ], "value" : "" },
          { "name" : "teams", "path" : [ "Sengers Ladies Cycling Team" ], "value" : "" },
          { "name" : "teams", "path" : [ "Team Flexpoint" ], "value" : "" }
        ]
      }
    ]
  }

The second CQL statement marks the cell (partition key: 2018, clustering key: 1, column name: cyclist_name) with an "expired" : true TTL expiration marker for the specific cell.

{
    "partition" : {
      "key" : [ "2018", "Giro d'Italia - Stage 11 - Osimo > Imola" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 95,
        "clustering" : [ 1 ],
        "cells" : [
          { "name" : "cyclist_name", "value" : "Cloudy Archipelago", "tstamp" : "2018-05-18T18:22:52.532855Z", "ttl" : 1, "expires_at" : "2018-05-18T18:22:53Z", "expired" : true }
        ]
      }
    ]
  }