Manual repair: Anti-entropy repair
Describe how manual repair works.
Anti-entropy node repairs are important for every Cassandra cluster. Frequent data deletions and downed nodes are common causes of data inconsistency. Use anti-entropy repair for routine maintenance and when a cluster needs fixing by running the nodetool repair command.
How does anti-entropy repair work?
- Build a Merkle tree for each replica
- Compare the Merkle trees to discover differences
Merkle trees are binary hash trees whose leaves are hashes of the individual key values. The leaf of a Cassandra Merkle tree is the hash of a row value. Each Parent node higher in the tree is a hash of its respective children. Because higher nodes in the Merkle tree represent data further down the tree, Casandra can check each branch independently without requiring the coordinator node to download the entire data set. For anti-entropy repair Cassandra uses a compact tree version with a depth of 15 (2^15 = 32K leaf nodes). For example, a node containing a million partitions with one damaged partition, about 30 partitions are streamed, which is the number that fall into each of the leaves of the tree. Cassandra works with smaller Merkle trees because they require less storage memory and can be transferred more quickly to other nodes during the comparison process.
After the initiating node receives the Merkle trees from the participating peer nodes, the initiating node compares every tree to every other tree. If the initiating node detects a difference, it directs the differing nodes to exchange data for the conflicting range(s). The new data is written to SSTables. The comparison begins with the top node of the Merkle tree. If Cassandra detects no difference between corresponding tree nodes, the process goes on to compares the left leaves (child nodes), then the right leaves. A difference between corresponding leaves indicates inconsistencies between the data in each replica for the data range that corresponds to that leaf. Cassandra replaces all data that corresponds to the leaves below the differing leaf with the newest version of the data.
Merkle tree building is quite resource intensive, stressing disk I/O and using memory. Some of the options discussed here help lessen the impact on the cluster performance. For details, see Repair in Cassandra.
You can run the nodetool repair
command on a specified node or on all
nodes. The node that initiates the repair becomes the coordinator node for the operation.
The coordinator node finds peer nodes with matching ranges of data and performs a major, or
validation, compaction on each peer node. The validation compaction builds a Merkle tree and
returns the tree to the initiating node. The initiating mode processes the Merkle trees as
described.
Full vs Incremental repair
The process described above represents what occurs for a full repair of a node's data:
Cassandra compares all SSTables for that node and makes necessary repairs. Cassandra 2.1 and
later support incremental repair. An incremental repair persists data that has already been
repaired, and only builds Merkle trees for unrepaired SSTables. This more efficient process
depends on new metadata that marks the rows in an SSTable as repaired
or
unrepaired
.
If you run incremental repairs frequently, the repair process works with much smaller
Merkle trees. The incremental repair process works with Merkle trees as described above.
Once the process had reconciled the data and built new SSTables, the initiating node issues
an anti-compaction command. Anti-compaction is the process of segregating repaired and
unrepaired ranges into separate SSTables, unless the SSTable fits entirely within the
repaired range. If it does, the process just updates the SSTable's
repairedAt
field.
- Size-tiered compaction splits repaired and unrepaired data into separate pools for separate compactions. A major compaction generates two SSTables, one for each pool of data.
- Leveled compaction performs size-tiered compaction on unrepaired data. After repair completes, Casandra moves data from the set of unrepaired SSTables to L0.
Full repair is the default in Cassandra 2.1 and earlier.
Parallel vs Sequential
Sequential repair takes action on one node after another. Parallel repair repairs all nodes with the same replica data at the same time.
Sequential repair takes a snapshot of each replica. Snapshots are hardlinks to existing SSTables. They are immutable and require almost no disk space. The snapshots are live until the repair is completed and then Cassandra removes them. The coordinator node compares the Merkle trees for one replica after the other, and makes required repairs from the snapshots. For example, for a table in a keyspace with a Replication factor RF=3 and replicas A, B and C, the repair command takes a snapshot of each replica immediately and then repairs each replica from the snapshots sequentially (using snapshot A to repair replica B, then snapshot A to repair replica C, then snapshot B to repair replica C).
Parallel repair works on nodes A, B, and C all at once. During parallel repair, the dynamic snitch processes queries for this table using a replica in the snapshot that is not undergoing repair.
Snapshots are hardlinks to existing SSTables. Snapshots are immutable and require almost no disk space. Repair requires intensive disk I/O because validation compaction occurs during Merkle tree construction. For any given replica set, only one replica at a time performs the validation compaction.
Partitioner range ( -pr
)
nodetool repair
on one node at a time, Cassandra may repair the same
range of data several times (depending on the replication factor used in the keyspace).
Using the partitioner range option, nodetool repair
only repairs a
specified range of data once, rather than repeating the repair operation needlessly. This
decreases the strain on network resources, although nodetool repair
still
builds Merkle trees for each replica.nodetool repair -pr
on EVERY node in the cluster to repair all data.
Otherwise, some ranges of data will not be repaired.Local (-local, --in-local-dc
) vs datacenter (dc,
--in-dc
) vs Cluster-wide
Consider carefully before using nodetool repair
across datacenters,
instead of within a local datacenter. When you run repair on a node using
-local
or --in-local-dc
, the command runs only on nodes
within the same datacenter as the node that runs it. Otherwise, the command runs repair
processes on all nodes that contain replicas, even those in different datacenters. If run
over multiple datacenters, nodetool repair
increases network traffic
between datacenters tremendously, and can cause cluster issues. If the local option is too
limited, consider using the -dc
or --in-dc
options,
limiting repairs to a specific datacenter. This does not repair replicas on nodes in other
datacenters, but it can decrease network traffic while repairing more nodes than the local
options.
The nodetool repair -pr
option is good for repairs across multiple
datacenters, as the number of replicas in multiple datacenters can increase substantially.
For example, if you start nodetool repair
over two datacenters, DC1 and
DC2, each with a replication factor of 3, repair
must build Merkle tables
for 6 nodes. The number of Merkle Tree increases linearly for additional datacenters.
-local
repairs:- The
repair
tool does not support the use of-local
with the-pr
option unless the datacenter's nodes have all the data for all ranges. - Also, the tool does not support the use of
-local
with-inc
(incremental repair).
-dcpar
or --dc-parallel
to repair datacenters in
parallel.Endpoint range vs Subrange repair (-st, --start-token, -et
--end-token
)
A repair operation runs on all partition ranges on a node, or an endpoint range, unless you
use the -st
and -et
(or -start-token
and
-end-token
) options to run subrange repairs. When you specify a start
token and end token, nodetool repair
works between these tokens, repairing
only those partition ranges.
Subrange repair is not a good strategy because it requires generated token ranges. However, if you know which partition has an error, you can target that partition range precisely for repair. This approach can relieve the problem known as overstreaming, which ties up resources by sending repairs to a range over and over.
You can use subrange repair with Java to reduce overstreaming further. Send a Java
describe_splits call to ask for a split containing 32k partitions can
be iterated throughout the entire range incrementally or in parallel. Once the tokens are
generated for the split, you can pass them to nodetool repair -st <start_token>
-et <end_token>
. Add the -local
option to limit the repair to
the local datacenter. This reduces cross datacenter transfer.