Repair Service overview
The Repair Service runs as a background process to cyclically repairs DSE clusters within the specified completion time.
The Repair Service runs repair operations, which synchronize the most current data across nodes and their replicas, including repairing any corrupted data encountered at the filesystem level. By default, the Repair Service runs subrange repairs for most tables, and can be configured to run incremental repairs on certain tables. Distributed subrange repair is an alternative implementation of subrange repairs within the Repair Service, intended to better scale for large clusters.
- If data is relatively static, configure incremental repair for those tables or datacenters.
- If data is dynamic and constantly changing, use subrange repairs, excluding keyspaces and tables as appropriate for an environment.
- If repairing a very large cluster, and the opscenterd process becomes a bottleneck for timely subrange repairs, use distributed subrange repairs.
nodetool repair
or OpsCenter while the Repair Service is
running.Subrange repairs
Subrange repairs repair a portion of the data that a node
is responsible for. Subrange repairs are analogous to specifying the -st
and -et
options on the nodetool repair
command, only the
Repair Service determines and optimizes the start and end tokens of a subrange for you. The
main benefit of subrange repair is more precise targeting of repairs while avoiding
overstreaming.
Distributed subrange repairs
Distributed subrange repairs are designed for repairing large clusters. Rather than rely on opscenterd to coordinate subrange repairs, opscenterd instead instructs agents to simply repair a list of one or more entire keyspaces. The agent handles the details of repairing the keyspaces on a per-subrange basis so that opscenterd coordinates a much smaller list of more coarsely-grained repair tasks, the details of which are handled by each agent.
Incremental repairs
Incremental repairs only repair data that has not been previously repaired on tables reserved and configured for incremental repair.
Subrange repairs operate on an exclusion (opt out) basis that can exclude certain keyspaces and tables. Ignored tables for subrange repairs consist of those reserved by OpsCenter and those configured by admins. Incremental repairs operate on an inclusion (opt in) basis. Only those keyspaces and tables designated for incremental repairs are processed during an incremental repair. Tables flagged for incremental repair include those built-in by OpsCenter and those configured by admins.
If data is relatively static, configure incremental repair for those tables or datacenters. If data is dynamic and constantly changing, use subrange repairs, excluding keyspaces and tables as appropriate for an environment.
There is no crossover between subrange and incremental repairs: keyspaces and tables are either repaired by a subrange or an incremental repair. Subrange and incremental repairs are mutually exclusive at a table level. The Repair Service runs both repair types simultaneously. Each repair type has its own timeline, which is tracked in their respective individual subrange and incremental progress bars in the Repair Status summary.
Parallel vs. sequential validation compaction processing
The Repair Service runs validation compaction in parallel by default rather than
sequentially because sequential processing take considerably more time. The
snapshot_override
setting controls whether validation compactions for
both subrange and incremental repairs are processed in parallel or sequentially. See Running validation compaction sequentially.
Conditions under which the Repair Service does not run
A cluster with a single node is not eligible for repairs. Repairs make node replicas consistent; therefore, there must be at least two nodes to exchange Merkle trees during the repair process.