Repair Service overview

The Repair Service runs as a background process to cyclically repairs DSE clusters within the specified completion time.

The Repair Service runs repair operations, which synchronize the most current data across nodes and their replicas, including repairing any corrupted data encountered at the filesystem level. By default, the Repair Service runs subrange repairs for most tables, and can be configured to run incremental repairs on certain tables. Distributed subrange repair is an alternative implementation of subrange repairs within the Repair Service, intended to better scale for large clusters.

To determine which type of repair to use, and when to use it, follow these guidelines:
  • If data is relatively static, configure incremental repair for those tables or datacenters.
  • If data is dynamic and constantly changing, use subrange repairs, excluding keyspaces and tables as appropriate for an environment.
  • If repairing a very large cluster, and the opscenterd process becomes a bottleneck for timely subrange repairs, use distributed subrange repairs.
Warning: Do not run a manual node repair using nodetool repair or OpsCenter while the Repair Service is running.

Subrange repairs

Subrange repairs repair a portion of the data that a node is responsible for. Subrange repairs are analogous to specifying the -st and -et options on the nodetool repair command, only the Repair Service determines and optimizes the start and end tokens of a subrange for you. The main benefit of subrange repair is more precise targeting of repairs while avoiding overstreaming.

Distributed subrange repairs

Distributed subrange repairs are designed for repairing large clusters. Rather than rely on opscenterd to coordinate subrange repairs, opscenterd instead instructs agents to simply repair a list of one or more entire keyspaces. The agent handles the details of repairing the keyspaces on a per-subrange basis so that opscenterd coordinates a much smaller list of more coarsely-grained repair tasks, the details of which are handled by each agent.

Incremental repairs

Incremental repairs only repair data that has not been previously repaired on tables reserved and configured for incremental repair.

Subrange repairs operate on an exclusion (opt out) basis that can exclude certain keyspaces and tables. Ignored tables for subrange repairs consist of those reserved by OpsCenter and those configured by administrators. Incremental repairs operate on an inclusion (opt in) basis. Only those keyspaces and tables designated for incremental repairs are processed during an incremental repair. Tables flagged for incremental repair include those built-in by OpsCenter and those configured by admins.

If data is relatively static, configure incremental repair for those tables or datacenters. If data is dynamic and constantly changing, use subrange repairs, excluding keyspaces and tables as appropriate for an environment.

There is no crossover between subrange and incremental repairs: keyspaces and tables are either repaired by a subrange or an incremental repair. Subrange and incremental repairs are mutually exclusive at a table level. The Repair Service runs both repair types simultaneously. Each repair type has its own timeline, which is tracked in their respective individual subrange and incremental progress bars in the Repair Status summary.

Parallel vs. sequential validation compaction processing

The Repair Service runs validation compaction in parallel by default rather than sequentially because sequential processing take considerably more time. The snapshot_override setting controls whether validation compactions for both subrange and incremental repairs are processed in parallel or sequentially. See Running validation compaction sequentially.

Conditions under which the Repair Service does not run

A cluster with a single node is not eligible for repairs. Repairs make node replicas consistent; therefore, there must be at least two nodes to exchange Merkle trees during the repair process.