How the Repair Service works

The Repair Service runs continuously as a background process. The Repair Service incrementally and cyclically repairs a DSE cluster within the specified completion time. This overview describes the Repair Service behavior and its response to changes in cluster topology or schemas.

The Repair Service works by repairing small chunks of a cluster in the background. The service takes a single parameter, time_to_completion, which is the maximum amount of time it takes to repair the entire cluster once. Typically, you set this to a value lower than your lowest gc_grace_seconds setting (the default for gc_grace_seconds is 10 days). The service might run multiple repairs in parallel, but runs as few as needed to complete within the amount of time specified. The service always avoids running more than one repair within a single replica set.

The Repair Service uses an average of the throughput of recent repairs to calculate how many parallel repairs OpsCenter can complete in the current cycle. Before issuing a new subrange repair, the Repair Service checks for the number of repairs. If the configured maximum pending repairs threshold would be exceeded, the repair skips that node for the time being to avoid overwhelming an already swamped node. The repair task is moved to the back of the pending repair tasks queue and an alert is fired.

Restarting opscenterd

The current state of the Repair Service is persisted locally on the opscenterd server every five minutes by default. If opscenterd is restarted, the Repair Service resumes where it left off.

Note: If automatic failover is configured, be sure to mirror the repair service persist directory.

For more information on repair service continuity during a failure, see failover aftereffects.

Known limitations

If a cluster is datacenter aware and has keyspaces using SimpleStrategy, the repair service will fail to start. Follow the prompts to change the keyspaces to NetworkToplogyStrategy.

Changes in cluster topology

If a change in cluster topology occurs, the Repair Service stops its current cycle and waits for the ring to stabilize before starting a new cycle. This check occurs every five minutes.

Topology changes:

Nodes moving
Nodes joining a cluster
Nodes leaving a cluster
Nodes being decommissioned

Changes in schemas

Keyspaces added while the repair service is running are repaired when the next subrange repair is started.
Column families (tables) added to existing keyspaces are repaired immediately during the current cycle of the Repair Service.
Keyspaces or column families (tables) can be removed while the Repair Service is running without causing any issues.