How Repair Service works

The Repair Service works by repairing small chunks of your cluster in the background. The service takes a single parameter, time_to_completion, which is the maximum amount of time it takes to repair the entire cluster once. Typically, you set this to a value lower than your lowest gc_grace_seconds setting (the default for gc_grace_seconds is 10 days). The service may run multiple repairs in parallel, but will run as few as needed in order to complete within the amount of time specified, and will always avoid running more than one repair in a single replica set.

The Repair Service uses an average of the throughput of recent repairs to calculate how many parallel repairs can be completed in the current cycle.

Restarting opscenterd

The current state of the Repair Service is persisted locally on the opscenterd server every five minutes by default. If opscenterd is restarted, the Repair Service resumes where it left off.

Known limitations

If a cluster is data center aware and has keyspaces using SimpleStrategy, the repair service will fail to start

Changes in cluster topology 

If a change in cluster topology occurs, the Repair Service stops its current cycle and waits for the ring to stabilize before starting a new cycle. This check occurs every five minutes.

Topology changes:
  • Nodes moving
  • Nodes joining
  • Nodes leaving
  • Nodes going up/down

Changes in schemas 

  • Keyspaces added while the repair service is running are repaired when the next subrange repair is started.
  • Column families added to existing keyspaces are repaired immediately during the current cycle of the Repair Service.
  • Keyspace or column family can be removed while the Repair Service is running without causing any issues.