How the Repair Service works

The Repair Service runs continuously as a background process. The Repair Service incrementally and cyclically repairs a DataStax Enterprise cluster within the specified completion time. This overview describes the Repair Service behavior and its response to changes in cluster topology or schemas.

The Repair Service works by repairing small chunks of a cluster in the background. The Repair Service incrementally and cyclically repairs a DataStax Enterprise cluster within the specified completion time. This overview describes the Repair Service behavior and its response to changes in cluster topology or schemas.

The Repair Service uses an average of the throughput of recent repairs to calculate how many parallel repairs OpsCenter can complete in the current cycle. Before issuing a new subrange repair, the Repair Service checks for the number of repairs. If the configured maximum pending repairs threshold would be exceeded, the repair skips that node for the time being to avoid overwhelming an already swamped node. The repair task is moved to the back of the pending repair tasks queue and an alert is fired.

Parameters 

  • The time_to_completion parameter is the maximum amount of time it takes to repair the entire cluster once. Typically, you set this value lower than your lowest gc_grace_seconds setting. The default for gc_grace_seconds is 10 days. The service might run multiple repairs in parallel, but runs as few as needed to complete within the amount of time specified. The service always avoids running more than one repair within a single replica set.

Estimating remaining repair time 

If the Repair Service anticipates it cannot complete a repair within the allotted time to completion due to throughput errors, it displays a warning message and a newly estimated time remaining to complete the repair. The Repair Service does not adjust the configured time to completion; it reports the revised estimate for completion without stopping the repair in progress.

When the Repair Service estimates that it will not finish a repair within the configured time_to_completion, it triggers an ALERT in the OpsCenter Event Log. The alert is also visible in the opscenterd.log, as well as the Event Log in the Activities section of the OpsCenter UI. If email alerts or post-url alert notifications are configured, the alert notifications are emailed or posted.

The error_logging_window configuration property controls how often to fire the alert if the Repair Service continues to estimate that it will not finish the repair in time.

Known limitations 

If a cluster is datacenter-aware and has any keyspaces using SimpleStrategy, the Repair Service will fail to start. Follow the prompts to change the keyspaces to use NetworkToplogyStrategy.

Cluster topology changes 

If a change in cluster topology occurs, the Repair Service stops its current cycle and waits for the ring to stabilize before starting a new cycle. The check for topology changes occurs every five minutes. Topology changes include:
  • Nodes moving within a cluster
  • Nodes joining a cluster
  • Nodes leaving a cluster
  • Nodes being decommissioned

Schema changes 

  • Keyspaces added while the repair service is running are repaired when the next subrange repair is started.
  • Tables added to existing keyspaces are repaired immediately during the current cycle of the Repair Service.
  • Keyspaces or tables can be removed while the Repair Service is running without causing any issues.

Down nodes 

When a node or multiple nodes are down, the Repair Service continues to run repairs for ranges and keyspaces unaffected by the down nodes. If there are enough nodes down to make repairing the cluster not possible during the down time, the Repair Service waits up to three hours before restarting a cycle (after repairing as much as possible under the circumstances).

Restarting opscenterd 

The current state of the Repair Service is persisted locally on the opscenterd server every five minutes by default. If opscenterd is restarted, the Repair Service resumes where it left off. For more information on repair service continuity during a failure, see failover aftereffects.