How the Repair Service works
The Repair Service runs continuously as a background process. The Repair Service incrementally and cyclically repairs a DataStax Enterprise cluster within the specified completion time. This overview describes the Repair Service behavior and its response to changes in cluster topology or schemas.
The Repair Service works by repairing small chunks of a cluster in the background. The Repair Service incrementally and cyclically repairs a DataStax Enterprise cluster within the specified completion time. This overview describes the Repair Service behavior and its response to changes in cluster topology or schemas.
The Repair Service uses an average of the throughput of recent repairs to calculate how many parallel repairs OpsCenter can complete in the current cycle. Before issuing a new subrange repair, the Repair Service checks for the number of repairs. If the configured maximum pending repairs threshold would be exceeded, the repair skips that node for the time being to avoid overwhelming an already swamped node. The repair task is moved to the back of the pending repair tasks queue and an alert is fired.
Parameters
- The time_to_completion parameter is the maximum amount of time it takes to repair the entire cluster once. Typically, you set this value lower than your lowest gc_grace_seconds setting. The default for gc_grace_seconds is 10 days. The service might run multiple repairs in parallel, but runs as few as needed to complete within the amount of time specified. The service always avoids running more than one repair within a single replica set.
Estimating remaining repair time
If the Repair Service anticipates it cannot complete a repair within the allotted time to completion due to throughput errors, it displays a warning message and a newly estimated time remaining to complete the repair. The Repair Service does not adjust the configured time to completion; it reports the revised estimate for completion without stopping the repair in progress.
When the Repair Service estimates that it will not finish a repair within the configured
time_to_completion
, it triggers an ALERT in the OpsCenter Event Log. The alert
is also visible in the opscenterd.log, as well as the Event Log in the
Activities section of the OpsCenter UI. If email
alerts or post-url alert notifications are
configured, the alert notifications are emailed or posted.
The error_logging_window configuration property controls how often to fire the alert if the Repair Service continues to estimate that it will not finish the repair in time.
Known limitations
If a cluster is datacenter-aware and has any keyspaces using SimpleStrategy
,
the Repair Service will fail to start. Follow the
prompts to change the keyspaces to use NetworkToplogyStrategy
.
Cluster topology changes
- Nodes moving within a cluster
- Nodes joining a cluster
- Nodes leaving a cluster
- Nodes being decommissioned
Schema changes
- Keyspaces added while the repair service is running are repaired when the next subrange repair is started.
- Tables added to existing keyspaces are repaired immediately during the current cycle of the Repair Service.
- Keyspaces or tables can be removed while the Repair Service is running without causing any issues.
Down nodes
When a node or multiple nodes are down, the Repair Service continues to run repairs for ranges and keyspaces unaffected by the down nodes. If there are enough nodes down to make repairing the cluster not possible during the down time, the Repair Service waits up to three hours before restarting a cycle (after repairing as much as possible under the circumstances).
Restarting opscenterd
opscenterd
server every five minutes by default. If
opscenterd
is restarted, the Repair Service resumes where it left off.
For more information on repair service continuity during a failure, see
failover aftereffects.