Repair Service behavior during environment changes
The following sections provide details on how the Repair Service behaves when there are changes in the environment such as topology changes, down nodes, and OpsCenter restarts.
The Repair Service is nearly immediately aware of any topology changes to a cluster.
When a change in cluster topology occurs, the Repair Service stops its current repair cycle and waits for the ring to stabilize before restarting a new cycle.
The restart period is controlled by the
restart_period configuration option, which defaults to 300 seconds (5 minutes).
While paused, the Repair Service checks the state of the cluster periodically using this period of time until it is able to reactivate.
Before resuming repairs, the Repair Service checks every 30 seconds by default for the cluster state.
After the cluster has stabilized, the checks for cluster stabilization cease until the next time
opscenterd is restarted.
Configure the interval for the stable cluster check with the
Topology changes include:
Nodes moving within a cluster
Nodes joining a cluster
Nodes leaving a cluster
When a schema change happens, the Repair Service pauses for five minutes by default, then starts back up and immediately begins repairing new keyspaces or tables. Schema changes include adding, changing, or removing keyspaces or tables.
A repair cannot run if any of the nodes in the replica set for that range are down. If an entire rack or data center goes down, it is unlikely that repair operations can successfully run on the cluster. When one or more nodes are down, the Repair Service continues to run repairs for ranges and keyspaces unaffected by the down nodes.
When there are no runnable repair operations remaining, the Repair Service waits for 10 seconds and checks again.
The Repair Service repeats this process for up to the value set for the
max_down_node_retry option, which defaults to three hours based on the
max_hint_window_in_ms property in cassandra.yaml, and then starts a new cycle.
max_hint_window_in_ms is exceeded for a down node, the recovery process for that node is to rebuild rather than rely on hint replay.
Therefore the Repair Service starts a new cycle to ensure that any available ranges continue to be repaired and are not blocked by down nodes.
To mitigate the performance implications of scanning the entire list of remaining repair tasks, the scan for available ranges only scans the first
If the Repair Service reports errors when activated, deactivate the Repair Service and then ensure all nodes are available before reactivating.
At the end of each persist period (one hour by default), the current state of the Repair Service is persisted locally on the
opscenterd server in the persist directory location.
You can configure the persist period frequency and the persist directory location with the
persist_period option and the
persist_directory option, respectively.
opscenterd is restarted, the Repair Service resumes where it left off based on the persisted state information.
For more information on Repair Service continuity during a failure, see failover aftereffects.