Repair Service behavior during environment changes

Environment changes can impact how the Repair Service behaves.

The following sections provide details on how the Repair Service behaves when there are changes in the environment such as topology changes, down nodes, and OpsCenter restarts.

Cluster topology changes

The Repair Service is nearly immediately aware of any topology changes to a cluster. When a change in cluster topology occurs, the Repair Service stops its current repair cycle and waits for the ring to stabilize before restarting a new cycle. The restart period is controlled by the restart_period configuration option, which defaults to 300 seconds (5 minutes). While paused, the Repair Service checks the state of the cluster periodically using this period of time until it is able to reactivate.

Before resuming repairs, the Repair Service checks every 30 seconds by default for the cluster state. After the cluster has stabilized, the checks for cluster stabilization cease until the next time opscenterd is restarted. Configure the interval for the stable cluster check with the cluster_stabilization_period option.

Topology changes include:

Nodes moving within a cluster
Nodes joining a cluster
Nodes leaving a cluster

Schema changes

When a schema change happens, the Repair Service pauses for five minutes by default, then starts back up and immediately begins repairing new keyspaces or tables. Schema changes include adding, changing, or removing keyspaces or tables.

Down nodes or replicas

A repair cannot run if any of the nodes in the replica set for that range are down. In the case where an entire rack or data center goes down, it is likely that no repair operations can be successfully run on the cluster. When one or more nodes are down, the Repair Service continues to run repairs for ranges and keyspaces unaffected by the down nodes.

When there are no runnable repair operations remaining, the Repair Service waits for 10 seconds and checks again. The Repair Service repeats this for up to the value configured for the max_down_node_retry option, which defaults to three hours based on the max_hint_window_in_ms property in cassandra.yaml, and then starts a new cycle. After the max_hint_window_in_ms is exceeded for a down node, the recovery process for that node is to rebuild rather than rely on hint replay. Therefore the Repair Service starts a new cycle to ensure that any available ranges continue to be repaired and are not blocked by down nodes.

Note: To mitigate the performance implications of scanning the entire list of remaining repair tasks, the scan for available ranges only scans the first prioritization_page_size tasks (default: 512). The order of these tasks is random, so if no available ranges are found in the first prioritization_page_size, it is unlikely there are any available ranges.

Persisted repair state when restarting opscenterd

At the end of each persist period (one hour by default), the current state of the Repair Service is persisted locally on the opscenterd server in the persist directory location. The persist period frequency can be configured with the persist_period option. The persist directory location can be configured with the persist_directory option. When opscenterd is restarted, the Repair Service resumes where it left off based on the persisted state information.

Attention: If automatic failover is configured, be sure to mirror the repair service persist directory.

For more information on repair service continuity during a failure, see failover aftereffects.