Repair Service behavior during environment changes
Environment changes can impact how the Repair Service behaves.
The following sections provide details on how the Repair Service behaves when there are changes in the environment such as topology changes, down nodes, and OpsCenter restarts.
Cluster topology changes
The Repair Service is nearly immediately aware of any topology changes to a cluster. When a
change in cluster topology occurs, the Repair Service stops its current repair cycle and waits
for the ring to stabilize before restarting a new cycle. The restart period is controlled by the
restart_period
configuration option, which defaults to 300 seconds (5
minutes). While paused, the Repair Service checks the state of the cluster periodically using
this period of time until it is able to reactivate.
Before resuming repairs, the Repair Service checks every 30 seconds by default for the cluster
state. After the cluster has stabilized, the checks for cluster stabilization cease until the
next time opscenterd
is restarted. Configure the interval for the stable
cluster check with the cluster_stabilization_period
option.
- Nodes moving within a cluster
- Nodes joining a cluster
- Nodes leaving a cluster
Schema changes
When a schema change happens, the Repair Service pauses for five minutes by default, then starts back up and immediately begins repairing new keyspaces or tables. Schema changes include adding, changing, or removing keyspaces or tables.
Down nodes or replicas
A repair cannot run if any of the nodes in the replica set for that range are down. If an entire rack or data center goes down, it is unlikely that repair operations can successfully run on the cluster. When one or more nodes are down, the Repair Service continues to run repairs for ranges and keyspaces unaffected by the down nodes.
When there are no runnable repair operations remaining, the Repair Service waits for 10
seconds and checks again. The Repair Service repeats this process for up to the value set for
the max_down_node_retry
option, which defaults to three hours based on the
max_hint_window_in_ms
property in cassandra.yaml, and
then starts a new cycle. After the max_hint_window_in_ms
is exceeded for a down
node, the recovery process for that node is to rebuild rather than rely on hint replay.
Therefore the Repair Service starts a new cycle to ensure that any available ranges continue to
be repaired and are not blocked by down nodes.
prioritization_page_size
tasks (default: 512). The order of these tasks is
random, so if no available ranges are found in the first
prioritization_page_size
, it is unlikely any available ranges exist.If the Repair Service reports errors when activated, deactivate the Repair Service and then ensure all nodes are available before reactivating.
Persisted repair state when restarting opscenterd
At the end of each persist period (one hour by default), the current state of the Repair
Service is persisted locally on the opscenterd
server in the persist directory
location. You can configure the persist period frequency and the persist directory location with
the persist_period
option and the persist_directory
option,
respectively. When opscenterd
is restarted, the Repair Service resumes where it
left off based on the persisted state information.
For more information on repair service continuity during a failure, see failover aftereffects.