Troubleshoot Repair Service errors
To resolve errors, try adjusting the configuration options in the [repair_service]
section of opscenterd.conf or cluster_name.conf as appropriate for your environment.
The location of the cluster_name.conf file depends on the type of installation:
-
Package installations: /etc/opscenter/clusters/cluster_name.conf
-
Tarball installations: install_location/conf/clusters/cluster_name.conf
The location of the opscenterd.conf file depends on the type of installation:
-
Package installations: /etc/opscenter/opscenterd.conf
-
Tarball installations: install_location/conf/opscenterd.conf
Errors encountered when running the Repair Service can include:
-
Error of a single repair task
When a single repair task fails, the repair is skipped temporarily and added to the end of the queue of repairs and retried later. If a single repair fails ten times (default), the Repair Service fires an alert. Adjust this setting with the single_task_err_threshold option.
-
Incremental error alert threshold exceeded
By default, the number of failed incremental repair attempts defaults to 20 before sending an alert that there could be a problem with incremental repair. Adjust this setting with the incremental_err_alert_threshold option.
-
Offline splits
At the beginning of each cycle, the Repair Service attempts to generate intelligent subrange splits based on the
system.size_estimates
table. The subrange splits cannot happen when a node or agent is down or unavailable. If a node or agent is unavailable when the subrange determination happens, Offline Splits are used.In large or dense clusters, these offline subrange calculations can often be inefficient. The best way to detect that a Repair Service cycle has fallen back to offline splits is to monitor the Repair Service log for
using offline task generation
. If offline splits are detected, restart the Repair Service once all nodes/agents are up and available. -
Repair history tables
DSE stores repair events details in the
system_distributed.repair_history
andsystem_distributed.parent_repair_history
tables. By default these tables have no time to live (TTL), which can lead to significant unnecessary disk usage because of the number of repair tasks being run continuously.Manually set a TTL on these tables based on your needs. In most cases, the TTL should be larger than
gc_grace_seconds
, but not more than needed for debugging purposes. -
Skipping range because pending repairs exceeds the max repairs
The Repair Service skips repairing a range if pending repairs exceed the maximum pending repairs, which is 5 by default. The Repair Service immediately moves the skipped repair task to the end of the repair queue and fires an alert. At your discretion, you might want to restart any stalled nodes. Adjust this setting with the
max_pending_repairs
option. -
Timeouts
The Repair Service times out a single repair task after one hour by default. This counts towards an error for that repair task and it is placed at the end of the queue of repairs and retried later. Adjust this setting with the single_repair_timeout option.
-
Too many repairs in parallel
The Repair Service errors if it has to run too many repairs in parallel. By default, this error happens if it estimates that it needs to run more than one repair in a single replica set to complete on time. Try increasing the
Time to completion
parameter. If that does not resolve the issue, try adjusting themax_parallel_repairs
option. See Setting the maximum for parallel subrange repairs.CAUTION:
DataStax recommends only manually adjusting the
max_parallel_repairs
, changingmin_repair_time
and other advanced or expert options only if thetime_to_completion_percentage
throttle is not is use. See Adjusting or disabling the throttle for subrange repairs.