Troubleshoot Repair Service errors
Errors encountered when running the Repair Service. Adjust repair service configuration options to resolve the errors.
cluster_name.conf
The location of the cluster_name.conf file depends on the type of installation:- Package installations: /etc/opscenter/clusters/cluster_name.conf
- Tarball installations: install_location/conf/clusters/cluster_name.conf
opscenterd.conf
The location of the opscenterd.conf file depends on the type of installation:- Package installations: /etc/opscenter/opscenterd.conf
- Tarball installations: install_location/conf/opscenterd.conf
To resolve errors, try adjusting the configuration
options in the [repair_service]
section of
opscenterd.conf or
cluster_name.conf as appropriate for your environment.
Errors encountered when running the Repair Service can include:
- Error of a single repair task
- When a single repair task fails, the repair is skipped temporarily and added to the end of the queue of repairs and retried later. If a single repair fails ten times (default), the Repair Service fires an alert. Adjust this setting with the single_task_err_threshold option.
- Timeouts
- The Repair Service times out a single repair task after one hour by default. This counts towards an error for that repair task and it is placed at the end of the queue of repairs and retried later. Adjust this setting with the single_repair_timeout option.
- Too many repairs in parallel
- The Repair Service errors if it has to run too many repairs in parallel. By default,
this error happens if it estimates that it needs to run more than one repair in a single
replica set to complete on time. Try increasing the Time to
completion parameter. If that does not resolve the issue, try adjusting the
max_parallel_repairs option. See Setting the maximum for parallel subrange repairs.CAUTION: DataStax recommends only manually adjusting the
max_parallel_repairs
, changingmin_repair_time
and other advanced or expert options only if thetime_to_completion_percentage
throttle is not is use. See Adjusting or disabling the throttle for subrange repairs. - Skipping range because pending repairs exceeds the max repairs
- The Repair Service skips repairing a range if pending repairs exceed the maximum pending repairs, which is 5 by default. The Repair Service immediately moves the skipped repair task to the end of the repair queue and fires an alert. At your discretion, you might want to restart any stalled nodes. Adjust this setting with the max_pending_repairs option.
- Incremental error alert threshold exceeded
- By default, the number of failed incremental repair attempts defaults to 20 before sending an alert that there could be a problem with incremental repair. Adjust this setting with the incremental_err_alert_threshold option.