Troubleshooting Repair Service errors

Errors encountered when running the Repair Service. Adjust repair service configuration options to resolve the errors.

To resolve errors, try adjusting the configuration options in the [repair_service] section of opscenterd.conf or cluster_name.conf as appropriate for your environment. Errors encountered when running the Repair Service can include:

Error of a single range repair
When a single range repair fails, the repair is skipped temporarily and added to the end of the queue of repairs and retried later. If a single range fails ten times (default), the Repair Service shuts down and fires an alert. Configure this setting with the single_task_err_threshold option.
Too many errors in a single run
After a total of 100 errors (default) during a single run, the Repair Service shuts down and fires an ALERT. Configure this setting with the max_err_threshold option.
Time-outs
The Repair Service times out a single repair command after one hour by default. This counts towards an error for that repair command and it is placed at the end of the queue of repairs and retried later. Configure this setting with the single_repair_timeout option.
Too many repairs in parallel
The Repair Service errors and shuts down if it has to run too many repairs in parallel. By default, this happens if it estimates that it needs to run more than one repair in a single replica set to complete on time. Configure this setting with the max_parallel_repairs option.
Skipping range because pending repairs exceeds the max repairs
The Repair Service skips repairing a range if pending repairs exceed the maximum pending repairs, which is 5 by default. The Repair Service immediately moves the skipped repair task to the end of the repair queue and fires an alert. At your discretion, you might want to restart any stalled nodes. Configure this setting with the max_pending_repairs option.
Incremental error alert threshold exceeded
By default, the number of failed incremental repair attempts defaults to 20 before sending an alert that there may be a problem with incremental repair. Adjust this setting with the incremental_err_alert_threshold option.

opscenterd.conf 

The location of the opscenterd.conf file depends on the type of installation:

  • Package installations: /etc/opscenter/opscenterd.conf
  • Tarball installations: install_location/conf/opscenterd.conf

cluster_name.conf 

The location of the cluster_name.conf file depends on the type of installation:

  • Package installations: /etc/opscenter/clusters/cluster_name.conf
  • Tarball installations: install_location/conf/clusters/cluster_name.conf