Troubleshooting Repair Service errors

Errors encountered when running the Repair Service. Adjust repair service configuration options to resolve the errors.

The following is currently configurable by adding a [repair_service] section to the opscenterd.conf file to apply to all clusters, or per cluster by adding the section to the cluster_name.conf file. Settings in cluster_name.conf override any settings in opscenterd.conf.

To resolve errors, try adjusting the configuration options in the [repair_service] section. Errors encountered when running the Repair Service can include:

Error of a single range repair
When a single range repair fails, the repair is skipped temporarily and added to the end of the queue of repairs and retried later. If a single range fails ten times (default), the Repair Service shuts down and fires an alert. Configure this setting with the single_task_err_threshold option.
Too many errors in a single run
After a total of 100 errors (default) during a single run, the Repair Service shuts down and fires an ALERT. Configure this setting with the max_err_threshold option.
Time-outs
The Repair Service times out a single repair command after one hour by default. This counts towards an error for that repair command and it is placed at the end of the queue of repairs and retried later. Configure this setting with the single_repair_timeout option.
Too many repairs in parallel
The Repair Service errors and shuts down if it has to run too many repairs in parallel. By default, this happens if it estimates that it needs to run more than one repair in a single replica set to complete on time. Configure this setting with the max_parallel_repairs option.
Skipping range because pending repairs exceeds the max repairs
The Repair Service skips repairing a range if pending repairs exceed the maximum pending repairs, which is 5 by default. The Repair Service immediately moves the skipped repair task to the end of the repair queue and fires an alert. At your discretion, you might want to restart any stalled nodes. Configure this setting with the max_pending_repairs option.
Incremental error alert threshold exceeded
By default, the number of failed incremental repair attempts defaults to 20 before sending an alert that there may be a problem with incremental repair. Adjust this setting with the incremental_err_alert_threshold option.

opscenterd.conf 

The location of the opscenterd.conf file depends on the type of installation:

  • Installer-Services or package installations: /etc/opscenter/opscenterd.conf
  • Installer-No Services or tarball installations: install_location/conf/opscenterd.conf
  • Windows installations: Program Files (x86)\DataStax Community\opscenter\conf\opscenterd.conf

cluster_name.conf 

The location of the cluster_name.conf file depends on the type of installation:

  • Installer-Services or package installations: /etc/opscenter/clusters/cluster_name.conf
  • Installer-No Services or tarball installations: install_location/conf/clusters/cluster_name.conf
  • Windows installations: Program Files (x86)\DataStax Community\opscenter\conf\clusters\cluster_name.conf