Troubleshooting Repair Service errors
Errors encountered when running the Repair Service. Adjust repair service configuration options to resolve the errors.
To resolve errors, try adjusting the configuration options in the [repair_service]
section of
opscenterd.conf or
cluster_name.conf as appropriate for your environment.
Errors encountered when running the Repair Service can include:
- Error of a single range repair
- When a single range repair fails, the repair is skipped temporarily and added to the
end of the queue of repairs and retried later. If a single range fails ten times
(default), the Repair Service shuts down and fires an alert. Configure this setting with
the
single_task_err_threshold
option. - Too many errors in a single run
- After a total of 100 errors (default) during a single run, the Repair Service shuts
down and fires an ALERT. Configure this setting with the
max_err_threshold
option. - Time-outs
- The Repair Service times out a single repair command after one hour by default. This
counts towards an error for that repair command and it is placed at the end of the queue
of repairs and retried later. Configure this setting with the
single_repair_timeout
option. - Too many repairs in parallel
- The Repair Service errors and shuts down if it has to run too many repairs in
parallel. By default, this happens if it estimates that it needs to run more than one
repair in a single replica set to complete on time. Configure this setting with the
max_parallel_repairs
option. - Skipping range because pending repairs exceeds the max repairs
- The Repair Service skips repairing a range if pending repairs exceed the maximum
pending repairs, which is 5 by default. The Repair Service immediately moves the skipped
repair task to the end of the repair queue and fires an alert. At your discretion, you
might want to restart any stalled nodes. Configure this setting with the
max_pending_repairs
option. - Incremental error alert threshold exceeded
- By default, the number of failed incremental repair attempts defaults to 20 before
sending an alert that there may be a problem with incremental repair. Adjust this
setting with the
incremental_err_alert_threshold
option.
opscenterd.conf
The location of the opscenterd.conf file depends on the type of installation:
- Package installations: /etc/opscenter/opscenterd.conf
- Tarball installations: install_location/conf/opscenterd.conf
cluster_name.conf
The location of the cluster_name.conf file depends on the type of installation:
- Package installations: /etc/opscenter/clusters/cluster_name.conf
- Tarball installations: install_location/conf/clusters/cluster_name.conf