Troubleshoot Repair Service errors

Errors encountered when running the Repair Service. Adjust Repair Service configuration options to resolve the errors.

cluster_name.conf

The location of the cluster_name.conf file depends on the type of installation:
  • Package installations: /etc/opscenter/clusters/cluster_name.conf
  • Tarball installations: install_location/conf/clusters/cluster_name.conf

opscenterd.conf

The location of the opscenterd.conf file depends on the type of installation:
  • Package installations: /etc/opscenter/opscenterd.conf
  • Tarball installations: install_location/conf/opscenterd.conf

To resolve errors, try adjusting the configuration options in the [repair_service] section of opscenterd.conf or cluster_name.conf as appropriate for your environment. Errors encountered when running the Repair Service can include:

Error of a single repair task
When a single repair task fails, the repair is skipped temporarily and added to the end of the queue of repairs and retried later. If a single repair fails ten times (default), the Repair Service fires an alert. Adjust this setting with the single_task_err_threshold option.
Incremental error alert threshold exceeded
By default, the number of failed incremental repair attempts defaults to 20 before sending an alert that there could be a problem with incremental repair. Adjust this setting with the incremental_err_alert_threshold option.
Offline splits

At the beginning of each cycle, the Repair Service attempts to generate intelligent subrange splits based on the system.size_estimates table. The subrange splits cannot happen when a node or agent is down or unavailable. If a node or agent is unavailable when the subrange determination happens, Offline Splits are used.

In large or dense clusters, these offline subrange calculations can often be inefficient. The best way to detect that a Repair Service cycle has fallen back to offline splits is to monitor the Repair Service log for using offline task generation. If offline splits are detected, restart the Repair Service once all nodes/agents are up and available.

Repair history tables

DSE stores repair events details in the system_distributed.repair_history and system_distributed.parent_repair_history tables. By default these tables have no time to live (TTL), which can lead to significant unnecessary disk usage because of the number of repair tasks being run continuously.

Manually set a TTL on these tables based on your needs. In most cases, the TTL should be larger than gc_grace_seconds, but not more than needed for debugging purposes.

Skipping range because pending repairs exceeds the max repairs
The Repair Service skips repairing a range if pending repairs exceed the maximum pending repairs, which is 5 by default. The Repair Service immediately moves the skipped repair task to the end of the repair queue and fires an alert. At your discretion, you might want to restart any stalled nodes. Adjust this setting with the max_pending_repairs option.
Timeouts
The Repair Service times out a single repair task after one hour by default. This counts towards an error for that repair task and it is placed at the end of the queue of repairs and retried later. Adjust this setting with the single_repair_timeout option.
Too many repairs in parallel
The Repair Service errors if it has to run too many repairs in parallel. By default, this error happens if it estimates that it needs to run more than one repair in a single replica set to complete on time. Try increasing the Time to completion parameter. If that does not resolve the issue, try adjusting the max_parallel_repairs option. See Setting the maximum for parallel subrange repairs.
CAUTION: DataStax recommends only manually adjusting the max_parallel_repairs, changing min_repair_time and other advanced or expert options only if the time_to_completion_percentage throttle is not is use. See Adjusting or disabling the throttle for subrange repairs.