Advanced Repair Service configuration reference

Reference of available advanced configuration options for the Repair Service. Set the configuration options in either opscenterd.conf or cluster_name.conf. The settings in cluster_name.conf override settings in opscenterd.conf.

cluster_name.conf

The location of the cluster_name.conf file depends on the type of installation:
  • Package installations: /etc/opscenter/clusters/cluster_name.conf
  • Tarball installations: install_location/conf/clusters/cluster_name.conf

opscenterd.conf

The location of the opscenterd.conf file depends on the type of installation:
  • Package installations: /etc/opscenter/opscenterd.conf
  • Tarball installations: install_location/conf/opscenterd.conf

The following options are currently configurable by adding a [repair_service] section to the opscenterd.conf file to apply to all clusters, or per cluster by adding the section to the cluster_name.conf file. Settings in cluster_name.conf override any settings in opscenterd.conf. After changing configuration, restart opscenterd.

If there are any issues with the Repair Service configuration, the Repair service not configured correctly rule in the Best Practice Service fails and provides guidance as to incorrectly configured options, unless the rule has been turned off.

Repair Service Best Practice Rules

The configuration options prefixed with incremental_* are only applicable to incremental repairs.

[repair_service] cluster_stabilization_period
The frequency in seconds that the Repair Service checks for cluster stability before making repairs. The check begins when the Repair Service is activated (either by a user or after an OpsCenter restart) and repeats until the cluster is stable. Default: 30.
[repair_service] error_logging_window
The frequency in seconds to log errors and trigger alerts after exceeding time_to_completion. Default: 86400 (1 day).
[repair_service] incremental_err_alert_threshold
The threshold for the number of errors during incremental repair to ignore before alerting that incremental repair seems to be failing more than an acceptable amount. Default: 20.
[repair_service] incremental_sleep
The number of seconds to pause after completing all incremental repairs for a cluster. Default: 3600 (1 hour).
[repair_service] incremental_threshold
The minimum number of bytes required to consider a table for incremental repairs. The default value of 1 byte means that if there is any unrepaired data in a table, the Repair Service will run an incremental repair. Be cautious of setting this value too high. If not enough data is written to exceed the threshold in the gc_grace_seconds period, deletes might be lost. Default: 1.
[repair_service] max_down_node_retry
The maximum number of attempts to retry a repair task when a node containing a replica is down. The default is 1080 retry attempts. Retries occur every 10 seconds. The default 1080 retries elapses after 10800 seconds (3 hours), which corresponds to the default Cassandra hinted-handoff expiration. Example: To double the time allowed to attempt repairs on a down node or replica to 6 hours, set the number of retries to 2160. Default: 1080.
[repair_service] max_pending_repairs
The maximum number of pending repairs allowed to be running on a node at one time. Default: 5.
[repair_service] min_repair_time
The minimum length of time in seconds for a repair to complete. If a repair finishes sooner, it will be padded with a sleep. Default: 5.
[repair_service] persist_directory
The location in which to store a file with the current repair service status. The default location is /var/lib/opscenter/repair_service for package installations and install_location/repair_service for tarball installations.
[repair_service] persist_period
The minimum number of seconds between Repair Service writing the persist file to disk. Default: 3600 (1 hour). This parameter applies to subrange and incremental repairs only, and is not applicable to distributed subrange repairs.
[repair_service] restart_period
The period of time in seconds the Repair Service pauses in response to certain events before verifying the cluster stability and restarting repairs. Default: 300 (5 minutes).
[repair_service] single_repair_timeout
The maximum length of time in seconds for a repair to complete. Default: 3600 (1 hour).
[repair_service] single_task_err_threshold
The maximum number of times to retry a repair task before temporarily skipping the task and moving on to the next task. The skipped task is moved to the end of the repairs queue to retry later. After the maximum retries is reached, an alert is fired. Default: 10.
[repair_service] snapshot_override
Specifies whether to override the default snapshot repair behavior. Specifying this option as True runs validation compaction sequentially rather than in parallel. Default: False.
[repair_service] time_to_completion_target_percentage
A percentage of the time to completion that the repair service should target, including slowing down or reducing parallelism as necessary to avoid overtaxing the cluster. Default: 65. This parameter applies to subrange and incremental repairs only, and is not applicable to distributed subrange repairs.