Advanced Repair Service configuration reference

Reference of available advanced configuration options for the Repair Service. Set the configuration options in either opscenterd.conf or cluster_name.conf. The settings in cluster_name.conf override settings in opscenterd.conf.

cluster_name.conf

The location of the cluster_name.conf file depends on the type of installation:

Package installations: /etc/opscenter/clusters/cluster_name.conf
Tarball installations: install_location/conf/clusters/cluster_name.conf

opscenterd.conf

The location of the opscenterd.conf file depends on the type of installation:

Package installations: /etc/opscenter/opscenterd.conf
Tarball installations: install_location/conf/opscenterd.conf

The following options are currently configurable by adding a [repair_service] section to the opscenterd.conf file to apply to all clusters, or per cluster by adding the section to the cluster_name.conf file. Settings in cluster_name.conf override any settings in opscenterd.conf. After changing configuration, restart opscenterd.

If there are any issues with the Repair Service configuration, the Repair service not configured correctly rule in the Best Practice Service fails and provides guidance as to incorrectly configured options, unless the rule has been turned off.

Repair Service Best Practice Rules

The configuration options prefixed with incremental_* are only applicable to incremental repairs.

[repair_service] cluster_stabilization_period: The frequency in seconds that the Repair Service checks for cluster stability before making repairs. The check begins when the Repair Service is activated (either by a user or after an OpsCenter restart) and repeats until the cluster is stable. Default: 30.
[repair_service] error_logging_window: The frequency in seconds to log errors and trigger alerts after exceeding time_to_completion. Default: 86400 (1 day).
[repair_service] incremental_err_alert_threshold: The threshold for the number of errors during incremental repair to ignore before alerting that incremental repair seems to be failing more than an acceptable amount. Default: 20.
[repair_service] incremental_sleep: The number of seconds to pause after completing all incremental repairs for a cluster. Default: 3600 (1 hour).
[repair_service] incremental_threshold: The minimum number of bytes required to consider a table for incremental repairs. The default value of 1 byte means that if there is any unrepaired data in a table, the Repair Service will run an incremental repair. Be cautious of setting this value too high. If not enough data is written to exceed the threshold in the gc_grace_seconds period, deletes might be lost. Default: 1.
[repair_service] max_down_node_retry: The maximum number of attempts to retry a repair task when a node containing a replica is down. The default is 1080 retry attempts. Retries occur every 10 seconds. The default 1080 retries elapses after 10800 seconds (3 hours), which corresponds to the default Cassandra hinted-handoff expiration. Example: To double the time allowed to attempt repairs on a down node or replica to 6 hours, set the number of retries to 2160. Default: 1080.
[repair_service] max_pending_repairs: The maximum number of pending repairs allowed to be running on a node at one time. Default: 5.
[repair_service] min_repair_time: The minimum length of time in seconds for a repair to complete. If a repair finishes sooner, it will be padded with a sleep. Default: 5.
[repair_service] persist_directory: The location in which to store a file with the current repair service status. The default location is /var/lib/opscenter/repair_service for package installations and install_location/repair_service for tarball installations.
[repair_service] persist_period: The minimum number of seconds between Repair Service writing the persist file to disk. Default: 3600 (1 hour). This parameter applies to subrange and incremental repairs only, and is not applicable to distributed subrange repairs.
[repair_service] restart_period: The period of time in seconds the Repair Service pauses in response to certain events before verifying the cluster stability and restarting repairs. Default: 300 (5 minutes).
[repair_service] single_repair_timeout: The maximum length of time in seconds for a repair to complete. Default: 3600 (1 hour).
[repair_service] single_task_err_threshold: The maximum number of times to retry a repair task before temporarily skipping the task and moving on to the next task. The skipped task is moved to the end of the repairs queue to retry later. After the maximum retries is reached, an alert is fired. Default: 10.
[repair_service] snapshot_override: Specifies whether to override the default snapshot repair behavior. Specifying this option as True runs validation compaction sequentially rather than in parallel. Default: False.
[repair_service] time_to_completion_target_percentage: A percentage of the time to completion that the repair service should target, including slowing down or reducing parallelism as necessary to avoid overtaxing the cluster. Default: 65. This parameter applies to subrange and incremental repairs only, and is not applicable to distributed subrange repairs.