Advanced Repair Service configuration reference

The following options are currently configurable by adding a [repair_service] section to the opscenterd.conf file to apply to all clusters, or per cluster by adding the section to the cluster_name.conf file. Settings in cluster_name.conf override any settings in opscenterd.conf.

The location of the opscenterd.conf file depends on the type of installation:

Package installations: /etc/opscenter/opscenterd.conf
Tarball installations: install_location/conf/opscenterd.conf

The location of the cluster_name.conf file depends on the type of installation:

Package installations: /etc/opscenter/clusters/cluster_name.conf
Tarball installations: install_location/conf/clusters/cluster_name.conf

After changing configuration, restart opscenterd.

If there are any issues with the Repair Service configuration, the Repair Service not configured correctly rule in the Best Practice Service fails and provides guidance as to incorrectly configured options, unless the rule has been turned off.

Repair Service Best Practice Rules

The configuration options prefixed with incremental_* are only applicable to incremental repairs.

[repair_service] cluster_stabilization_period

The frequency in seconds that the Repair Service checks for cluster stability before making repairs. The check begins when the Repair Service is activated (either by a user or after an OpsCenter restart) and repeats until the cluster is stable. Default: 30.
[repair_service] error_logging_window

The frequency in seconds to log errors and trigger alerts after exceeding time_to_completion. Default: 86400 (1 day).
[repair_service] incremental_err_alert_threshold

The threshold for the number of errors during incremental repair to ignore before alerting that incremental repair seems to be failing more than an acceptable amount. Default: 20.
[repair_service] incremental_sleep

The number of seconds to pause after completing all incremental repairs for a cluster. Default: 3600 (1 hour).
[repair_service] incremental_threshold

The minimum number of bytes required to consider a table for incremental repairs. The default value of 1 byte means that if there is any unrepaired data in a table, the Repair Service runs an incremental repair. Be cautious of setting this value too high. If not enough data is written to exceed the threshold in the gc_grace_seconds period, deletes might be lost. Default: 1.
[repair_service] max_down_node_retry

The maximum number of attempts to retry a repair task when a node containing a replica is down. The default is 1080 retry attempts. Retries occur every 10 seconds. The default 1080 retries elapses after 10800 seconds (3 hours), which corresponds to the default Cassandra hinted-handoff expiration. Example: To double the time allowed to attempt repairs on a down node or replica to 6 hours, set the number of retries to 2160. Default: 1080.
[repair_service] max_pending_repairs

The maximum number of pending repairs allowed to be running on a node at one time. Default: 5.
[repair_service] min_repair_time

The minimum length of time in seconds for a repair to complete. If a repair finishes sooner, it is padded with a sleep. Default: 5.
[repair_service] persist_directory

The location in which to store a file with the current repair service status. The default location is /var/lib/opscenter/repair_service for package installations and <install_location>/repair_service for tarball installations.
[repair_service] persist_period

The minimum number of seconds between Repair Service writing the persist file to disk. Default: 3600 (1 hour). This parameter applies to subrange and incremental repairs only, and is not applicable to distributed subrange repairs.
[repair_service] restart_period

The period of time in seconds the Repair Service pauses in response to certain events before verifying the cluster stability and restarting repairs. Default: 300 (5 minutes).
[repair_service] single_repair_timeout

The maximum length of time in seconds for a repair to complete. Default: 3600 (1 hour).
[repair_service] single_task_err_threshold

The maximum number of times to retry a repair task before temporarily skipping the task and moving on to the next task. The skipped task is moved to the end of the repairs queue to retry later. After the maximum retries is reached, an alert is fired. Default: 10.
[repair_service] snapshot_override

Specifies whether to override the default snapshot repair behavior. Specifying this option as True runs validation compaction sequentially rather than in parallel. Default: False.
[repair_service] time_to_completion_target_percentage

A percentage of the time to completion that the Repair Service should target, including slowing down or reducing parallelism as necessary to avoid overtaxing the cluster. Default: 65. This parameter applies to subrange and incremental repairs only, and is not applicable to distributed subrange repairs.

Advanced Repair Service configuration reference

Was this helpful?

Give Feedback