Advanced Repair Service configuration reference

The following options are currently configurable by adding a [repair_service] section to the opscenterd.conf file to apply to all clusters, or per cluster by adding the section to the cluster_name.conf file. Settings in cluster_name.conf override any settings in opscenterd.conf.

The location of the opscenterd.conf file depends on the type of installation:

  • Package installations: /etc/opscenter/opscenterd.conf

  • Tarball installations: install_location/conf/opscenterd.conf

The location of the cluster_name.conf file depends on the type of installation:

  • Package installations: /etc/opscenter/clusters/cluster_name.conf

  • Tarball installations: install_location/conf/clusters/cluster_name.conf

After changing configuration, restart opscenterd.

If there are any issues with the Repair Service configuration, the Repair Service not configured correctly rule in the Best Practice Service fails and provides guidance as to incorrectly configured options, unless the rule has been turned off.

Repair Service Best Practice Rules

The configuration options prefixed with incremental_* are only applicable to incremental repairs.

  • [repair_service] cluster_stabilization_period

    The frequency in seconds that the Repair Service checks for cluster stability before making repairs. The check begins when the Repair Service is activated (either by a user or after an OpsCenter restart) and repeats until the cluster is stable. Default: 30.

  • [repair_service] error_logging_window

    The frequency in seconds to log errors and trigger alerts after exceeding time_to_completion. Default: 86400 (1 day).

  • [repair_service] incremental_err_alert_threshold

    The threshold for the number of errors during incremental repair to ignore before alerting that incremental repair seems to be failing more than an acceptable amount. Default: 20.

  • [repair_service] incremental_sleep

    The number of seconds to pause after completing all incremental repairs for a cluster. Default: 3600 (1 hour).

  • [repair_service] incremental_threshold

    The minimum number of bytes required to consider a table for incremental repairs. The default value of 1 byte means that if there is any unrepaired data in a table, the Repair Service runs an incremental repair. Be cautious of setting this value too high. If not enough data is written to exceed the threshold in the gc_grace_seconds period, deletes might be lost. Default: 1.

  • [repair_service] max_down_node_retry

    The maximum number of attempts to retry a repair task when a node containing a replica is down. The default is 1080 retry attempts. Retries occur every 10 seconds. The default 1080 retries elapses after 10800 seconds (3 hours), which corresponds to the default Cassandra hinted-handoff expiration. Example: To double the time allowed to attempt repairs on a down node or replica to 6 hours, set the number of retries to 2160. Default: 1080.

  • [repair_service] max_pending_repairs

    The maximum number of pending repairs allowed to be running on a node at one time. Default: 5.

  • [repair_service] min_repair_time

    The minimum length of time in seconds for a repair to complete. If a repair finishes sooner, it is padded with a sleep. Default: 5.

  • [repair_service] persist_directory

    The location in which to store a file with the current repair service status. The default location is /var/lib/opscenter/repair_service for package installations and <install_location>/repair_service for tarball installations.

  • [repair_service] persist_period

    The minimum number of seconds between Repair Service writing the persist file to disk. Default: 3600 (1 hour). This parameter applies to subrange and incremental repairs only, and is not applicable to distributed subrange repairs.

  • [repair_service] restart_period

    The period of time in seconds the Repair Service pauses in response to certain events before verifying the cluster stability and restarting repairs. Default: 300 (5 minutes).

  • [repair_service] single_repair_timeout

    The maximum length of time in seconds for a repair to complete. Default: 3600 (1 hour).

  • [repair_service] single_task_err_threshold

    The maximum number of times to retry a repair task before temporarily skipping the task and moving on to the next task. The skipped task is moved to the end of the repairs queue to retry later. After the maximum retries is reached, an alert is fired. Default: 10.

  • [repair_service] snapshot_override

    Specifies whether to override the default snapshot repair behavior. Specifying this option as True runs validation compaction sequentially rather than in parallel. Default: False.

  • [repair_service] time_to_completion_target_percentage

    A percentage of the time to completion that the Repair Service should target, including slowing down or reducing parallelism as necessary to avoid overtaxing the cluster. Default: 65. This parameter applies to subrange and incremental repairs only, and is not applicable to distributed subrange repairs.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com