Repair Service overview 

The Repair Service runs as a background process. The Repair Service cyclically repairs a DataStax Enterprise cluster within the specified completion time. This overview describes the Repair Service behavior and its response to changes in cluster topology or schemas.

The Repair Service repairs small chunks of a cluster in the background. The Repair Service cyclically repairs a DataStax Enterprise cluster within the specified time to completion. Any anticipated overshoot of the targeted completion time is communicated with a revised estimate. This overview also describes the Repair Service behavior and its response to changes in cluster topology or schemas.

Repair Service Summary 

The Repair Service automates the repair process for DSE clusters. There are two types of repairs handled by the service: subrange and incremental.

The terminology repair is a bit of a misnomer. Repairs run by the Repair Service are mainly synchronizing the most current data across nodes and their replicas, which includes repairing any corrupted data encountered at the filesystem level. The Repair Service has the ability to run both subrange and incremental repairs. By default, the Repair Service runs subrange for most tables and can be configured to run incremental repairs on certain tables.

Subrange repairs repair a portion of the data that a node is responsible for. Subrange repairs in the Repair Service are analogous to specifying the -st and -et options on the nodetool repair command, only the Repair Service determines and optimizes the start and end tokens of a subrange for you. The main benefit of subrange repair is more precise targeting of repairs while avoiding overstreaming.

Incremental repairs only repair data that has not been previously repaired on tables reserved and configured for incremental repair.

Subrange repairs operates on an exclusion (opt out) basis that can exclude certain keyspaces and tables. Ignored tables for subrange repairs consist of those reserved by OpsCenter and those configured by admins. Incremental repairs operate on an inclusion (opt in) basis. Only those keyspaces and tables designated for incremental repairs are processed during an incremental repair. Tables flagged for incremental repair include those built-in by OpsCenter and those configured by admins.

If data is relatively static, configure incremental repair for those tables or datacenters. If data is dynamic and constantly changing, use subrange repairs, excluding keyspaces and tables as appropriate for an environment.

There is no crossover between subrange and incremental repairs: keyspaces and tables are either repaired by a subrange or an incremental repair. Subrange and incremental repairs are mutually exclusive at a table level. The Repair Service runs both repair types simultaneously. Each repair type has its own timeline, which is tracked in their respective individual subrange and incremental progress bars in the Repair Status summary.

Parameters 

The time_to_completion parameter is the maximum amount of time it takes to repair the entire cluster one time.

Note: Typically, you should set the Time to Completion to a value lower than the lowest grace seconds before garbage collection setting (gc_grace_seconds) on your tables. The default for gc_grace_seconds is 10 days (864000 seconds). OpsCenter provides an estimate by checking gc_grace_seconds across all tables and calculating 90% of the lowest value. The default estimate for the time to completion based on the typical grace seconds default is 9 days. For more information about configuring grace seconds, see gc_grace_seconds in the CQL documentation.

The Repair Service might run multiple subrange repairs in parallel, but runs as few as needed to complete within the amount of time specified. The Repair Service always avoids running more than one repair within a single replica set; there is no overlap in repairs between replica sets.

Estimating remaining repair time 

If the Repair Service anticipates it cannot complete a repair cycle within the allotted time to completion due to throughput, it displays a warning message and a newly estimated time remaining to complete the repair cycle. The Repair Service does not adjust the configured time to completion; it reports the revised estimate for completion without stopping the repair in progress.

When the Repair Service estimates that it will not finish a repair cycle within the configured time_to_completion, it triggers an ALERT in the OpsCenter Event Log. The alert is also visible in the opscenterd.log, as well as the Event Log in the Activities section of the OpsCenter UI. If email alerts or post-url alert notifications are configured, the alert notifications are emailed or posted.

The error_logging_window configuration property controls both how often to log the message and how often to fire the alert if the Repair Service continues to estimate that it will not finish a repair in time.

Parallel vs. sequential validation compaction processing

The Repair Service runs validation compaction in parallel by default rather than sequentially because sequential processing take considerably more time. The snapshot_override setting controls whether validation compactions for both subrange and incremental repairs are processed in parallel or sequentially. See Running validation compaction sequentially.

Restart frequency 

The Repair Service pauses when it detects a topology change or schema change and then restarts after a period of time. The restart period is controlled by the restart_period configuration option, which defaults to 300 seconds (5 minutes). While paused, the Repair Service checks the state of the cluster periodically using this period of time until it is able to reactivate.

Conditions under which the Repair Service does not run 

A cluster with a single node is not eligible for repairs. Repairs make node replicas consistent; therefore, there must be at least two nodes to exchange Merkle trees during the repair process.

Repair Service behavior during environment changes 

The following sections provide details on how the Repair Service behaves when there are changes in the environment such as topology changes, down nodes, and OpsCenter restarts.

Cluster topology changes 

The Repair Service is nearly immediately aware of any topology changes to a cluster. When a change in cluster topology occurs, the Repair Service stops its current repair cycle and waits for the ring to stabilize before restarting a new cycle. Before resuming repairs, the Repair Service checks every 30 seconds by default for the cluster state. After the cluster has stabilized, the checks for cluster stabilization cease until the next time opscenterd is restarted. Configure the interval for the stable cluster check with the cluster_stabilization_period option.

Topology changes include:
  • Nodes moving within a cluster
  • Nodes joining a cluster
  • Nodes leaving a cluster

Schema changes 

When a schema change happens, the Repair Service pauses for five minutes by default, then starts back up and immediately begins repairing new keyspaces or tables. Schema changes include adding, changing, or removing keyspaces or tables.

Down nodes or replicas 

A repair cannot run if any of the nodes in the replica set for that range are down. In the case where an entire rack or data center goes down, it is likely that no repair operations can be successfully run on the cluster. When one or more nodes are down, the Repair Service continues to run repairs for ranges and keyspaces unaffected by the down nodes.

When there are no runnable repair operations remaining, the Repair Service waits for 10 seconds and checks again. The Repair Service repeats this for up to the value configured for the max_down_node_retry option, which defaults to three hours based on the max_hint_window_in_ms property in cassandra.yaml, and then starts a new cycle. After the max_hint_window_in_ms is exceeded for a down node, the recovery process for that node is to rebuild rather than rely on hint replay. Therefore the Repair Service starts a new cycle to ensure that any available ranges continue to be repaired and are not blocked by down nodes.

Note: To mitigate the performance implications of scanning the entire list of remaining repair tasks, the scan for available ranges only scans the first prioritization_page_size tasks (default: 512). The order of these tasks is random, so if no available ranges are found in the first prioritization_page_size, it is unlikely there are any available ranges.

Persisted repair state when restarting opscenterd 

At the end of each persist period (one hour by default), the current state of the Repair Service is persisted locally on the opscenterd server in the persist directory location. The persist period frequency can be configured with the persist_period option. The persist directory location can be configured with the persist_directory option. When opscenterd is restarted, the Repair Service resumes where it left off based on the persisted state information.

Attention: If automatic failover is configured, be sure to mirror the repair service persist directory.

For more information on repair service continuity during a failure, see failover aftereffects.

opscenterd.log 

The location of the opscenterd.log file depends on the type of installation:

  • Package installations: /var/log/opscenter/opscenterd.log
  • Tarball installations: install_location/log/opscenterd.log