Subrange repairs overview

Subrange repairs repair a portion of the data that a node is responsible for. After an entire cluster has been repaired, including any independently running incremental repairs, the DSE OpsCenter Repair Service recalculates the list of subranges to repair and starts over. Repairing an entire cluster one time is referred to as a repair cycle.

Exclude keyspaces and tables

Specify entire keyspaces or certain tables for subrange repairs to ignore, in addition to the default system keyspaces, certain rollup tables, and so forth. See Exclude keyspaces or tables from subrange repairs.

Prioritize tasks

When running subrange repairs, the Repair Service determines which nodes have the least traffic in terms of compactions or streaming between nodes. Both streaming and compaction activities are represented in the Repair Service Status.

Only modify the prioritization_page_size if instructed by IBM Support. This is an expert repair tuning option that can have significant impact on repair performance if configured incorrectly or suboptimally.

The prioritization_page_size option limits the number of possible repair tasks for the Repair Service to review when selecting a low impact repair. Increasing the page size is more CPU-intensive for the Repair Service but could result in more optimal dispatching of repairs to the cluster.

Offline splits

Offline splits refers to offline task generation (determining splits for subrange repairs) by the Repair Service when a node is down or unavailable.

Ideally, during planning of a subrange repair, the Repair Service in the OpsCenter daemon retrieves the token subrange splits from each OpsCenter agent in the cluster, since each agent is able to retrieve the necessary data from its node to determine the optimal set of subrange splits for each keyspace to repair. However, if either the agent or node is offline or unavailable, the Repair Service falls back to splitting the token range for that node. This is less than optimal because the OpsCenter daemon cannot access the information about counts and sizes of partitions that belong to a token range for an unavailable node.

Only modify the offline splits configuration options if instructed by IBM Support. These are expert repair tuning options that can have significant impact on repair performance if configured incorrectly or suboptimally.

The offline_splits option controls the number of subranges per keyspace to split the primary range into for a node. The goal for each subrange is to have no more than approximately 32,000 partitions per keyspace. It is most optimal to repair a subrange that contains 32,000 partitions because that is the largest number of partitions in a range that can be repaired in a single attempt without streaming more data than necessary between nodes.

The default for the offline_splits option is 256. For sparsely populated clusters, the default might suffice. For clusters having much more densely populated nodes, it could make sense to increase the default value. The system.size_estimates table is regenerated every five minutes, and gives some indication of how many partitions are contained within each node’s primary range for each keyspace and table.

The Repair Service log indicates if offline splits had to be used for any node.

Throttle subrange repair time

The Repair Service automatically throttles subrange repairs when the current repair cycle is estimated to finish significantly before the deadline specified by the time to completion.

Incremental repairs are exempt from this throttle.

The time_to_completion_target_percentage configuration option controls the frequency and pace of the subrange repair process. The throttle slows repairs or reduces parallel repair processes as necessary to prevent overloading the cluster while still completing a repair cycle within the specified time window designated by the Time to completion value. The default value for the target percentage to complete a repair is 65%.

Because certain repair config options are tempered by the percentage option, a judicious approach to configuring advanced repair options can optimize repair performance for various production environments and avoid issues due to misconfiguration.

For most use cases, the default settings are adequate.

If there are any issues with the Repair Service configuration, the Repair Service not configured correctly rule in the Best Practice Service fails and provides guidance as to incorrectly configured options, unless the rule has been turned off.

DataStax recommends modifying the following settings only if you aren’t using the time_to_completion_percentage throttle:

Manually setting max_parallel_repairs
min_repair_time
Other advanced repair configuration options
Expert repair configuration options

For more information, see Adjust or disable the throttle for subrange repairs.

Calculate parallel repairs

The Repair Service uses an average of the throughput of recent repairs to calculate the average throughput. The average throughput is used to dynamically determine the number of parallel repairs required to complete a repair during a current cycle. The num_recent_throughputs option determines the maximum number of recent throughputs used to calculate average throughput. The default value is 500. Calculating parallel repairs also depends on a corresponding minimum throughput value before commencing its calculation. The min_throughput option represents the throughput required for any given repair task to be considered when determining the number of parallel repairs. The default value is 512 bytes/sec. See Set the maximum for parallel subrange repairs for more information.

Maximum pending repairs

Before issuing a new subrange repair, the Repair Service checks for the number of repairs both running or waiting to run. If the configured maximum pending repairs threshold would be exceeded, the repair skips that node for the time being to avoid overwhelming an already swamped node. The repair task is moved to the back of the pending repair tasks queue to retry later and an alert is fired.

Subrange repair status

View progress, statistics, and details for subrange repairs in the Repair Status tab.