Subrange repairs overview
Subrange repairs repair a portion of the data that a node is responsible for. After an entire cluster has been repaired, including any independently running incremental repairs, the Repair Service recalculates the list of subranges to repair and starts over. Repairing an entire cluster one time is referred to as a repair cycle.
Excluding keyspaces and tables
Specify entire keyspaces or certain tables for subrange repairs to ignore, in addition to the default system keyspaces, certain rollup tables, and so forth. See Excluding keyspaces or tables from subrange repairs.
When running subrange repairs, the Repair Service determines which nodes have the least traffic in terms of compactions or streaming between nodes.
Both streaming and compaction activities are represented in the Repair Service Status.
prioritization_page_size option limits the number of possible repair tasks for the Repair Service to review when selecting a low impact repair.
Increasing the page size is more CPU-intensive for the Repair Service but could result in more optimal dispatching of repairs to the cluster.
Offline splits refers to offline task generation (determining splits for subrange repairs) by the Repair Service when a node is down or unavailable.
Ideally, during planning of a subrange repair, the Repair Service in the OpsCenter daemon retrieves the token subrange splits from each OpsCenter agent in the cluster, since each agent is able to retrieve the necessary data from its node to determine the optimal set of subrange splits for each keyspace to repair. However, if either the agent or node is offline or unavailable, the Repair Service falls back to splitting the token range for that node. This is less than optimal because the OpsCenter daemon cannot access the information about counts and sizes of partitions that belong to a token range for an unavailable node.
offline_splits option controls the number of subranges per keyspace to split the primary range into for a node.
The goal for each subrange is to have no more than approximately 32,000 partitions per keyspace.
It is most optimal to repair a subrange that contains 32,000 partitions because that is the largest number of partitions in a range that can be repaired in a single attempt without streaming more data than necessary between nodes.
The default for the
offline_splits option is 256.
For sparsely populated clusters, the default might suffice.
For clusters having much more densely populated nodes, it could make sense to increase the default value.
system.size_estimates table is regenerated every five minutes, and gives some indication of how many partitions are contained within each node’s primary range for each keyspace and table.
Configuration options for offline splits and its related options are considered expert-level options that should not be adjusted without guidance from DataStax Support.
The Repair Service log indicates if offline splits had to be used for any node.
Throttling subrange repair time
The Repair Service automatically throttles subrange repairs when the current repair cycle is estimated to finish significantly before the deadline specified by the time to completion.
time_to_completion_target_percentage configuration option controls the frequency and pace of the subrange repair process.
The throttle slows repairs or reduces parallel repair processes as necessary to prevent overloading the cluster while still completing a repair cycle within the specified time window designated by the Time to completion value.
The default value for the target percentage to complete a repair is 65%.
Because certain repair config options are tempered by the percentage option, a judicious approach to configuring advanced repair options can optimize repair performance for various production environments and avoid issues due to misconfiguration. The majority of default settings typically do not require adjustment unless advised by a DataStax Support professional.
If there are any issues with the Repair Service configuration, the Repair service not configured correctly rule in the Best Practice Service fails and provides guidance as to incorrectly configured options, unless the rule has been turned off.
DataStax recommends only manually adjusting the
Incremental repairs are exempt from this throttle.
See Adjusting or disabling the throttle for subrange repairs for more information.
Calculating parallel repairs
The Repair Service uses an average of the throughput of recent repairs to calculate the average throughput.
The average throughput is used to dynamically determine the number of parallel repairs required to complete a repair during a current cycle.
num_recent_throughputs option determines the maximum number of recent throughputs used to calculate average throughput.
The default value is 500.
Calculating parallel repairs also depends on a corresponding minimum throughput value before commencing its calculation.
min_throughput option represents the throughput required for any given repair task to be considered when determining the number of parallel repairs.
The default value is 512 bytes/sec.
See Setting the maximum for parallel subrange repairs for more information.
Maximum pending repairs
Before issuing a new subrange repair, the Repair Service checks for the number of repairs both running or waiting to run. If the configured maximum pending repairs threshold would be exceeded, the repair skips that node for the time being to avoid overwhelming an already swamped node. The repair task is moved to the back of the pending repair tasks queue to retry later and an alert is fired.
Subrange repair status
View progress, statistics, and details for subrange repairs in the Repair Status tab.