Subrange repairs overview
Summary of subrange repair behavior and available configuration options.
Subrange repairs repair a portion of the data that a node is responsible for. After an entire cluster has been repaired, including any independently running incremental repairs, the Repair Service recalculates the list of subranges to repair and starts over. Repairing an entire cluster one time is referred to as a repair cycle.
Excluding keyspaces and tables
Specify entire keyspaces or certain tables for subrange repairs to ignore, in addition to the default system keyspaces, certain rollup tables, and so forth. See Excluding keyspaces or tables from subrange repairs.
Prioritizing tasks
prioritization_page_size
option limits the number of possible repair
tasks for the Repair Service to review when selecting a low impact repair. Increasing the
page size is more CPU-intensive for the Repair Service but could result in more optimal
dispatching of repairs to the cluster.prioritization_page_size
is an expert option that should not be changed
without guidance from DataStax Support.Offline splits
Offline splits refers to offline task generation (determining splits for subrange repairs) by the Repair Service when a node is down or unavailable.
Ideally, during planning of a subrange repair, the Repair Service in the OpsCenter daemon retrieves the token subrange splits from each OpsCenter agent in the cluster, since each agent is able to retrieve the necessary data from its node to determine the optimal set of subrange splits for each keyspace to repair. However, if either the agent or node is offline or unavailable, the Repair Service falls back to splitting the token range for that node. This is less than optimal because the OpsCenter daemon cannot access the information about counts and sizes of partitions that belong to a token range for an unavailable node.
The offline_splits
option controls the number of subranges per keyspace to
split the primary range into for a node. The goal for each subrange is to have no more than
approximately 32,000 partitions per keyspace. It is most optimal to repair a subrange that
contains 32,000 partitions because that is the largest number of partitions in a range that
can be repaired in a single attempt without streaming more data than necessary between
nodes.
The default for the offline_splits
option is 256. For sparsely populated
clusters, the default might suffice. For clusters having much more densely populated nodes,
it could make sense to increase the default value. The
system.size_estimates
table is regenerated every five minutes, and gives
some indication of how many partitions are contained within each node’s primary range for
each keyspace and table.
Configuration options for offline splits and its related options are considered expert-level options that should not be adjusted without guidance from DataStax Support.
The Repair Service log indicates if offline splits had to be used for any node.
Throttling subrange repair time
The Repair Service automatically throttles subrange repairs when the current repair cycle is estimated to finish significantly before the deadline specified by the time to completion.
The time_to_completion_target_percentage
configuration option
controls the frequency and pace of the subrange repair process. The throttle slows
repairs or reduces parallel repair processes as necessary to prevent overloading the
cluster while still completing a repair cycle within the specified time window
designated by the Time to completion value. The default value
for the target percentage to complete a repair is 65%.
Because certain repair config options are tempered by the percentage option, a judicious approach to configuring advanced repair options can optimize repair performance for various production environments and avoid issues due to misconfiguration. The majority of default settings typically do not require adjustment unless advised by a DataStax Support professional.
If there are any issues with the Repair Service configuration, the Repair service not configured correctly rule in the Best Practice Service fails and provides guidance as to incorrectly configured options, unless the rule has been turned off.
max_parallel_repairs
, changing min_repair_time
and other advanced or expert options only if the
time_to_completion_percentage
throttle is not is use. See Adjusting or disabling the throttle for subrange repairs.See Adjusting or disabling the throttle for subrange repairs for more information.
Calculating parallel repairs
The Repair Service uses an average of the throughput of recent repairs to calculate the
average throughput. The average throughput is used to dynamically determine the number of
parallel repairs required to complete a repair during a current cycle. The
num_recent_throughputs
option determines the maximum number of recent
throughputs used to calculate average throughput. The default value is 500. Calculating
parallel repairs also depends on a corresponding minimum throughput value before commencing
its calculation. The min_throughput
option represents the throughput
required for any given repair task to be considered when determining the number of parallel
repairs. The default value is 512 bytes/sec.
Maximum pending repairs
Before issuing a new subrange repair, the Repair Service checks for the number of repairs both running or waiting to run. If the configured maximum pending repairs threshold would be exceeded, the repair skips that node for the time being to avoid overwhelming an already swamped node. The repair task is moved to the back of the pending repair tasks queue to retry later and an alert is fired.
Subrange repair status
View progress, statistics, and details for subrange repairs in the Repair Status tab.