Tuning Repair Service for multi-datacenter environments
When running the Repair Service on a multi-dc cluster, consider the number of total repair tasks and over-streaming.
When running the Repair Service on a multi-datacenter cluster, consider the number of total repair tasks and over-streaming.
Reduce the number of repair tasks
A single repair task is made up of at least six network requests between any two peers. Reducing the total number of repair tasks can drastically reduce network overhead and the time to complete a full Repair Service cycle. The number of repair tasks is controlled by how many partitions are targeted for each subrange. If there are more partitions in a subrange, each subrange is larger, which means fewer total subranges. The tokenranges_partitions property controls the targeted partition count.
Avoid over-streaming
Over-streaming occurs when a subrange is repaired that contains more partitions than the
maximum merkle tree depth. This occurs if the tokenranges_partitions
is
set too high.
Guidelines for tuning
- Never set
tokenranges_partitions
higher than the default 1048576, which ismax-merkle-tree-depth
of 220. - Test the tuning on the cluster prior to production. Look for the total number of repair tasks, average repair task time, and impact on cluster performance.
- If single repair tasks take longer than 20-30 minutes and a full Repair Service cycle is
within
gc_grace_seconds
, halve thetokenranges_partitions
and re-test. - To check for over-streaming, ensure the following line does not exist in
system.log:
Range X with Y partitions require a merkle tree with depth Z but the maximum allowed depth for this range is 20.
Note: X, Y, and Z are variables.