Tuning Repair Service for multi-datacenter environments
When running the Repair Service on a multi-datacenter cluster, consider the number of total repair tasks and over-streaming.
A single repair task is made up of at least six network requests between any two peers. Reducing the total number of repair tasks can drastically reduce network overhead and the time to complete a full Repair Service cycle. The number of repair tasks is controlled by how many partitions are targeted for each subrange. If there are more partitions in a subrange, each subrange is larger, which means fewer total subranges. The tokenranges_partitions property controls the targeted partition count.
Over-streaming occurs when a subrange is repaired that contains more partitions than the maximum merkle tree depth.
This occurs if the
tokenranges_partitions is set too high.
tokenranges_partitionshigher than the default
1048576, which is
Test the tuning on the cluster prior to production. Look for the total number of repair tasks, average repair task time, and impact on cluster performance.
If single repair tasks take longer than 20-30 minutes and a full Repair Service cycle is within
gc_grace_seconds, halve the
To check for over-streaming, ensure the following line does not exist in
Range X with Y partitions require a merkle tree with depth Z but the maximum allowed depth for this range is 20.
X, Y, and Z are variables.