Distributed subrange overview

Summary of distributed subrange repair behavior and its differences from subrange repairs.

opscenterd.conf

The location of the opscenterd.conf file depends on the type of installation:
  • Package installations: /etc/opscenter/opscenterd.conf
  • Tarball installations: install_location/conf/opscenterd.conf

Distributed subrange repair is an alternative implementation of subrange repairs within the OpsCenter Repair Service, intended to scale for large clusters.

Note: The distributed subrange repair feature is available from OpsCenter 6.5.3.

With subrange repairs, metadata about cluster token ranges must be retrieved and processed, which requires splitting the ranges into appropriate subranges prior to starting repair operations. The amount of metadata to process is proportional to the size of the cluster. Therefore, for large clusters, the OpsCenter daemon (opscenterd) process can become a bottleneck when attempting to processing sizeable metadata.

With distributed subrange repairs, much of the metadata processing and repair planning is moved to the DataStax agents, relieving opscenterd and allowing the repair service to scale better for large clusters. In this method of repair, opscenterd instructs OpsCenter agents to repair a list of one or more entire keyspaces, and the agents repair the keyspaces on a per-subrange basis.

You must opt into using distributed subrange repairs by setting enable_distributed_subrange_repair to True in the opscenterd.conf configuration file. See Enabling distributed subrange repairs.

Important: After changing the value for enable_distributed_subrange_repair, restart OpsCenter for the changes to take effect.

Differences from subrange repairs

Distributed subrange repair is designed to run as fast as possible to expedite repair of large clusters. Because distributed subrange repair is optimized for performance, it cannot be tuned in the way that subrange repairs can. For example, the time_to_completion_target_percentage parameter has no effect on distributed subrange repairs.

However, distributed subrange repair will honor the min_repair_time property to provide a limited amount of throttling. Each DataStax agent ensures that the individual JMX repair operations will not occur more frequently than the time set by this property.

Limitations

Because of the distributed and dynamic nature of distributed subrange repair, the OpsCenter Repair Service is unable to provide a precise estimated time to completion for a running distributed subrange repair job. This limitation exists because distributed subrange repair does not process all token subrange metadata at the beginning of the job (like in subrange repair). Instead, each DataStax agent processes its own subset of that metadata dynamically, as necessary.

The benefit of this processing method is reducing latency and eliminating error handling changes that can stem from processing all metadata at the beginning of the job. The drawback is the inability to calculate estimates for data and time remaining in the repair process.

In OpsCenter, the distributed repair progress bar indicates a rough measurement of progress for data synchronization to complete, but without the fine-grained measure of progress (time and bytes remaining) available in subrange repairs.

Distributed subrange repair status, showing number of JMX repair calls completed