Estimating remaining repair time

The Repair Service updates repair estimates for completion time without stopping the repair progress.

opscenterd.log

The location of the opscenterd.log file depends on the type of installation:
  • Package installations: /var/log/opscenter/opscenterd.log
  • Tarball installations: install_location/log/opscenterd.log

If the Repair Service anticipates it cannot complete a repair cycle within the allotted time to completion due to throughput, it displays a warning message and a newly estimated time remaining to complete the repair cycle. The Repair Service does not adjust the configured time to completion; it reports the revised estimate for completion without stopping the repair in progress.

When the Repair Service estimates that it will not finish a repair cycle within the configured time_to_completion, it triggers an ALERT in the OpsCenter Event Log. The alert is also visible in the opscenterd.log, as well as the Event Log in the Activities section of the OpsCenter UI. If email alerts or post-url alert notifications are configured, the alert notifications are emailed or posted.

The error_logging_window configuration property controls both how often to log the message and how often to fire the alert if the Repair Service continues to estimate that it will not finish a repair in time.

Parameters

The time_to_completion parameter is the maximum amount of time it takes to repair the entire cluster one time.

Note: Typically, you should set the Time to Completion to a value lower than the lowest grace seconds before garbage collection setting (gc_grace_seconds) on your tables. The default for gc_grace_seconds is 10 days (864000 seconds). OpsCenter provides an estimate by checking gc_grace_seconds across all tables and calculating 90% of the lowest value. The default estimate for the time to completion based on the typical grace seconds default is 9 days. For more information about configuring grace seconds, see gc_grace_seconds in the CQL documentation.

The Repair Service might run multiple subrange repairs in parallel, but runs as few as needed to complete within the amount of time specified. The Repair Service always avoids running more than one repair within a single replica set; there is no overlap in repairs between replica sets.