Viewing repair status
To access the Repair Status details:
Turn the Repair Service on. It immediately activates the Repair Service and opens the Status tab.
If you are elsewhere in the monitoring application and the Repair Service is already activated, click Details for the Repair Service. The Repair Status tab displays full details for repair processes.
Information to fully understand all aspects of repair status are readily available from within the Repair Status tab. Hover over areas of the Status page to view inline information. Click tooltip icons to access short descriptions about an item. Click the Read more links to access the relevant Repair Service documentation.
Monitor the progress of incremental and subrange repairs in the Status tab. After turning on the Repair Service, the Repair Service Status is either Active or Paused:
When the Repair Service is actively processing repairs, the Repair Service Status indicates Active. The progress graphics and statistics reflect real-time measurements of repairs.
The Repair Service Status appears Paused in response to cluster or schema change events.
The repair process performs validation compaction and streams data to and from other nodes in the cluster when synchronizing replicas. Those activities when active are visible in their respective panes.
The progress and statistics pane displays progress bars for subrange and incremental repairs. A pie chart represents Completed, In Progress, and Failed repair tasks thus far. Remaining tasks are not represented in the pie chart; they are represented in the progress bars. The remaining time until the incremental and subrange repairs are completed is indicated underneath each respective progress bar.
The Total Repairs value represents the number and percentage of the grand total of repair tasks for the current repair cycle. Repair tasks for each category’s count can represent an aggregate of the tasks shown in the Table Repair Tasks pane. Repair tasks in a particular category might not equal the total number of tasks displayed in the Table Repair Tasks pane because multiple tables might be aggregated into a single repair task. The number of tasks in the Table Repair Tasks pane are displayed and counted in all rows for tables within the range of a repair task.
The Validation Compactions pane displays the progress of any validation compactions per node for both incremental and subrange repairs. In the absence of compaction activity, the No active validation compactions status is displayed.
If repairs are configured for Running validation compaction sequentially, compaction progress is considerably slower, impacting both subrange and incremental repairs.
A validation compaction reads and generates a hash for every row in the stored tables, adds the result to a Merkle tree, and returns the tree to the initiating node as part of the underlying Merkle tree comparison process.
The Streams pane displays an aggregate of streaming activity progress per node. The streams could be comprised of hundreds of files. When actively streaming data, the nodes from which the streams originate and their target node are shown along with progress bars for each node receiving streamed replica data. Otherwise, the No active streams status is displayed.
The Table Repair Tasks pane provides insight into keyspace tables that are being repaired (or not if excluded), status summary, attempts at repair for skipped tasks, the type of repair, average repair time. To discover more:
Each column is sortable. Click a column heading to sort its column contents. The Status column provides visual status indicators along with a summary of completed, running, or pending repair tasks. Any task with errors displays a red explanation point.
The Total Attempts column indicates how many attempts (retries) the Repair Service has made before temporarily skipping the task. The skipped task is added to the end of the queue to retry later. The default maximum is 10 attempts. When that maximum is reached, an alert is fired and the Repair Service abandons any further repair attempts for that task. In the above graphic, 0/10 indicates all repair tasks completed without the need for any retry attempts. Configure the maximum attempts with the single_task_err_threshold option.
Incremental repair tables are opted in for repair as mentioned in the incremental repairs overview.
There are a few OpsCenter keyspace tables that are hard-coded for incremental repairs:
The incremental tooltip flags these as special tables and provides a link to documentation to configure additional tables or datacenters to include in incremental repairs.
Any tables configured by OpsCenter admins appear in the tasks pane sans the tooltip.
Incremental repairs have their own threshold setting for alerting about failed repair tasks.
The default is
Configure with the incremental_err_alert_threshold option.
Observe the progress of incremental repairs using the SSTable repaired metrics available in the dashboard graphs. See Tracking repaired SSTables for incremental repairs.
Excluding keyspaces and tables from unnecessary repairs makes repair processes more focused, efficient, and faster with less workload impact on DSE clusters.
A link is available above the Table Repair Tasks pane for viewing keyspaces and tables excluded from subrange repairs. Click the View excluded tables link. The Excluded Keyspaces and Tables dialog displays the keyspaces excluded due to RF=1, system keyspaces, reserved tables, or those specifically configured for subrange repairs to ignore.
If using authentication, be sure to change the replication strategy and replication factor for the
Click any row of the Table Repair Tasks pane to view more details about a particular task.
The Repair Tasks for keyspace.table dialog provides details for the number of Succeeded, Failed, Running, Pending, and Aborted repair tasks for each repair-eligible table in a keyspace. The Average Repair Time and number of Attempts (configurable) for the repair task are also shown.