Automatic failover overview

Automatic failover from the primary DSE OpsCenter instance to the backup OpsCenter instance provides high availability without any manual intervention or downtime.

opscenterd.conf

The location of the opscenterd.conf file depends on the type of installation:
  • Package installations: /etc/opscenter/opscenterd.conf
  • Tarball installations: install_location/conf/opscenterd.conf

address.yaml

The location of the address.yaml file depends on the type of installation:
  • Package installations: /var/lib/datastax-agent/conf/address.yaml
  • Tarball installations: install_location/conf/address.yaml

Automatic failover provides continuous high availability of OpsCenter for managing mission-critical data on DataStax Enterprise (DSE) clusters without manual intervention or downtime.

Currently, OpsCenter allows one backup instance to a primary instance in an active-passive configuration. The OpsCenter Failover Enabled Best Practice Rule recommends enabling failover. When no backup is configured, the rule fails and sends an alert. After enabling failover, the best practice rule passes the next time it runs if it detected a correctly configured backup OpsCenter instance.
Note: If a non-DataStax Enterprise cluster (such as DataStax Community or open source Cassandra) is added after enabling automatic failover, OpsCenter fires an alert that automatic failover will not work, and the backup OpsCenter instance shuts down.

Failover behavior

The primary and backup OpsCenter instances send and listen for heartbeat messages on stomp channels to communicate status with each other. The primary OpsCenter sends a heartbeat message regardless of whether a backup OpsCenter instance is configured. The primary OpsCenter instance listens for messages from the heartbeat reply stomp channel to determine if a backup instance is configured. The primary_opscenter_location configuration file you create on the backup OpsCenter instance contains the IP address of the primary OpsCenter instance that the backup OpsCenter instance monitors.

The configured backup OpsCenter instance listens for heartbeat messages from the primary OpsCenter instance to determine whether the primary OpsCenter instance is up. If the backup OpsCenter instance detects no heartbeat from the primary OpsCenter instance during the configured window (60 seconds by default), the backup OpsCenter instance initiates the failover process and automatically assumes the responsibilities of the primary OpsCenter instance. The backup OpsCenter instance automatically reconfigures the agents by automatically changing stomp_interface in address.yaml to connect to the backup instance instead of the failing primary instance.
Warning: Ensure that address.yaml is not being managed by third-party Configuration Management. During failover, OpsCenter automatically changes stomp_interface in address.yaml to point to the backup OpsCenter instance. If a separate Configuration Management system is managing address.yaml, that change might be undone when the Configuration Management system pushes its next update.

Failover recovery

After a failover, the former backup OpsCenter instance that took over as primary remains the primary OpsCenter instance. At that point, configure another backup OpsCenter by recreating the primary_opscenter_location file that points the new backup instance to the IP address of the primary instance to monitor. If you are configuring the former primary OpsCenter instance as the new backup instance, ensure the server is healthy again before restarting the server.

Note: If a failover occurred due to a network split, the former primary OpsCenter instance must be manually shut down, and another backup instance configured when network connectivity has been restored. Upon startup, each OpsCenter instance generates a unique id (uuid), which is stored in the failover_id file. In the event of a network split, a failover_id uniquely identifies each OpsCenter instance to agents and prevents both OpsCenter instances from running operations post-failover, which could corrupt data. The location of failover_id file depends on the type of installation and is configurable.

Failover aftereffects

After an automatic failover, minimal manual intervention (if any) is required for recovery, depending on the root cause of the failover and what processes were in progress at that time. Generally, the effects of failing over are similar to restarting OpsCenter, with a few notable exceptions:
  • Alerts - Trigger as normal. An exception is an alert firing and unfiring within the failover window, in which case the alert is never triggered.
  • Authentication - Logs out existing user sessions. User sessions do not persist. Users must log in again.
  • Backup - Skips a scheduled backup if it falls within the failover window. Backup does not occur until the next scheduled time.
  • Restore - Continues the restore operation if failover occurred mid-restore. However, the result of the restore cannot be communicated because the backup OpsCenter was unaware the restore transpired.
  • Repair Service - Resumes from the last saved state. Be sure to mirror the repair service directory. An OpsCenter instance failure does not affect repairs currently running on any nodes. New repairs do not continue until an automatic failover successfully completes or the OpsCenter instance that failed is brought up again.
  • Provisioning - Provisioning jobs that were in progress when the primary Lifecycle Manager fails attempt to complete on the primary instance, and may fail. Lifecycle Manager does not attempt to automatically resume jobs on the backup OpsCenter, but manually Running the job again allows the job to proceed to completion.

Troubleshooting failover

In most cases, the backup OpsCenter instance selects the correct IP address for reconfiguring agents after a failover as described in failover behavior. If for some reason the incorrect IP address is not automatically selected to update all agents, explicitly set the report_interface property in opscenterd.conf on the backup OpsCenter instance.
Note: This workaround assumes the snitch is not Ec2MultiRegionSnitch.

Failover when upgrading OpsCenter

When failover is configured, there is a recommended process to follow when upgrading OpsCenter. For more information, see Upgrading DSE OpsCenter when failover is enabled.