Automatic failover overview
Automatic failover from the primary DSE OpsCenter instance to the backup OpsCenter instance provides high availability without any manual intervention or downtime.
opscenterd.conf
The location of the opscenterd.conf file depends on the type of installation:- Package installations: /etc/opscenter/opscenterd.conf
- Tarball installations: install_location/conf/opscenterd.conf
address.yaml
The location of the address.yaml file depends on the type of installation:- Package installations: /var/lib/datastax-agent/conf/address.yaml
- Tarball installations: install_location/conf/address.yaml
Automatic failover provides continuous high availability of OpsCenter for managing mission-critical data on DataStax Enterprise clusters without manual intervention or downtime.
Failover behavior
primary_opscenter_location
configuration file you create on
the backup OpsCenter instance contains the IP address of the primary OpsCenter instance that
the backup OpsCenter monitors. The configured backup OpsCenter listens for heartbeat
messages from the primary OpsCenter to determine whether the primary OpsCenter is up. If the
backup OpsCenter detects no heartbeat from the primary OpsCenter during the configured
window (60 seconds by default), the backup OpsCenter initiates the failover process and
automatically assumes the responsibilities of the primary OpsCenter. The backup OpsCenter
automatically reconfigures the agents by automatically changing stomp_interface in
address.yaml to connect to the backup instance instead of
the failing primary instance.stomp_interface
in address.yaml to point to the
backup opscenterd instance. If a separate Configuration Management system is managing
address.yaml, that change might be undone when the Configuration
Management system pushes its next update.Failover recovery
After a failover, the former backup OpsCenter that took over as primary remains the primary OpsCenter. At that point, configure another backup OpsCenter by recreating the primary_opscenter_location file that points the new backup instance to the IP address of the primary instance to monitor. If you are configuring the former primary OpsCenter as the new backup instance, ensure the server is healthy again before restarting the server.
failover_id
file. In the event of a network split, a
failover_id
uniquely identifies each OpsCenter to agents and prevents
both OpsCenter machines from running operations post-failover, which could corrupt data. The
location of failover_id
file depends on the type of install and is configurable.Failover aftereffects
- Alerts - Trigger as normal. An exception is an alert firing and unfiring within the failover window; in which case the alert is never triggered.
- Authentication - Logs out existing user sessions. User sessions do not persist. Users must log in again.
- Backup - Skips a scheduled backup if it falls within the failover window. Backup does not occur until the next scheduled time.
- Restore - Continues the restore operation if failover occurred mid-restore; however, the result of the restore cannot be communicated because the backup OpsCenter was unaware the restore transpired.
- Repair Service - Resumes from the last saved state. Be sure to mirror the repair service directory. An OpsCenter instance failure does not affect repairs currently running on any nodes. New repairs do not continue until an automatic failover successfully completes or the OpsCenter instance that failed is brought up again.
- Provisioning - Provisioning jobs that were in progress when the primary Lifecycle Manager fails attempt to complete on the primary, and may fail. Lifecycle Manager does not attempt to automatically resume jobs on the backup OpsCenter, but manually Running the job again allows the job to proceed to completion.
Troubleshooting failover
In most cases, the backup OpsCenter instance selects the correct IP address for reconfiguring agents after a failover as described in failover behavior. If for some reason the incorrect IP address is not automatically being selected to update all agents, explicitly set the report_interface property in opscenterd.conf on the backup OpsCenter instance.Failover when upgrading OpsCenter
When failover is configured, there is a recommended process to follow when upgrading OpsCenter. For more information, see upgrading OpsCenter when failover is enabled.