DSE OpsCenter 6.5 upgrade considerations

Changes in features, configuration files, metrics, and APIs impacting upgrades to OpsCenter 6.5.

Compact storage no longer supported

Tables in the OpsCenter schema are no longer created with compact storage because this feature for thrift-compatible tables has been removed in DSE version 6.0.

Warning: Before upgrading to DSE 6.0 and attempting to connect to a cluster with OpsCenter, execute the following CQL command for every table in the OpsCenter keyspace.
ALTER TABLE table.name DROP COMPACT STORAGE

For more information, see migrating from compact storage (for DSE version 5.1.x clusters managed by OpsCenter) or migrating from compact storage (for DSE version 5.0.x clusters managed by OpsCenter). If OpsCenter was not upgraded to 6.5 before upgrading DSE to 6.0, refer to the instructions in this KB article for a workaround until the issue is fixed.

Note: This issue was fixed in OpsCenter 6.5.1.

For DSE versions earlier than 6.0, the OpsCenter Backup Service checks for tables that have compact storage and warns that they cannot be created during a restore.

New Best Practice Service rule for NodeSync

The following change was made to the Best Practice Service:

The Check that NodeSync is enabled on all nodes rule has been added: The NodeSync service is intended to run on every node in the cluster. If any nodes are not running NodeSync, the replica data segments for those nodes will not be validated and synchronized, which could potentially result in data loss. This rule ensures that NodeSync is running on every node. To check whether NodeSync is running, use nodetool nodesyncservice enable. The NodeSync Service status is visible from the Nodes and Services areas of OpsCenter Monitoring.

See Enabling keyspaces and tables for monitoring NodeSync in OpsCenter for additional details.

New SSL configuration option

A new SSL configuration option, opscenter_ssl_strict_subject_validation, indicates that if a certificate subject does not match the IP of the server, the OpsCenter SSL agent rejects the certificate. The default option is false, which means the SSL agent attempts subject validation first. If validation fails, the agent logs a warning and retries the connection without subject validation. If set to true, the SSL agent rejects the certificate without retrying validation.

Repair Service new subrange repair configuration option for parallel tasks

The parallel_tasks_update_interval configuration option has been added to the Repair Service. The option determines the length of time before the Repair Service periodically recalculates the required number of parallel tasks to run during a subrange repair cycle. The interval is 120 seconds (2 minutes) by default. For more details, see Setting the maximum for parallel subrange repairs.

Backup Service AWS CLI S3 bulk backups promoted from labs feature to production feature

The AWS CLI feature for bulk uploading S3 backups has been promoted from an OpsCenter Labs feature to a full production status feature. The config option has moved from the [labs] section to the [backups] section in all OpsCenter configuration files. If you have the labs feature enabled, adjust your use_s3_cli configuration settings from the [labs] section to the [backups] section.

Backup Service new phased staging configuration option for commit logs

The On Server commit log storage has changed. Commit logs are still initially moved into the backup_staging_dir, but after the commit logs have been sent to any other configured locations, the commit logs are moved to the directory specified by a backup_storage_dir defined in address.yaml. This approach should resolve a number of problems customers have encountered when restarting agents due to large numbers of On Server commit logs being reprocessed. See Configuring commit log backups for details.

New config option for OpsCenter failover URL

The new configuration option override_primary_redirect_url for overriding the default URL and port of the OpsCenter primary instance is available in opscenterd.conf.

Cassandra read request timeout configuration options

The cluster configuration and DataStax agent configuration files have new host read request timeout configuration options for both monitored and storage clusters:
All of the values default to nil, which forces the Java driver to use its default value of 12 seconds for the read timeout.
Note: The timeouts are per node. If the node selected to do the read operation hits the timeout, an internal retry policy is set in the Java driver to try the request again.
The new read timeout options for cluster_name.conf:
[cassandra]
host_read_timeout_ms=
          
[storage_cassandra]
host_read_timeout_ms=
The new read timeout options for the DataStax agents in address.yaml:
monitored_dse_host_read_timeout:
storage_dse_host_read_timeout:

Metrics

There are new metrics for NodeSync, new metrics for thread pools, and changes made for dropped messages metrics. For a comprehensive list of metrics available in OpsCenter, refer to the OpsCenter Metrics Tooltips Reference.

NodeSync metrics

Metrics are available for the new NodeSync Service. See NodeSync metrics.

Thread Pool metrics

Many metrics have been added for monitoring thread pools. See Thread Pool (TP) metrics.

Dropped messages metrics

Dropped messages metrics updates include:
  • The TP: Dropped Paged Range Reads and TP: Dropped Request Responses metrics have been removed for DSE 6.0 and later.
  • Several metrics regarding dropped messages have had their labels changed from TP: <message type> to Dropped Messages: <message type>.
  • New dropped messages metrics have been added.

Declarative password management

The password experience for running jobs and managing the password of the cassandra user has been improved in Lifecycle Manager. Rather than requiring entering the password every time a job was run, the password is now declared at the cluster level. Entering credentials is only required once if an associated Config Profile has internal or password authentication enabled (which is the default behavior). The New DSE password field has been removed from the Job dialogs. Password fields have been added to the Add Cluster and Edit Cluster dialogs for changing the cassandra user password. The improved functionality allows changing the password for the cassandra user at any time, or removing the stored password.

Renamed RPC address properties to native transport in LCM UI and API

The RPC address fields in the LCM Add Node and Edit Node dialogs, and the Lifecycle Manager API have been renamed to Native Transport to correspond with the changes for DSE 6.0:
  • The RPC Address field in the LCM UI Add Node and Edit Node dialogs has been renamed to Native Transport (RPC) Address.
  • The Broadcast RPC Address field in the LCM UI Add Node and Edit Node dialogs has been renamed to Native Transport (RPC) Broadcast Address.
  • The rpc-address field in the LCM API has been renamed to native-transport-address.
  • The broadcast-rpc-address field in the LCM API has been renamed to native-transport-broadcast-address.
Note: If using the LCM API directly, update any API clients that reference the renamed fields rpc-address and broadcast-rpc-address.

LCM API version updates

The following changes have been made to the Lifecycle Manager API:
  • Base url version bump from v1 to v2.
  • More strict validation of config_profiles api for both behavior and json content.
  • Job password parameters: Change cassandra user password parameters shifted from the run job level to the add (or edit) cluster level. LDAP password changes at the run job level have been removed.
  • Multiple endpoints replaced msg with message in api errors.

Base URL version change

The base url for LCM has changed from /api/v1/lcm/ to /api/v2/lcm/ to reflect the backwards-incompatible api changes present in the OpsCenter 6.5 release. All api clients must be updated to use the new base url.
Note: Unless a behavior change is described below in this section of the upgrade guide, all endpoint URLs will continue to operate at their new /api/v2/lcm/ location exactly as they did previously.

More restrictive config_profiles (json validation)

/api/v2/lcm/config_profiles/

While the config_profiles api has not technically changed, it has become considerably more strict about the contents of the json field. Requests that previously returned a 200 success code might now fail with a validation error. Many formerly failing requests that have always been invalid or had undefined results are now rejected upfront at input submission time by the system rather than failing or behaving ambiguously later.

When processing POST and PUT requests for config_profiles, the system now verifies the format of the json field against the definitions for the relevant DSE version as specified in /api/v2/lcm/definitions/. The following properties are verified against the definitions:
  • Every key-name must be a valid DSE configuration property.
  • Every value-type must match the type specified in the DSE configuration property.
  • Families of fields that have dependencies must be consistent. One cannot disable a parent field (such as client_encryption_options.enabled) and specify a value for a dependent child field (such as client_encryption_options.keystore).

Change password parameters

/api/v2/actions/install/

The change-default-cassandra-password and cassandra-ldap-password parameters are no longer valid. Supplying these parameters at run job time made many edge cases impossible to detect. Corresponding parameters have been added to the cluster model at the /api/v2/lcm/clusters/ endpoint, where they can be persisted across jobs and facilitate more effective error handling.

Consistent error message fields

The msg field has been removed from all api errors and replaced with the message field. Previously the two fields were used inconsistently. Any api clients that expect to process a msg field must be updated to look for the message field instead.

OpsCenter and DataStax Agent API dynamic updates for agent log level

The log level (debug, info, warn, error) can now be dynamically set at the OpsCenter daemon level using the updating logging level method.
PUT /cluster_id/log/level/log_level

The log level update is not persisted to the log4j.properties configuration or the logback.xml configuration file. Restarting opscenterd or the DataStax Agent returns the agent log level to its original configuration value.

Tip: Update the log level for all agents across a cluster using a cURL command:
curl -X PUT http://127.0.0.1:8888/Test_Cluster/log/level/debug

The response body contains the IP addresses of the nodes whose agent log levels were updated and skipped.

Note: The OpsCenter API version remains at v1 (in contrast with LCM v2).