Planning your upgrade

Before attempting to upgrade Apache Cassandra ® in production, it’s vitally important that you create an upgrade plan that’s specifically tailored for your environment.

Review the sections on this page to begin formulating your upgrade plan.

Confirm your upgrade path

The first step in creating an upgrade plan is to confirm that the version of Cassandra you’re currently running can actually be upgraded to the version you want to upgrade to.

Cassandra can be upgraded by up to one major version increment in a single upgrade operation, e.g. any release in the 3.x line can be upgraded to any release in the 4.x line. Upgrades to non-adjacent major versions are not supported. To upgrade by more than one major version increment, you’ll need to upgrade to an intermediate major version first before you can upgrade to your desired version.

When upgrading to a new major version, you should first upgrade the cluster to the latest patch release on your current version. Fixes included in the latest patch release can often help mitigate issues with the upgrade process.

For more information, see Supported upgrade paths.

Familiarize yourself with the new version

You should familiarize yourself with the features and changes in the new version of Cassandra you’re upgrading to. Configuration changes in the new version may require you to make certain accommodations in your upgrade plan.

Cassandra 4.x

Important configuration changes in Apache Cassandra 4.x

Review the release notes

Make sure to review the release notes (NEWS.txt file) for each version of Cassandra in your upgrade path, all the way back to your current version. You should check whether there are any reported known issues that may affect your cluster during (and after) upgrade.

Cassandra release notes:

You may also choose to review the Cassandra changelogs (CHANGES.txt) which list all of the tickets that have been incorporated into each release.

Cassandra changelogs:

Review the prerequisites for upgrade

Before upgrading Cassandra to a new version, you should first make sure your cluster satisfies all of the appropriate prerequisites. Your upgrade plan should incorporate any necessary steps that are needed to meet the required prerequisites for your cluster and environment.

Plan the upgrade order

Upgrade order matters. Nodes need to be upgraded in logical order according to the rack and datacenter configuration of your Cassandra cluster. Your upgrade plan should reflect the upgrade order that’s appropriate for your cluster.

Always start by upgrading the seed node first.

Seed nodes have some small differences in behavior from other nodes in the cluster, such as attracting more gossip traffic. For this reason (along with availability), the seed node should be upgraded first.
For clusters that utilize multiple racks, upgrade each of the nodes in a single rack before moving on to upgrading the nodes in another rack.

It’s a general best practice to have three seed nodes per datacenter, one in each rack. In this example, you would upgrade the seed node in Rack A, followed by every other node in Rack A. Then you would upgrade the seed node in Rack B, followed by every other node in Rack B, and so on for Rack C.
For clusters that utilize multiple datacenters, only one datacenter should be upgraded at a time. Once all the racks in a datacenter have been upgraded, you can move on to upgrading the racks in another datacenter.

Depending on how NetworkTopologyStrategy is configured, whole racks can be upgraded at the same time, and nodes/racks in separate datacenters may also be upgraded at the same time.

Understand limitations during upgrade

After one or more nodes in a cluster have been upgraded, but before all of the nodes in the cluster are running the new version, the cluster is considered to be in a partially upgraded state. In this state, the cluster, as a whole, continues to operate as though it were running on the earlier, pre-upgraded version.

Certain restrictions and limitations apply to a cluster that’s in a partially upgraded state, and you should take action to restrict certain operations until the cluster has been fully upgraded.

During upgrade, and while a cluster is in a partially upgraded state:

Do not enable new features.
Do not repair SSTables.

Before beginning the upgrade process, you should disable all automated/scheduled repairs. This includes disabling tools like Reaper, crontab, and any scripts that call nodetool repair.

Note that although you shouldn’t repair SSTables during the upgrade process, it’s recommended that you run regular repairs before upgrading. See Node repair considerations.
Do not add/bootstrap new nodes to the cluster or decommission any existing nodes.
Do not issue TRUNCATE or other DDL-related queries.
Do not alter schemas for any workloads.

Propagation of schema changes between mixed version nodes might have unexpected results. Therefore, you should take action to prevent schema changes for the duration of the upgrade process.

Nodes on different versions might show a schema disagreement during an upgrade.
Do not change credentials, permissions, or any other security settings.
Complete the cluster-wide upgrade before the expiration of gc_grace_seconds (approximately 13 days) to ensure any repairs complete successfully.

Understand risks

Performing an in-place online upgrade of a Cassandra cluster may result in a temporary reduction in performance, since it effectively simulates a series of temporary node failures. It’s important that you investigate and understand how the upgrade, both during and after the upgrade process, may affect the performance of your Cassandra environment.

Increased latency

Read and write latency may be affected during the upgrade process as nodes are stopped, started, and re-join with a cold page cache and cold JVM.

Impact: Increased read and write latency may occur until the nodes warm up and the node availability becomes stable. Also when nodes are turned off during the process, some queries might be dropped and retried, lowering performance.
Mitigation & monitoring: This risk can be mitigated by softly shutting down nodes according to the procedures in Upgrade an online cluster. Adding a delay of 3 to 5 minutes between starting a new node and shutting down the next old node will allow the page cache to heat up.

The risk can be monitored by tracking cluster level read and write latency at the 95th percentile (or above) for the duration of the work. If the latency rises to an unacceptably high level, you can stop upgrading additional nodes until latency returns to the previous level.

Decreased availability

When a node is DOWN during the rolling migration process, the cluster operates with less redundancy. Depending on replication factor, consistency level, and the number of racks, the loss of a second node can result in a partial or complete outage of data in the cluster.

Impact: Loss of data availability.
Mitigation & monitoring: This risk can be mitigated by ensuring that all nodes are UP before operating on the next node, reducing the time nodes are DOWN for, and monitoring the state of the cluster while upgrading a node. This can be done using the nodetool status command (which is incorporated into the procedures in guides:upgrade/cluster-online/phase-1-pre-upgrade-checks.adoc#confirm-all-nodes-up-normal).

Driver incompatibility

Unexpected issues in driver compatibility or server functionality may occur after upgrading.

Impact: Loss of function or availability.
Mitigation & monitoring: This risk can be mitigated by upgrading one node first, checking that it’s processing client traffic, and ensuring the application is free from errors.

Low disk space

Additional disk space used for repairs, taking snapshots, and upgrading SSTables may result in low free disk space.

Impact: Loss of compaction functionality due to lack of space for data files. Loss of write functionality due to lack of space for commit log or data files.
Mitigation & monitoring: The risk can be mitigated by ensuring at least 50% free disk space before starting the upgrade. Size Tiered Compaction Strategy (STCS) generally requires additional disk space for compaction, and 50% is a safe buffer.

If disk space runs low, remove existing snapshots, Java heap dumps, and non-critical data from the disk to allow the operation to finish. The SSTable upgrade operation can be monitored by tracking free disk space per node, and alerting when it’s below 50%.

Configuration management failure

If the Cassandra attributes in your configuration management system are out of synchronization with what’s in use in production, unexpected issues may occur.

Impact: Cassandra unable to start, the upgrade fails, or some nodes perform poorly.
Mitigation & monitoring: Make sure any configuration changes made on nodes are also changed in your configuration management system (if any). All configuration changes must be tested to reduce the risk of an unexpected issue occurring at the configuration level. Follow the outputs generated by the configuration management system when converging the attributes on the nodes.

Configure appropriate monitoring

As outlined in the Understand risks section, the upgrade process may result in a temporary reduction in performance.

You should configure appropriate monitoring to ensure that the cluster is functioning as expected both during and after the upgrade process. If some monitoring is in place for all nodes of the cluster, and for the application, metrics should remain unchanged.

Use canary nodes

Whenever you make a change to a cluster (such as upgrading Cassandra or one of its dependencies) you should first apply the change to a single node — also known as a canary node — so you can check that both Cassandra and your applications are working as expected before applying the change to the rest of the nodes in the cluster.

After upgrading Cassandra on a canary node, you should monitor the cluster for errors and search the logs of all dependent applications for unusual exceptions. For production clusters, it’s recommended to wait up to six hours after upgrading the first seed node to properly collect metrics. In certain environments, an increase in latencies of just a few percent can have serious availability impact, therefore it’s important to use this time to identify any such problems before proceeding with the upgrade. Once you have certainty that the canary node is functioning as expected, you can go about applying the change to the rest of the nodes in the cluster.

If a cluster error or application exception is encountered as a result of upgrading the canary node, you have the option to roll back the change with little impact to the cluster. Rolling back a canary node is much simpler than rolling back a full cluster as you can rely on consistent replicas to provide up-to-date data during the rollback procedure.

Test your upgrade plan

The larger the difference between your current version and your desired version, the more risk and effort may be involved in the upgrade process. Minor version upgrades are generally lower risk, but major releases can potentially introduce breaking changes. When planning an upgrade for a production cluster, you should always test your upgrade plan on a non-production cluster first.

Schedule an upgrade window

It’s a general best practice to perform a cluster upgrade during a scheduled maintenance window that occurs when application loads are typically lower. This isn’t mandatory, as the upgrade can be applied without causing an application outage. However, since individual nodes are taken offline during the upgrade process, the service level of client applications will be in a degraded state for the duration of the upgrade. Similarly, in the event the upgrade needs to be rolled back (reverted), the application may experience data loss.

See Understand risks for more information.