Phase 1: Pre-upgrade checks

In this phase you’ll perform various checks to confirm the state of the cluster. The upgrade procedures in later phases assume that the cluster is in good health and that it satisfies the necessary prerequisites. It’s important to complete all of the pre-upgrade checks to confirm this assumption.

Step 1: Confirm prerequisites

Confirm that your Cassandra environment satisfies all of the necessary prerequisites.

Important prerequisites

During upgrade, and while a cluster is in a partially upgraded state:

Do not enable new features.
Do not repair SSTables.

Before beginning the upgrade process, you should disable all automated/scheduled repairs. This includes disabling tools like Reaper, crontab, and any scripts that call nodetool repair.

Note that although you shouldn’t repair SSTables during the upgrade process, it’s recommended that you run regular repairs before upgrading. See Node repair considerations.
Do not add/bootstrap new nodes to the cluster or decommission any existing nodes.
Do not issue TRUNCATE or other DDL-related queries.
Do not alter schemas for any workloads.

Propagation of schema changes between mixed version nodes might have unexpected results. Therefore, you should take action to prevent schema changes for the duration of the upgrade process.

Nodes on different versions might show a schema disagreement during an upgrade.
Do not change credentials, permissions, or any other security settings.
Complete the cluster-wide upgrade before the expiration of gc_grace_seconds (approximately 13 days) to ensure any repairs complete successfully.

Step 2: Confirm that all nodes are Up and Normal

All nodes in the cluster need to report status UN (Up and Normal).

Run the following command to check for any nodes in the cluster that are currently reporting a status other than UN:

nodetool status | grep -v UN

Result

If all nodes are in an Up and Normal state, then there should be no nodes listed in the output:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack

Step 3: Confirm no unresolved errors

Confirm that no unresolved ERROR messages have been logged in system.log on any node in the previous 72 hours. You should also inspect WARN messages that occurred during the same period in case they reveal any patterns that indicate an unhealthy cluster state.

Package installations
Tarball installations

To retrieve ERROR and WARN messages for packaged installations:

sudo grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log

To retrieve ERROR and WARN messages for tarball installations:

sudo grep -e "WARN" -e "ERROR" <install-location>/logs/system.log

Cassandra logs by default live in ${CASSANDRA_HOME}/logs, but most Linux distributions relocate logs to /var/log/cassandra.

All messages at the ERROR level in system.log must be understood, and most often resolved, before proceeding with the upgrade.

Step 4: Confirm gossip is stable

Verify that all entries in the cluster’s gossip information output have the state NORMAL.

The following command checks for any nodes that have a status other than NORMAL:

nodetool gossipinfo | grep STATUS | grep -v NORMAL

If nothing is returned in the command output, then all nodes have NORMAL status.

Common reasons that a node might have a gossip status other than ‘NORMAL’ include:

The node recently left the cluster.
The node unsuccessfully left the cluster.
The node is joining the cluster.
The node is participating as an observer.

The history of all non-NORMAL nodes should be investigated, and appropriately resolved, before proceeding with the upgrade.

Unknown or ghost nodes may be removed via Java Management Extensions (JMX) using the unsafeAssassinateEndpoint method of the org.apache.cassandra.net:type=Gossiper MBean.

The unsafeAssassinateEndpoint method is intended as a last resort when no other alternatives are available for removing an entry from the gossip state, and you should monitor the cluster closely should you use it.

Step 5: Confirm no dropped messages

Confirm that there have been no dropped messages logged on any node during the previous 72 hours.

To check for dropped messages, run the following command on each node in the cluster:

nodetool tpstats | grep -A 12 Dropped

Result

If there have been no dropped messages, the output should look like the following:

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                     0
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

Dropped messages indicate the cluster is overloaded. Making changes to an overloaded cluster, such as a cluster upgrade, may reduce performance.

You should try to address the root cause of the dropped messages (such as JVM GC pauses or blocked Flush Writers) before upgrading. If the root cause cannot be identified, or you believe it may be resolved by upgrading Cassandra, then proceed with caution.

End of phase

At the end of this phase:

Your Cassandra environment satisfies all of the necessary prerequisites.
All nodes in the cluster are reporting as Up and Normal.
There have been no ERROR messages logged on any node in the previous 72 hours that have yet to be resolved. WARN messages that occurred during the same period have been inspected and have been found not to indicate an unhealthy cluster state.
All entries in the cluster’s gossip information output have the state NORMAL. Any non-NORMAL nodes have been investigated, and appropriately resolved.
No dropped messages have been logged on any node during the previous 72 hours. For any dropped messages that have occurred, the root cause has been identified and addressed.