Phase 1: Pre-upgrade checks
In this phase you’ll perform various checks to confirm the state of the cluster. The upgrade procedures in later phases assume that the cluster is in good health and that it satisfies the necessary prerequisites. It’s important to complete all of the pre-upgrade checks to confirm this assumption.
Step 1: Confirm prerequisites
Confirm that your Cassandra environment satisfies all of the necessary prerequisites.
During upgrade, and while a cluster is in a partially upgraded state:
|
Step 2: Confirm that all nodes are Up and Normal
All nodes in the cluster need to report status UN
(Up and Normal).
Run the following command to check for any nodes in the cluster that are currently reporting a status other than UN
:
-
Command
-
Result
nodetool status | grep -v UN
If all nodes are in an Up and Normal state, then there should be no nodes listed in the output:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
Step 3: Confirm no unresolved errors
Confirm that no unresolved ERROR
messages have been logged in system.log on any node in the previous 72 hours.
You should also inspect WARN
messages that occurred during the same period in case they reveal any patterns that indicate an unhealthy cluster state.
-
Package installations
-
Tarball installations
To retrieve ERROR
and WARN
messages for packaged installations:
sudo grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log
To retrieve ERROR
and WARN
messages for tarball installations:
sudo grep -e "WARN" -e "ERROR" <install-location>/logs/system.log
Cassandra logs by default live in ${CASSANDRA_HOME}/logs, but most Linux distributions relocate logs to /var/log/cassandra. |
All messages at the ERROR
level in system.log must be understood, and most often resolved, before proceeding with the upgrade.
Step 4: Confirm gossip is stable
Verify that all entries in the cluster’s gossip information output have the state NORMAL
.
The following command checks for any nodes that have a status other than NORMAL
:
nodetool gossipinfo | grep STATUS | grep -v NORMAL
If nothing is returned in the command output, then all nodes have NORMAL
status.
Common reasons that a node might have a gossip status other than ‘NORMAL’ include:
-
The node recently left the cluster.
-
The node unsuccessfully left the cluster.
-
The node is joining the cluster.
-
The node is participating as an observer.
The history of all non-NORMAL
nodes should be investigated, and appropriately resolved, before proceeding with the upgrade.
Unknown or ghost nodes may be removed via Java Management Extensions (JMX) using the The |
Step 5: Confirm no dropped messages
Confirm that there have been no dropped messages logged on any node during the previous 72 hours.
To check for dropped messages, run the following command on each node in the cluster:
-
Command
-
Result
nodetool tpstats | grep -A 12 Dropped
If there have been no dropped messages, the output should look like the following:
Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 0
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
Dropped messages indicate the cluster is overloaded. Making changes to an overloaded cluster, such as a cluster upgrade, may reduce performance.
You should try to address the root cause of the dropped messages (such as JVM GC pauses or blocked Flush Writers) before upgrading. If the root cause cannot be identified, or you believe it may be resolved by upgrading Cassandra, then proceed with caution.
End of phase
At the end of this phase:
-
Your Cassandra environment satisfies all of the necessary prerequisites.
-
All nodes in the cluster are reporting as Up and Normal.
-
There have been no
ERROR
messages logged on any node in the previous 72 hours that have yet to be resolved.WARN
messages that occurred during the same period have been inspected and have been found not to indicate an unhealthy cluster state. -
All entries in the cluster’s gossip information output have the state
NORMAL
. Any non-NORMAL
nodes have been investigated, and appropriately resolved. -
No dropped messages have been logged on any node during the previous 72 hours. For any dropped messages that have occurred, the root cause has been identified and addressed.