Recovering from a single disk failure using JBOD
Steps for recovering from a single disk failure in a disk array using JBOD (just a bunch of disks).
Steps for recovering from a single disk failure in a disk array using JBOD (just a bunch of disks).
- The operation's consistency level is ALL.
- The data being requested or written is stored on the defective disk.
- The data to be compacted is on the defective disk.
It's possible that you can simply replace the disk, restart Cassandra, and run nodetool repair. However, if the disk crash corrupted the Cassandra system table, you must remove the incomplete data from the other disks in the array. The procedure for doing this depends on whether the cluster uses vnodes or single-token architecture.
Procedure
These steps are supported for Cassandra versions 3.2 and later. If a disk fails on a node in a cluster using an earlier version of Cassandra, replace the node.
-
Verify that the node has a defective disk and identify the disk, by checking
the logs on the affected node.
Disk failures are logged in
FILE NOT FOUND
entries, which identifies the mount point or disk that has failed. - If the node is still running, stop Cassandra and shut down the node.
- Replace the defective disk and restart the node.
-
If the node cannot restart:
-
Try restarting Cassandra without bootstrapping the node:
Cassandra package installations:
- Add the following option to
cassandra-env.sh
file:
JVM_OPTS="$JVM_OPTS -Dcassandra.allow_unsafe_replace=true
- Start the node.
- After the node bootstraps, remove the
-Dcassandra.allow_unsafe_replace=true
parameter from cassandra-env.sh. - Restart the node.
Cassandra tarball installations:
- Start Cassandra with this
option:
sudo bin/dse cassandra Dcassandra.allow_unsafe_replace=true #Starts DataStax Enterprise
- Add the following option to
cassandra-env.sh
file:
-
Try restarting Cassandra without bootstrapping the node:
- If Cassandra restarts, run nodetool repair on the node. If not, replace the node.
- If the repair succeeds, the node is restored to production. Otherwise, go to 7 or 8.
-
For a cluster using vnodes:
-
On the affected node, clear the system directory
on each functioning drive.
Example for a node with a three disk JBOD array:
-/mnt1/cassandra/data $ -/mnt2/cassandra/data $ -/mnt3/cassandra/data
Ifmnt1
has failed:rm -fr /mnt2/cassandra/data/system $ rm -fr /mnt3/cassandra/data/system
-
Restart Cassandra without bootstrapping as described in 4:
sudo bin/dse cassandra -Dcassandra.allow_unsafe_replace=true #Restarts DataStax Enterprise
-
Run nodetool repair on the
node.
If the repair succeeds, the node is restored to production. If not, replace the dead node.
-
On the affected node, clear the system directory
on each functioning drive.
-
For a cluster single-token nodes:
-
On one of the cluster's working nodes, run nodetool ring to retrieve the list of the repaired node's
tokens:
nodetool ring | grep ip_address_of_node | awk ' {print $NF ","}' | xargs
- Copy the output of the nodetool ring into a spreadsheet (space-delimited).
- Edit the output, keeping the list of tokens and deleting the other columns.
- On the node with the new disk, open the cassandra.yaml file and add the tokens (as a comma-separated list) to the initial_token property.
-
Change any other non-default settings
in the new nodes to match the existing nodes. Use the
diff command to find and merge any differences
between the nodes.
If the repair succeeds, the node is restored to production. If not, replace the node.
-
On the affected node, clear the system directory
on each functioning drive.
Example for a node with a three disk JBOD array:
-/mnt1/cassandra/data $ -/mnt2/cassandra/data $ -/mnt3/cassandra/data
Ifmnt1
has failed:rm -fr /mnt2/cassandra/data/system $ rm -fr /mnt3/cassandra/data/system
-
Restart Cassandra without bootstrapping as described in 4:
sudo bin/dse cassandra -Dcassandra.allow_unsafe_replace=true #Restarts DataStax Enterprise
-
Run nodetool repair on the
node.
If the repair succeeds, the node is restored to production. If not, replace the node.
The location of the cassandra.yaml file depends on the type of installation:DataStax Enterprise 5.0 Installer-Services and package installations /etc/dse/cassandra/cassandra.yaml DataStax Enterprise 5.0 Installer-No Services and tarball installations install_location/resources/cassandra/conf/cassandra.yaml Cassandra package installations /etc/cassandra/cassandra.yaml Cassandra tarball installations install_location/resources/cassandra/conf/cassandra.yaml The location of the cassandra-rackdc.properties file depends on the type of installation:DataStax Enterprise 5.0 Installer-Services and package installations /etc/dse/cassandra/cassandra-rackdc.properties DataStax Enterprise 5.0 Installer-No Services and tarball installations install_location/resources/cassandra/conf/cassandra-rackdc.properties Cassandra package installations /etc/cassandra/cassandra-rackdc.properties Cassandra tarball installations install_location/conf/cassandra-rackdc.properties The location of the cassandra-topology.properties file depends on the type of installation:DataStax Enterprise 5.0 Installer-Services and package installations /etc/dse/cassandra/cassandra-topology.properties DataStax Enterprise 5.0 Installer-No Services and tarball installations install_location/resources/cassandra/conf/cassandra-topology.properties Cassandra package installations /etc/cassandra/cassandra-topology.properties Cassandra tarball installations install_location/conf/cassandra-topology.properties The location of the cassandra-env.sh file depends on the type of installation:DataStax Enterprise 5.0 Installer-Services and package installations /etc/dse/cassandra/cassandra-env.sh DataStax Enterprise 5.0 Installer-No Services and tarball installations install_location/resources/cassandra/conf/cassandra-env.sh Cassandra package installations /etc/cassandra/cassandra-env.sh Cassandra tarball installations install_location/conf/cassandra-env.sh
-
On one of the cluster's working nodes, run nodetool ring to retrieve the list of the repaired node's
tokens: