Recovering from a single disk failure using JBOD
Steps for recovering from a single disk failure in a disk array using JBOD (just a bunch of disks).
DataStax Enterprise might not fail from the loss of one disk in a JBOD array, but some reads and writes may fail when:
-
The operation’s consistency level is
ALL
. -
The data being requested or written is stored on the defective disk.
-
The data to be compacted is on the defective disk.
It’s possible that you can simply replace the disk, restart DSE, and run nodetool repair
.
However, if the disk crash corrupted system table, you must remove the incomplete data from the other disks in the array.
The procedure for doing this depends on whether the cluster uses vnodes or single-token architecture.
Where is the cassandra-env.sh
file?
The location of the cassandra-env.sh
file depends on the type of installation:
Installation Type | Location |
---|---|
Package installations + Installer-Services installations |
|
Tarball installations + Installer-No Services installations |
|
Where is the cassandra.yaml
file?
The location of the cassandra.yaml
file depends on the type of installation:
Installation Type | Location |
---|---|
Package installations + Installer-Services installations |
|
Tarball installations + Installer-No Services installations |
|
Procedure
If a disk fails on a node in a cluster using DSE 5.0 or earlier, replace the node.
-
Verify that the node has a defective disk and identify the disk, by checking the logs on the affected node.
Disk failures are logged in
FILE NOT FOUND
entries, which identifies the mount point or disk that has failed. -
If the node is still running, stop DSE and shut down the node.
-
Replace the defective disk and restart the node.
-
If the node cannot restart:
-
Try restarting DSE without bootstrapping the node:
-
Package and Installer-Services installations:
-
Add the following option to
cassandra-env.sh
file:JVM_OPTS="$JVM_OPTS -Dcassandra.allow_unsafe_replace=true
-
After the node bootstraps, remove the
-Dcassandra.allow_unsafe_replace=true
parameter fromcassandra-env.sh
.
-
-
Tarball and Installer-No Services installations:
-
Start DataStax Enterprise with this option:
sudo bin/dse cassandra Dcassandra.allow_unsafe_replace=true
Tarball and Installer No-Services path:
<installation_location>
-
-
-
-
If DSE restarts, run
nodetool repair
on the node. If not, replace the node. -
If the repair succeeds, the node is restored to production. Otherwise, go to 7 or 8.
-
For a cluster using vnodes:
-
On the affected node, clear the system directory on each functioning drive.
Example for a node with a three disk JBOD array:
$ -/mnt1/cassandra/data $ -/mnt2/cassandra/data $ -/mnt3/cassandra/data
If
mnt1
has failed:$ rm -fr /mnt2/cassandra/data/system $ rm -fr /mnt3/cassandra/data/system
-
Restart DSE without bootstrapping as described in 4:
-Dcassandra.allow_unsafe_replace=true
-
Run
nodetool repair
on the node.If the repair succeeds, the node is restored to production. If not, replace the dead node.
-
-
For a cluster single-token nodes:
-
On one of the cluster’s working nodes, run
nodetool ring
to retrieve the list of the repaired node’s tokens:$ nodetool ring | grep ip_address_of_node | awk ' {print $NF ","}' | xargs
-
Copy the output of the
nodetool ring
into a spreadsheet (space-delimited). -
Edit the output, keeping the list of tokens and deleting the other columns.
-
On the node with the new disk, open the
cassandra.yaml
file and add the tokens (as a comma-separated list) to theinitial_token
property. -
Change any other non-default settings in the new nodes to match the existing nodes. Use the
diff
command to find and merge any differences between the nodes.If the repair succeeds, the node is restored to production. If not, replace the node.
-
On the affected node, clear the system directory on each functioning drive.
Example for a node with a three disk JBOD array:
$ -/mnt1/cassandra/data $ -/mnt2/cassandra/data $ -/mnt3/cassandra/data
If
mnt1
has failed:$ rm -fr /mnt2/cassandra/data/system $ rm -fr /mnt3/cassandra/data/system
-
Restart DSE without bootstrapping as described in 4:
-Dcassandra.allow_unsafe_replace=true
-
Run
nodetool repair
on the node.If the repair succeeds, the node is restored to production. If not, replace the node.
-