Recovering from a single disk failure using JBOD

Steps for recovering from a single disk failure in a disk array using JBOD (just a bunch of disks).

DataStax Enterprise might not fail from the loss of one disk in a JBOD array, but some reads and writes may fail when:

The operation’s consistency level is ALL.
The data being requested or written is stored on the defective disk.
The data to be compacted is on the defective disk.

It’s possible that you can simply replace the disk, restart DSE, and run nodetool repair. However, if the disk crash corrupted system table, you must remove the incomplete data from the other disks in the array. The procedure for doing this depends on whether the cluster uses vnodes or single-token architecture.

Where is the cassandra-env.sh file located?

The location of the cassandra-env.sh file depends on the type of installation:

Package installations and installer-services installations: /etc/dse/cassandra/cassandra-env.sh
Tarball installations and installer-no services installations: INSTALLATION_LOCATION/resources/cassandra/conf/cassandra-env.sh

Where is the cassandra.yaml file?

The location of the cassandra.yaml file depends on the type of installation:

Installation Type Location

Installation Type	Location
Package installations + Installer-Services installations	`/etc/dse/cassandra/cassandra.yaml`
Tarball installations + Installer-No Services installations	`<installation_location>/resources/cassandra/conf/cassandra.yaml`

Package installations + Installer-Services installations

/etc/dse/cassandra/cassandra.yaml

Tarball installations + Installer-No Services installations

<installation_location>/resources/cassandra/conf/cassandra.yaml

Procedure

If a disk fails on a node in a cluster using DSE 5.0 or earlier, replace the node.

Verify that the node has a defective disk and identify the disk, by checking the logs on the affected node.

Disk failures are logged in FILE NOT FOUND entries, which identifies the mount point or disk that has failed.
If the node is still running, stop DSE and shut down the node.
Replace the defective disk and restart the node.
If the node cannot restart:
1. Try restarting DSE without bootstrapping the node:
  1. Package and Installer-Services installations:
    
    Add the following option to cassandra-env.sh file:
    
    JVM_OPTS="$JVM_OPTS -Dcassandra.allow_unsafe_replace=true
    
    Starting DataStax Enterprise as a service.
    
    After the node bootstraps, remove the -Dcassandra.allow_unsafe_replace=true parameter from cassandra-env.sh.
    
    Starting DataStax Enterprise as a service.
  2. Tarball and Installer-No Services installations:
    
    Start DataStax Enterprise with this option:
    
    sudo bin/dse cassandra Dcassandra.allow_unsafe_replace=true
    
    Tarball and Installer No-Services path:
    
    <installation_location>
If DSE restarts, run nodetool repair on the node. If not, replace the node.
If the repair succeeds, the node is restored to production. Otherwise, go to 7 or 8.
For a cluster using vnodes:
1. On the affected node, clear the system directory on each functioning drive.
  
  Example for a node with a three disk JBOD array:
  -/mnt1/cassandra/data -/mnt2/cassandra/data -/mnt3/cassandra/data
  If mnt1 has failed:
  rm -fr /mnt2/cassandra/data/system rm -fr /mnt3/cassandra/data/system
2. Restart DSE without bootstrapping as described in 4:
  -Dcassandra.allow_unsafe_replace=true
3. Run nodetool repair on the node.
  
  If the repair succeeds, the node is restored to production. If not, replace the dead node.
For a cluster single-token nodes:
1. On one of the cluster’s working nodes, run nodetool ring to retrieve the list of the repaired node’s tokens:
  nodetool ring | grep ip_address_of_node | awk ' {print $NF ","}' | xargs
2. Copy the output of the nodetool ring into a spreadsheet (space-delimited).
3. Edit the output, keeping the list of tokens and deleting the other columns.
4. On the node with the new disk, open the cassandra.yaml file and add the tokens (as a comma-separated list) to the initial_token property.
5. Change any other non-default settings in the new nodes to match the existing nodes. Use the diff command to find and merge any differences between the nodes.
  
  If the repair succeeds, the node is restored to production. If not, replace the node.
6. On the affected node, clear the system directory on each functioning drive.
  
  Example for a node with a three disk JBOD array:
  -/mnt1/cassandra/data -/mnt2/cassandra/data -/mnt3/cassandra/data
  If mnt1 has failed:
  rm -fr /mnt2/cassandra/data/system rm -fr /mnt3/cassandra/data/system
7. Restart DSE without bootstrapping as described in 4:
  -Dcassandra.allow_unsafe_replace=true
8. Run nodetool repair on the node.
  
  If the repair succeeds, the node is restored to production. If not, replace the node.

Recovering from a single disk failure using JBOD

Procedure

Was this helpful?

Give Feedback