Nodes appear unresponsive due to a Linux futex_wait() kernel bug

Nodes randomly freeze and become unresponsive for an unknown reason.

Nodes randomly freeze and become unresponsive for an unknown reason.

The bug exists in RHEL 6.6, CentOS 6.6 and above.

Nodes affected by this bug have the following characteristics:
  • No garbage collection activity in the logs
  • No compactions in progress
  • Unable to run nodetool commands
  • No response on native transport, Thrift or JMX ports
  • Low or close to zero CPU utilization, or
  • High CPU utilization which eventually leads to the node becoming unresponsive

A thread dump on the node might show:

Thread 104823: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=226 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(long) @bci=68, line=2082 (Compiled frame)
 - java.util.concurrent.LinkedBlockingQueue.poll(long, java.util.concurrent.TimeUnit) @bci=62, line=467 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)

Cause 

This problem is caused by a Linux futex_wait() bug that causes user processes to deadlock and hang. A futex_wait() call (and any processes making this call) can stay blocked forever. JVM synchronization method calls such as lock(), park() and unpark() all make futex_wait() calls at some point and can trigger the unresponsiveness caused by this bug.

Solution 

Upgrade to Linux kernels containing the get_futex_key_refs() fix, such as RHEL 6.6.z and CentOS 6.6.z.

Use the following command to check for the installed patches on a RHEL server:

$ sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref

Sample output from this command:

- [kernel] futex: Mention key referencing differences between shared and private futexes (Larry Woodman) [1167405]
- [kernel] futex: Ensure get_futex_key_refs() always implies a barrier (Larry Woodman) [1167405]

If the patch had not been installed, the rpm command would show nothing.

For further information on distributions that contain the fix, consult the relevant vendor or distributor of the operating system.