Nodes appear unresponsive due to a Linux futex_wait() kernel bug
Nodes randomly freeze and become unresponsive for an unknown reason.
Nodes randomly freeze and become unresponsive for an unknown reason.
The bug exists in RHEL 6.6, CentOS 6.6 and above.
- No garbage collection activity in the logs
- No compactions in progress
- Unable to run nodetool commands
- No response on native transport, Thrift or JMX ports
- Low or close to zero CPU utilization, or
- High CPU utilization which eventually leads to the node becoming unresponsive
A thread dump on the node might show:
Thread 104823: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=226 (Compiled frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(long) @bci=68, line=2082 (Compiled frame)
- java.util.concurrent.LinkedBlockingQueue.poll(long, java.util.concurrent.TimeUnit) @bci=62, line=467 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
Cause
This problem is caused by a Linux futex_wait()
bug that causes user
processes to deadlock and hang. A futex_wait()
call (and any processes
making this call) can stay blocked forever. JVM synchronization method calls such as
lock()
, park()
and unpark()
all make
futex_wait()
calls at some point and can trigger the unresponsiveness
caused by this bug.
Solution
Upgrade to Linux kernels containing the get_futex_key_refs() fix, such as RHEL 6.6.z and CentOS 6.6.z.
Use the following command to check for the installed patches on a RHEL server:
$ sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref
Sample output from this command:
- [kernel] futex: Mention key referencing differences between shared and private futexes (Larry Woodman) [1167405] - [kernel] futex: Ensure get_futex_key_refs() always implies a barrier (Larry Woodman) [1167405]
If the patch had not been installed, the rpm
command would show nothing.
For further information on distributions that contain the fix, consult the relevant vendor or distributor of the operating system.