Ubuntu – IO-intensive processes hang with iowait, but no activity going on

iowaitssdUbuntu

I have a bunch of IO-intensive jobs, and to boost performance, I just installed two SSDs in a compute server, one as a scratch file system, one as swap. After running for some time, all my processes hang in "D" state, consume no CPU, and the system reports 67% idle, and 33% wait. An iostat shows no disk activity going on, and the system is otherwise responsive, including the relevant file systems. Attaching a 'strace' to the processes produce no output.

Looking in /proc/(pid)/fd, I discover that all processes are using (reading) one common file. I can't see any reason why this should cause a problem, but I replaced the file, killed the processes, and let everything continue (i.e. new processes will be launced). We'll see if things get stuck on the new file, on a different file, or – ideally – not at all 🙂

I also found a couple of these in kern.log:

BUG: unable to handle kernel paging request at ffffeb8800096e5c

Lots of other information, but I don't know how to decipher it – except that it refers to the PID and name of one of my processes.

Any idea what is going on here, or how to fix it? This is on Ubuntu 12.04 LTS, Dell-something box with a RocketRaid disk controller and btrfs file system.

Best Answer

This seems like it could be a memory problem. Boot memtest and check your ram.