Linux – How to tell if linux disk IO is causing excessive (> 1 second) application stalls

linuxperformanceredhatstorage-area-networkveritas

I have a Java application performing a large volume (hundreds of MB) of continuous output (streaming plain text) to about a dozen files a ext3 SAN filesystem. Occasionally, this application pauses for several seconds at a time. I suspect that something related to ext3 vsfs (Veritas Filesystem) functionality (and/or how it interacts with the OS) is the culprit.

What steps can I take to confirm or refute this theory? I am aware of iostat and /proc/diskstats as starting points.

Revised title to de-emphasize journaling and emphasize "stalls"

I have done some googling and found at least one article that seems to describe behavior like I am observing: Solving the ext3 latency problem

Additional Information

  • Red Hat Enterprise Linux Server release 5.3 (Tikanga)
  • Kernel: 2.6.18-194.32.1.el5
  • Primary application disk is fiber-channel SAN: lspci | grep -i fibre >> 14:00.0 Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter (rev 03)
  • Mount info: type vxfs (rw,tmplog,largefiles,mincache=tmpcache,ioerror=mwdisable) 0 0
  • cat /sys/block/VxVM123456/queue/scheduler >> noop anticipatory [deadline] cfq

Best Answer

My guess is that there's some other process that hogs the disk I/O capacity for a while. iotop can help you pinpoint it, if you have a recent enough kernel.

If this is the case, it's not about the filesystem, much less about journalling. It's the I/O scheduler the responsible to arbitrate between conflicting applications. An easy test: check the current scheduler and try a different one. It can be done on the fly, without restarting. For example, on my desktop to check the first disk (/dev/sda):

cat /sys/block/sda/queue/scheduler
=>  noop deadline [cfq]

shows that it's using CFQ, which is a good choice for desktops but not so much for servers. Better set 'deadline':

echo 'deadline' > /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/scheduler
=>  noop [deadline] cfq

and wait a few hours to see if it improves. If so, set it permanently in the startup scripts (depends on distribution)