How to find the source of the high load issues on Ubuntu server

topubuntu-10.04

We have an Ubuntu 10.4 VPS serving a Rails site which often shows pretty high load, but doesn't have high CPU or memory numbers. Reading a lot of other questions here on Server Fault suggests to me that this is an I/O issue (i.e. there are processes which are stuck in I/O wait state and therefore driving up load). I'm trying to track down those processes, but not having much luck. I'd appreciate help with (a) ways to identify the guilty processes, and/or (b) confirmation that I'm asking the right question.

Here's a snapshot of top:

top - 18:28:49 up 5 days,  3:07,  2 users,  load average: 1.79, 1.83, 1.73
Tasks:  82 total,   1 running,  81 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.3%sy,  0.0%ni, 99.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.1%st
Mem:   1794980k total,  1780384k used,    14596k free,    13356k buffers
Swap:   524284k total,     3116k used,   521168k free,  1012272k cached

Notice low swap, CPUs mostly idle; that's why I think we're I/O bound instead of memory or CPU bound.

Here's iostat (I've obfuscated the server name):

$ iostat -x 1 3
Linux 2.6.35.2-xenU (our.server.com)     03/25/11        _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.75    0.19    0.50    0.31    0.01   97.24

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvdap1            0.01    11.52    2.19    3.18   145.12   117.55    48.97     0.08   15.60   1.67   0.90
xvdap9            0.01     0.01    0.00    0.00     0.10     0.14    62.62     0.00   13.20   6.09   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.00    0.00    0.00  100.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvdap1            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdap9            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.00    0.00    0.00  100.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvdap1            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdap9            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

iotop won't run on this box:

$ iotop
Could not run iotop as some of the requirements are not met:
- Linux >= 2.6.20 with I/O accounting support (CONFIG_TASKSTATS, CONFIG_TASK_DELAY_ACCT, CONFIG_TASK_IO_ACCOUNTING): Not found
- Python >= 2.5 or Python 2.4 with the ctypes module: Found

ps seldom finds any processes in the D state:

$ sudo ps -eo pid,user,state,cmd | awk '$3 ~ /D/ { print $0 }'
  976 root     D [kjournald]
$ sudo ps -eo pid,user,state,cmd | awk '$3 ~ /D/ { print $0 }'
$ sudo ps -eo pid,user,state,cmd | awk '$3 ~ /D/ { print $0 }'
$ 

What's my next troubleshooting step?

ETA: I ran vmstat:

$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0   3116 509372  22880 773232    0    0    18    15   24   14  2  0 97  0

That wa value of 0 makes me wonder if I/O is really the problem.

Also, yes, I know load in the 1.x range isn't really a problem – but this app has a history of ramping up load until it chokes, and if I can track the source while it still has a low fever I might spare a fatality (to torture a metaphor).

Best Answer

I would recommend searching for anything non in the S sleeping state. It's possible you've got zombie processes which can get counted as something running, despite not really doing anything. ps -eo pid,user,state,cmd | awk '$3 !~ /S/ {print $0}' This will show any non-sleeping processes. (Running, waiting on IO, zombied, etc)

It's worth noting that your load average isn't terribly alarming. Assuming you have more than two cores on the box, there's no doubt plenty of CPU power to go around. But obviously still worth looking into if you don't expect 1-2 processes running at any given time.


--Christopher Karel