Ubuntu – debugging high load average due to slow io on ec2

amazon ec2Ubuntu

I am on amazon ec2 ubuntu 11.04 large instance with a 150GB volume mounted for the database (ext4).

The cpu usage is VERY low but the load average has been consistently at 2.0 for about a day now. I used to have the database partition on a 40GB volume and did not have this problem.

iostat tells me we are spending a lot of time waiting for io:

:~$ iostat 1 2
Linux 2.6.38-11-virtual (flashgroup)    04/05/2012      _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.16    0.09    2.62    1.11    2.09   86.92

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvdap1            3.45         0.88        18.59    9137065  192742888
xvdb              4.47         2.84        24.17   29479675  250638760
xvdh             10.62        19.95        88.05  206811124  912892410
xvdf              0.18         0.00         1.93       1378   19971464
xvdg              0.00         0.00         0.00        656          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.22    0.00    1.92   42.58    3.02   47.25

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvdap1            0.00         0.00         0.00          0          0
xvdb             43.00         0.00       172.00          0        172
xvdh              0.00         0.00         0.00          0          0
xvdf             49.00         0.00       288.00          0        288
xvdg              0.00         0.00         0.00          0          0

The product is performing just fine and the database is not logging any slow queries…

How should I go about debugging this?

EDIT:

It turns out that none of the volumes are exhibiting high latency and all other aspects of the system seem to be healthy. Wikipedia tells me that linux includes processes in the un-interruptable state in the load average. ps tells me that there are two hung mount commands are in such state:

ps auxww | grep " D"
root     21557  0.0  0.0   9904   760 ?        D    Apr03   0:00 umount db /dev/xvdh
root     26428  0.0  0.0  16456   912 ?        D    Apr03   0:00 mount /dev/xvdh /mnt/db

I am afraid to kill these (probably would not even work if I tried) so I think that this instance is sick and needs a restart. Thanks for your help!

Best Answer

It turns out that none of the volumes are exhibiting high latency and all other aspects of the system seem to be healthy. Wikipedia tells me that linux includes processes in the un-interruptable state in the load average. ps tells me that there are two hung mount commands are in such state:

ps auxww | grep " D"
root     21557  0.0  0.0   9904   760 ?        D    Apr03   0:00 umount db /dev/xvdh
root     26428  0.0  0.0  16456   912 ?        D    Apr03   0:00 mount /dev/xvdh /mnt/db

Restarting the instance got rid of these hung processed and the load average is back to normal.