I am on amazon ec2 ubuntu 11.04 large instance with a 150GB volume mounted for the database (ext4).
The cpu usage is VERY low but the load average has been consistently at 2.0 for about a day now. I used to have the database partition on a 40GB volume and did not have this problem.
iostat tells me we are spending a lot of time waiting for io:
:~$ iostat 1 2 Linux 2.6.38-11-virtual (flashgroup) 04/05/2012 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 7.16 0.09 2.62 1.11 2.09 86.92 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn xvdap1 3.45 0.88 18.59 9137065 192742888 xvdb 4.47 2.84 24.17 29479675 250638760 xvdh 10.62 19.95 88.05 206811124 912892410 xvdf 0.18 0.00 1.93 1378 19971464 xvdg 0.00 0.00 0.00 656 0 avg-cpu: %user %nice %system %iowait %steal %idle 5.22 0.00 1.92 42.58 3.02 47.25 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn xvdap1 0.00 0.00 0.00 0 0 xvdb 43.00 0.00 172.00 0 172 xvdh 0.00 0.00 0.00 0 0 xvdf 49.00 0.00 288.00 0 288 xvdg 0.00 0.00 0.00 0 0
The product is performing just fine and the database is not logging any slow queries…
How should I go about debugging this?
EDIT:
It turns out that none of the volumes are exhibiting high latency and all other aspects of the system seem to be healthy. Wikipedia tells me that linux includes processes in the un-interruptable state in the load average. ps tells me that there are two hung mount commands are in such state:
ps auxww | grep " D" root 21557 0.0 0.0 9904 760 ? D Apr03 0:00 umount db /dev/xvdh root 26428 0.0 0.0 16456 912 ? D Apr03 0:00 mount /dev/xvdh /mnt/db
I am afraid to kill these (probably would not even work if I tried) so I think that this instance is sick and needs a restart. Thanks for your help!
Best Answer
It turns out that none of the volumes are exhibiting high latency and all other aspects of the system seem to be healthy. Wikipedia tells me that linux includes processes in the un-interruptable state in the load average. ps tells me that there are two hung mount commands are in such state:
Restarting the instance got rid of these hung processed and the load average is back to normal.