Linux – Diagnosing high CPU waiting

central-processing-unithard drivelinuxtop

I have a monitoring server that is running icinga/collectd/graphite with about 50 hosts. I have noticed high load/slugging performance on the box. If you take a look at top, you'll see:

Cpu(s): 0.6%us, 0.2%sy, 0.0%ni, 7.6%id, 23.4%wa, 0.0%hi, 0.2%si, 0.0%st

Notice the HUGE %wa value, which as far as I know means a network or disk bottleneck. ifconfig shows no dropping packets and there's not a ton of bandwidth going on, so that leaves disk issues, right? There's not a lot of disk writing going on either…iotop is reporting we're only writing a little over 1 MB per second and the RAID tool reports everything is A-OK and write caching is enabled.

How do I go about trying to figure out how to fix this?

UPDATE:
iostat -x output is:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.62    0.10    0.31    9.65    0.00   89.31

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.21    33.34   83.55   16.54  1599.94   399.07    19.97    43.21  416.98   3.71  37.13

Best Answer

i/o wait is also generated by NFS, SMB and other remote filesystems.

Use vmstat 2 to see a granular view of system performance including io wait.