I've a nagios server which was perfectly working up to a few days ago. I've stopped it and restarted it to increase its RAM, and since then, iowait increased dramatically on the server (more than 20%, it was less than 1% before). I've tried to put back the original amount of RAM on the server but I still get the same issue.
I've readed lots of similar iowait problems on serverfault, but I never manage to find the explaination in my case :
Looking at iotop, I see there is a lot of io for pdflush, which is doing page cache & kjournald, which is dedicated for journaling ext3 filesystem. I don't know if it's normal. According to other serverfault questions, i've tried to put noatime in fstab. Ext3 filesystem is mounted with ordered data mode
Total DISK READ: 0.00 B/s | Total DISK WRITE: 210.44 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
650 be/3 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [kjournald]
11482 be/4 root 0.00 B/s 0.00 B/s 0.00 % 98.42 % [pdflush]
12167 be/4 nagios 0.00 B/s 0.00 B/s 0.00 % 0.12 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
11 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.10 % [migration/3]
12168 be/4 nagios 0.00 B/s 0.00 B/s 0.02 % 0.08 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
12165 be/4 nagios 0.00 B/s 0.00 B/s 98.42 % 0.02 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
2600 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.02 % auditd
12164 be/4 nagios 0.00 B/s 0.00 B/s 0.00 % 0.00 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
8 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/2]
20 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/6]
26 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/0]
23 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/7]
3047 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % snmpd -Ln -Lf /dev/null -p /var/run/snmpd.pid -a
12169 be/4 nagios 0.00 B/s 0.00 B/s 0.12 % 0.00 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
14 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/4]
2601 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % auditd
5 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1]
17 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/5]
5228 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % bash
10 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/2]
13 rt/3 root 0.00 B/s 0.00 B/s 0.10 % 0.00 % [watchdog/3]
the following line
12165 be/4 nagios 0.00 B/s 0.00 B/s 98.42 % 0.02 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
seems quite surprizing : how can I have 98.42% of swapin since I have almost no swap :
free -o
total used free shared buffers cached
Mem: 4046468 3163796 882672 0 103548 2193604
Swap: 4192956 1572 4191384
top don't show something specific, exept high load and high iowait
top - 10:07:56 up 12 days, 23:42, 4 users, load average: 8.60, 9.29, 9.85
Tasks: 177 total, 1 running, 176 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 0.0%sy, 0.0%ni, 77.2%id, 22.6%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4046468k total, 3165500k used, 880968k free, 104204k buffers
Swap: 4192956k total, 1572k used, 4191384k free, 2201500k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5246 root 15 0 14252 2632 836 R 0.3 0.1 0:03.94 top
1 root 15 0 10372 696 584 S 0.0 0.0 0:03.61 init
2 root RT -5 0 0 0 S 0.0 0.0 0:14.80 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.73 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
5 root RT -5 0 0 0 S 0.0 0.0 0:13.93 migration/1
6 root 34 19 0 0 0 S 0.0 0.0 0:01.75 ksoftirqd/1
7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
8 root RT -5 0 0 0 S 0.0 0.0 0:09.51 migration/2
9 root 34 19 0 0 0 S 0.0 0.0 0:01.09 ksoftirqd/2
10 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
11 root RT -5 0 0 0 S 0.0 0.0 0:08.98 migration/3
12 root 34 19 0 0 0 S 0.0 0.0 0:01.46 ksoftirqd/3
13 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/3
14 root RT -5 0 0 0 S 0.0 0.0 0:20.36 migration/4
15 root 34 19 0 0 0 S 0.0 0.0 0:01.15 ksoftirqd/4
16 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/4
disabling nagios process make the system load normal (i.e. < 1 ) but i still get high iowait.
In atop, the DSK is 100% busy, even with no nagios process running. May I have a hard drive problem? (it's a western digital green, which is not supposed to be running in such a server). I get no special message on dmesg or syslog.
Best Answer
Oh, I'm sorry. Are you using a WD Green disk for something other than a desktop PC?
Don't.
They're slow, unreliable (they'll go to sleep and drop out of a RAID array), and totally unsuitable for what you want to do.
If you're experiencing high IOWait, that means the disk subsystem isn't able to handle the amount of disk IO that's required.
The easy way to resolve that is to add more disks (Ideally a whole bunch in a RAID6 array).
You should also check general disk health with smartctl, and take a backup (should do this regularly anyway, but if you've got an over-used WD Green, I'd be extra cautious.).