My server is showing high average load, and after investigation I found that there is a lot of IO caused by raid.
Server uses i7 3770 processor, 32GB ram and 2x3TB disks with CentOS7 and software raid setup.
[root@server ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[1] sdb3[0]
1073610752 blocks super 1.2 [2/2] [UU]
[===============>.....] check = 77.3% (830580032/1073610752) finish=333.7min speed=12133K/sec
bitmap: 4/8 pages [16KB], 65536KB chunk
md3 : active raid1 sda4[1] sdb4[0]
1839090112 blocks super 1.2 [2/2] [UU]
bitmap: 3/14 pages [12KB], 65536KB chunk
md0 : active raid1 sda1[1] sdb1[0]
16760832 blocks super 1.2 [2/2] [UU]
resync=DELAYED
md1 : active raid1 sda2[1] sdb2[0]
523712 blocks super 1.2 [2/2] [UU]
resync=DELAYED
unused devices: <none>
This check has started automatically and 54% when I noticed it was before 12 hours. I have checked disk health, and my server provider has tested them 2 days ago also because I have been convinced that disk are causing my server to have high average load.
When I check which processes are delayed I get this, and each time I run it one of raid processes is there
[root@server ~]# top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D (I/O wait probably): "count}'
top - 08:38:38 up 1 day, 16:23, 3 users, load average: 6.33, 6.32, 6.22
Tasks: 288 total, 2 running, 280 sleeping, 4 stopped, 2 zombie
%Cpu(s): 3.9 us, 0.7 sy, 0.3 ni, 76.6 id, 18.6 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32460092 total, 265352 free, 7304544 used, 24890196 buff/cache
KiB Swap: 16760828 total, 16727480 free, 33348 used. 24434784 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
387 root 20 0 0 0 0 D 0.0 0.0 0:24.48 kworker/u16:4
545 root 20 0 0 0 0 D 0.0 0.0 1:14.82 jbd2/md2-8
449624 root 25 5 0 0 0 D 0.0 0.0 5:48.69 md2_resync
Total status D (I/O wait probably): 3
Is this normal behavior, is this software or hardware problem?
I am suspecting that it is slowing my server because when I check top processes, there is no process with too much CPU consumption, and load averages are almost always above 6.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
899323 mysql 20 0 30.285g 4.844g 9304 S 1.7 15.6 86:07.46 mysqld
477 root 20 0 0 0 0 S 0.7 0.0 0:09.68 md0_raid1
3359 root 30 10 277464 33136 2712 S 0.7 0.1 12:37.91 python2.7
310858 mailnull 20 0 77356 7824 3856 D 0.7 0.0 0:00.03 exim
18 root 20 0 0 0 0 S 0.3 0.0 1:42.94 rcuos/0
407 root 0 -20 0 0 0 S 0.3 0.0 0:08.27 kworker/+
625 root 20 0 94284 53560 53372 S 0.3 0.2 1:32.82 systemd-+
3504 root 20 0 216748 27800 5324 S 0.3 0.1 1:10.35 httpd
309919 nobody 20 0 217164 25440 2680 S 0.3 0.1 0:00.04 httpd
right now after this top command this is result of uptime
[root@server ~]# uptime
17:47:19 up 2 days, 1:32, 1 user, load average: 5.87, 6.23, 6.06
UPDATE
Here are results from raid checking
[root@server ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[1] sdb3[0]
1073610752 blocks super 1.2 [2/2] [UU]
bitmap: 4/8 pages [16KB], 65536KB chunk
md3 : active raid1 sda4[1] sdb4[0]
1839090112 blocks super 1.2 [2/2] [UU]
bitmap: 11/14 pages [44KB], 65536KB chunk
md0 : active raid1 sda1[1] sdb1[0]
16760832 blocks super 1.2 [2/2] [UU]
md1 : active raid1 sda2[1] sdb2[0]
523712 blocks super 1.2 [2/2] [UU]
unused devices: <none>
Can I do something to fix it?
Best Answer
This is down (at least on CentOS 6, I don't have a C7 box to hand as
systemd
still gives me hives) to the file/etc/cron.d/raid-check
. This schedules a RAID scrub once a week. It isn't supposed to conflict with real use of the HDDs, but even a perfectly submissive algorithm will still have some backoff time when the system IO steps up massively under new load.You are free to run that job less often, or indeed not at all, by editing the file (or disabling in
/etc/sysconfig/raid-check
). If you think you are having actual disc problems, it's probably best to disable it while you test the hypothesis (though make sure your backups are up-to-date and that you have tested your restores!). Once you've determined what's going on, it's probably best to re-enable it. I'd run it at least monthly.