Linux – Software raid is slowing the server

linuxraidsoftware-raid

My server is showing high average load, and after investigation I found that there is a lot of IO caused by raid.

Server uses i7 3770 processor, 32GB ram and 2x3TB disks with CentOS7 and software raid setup.

[root@server ~]# cat /proc/mdstat

Personalities : [raid1]
md2 : active raid1 sda3[1] sdb3[0]
      1073610752 blocks super 1.2 [2/2] [UU]
      [===============>.....]  check = 77.3% (830580032/1073610752) finish=333.7min speed=12133K/sec
      bitmap: 4/8 pages [16KB], 65536KB chunk

md3 : active raid1 sda4[1] sdb4[0]
      1839090112 blocks super 1.2 [2/2] [UU]
      bitmap: 3/14 pages [12KB], 65536KB chunk

md0 : active raid1 sda1[1] sdb1[0]
      16760832 blocks super 1.2 [2/2] [UU]
        resync=DELAYED

md1 : active raid1 sda2[1] sdb2[0]
      523712 blocks super 1.2 [2/2] [UU]
        resync=DELAYED

unused devices: <none>

This check has started automatically and 54% when I noticed it was before 12 hours. I have checked disk health, and my server provider has tested them 2 days ago also because I have been convinced that disk are causing my server to have high average load.

When I check which processes are delayed I get this, and each time I run it one of raid processes is there

[root@server ~]# top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D (I/O wait probably): "count}'
top - 08:38:38 up 1 day, 16:23,  3 users,  load average: 6.33, 6.32, 6.22
Tasks: 288 total,   2 running, 280 sleeping,   4 stopped,   2 zombie
%Cpu(s):  3.9 us,  0.7 sy,  0.3 ni, 76.6 id, 18.6 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32460092 total,   265352 free,  7304544 used, 24890196 buff/cache
KiB Swap: 16760828 total, 16727480 free,    33348 used. 24434784 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    387 root      20   0       0      0      0 D   0.0  0.0   0:24.48 kworker/u16:4
    545 root      20   0       0      0      0 D   0.0  0.0   1:14.82 jbd2/md2-8
 449624 root      25   5       0      0      0 D   0.0  0.0   5:48.69 md2_resync
Total status D (I/O wait probably): 3

Is this normal behavior, is this software or hardware problem?

I am suspecting that it is slowing my server because when I check top processes, there is no process with too much CPU consumption, and load averages are almost always above 6.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 899323 mysql     20   0 30.285g 4.844g   9304 S   1.7 15.6  86:07.46 mysqld
    477 root      20   0       0      0      0 S   0.7  0.0   0:09.68 md0_raid1
   3359 root      30  10  277464  33136   2712 S   0.7  0.1  12:37.91 python2.7
 310858 mailnull  20   0   77356   7824   3856 D   0.7  0.0   0:00.03 exim
     18 root      20   0       0      0      0 S   0.3  0.0   1:42.94 rcuos/0
    407 root       0 -20       0      0      0 S   0.3  0.0   0:08.27 kworker/+
    625 root      20   0   94284  53560  53372 S   0.3  0.2   1:32.82 systemd-+
   3504 root      20   0  216748  27800   5324 S   0.3  0.1   1:10.35 httpd
 309919 nobody    20   0  217164  25440   2680 S   0.3  0.1   0:00.04 httpd

right now after this top command this is result of uptime

[root@server ~]# uptime
 17:47:19 up 2 days,  1:32,  1 user,  load average: 5.87, 6.23, 6.06

UPDATE

Here are results from raid checking

[root@server ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[1] sdb3[0]
      1073610752 blocks super 1.2 [2/2] [UU]
      bitmap: 4/8 pages [16KB], 65536KB chunk

md3 : active raid1 sda4[1] sdb4[0]
      1839090112 blocks super 1.2 [2/2] [UU]
      bitmap: 11/14 pages [44KB], 65536KB chunk

md0 : active raid1 sda1[1] sdb1[0]
      16760832 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[1] sdb2[0]
      523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Can I do something to fix it?

Best Answer

This is down (at least on CentOS 6, I don't have a C7 box to hand as systemd still gives me hives) to the file /etc/cron.d/raid-check. This schedules a RAID scrub once a week. It isn't supposed to conflict with real use of the HDDs, but even a perfectly submissive algorithm will still have some backoff time when the system IO steps up massively under new load.

You are free to run that job less often, or indeed not at all, by editing the file (or disabling in /etc/sysconfig/raid-check). If you think you are having actual disc problems, it's probably best to disable it while you test the hypothesis (though make sure your backups are up-to-date and that you have tested your restores!). Once you've determined what's going on, it's probably best to re-enable it. I'd run it at least monthly.

Related Topic