Linux – Why are CentOS RAID-1/mirror partitions syncing multiple times

linuxmirrorraidsynchronization

I'm setting up a CentOS 5.6 server, using Kickstart. I have four disk drives, sda-sdd. These are the relevant Kickstart lines:

clearpart --linux --drives=sda,sdb,sdc,sdd --initlabel
part raid.11 --size 102400 --asprimary --ondrive=sda
part raid.21 --size 16384 --asprimary --ondrive=sda
part raid.31 --size 1024 --asprimary --grow --ondrive=sda
part raid.12 --size 102400 --asprimary --ondrive=sdb
part raid.22 --size 16384 --asprimary --ondrive=sdb
part raid.32 --size 1024 --asprimary --grow --ondrive=sdb
part raid.41 --size 1024 --asprimary --grow --ondrive=sdc
part raid.42 --size 1024 --asprimary --grow --ondrive=sdd
raid / --fstype ext3 --device md0 --level=RAID1 raid.11 raid.12
raid swap --device md1 --level=RAID1 raid.21 raid.22
raid /data1 --fstype ext3 --device md2 --level=RAID1 raid.31 raid.32
raid /data2 --fstype ext3 --device md3 --level=RAID1 raid.41 raid.42

Basically I partitioned the first two disks for root, swap, and a data partition. I partitioned the 3rd and 4th disk as one big data partition. I set up the first two disks to mirror each other, and the last two disks to mirror each other. Very straightforward.

The odd behavior I'm seeing is that when I boot up the newly-installed machine, it syncs some of the raids more than once. Here are the relevant /var/log/messages lines:

Sep  5 15:09:22 dsp-hw1 kernel: md: md3: raid array is not clean -- starting background reconstruction
Sep  5 15:09:24 dsp-hw1 kernel: md: syncing RAID array md3
Sep  5 15:09:39 dsp-hw1 kernel: md: md2: raid array is not clean -- starting background reconstruction
Sep  5 15:09:39 dsp-hw1 kernel: md: syncing RAID array md2
Sep  5 15:09:53 dsp-hw1 kernel: md: md0: raid array is not clean -- starting background reconstruction
Sep  5 15:09:53 dsp-hw1 kernel: md: delaying resync of md0 until md2 has finished resync (they share one or more physical units)
Sep  5 15:30:12 dsp-hw1 kernel: md: md2: sync done.
Sep  5 15:30:12 dsp-hw1 kernel: md: syncing RAID array md0
Sep  5 15:40:37 dsp-hw1 kernel: md: md3: sync done.
Sep  5 15:42:02 dsp-hw1 kernel: md: md0: sync done.
Sep  5 16:16:03 dsp-hw1 kernel: md: syncing RAID array md1
Sep  5 16:16:03 dsp-hw1 kernel: md: syncing RAID array md3
Sep  5 16:16:03 dsp-hw1 kernel: md: delaying resync of md2 until md1 has finished resync (they share one or more physical units)
Sep  5 16:18:10 dsp-hw1 kernel: md: md1: sync done.
Sep  5 16:18:10 dsp-hw1 kernel: md: syncing RAID array md2
Sep  5 16:43:31 dsp-hw1 kernel: md: md2: sync done.
Sep  5 16:54:57 dsp-hw1 kernel: md: md3: sync done.

So it starts a sync of md2 and md3 in parallel (pretty reasonable), then does md0 once md2 is done (again pretty reasonable), and md1 once md0 is done. So far so good. Then, for no reason I can see, it starts another sync of md3 at the same time as it starts md1. Then follows up with another sync of md2 once md1 is done. These are full syncs, taking just as long as the original. Longer, actually. md3's syncs run for 31 minutes and 37 minutes respectively, and md2's syncs run for 21 minutes and 25 minutes respectively.

So the question is, why does it need to sync anything more than once? I haven't seen it start a third one (yet), but I'm not sure whether to expect it. More importantly, I don't know if this indicates some kind of a problem that I should fix before putting the system into a production environment. I don't see anything looking like an error in any of the logs, nothing indicating a problem with the first sync, nothing looking abnormal at all, actually, except for the odd extra syncs.

Can anyone shed light on this?


Update: In trying to diagnose this, I've redone the kickstart a few times. I've noticed that it's not 100%. Once (only once) it never synced the swap partition (md1) at all. On that occasion, it only synced each of the other partitions once.

Perhaps it's some race condition exacerbated by multiple raid partitions set up simultaneously on the same physical disks?

Best Answer

Mine (CentOS 5.6, 3 RAID partitions on 2 physical disks) is running a job called raid-check every week. In /var/log/messages, the check is basically the only time I see md sync processes.

The raid-check is part of the mdadm package, and is enabled by /etc/sysconfig/raid-check. The script itself is in /etc/cron.weekly and is named 99-raid-check. If you look at the script, you'll see some of the /proc and /sys files that indicate whether the array is considered clean, is syncing, etc. Maybe there are some clues in there.

I'm sure you've already looked, but is /proc/mdstat OK?