Software RAID – How to Boot After RAID Failure

grubmdadmsoftware-raid

Previously I had a software raid set up using (mdadm) of drives sda and sdb. sdb failed and the only way to reboot the system was by unplugging the second hard drive.

Now I've added fresh sdb and sdc to my RAID setup. sda is the oldest (so most likely to fail) and it is the drive from which we boot (I think, how can I check?).

How can I ensure and test (through GRUB configuration, etc.) that if sda fails, I will still be able to boot my machine.

fdisk -l:

Disk /dev/sda: 250.0 GB, 250000000000 bytes
255 heads, 63 sectors/track, 30394 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000080

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1       30064   241489048+  fd  Linux raid autodetect
/dev/sda2           30065       30394     2650725    5  Extended
/dev/sda5           30065       30394     2650693+  fd  Linux raid autodetect

Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1       30064   241489048+  fd  Linux raid autodetect
Partition 1 does not start on physical sector boundary.
/dev/sdb2           30065       30394     2650725    5  Extended
/dev/sdb5           30065       30394     2650693+  fd  Linux raid autodetect
Partition 5 does not start on physical sector boundary.

Disk /dev/sdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1   *           1       30064   241489048+  fd  Linux raid autodetect
Partition 1 does not start on physical sector boundary.
/dev/sdc2           30065       30394     2650725    5  Extended
/dev/sdc5           30065       30394     2650693+  fd  Linux raid autodetect
Partition 5 does not start on physical sector boundary.

Disk /dev/md0: 247.3 GB, 247284695040 bytes
2 heads, 4 sectors/track, 60372240 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Alignment offset: 512 bytes
Disk identifier: 0x00000000

Disk /dev/md0 doesn't contain a valid partition table

Disk /dev/md1: 2714 MB, 2714238976 bytes
2 heads, 4 sectors/track, 662656 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Alignment offset: 512 bytes
Disk identifier: 0x00000000

Disk /dev/md1 doesn't contain a valid partition table

Best Answer

This is an old chestnut. The short answer is that "grub-install" is often the wrong answer for Software RAID. Here is an example where I have a 3-way RAID-1 array. The /boot partition is stored at /dev/md0. This installs GRUB to each disk, so that if one disk fails, you can boot off one of the other disks.

# grub
grub> find /grub/stage1
 (hd0,0)
 (hd1,0)
 (hd2,0)
grub> device (hd0) /dev/sda
grub> root (hd0,0)
grub> setup (hd0)
grub> device (hd0) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)
grub> device (hd0) /dev/sdc
grub> root (hd0,0)
grub> setup (hd0)
grub> quit

In the future versions of GRUB, it's much smarter, but CentOS 6 / RHEL 6 still ship with the older GRUB.

To test: Change the "timeout=5" value in your grub.conf file (under /boot) to something like timeout=30. Then swap the location of the two drives before powering the system back on. Or change the the boot order of the hard drives in the BIOS.

(Naturally... make sure you have good backups and know how to put it back to the correct configuration. Testing this out on a throwaway system is always a good idea.)