How to coax RHEL / CentOS / SL 7 into booting normally with degraded software RAID 1

centos7raid1rhel7scientific-linuxsoftware-raid

I set up a new server (my first with this version of Linux). I installed a pair of 160 GB blank SATA HDDs (one Seagate and one WDC, but with exactly the same number of LBA sectors) in an old machine, and set out to install Scientific Linux 7.0 (rebranded RHEL) in a RAID 1 (software mirrored) configuration.

The first hiccup was that I couldn't figure out how to get SL / RHEL installer (Anaconda) to set up the two drives for RAID1. So I booted from a PartedMagic CD, and used it to do the partitioning.

I partitioned the two drives identically. Each drive has a big partition for RAID1+ext4 to be mounted at /, a small (currently unused) partition for RAID1+ext3 to be mounted at /safe, and a 3GB Linux Swap partition. I used fdisk to change the types of the RAID partitions on each drive to FD, and mdadm to build the RAID arrays:

mdadm --create --verbose /dev/md0 --raid-devices=2 --level=1 /dev/sda1 /dev/sdb1
mdadm --create --verbose /dev/md1 --raid-devices=2 --level=1 /dev/sda2 /dev/sdb2

Then I shut down, booted the SL DVD, and tried the install again. This time the installer recognized the RAID1 arrays, formatted them for ext4 & ext3, respectively, and installed smoothly.

At this point, everything seemed okay. I shut it down, started it again, and it booted fine. So far so good.

So then I tested the RAID1 functionality: I shut down the computer, removed one of the drives, and tried to boot it. I was expecting it to display some error messages about the RAID array being degraded, and then come up to the normal login screen. But it didn't work. Instead I got:

Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" to try again
to boot into default mode.
Give root password for maintenance
(or type Control-D to continue):

The same thing happens regardless of which drive is missing.

That's no good! The purpose of the mirrored drives is to ensure that the server will keep on running if one of the drives fails.

Ctrl-D just gets me back to a repeat of the same "Welcome to emergency mode" screen. So does entering my root password and then "systemctl default".

So then I tried an experiment. At the boot menu I pressed e to edit the kernel boot parameters, and changed "rhgb quiet" to "bootdegraded=true" and then booted. No joy.

That let me see more status messages flying by, but it didn't enable the machine to boot normally when a drive was missing. It still stopped at the same "Welcome to emergency mode" screen. The following is what I saw with the Seagate drive removed, and the WDC drive remaining. The last few lines look like the following (except that "…." denotes where I got tired of typing):

[  OK  ] Started Activation of DM RAID sets.
[  OK  ] Reached target Encrypted Volumes.
[  14.855860] md: bind<sda2>
[  OK  ] Found device WDC_WD1600BEVT-00A23T0.
         Activating swap /dev/disk/by-uuid/add41844....
[  15.190432] Adding 3144700k swap on /dev/sda3.  Priority:-1 extents:1 across:3144700k FS
[  OK  ] Activated swap /dev/disk/by-uuid/add41844....
[ TIME ] Timed out waiting for device dev-disk-by\x2duuid-a65962d\x2dbf07....
[DEPEND] Dependency failed for /safe.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for Mark the need to relabel after reboot.
[DEPEND] Dependency failed for Relabel all file systems, if necessary.
[  99.299068] systemd-journald[452]: Received request to flush runtime journal from PID 1
[  99.3298059] type=1305 audit(1415512815.286:4): audit_pid=588 old=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" to try again
to boot into default mode.
Give root password for maintenance
(or type Control-D to continue):

So it appears that installing on RAID1 mirrored drives will just double the chance of a drive failure bringing down the server (since there are two drives instead of one). That is not what I was hoping to achieve w/ mirrored drives.

Does anyone know how to make it boot & run "normally" (with a degraded RAID1 array) when a hard disk drive fails?

Two other notes:

  1. I'm new to RHEL/SL/CentOS 7, so at the "Software Selection" screen, during the SL installation, I had to do some guessing. I chose:
    "General Purpose System" +
    FTP Server,
    File and Storage Server,
    Office Suite and Productivity,
    Virtualization Hypervisor,
    Virtualization Tools, and
    Development Tools

  2. I'm seeing some apparently-innocuous errors:

    ATAx: softreset failed (device not ready)

The "x" depends on which drives are installed. I get more of those errors with two drives installed than with only one.

Best Answer

It turns out that the problem wasn't the main RAID1 partition, it was the other partitions.

In the first place, I shouldn't have used swap partitions. That was just dumb. Even if that had worked, it still probably would have made the system vulnerable to a failure if a disk drive developed a bad block within the swap partition. It's obviously better to use a swap file on the RAID1 partition; I don't know what I was thinking.

However, the "extra" ext3 md1 partition was also a problem. I don't know why.

Once I removed the references to the other partitions (the two swap partitions and the ext3 md1 partition) from /etc/fstab, the system would boot up just fine with one drive, running the RAID1 array in degraded mode, like I wanted it to.

After I shut down and reinstalled the missing drive, I started the machine again, and it was still running with just one drive. But I did "mdadm --add" to add the missing drive back, and its state went to "spare rebuilding" for a while, and then to "active."

In other words, it's working perfectly.