Linux – How to get notified of mdadm RAID problems

linuxmdadmraidsoftware-raid

I am running Ubuntu 12.04 LTS. Yesterday I found a message in my mailbox saying that my server was shut down. I proceeded to reboot the system, but it didn't come up after many minutes, and I didn't have a hardware KVM system to see what the kernel was printing to the terminal. So I rebooted the system to a Linux rescue image and I saw that the software RAID 1 array was out of sync. The rescue system also began to reconstruct the RAID array.

So far there is no evidence that any of the disks have hardware errors. SMART statuses look good so far.

I never received an email notification by mdadm, even though email notification was turned on in /etc/mdadm/mdadm.conf.

This server was also configured to forward all syslog messages to a log host, so I checked my log host. The relevant parts are:

May 20 15:38:40 kernel: [    1.869825] md0: detected capacity change from 0 to 536858624
May 20 15:38:40 kernel: [    1.870687]  md0: unknown partition table
May 20 15:38:40 kernel: [    1.877412] md: bind
May 20 15:38:40 kernel: [    1.878337] md/raid1:md1: not clean -- starting background reconstruction
May 20 15:38:40 kernel: [    1.878376] md/raid1:md1: active with 2 out of 2 mirrors
May 20 15:38:40 kernel: [    1.878418] md1: detected capacity change from 0 to 3000052808704
May 20 15:38:40 kernel: [    1.878575] md: resync of RAID array md1
[snip]
May 20 15:52:33 kernel: Kernel logging (proc) stopped.
May 20 15:52:33 rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="845" x-info="http://www.rsyslog.com"] exiting on signal 15.

As you can see, the system (the normal one, not the rescue system) already detected that something was wrong with the RAID array during a system boot. Then, shortly after, something (not me) halted the system.

So my questions are:

  1. What could cause the disks to suddenly become out of sync?
  2. Why was I not notified by email?
  3. Why was the error not properly logged to syslog before halting the system? Could it be that the system tried to log to syslog, but did so after stopping the syslog daemon? If so what can I do to prevent that?
  4. What can I do to find out what happened? Or, if there's no way for me now to find out what happened, how can I improve logging and notifications so that next time I can do a better post-mortem?

My question is not about proper backup practice. I already know that RAID is not a backup etc. My question is solely about notifications and diagnosis.

Best Answer

What could cause the disks to suddenly become out of sync?

It could be any hardware or software fault in the path between the drive platters and the data in memory. Which could mean, but is not limited to: drive head, drive controller, connecting head on the cable, the cable itself (internal wire break), the port the cable plugs into on the drive, the port on the motherboard or daughter-card, the controller chip on the motherboard or daughter-card, or even a failure in software (somewhere).

True story: I once had a RAID mirror that was flaky, dropping a drive for no reason. The drives checked out fine, the platters were clean (repeat SMART passes turned up nothing), and everything worked well - until it would flake out again, and again. I replaced the $3 SATA cable and the issues instantly went away. Moral of the story: there's a LOT that can go wrong, and you can't always assume that "everything is fine" if you don't check every component in the path of the data.

Why was I not notified by email?

Email notification only occurs when (a) actively monitoring the array, or (b) when the array is interrogated.

My advice is: you need to have mdadm actively monitor the drive array as a process. This can be accomplished with something similar to (but not exactly like):

mdadm --monitor --scan --syslog

You will need to adjust the above line to your specific installation.

Why was the error not properly logged to syslog before halting the system? Could it be that the system tried to log to syslog, but did so after stopping the syslog daemon? If so what can I do to prevent that?

There could have been a variety of issues that caused the logging to be dropped.

First, there is the entire issue of how syslog works in general; and while many years have gone into making it robust and reliable, there are certain edge cases where data may not make it to disk. This is a well-known design issue and one that was actively addressed with supervision-styled service management (aka daemontools and their ilk). The solution there was to bypass syslog altogether and write the output to a logger that had an open file descriptor at all times, so nothing would get dropped, and the logger would dump the output to disk as fast as possible; while it is not a 100% effective solution it does significantly improves the odds of having events written to the drive before a kernel panics or shuts down.

Second, there is the possibility that the kernel had an outright panic, or some other event occured that would force the machine into a corner. Even faulty hardware could cause an issue - I've seen machines with underpowered PSUs cause spontaneous shutdowns in Windows 8. A replacement of the PSU fixed the shutdown problem permanently. Obviously, nothing the kernel can do will guard against a machine that just decided "I've had enough of this" and toddled off to reboot-land.

What can I do to find out what happened? Or, if there's no way for me now to find out what happened, how can I improve logging and notifications so that next time I can do a better post-mortem?

There are several approaches:

  • Place logging on a separate partition. While this is not a guarantee that you will get intact logs, it does help with isolating filesystem issues, such as disk-full-can't-write, corruption that causes a remount to read-only, etc. It certainly helps in those specific cases.

  • Look at remote logging vital system information. Again, this is not a guarantee but it will help if the last packet can "make it out the door" before a reboot happens, and that packet has critical clues to why the reboot happened.

  • For specific, critical services, look at replacing output to syslog with something else, such as supervision-styled logging, where a dedicated logger intercepts output and writes it to disk as soon as possible. This increases the reliability of the output making it to storage. With a little work, it can be made to co-exist side-by-side with other service management arrangements.