Can you verify whether the affected blocks and underlying bad sectors on the disk are reallocated to "spare sectors" area? The bad sector should be reallocated when write operation fails. Verify it with smartctl:
smartctl -a /dev/sdb | grep -i reallocated
The last column should contain a number of total reallocated sectors. If there is zero try to read the bad sector:
hdparm –-read-sector XXXXXXXX /dev/sdb
It should return an I/O error otherwise I would recommend to skip next section.
The error means the sector was not reallocated yet. So you can try to reallocate it forcibly by writing it. Remember that any data stored in this sector will be lost after this step !!!:
hdparm –-write-sector XXXXXXXX --yes-i-know-what-i-am-doing /dev/sdb
By the way, the sector number XXXXXXXX should be possible to obtain from kernel messages (dmesg command or from /var/log/messages). As you had bad blocks during resynchronisation there should be some related messages similar to:
... end_request: I/O error, dev sdb, sector 1261071601
Then, try to verify it with smartctl again. Does the counter increased? If so try to read it with hdparm. Now, it should read it without any error as it is supposed to be reallocated. Done.
Finally, you can continue with mdadm and with adding the disk to your degraded mirror.
Blazer, it looks like in the process of improving your question (which is now a good one, by the way), you've found your own answer. Well done, you! But there is a little more that could usefully be said.
As far as I know, that mdadm.conf
will suffice for you to get automated notifications. Certainly, mine looks very little different to that, and I know from a recent failout test that I get notifications. (I'm a little curious about the second slash in /dev/md/0
, but if that's what your system wrote, it's very likely right.)
But it's axiomatic in professional sysadmin that, unless you've tested something, you can't really know that it works.
At the very least, you will want to check that you can send mail from that system, as root, to the specified gmail.com address, and receive it.
If I were you, I'd at least perform a soft failure test. You can do that with mdadm /dev/md0 -f /dev/sdb1
. That will fail the second partition out of the array, and should generate a formal notification to you (check your system's mail logs to see if it's gone). Check the output of cat /proc/mdstat
so you know what a half-bad array looks like.
You can resync the array later with mdadm /dev/md0 -a /dev/sdb1
, and check that it's sync'ed back with another cat /proc/mdstat
.
If you want to go the whole hog, schedule some downtime, try pulling one of the drives, and check that the system can still boot. Where the metadevice in question is the boot partition, people sometimes forget to have a GRUB boot block on both drives, so when the second one fails, their system becomes unbootable. Replace and resync the drive later.
Whatever tests you decide to do, document them, so that when there's a real failure, you know what to expect, and you can minimise the chance of pilot error trashing the second drive.
Best Answer
I would highly recommend having a good backup strategy instead of planning for data recovery. Raid is perfect for improving uptime of a server, but all it takes is one little software bug and all your data is gone.