The filesystem is probably mounted with the option errors=remount-ro
, which as the name suggests means that if an error is detected, the filesystem is immediately set to read-only, to avoid further damage.
There will be information in the kernel logs (/var/log/kern.log
on most Linux distributions).
What to do next depends on the cause. Here are the most likely:
It could be a failing disk. Often you'll see IO errors reported in the kernel logs. smartctl -a /dev/sdb
can tell you more. Back up your data as soon as possible and replace the disk.
It could be a problem with your RAM. Run a memtest just to make sure.
It could be a kernel bug. This is hard for mere mortals to diagnose. Make sure you have the latest kernel released for your distribution.
The filesystem could have been damaged earlier, for a reason that no longer applies (e.g. a kernel bug that has now been fixed). Running fsck
should fix the problem, so unfortunately for you this case doesn't apply to you.
You can force a check of (eg) md0 with
echo "check" > /sys/block/md0/md/sync_action
You can check the state of the test with
cat /sys/block/md0/md/sync_action
while it returns check
the check is running, once it returns idle
you can do a
cat /sys/block/$dev/md/mismatch_cnt
to see if the mismatch count is zero or not. Many distros automate this check to run eg weekly for you anyway, just as most industrial hardware RAIDs continually run this in the background (they often call it "RAID scrubbing") while the array is otherwise idle. Note that according to the comments in fedora's automated check file, RAID1 writes in the kernel are unbuffered and therefore mismatch counts can be non-zero even for a healthy array if the array is mounted.
So quiescing the arrays by doing this check while the VM is down, if at all possible, is probably a good idea.
I'd add that I agree with the docs when they say that
RAID cannot and is not supposed to
guard against data corruption on the
media
RAID is supposed to guard against complete failure of a device; guarding against incremental random failures in elements of a storage device is a job for error-checking and block-remapping, which is probably best done in the controller itself. I'm happy that the docs warn people of the limitations of RAID, especially if it's implemented on top of flaky devices. I find that frequent smartctl health checks of my drives help me to stay on top of drives which are starting to show the sort of errors that lead to out-of-sync mirrors.
Best Answer
I deduce you are using ext3 or ext4 as the file system. If so, you can mount it with the
errors=panic
option and configurewatchdog
to reboot your system in case a panic happen.While more complex than roelvanmeer's answer (which I upvoted), it has the added bonus of working for all panic-level kernel crash.
As suggested by NikitaKipriyanov, setting the
panic=5
kernel boot option can be a simpler alternative towatchdog
(which has more configuration options but it is slightly more complex as result).