You always have to get these things fixed from the top.
Is the current backup strategy backed by and understood by management? If not, it's useless.
The executive management needs to know about the problems and what risks are involved (losing financial data that you need to bring out legally to survive, or customer data that has taken years to collect?) and weigh that in deciding on actions, or deciding on letting someone (like you) take action.
If you can't get to management, try business controllers or other financial positions where data retrieval and its integrity is of high importance to the company's reports. They in turn can "start the storm" if needed...
You can force a check of (eg) md0 with
echo "check" > /sys/block/md0/md/sync_action
You can check the state of the test with
cat /sys/block/md0/md/sync_action
while it returns check
the check is running, once it returns idle
you can do a
cat /sys/block/$dev/md/mismatch_cnt
to see if the mismatch count is zero or not. Many distros automate this check to run eg weekly for you anyway, just as most industrial hardware RAIDs continually run this in the background (they often call it "RAID scrubbing") while the array is otherwise idle. Note that according to the comments in fedora's automated check file, RAID1 writes in the kernel are unbuffered and therefore mismatch counts can be non-zero even for a healthy array if the array is mounted.
So quiescing the arrays by doing this check while the VM is down, if at all possible, is probably a good idea.
I'd add that I agree with the docs when they say that
RAID cannot and is not supposed to
guard against data corruption on the
media
RAID is supposed to guard against complete failure of a device; guarding against incremental random failures in elements of a storage device is a job for error-checking and block-remapping, which is probably best done in the controller itself. I'm happy that the docs warn people of the limitations of RAID, especially if it's implemented on top of flaky devices. I find that frequent smartctl health checks of my drives help me to stay on top of drives which are starting to show the sort of errors that lead to out-of-sync mirrors.
Best Answer
In my experience, each file type needs its own checks to determine if something is indeed corrupt. Data is just dots and dashes at its heart, and what determines "corruption" is entirely file dependent. You will need to determine what file types are most important, and then determine if it is reasonably possible to create automation that checks for the file type's consistency. That will be a daunting task as file type specifications change over time and as you encounter proprietary formats that have no easy way to programmatically determing corruption.
Furthermore, data corruption is only a part of the problem. Sometimes files can be wrong from a human perspective, but consistent from a data structure perspective. If someone mistakenly edits a file - the data is fine from a corruption standpoint.
Ultimately you need to sit down with the leadership of the business and determine what the most important data assets are for the company. Then determine how long those need to be retained and with what level of recall. Do they want fine-graned point-in-time recovery to four years in the past? Maybe only for certain files but not for others?
Considering that you only have 2TB to backup, a GFS tape backup scheme using LTO4 cartridges can allow you to reach back many years with relatively few tapes. This is, of course, entirely depdendent on data churn. If you have a lot of busy bits, then you'll have more tapes. Still, 2TB is a relative speck that LTO4 or even commodity disk storage would yawn at to keep a few years of data around.
It's not an easy task to protect digital assets. Keep the Tums handy.