Detecting data corruption so we’re not backing up corrupt files

backupcorruptiondata integrity

I've been thinking about data integrity, I currently backup about 2tb of data and always have one backup of data from a year ago.

My concern is if a file became corrupt on our production file server no one would notice because some files aren't accessed for many years and if corruption occurred I'd be backing up a corrupted file.

I'm not sure how I should handle this problem, is there a way to detect data corruption? Or is the only solution to store older backups in case something becomes corrupted and isn't noticed?

Best Answer

In my experience, each file type needs its own checks to determine if something is indeed corrupt. Data is just dots and dashes at its heart, and what determines "corruption" is entirely file dependent. You will need to determine what file types are most important, and then determine if it is reasonably possible to create automation that checks for the file type's consistency. That will be a daunting task as file type specifications change over time and as you encounter proprietary formats that have no easy way to programmatically determing corruption.

Furthermore, data corruption is only a part of the problem. Sometimes files can be wrong from a human perspective, but consistent from a data structure perspective. If someone mistakenly edits a file - the data is fine from a corruption standpoint.

Ultimately you need to sit down with the leadership of the business and determine what the most important data assets are for the company. Then determine how long those need to be retained and with what level of recall. Do they want fine-graned point-in-time recovery to four years in the past? Maybe only for certain files but not for others?

Considering that you only have 2TB to backup, a GFS tape backup scheme using LTO4 cartridges can allow you to reach back many years with relatively few tapes. This is, of course, entirely depdendent on data churn. If you have a lot of busy bits, then you'll have more tapes. Still, 2TB is a relative speck that LTO4 or even commodity disk storage would yawn at to keep a few years of data around.

It's not an easy task to protect digital assets. Keep the Tums handy.

Related Solutions

Cliffhanger: The backups are right… here… right

You always have to get these things fixed from the top.

Is the current backup strategy backed by and understood by management? If not, it's useless.

The executive management needs to know about the problems and what risks are involved (losing financial data that you need to bring out legally to survive, or customer data that has taken years to collect?) and weigh that in deciding on actions, or deciding on letting someone (like you) take action.

If you can't get to management, try business controllers or other financial positions where data retrieval and its integrity is of high importance to the company's reports. They in turn can "start the storm" if needed...

Linux – How to make a Linux software RAID1 detect disc corruption

You can force a check of (eg) md0 with

echo "check" > /sys/block/md0/md/sync_action

You can check the state of the test with

cat /sys/block/md0/md/sync_action

while it returns check the check is running, once it returns idle you can do a

cat /sys/block/$dev/md/mismatch_cnt

to see if the mismatch count is zero or not. Many distros automate this check to run eg weekly for you anyway, just as most industrial hardware RAIDs continually run this in the background (they often call it "RAID scrubbing") while the array is otherwise idle. Note that according to the comments in fedora's automated check file, RAID1 writes in the kernel are unbuffered and therefore mismatch counts can be non-zero even for a healthy array if the array is mounted.

So quiescing the arrays by doing this check while the VM is down, if at all possible, is probably a good idea.

I'd add that I agree with the docs when they say that

RAID cannot and is not supposed to guard against data corruption on the media

RAID is supposed to guard against complete failure of a device; guarding against incremental random failures in elements of a storage device is a job for error-checking and block-remapping, which is probably best done in the controller itself. I'm happy that the docs warn people of the limitations of RAID, especially if it's implemented on top of flaky devices. I find that frequent smartctl health checks of my drives help me to stay on top of drives which are starting to show the sort of errors that lead to out-of-sync mirrors.

Best Answer

Related Solutions

Cliffhanger: The backups are right… here… right

Linux – How to make a Linux software RAID1 detect disc corruption

Related Topic