Detecting data corruption so we’re not backing up corrupt files

backupcorruptiondata integrity

I've been thinking about data integrity, I currently backup about 2tb of data and always have one backup of data from a year ago.

My concern is if a file became corrupt on our production file server no one would notice because some files aren't accessed for many years and if corruption occurred I'd be backing up a corrupted file.

I'm not sure how I should handle this problem, is there a way to detect data corruption? Or is the only solution to store older backups in case something becomes corrupted and isn't noticed?

Best Answer

In my experience, each file type needs its own checks to determine if something is indeed corrupt. Data is just dots and dashes at its heart, and what determines "corruption" is entirely file dependent. You will need to determine what file types are most important, and then determine if it is reasonably possible to create automation that checks for the file type's consistency. That will be a daunting task as file type specifications change over time and as you encounter proprietary formats that have no easy way to programmatically determing corruption.

Furthermore, data corruption is only a part of the problem. Sometimes files can be wrong from a human perspective, but consistent from a data structure perspective. If someone mistakenly edits a file - the data is fine from a corruption standpoint.

Ultimately you need to sit down with the leadership of the business and determine what the most important data assets are for the company. Then determine how long those need to be retained and with what level of recall. Do they want fine-graned point-in-time recovery to four years in the past? Maybe only for certain files but not for others?

Considering that you only have 2TB to backup, a GFS tape backup scheme using LTO4 cartridges can allow you to reach back many years with relatively few tapes. This is, of course, entirely depdendent on data churn. If you have a lot of busy bits, then you'll have more tapes. Still, 2TB is a relative speck that LTO4 or even commodity disk storage would yawn at to keep a few years of data around.

It's not an easy task to protect digital assets. Keep the Tums handy.

Related Topic