Sql-server – good way to backup a petabyte of data and store it

backupsql server

I'm starting to see clients with hundreds of terabytes of data (in SQL Server installations). As the total volume of data in some enterprises approaches meaningful fractions of a petabyte, I'd like to canvas the collective knowledge base out there to see what people dealing with that magnitude of data are doing to safeguard it.

The obvious issue is that storing multiple backups of that much data is prohibitively expensive, using enterprise-class storage, heck, even just RAID-5.

Options I see are as follows:

  1. Create a mirror copy of the data in another data-center, and continually ship differences to it (using whatever mechanism is available for your data source – e.g. log-shipping or database mirroring with SQL Server)
  2. Take regular backups using a hefty compression algorithm (probably only suitable if the data lends itself well to being heavily compressed)
  3. Take piecemeal backups of the critical/changing parts of the data.
  4. Don't backup the data and trust the corruption-gods.

I'm seeing option #4 being adopted as the default, and as an HA/DR expert it's really scary, but what do I advise as an alternative? I think #1 is the best approach, but "I don't think so" is the usual answer when any alternatives apart from #4 and possibly #3 are suggested.

Now, of course it depends on the change-rate and criticality of the data. No need to answer with that as I used to be responsible for all the HA features of SQL Server while I worked at Microsoft so I'm well-versed in the 'it depends' arguments – that's my catch-phrase 🙂

I'd be very interested to hear of any alternatives I've missed, or to hear that everyone else is in the same boat and there's no realistic alternative to spending lots of money on more storage.

Thanks in advance – due credit will be given to all well thought-out and expressed answers.

Best Answer

Off the wall idea - is the all of the stored information needed or even useful?

How much is the information actually worth? It seems obviously ridiculous to spend more in upkeep and management than the data is worth.

Is the data in the database appropriate for storage in a database? For example, does keeping compressed multi-gigabyte core files in the support organization's database really provide any actual benefit?

Is there a lot of duplicated data in the database? For example, are a thousand people keeping ten copies each of a weekly 10MB newsletter?

Does some of the data have an "expiration date" after which it does not provide any value? Returning to the support organization example, for various reasons there is virtually no benefit in keeping around customer core files more than a few months after a fix has been delivered.

Another thought - is keeping that much data opening the company to liabilities. Some data one must, by law, keep. Some data, however, should be "shredded" because of the risks posed if it is accidentally, or maliciously, released to inappropriate parties.

Related Topic