What system should be used to track and ensure accountability for backups & tapes

backup

We run a small business network with a few Windows servers and Backup Exec + an LTO4 tape library to back them all up. We use a yearly, monthly, weekly rota, with tapes going off-site. I should also mention that we use LTO barcodes.

My question is really this – What paperwork/spreadsheets/databases/etc do you use around a backup rota to achieve goals such as the following:

a) Ensure there is a written record of accountability showing that engineers have checked backup logs to ensure jobs are completing successfully, the tape is in good condition, etc (aside from anything else, this seems to be a good way to encourage the process to be followed if people must sign their name to say they've done it).

b) Ability to track where all tapes are currently stored (Backup Exec helps with this, but a separate record seems sensible). Would also be good if this record was somehow stored off-site so that it is accessible in the event of a disaster such as an office fire.

c) In a disaster recovery situation, there isn't just tapes stored off-site, but a written record explaining exactly what job the tapes correspond to, with a record showing the job completed successfully, etc.

d) Anything else that's important

In short, an audit trail. An audit trail that is designed is such a way that it is resilient to disaster situations such as office fires.

Do people tend to roll their own solution, or are there off-the-shelf solutions? Do you tend to keep it all paper based, or do you have some electronic method? Do you keep any paperwork with the off-site tapes?

I should say that we already have a basic system in place, but I'm interested to see what makes up a good audit trail system, in the hope I can improve ours.

Many thanks!

Best Answer

(a) is important, but it shouldn't be left as a process issue for humans. Checking that all these things are happening, with appropriate periodicity, should be one of the functions of your monitoring system.

(b) is the job of the backup software. Recall the principle "one datum, one location"; if your backup software says a tape's in one place, and your other internal process says its in another, who will you believe? If your onsite/offsite requests are generated automatically (as they should be), it's helpful to keep (soft) copies of those; they can always be used as an emergency fallback check of the backup software's memory.

(c) is also the job of the backup software. Any good software package will have the concept of a "bare metal restore" built into it, and that should include the bare metal restore of the backup server itself. My preferred backup software, bacula, details this in their documentation, which assumes that everything has been lost except the stack of offsite backup tapes, and that you have acquired replacement hardware. It says what tools you'd use to index the tapes, how to find the most recent catalogue backup, how to restore that into a fresh, empty bacula instance, and how you'd go about restoring the clients from there.

Ensure that your backup software also documents this. Test that the procedure works. Keep your notes from those tests.

As for (d), I think you've already covered most of the important points. The one I'd reiterate is that you should test your restores frequently; not just once every six months, but at least once a month. Pick a random employee, ask them which file they'd hate to lose; check this can be restored to their satisfaction. Ask a random IT person which server they'd most hate to lose; restore it to another box and have them check it over for functionality. Test your DR procedures every six to twelve months, in full. Yes, this all costs; lots of time as well as offsite callback charges. But untested backups and procedures may well be worthless, and certainly can't be relied on.

Related Topic