Sql-server – Using SAN Replication/Snapshots for SQL Server disaster recovery

sql serversql-server-2008storage-area-network

We have a web application that utilizes SQL Server 2008 on a single database server. All storage is local. For the past year we have been trying to get any form of SQL Server Replication to work with our configuration, but it won't. The reason is because we have over 2,000 databases that are constantly being updated (one for each of our clients), so our tests show that all forms of replication are too resource intensive.

Every time I ask this question, people focus on the fact that we have too many databases. This is something that cannot change (for regulatory and other reasons), so I'd like to focus on how we can replicate the data.

We've been told that one option is to move all the data to a SAN and let the SAN replicate the data (or take frequent snapshots). However, if our database server fails, is there a risk of a database getting corrupt in this case? Is it possible to leverage the SAN, replicated to another SAN, to provide a decent DR solution (in our case, we can lose up to about 30 minutes of data, but we cannot lose an entire day's worth … i.e. we can't go to a prior night's backup).

Best Answer

As mentioned in other answers:

  • Old-style database mirroring and new-style AlwaysOn need threads and you will definitely run out of threads with 2000 databases. I vaugely recall that the practical limit is well below 200 databases. (There is a white paper on this somewhere, but I'm too lazy to look for it right now and this answer is already super-long.) Of course, that 200 databases per instance. Theoretically, you could start up 20 instances and run 100 databases on each instance. Managing all of that would be a hassle, and I suspect that managing the memory between all of those instances would be a headache.

  • SQL Server replication (replicating tables (or subsets of tables), rather than files) isn't really intended for DR. It's hard to set up and administer, even for a few databases. You may need to alter your data model to get it to work, that might mean changes to your app. You'd need an automated way to apply the same replication configuration to each of your 2000 (presumably identical or nearly identical) databases. The stored procedures that you need to use to configure replication are messy. Administering 2000 databases that are configured for replication via the GUI would be a nightmare. When/if you failover, you might need to make changes to get everything working again. Failover time isn't when you want to be doing any finicky changes or work that you can avoid. You want to get everything back up and running ASAP. It just seems like a pack of problems.

  • Replication between SAN storage units can be expensive, especially if you are talking about hardware from an outfit like EMC. Once you start with a vendor, you are pretty much married to them for upgrades, maintenance, additional space, etc.

Suggestion #1: Have you looked at something like Steeleye's DataKeeper? It's a software-based replication product that runs on your servers that leverages Windows Failover Clustering. I've never actually used it and I have no connection to the company other than sitting through a few dog-and-pony shows. It looks ideal for your situation.

Suggestion #2: If it were me and I had absolutely no budget, I would look at some home-grown log shipping system. I have doubt that the built-in log shipping will deal with 2000 databases very well. It's not that hard to write a log shipping system, and it can address all of the issues specific to your environment. (For example, maybe you need to send the files via sftp to your DR site.)

Basically, there are three parts to the system. Each part needs to run on a regular schedule:

  • One part takes the transaction log backups, putting the tlog backup files for each database into a different folder (for file system scaling). I would not use the maintenance wizard for this, I've seen it go wonky too many times and start skipping databases and generally misbehaving. If you want to provide a 30 minute guarantee, maybe this runs every 15 minutes.

  • One part copies the backup files from the staging area to your DR site. This could be something as simple as a robocopy CMD file if you have VPN to your DR. You could write a package or a powershell script if you need something fancier (sftp or ssh/scp, or maybe zip/unzip if you don't have built-in backup compression). This can run more quickly, maybe every 5 minutes, to make sure it gets everything. Once something is copied offsite, it is "safe".

  • One part restores the tlog backups that it finds at the DR site onto your secondary server. You need to be sure to identify the tlogs that have been restored and move them off or delete them on some schedule or you will eventually run out of space. This doesn't need to run as frequently, but you need to make sure that it has run on all available tlog backups have been applied before declaring the DR secondary "live" when you have a problem.

You want tables that audit all three steps, some reports/scripts that show you what has happened (is a particular database running on your primary or secondary site? has any database on the secondary not seen a tlog restore in, say, two hours?) and an alerting scheme.

On top of that, I'd also want to be able to choose a particular database to failover as well as being able to failover everything. Being able to pick a db to failover allows for easy testing (you failover a test database, not a customer's database) and might give you a rudimentary load-balancing scheme if you get into scaling trouble. You will also want an automated way to "resync" between primary and secondary (take a full backup from the primary and apply it to the secondary, start the tlogs flowing, etc.) These features might be better for a release 2.0.

(Everyone has forgotten that the earliest tlog shipping that MS supported was implemented via a few scripts that you could download and run on SQL 7.0. There was go GUI, the UI was some SQL reports and a few stored procedures.)

Other than writing a little tsql code, the challenges here are:

  • Changing to the full recovery model (it sounds to me that you might be running in simple recovery model) and the increases in storage usage that are likely for log backups, increased database sizes, what-have-you.

  • Making sure that your storage system can handle the load of frequent tlog backups and copying them to a DR site in a timely way. IOW, if you have 2000 databases and want to guarantee data up until the last hour, you need to able to take one transaction log backup of each of those 2000 databases and get it onto networked storage (somewhere that isn't in your primary server).

  • Making sure that everything keeps up generally.

After I got all of that working, I'd start looking at automating failover, how to tell my web sites where the live version of a particular customer's database was running, etc. If you aren't running clustered systems, making sure that you keep all of the logins/passwords, jobs, linked servers, etc. etc. in sync is a PITA.

Related Topic