For all those suggesting mirroring - it's not going to be possible to mirror 400 databases. You'll run out of worker threads way before you hit 400, whether on 32- or 64-bit. There are at least 2 worker threads per database on the principal and at least 3 per database on the mirror.
It's really got nothing to do with the skills of the DBA. It depends entirely on what the HA requirements of the business or application are - these are the driving factors, not what the DBA can cope with. If a more complex solution is required, then getting a DBA that an deal with it is also required. Limiting your HA solution because of the skills of the DBA is nonsensical.
Database mirroring is also not fast-detection and fast-failover - it entirely depends on what the failure is, what the mirroring partner timeout is, what the SEND and REDO queues are. It could be quite a while for a failover to complete one the mirror decides to do become the principal. I've helped many customers implement mirroring (both when at Microsoft when I owned database mirroring along with the rest of the Storage Engine), and since leaving in 2007.
Forget mirroring for that number of databases.
There are some fundamental questions you need to answer before we can give you a recommendation:
- Do you want to have a redundant copy of the entire databases or just portions?
- Do you want to be able to use the redundant copies? For writes or just for reads?
- What's the transaction log generation rate of the databases?
- What's the network bandwidth between the two sites?
Log shipping is probably the easiest for you, both in terms of setting it up, monitoring it, and being able to bring the redundant copies only quickly and within your data-loss limits. With log shipping you can choose to bring the redundant copies online without finishing restoring all the transaction log backups queued for restore, by accepting data-loss.
You can also effect simple failovers using a mid-tier routing technology, or even something as simple as Windows NLB with a 0/100 config, or switching to 100/0. I've also seen customers using DNS switching to effect failovers when the server name cannot change.
With replication you have to deal with unpredictable latency between the transactions being harvested from the publication db transaction log, hitting the distribution database and then being pushed/pulled to the subscription database by the distribution agent. Replication can also get itself twisted up and be a pain to figure out and reset - log shipping is dead simple.
Given that you've been using your own log shipping, I'm guessing that the tran log generation rate and network bandwidth aren't an issue - I'd stick with log shipping and shy away from replication. Don't even think about database mirroring as it won't work for your volume.
Hope this helps - this is really two day's worth of discussion when I teach HA to the DBAs inside Microsoft - boiled down to 5 mins answering this. Feel free to follow-up with more specific questions in comments.
Yes you can write your own Replication, but it isn't necessary. You can use SQL Server Replication between the two data centers without a VPN between them (a VPN is always preferred).
Before configuring Replication I'd urge you to look at SQL Server's Database Mirroring between the two sites. Database Mirroring will give you an exact copy of the database on the other site.
In either case you simply need to open the correct TCP ports in the firewalls at both sites (and configure NAT on both sites between a public IP and the private IP if the SQL Servers have private IPs). Then simply tell the Replication or database mirroring to use the public IP of the remote SQL Server as the replication or mirroring partner.
As I said above, you'll want to setup a VPN between the sites so that you can route all your Active Directory replication traffic over that VPN, as well as all your administrative traffic when you are managing the remote machines.
Best Answer
As mentioned in other answers:
Old-style database mirroring and new-style AlwaysOn need threads and you will definitely run out of threads with 2000 databases. I vaugely recall that the practical limit is well below 200 databases. (There is a white paper on this somewhere, but I'm too lazy to look for it right now and this answer is already super-long.) Of course, that 200 databases per instance. Theoretically, you could start up 20 instances and run 100 databases on each instance. Managing all of that would be a hassle, and I suspect that managing the memory between all of those instances would be a headache.
SQL Server replication (replicating tables (or subsets of tables), rather than files) isn't really intended for DR. It's hard to set up and administer, even for a few databases. You may need to alter your data model to get it to work, that might mean changes to your app. You'd need an automated way to apply the same replication configuration to each of your 2000 (presumably identical or nearly identical) databases. The stored procedures that you need to use to configure replication are messy. Administering 2000 databases that are configured for replication via the GUI would be a nightmare. When/if you failover, you might need to make changes to get everything working again. Failover time isn't when you want to be doing any finicky changes or work that you can avoid. You want to get everything back up and running ASAP. It just seems like a pack of problems.
Replication between SAN storage units can be expensive, especially if you are talking about hardware from an outfit like EMC. Once you start with a vendor, you are pretty much married to them for upgrades, maintenance, additional space, etc.
Suggestion #1: Have you looked at something like Steeleye's DataKeeper? It's a software-based replication product that runs on your servers that leverages Windows Failover Clustering. I've never actually used it and I have no connection to the company other than sitting through a few dog-and-pony shows. It looks ideal for your situation.
Suggestion #2: If it were me and I had absolutely no budget, I would look at some home-grown log shipping system. I have doubt that the built-in log shipping will deal with 2000 databases very well. It's not that hard to write a log shipping system, and it can address all of the issues specific to your environment. (For example, maybe you need to send the files via sftp to your DR site.)
Basically, there are three parts to the system. Each part needs to run on a regular schedule:
One part takes the transaction log backups, putting the tlog backup files for each database into a different folder (for file system scaling). I would not use the maintenance wizard for this, I've seen it go wonky too many times and start skipping databases and generally misbehaving. If you want to provide a 30 minute guarantee, maybe this runs every 15 minutes.
One part copies the backup files from the staging area to your DR site. This could be something as simple as a robocopy CMD file if you have VPN to your DR. You could write a package or a powershell script if you need something fancier (sftp or ssh/scp, or maybe zip/unzip if you don't have built-in backup compression). This can run more quickly, maybe every 5 minutes, to make sure it gets everything. Once something is copied offsite, it is "safe".
You want tables that audit all three steps, some reports/scripts that show you what has happened (is a particular database running on your primary or secondary site? has any database on the secondary not seen a tlog restore in, say, two hours?) and an alerting scheme.
On top of that, I'd also want to be able to choose a particular database to failover as well as being able to failover everything. Being able to pick a db to failover allows for easy testing (you failover a test database, not a customer's database) and might give you a rudimentary load-balancing scheme if you get into scaling trouble. You will also want an automated way to "resync" between primary and secondary (take a full backup from the primary and apply it to the secondary, start the tlogs flowing, etc.) These features might be better for a release 2.0.
(Everyone has forgotten that the earliest tlog shipping that MS supported was implemented via a few scripts that you could download and run on SQL 7.0. There was go GUI, the UI was some SQL reports and a few stored procedures.)
Other than writing a little tsql code, the challenges here are:
Changing to the full recovery model (it sounds to me that you might be running in simple recovery model) and the increases in storage usage that are likely for log backups, increased database sizes, what-have-you.
Making sure that your storage system can handle the load of frequent tlog backups and copying them to a DR site in a timely way. IOW, if you have 2000 databases and want to guarantee data up until the last hour, you need to able to take one transaction log backup of each of those 2000 databases and get it onto networked storage (somewhere that isn't in your primary server).
Making sure that everything keeps up generally.
After I got all of that working, I'd start looking at automating failover, how to tell my web sites where the live version of a particular customer's database was running, etc. If you aren't running clustered systems, making sure that you keep all of the logins/passwords, jobs, linked servers, etc. etc. in sync is a PITA.