Architecture – How to Design a High-Availability Application

architectural-patternsArchitecture

We currently have a classic n-tier application: DB / web service / front-end. It has other components, but it's the basic layout.

We want to improve the application availability for 3 main reasons:

  1. Our host sometimes experiences outages (as they all do), and we want to minimize the impact on our customers, so for instance, they would switch on datacenter B if datacenter A is down.
  2. When we upgrade the version, we shut down the site for maintenance, and it usually takes a few hours (migration scripts, etc). We'd like the users to have a more seamless transition, with as minimal a downtime as possible (they use server B while server A is being upgraded).
  3. Optionnaly, our customers are located around the world, and we want them to have the best experience possible despite their possibly crappy connections (anyone who worked with Indian devs should know what I mean). Ideally, we'd like to be able to plug a server in their office (or use a datacenter near their city), and it would integrate seamlessly into our architecture.

We don't remotely need 99% availability, not even 95%. It's a documents management app. Nobody cares. But since migrations can take a while, and there are customers around the world, sometimes we prevent a customer from working for most of their day.

For the SQL part, even though there aren't "proper" DBAs, we know about the SQL possibilities: replication, mirroring, etc. On the DB side, it's pretty easy to find resources for this. What is harder is everything else: storing sessions, the code, etc. If my webservice server goes down, how does my UI knows it must switch? How are my sessions persisted across servers?

Unfortunately, none of us have experience in this area, and we don't even know where to start looking. Are there best practices for this? Design patterns? Libraries (which should be free because we don't have money)?

We're using ASP.Net and SQL Server, with a WCF webservice in the middle. We have a bunch of Windows services lying around, but they are not mission-critical, and I assume the methods to deal with the website will be applicable to the services.

I understand that most cloud platforms provide a built-in system for this, but cloud hosting is a no-go because of our sysadmin, who want to manage everything themselves and not rely on anyone.

Best Answer

You need to clarify what kind of high availability you're looking for. There are highly available applications that I run that need to be up 95% of the time. There are others that need to run at 99%. I can think of life-or-death scenarios that require 100% uptime. Just those three have drastically different approaches and costs.

Just guessing based on your needs and a 95-99% uptime SLA:

  • Database migrations should be able to happen in real time for most changes. Practice Evolutionary database design. For changes that do require more invasive behavior, you have a few options. One is take the downtime. If possible, running your service in read-only mode might work. For full functionality, I've been wanting to try ScaleArc for a while. It looks like a really slick tool for scaling and resiliency in the SQL Server world.
  • Putting servers inside your customer's sites is a recipe for an unmanageable disaster unless you've got world-class deployment strategies (which, based on your description of your migrations, you don't have yet). Don't push cloud services on-prem because you have performance problems. Solve the performance problems now and then you won't have to deal with costlier ones done the road.
  • Your state server should be a database of some sort. Follow their HA guidelines. You can use SQL Server for this, since you already have it available to you.
  • Speaking of databases, replication does not enable HA. In fact, SQL Replication will cause you headaches around every turn (speaking from experience with multiple node replication scenarios). Mirroring can work, but last I remember, SQL clustering takes 1-5 minutes to fail over to the new server. I've heard good things about AlwaysOn, but I'm still suspicious given Microsoft's track record. Something like ScaleArc might be more help here.
  • Your web server should be stateless. Spin up three or four and put them behind a load balancer. That solves your uptime worries there. As Frederik mentioned earlier, you can also do rolling deployments this way.
  • Your web service should probably be stateless. If not, see if you can break it apart into stateless and stateful bits. Putting multiple instances of it behind the same load balancer again solves uptime worries and enables more interested deployment scenarios (e.g. blue/green deployments).

Unlike Frederik, I won't call your cloud paranoia unwarranted. It depends on your uptime requirements. It is conceivable that a service would have to run in multiple data centers operated by different providers in different countries for redundancy's sake. Given your current state, however, I'd agree that AWS, Azure, or similar are probably safe bets for your company.

Related Topic