Architecture – How to Design a High-Availability Application

architectural-patternsArchitecture

We currently have a classic n-tier application: DB / web service / front-end. It has other components, but it's the basic layout.

We want to improve the application availability for 3 main reasons:

Our host sometimes experiences outages (as they all do), and we want to minimize the impact on our customers, so for instance, they would switch on datacenter B if datacenter A is down.
When we upgrade the version, we shut down the site for maintenance, and it usually takes a few hours (migration scripts, etc). We'd like the users to have a more seamless transition, with as minimal a downtime as possible (they use server B while server A is being upgraded).
Optionnaly, our customers are located around the world, and we want them to have the best experience possible despite their possibly crappy connections (anyone who worked with Indian devs should know what I mean). Ideally, we'd like to be able to plug a server in their office (or use a datacenter near their city), and it would integrate seamlessly into our architecture.

We don't remotely need 99% availability, not even 95%. It's a documents management app. Nobody cares. But since migrations can take a while, and there are customers around the world, sometimes we prevent a customer from working for most of their day.

For the SQL part, even though there aren't "proper" DBAs, we know about the SQL possibilities: replication, mirroring, etc. On the DB side, it's pretty easy to find resources for this. What is harder is everything else: storing sessions, the code, etc. If my webservice server goes down, how does my UI knows it must switch? How are my sessions persisted across servers?

Unfortunately, none of us have experience in this area, and we don't even know where to start looking. Are there best practices for this? Design patterns? Libraries (which should be free because we don't have money)?

We're using ASP.Net and SQL Server, with a WCF webservice in the middle. We have a bunch of Windows services lying around, but they are not mission-critical, and I assume the methods to deal with the website will be applicable to the services.

I understand that most cloud platforms provide a built-in system for this, but cloud hosting is a no-go because of our sysadmin, who want to manage everything themselves and not rely on anyone.

Best Answer

You need to clarify what kind of high availability you're looking for. There are highly available applications that I run that need to be up 95% of the time. There are others that need to run at 99%. I can think of life-or-death scenarios that require 100% uptime. Just those three have drastically different approaches and costs.

Just guessing based on your needs and a 95-99% uptime SLA:

Database migrations should be able to happen in real time for most changes. Practice Evolutionary database design. For changes that do require more invasive behavior, you have a few options. One is take the downtime. If possible, running your service in read-only mode might work. For full functionality, I've been wanting to try ScaleArc for a while. It looks like a really slick tool for scaling and resiliency in the SQL Server world.
Putting servers inside your customer's sites is a recipe for an unmanageable disaster unless you've got world-class deployment strategies (which, based on your description of your migrations, you don't have yet). Don't push cloud services on-prem because you have performance problems. Solve the performance problems now and then you won't have to deal with costlier ones done the road.
Your state server should be a database of some sort. Follow their HA guidelines. You can use SQL Server for this, since you already have it available to you.
Speaking of databases, replication does not enable HA. In fact, SQL Replication will cause you headaches around every turn (speaking from experience with multiple node replication scenarios). Mirroring can work, but last I remember, SQL clustering takes 1-5 minutes to fail over to the new server. I've heard good things about AlwaysOn, but I'm still suspicious given Microsoft's track record. Something like ScaleArc might be more help here.
Your web server should be stateless. Spin up three or four and put them behind a load balancer. That solves your uptime worries there. As Frederik mentioned earlier, you can also do rolling deployments this way.
Your web service should probably be stateless. If not, see if you can break it apart into stateless and stateful bits. Putting multiple instances of it behind the same load balancer again solves uptime worries and enables more interested deployment scenarios (e.g. blue/green deployments).

Unlike Frederik, I won't call your cloud paranoia unwarranted. It depends on your uptime requirements. It is conceivable that a service would have to run in multiple data centers operated by different providers in different countries for redundancy's sake. Given your current state, however, I'd agree that AWS, Azure, or similar are probably safe bets for your company.

Related Solutions

Multi-Tenant Application – Architecture Design Patterns

It is not scalable to maintain one fork of your software per client, regardless of whether you try to maintain this as multiple repositories or as multiple branches in one repository. You will be unable to apply cross-cutting changes to all your clients, except with extraordinary effort. Common cross-cutting changes are refactorings, redesigns, or security fixes.

The solution is twofold:

Recognize that having an individual deployment for a client is not the same as running an individual project for that client. You can deploy with different configurations from the same codebase.
Create client-specific variants through feature toggles, build-time or run-time configuration, plugin systems, and dependency injection.

Do not hardcode modifications. If you have more than one client, you will regret this. Instead, make the engine configurable so that client-specific plugins can be loaded. When you want to modify the behavior, refactor your core engine to support a plugin there, for example by introducing a new interface. Then provide an implementation for that interface as customer-specific code. When you see that multiple clients might need that functionality you can move that code into the core, but possibly disable it for clients that don't need it.

The shared core of your software is crucial for making modifications with low effort. As it evolves, make sure that it stays well-designed. This is your framework for building client-specific variants.

For the repository layout, this hinges on whether you need to give access to the development repository to clients. If not, consider keeping all work in a single monorepo. For example:

crm/
  build-tools/
  core-engine/  (contains source, tests, default assets)
  client-a/     (contains config and client-specific code, tests, assets)
  client-b/
  ...

If clients need access to their repos, then it might be better to create separate repos for the core engine and the client-specific adaptions. It may be best to treat the core engine as a library that is used by client-specific apps. However, separate repositories make it more difficult to apply backwards-incompatible changes to your core, like changing a method's signature or renaming a class.

Best Answer

Related Solutions

Multi-Tenant Application – Architecture Design Patterns

Related Topic