C# .Net Scalability – Best Ways to Horizontally Scale .Net Back-End Applications

cload balancingnetscalability

Given a workload with many long running CPU/IO intensive operations (e.g. producing multi-GB text files from database reads and business rules), what is the best way to balance .net application load across a group of Windows Server 2012 R2 VMs?

Ideally, the approach would allow for adding/removing/patching nodes as needed. It's pretty much a 24/7 operation with little opportunity for maintenance windows.

Background

We are in the process of migrated millions of lines of COBOL code from a IBM mainframe to C#. The current workload is batch centered and there are lots of jobs that fire up at a scheduled time to perform units of work of various sizes.

Wherever possible, we'll look to make the converted processes more real-time, event-driven, etc… However, because of dependence on existing inputs/ouputs to/from many upstream and downstream partners and a desire to get from one platform to the other in the shortest time possible with minimal disruption, we're going to have constraints that prevent us going with a completely greenfield approach.

Due to the large volumes of constantly changing data involved and the interconnectedness of that data with other legacy systems, moving to a cloud-based infrastructure is probably not feasible.

Approaches we're considering

JMS queues – We have an enterprise class JMS compliant ESB and we could queue the work, having consumers on each VM that could pull work based on resource availability.

Task coordinator – We've considered creating a custom workload manager that would monitor servers and decide where to send work. Something along these lines.

Best Answer

Here is an excerpt from a related question, specifically about scaling.

Scaling across machines can be done in various ways. Likely what you will want is partitioning. The easiest way is by hashing on some request value to decide which server to send to. For instance, if your customer ID is an integer (or can be consistently reduced to one... e.g. with GetHashCode) and you have 3 machines processing tasks, you can use a simple modulus to decide which customer should go to which machine. machineNumber = customerId % 3. CustomerId 1,4,7, etc will always go to machine 1. However, if Machine 1 goes down, 1/3 of the customer requests will not get processed until it comes back up. Since these are long running imports anyway, that's likely not a big deal. The load will also not be distributed evenly, since there are usually some customers who are heavier users. Again, probably not a huge deal. Measure to make sure.

Another way that is resilient to failure is to use a distributed directory. It keeps track of which node currently owns which customer. Project Orleans uses a mechanism like this. It allows for nodes to fail and customers to be transitioned to another node. Before allocating a new customer on a node, you can also query the node to see which is the least loaded. However, I'm not aware of a pre-built component for this purpose, and building it yourself is perilous to your time.

Since you mentioned queues, you might look at a feature called Competing Consumers which is available in various queue systems. In that case, you setup multiple machines or VMs (nodes) to service the queue. Whenever a new message comes through, the first node to answer gets the message and is expected to process it.

The Issues and Considerations section of the competing consumers link has some excellent points regarding such a queue/workload system.