Task Distribution Algorithms – Balancing Workloads

algorithmsscalability

I'm looking for an algorithm to either use or as a jump point for load balancing.

Environment:
We have ~7 job types that can be scheduled at any time by our users. Some jobs are fast, others are slow (lot of data processing). We have a single instance of a "job processor" that will discover jobs that have been scheduled and then execute them. The "job processor" will run up to 5 jobs at a time, "threads".

The problem is that one job could consume so many resources that the other 4 jobs don't get processed and even worse, the other scheduled jobs are delayed for long periods of time.

Some jobs can be scheduled as "run immediately" which makes them next in line.

Solution:
Add more instances of the "job processor". We have a big VM server that IT is rolling out 3 VM's to each handle an instance of this "job processor".

By default, it's going to help but I believe that there should be more thought behind it.

My solution:
In addition to making the "job processors" scale horizontally, I think there needs to be a way to determine which jobs an instance will grab based on current load of the instance and also allow for a bias.

I suggest we determine statistics for each job type (avg run time, etc) and give it a score of 1-5 (5 being long running). Each instance will determine what its current load is either based on the total score of the jobs its running currently and then factoring in it's bias. For example, I think we should be able to set an instance to be biased toward small jobs so it avoids larger jobs while another instance is biased toward medium jobs, etc.

I'm looking for advice on how to go about this. Jobs can consume large amounts of time, cpu and/or memory. My goal is to make sure each instance is only pulling down the work it's capable of doing while keeping the scheduled job queue moving along as quickly as possible.

One of the other devs suggested we leave the "job processors" alone to just pull whatever is in the queue next or "round robin". I say that this could lead to a potential issue where a single instance has pulled down too many large jobs and is struggling to get them done while the other instances are idle.

Best Answer

Part of what you are looking for is a "priority queue". At previous employers, we did a very primitive version of this, but my heuristic was to only allow some processors to handle short running jobs (short jobs could take minutes), while others were handling longer running jobs (the quarterly report could take almost 2 days to process). This guaranteed that short jobs always had processing time available. I also used a scoreboard that listed jobs ready to be run, and the first processor able to handle the task would pick it up and run it single threaded (they were cheap computers that hadn't been depreciated and so could not be discarded). Many folks use the opposite: a scheduler that tells processors which work unit to do next. My advice would be to have each instance running a single task - this drastically simplifies the scheduling.

Scheduling of arbitrary jobs of arbitrary lengths is a hard problem in distributed processing. Almost every decision is going to involve simulating lots of runs. Which is one of the quirks of queuing theory, which this stuff is going to be based on.

One of the other devs suggested we leave the "job processors" alone to just pull whatever is in the queue next or "round robin". I say that this could lead to a potential issue where a single instance has pulled down too many large jobs and is struggling to get them done while the other instances are idle.

This needs simulation to answer. My earlier scheme used something very similar. If you have stats on previous job runs, you can model it in Excel. I've picked up this book from another post recommending it and am looking to learn some techniques to better be able to answer problems like what you're describing. Actual numbers trump everything, so gather data and do simulations based on them.