Fair distributed task scheduling with RabbitMQ

message-queuequeuerabbitmqscheduling

Problem: A single client of our system can completely flood all available resources with a massive workload. You can consider we have only one queue, and anyone can schedule any amount of work in it. Any other client that subsequently submits a small amount of work will have to wait until the first tasks have been fully treated. It is an asynchronous system so it will there's no risk for DOS. The problem we'd like to solve is to allow our clients to have a fair amount of the processing at any time, no matter if a few clients have submitted a lot of work. It is a distributed system with a good amount of workers, all work is chunked into small pieces so tasks flow correctly through the system.

This seems like a very common problem to me and I'm a bit alarmed by the fact that I don't find a very simple solution. It is similar to process scheduling in an operating system in that the processes are given processing slots in a round robin fashion, no one process can preschedule a lot of work.

One solution would be to use a particular queuing topology. One queue per user, that feeds a small bounded queue. Because of the small amount of work in the latter queue no one process can monopolize the workers for an extended amount of time. Like this

queing

I expected this to be easy to implement in RabbitMQ or possibly ZeroMQ, but there are several challenges. First I need to manually create a new queue if a new user submits work. Second and more importantly it seems I'd have to implement the red part myself, listening to all queues in a non blocking fashion in order to submit them to the bounded queue.

My concern is that I'm working with very low level abstractions here, all I want is fair task scheduling of capacity limitation. Basically create some backpressure to allow scheduling to happen just before the actual work and hence disallow any user to monoplize the system.

Are there better abstractions to work with?

Best Answer

I think the pattern you outline is the common one. (In that you program your own 'routing worker') But you could condense it down, moving the routing logic (red) into the worker.

For example, say instead of a worker listening to a single queue, I add code that is aware of the users.

I can then fire up a thread per user queue in the same worker service and let the cpu spread its time over each thread.

It might be slightly sub optimal for large numbers of users or CPU bound tasks, but it would simplify your overall solution.

Along a similar line, if you can pass on the costs, a good solution is to spin up a new worker on a new machine in the cloud as well as a queue per user.

Related Topic