Python – Scheduling of parallel I/O-bound tasks (Backup solution)

backupsmultithreadingpythonscheduling

I want to implement a backup solution in Python where a Backup-Server initiates backups on a number of virtual and physical servers (one server = one backup task). Disregarding the details of the actual backups tasks I am concerned with the scheduling/multiprocessing part for now.

The constraints I have are:

  1. Only backup two servers at once (e.g. have at maximum two backup threads running at once)
  2. Don't backup two servers on the same physical machine (oftentimes multiple virtual servers share a common hardware machine) at once.

Since I am not too experienced in multiprocessing in Python I am wondering what an optimal Python solution would be. The following came to my mind:

  • Have a thread for each backup-job (e.g. for each server) and use a threading.BoundedSemaphore to ensure only two are running at once. Use more semaphores/conditions to ensure that multiple threads are not backupping two servers on the same physical machine.
  • Have exactly two threads that are running all the time and retrieve their tasks from a queue. Simultaneously the queue would have to make sure no tasks on the same physical machine are handed out at once (e.g. skipping/reordering tasks at times) . I would probably do this by subclassing Queue.PriorityQueue to add the additional constraints.

I am leaning towards the second option but I am not sure whether or not a queue is the right data structure to hand out the tasks to multiple working threads. I don't need to add tasks to the queue at runtime (which a queue allows) and I need a bit of logic to hand out the tasks rather than just process them in a linear order. Is there is a better (standard) data structure for this?

I would be thankful to hear some thoughts from more experienced programmers.

Best Answer

I'd have a queue per physical machine.

Thus a backup process would take at most N=2 tasks, and make sure it does not pick a task from a queue for which another task as already running.

(N could be easily adjusted when needed.)

Related Topic