Do you have a shared database? I've done this using a database as the arbiter in the past.
Basically, each "job" is represented as a row in the database. You schedule a job by adding a row to the database with the time you want it to run then each server does:
SELECT TOP 1 *
FROM jobs
WHERE state = 'NotRun'
ORDER BY run_time ASC
That way, they'll all pick the job that is scheduled to run next. They all sleep so that they wake up when the job is actually supposed to run. Then, they all do this:
UPDATE jobs
SET state = 'Running'
WHERE job_id = :id
AND state = 'NotRun'
Where :id
is the identifier of the job you got in the step above. Because the update is atomic, only one of the servers will actually update the row, you can check the database's "number of rows updates" status code to determine whether you were the server that actually updated the row, and therefore whether you are the server that gets to run the job.
If you didn't "win" and you're not running the job, just go back to step 1 immediately. If you did "win", schedule the job to execute in another thread, then wait a couple of seconds before going back to step 1. That way, servers that didn't get the job this time are more likely to pick up a job that's scheduled to run immediately.
You can use a list of triggerables, sorted ascending by the time of the next trigger. When you need to trigger something, you can always pick off the front of the list, O(1). When you need to (re)schedule something, you can fit it in the list in O(log N) time.
It would be fairly simple & straightforward, and you can leverage any form of timer API (your scheduler needs to wake up only when the front of the list needs to be triggered).
Best Answer
Although this algorithm was designed to provide maximum throughput in all the scenarios, this is not correct for all situations. What if the processor has a lot of small processes? It will definitely lead to starvation. The turnaround time of small processes will be low, whereas that of large processes will be large as compared to other scheduling algorithms.
Wikipedia says:
If you schedule larger jobs first, lots of small processes would have to wait for the process to terminate (as this is non-preemptive). And since turnaround time is total time between submission of a process and its completion, this would definitely be low. But in SJF, very few large processes wait for some small processes, so turnaround time of only those large processes would be affected.
Check out this link for a comparison of various techniques.