Software for batch process management in unix

open sourcescheduled-taskunix

I'm looking for a good open source solution to managing many batch jobs across a cluster
of machines. I've looked at the solutions indicated in this post but it doesn't really seem to be what I'm looking for, or perhaps the projects mentioned just have really poor documentation.

We have a good set of batch operations that need to happen on various schedules.
These batch operations sometimes have dependencies, as in logs are processed with batch job A, then batch jobs B and C can run on the resulting data. Resource utilization (balancing jobs amongst our batch machines) are probably not as much of an issue, all though would be a nice bonus.

Today we handle this with a combination of fcron and shell scripts. But of course it's rather difficult to keep track of what jobs are scheduled to run on what machines. It's also not always obvious when some job has hung (or is running much longer than expected) or even just fails out right.

This can't be a unique problem for us. In fact we had a home grown solution at a previous company, but was never open sourced. Does anyone have a good solution ?

Best Answer

There's numerous solutions you may want to take a look at:

Torque - This is a variation of the original PBS (Portable Batch Scheduler) code base. They call it a resource manager because technically it doesn't take care of scheduling jobs although it does include several schedulers. However, it will take care of managing and allocating your compute node CPU, memory, file, and other consumable resources. If you have anything more than very basic scheduling needs, you'll probably want to supplement it with the Maui Cluster Scheduler. I know most about this one because it's what we use. It can be a bit rough around the edges because it's mostly community developed, and most of the developers are sysadmins and not software engineers. There's a commercial product that spawned from the same PBS code base called PBS Professional which seems more mature and is available for a relatively modest fee.

Sun Grid Engine - Similar to the PBS based systems, but written by Sun. The resource manager and scheduler are more integrated in this system and it offers a few different modes of operation and resource allocation. Despite being a Sun product, it apparently runs well on Linux and other operating systems, not just solaris.

Platform LSF - Is another popular commercial offering in the same space.

Condor - Another batch scheduling system more suited for high throughput, tons of short jobs.

SLURM - Is another open source offering. It's not quite as mature as the PBS based products, but it has a nicer architecture that's plugin based and is easy to install if you go with the CAOS NSA Linux distribution and the Perceus cluster manager. See this Linux Magazine article for an example of how easy it is to get up and running.

Which one of these you pick is largely a matter of preference and matching it up with your requirements. I would say that Torque and SGE have a slight bend to multi-user clusters in a scientific computing environment. Based on what I've seen of Altair's PBS Professional, it looks like it is far more suitable for a commercial environment and has a better suite of tools for developing product specific workflows. Same goes for LSF.

SLURM and Condor are probably the easiest to get up and running, and if your requirements are relatively modest, they may be the best fit. However, if you have needs for more complicated scheduling policies and many users submitting jobs to your systems, they may be lacking in that regard without being supplemented by an external scheduler.