Software for batch process management in unix

open sourcescheduled-taskunix

I'm looking for a good open source solution to managing many batch jobs across a cluster
of machines. I've looked at the solutions indicated in this post but it doesn't really seem to be what I'm looking for, or perhaps the projects mentioned just have really poor documentation.

We have a good set of batch operations that need to happen on various schedules.
These batch operations sometimes have dependencies, as in logs are processed with batch job A, then batch jobs B and C can run on the resulting data. Resource utilization (balancing jobs amongst our batch machines) are probably not as much of an issue, all though would be a nice bonus.

Today we handle this with a combination of fcron and shell scripts. But of course it's rather difficult to keep track of what jobs are scheduled to run on what machines. It's also not always obvious when some job has hung (or is running much longer than expected) or even just fails out right.

This can't be a unique problem for us. In fact we had a home grown solution at a previous company, but was never open sourced. Does anyone have a good solution ?

Best Answer

There's numerous solutions you may want to take a look at:

Torque - This is a variation of the original PBS (Portable Batch Scheduler) code base. They call it a resource manager because technically it doesn't take care of scheduling jobs although it does include several schedulers. However, it will take care of managing and allocating your compute node CPU, memory, file, and other consumable resources. If you have anything more than very basic scheduling needs, you'll probably want to supplement it with the Maui Cluster Scheduler. I know most about this one because it's what we use. It can be a bit rough around the edges because it's mostly community developed, and most of the developers are sysadmins and not software engineers. There's a commercial product that spawned from the same PBS code base called PBS Professional which seems more mature and is available for a relatively modest fee.

Sun Grid Engine - Similar to the PBS based systems, but written by Sun. The resource manager and scheduler are more integrated in this system and it offers a few different modes of operation and resource allocation. Despite being a Sun product, it apparently runs well on Linux and other operating systems, not just solaris.

Platform LSF - Is another popular commercial offering in the same space.

Condor - Another batch scheduling system more suited for high throughput, tons of short jobs.

SLURM - Is another open source offering. It's not quite as mature as the PBS based products, but it has a nicer architecture that's plugin based and is easy to install if you go with the CAOS NSA Linux distribution and the Perceus cluster manager. See this Linux Magazine article for an example of how easy it is to get up and running.

Which one of these you pick is largely a matter of preference and matching it up with your requirements. I would say that Torque and SGE have a slight bend to multi-user clusters in a scientific computing environment. Based on what I've seen of Altair's PBS Professional, it looks like it is far more suitable for a commercial environment and has a better suite of tools for developing product specific workflows. Same goes for LSF.

SLURM and Condor are probably the easiest to get up and running, and if your requirements are relatively modest, they may be the best fit. However, if you have needs for more complicated scheduling policies and many users submitting jobs to your systems, they may be lacking in that regard without being supplemented by an external scheduler.

Related Solutions

Linux – Can I Nohup/Screen an Already-Started Process?

If you're using Bash, you can run disown -h job

disown
disown [-ar] [-h] [jobspec ...]
Without options, each jobspec is removed from the table of active jobs. If the -h option is given, the job is not removed from the table, but is marked so that SIGHUP is not sent to the job if the shell receives a SIGHUP. If jobspec is not present, and neither the -a nor -r option is supplied, the current job is used. If no jobspec is supplied, the -a option means to remove or mark all jobs; the -r option without a jobspec argument restricts operation to running jobs.

Directories on Unix and Unix-like Systems – Understanding Their Meaning

For more data on the layout of Linux file-systems, look at the Filesystem Hierarchy Standard (now at version 2.3, with the beta 3.0 version deployed on most recent distros). It does explain some of where the names came from:

/bin - Binaries.
/boot - Files required for booting.
/dev - Device files.
/etc - Et cetera. The name is inherited from the earliest Unixes, which is when it became the spot to put config-files.
/home - Where home directories are kept.
/lib - Where code libraries are kept.
/media - A more modern directory, but where removable media gets mounted.
/mnt - Where temporary file-systems are mounted.
/opt - Where optional add-on software is installed. This is discrete from /usr/local/ for reasons I'll get to later.
/run - Where runtime variable data is kept.
/sbin - Where super-binaries are stored. These usually only work with root.
/srv - Stands for "serve". This directory is intended for static files that are served out. /srv/http would be for static websites, /srv/ftp for an FTP server.
/tmp - Where temporary files may be stored.
/usr - Another directory inherited from the Unixes of old, it stands for "UNIX System Resources". It does not stand for "user" (see the Debian Wiki). This directory should be sharable between hosts, and can be NFS mounted to multiple hosts safely. It can be mounted read-only safely.
/var - Another directory inherited from the Unixes of old, it stands for "variable". This is where system data that varies may be stored. Such things as spool and cache directories may be located here. If a program needs to write to the local file-system and isn't serving that data to someone directly, it'll go here.

/opt vs /usr/local

The rule of thumb I've seen is best described as:

Use /usr/local for things that would normally go into /usr, or are overriding things that are already in /usr. Use /opt for things that install all in one directory, or are otherwise special.

Best Answer

Related Solutions

Linux – Can I Nohup/Screen an Already-Started Process?

Directories on Unix and Unix-like Systems – Understanding Their Meaning

Related Topic