Unix – Better Find Command with Parallel Processing

findunix

The unix find(1) utility is very useful allowing me to perform an action on many files that match certain specifications, e.g.

find /dump -type f -name '*.xml' -exec java -jar ProcessFile.jar {} \;

The above might run a script or tool over every XML file in a particular directory.

Let's say my script/program takes a lot of CPU time and I have 8 processors. It would be nice to process up to 8 files at a time.

GNU make allows for parallel job processing with the -j flag but find does not appear to have such functionality. Is there an alternative generic job-scheduling method of approaching this?

Best Answer

xargs with the -P option (number of processes). Say I wanted to compress all the logfiles in a directory on a 4-cpu machine:

find . -name '*.log' -mtime +3 -print0 | xargs -0 -P 4 bzip2

You can also say -n <number> for the maximum number of work-units per process. So say I had 2500 files and I said:

find . -name '*.log' -mtime +3 -print0 | xargs -0 -n 500 -P 4 bzip2

This would start 4 bzip2 processes, each of which with 500 files, and then when the first one finished another would be started for the last 500 files.

Not sure why the previous answer uses xargs and make, you have two parallel engines there!

Related Solutions

Directories on Unix and Unix-like Systems – Understanding Their Meaning

For more data on the layout of Linux file-systems, look at the Filesystem Hierarchy Standard (now at version 2.3, with the beta 3.0 version deployed on most recent distros). It does explain some of where the names came from:

/bin - Binaries.
/boot - Files required for booting.
/dev - Device files.
/etc - Et cetera. The name is inherited from the earliest Unixes, which is when it became the spot to put config-files.
/home - Where home directories are kept.
/lib - Where code libraries are kept.
/media - A more modern directory, but where removable media gets mounted.
/mnt - Where temporary file-systems are mounted.
/opt - Where optional add-on software is installed. This is discrete from /usr/local/ for reasons I'll get to later.
/run - Where runtime variable data is kept.
/sbin - Where super-binaries are stored. These usually only work with root.
/srv - Stands for "serve". This directory is intended for static files that are served out. /srv/http would be for static websites, /srv/ftp for an FTP server.
/tmp - Where temporary files may be stored.
/usr - Another directory inherited from the Unixes of old, it stands for "UNIX System Resources". It does not stand for "user" (see the Debian Wiki). This directory should be sharable between hosts, and can be NFS mounted to multiple hosts safely. It can be mounted read-only safely.
/var - Another directory inherited from the Unixes of old, it stands for "variable". This is where system data that varies may be stored. Such things as spool and cache directories may be located here. If a program needs to write to the local file-system and isn't serving that data to someone directly, it'll go here.

/opt vs /usr/local

The rule of thumb I've seen is best described as:

Use /usr/local for things that would normally go into /usr, or are overriding things that are already in /usr. Use /opt for things that install all in one directory, or are otherwise special.

How to master the UNIX find command

Well, as far as the -exec syntax goes, you could do like a lot of people, give up and use xargs:

find . -type f | xargs chown username

(or the files-with-spaces-and-other-nonsense-in-them-safe version)

find . -type f -print0 | xargs -0 chown username

Or, to try to remember the right thing to do with the semicolon, what you need to drill into your head is that you're using a semicolon to terminate the command that -exec is running, and you have to escape the semicolon because it has special meaning to bash. Which is why it's backslash semicolon. You seem to have the {} substitution part okay.

As to killing files and so on, if you're running something big and dangerous like you're talking about, first do this:

find . -type f -exec echo chown username {} \;

and review the results. This is basically a "dry run" where you're seeing the commands it would run if you let it. Definitely a good practice. Won't help with the .* problem, but you know not to do that one now. :)

Best Answer

Related Solutions

Directories on Unix and Unix-like Systems – Understanding Their Meaning

How to master the UNIX find command

Related Topic