Linux – Bash script – wait for all xargs processes to be finished

bashlinuxprocessweb-crawlerxargs

I have written a small bash script for crawling an XML sitemap of URLs. It retrieves 5 URLs in parallel using xargs.

Now I want an E-Mail to be sent when all URLs have been crawled, so it has to wait until all sub-processes of xargs have finished and then send the mail.

I have tried with a pipe after the xargs:

#!/bin/bash

wget --quiet --no-cache -O- http://some.url/test.xml | egrep -o "http://some.url[^<]+" | xargs -P 5 -r -n 1 wget --spider | mail...

and with wait

#!/bin/bash

wget --quiet --no-cache -O- http://some.url/test.xml | egrep -o "http://some.url[^<]+" | xargs -P 5 -r -n 1 wget --spider

wait

mail ... 

Which both doesn't work, the email is sent immediately after the script is executed.
How can I achieve this? Unfortunately I don't have the parallel program on my server (managed hosting).

Best Answer

Instead of using xargs, spawn each wget individually on the background and collect the PIDs of background processes in a list. Additionally, ensure that the output of background processes gets written to a file.

Once all background processes have been spawned, go through all PIDs on the list and wait each one -- the ones already exited will not block at wait. Having now, hopefully, waited all background processes successfully, all there is left to do is to concatenate outputs from each background process to a single file and mail that to wherever output is needed.

Something along the lines of (echos are, of course, redundant and for demonstration purposes only):

#!/bin/bash

mail=$(tempfile)
pids=()
outputs=()

trap "rm -f ${outputs[@]}" EXIT
trap "rm -f $mail" EXIT

for url in $(wget --quiet --no-cache -O- http://some.url/test.xml |\
             egrep -o "http://some.url[^<]+") ; do
  output=$(tempfile)
  wget --spider > $output 2>&1 &
  pids+=($!)
  outputs+=($output)
  echo "Spawned wget and got PID ${pids[-1]}."
done

for pid in ${pids[@]} ; do
  echo "Waiting for PID $pid."
  wait $pid
done

# Concatenate outputs from individual processes into a single file.
for output in ${outputs[@]} ; do cat $output >> $mail ; done

# Mail that file.
< $mail mail -s "All outputs" some.user@some.domain

# end of file.