I have written a small bash script for crawling an XML sitemap of URLs. It retrieves 5 URLs in parallel using xargs.
Now I want an E-Mail to be sent when all URLs have been crawled, so it has to wait until all sub-processes of xargs have finished and then send the mail.
I have tried with a pipe after the xargs:
#!/bin/bash
wget --quiet --no-cache -O- http://some.url/test.xml | egrep -o "http://some.url[^<]+" | xargs -P 5 -r -n 1 wget --spider | mail...
and with wait
#!/bin/bash
wget --quiet --no-cache -O- http://some.url/test.xml | egrep -o "http://some.url[^<]+" | xargs -P 5 -r -n 1 wget --spider
wait
mail ...
Which both doesn't work, the email is sent immediately after the script is executed.
How can I achieve this? Unfortunately I don't have the parallel
program on my server (managed hosting).
Best Answer
Instead of using
xargs
, spawn eachwget
individually on the background and collect the PIDs of background processes in a list. Additionally, ensure that the output of background processes gets written to a file.Once all background processes have been spawned, go through all PIDs on the list and
wait
each one -- the ones already exited will not block atwait
. Having now, hopefully, waited all background processes successfully, all there is left to do is to concatenate outputs from each background process to a single file and mail that to wherever output is needed.Something along the lines of (echos are, of course, redundant and for demonstration purposes only):