I am trying to use monit to find surefire processes running too long and kill them.
The machine is running parallel builds so it is possible to have several surefire processes runnig at the same time but there is no PID file for those processes.
My monit config looks like this:
check process surefire matching "surefire/surefirebooter"
if uptime > 4 hours then alert
if uptime > 4 hours then stop
The alert is sent, but the stop does not work.
I can't use killall since the process is run by java and there is several other java processes running.
All I need is to detect thee right PID of that process so I can kill the right one.
Best Answer
there is MONIT_PROCESS_PID environment variable propagated into context of program executed by exec command.
if uptime > 4 hours then stop
shoud be replaced by
if uptime > 4 hours then exec "/usr/bin/monit-kill-process.sh"
and the /usr/bin/monit-kill-process.sh should look like
The only problem is that the monit is not right tool for this job anyway, since it want the process matching the check pattern to be found everytime it perform the checking, otherwise it tries to start the process using start part of check definition (which is not exactly what we want to do).
So I found and modified this ps/grep/perl/xargs oneliner which I run through cron. It's able to find processes by it's command line substring, select long running ones and treat them well.