Using monit to kill the right process withown knowing it’s PID

monit

I am trying to use monit to find surefire processes running too long and kill them.

The machine is running parallel builds so it is possible to have several surefire processes runnig at the same time but there is no PID file for those processes.

My monit config looks like this:

check process surefire matching "surefire/surefirebooter"
    if uptime > 4 hours then alert
    if uptime > 4 hours then stop

The alert is sent, but the stop does not work.

I can't use killall since the process is run by java and there is several other java processes running.

All I need is to detect thee right PID of that process so I can kill the right one.

Best Answer

there is MONIT_PROCESS_PID environment variable propagated into context of program executed by exec command.

if uptime > 4 hours then stop

shoud be replaced by

if uptime > 4 hours then exec "/usr/bin/monit-kill-process.sh"

and the /usr/bin/monit-kill-process.sh should look like

#!/bin/bash
# script run from monit instance
# this will find long-running surefire process and kill it

kill -9 $MONIT_PROCESS_PID

The only problem is that the monit is not right tool for this job anyway, since it want the process matching the check pattern to be found everytime it perform the checking, otherwise it tries to start the process using start part of check definition (which is not exactly what we want to do).

So I found and modified this ps/grep/perl/xargs oneliner which I run through cron. It's able to find processes by it's command line substring, select long running ones and treat them well.

#!/bin/bash
# script run from monit instance
# this will find long-running surefire process and kill it

readonly PROCESS_STRING="surefireboot"

/bin/ps -e -o pid,time,command \
 | /bin/grep $PROCESS_STRING \
 | /usr/bin/perl -ne 'print "$1 " if /^\s*([0-9]+) ([-0-9]+:[0-9]+:[0-9]+)/ && $2 gt "04:00:00"' \
 | /usr/bin/xargs kill