Awaken monit daemon every few hours for all monitored processes

monitmonitoring

I am seeing a problem with monit config in configuring the monit daemon to awaken every few hours and start monitoring the processes which were set to "Not Monitored" state.

PROBLEM:
When the monit changes to unmonitor certain process, the status changes to "not monitored" and the monit daemon will NEVER try to start the monitoring of this process again even when the PID file is updated with new correct PID and the monitoring STOPS for this process forever unless the monit daemon is awakened for this process again manually like below.

Can this awakening daemon for each process be configured at certain timeout intervals in the monit config for this process, to avoid of pitfalls of ending up with process going to "not monitored" state forever?

Like
if 2 restarts within 3 cycles then timeout {X hours} monitor restart

Thank you.

I have this below config for a snmp process.

# Check for cmaeventd process
check process cmaeventd with pidfile /var/run/cmaeventd.pid
group snmp-agents
start program = "/opt/hp/hp-snmp-agents/storage/etc/cmaeventd start"
stop program = "/opt/hp/hp-snmp-agents/storage/etc/cmaeventd stop"
if 2 restarts within 3 cycles then timeout

For some reason, if the PID file is NOT populated correctly (I am working on fixing it), monit keeps trying to restart the process using the empty pid file throwing the below errors in the monit log and finally "unmonitor" it after it fails to restart within 3 cycles as we configured.

log messages:

[PST Feb  3 11:43:23] error    : monit: Error reading pid from file '/var/run/cmaeventd.pid'
[PST Feb  3 11:43:24] error    : monit: Error reading pid from file '/var/run/cmaeventd.pid'

[PST Feb  3 11:45:25] error    : 'cmaeventd' service restarted 2 times within 2 cycles(s) - unmonitor

Monit status for that process after unmonitor:

Process 'cmaeventd'
  status                            not monitored
  monitoring status                 not monitored
  data collected                    Tue Feb  3 12:10:25 2015

Manually awakening the daemon for this process to start the monitoring again:

>monit monitor cmaeventd 

This will awaken the monit daemon for this process and starts reading the PID file again and if successful it starts the monitoring back in. 

Before awakening the monit daemon for this process:
---------------------------------------------------
logbash-3.1# ls -l /var/run/cmaeventd.pid
-rw-r--r-- 1 root root 1 Feb  3 00:00 /var/run/cmaeventd.pid
logbash-3.1# cat /var/run/cmaeventd.pid

logbash-3.1# ps -ef|grep cmaeventd |grep -v grep
root     13066     1  0 00:00 ?        00:00:00 cmaeventd -p 15 -l /var/log/hp-snmp-agents/cma.log
l
logbash-3.1# echo "13066" > /var/run/cmaeventd.pid
logbash-3.1# cat /var/run/cmaeventd.pid
13066

logbash-3.1# monit monitor cmaeventd

From log:

[PST Feb  3 12:20:54] info     : monitor service 'cmaeventd' on user request
[PST Feb  3 12:20:54] info     : monit daemon at 23515 awakened
[PST Feb  3 12:20:54] info     : Awakened by User defined signal 1
[PST Feb  3 12:20:54] info     : 'cmaeventd' monitor action done

Monit status:

Process 'cmaeventd'
  status                            initializing
  monitoring status                 initializing
  data collected                    Tue Feb  3 12:20:54 2015

Changes to below after sometime:

Process 'cmaeventd'
  status                            running
  monitoring status                 monitored
  pid                               13066
  parent pid                        1
  uptime                            12h 21m
  children                          0
  memory kilobytes                  2160
  memory kilobytes total            2160
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Tue Feb  3 12:21:54 2015

Best Answer

It's not necessary to monitor individual HP agents with Monit. Plus, they're all tied together with the wrapper service, hp-snmp-agents. Restarting one independently of the rest will have undesirable effects.

While it's possible to debug the HP agent logs, I think you may have an issue with your old kernel (looks like RHEL/CentOS 5.5) and possibly old HP management agents. The HP agents you should be using are at the SDR repository.

For the ProLiant DL3xx G7 platform, you'll need the newest version of the following packages:

hp-snmp-agents, hpssa, hp-health, hp-smh-templates, hpsmh, hpssacli, hponcfg