Why does monit log error & status failed while check programm returning exit code 0

monit

Problem 1

I want to monitor a headless running LibreOffice-Process with monit version 5.25.1.

Here is my monit config for this approach:

cat /etc/monit/conf.d/libreoffice


check program lo-check-8101 with path "/bin/bash /opt/libreoffice/chkloproc.sh TestLOPort8101 8101"
        with timeout 10 seconds
        if status != 0 then exec "/bin/bash /opt/libreoffice/loproc_is_down.sh"
        if status = 0 then exec "/bin/bash /opt/libreoffice/loproc_is_up.sh"

This LibreOffice Instance is listening on port 8101.

The check-script is returning 0 if everything is ok and 101 if there is an
error with that LibreOffice Instance. I'm testing the text conversion of this
running LibreOffice Process by sending HTML, requesting TEXT and check the
response.

The action-scripts (loproc_is_down.sh / loproc_is_up.sh) are adding / deleting
an iptables rule to pronounce the status to a running haproxy, who is port-checking that
LibreOffice Instance / Process … if this sounds a little bit complicated, I'm sorry, but that is not
the problem I would like to talk about here.

The problem is, that I don't understand, why monit is logging the following entries:

monit log after restart

[CET Oct 29 16:58:18] info     : Starting Monit 5.25.1 daemon with http interface at [localhost]:2812
[CET Oct 29 16:58:18] info     : Monit start delay set to 10s
[CET Oct 29 16:58:28] info     : 'host1' Monit 5.25.1 started
[CET Oct 29 16:58:58] error    : 'lo-check-8101' status failed (0) -- no output
[CET Oct 29 16:58:58] info     : 'lo-check-8101' exec: '/bin/bash /opt/libreoffice/loproc_is_up.sh'
[CET Oct 29 16:59:28] error    : 'lo-check-8101' status failed (0) -- no output

… and the following status screen from 'monit status':

monit status
Monit 5.25.1 uptime: 0m

Program 'lo-check-8101'
  status                       Status failed
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Tue, 29 Oct 2019 16:58:58

System 'host1'
  status                       OK
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  load average                 [0.03] [0.02] [0.01]
  cpu                          0.6%us 0.6%sy 0.0%wa
  memory usage                 543.9 MB [7.8%]
  swap usage                   0 B [0.0%]
  uptime                       20d 1h 11m
  boot time                    Wed, 09 Oct 2019 16:47:51
  data collected               Tue, 29 Oct 2019 16:58:58

To me it seems, that the check-script is returning exit value 0 but status is reported / interpreted as "Status failed".

I don't understand, why monit is reporting an "error: … status failed (0)" in its logfile.

What does status mean other than the interpretation of the last exit code of the given check-script programm?


Problem 2

And there is another reaction from monit, which I can't understand, perhaps anybody can explain it to me?

When I try to fake a broken LibreOffice Process by stopping it, monit does recognize this after one cycle and is starting the wanted / configured action-script 'loproc_is_down.sh' and reporting the last exit code correctly as 101, but with the log-line

"info: status succeeded (101)"

for the first cycle and again then with

"error: status failed (101)"

monit log with faked failure

[CET Oct 29 17:14:28] info     : 'lo-check-8101' status succeeded (101) -- Error: Existing listener not found. Unable start listener by parameters. Aborting.
[CET Oct 29 17:14:28] error    : 'lo-check-8101' status failed (101) -- Error: Existing listener not found. Unable start listener by parameters. Aborting.
[CET Oct 29 17:14:28] info     : 'lo-check-8101' exec: '/bin/bash /opt/libreoffice/loproc_is_down.sh'
[CET Oct 29 17:14:58] error    : 'lo-check-8101' status failed (101) -- Error: Existing listener not found. Unable start listener by parameters. Aborting.
[CET Oct 29 17:15:28] error    : 'lo-check-8101' status failed (101) -- Error: Existing listener not found. Unable start listener by parameters. Aborting.

The opposite is when starting that LibreOffice Process again:

monit log when service is running again

[CET Oct 29 17:15:58] error    : 'lo-check-8101' status failed (0) -- no output
[CET Oct 29 17:15:58] info     : 'lo-check-8101' exec: '/bin/bash /opt/libreoffice/loproc_is_up.sh'
[CET Oct 29 17:15:58] info     : 'lo-check-8101' status succeeded (0) -- no output
[CET Oct 29 17:16:28] error    : 'lo-check-8101' status failed (0) -- no output
[CET Oct 29 17:16:58] error    : 'lo-check-8101' status failed (0) -- no output

Which looks like monit runs that check-script, which is returning exit code 0 and starts the action-script "loproc_is_up.sh" and reports it with "status succeeded (0)"

… but then again is logging "error: status failed (0)" in the following cycles.

I am not understanding the meaning of "status" in the monit concept / documentation … can somebody explain it to me?

Thank you for reading this long post and hopefully help me with an answer.

Best Answer

Monit is there to catch problems on a monitored entity.

So - line by line - your config tells Monit:

check program lo-check-8101 with path "/bin/bash /opt/libreoffice/chkloproc.sh TestLOPort8101 8101" with timeout 10 seconds

Execute a binary. Store the exit code and some additional info.

        if status != 0 then exec "/bin/bash /opt/libreoffice/loproc_is_down.sh"

A problem occurs if status is not 0. Now execute a binary.

        if status = 0 then exec "/bin/bash /opt/libreoffice/loproc_is_up.sh"

A problem occurs if status is 0. Now execute a binary. - I don't even get what the result of this call should be. Everything's okay here, so why executing something?


So to say: With this config there is not "success" (= everything is fine) case.

To optimize it, you should only catch problems with Monit:

check program lo-check-8101 with path "/opt/libreoffice/chkloproc.sh TestLOPort8101 8101"
    with timeout 10 seconds
    if status != 0 then exec "/opt/libreoffice/loproc_is_down.sh"
    if 2 restarts within 3 cycles then unmonitor

This means nothing is done by Monit if status is 0.

Some more words on the config:

  1. If I get it correctly (see this question), the headless server will create a PID-File. So you might also check with check process and perhaps some send/expect magic to verify the service is running.
  2. If you set your .sh files executable (+x; ie. chmod +x /opt/libreoffice/*.sh) and you have a correct shebang in those files, you can omit /bin/bash in your executes for better readability.

My config on this (not knowing what protocol is used by :8101, assuming http) would be more like this:

check process libre-local with pidfile "/var/run/libreoffice-server.pid"
    start program = "/usr/bin/systemctl start libreoffice-server" # Unit name is an assumption!
    stop program = "/usr/bin/systemctl stop libreoffice-server" # Unit name is an assumption!

    if failed
        port 8101
        protocol http
        request "/any_valid_entrypoint"
        for 2 cycles
    then restart

    # if loadavg (5min) per core > 1 for 5 cycles then restart
    if loadavg (5min) > 4 for 5 cycles then restart
    if totalmem > 2 GB for 5 cycles then restart
    if 3 restarts within 5 cycles then unmonitor

Getting loadavg with per core requires latest Monit-version. So it might not be available in your distro, so I commented out this line ;)


Edit after response from OP (I hope you get notified):

(it's really a pain that we cannot comment < 50 Rep...)

If I get it right, you have to convert something to get the state of the application, if conversion fails the app should be restarted. Translated to Monit:

check program lo-check-8101 with path "CONVERT_HERE"
    with timeout 10 seconds
    if status != 0 then exec "/usr/bin/systemctl restart libreoffice-server"
    if 2 restarts within 3 cycles then unmonitor

... where the CONVERT_HERE executable exits with 0 if converting goes well and <>0 if it fails. I still feel I missed something here. ;)

Could you perhaps drop all three executables to a gist or something?