Troubleshooting Nagios alerts; aka. Why aren’t the alerts firing

alertsnagiostroubleshooting

I'm attempting to add email alerts to an existing Nagios install. I've been using the web interface to keep an eye on some non-critical systems for a few months and it's been running well; warnings and critical problems are detected without issue.

My next step is to enable the alerting functionality but despite hours of fiddling I've been unable to get even the simplest alert to fire. I'm flat out of ideas as to what could be going wrong. It's almost certainly something simple that I've just failed to pick up on so hopefully one of you guys will spot it with ease.

The command I'm testing with is dead simple. Initially I'm just trying to write to a file:

define command{
        command_name    alerter
        command_line    echo "Alerter command fired by Nagios" >> /usr/local/nagios/var/alerter.log
}

I've tested the nagios user can execute this command using sudo. All seems well.

The hosts and services all refer to the 'admins' contact group. These are the templates they use, none of them override any of these settings.

define host{
        name                            generic-host
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        failure_prediction_enabled      1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        check_period                    24x7
        check_interval                  1
        retry_interval                  1
        max_check_attempts              10
        check_command                   check-host-alive
        notification_period             24x7
        notification_interval           120
        notification_options            d,u,r,s,f
        contact_groups                  admins
        register                        0
}
define service{
        name                            generic-service
        active_checks_enabled           1
        passive_checks_enabled          1
        parallelize_check               1
        obsess_over_service             1
        check_freshness                 0
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        failure_prediction_enabled      1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           1
        retry_check_interval            1
        contact_groups                  admins
        notification_options            w,u,c,r
        notification_interval           120
        notification_period             24x7
        register                        0
}

The contact and contact group are configured as such:

define contact{
        name                            generic-contact
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r,f,s
        host_notification_options       d,u,r,f,s
        service_notification_commands   alerter
        host_notification_commands      alerter
        register                        0
}
define contact{
        contact_name            nagiosadmin
        use                     generic-contact
        alias                   Nagios Admin
        email                   alerts@tekretic.tk
}
define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin
}

When I cause an outage Nagios picks it up and logs it like this…

[1315210448] SERVICE ALERT: ifs.aleph;Test service;CRITICAL;HARD;3;HTTP CRITICAL: HTTP/1.1 400 Bad Request - string 'Blah blah' not found on 'http://aleph.tekretic.com.au:80/' - 168 bytes in 0.369 second response time
[1315210653] SERVICE ALERT: ifs.aleph;Test service;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 416 bytes in 0.364 second response time

.. but nothing is logged to my 'alerter.log' file. It's as though the alerter command is never fired.

What am I missing??

Best Answer

Make sure that you have the following in nagios.cfg:

log_notifications=1
enable_notifications=1

Also try to increate the debug_level to 32 for notifications to see what it says:

debug_level=32
Related Topic