Nagios – Fix ‘check_ntp_time – Offset Unknown’ Error

centoslinuxnagiosntp

I have a local NTP server running on the subnet to keep other subnet nodes in sync, without every node syncing with upstream servers. But, while implementing the check_ntp_time plugin for Nagios I am noticing a frustrating issue, where nagios keeps reporting critical error for local nodes syncing up with the local ntp server.

Here is the ntp config on the local ntp server, notice the upstream server entries and the restrict entry, according to my research this qualifies the node as an ntp server which local nodes can sync against.

driftfile /var/lib/ntp/drift

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod limited nomodify notrap nopeer noquery
restrict -6 default kod limited nomodify notrap nopeer noquery

# Permit all access over the loopback interface.  This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1

# Makes me able to answer requests from local nodes
restrict 10.0.0.0 mask 255.255.192.0 nomodify notrap

# My source
server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org
server 2.centos.pool.ntp.org

logfile /var/log/ntp/server.log

statistics loopstats
statsdir /var/log/ntp/
filegen peerstats file peers type day link enable
filegen loopstats file loops type day link enable

And on the local non-ntp server nodes, everything is the same except the restrict entry is removed, and the server entries reference only the local ntp server: server ntp.example.com iburst.

Every local node can resolve ntp.example.com.

The problem I am having is when I run the following command from the nagios server:

/usr/lib64/nagios/plugins/check_ntp_time -H node-a-1 -v

And the output:

sending request to peer 0
response from peer 0: offset -0.002921819687
sending request to peer 0
response from peer 0: offset -0.0001939535141
sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
discarding peer 0: stratum=0
overall average offset: 0
NTP CRITICAL: Offset unknown|  

This happens for all the nodes, except the local ntp server, which references upstream servers. At first I thought it was IPTables issue, but I have the ports pinholed on every local ntp node (to allow nagios access to check the time diff):

ACCEPT     udp  --  eth0   *       10.0.0.0/18          0.0.0.0/0           multiport dports 123 /* 777 allow ntp access */ state NEW

Versions:

nagios-plugins-ntp: 1.4.16
ntp: 4.2.6p5-1.el6.centos

Any help is greatly appreciated, I really can't submit the nagios work until I get this resolved, as you know keeping server times in sync is priority 1.

— Edit —

Per the comments, here are the results of ntpq -p, on various nodes:

# Actual NTP Server (10.0.0.2)
==============================================================================
+propjet.latt.ne 241.199.164.101  2 u  105  128  337   14.578   12.954   7.138
+x2la01.hostigat 63.145.169.2     3 u   21  128  377   16.037   13.546   4.090
*pacific.latt.ne 241.199.164.101  2 u   72  128  377   15.148   24.434   7.403

# Local node 1
==============================================================================
*service-a-1.sn1 204.2.134.163    3 u    9  128  377    0.228    5.217   1.296

# Local node 2
==============================================================================
*service-a-1.sn1 204.2.134.163    3 u   91  128  377    0.200    3.608   1.167

Best Answer

The key line here is this one:

discarding peer 0: stratum=0

An NTP server identifying itself as stratum 0 is a violation of the spec (it's reserved for atomic clocks or something like that). I had this problem years ago with some BSD and Mac OS X hosts. I ended up hacking the stratum check out of the source and maintaining a separate build of the plugin for "problematic" hosts.

The offending lines are 254-257 (currently, anyway), if you want to rip that out. It's a hack, but it works for me ;-)

I found this thread in the mailing list archives about it. I think there was another one where I suggested adding a command-line option to ignore the stratum check, but I don't think it got any traction.

There's also a bug report about it, but it hasn't yielded anything useful as far as I can tell.