How to Verify Internal NTP Server is Sending Correct Time

alertschronyhealthcheckntp

I have two NTP stratum 3 servers running and wanted to create a simple check that I could tell if either of the servers time drifted and alert that it's not synced properly with the public stratum 2 servers.

My first thought was to pull time from multiple stratum 2 servers and compare that time with what my ntp servers are sending. Then alert if the drift is over X delta.

Is there a more standard way or better method for verifying that an NTP server is sending the correct time?

Best Answer

TL;DR:

Configure your NTP server according to best current practices.
(Shameless self-promotion warning.) Use my ntpmon check if your monitoring solution uses collectd, Nagios, or telegraf.

Long version:

Configuration

The most important foundation for good NTP monitoring is good NTP configuration. For best understanding this, read the NTP Best Current Practices (BCP 223/RFC 8633). Here's a condensed summary of its configuration recommendations:

Keep your NTP software up-to-date
Use between 4 and 10 sources
Ensure you have a diversity of reference clocks represented in those sources
Don't allow unauthenticated remote control (should be the default on most distros)
Use the pool responsibly (should also be the default on most distros)
Don't mix leap-smeared and non-leap-smeared sources
Don't use unauthenticated broadcast mode
Don't use anycast or load-balancing when you're serving time

Where to measure

Once you have a good local configuration, the main thing to remember is that your check should query the local NTP server for its metrics, rather than trying to manually measure offset from remote servers. The major NTP servers (ntpd and chronyd) already collect all the metrics you need, so checks which compare the clock against remote servers are ignoring a lot of NTP's built-in goodness.

Metric selection

So to your question, the metrics you should be most interested in are:

system offset: the calculated best guess of the local clock's offset from the one true time
root dispersion: the calculated maximum offset of the local clock from the stratum 0 sources

Monitoring

There are a few monitoring solutions for NTP - depending on what monitoring you already have in place, some might suit you better than others. I wrote an overview of these on my blog, here's a summary:

Nagios:

check_ntp_peer: decent basic check; doesn’t check a wide enough variety of metrics; a little too liberal in what offsets it allows
check_ntp_time: not recommended; checks only the offset from a given remote NTP server
check_ntpd: reasonable check coverage; use it if you prefer perl over python.
ntpmon's nagios check

collectd:

NTP plugin: some of the metrics it collects are unclear
ntpmon in collectd mode

prometheus/influxdb

prometheus node exporter: not recommended; checks only the offset from a given remote NTP server
telegraf ntpq input plugin: a direct translation of ntpq output to telegraf metrics; this is probably too detailed if you just want to know, "Is my NTP server OK?"
ntpmon in telegraf mode

Caveats

The above is a summary of the state as at October 2016 when I did my alerting and telemetry review. Things may have improved since.
ntpmon is my project which I think overcomes the deficiencies of the checks which were available at the time. It supports both ntpd and chronyd, and the above-listed alerting and telemetry systems.

Best Answer

Related Solutions

Linux – NTP is running, system clock still not in time – what gives

Compare two NTP servers

Related Topic