How to Verify Internal NTP Server is Sending Correct Time

alertschronyhealthcheckntp

I have two NTP stratum 3 servers running and wanted to create a simple check that I could tell if either of the servers time drifted and alert that it's not synced properly with the public stratum 2 servers.

My first thought was to pull time from multiple stratum 2 servers and compare that time with what my ntp servers are sending. Then alert if the drift is over X delta.

Is there a more standard way or better method for verifying that an NTP server is sending the correct time?

Best Answer

TL;DR:

  1. Configure your NTP server according to best current practices.
  2. (Shameless self-promotion warning.) Use my ntpmon check if your monitoring solution uses collectd, Nagios, or telegraf.

Long version:

Configuration

The most important foundation for good NTP monitoring is good NTP configuration. For best understanding this, read the NTP Best Current Practices (BCP 223/RFC 8633). Here's a condensed summary of its configuration recommendations:

  1. Keep your NTP software up-to-date
  2. Use between 4 and 10 sources
  3. Ensure you have a diversity of reference clocks represented in those sources
  4. Don't allow unauthenticated remote control (should be the default on most distros)
  5. Use the pool responsibly (should also be the default on most distros)
  6. Don't mix leap-smeared and non-leap-smeared sources
  7. Don't use unauthenticated broadcast mode
  8. Don't use anycast or load-balancing when you're serving time

Where to measure

Once you have a good local configuration, the main thing to remember is that your check should query the local NTP server for its metrics, rather than trying to manually measure offset from remote servers. The major NTP servers (ntpd and chronyd) already collect all the metrics you need, so checks which compare the clock against remote servers are ignoring a lot of NTP's built-in goodness.

Metric selection

So to your question, the metrics you should be most interested in are:

  • system offset: the calculated best guess of the local clock's offset from the one true time
  • root dispersion: the calculated maximum offset of the local clock from the stratum 0 sources

Monitoring

There are a few monitoring solutions for NTP - depending on what monitoring you already have in place, some might suit you better than others. I wrote an overview of these on my blog, here's a summary:

  1. Nagios:
  • check_ntp_peer: decent basic check; doesn’t check a wide enough variety of metrics; a little too liberal in what offsets it allows
  • check_ntp_time: not recommended; checks only the offset from a given remote NTP server
  • check_ntpd: reasonable check coverage; use it if you prefer perl over python.
  • ntpmon's nagios check
  1. collectd:
  • NTP plugin: some of the metrics it collects are unclear
  • ntpmon in collectd mode
  1. prometheus/influxdb
  • prometheus node exporter: not recommended; checks only the offset from a given remote NTP server
  • telegraf ntpq input plugin: a direct translation of ntpq output to telegraf metrics; this is probably too detailed if you just want to know, "Is my NTP server OK?"
  • ntpmon in telegraf mode

Caveats

  1. The above is a summary of the state as at October 2016 when I did my alerting and telemetry review. Things may have improved since.
  2. ntpmon is my project which I think overcomes the deficiencies of the checks which were available at the time. It supports both ntpd and chronyd, and the above-listed alerting and telemetry systems.