NTP rejecting upstream due to “peer_dist”

configurationdebuggingntpntpd

Currently NTP is rejecting its upstream and is drifting quite badly (15 seconds of offset so far and growing). When checking the reason using ntpq the flash code is flash=400 peer_dist.

Checking the NTP documentation the peer is marked as distant if the roundtrip takes longer than 1.5 seconds. However using tcpdump I can see the packets leave and the reply return in milliseconds:

09:06:36.304204 IP 10.127.255.230.ntp > 10.127.255.213.ntp: NTPv4, Client, length 68
09:06:36.304371 IP 10.127.255.213.ntp > 10.127.255.230.ntp: NTPv4, Server, length 68

The general architecture here is a 1 ntp server in this subnet (that gets its time from an upstream outside the cluster) that serves times to the nodes in the subnet. The server is in sync and serving time as normal, however all the nodes in the subnet report as unsynchronised.

Simply restarting ntpd has no effect as the peer is still rejected. However after changing the maxdist using tos maxdist 5000 in the ntp.conf, then it syncs (flash=00 ok).

Why would ntp think that the distance is greater than 1.5s when I can see (using ntpq/tcpdump) that requests complete in milliseconds? Is there some internal NTP parameter that I can tweak other than maxdist that would make sense here? Is there some more debugging that can be done to diagnose this?

This is just one example of a cluster where this is happening, but I see the same symptoms elsewhere.

For reference, here is the (snarky) ntp documentation for maxdist:

maxdist maxdistance
Specify the synchronization distance threshold used by the clock selection algorithm. The default is 1.5 s. This determines both the minimum number of packets to set the system clock and the maximum roundtrip delay. It can be decreased to improve reliability or increased to synchronize clocks on the Moon or planets.

Best Answer

If ntpd is reporting the peer_dist code for the upstream peer, that means that between the root dispersion reported by the peer and the dispersion measured in the peer association, the 1.5-second threshold has been exceeded.

Given that your requests complete within a few milliseconds, it seems likely that the problem lies with the upstream stratum. To confirm or deny this you'd need to analyse a packet capture. Are you in control of the upstream as well?

It's probably worth mentioning here that your design of having 1 NTP server in the subnet associating with 1 NTP server upstream means that you're nullifying the selection and clustering algorithms, which will result in less accurate time for clients. Each NTP stratum should have 4-10 sources for maximum accuracy.