Why is NTP considering the server inadequate

ntp

I have an embedded Linux device connected directly to my Windows desktop via a USB/Net interface. It's based on the Freescale iMX6 boards so I believe the clock hardware is the SNVS RTC.

On the desktop 192.0.0.10, I have W32Time running as an NTP server and the embedded device 192.0.0.100 is (I think) correctly configured to use it as per the ntp.conf file:

server 192.0.0.10 iburst minpoll 5 maxpoll 7
driftfile /data/ntp.drift
restrict default nomodify nopeer noquery limited kod
restrict 127.0.0.1
restrict [::1]

Connectivity is not an issue(a) since I can, on the embedded device, execute:

ntpdate -uq 192.0.0.10
ntpdate -ub 192.0.0.10

and this will successfully query and update the time.

However, I find that the clock which is supposed to be kept in sync by ntpd is drifting quite a bit. I started and synced ntpd about 18 hours ago and the offset gradually rose to about 5 seconds:

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 192.0.0.10      192.168.0.4      4 u   31   32  377    1.452  4941.57  11.927

Over the last few hours, it's actually started coming back but it's still 3.2 seconds away from what it should be. In any case, I'm not convinced that's any more than a coincidence, for the following reasons.

When I saw it rising consistently, I did some digging. The output of the ntpq associations command was (and still is):

# ntpq -c as

ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1 62876  9024   yes   yes  none    reject   reachable  2

This appears to indicate that, though reachable, the server is being filtered for some reason. Base on the status 9024 (see here), it appears to be explained by "discarded as not valid (TEST10-TEST13)".

So, then I go and look at the ntpq variables for that association:

# ntpq -c rv 62876

associd=62876 status=9024 conf, reach, sel_reject, 2 events, reachable,
srcadr=192.0.0.10, srcport=123, dstadr=192.0.0.100, dstport=123, leap=00,
stratum=4, precision=-6, rootdelay=129.150, rootdisp=2193.741,
refid=192.168.0.4,
reftime=ddd30907.eff60ee5  Thu, Dec  7 2017  0:25:43.937,
rec=ddd31287.4db24cd8  Thu, Dec  7 2017  1:06:15.303, reach=377,
unreach=0, hmode=3, pmode=4, hpoll=5, ppoll=5, headway=21,
flash=400 peer_dist, keyid=0, offset=3186.569, delay=1.446,
dispersion=16.036, jitter=11.844, xleave=0.093,
filtdelay=     1.45    1.42    1.41    1.47    1.44    1.43    1.44    1.48,
filtoffset= 3186.57 3189.58 3192.08 3194.56 3197.13 3199.58 3202.57 3205.06,
filtdisp=     15.63   16.12   16.60   17.08   17.58   18.06   18.54   19.03

I see that the flash variable is set to 400 which, based on that same page linked to above, shows 0400/TEST11/peer_dist/peer distance exceeded.

Now I gather that's not physical distance (both client and server are on my desktop) or network distance (the two devices are directly connected). The only useful reference I've been able to find on the net is on Google Groups where one David Woolley states:

Distance exceeded means that the combination of worst case round trip
time induced error and an assumed drift of 15ppm since the last valid
time on the root server (plus a few minor components) has exceeded 1 second.

It commonly happens with w32time servers that have been synchronized
once but left to drift. It can also happen if the servers are orphan
mode, and haven't had a real time source for too long, and you are not
using the very latest orphan mode code.

Unfortunately, I have no idea how to calculate the "worst case round trip
time induced error" so I'm not sure how to proceed from here. I'm pretty certain my desktop is synchronising with the corporate time server (mine, and a smattering of other desktops all seem to be very close in time) though I'm also not sure how I'd check that emphatically.

So, my question is, therefore, where can I go from here? I can't seem to get any more useful information out of ntpq and even running ntpd -dd in the foreground doesn't seem to clear up why the server time is being rejected.

Any help would be greatly appreciated.


(a) As further indicted by the logs on the Windows side, enabled with:

w32tm /debug /enable /file:C:\w32time.log /size:10000000 /entries:0-300

and producing:

152281 02:06:57.1968483s - ListeningThread -- DataAvailEvent set for socket 1 (0.0.0.0:123)
152281 02:06:57.1973483s - ListeningThread -- response heard from 192.0.0.100:123 <- 192.0.0.10:123
152281 02:06:57.1973483s - /-- NTP Packet:
152281 02:06:57.1973483s - | LeapIndicator: 3 - not synchronized;  VersionNumber: 4;  Mode: 3 - Client;  LiVnMode: 0xE3
152281 02:06:57.1973483s - | Stratum: 0 - unspecified or unavailable
152281 02:06:57.1973483s - | Poll Interval: 5 - 32s;  Precision: -20 - 953.674ns per tick
152281 02:06:57.1973483s - | RootDelay: 0x0000.0000s - unspecified;  RootDispersion: 0x0000.F1A0s - 0.943848s
152281 02:06:57.1973483s - | ReferenceClockIdentifier: 0x494E4954 - source name: "INIT"
152281 02:06:57.1973483s - | ReferenceTimestamp:   0x0000000000000000 - unspecified
152281 02:06:57.1973483s - | OriginateTimestamp:   0xDDD320A033087D7D - 13157085984199348300ns - 152281 02:06:24.1993483s
152281 02:06:57.1973483s - | ReceiveTimestamp:     0xDDD3209D4DB18BA5 - 13157085981303490400ns - 152281 02:06:21.3034904s
152281 02:06:57.1973483s - | TransmitTimestamp:    0xDDD320BE4D535D3F - 13157086014302053300ns - 152281 02:06:54.3020533s
152281 02:06:57.1973483s - >-- Non-packet info:
152281 02:06:57.1973483s - | DestinationTimestamp: 152281 02:06:57.1973483s - 0xDDD320C132856B0E152281 02:06:57.1973483s -  - 13157086017197348300ns152281 02:06:57.1973483s -  - 152281 02:06:57.1973483s
152281 02:06:57.1973483s - | RoundtripDelay: -562900ns (0s)
152281 02:06:57.1973483s - | LocalClockOffset: -2895576400ns - 0:02.895576400s
152281 02:06:57.1973483s - \--
152281 02:06:57.1973483s - TransmitResponse: sent 0.0.0.0:123(192.0.0.10:123)->192.0.0.100:123

Update on the comment "Over the last few hours, it's actually started coming back": it's actually started drifting out again (currently at 3.7 seconds) so my thoughts that this was a coincidence seem to be supported.

Best Answer

Your client is refusing to synchronize to the server because its "root dispersion" (the server's own estimate of its error from "true" time, and one of the variables that contributes to peer distance) is around 2.2 seconds, which is greater than the default tolerance of one second.

Although it's best to debug the server and figure out why it has such a bad estimate of its own timekeeping abilities, you can force the client to synchronize to it anyway by providing a larger value for the tos maxdist option in ntp.conf.