Unpredictable high ntp jitter from single local GPS clock source

jitterntp

Question

How can I fix transient, high NTP jitter?

Background information

I have an NTP server on my private network. My servers synchronize from this clock, and usually all is well. An example set of output:

ntpq> pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*10.10.10.249    10.10.100.20     3 u  367 1024  377    0.096    0.145   0.142
ntpq> as

ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1  2378  962a   yes   yes  none  sys.peer    sys_peer  2
ntpq> rv 2378
associd=2378 status=962a conf, reach, sel_sys.peer, 2 events, sys_peer,
srcadr=10.10.10.249, srcport=123, dstadr=10.10.200.1, dstport=123,
leap=00, stratum=3, precision=-18, rootdelay=1.190, rootdisp=37.155,
refid=10.10.100.20,
reftime=df134714.c026b762  Mon, Aug  6 2018 22:15:48.750,
rec=df134a04.507b5ad6  Mon, Aug  6 2018 22:28:20.314, reach=377,
unreach=0, hmode=3, pmode=4, hpoll=10, ppoll=10, headway=0, flash=00 ok,
keyid=0, offset=0.145, delay=0.096, dispersion=15.187, jitter=0.142,
xleave=0.052,
filtdelay=     0.10    0.10    0.05    0.08    0.09    0.11    0.11    0.11,
filtoffset=    0.14    0.16    0.19    0.12    0.02   -0.02   -0.04   -0.10,
filtdisp=      0.00   15.57   31.37   47.42   63.65   79.41   95.27  110.72

However every once in a while we will see a system increase to a much larger jitter. Digging into that when it happens, we see a single jump in the delay and offset values. Example:

filtdelay=     0.06    0.11  250.20    0.07    0.04    0.10    0.07    0.09,
filtoffset=    0.05   -0.01  124.95   -0.05   -0.05   -0.07   -0.05   -0.03,

Note in this case that offset (usually, but always) stays within 0.5/-0.5:

# ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*10.10.10.249    10.10.100.20     3 u  711 1024  377    0.112   -0.006  47.230

Sometimes the high jitter value can persist, mostly unchanging, for a few hours. The large jitter amount varies from 1 to over 100. Eventually it drops back down below 1.

Addendum
We are seeing a correlation between system load and NTP jitter. As a first guess, NTP packets might be colliding with NFS traffic.

EDIT
It's not a GPS clock source.

EDIT
It's definitely a problem. The jitter we see roughly correlates to high offset values.

Best Answer

Based on my experience on the Mars 2003 project at JPL, where I was responsible for the software phase-locked loop that kept the ground-based spacecraft simulation in sync with the downlinked clock signal from the spacecraft, aliasing is the only phenomenon I can think of that might cause transient jitter. Aliasing happens when an association is lost between what the time signal client thinks a signal "tick" represents and what it really is. If your clients ("my servers" in your question) use an anti-aliasing algorithm to try to get back in sync after a loss of connectivity, it might take them a while to re-sync.

The Mars'03 clock signal was 8Hz, meaning that there were 8 signals per second. If the client falls behind in its sampling by more than 1/8th of a second then it will miss one of the signals and get confused. To combat this, I made the phase locked loop as robust and elastic as possible, so that it was practically impossible under normal circumstances for it to lose sync with the incoming signal. If it did lose sync (which I never saw it do unless I forced it using an oscilloscope), it would have to start over by waiting for the well-known sync pattern to come in, whereupon it could reset the phase locked loop, just as it does at startup.

I'm guessing based on this experience that your transient jitter results from transient losses of connectivity on the time sync network, which may be compounded by packet storms if your time protocol guarantees delivery as does TCP/IP. If a guaranteed delivery protocol falls behind the clock signal, aliasing results. Then the clients must do whatever they do to re-sync, and trying to guarantee delivery under these circumstances might kick up a packet storm that makes things worse before they get better. If the anti-aliasing logic is sound enough then you might want to check whether your time protocol is using TCP/IP (which guarantees delivery) or UDP (which doesn't but is much leaner) and use UDP to eliminate the packet storms.

Related Topic