NTP synchronization stays with no reach

ntpntpd

Usually, when a machine looses connection completely, ntpd misses a couple polls and marks all sources as failed sanity. Which seems quite logical. But I've met a situation when a server stays marked as current time source while its reach turned 0.

Sever is deployed in a same subnet as a target machine providing very low delay, offset and jitter. The situation was modelled by shutting down the connection physically: just unplugging a cord from a client machine. I tried to recreate this, but since then the same machine always loses synchronization status nicely after 5-6 unsuccessful polls.

The real question is: what exactly determines the synchronization status when the connection is lost?

Best Answer

There is a deffinite explanation about reach register in RFC-1305:

The reachability register is shifted one position to the left, with zero replacing the vacated bit. If all bits of this register are zero, the clear procedure is called to purge the clock filter and reselect the synchronization source, if necessary. If the association was not configured by the initialization procedure, the association is demobilized.

Howewer RFC-1305 is obsoleted by RFC-5905, which is not that destinctive:

Next, the 8-bit p.reach shift register in the poll process described in Section 13 is used to determine whether the server is reachable and the data are fresh. The register is shifted left by one bit when a packet is sent and the rightmost bit is set to zero. As valid packets arrive, the rightmost bit is set to one. If the register contains any nonzero bits, the server is considered reachable; otherwise, it is unreachable.

No clear procedure is mentioned in Section 13. But still it looks like an unreacable peer should loose its synchronation status at some point.

I've managed to recreate synchronized status with reach 0 situation to ensure that it is rare and not at all permanent. It took disabling "burst" in servers configuration and breaking the connection right after the synchronization.

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 91.198.10.4     194.190.168.1    2 u   20   64  177   51.137   -2.192  11.049
 192.168.1.1     193.67.79.202    2 u   65   64   77    0.459   -1.818   0.922
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*91.198.10.4     194.190.168.1    2 u   21   64  177   51.137   -2.192  11.049
+192.168.1.1     193.67.79.202    2 u    -   64  177    0.449   -3.192   1.828

The reach was 177 which is 01111111 in binary. So it took 7 polls to establish the synchronization.

The synchronization then were lost at this posotion:

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+91.198.10.4     194.190.168.1    2 u  574   64    0   63.846   -9.652   0.756
*192.168.1.1     193.67.79.202    2 u  553   64    0    0.449   -3.192   0.505
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 91.198.10.4     194.190.168.1    2 u  575   64    0   69.871  -10.409   0.002
 192.168.1.1     193.67.79.202    2 u  554   64    0    0.449   -3.192   0.505

When numbers are little strange as 64*9 = 576 not 575, but i guess, the representation might be 1 second inaccurate. Considering this, it took 9 unsuccessful polls to break the synchronization status.

So, considering both theory and practice, it looks like the state in which source with 0 reach might be considered current time source is possible, but also rare and temporary.

Related Solutions

NTP fudge network source stratum

After some more research it seems "fudging" the stratum level of a network source is not possible. So I moved on and tried dtoubeli's answer. To my surprise, simply making my local time server a stratum level 2 (equal to the 3rd party device) did not always cause it to be the preferred time source. My local ntpd would still rule them both as "false ticks". For what reason, I'm not sure, but I'm guessing because they were the only two time sources, and their times were so far off.

The biggest problem here is the fact that my 3rd party device doesn't seem to hold a very consistent time, in fact it fluctuates a lot. The solution to my problem was adding several other accurate time sources (pool.ntp.org) to my /etc/ntp.conf. Now my local server is always chosen as the preferred time source, often times despite having a higher stratum level than some of the servers in the pool.

NTP – Synchronizing Time with Only One NTP Server

As Pawel has said, remove the local clock line in your ntp.conf. In fact, remove everything, pretty much. If you have a working, sync'ed NTP source on your local network that's willing to act as a server, then clients really only need one line in their ntp.conf, which should read

server ntp.intranet.example.com

or, for fastest syncing,

server ntp.intranet.example.com burst

(the latter puts more load on the server at service start time, but since it's your server, you can say "i permit that", if you want faster syncing at ntpd start time).

Don't forget to put ntp.intranet.example.com in /etc/ntp/step-tickers, or wherever your distro keeps that file, so the clocks of clients can be hard-synced at startup time.

Best Answer

Related Solutions

NTP fudge network source stratum

NTP – Synchronizing Time with Only One NTP Server

Related Topic