Likely causes of NTPD dying unexpectedly and solutions

amazon s3ntpntpdservicevirtual-machines

On a web application which uses s3 for physical document storage, we are experiencing issues with NTP continuously dying. This seems to happen roughly once or twice a day. There is very little information provided when this occurs, other than that the PID file exists but the service is dead when I check the status.

Can anyone suggest likely causes of NTPD dying? I am assuming that maybe clock drift is causing it to die but I am not sure what would cause that either. There is more than enough memory and available disk space.

The last time the service died, this was the output:

Sep  6 06:15:25 vm02 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="988" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Sep  6 06:17:06 vm02 ntpd[10803]: 0.0.0.0 0618 08 no_sys_peer
Sep  6 08:01:10 vm02 ntpd[10803]: 0.0.0.0 0617 07 panic_stop -28101 s; set clock manually within 1000 s.

Best Answer

I would say there is no 1-minute method to find the exact reason.

We had similar issues before in our ESXi environment. To cut the story short, we found the ESXi host's clock drifted a lot and guest VMs were syncing time from both ESXi host and upstream NTP server. This caused NTPd on VMs confused therefore died quite often.

We also found in some rare cases the random packet loss also caused NTPd quit because the round trip time between your server and upstream NTPd server is used to calculate the drift time.

In above two cases, if NTPd sees a massive time drift, for example more than 1000s, it quits by default. -g option will help a bit.

   -g      Normally,  ntpd  exits  with  a  message to the system log if the offset exceeds the panic threshold,
           which is 1000 s by default. This option allows the time to be set to any value  without  restriction;
           however,  this  can  happen only once. If the threshold is exceeded after that, ntpd will exit with a
           message to the system log. This option can be used with the -q and -x options. See the tinker command
           for other options.

You can have a look at the system log, which should have some words may give you a hint. You could also monitor "ntpq -p" output to have a rough idea how the offset develops.