ntpd drift on the wapps.

15apr05

 Links to plots:

Plots of the 4 wapps (.ps)  (.pdf) :
plots of Wapp 4 response (.ps)  (.pdf):

    The ntp (network time protocol) daemon is used to keep the clocks on the 4 wapps synchronized. The wapps are normally started on a hardware 1 second tick. Which tick they start on has been determined by the time on the wapp computers.

    To start an observation, the observer2 computer waits for a few hundred milliseconds before the second and then tells the wapps to start on the next second. The wapps look at their local times, wait for a predefined number of millisecs before the second, and then send the command for the wapps to start on the next hardware tick. The value of the tick recorded in the header is taken from the wapp time. If the wapp time has drifted far enough different wapps may start  on different hardware 1 second ticks. When this happens the time recorded in the header may or may not be correct (depending on how far the clock has drifted).

    Project P2030 is the alfa pulsar search. It uses 64 Usec dumps, 256 lags, with polarization adding. When the program is running the wapp cpu is up to 75% busy. There have been problems with the drifting of the ntp clock when these programs have been run (this can be seen in the 2nd, 3rd, and 4th compression plots).

    On 15apr05 p2030 started taking data at about 9 and went to 12 hours (utc).  Plots were made showing what the time was doing:

Plots of the 4 wapps (.ps)  (.pdf) :

This is for the time period 8 through 13 hours utc.The first page shows 8 thru 13 hours (utc) while the second page is a blowup around 8.5 to 10 hours (when the first jumps occurred). The wapps are color coded (black: wapp1, red:wapp2, green: wapp3, blue:wapp4). The vertical dashed lines are when p2030 ran (it started and stopped data taking many times between these limits). The dotted lines (color coded) are when different wapps reset their ntp loop parameters (these may be off by a minute or so). The plots show:
setups:     When data taking started (my start time might be a little off). wapps 1,2,3 started to slew their frequency offsets. A few minutes later the measured offsets had reached about .5 seconds.  The vertical jumps back for the measured offsets back to 0 are when the ntpd's reset there loop variables because things had drifted to far.  The maximum allowed frequency offset is 500 ppm. The maximum loop time offset is about  abs(128 ms). It is interesting that wapp 1 responded to the measured time offset jump at 9.1 hrs by ramping the frequency offset while wapp 4 (who jumped at the same time) did not try to change the loop frequency for 20 minutes.

plots of Wapp 4 response (.ps)  (.pdf):

    Wapp 4 was sampling the ntp primary server  info every 16 seconds. Here each ntp primary server is color coded.
  • Top plot: The time offset used in the wapp4 pll loop. Notice that there are no changes from 9.06 till 9.31 when the loop  parmeters were reset.
  • 2nd plot: The frequency offset used in the wapp4 pll loop.
  • 3rd plot: The time offset of wapp4 time relative to the 3 primary ntp servers. Notice that all of the servers tracked one another.
  • 4th plot: the round trip communications delay for wapp4 to talk to the ntp servers. These values all remained small so there is no communications problem.
  • 5th plot: This code determines which of the ntp servers wapp4 tried to sync to. The coding is described in rfc1305 appendix B  peer status words. The codes are:
  • 0 - peer rejected
  • 1- passed sanity checks 1 through 8
  • 2- passed correctness tests.
  • 3 - passed candidate checks.
  • 4 - passed outlyer checks
  • 5 - curent synchronization source ; max distance exceeded (if limit check implemented).
  • 6 current synchronization source; max distance ok.
  • 6th plot: the rms jitter for each primary server.
  • The change in the local clock occurred at 9.07. The pll loop did not respond so ntp was not  changing  the local clock during the jump. All 3 of the ntp servers tracked one another during the jump.  The common element here is the wapp4 clock. It was the one that jumped and all 3 servers measured the same jump. The jump was not caused by ntp since ntp was not changing the loop parameters during this period. So the problem must be in the wapp4 clock itself.

    processing: x101/ntp/15apr05_wapp.pro, 15apr05_event.pro

    home_~phil