Heat it Up
In my previous post, I introduced ntpviz, a way to quantify and visualize what ntpd is doing. The graphs immediately made obvious that some externality was degrading ntpd accuracy. Experimentation showed the wild card was temperature. This post will show how to measure and graph temperature and illuminate its affects on NTP performance.
Measure it
Before we can understand temperature effects we must gather temperature data. Keane Wolter recently added a program to the contrib/ directory that greatly simplifies this task. ntplogtemp automatically gathers data from four different types of sources and logs it to a file that can be used by ntpviz.
The ntplogtemp man page is available and documents the potential temperature choices. There is also a blog post on ntplogtemp. For now, you just need to know how to run it. Easy, root is not required. The user just needs to be able to write to your ntpd logs directory. Usually this is the user ntpuser. Just run this:
ntplogtemp -l /var/log/ntpstats/temps -w 300 &
That will log available temperatures to a file that ntpviz will automatically use with no further configuration. Congratulate yourself, check your ntpviz plots tomorrow.
Unexplained squiggles
Now might be a time to re-read the description in my last post of the temperature here: So Much Data
To summarize: CPU load spikes and room temperature seemed to correlate with aberrations in the system clock. But without temperature data it might have been just wishful thinking. Thus the need to collect temperature data.
This is a New Year, so here is a new plot of time and frequency offsets. Taken from a plain Raspberry Pi 3 with a GPS HAT.
And a matching plot of Local Frequency Offset and ZONE0 temperature.
ZONE0 is the temperature sensor inside most CPU’s. The drop in both graphs from 08:00Z to 13:00Z is night time when the room heat was off. The spikes at 23 minutes after the hour, each hour, are largish cron jobs.
The summary data for those plots:
Min | 1% | 5% | 50% | 95% | 99% | Max | |
---|---|---|---|---|---|---|---|
Local Clock Time Offset |
-9.463 |
-1.181 |
-0.709 |
-0.050 |
0.493 |
4.324 |
8.128 |
Local Clock Frequency Offset |
-4.366 |
-4.337 |
-4.279 |
-3.606 |
-2.764 |
-2.659 |
-2.482 |
Temp ZONE0 |
42.932 |
42.932 |
43.470 |
46.160 |
49.388 |
49.388 |
54.230 |
90% | 95% | StdDev | Mean | Units | |
---|---|---|---|---|---|
Local Clock Time Offset |
1.202 |
5.505 |
0.763 |
-0.002 |
µs |
Local Clock Frequency Offset |
1.515 |
1.678 |
0.513 |
-3.581 |
ppm |
Temp ZONE0 |
5.918 |
6.456 |
2.004 |
45.992 |
°C |
Basically this Raspberry Pi 3 can hold its Local Time Offset to a Standard Deviation of about 0.7 µs and its Frequency to about 0.5 ppm. The frequency offset seems to move about 256 ppb/°C.
zero TC
The math of how temperature changes frequency in a crystal is complex. It depends on many variables. Some important ones are how the crystal was cut, how it is excited, and even how long the crystal has aged. So there is no easy way to correct for it in software.
As you can see from the above plots, the crystal speeds up when the temperature increases. Achim Gratz recently used a Raspberry Pi to show that as you heat up the crystal to a temperature T0 that the temperature affects the frequency less and less. As you continue to heat up past T0 the temperature dependence returns.
This magic temperature T0 is called the 'zero Temperature Coefficient' point, abbreviated zero TC. In Achim’s case zero TC is when his CPU temperature is 60C. The crystal is on the flip side of the PCB board so the actual crystal temperature and zero TC is not known.
The math and physics of crystals gets complicated. If you want to dive into the theory this is a good paper on Static Frequency versus Temperature Stability
This has a simple application to NTP time keeping. If we can heat up our NTP server, to the zero TC point, it becomes less influenced by temperature fluctuations and so performs better.
So What
Very interesting you may be thinking, but so what? You do not have a laboratory oven and your significant other will likely complain if you heat up your server room to 60°C.
But you only need to heat up the Raspberry Pi, not your room. Most people put heat sinks on their CPU’s, Achim just put his CPU in a bubble wrap mailing envelope. The results are immediately obvious and good.
Before 22:00Z the Raspberry Pi was open to room air. At 22:00 the board was placed in the envelope. At about 05:00Z the CPU stabilized at 60°C.
As the temperature increased the local time and frequency offsets stabilized. The envelope reduced time spike by 30%. Elevating to 60°C made even more of an improvement.
If you do nothing else today, put your Raspberry Pi in an padded envelope.
ntpheat
Achim’s second good idea was to run a simple program on the Raspberry Pi to hold the CPU to 60°C. I have written a new program ntpheat and placed it in the contrib directory to replicate his program. You do not even need to be root to run it:
./ntpheat -t 60 &
The program will run on the background and try to hold your CPU at 60°C.
Results
After some iteration, I added a second bubble pack envelope over the first one, and raised the ntpheat set point to 67°C.
The results of this experiment are clearly rewarding. The plot of time and frequency offsets:
The matching plot of Local Frequency Offset and ZONE0 temperature.
Min | 1% | 5% | 50% | 95% | 99% | Max | |
---|---|---|---|---|---|---|---|
Local Clock Time Offset |
-6.262 |
-0.606 |
-0.303 |
0.002 |
0.323 |
0.742 |
4.448 |
Local Clock Frequency Offset |
-262.000 |
-246.000 |
-239.000 |
-188.000 |
-172.000 |
-164.000 |
-149.000 |
Temp ZONE0 |
66.604 |
66.604 |
66.604 |
67.142 |
67.679 |
67.679 |
73.060 |
90% | 95% | StdDev | Mean | Units | |
---|---|---|---|---|---|
Local Clock Time Offset |
0.626 |
1.348 |
0.306 |
-0.000 |
µs |
Local Clock Frequency Offset |
67.000 |
82.000 |
20.602 |
-195.016 |
ppb |
Temp ZONE0 |
1.075 |
1.075 |
0.512 |
67.086 |
°C |
The CPU temperature is now held to a Standard Deviation of 0.512°C, the local time offset to 306 ns and the local frequency offset to 20 ppb. The frequency now only moves by 39 ppb/°C, and that is over the reduced temperature range.
That is an improvement of 2.5x in time and 25x in frequency! Amazing what a little bubble wrap can do.
Limits
These results are pushing hard against the ultimate limits of the RasPi 3. The Local Clock Time Offsets are quantized at about 100 ns. This is a bit surprising as sys_fuzz is 937 ns. ntpd fuzzes the time stamps at 1/2 of sys_fuzz. KPPS is doing a good job.
The quantization is clear in the histogram below. You can also see that the distribution is not a 'normal distribution'.
Update
I have done more experimentation with ntpheat. Sometimes ntpheat can not create enough heat to get the desired temperature rise over ambient. The obvious solution is to add more insulation around the RasPi. That creates the opposite problem. The temperature will go too high when the host is under load.
The better solution is to run multiple copies of ntpheat. Each copy can use up to 100% of a single CPU core. Multiple copies seem to play well together and will be able to adjust to high load factors.
To make this easier there is now a -c option that specifies the number of copies to run. So this will run three copies:
./ntpheat -t 68 -c 3 &
Update 2
To fit in better with the NTPsec family pf programs the program name makeheat has been changed to ntpheat. It is still in the contrib/ directory and not installed by default.
Update 3
temp-log is now called ntplogtemp and is installed by default.
Next?
All this just raises more questions.
What is the easiest and fastest way to find the best temperature? Could ntpheat read ntpd logs and seek automatically?
The GPS temperature has also been raised, and the GPS also has a crystal. How much of this improvement is do to GPS heating?
The clock distribution is better than a 'normal distribution', but not symmetric like one. Is there a way for ntpd to throw away the worst 5% of measurements?
The frequency plot still shows the effects of room temperature. Is there an easy way to fix that? Time offset is no longer affected, so is it worth fixing??
Why have you not bubble wrapped your RasPi yet?