Poor man’s temperature monitoring for data closets

Arseni Mourzenko

Founder and lead developer

179

articles

July 23, 2021

Tags: hardware 11

A few months ago, I explained how I replaced the fans in a server power supply unit, in order to reduce the noise made by the server. Similarly, in another article, I described how to silence an UPS.

Changes which affect the cooling of any hardware present a risk of overheating. Most manufacturers ensure the hardware works correctly under specific temperature conditions with the original fans. Swapping those fans with different ones invalidates the tests performed by the manufacturer, as they may have a lower air flow, or a lower static pressure, and presents a bunch of risks, from malfunctions to fire hazards.

In order to mitigate the risks, one needs at a minimum to keep an eye on the tampered hardware. This article describes how I do it in a context of a home-grown data closet, and how I expect to do it in the future. It may be useful for the persons who host servers at home and who also tampered with the cooling.

First things first, the two usual ways to monitor the environment of a piece of hardware are (1) the data center environmental monitoring systems and (2) the metrics provided by the hardware itself (lm_sensors in Linux is one example of that; SNMP metrics provided by a switch or an UPS is another one).

The professional monitoring systems for data centers are possibly an excellent choice for the actual data centers, but have a drawback: their cost. When I was researching years ago if I can afford a system like that, I was rather surprised by the high cost of every component. A simple DS18B20 one-wire temperature sensor that you buy for about $1 on Chinese websites would cost you $45 if it comes as a part of a professional monitoring system. It uses the very same DS18B20 sensor, but they make you pay extra $44, because they can.

The metrics provided by lm_sensors or through SNMP have a huge benefit: they come at no additional cost. Or, should I say, you paid for them when you were buying the hardware. The drawback is that sometimes, it doesn't work well. I don't remember seeing correct results for AUXTIN—the metric which should indicate the temperature of the power supply unit—and some other metrics seem to be just broken for me. For instance, every SSD I tested was reporting a temperature exceeding 100 °C. If the metric were true, more persons would use their SSDs to fry eggs.

Another concern I have when it comes to trusting the device to monitor its own temperature is the fact that the piece of hardware may not be in a state where it can effectively do that. For instance, a malfunctioning server can restart, and before Linux has a chance to launch lm_sensors, it may for some reason start to use 100% of the CPU power, essentially converting all electrical energy into heat, just for fun. Without an external observer, such server can do quite a damage in a data center.

Therefore, I believe that additionally to lm_sensors, there should be an external, autonomous entity which monitors the environment and is capable of taking measures in unusual situations—something like the professional monitoring systems used in data centers, but much more affordable.

My current approach for the past seven years is to use a bunch of USB temperature sensors from PCsensor. A terrible choice, as the price of their sensors range from $16 to $21, for the very same DS18B20 in an ugly package. Also, since the devices lack even a basic unique identifier, there is no way to differentiate two devices of the same type on the same machine. This means that you're limited to two sensors per machine if you can get two different types of those sensors (the old one and the new one), or to only one otherwise. A much smarter choice is to get a bunch of one-wire temperature sensors at $1, and use them with an ATmega328P-based board (about $4 on Chinese websites), or connect them directly to a Raspberry Pi like device (NanoPI devices are available for less than $20).

One-wire temperature sensors can then be attached directly to the grid behind the fan, in order to measure the temperature of the air exiting the server. When writing this article, I noticed that I was doing it all wrong for years. This is how it looks like:

Figure 1 The sensor is positioned and fixed incorrectly.

There are two problems here.

First, it blocks too much air. As the sensor is inside the metallic tip, there is absolutely no reason for the black part and the wire to be in front of the grid. Instead, only the tip should be there.

Second, I had to reattach the sensor before taking the photo: it appears that over time, the wire loosened up, and the sensor drifted away, down to the middle of the fan, where it doesn't get too much air. This makes the measures less relevant. Another important issue is that a loose wire can find its way inside the fan, and either block it (which would trigger an alert), or slow it down, progressively damaging it. A correct way to fix the sensor would be to use a 2.5 mm nylon zip tie.

While this approach works for large fans (80×80 or larger), it doesn't work very well for 40×40 fans. As they are too small, the sensor is just too large for them, and would block too much air, even if it is fixed correctly. In this case, the sensor should be positioned at a distance from the fan. Here's mine, with the two power supply units being at a distance of around 5 cm from the sensor.

Figure 2 The sensor non-blocking the air exiting from small 40×40 fans.

Note that as the distance from the grid to the sensor increases, the relevance of the measurements decreases. The sensor is affected less by the temperature of the air exiting the power supply units, and more by the ambient air, including the neighbor hardware.

Sensors other than DS18B20 can also be used to monitor the environment. Recently, I became a huge fan of AHT21 sensor. Equally inexpensive ($1 each), they are slightly more precise than DS18B20 (±0.3 °C vs. ±0.5 °C) and measure the humidity as well. One drawback is that they are I2C sensors, and the data sheet tells nothing about the possibility to change their address. This means that a multiplexer is needed to use multiple sensors with the same Arduino device. With DS18B20, there is no such issue. Since each DS18B20 has a 64-bit serial code, multiple devices can be connected in parallel.

Once the measurements are performed, one needs to decide what to do with them. The system I have now is too primitive, and just logs them, with the possibility to see them through a web page. This alone is not completely useless: for instance I noticed once that the server temperature increased for no clear reason. It appeared indeed that the cause was a large box I've put in front of it and forgot to remove. As the box was blocking the air flow, the temperature increased.

However, a more useful system should also be able to react to the unexpected events. It may alert me by sending an SMS message (easy to do with Amazon SNS) or beeping, but it can also be proactive and work in an unattended way, for instance by asking the server to reduce the number of cores by doing something like echo 0 > /sys/devices/system/cpu/cpu3/online if the temperature gets too high. Or if the temperature gets even higher, it can instruct the server to shut down, and eventually order to PDU to power it off.

Naturally, such measures should be defined carefully. A common scenario, for instance, is when the server starts to heat because it has to process a lot of HTTP requests, would it be a DDOS attack or a legitimate usage. If the server is terminated, there are chances that the next requests would be redirected to other servers, which will in turn overheat, be turned off, and leave the remaining servers to deal with even more work. Automated systems are risky by nature, and at larger scale, the cascading effect only increases.

One common concern is that the monitoring system itself can malfunction, and I've seen more than once my colleagues mentioning that they are afraid that the sensors could misbehave. In other words, if the sensor starts reporting that the temperature of the PSU jumped in a matter of a second from 34 °C to 217 °C, this may not be exactly the case where you would like it to shut down your server. In seven years, I have never seen a situation like that. There are sensors which are known for drifting over time. Pollution sensors, for instance, are popular for that, for the obvious reason of dust accumulating inside. I don't know if DS18B20 drifts, but mine seem to remain pretty stable, although I don't have any means to notice a change in one degree over the years.

What I did encounter more than once is a situation where the whole monitoring system would just stop doing its job. Either because of an exception in the code I wrote, or because I inadvertently plugged it off. It happens occasionally, and it requires to be handled properly. One way to do this is to have an additional system looking at the monitoring one. Another possibility is to use redundancy, so instead of one sensor, there would be two of them. I don't believe the second alternative is cost-effective, nor it is particularly useful. Moreover, one-wire temperature sensors are quite difficult to put in place, and having to do it twice would be even harder. The first solution—that is another system keeping an eye on the monitoring system—looks much more interesting to me, given that it can take multiple forms. If I'm at home, a monitoring screen can trigger an alert if it doesn't get any metrics from the monitoring system. If I'm not at home, a cron job on one of the servers could do the same, and once it misses a few events, can send me an SMS.