Poor man’s temperature monitoring for data closets

Arseni Mourzenko
Founder and lead developer
176
articles
July 23, 2021
Tags: hardware 11

A few months ago, I ex­plained how I re­placed the fans in a serv­er pow­er sup­ply unit, in or­der to re­duce the noise made by the serv­er. Sim­i­lar­ly, in an­oth­er ar­ti­cle, I de­scribed how to si­lence an UPS.

Changes which af­fect the cool­ing of any hard­ware pre­sent a risk of over­heat­ing. Most man­u­fac­tur­ers en­sure the hard­ware works cor­rect­ly un­der spe­cif­ic tem­per­a­ture con­di­tions with the orig­i­nal fans. Swap­ping those fans with dif­fer­ent ones in­val­i­dates the tests per­formed by the man­u­fac­tur­er, as they may have a low­er air flow, or a low­er sta­t­ic pres­sure, and pre­sents a bunch of risks, from mal­func­tions to fire haz­ards.

In or­der to mit­i­gate the risks, one needs at a min­i­mum to keep an eye on the tam­pered hard­ware. This ar­ti­cle de­scribes how I do it in a con­text of a home-grown data clos­et, and how I ex­pect to do it in the fu­ture. It may be use­ful for the per­sons who host servers at home and who also tam­pered with the cool­ing.

First things first, the two usu­al ways to mon­i­tor the en­vi­ron­ment of a piece of hard­ware are (1) the data cen­ter en­vi­ron­men­tal mon­i­tor­ing sys­tems and (2) the met­rics pro­vid­ed by the hard­ware it­self (lm_sensors in Lin­ux is one ex­am­ple of that; SNMP met­rics pro­vid­ed by a switch or an UPS is an­oth­er one).

The pro­fes­sion­al mon­i­tor­ing sys­tems for data cen­ters are pos­si­bly an ex­cel­lent choice for the ac­tu­al data cen­ters, but have a draw­back: their cost. When I was re­search­ing years ago if I can af­ford a sys­tem like that, I was rather sur­prised by the high cost of every com­po­nent. A sim­ple DS18B20 one-wire tem­per­a­ture sen­sor that you buy for about $1 on Chi­nese web­sites would cost you $45 if it comes as a part of a pro­fes­sion­al mon­i­tor­ing sys­tem. It uses the very same DS18B20 sen­sor, but they make you pay ex­tra $44, be­cause they can.

The met­rics pro­vid­ed by lm_sensors or through SNMP have a huge ben­e­fit: they come at no ad­di­tion­al cost. Or, should I say, you paid for them when you were buy­ing the hard­ware. The draw­back is that some­times, it doesn't work well. I don't re­mem­ber see­ing cor­rect re­sults for AUX­TIN—the met­ric which should in­di­cate the tem­per­a­ture of the pow­er sup­ply unit—and some oth­er met­rics seem to be just bro­ken for me. For in­stance, every SSD I test­ed was re­port­ing a tem­per­a­ture ex­ceed­ing 100 °C. If the met­ric were true, more per­sons would use their SSDs to fry eggs.

An­oth­er con­cern I have when it comes to trust­ing the de­vice to mon­i­tor its own tem­per­a­ture is the fact that the piece of hard­ware may not be in a state where it can ef­fec­tive­ly do that. For in­stance, a mal­func­tion­ing serv­er can restart, and be­fore Lin­ux has a chance to launch lm_sensors, it may for some rea­son start to use 100% of the CPU pow­er, es­sen­tial­ly con­vert­ing all elec­tri­cal en­er­gy into heat, just for fun. With­out an ex­ter­nal ob­serv­er, such serv­er can do quite a dam­age in a data cen­ter.

There­fore, I be­lieve that ad­di­tion­al­ly to lm_sensors, there should be an ex­ter­nal, au­tonomous en­ti­ty which mon­i­tors the en­vi­ron­ment and is ca­pa­ble of tak­ing mea­sures in un­usu­al sit­u­a­tions—some­thing like the pro­fes­sion­al mon­i­tor­ing sys­tems used in data cen­ters, but much more af­ford­able.

My cur­rent ap­proach for the past sev­en years is to use a bunch of USB tem­per­a­ture sen­sors from PC­sen­sor. A ter­ri­ble choice, as the price of their sen­sors range from $16 to $21, for the very same DS18B20 in an ugly pack­age. Also, since the de­vices lack even a ba­sic unique iden­ti­fi­er, there is no way to dif­fer­en­ti­ate two de­vices of the same type on the same ma­chine. This means that you're lim­it­ed to two sen­sors per ma­chine if you can get two dif­fer­ent types of those sen­sors (the old one and the new one), or to only one oth­er­wise. A much smarter choice is to get a bunch of one-wire tem­per­a­ture sen­sors at $1, and use them with an AT­mega328P-based board (about $4 on Chi­nese web­sites), or con­nect them di­rect­ly to a Rasp­ber­ry Pi like de­vice (NanoPI de­vices are avail­able for less than $20).

One-wire tem­per­a­ture sen­sors can then be at­tached di­rect­ly to the grid be­hind the fan, in or­der to mea­sure the tem­per­a­ture of the air ex­it­ing the serv­er. When writ­ing this ar­ti­cle, I no­ticed that I was do­ing it all wrong for years. This is how it looks like:

Fig­ure 1 The sen­sor is po­si­tioned and fixed in­cor­rect­ly.

There are two prob­lems here.

First, it blocks too much air. As the sen­sor is in­side the metal­lic tip, there is ab­solute­ly no rea­son for the black part and the wire to be in front of the grid. In­stead, only the tip should be there.

Sec­ond, I had to reat­tach the sen­sor be­fore tak­ing the pho­to: it ap­pears that over time, the wire loos­ened up, and the sen­sor drift­ed away, down to the mid­dle of the fan, where it doesn't get too much air. This makes the mea­sures less rel­e­vant. An­oth­er im­por­tant is­sue is that a loose wire can find its way in­side the fan, and ei­ther block it (which would trig­ger an alert), or slow it down, pro­gres­sive­ly dam­ag­ing it. A cor­rect way to fix the sen­sor would be to use a 2.5 mm ny­lon zip tie.

While this ap­proach works for large fans (80×80 or larg­er), it doesn't work very well for 40×40 fans. As they are too small, the sen­sor is just too large for them, and would block too much air, even if it is fixed cor­rect­ly. In this case, the sen­sor should be po­si­tioned at a dis­tance from the fan. Here's mine, with the two pow­er sup­ply units be­ing at a dis­tance of around 5 cm from the sen­sor.

Fig­ure 2 The sen­sor non-block­ing the air ex­it­ing from small 40×40 fans.

Note that as the dis­tance from the grid to the sen­sor in­creas­es, the rel­e­vance of the mea­sure­ments de­creas­es. The sen­sor is af­fect­ed less by the tem­per­a­ture of the air ex­it­ing the pow­er sup­ply units, and more by the am­bi­ent air, in­clud­ing the neigh­bor hard­ware.

Sen­sors oth­er than DS18B20 can also be used to mon­i­tor the en­vi­ron­ment. Re­cent­ly, I be­came a huge fan of AHT21 sen­sor. Equal­ly in­ex­pen­sive ($1 each), they are slight­ly more pre­cise than DS18B20 (±0.3 °C vs. ±0.5 °C) and mea­sure the hu­mid­i­ty as well. One draw­back is that they are I2C sen­sors, and the data sheet tells noth­ing about the pos­si­bil­i­ty to change their ad­dress. This means that a mul­ti­plex­er is need­ed to use mul­ti­ple sen­sors with the same Ar­duino de­vice. With DS18B20, there is no such is­sue. Since each DS18B20 has a 64-bit se­r­i­al code, mul­ti­ple de­vices can be con­nect­ed in par­al­lel.

Once the mea­sure­ments are per­formed, one needs to de­cide what to do with them. The sys­tem I have now is too prim­i­tive, and just logs them, with the pos­si­bil­i­ty to see them through a web page. This alone is not com­plete­ly use­less: for in­stance I no­ticed once that the serv­er tem­per­a­ture in­creased for no clear rea­son. It ap­peared in­deed that the cause was a large box I've put in front of it and for­got to re­move. As the box was block­ing the air flow, the tem­per­a­ture in­creased.

How­ev­er, a more use­ful sys­tem should also be able to re­act to the un­ex­pect­ed events. It may alert me by send­ing an SMS mes­sage (easy to do with Ama­zon SNS) or beep­ing, but it can also be proac­tive and work in an un­at­tend­ed way, for in­stance by ask­ing the serv­er to re­duce the num­ber of cores by do­ing some­thing like echo 0 > /sys/devices/system/cpu/cpu3/online if the tem­per­a­ture gets too high. Or if the tem­per­a­ture gets even high­er, it can in­struct the serv­er to shut down, and even­tu­al­ly or­der to PDU to pow­er it off.

Nat­u­ral­ly, such mea­sures should be de­fined care­ful­ly. A com­mon sce­nario, for in­stance, is when the serv­er starts to heat be­cause it has to process a lot of HTTP re­quests, would it be a DDOS at­tack or a le­git­i­mate us­age. If the serv­er is ter­mi­nat­ed, there are chances that the next re­quests would be redi­rect­ed to oth­er servers, which will in turn over­heat, be turned off, and leave the re­main­ing servers to deal with even more work. Au­to­mat­ed sys­tems are risky by na­ture, and at larg­er scale, the cas­cad­ing ef­fect only in­creas­es.

One com­mon con­cern is that the mon­i­tor­ing sys­tem it­self can mal­func­tion, and I've seen more than once my col­leagues men­tion­ing that they are afraid that the sen­sors could mis­be­have. In oth­er words, if the sen­sor starts re­port­ing that the tem­per­a­ture of the PSU jumped in a mat­ter of a sec­ond from 34 °C to 217 °C, this may not be ex­act­ly the case where you would like it to shut down your serv­er. In sev­en years, I have nev­er seen a sit­u­a­tion like that. There are sen­sors which are known for drift­ing over time. Pol­lu­tion sen­sors, for in­stance, are pop­u­lar for that, for the ob­vi­ous rea­son of dust ac­cu­mu­lat­ing in­side. I don't know if DS18B20 drifts, but mine seem to re­main pret­ty sta­ble, al­though I don't have any means to no­tice a change in one de­gree over the years.

What I did en­counter more than once is a sit­u­a­tion where the whole mon­i­tor­ing sys­tem would just stop do­ing its job. Ei­ther be­cause of an ex­cep­tion in the code I wrote, or be­cause I in­ad­ver­tent­ly plugged it off. It hap­pens oc­ca­sion­al­ly, and it re­quires to be han­dled prop­er­ly. One way to do this is to have an ad­di­tion­al sys­tem look­ing at the mon­i­tor­ing one. An­oth­er pos­si­bil­i­ty is to use re­dun­dan­cy, so in­stead of one sen­sor, there would be two of them. I don't be­lieve the sec­ond al­ter­na­tive is cost-ef­fec­tive, nor it is par­tic­u­lar­ly use­ful. More­over, one-wire tem­per­a­ture sen­sors are quite dif­fi­cult to put in place, and hav­ing to do it twice would be even hard­er. The first so­lu­tion—that is an­oth­er sys­tem keep­ing an eye on the mon­i­tor­ing sys­tem—looks much more in­ter­est­ing to me, giv­en that it can take mul­ti­ple forms. If I'm at home, a mon­i­tor­ing screen can trig­ger an alert if it doesn't get any met­rics from the mon­i­tor­ing sys­tem. If I'm not at home, a cron job on one of the servers could do the same, and once it miss­es a few events, can send me an SMS.