Servers monitoring UI proposal

Arseni Mourzenko
Founder and lead developer
161
articles
June 28, 2015
Tags: user-experience 9 interaction-design 8 productivity 33

In the young in­dus­try which is soft­ware de­vel­op­ment, most do­mains are filled with ap­pli­ca­tions made by de­vel­op­ers with no thoughts about the end users. Bul­letin boards, for in­stance, were an ex­cel­lent ex­am­ple of ter­ri­ble, un­ac­cept­able user ex­pe­ri­ence, un­til Dis­course was re­leased.

Servers mon­i­tor­ing is an­oth­er do­main where the lack of thought about user ex­pe­ri­ence and the lack of any cre­ativ­i­ty from the per­sons who de­vel­op the mon­i­tor­ing soft­ware is rather im­pres­sive. This is not their fault: most of those soft­ware prod­ucts were de­vel­oped un­der time pres­sure and ex­treme­ly low bud­get; I can imag­ine that many of those prod­ucts were draft­ed by sys­tem ad­min­is­tra­tors dur­ing a few free hours per week they could get from their hard work.

Prod­ucts such as Na­gios were com­plete­ly un­able to evolve their user ex­pe­ri­ence over time. Those prod­ucts are still pre­sent­ing them­selves as sim­ple lists with a bit of col­or, the only slight­ly graph­i­cal el­e­ments be­ing charts. This makes it dif­fi­cult to work with those prod­ucts, and prac­ti­cal­ly im­pos­si­ble to vi­su­al­ize in­for­ma­tion prop­er­ly.

This con­strains the us­age of those prod­ucts to ac­tive reg­u­lar check­ing. In oth­er words, in most com­pa­nies, per­sons in charge of the mon­i­tor­ing are check­ing the sta­tus­es of servers once or twice per day. The re­main­ing time, they do some­thing oth­er, and rely ex­clu­sive­ly on the alerts in a form of emails sent to them by the mon­i­tor­ing soft­ware (giv­en that they may .

Thus, pas­sive con­stant check­ing doesn't ex­ist. What I mean is that per­sons in charge of the mon­i­tor­ing don't have ded­i­cat­ed dis­plays which show in near­ly-real-time what is hap­pen­ing. By not hav­ing prop­er vi­su­al­iza­tion ca­pa­bil­i­ty, we are los­ing the op­por­tu­ni­ty to eas­i­ly spot a prob­lem be­fore it starts af­fect­ing the in­fra­struc­ture.

Bub­bles con­cept

Wel­come to the con­cept of bub­bles. In­stead of rep­re­sent­ing a ma­chine as a line, each ma­chine cor­re­sponds to sev­er­al bub­bles ap­pear­ing close to each oth­er.

The fol­low­ing draft shows the con­cept ap­plied to a sam­ple ma­chine:

If we as­sume that CPU and mem­o­ry us­age are the most im­por­tant met­rics for this par­tic­u­lar ma­chine, those two en­tries will ap­pear twice larg­er than oth­ers. Ac­tu­al­ly, the con­cept can be pushed even fur­ther with the dy­nam­ic size of the bub­bles: when the sys­tem thinks that the re­port­ed met­ric may in­di­cate that some­thing is not right, the bub­ble may grow on it­self to at­tract the at­ten­tion of the per­son­nel.

Every bub­ble con­tains:

Ab­solute val­ues can be ob­tained by hov­er­ing a cur­sor over a bub­ble. Ad­di­tion­al in­for­ma­tion may (and prob­a­bly will) be dis­played as well.

The bub­ble-based lay­out en­sures that the over­all pic­ture of the in­fra­struc­ture is vi­su­al­ly reach­able, giv­en enough mon­i­tors. At a glance, any per­son can de­ter­mine if there is a prob­lem, and if yes, lo­cal­ize it with ease.

Ag­gre­ga­tion

Sim­ply dis­play­ing in­di­vid­ual in­for­ma­tion for every ma­chine is not enough. This ap­proach works well when man­ag­ing a few hun­dreds of servers, but how would any­one vi­su­al­ize in­for­ma­tion ar­riv­ing from thou­sands of servers?

Ex­is­tent prod­ucts do noth­ing to solve this is­sue, us­ing only the most el­e­men­tary in­di­ca­tors, such as the num­ber of servers which re­port er­rors. How­ev­er, solv­ing it is not par­tic­u­lar­ly dif­fi­cult.

When it comes to vi­su­al­iz­ing a sin­gle met­ric, such as the use of swap, a sim­ple vi­su­al­iza­tion could look like that:

Every ma­chine cor­re­sponds to a dot; the size of the dot is pro­por­tion­al to the met­ric, and the red col­or in­di­cates that a thresh­old (for this spe­cif­ic met­ric on this spe­cif­ic ma­chine) was reached. This al­lows to view thou­sands of el­e­ments at once and to spot how many are prob­lem­at­ic.

But the pri­ma­ry goal, when man­ag­ing thou­sands of servers, is not to view the same type of met­ric for thou­sands of servers, but rather view all the met­rics. This is when comes the ag­gre­ga­tion ap­proach. Two or more ma­chines are grouped to­geth­er based on an el­e­ment, such as their type, to ap­pear as a mono­lith­ic en­ti­ty which uses the same space as a sin­gle ma­chine used be­fore ag­gre­ga­tion. The bub­bles can then be cus­tomized to al­low dif­fer­ent types of ag­gre­ga­tions: a max­i­mum, a min­i­mum, an av­er­age val­ue or sep­a­rate val­ues shown by the same bub­ble. The fol­low­ing im­age il­lus­trates such ag­gre­ga­tion of five ma­chines, the CPU bub­ble show­ing sep­a­rate mea­sures.

Ag­gre­ga­tion is a pow­er­ful tool which makes it pos­si­ble to get real-time data from thou­sands of servers. For in­stance, ma­chines which serve as data­base failover for ap­pli­ca­tions of sim­i­lar type can all be ag­gre­gat­ed, since their us­age is sim­i­lar. In the same way, servers which process data in a map re­duce sce­nario can be ag­gre­gat­ed as well.

I ex­plained how the con­cept of bub­bles helps in vi­su­al­iz­ing met­rics gath­ered from dozens of servers as well as from thou­sands of servers. This new user in­ter­face pro­pos­al can be em­bed­ded into ex­is­tent prod­ucts to make it pos­si­ble for sys­tem ad­min­is­tra­tors to achieve high­er re­spon­sive­ness by hav­ing bet­ter tools which give them the right amount of data in a very nat­ur­al way.

For those of you who work close­ly with servers mon­i­tor­ing tools, feel free to talk to the com­pa­nies who de­vel­op those tools to see how can they ben­e­fit from the con­cepts of bub­bles and ag­gre­ga­tion.