Measuring quality like a pro

Arseni Mourzenko
Founder and lead developer
159
articles
April 1, 2021
Tags: quality 28

My pre­vi­ous ar­ti­cle led some of the read­ers won­der­ing how a cor­rect ap­proach to mea­sur­ing stuff looks. I'll use this as an op­por­tu­ni­ty to show how I worked my­self as a con­sul­tant. The ar­ti­cle would re­trace my con­sult­ing ex­pe­ri­ence from 2016 for one small French com­pa­ny (about thir­ty per­sons) de­vel­op­ing a soft­ware prod­uct used in a very nar­row field with very few cus­tomers. The com­pa­ny shared this mar­ket with a com­peti­tor of rough­ly the same ex­per­tise lev­el. The com­peti­tor ap­peared in this mar­ket three years ago, and over time “stole” an im­por­tant num­ber of cus­tomers.

The founder of this com­pa­ny came to me with a ma­jor con­cern: there were too many bugs and re­gres­sions in the soft­ware prod­uct. While every­one un­der­stands, in­tu­itive­ly, that bugs and re­gres­sions are bad, the sole pres­ence of bugs and re­gres­sions is not a use­ful met­ric. One com­pa­ny may have lots of them, while re­main­ing suc­cess­ful; an­oth­er one could have few bugs, which would se­vere­ly im­pact the busi­ness. The goal of the first meet­ing was to qual­i­fy what the boss ac­tu­al­ly meant by “too many bugs and re­gres­sions.” The meet­ing re­vealed that there are four is­sues caused by the bugs and re­gres­sions in pro­duc­tion, which have a di­rect fi­nan­cial im­pact:

The fact that it takes one day to de­ploy to pro­duc­tion sur­prised me, so we went to see the team man­ag­er for a ses­sion of five whys:

Not the most con­struc­tive five whys, but it seems that the num­ber of bugs in pro­duc­tion is not the only prob­lem in this com­pa­ny. Spend­ing one day do­ing a er­ror-prone pro­ce­dure by hand is not par­tic­u­lar­ly en­joy­able. There is a very con­crete risk of mak­ing er­rors, and er­rors were made reg­u­lar­ly. When I asked how many de­ploy­ments the team were do­ing per year, the man­ag­er told that it should be about six­ty. That's three man-months, large enough to be able to hire an ex­pert for sev­er­al months to au­to­mate the thing.

Soon, we start­ed work­ing on a se­ries of met­rics:

Next, it was time to talk with the de­vel­op­ers. As it was clear that some­thing wrong hap­pens around the num­ber of bugs reach­ing pro­duc­tion, it was still nec­es­sary to find why this would hap­pen.

Es­sen­tial­ly, cus­tomers were leav­ing for the com­peti­tor be­cause of the bugs, and the bugs ex­ist­ed in the first place be­cause the cus­tomers were leav­ing for the com­peti­tor. This cir­cu­lar re­la­tion had to be bro­ken at some point.

One im­me­di­ate con­se­quence was that the Gantt chart, es­pe­cial­ly com­par­ing the com­pa­ny with the com­peti­tor, had to be hid­den from the de­vel­op­ers, as it would only make things worse. The only ef­fect of such vi­su­al­iza­tion would be to in­crease the stress, and so in­crease the num­ber of bugs, and slow down de­vel­op­ment—the op­po­site of what we want­ed.

Next, stake­hold­ers were asked to stop talk­ing with the de­vel­op­ers about the com­peti­tor. It has ab­solute­ly no val­ue. I took time to ex­plain to the stake­hold­ers that the fea­tures and the time it takes to de­liv­er those fea­tures is not the only im­por­tant point. For them, it was cru­cial. But com­par­ing their pri­or­i­ties with the con­cerns of the boss showed the dis­crep­an­cy, which forced them to ad­just their pri­or­i­ties ac­cord­ing­ly, putting more fo­cus on sta­bil­i­ty.

Two months lat­er, I came back to this com­pa­ny to see how things are go­ing on. Over the two months, the num­ber of bug tick­ets slight­ly de­creased, but there was noth­ing im­pres­sive. Two sprints per month meant four val­ues, and this wasn't too rep­re­sen­ta­tive. Talk­ing with the team mem­bers, I had an im­pres­sion that noth­ing changed, ac­tu­al­ly. So we made an­oth­er five whys ses­sion. We took a tick­et which was solved the day be­fore. The bug con­sist­ed of a wrong con­ver­sion be­ing used for a giv­en val­ue, which im­pact­ed most of the cus­tomers in a quite crit­i­cal way. The unit tests were un­able to catch the is­sue, be­cause the de­vel­op­er in­tro­duced the very same con­ver­sion er­ror in the tests as well. The prod­uct own­er haven't no­ticed the prob­lem ei­ther, and it was only af­ter a call by an an­gry cus­tomer that the team found that the con­ver­sion is wrong.

Here we are, back to the orig­i­nal prob­lem we had two months ear­li­er. The stake­hold­ers were still em­pha­siz­ing the im­por­tance of re­leas­ing ear­ly, de­spite the fact that the boss clear­ly stat­ed what is im­por­tant for the busi­ness. They were still talk­ing about the com­peti­tor to the de­vel­op­ers, in­stead of fo­cus­ing on the num­ber of bugs re­port­ed by the sup­port. This time, in or­der to en­cour­age to change this be­hav­ior, I in­stalled a mon­i­tor in the space where de­vel­op­ers were work­ing, the mon­i­tor show­ing the num­ber of bugs in the cur­rent sprint, a com­par­i­son with the av­er­age for the last ten sprints, and the chart show­ing the evo­lu­tion over a year. This was vi­su­al enough to change the fo­cus.

I also ad­vised the team to start mak­ing code re­views, and cre­at­ed the cor­re­spond­ing met­ric. Af­ter a code re­view, the orig­i­nal au­thor can mark that the re­view­er pro­vid­ed a valu­able feed­back, and can also mark that the re­view­er found a bug. The met­ric would mea­sure the num­ber of those marks per per­son. We also dis­cussed pair pro­gram­ming, but the team was against it, with the ex­cep­tion of the man­ag­er. I ad­vised to for­get about pair pro­gram­ming: man­ag­er's opin­ion is ir­rel­e­vant here. If no­body among the mem­bers want to do it, they won't; if they are forced to do it by the man­age­ment, they'll find a way to make it in­ef­fec­tive.

Re­gard­ing the ex­haus­tion fac­tor, I ad­vised the man­ag­er to stop track­ing how much time peo­ple are pre­sent in the of­fice. I made him agree in front of the team that team mem­bers can leave ear­li­er if they feel tired, and there would be no neg­a­tive con­se­quences. It was es­sen­tial to em­pha­size that they will not be eval­u­at­ed based on the num­ber of hours they would spend in the of­fice. This wasn't an easy task: the cus­tom in French IT in­dus­try is to make the hours spent at work a ma­jor, and some­times the only, mea­sure­ment. I also imag­ined a met­ric which would show whether the ex­haus­tion fac­tor drops over time or not. De­vel­op­ers were asked, once per week, anony­mous­ly, whether they con­sid­er stay­ing in this com­pa­ny for the rest of their life. Few months lat­er, I would be us­ing a very sim­i­lar mea­sure­ment, the Fri­day's mo­ti­va­tion met­ric: “I want so much this week to end!” for an­oth­er team.

Two months lat­er, it's time to check the re­sults. Those re­sults are rather im­pres­sive:

In or­der to show to the team that they are go­ing in the right di­rec­tion, and en­cour­age them to con­tin­ue, I sug­gest­ed to the man­age­ment to give a fi­nan­cial re­ward to all the mem­bers of the team, and to de­fine a se­ries of oth­er re­wards when the num­ber of bug-re­lat­ed tick­ets would hit spe­cif­ic thresh­olds for the first time for sev­er­al con­sec­u­tive sprints. I also asked to the boss to give his con­grat­u­la­tions to the de­vel­op­er who found the most (that is, three out of six) bugs dur­ing code re­views, as well as to the whole team, who pre­vent­ed six bugs from af­fect­ing the cus­tomers.

While talk­ing with the team, I no­ticed a few con­cerns over the qual­i­ty of the tests. In­deed, there were a bunch of com­plex meth­ods which weren't test­ed well enough. I ad­vised to set up a mon­i­tor show­ing the branch cov­er­age for all meth­ods with com­plex­i­ty high­er than x. That x would be ad­just­ed by the team over time. The orig­i­nal val­ue that I sug­gest­ed, for in­stance, was com­plete­ly wrong, and was ad­just­ed two days lat­er.

Three years lat­er, I con­tact­ed the boss again to see how things are go­ing. The com­pa­ny is do­ing great, un­like the com­peti­tor. There are some im­por­tant new cus­tomers. A num­ber of oth­er cus­tomers left, but not be­cause of the bugs. The boss was un­will­ing to dis­close the real rea­son. A new met­ric ap­peared: the num­ber of de­ploy­ments. There are now about one hun­dred de­ploy­ments per year, with up to four re­leas­es some days. The high fre­quen­cy means that there is no need for hot­fix­es any longer. If there is a bug, it will be solved in the next or­di­nary re­lease. De­vel­op­ers have less health prob­lems, which is ab­solute­ly great. The boss didn't know what's the state of the code cov­er­age, but from the pos­i­tive feed­back he gave about the de­vel­op­ers, I'm sure they got their tests right. The Gantt chart I was so proud of was thrown away: in­stead, time-to-mar­ket (TTM) is now mea­sured. The man­age­ment learned their les­son: the de­vel­op­ment team has no knowl­edge of the ex­is­tence of the TTM met­ric; they don't work un­der pres­sure any longer.

Com­par­ing the sto­ry in this ar­ti­cle with the pre­vi­ous one, this one looks messy. In the Amer­i­can com­pa­ny, Andy have set sim­ple, straight­for­ward met­rics, which were con­sid­ered per­ma­nent. There were no re­vi­sions, no new met­rics over time. On the oth­er hand, the met­rics from the French com­pa­ny were rather com­plex, and changed all the time. Old ones were re­moved. New ones were cre­at­ed. The fol­low­ing il­lus­tra­tion shows the volatil­i­ty of the met­rics. No­tice that none sur­vived un­changed for three years and four months.

Fig­ure 3 The volatil­i­ty of the met­rics over time.

The fact is, the volatil­i­ty is what makes the sys­tem ef­fec­tive. As the sys­tem which is be­ing mea­sured evolves, so should the met­rics. The met­rics in­flu­ence the sys­tem, and the sys­tem in­flu­ences the met­rics. If the met­rics are not ad­just­ed to the evolv­ing sys­tem, there is no cer­tain­ty that those met­rics will have any ben­e­fit, and more of­ten than not, an un­evolv­ing met­ric starts to be ac­tive­ly harm­ful very fast. In the case of Andy, the met­rics were wrong from the be­gin­ning, but this is not his ma­jor mis­take. In the sto­ry above, I did spec­tac­u­lar mis­takes too, by sug­gest­ing to use the met­rics which ap­peared to be harm­ful from the be­gin­ning. How­ev­er, I nev­er wrote those met­rics in stone. We draft­ed them all to­geth­er, and went to check what would hap­pen. As soon as we no­ticed that the met­rics were im­pact­ing the sys­tem neg­a­tive­ly, we ei­ther mod­i­fied them, or got rid of them if they were be­yond sal­va­tion. When it comes to mea­sur­ing the met­rics them­selves, more volatil­i­ty is bet­ter.