Measuring quality and undesired behavior

Arseni Mourzenko
Founder and lead developer
March 31, 2021
Tags: quality 36

When I talk about mea­sur­ing qual­i­ty, I can't em­pha­size more how mea­sur­ing a giv­en thing in­flu­ences the be­hav­ior of the per­sons who are be­ing mea­sured. When done prop­er­ly, this is an ex­cel­lent lever which en­cour­ages per­sons to work bet­ter. In oth­er cas­es, how­ev­er, it can have ter­ri­ble ef­fects on a giv­en pro­ject.

Meet Andy, a tech­ni­cal leader who joined a small Amer­i­can com­pa­ny a few years ago. One of the de­vel­op­ers work­ing in this com­pa­ny told me the sto­ry when we were speak­ing about qual­i­ty and pro­duc­tiv­i­ty, and I be­lieve that this sto­ry is an ex­cel­lent il­lus­tra­tion of how mea­sur­ing in­cor­rect­ly could neg­a­tive­ly im­pact the busi­ness.

The pri­ma­ry busi­ness of the com­pa­ny is not soft­ware de­vel­op­ment, but the IT de­part­ment was, among oth­ers, in charge of the de­vel­op­ment of sev­er­al pieces of soft­ware, and one be­ing used in­ter­nal­ly by the com­pa­ny, and one be­ing mar­ket­ed to the cus­tomers. It grew over time, and out­grew the skills of the orig­i­nal team, who could suc­cess­ful­ly build some­thing sim­ple, but had no skills to main­tain a larg­er prod­uct in a con­text where dif­fer­ent cus­tomers ask for dif­fer­ent fea­tures, with no con­sid­er­a­tion of the over­all ar­chi­tec­ture. So the boss hired Andy, who had an im­pres­sive re­sume and promised that he could solve any prob­lem.

The first week, Andy skimmed through the pro­ject and made sev­er­al rec­om­men­da­tions which were com­mu­ni­cat­ed to the boss, who con­grat­u­lat­ed Andy in front of the team. Here they are:

  1. The code is most­ly untest­ed. The team should write more tests.
  2. There is no doc­u­men­ta­tion. Prop­er tech­ni­cal doc­u­men­ta­tion should be cre­at­ed.
  3. Pro­gram­mers don't com­mit of­ten. They should.
  4. There is a strong em­pha­sis on new fea­tures over bugs. Pro­gram­mers should “fix bugs be­fore writ­ing new code”.
  5. The build is most­ly un­us­able. It should be fixed.

The rec­om­men­da­tions don't look ex­cep­tion­al. I'm sure most of you have done sim­i­lar rec­om­men­da­tions when in­her­it­ing a pro­ject or when join­ing a team, and most of you would ex­pect prop­er test­ing and doc­u­men­ta­tion, reg­u­lar com­mits, bugs pri­or­i­tized, and a us­able build.

Andy knew that rec­om­men­da­tions are use­less, un­less they are en­forced, and they are en­forced only if their en­force­ment is mea­sured. So for the fol­low­ing month, he worked on a sys­tem which would col­lect the mea­sure­ments he need­ed. When he fin­ished, this is what he ac­tu­al­ly mea­sured:

  1. Code cov­er­age, de­ter­mined by the build Andy fixed in the mean­time. High LOCs cov­ered by unit tests is good. Low is bad.
  2. Num­ber of ar­ti­cles in the Wiki Andy cre­at­ed for the pro­ject.
  3. Num­ber of com­mits per week. High­er is bet­ter.
  4. Ra­tio of bug tick­ets closed per sprint to all tick­ets closed per sprint.
  5. Num­ber of build fail­ures per month, where the fail­ures are due to some­thing oth­er than an is­sue in the code it­self: for in­stance the build failed be­cause an agent crashed qual­i­fies, where­as a build failed be­cause a test fail­ure doesn't qual­i­fy.

Un­for­tu­nate­ly for Andy, all five mea­sure­ments are plain wrong, and all five had neg­a­tive ef­fects on the pro­ject it­self.

Code cov­er­age

There are many things writ­ten on code cov­er­age, and many heat­ed de­bates took place as to whether this is a use­ful met­ric or not. The way Andy was pre­sent­ing his mea­sure­ment had two prob­lems.

First at all, he was us­ing LOCs. While there are valid uses of LOCs as a met­ric, there are many cas­es where there are bet­ter met­rics out there. In the case of code cov­er­age, a usu­al al­ter­na­tive is the num­ber of branch­es. The same code with the same com­plex­i­ty could be writ­ten dif­fer­ent­ly; in one case, a block of code would span twen­ty lines, in an­oth­er one it would be us­ing only five.

And this is ex­act­ly what some of the pro­gram­mers did. They com­pact­ed code. Thus, a block of code which had a code cov­er­age of 25% could mag­i­cal­ly ob­tain a code cov­er­age of 100% by sim­ply putting every state­ment on the same line. It made a lot of code com­plete­ly un­read­able, but the read­abil­i­ty was nev­er mea­sured, so it didn't mat­ter.

Even worse, code cov­er­age doesn't take in ac­count the use­ful­ness of a giv­en test. There is code which should be test­ed ag­gres­sive­ly. And there is code which needs no tests what­so­ev­er. As most busi­ness ap­pli­ca­tions, this one had a small set of func­tions which were do­ing some­thing al­go­rith­mi­cal­ly in­ter­est­ing, and some func­tions which were do­ing some­thing which is worth be­ing test­ed, but 80% of the code was just pipeline stuff, plain get­ters and set­ters, mod­els pass­ing from one lev­el to an­oth­er, etc. In or­der to in­crease the code cov­er­age, pro­gram­mers fo­cused specif­i­cal­ly on those 80%, as they were eas­i­est to test. I can't blame them—when some mo­ron de­cid­ed at one of my jobs that we need to reach a giv­en code cov­er­age thresh­old, I was do­ing the ex­act same thing my­self.

The neg­a­tive im­pact of those tests is that they make it more dif­fi­cult to refac­tor code, while also de­mor­al­iz­ing the team. You make a lit­tle change, and sud­den­ly tens of unit tests start to fail. As a re­sult, you avoid do­ing lit­tle changes.

Es­sen­tial­ly, the mea­sure tak­en by Andy proved to be not just in­ef­fec­tive, but ac­tive­ly harm­ful. It led to a spread­ing of long, un­read­able one-lin­ers, and it made refac­tor­ing hard­er, with no ben­e­fit what­so­ev­er in terms of the qual­i­ty of the prod­uct.

Num­ber of ar­ti­cles

Andy didn't know how to mea­sure whether the pro­ject is doc­u­ment­ed or not, and so he picked up the met­ric he could eas­i­ly col­lect from the Wiki: the num­ber of pages. This, how­ev­er, is the stu­pid­est met­ric as well. It is like claim­ing that a giv­en book is bet­ter than an­oth­er one, be­cause it con­tains four hun­dred pages ver­sus three hun­dred.

The ef­fect of this mea­sure­ment is easy to pre­dict. Pro­gram­mers start­ed to cre­ate lots of small pages. A year lat­er, the Wiki con­tained sev­er­al hun­dred pages of tech­ni­cal doc­u­men­ta­tion. No­body read them, not only be­cause no­body cared about them, but also be­cause (1) no­body could fig­ure out where to find a giv­en piece of in­for­ma­tion, (2) the same piece of in­for­ma­tion was usu­al­ly in sev­er­al pages, (3) the qual­i­ty of the con­tent made it per­fect­ly un­us­able, and (4) most pages were out­dat­ed any­way.

But the chart show­ing an ever-in­creas­ing curve was look­ing nice.

Fre­quen­cy of com­mits

There are de­fin­i­tive­ly pat­terns which could raise an alarm when look­ing at how of­ten a giv­en per­son com­mits his changes. More­over, my im­pres­sion is that pro­gram­mers who com­mit less than once per day are usu­al­ly not very skill­ful, al­though I don't have hard data to prove it. Any­way, com­mit­ting rarely caus­es prob­lems, from com­plex merges to dif­fi­cult to read com­mits. Look­ing at my own work on my per­son­al pro­jects, I can as­sert with­out hes­i­ta­tion that when I didn't com­mit work for more than a day, I usu­al­ly ei­ther screwed up, or in­tro­duced a bug, or didn't un­der­stand what I did lat­er on when re­view­ing a com­mit.

This doesn't mean, how­ev­er, that more is bet­ter. Con­sid­er some­one who com­mits once per minute. Does it seem right?

This is a bit what hap­pened over time in Andy's team. Pro­gram­mers who com­mit­ted the most were re­ward­ed, and so there was an in­cen­tive to com­mit more, no mat­ter what a com­mit would con­tain. Re­named a mem­ber? Com­mit. Changed a la­bel? Com­mit. There is noth­ing wrong in com­mit­ting tiny changes like that, when they make sense. A sign that those pro­gram­mers just want­ed to in­crease their score was, how­ev­er, the style of their com­mit mes­sages. You would imag­ine that they wrote use­less things such as “Re­named X” or “Changed la­bel Y,” but it was worse than that. At some mo­ment, the de­vel­op­er I was talk­ing with took all the com­mits for a pe­ri­od of one year and mea­sured how many matched a case-in­sen­si­tive “WIP.” It was 52.5%. For the re­main­ing half, things weren't great ei­ther. There were mes­sages such as “re­name,” “del cmt,” and even a short and sweet “chg.” That's right, a com­mit... ac­tu­al­ly... changes some­thing! Who could have imag­ined that!

One mem­ber of the team once proud­ly told that he made his fifti­eth com­mit for the day. Good for him. Those should have been five to ten real com­mits in­stead, but thanks to Andy, there wasn't much choice.

Closed bugs vs. closed tick­ets

A math prob­lem. You have three new fea­tures that you'll like­ly to fin­ish dur­ing the sprint. And the ra­tio of closed bugs to all closed tick­ets should be 0.85 in or­der to per­form bet­ter than dur­ing the last sprint. How many bugs should you close dur­ing the sprint? 3 / 0.85 × 1 = 20, that is, there should be twen­ty tick­ets closed, or sev­en­teen bug tick­ets.

The num­ber of new fea­tures is fixed. The tar­get ra­tio is fixed too. Which means that you need to vary the num­ber of bugs. So let's cre­ate bugs.

This met­ric pos­si­bly led to the most rad­i­cal change of be­hav­ior in the team. In the past, pro­gram­mers were at least try­ing to think about the pos­si­ble re­gres­sions of a giv­en change, or the edge cas­es which need to be han­dled. A few months lat­er af­ter the mea­sure­ments were in place, pro­gram­mers start­ed to be slop­py. And when bugs were found, they were cre­at­ing tick­ets which were then solved in a mat­ter of min­utes, be­cause they knew well what caused the bugs in the first place.

On a larg­er scale, the slop­pi­ness caused the qual­i­ty of the soft­ware prod­uct to drop. But the met­rics were great: the ra­tio curve was steadi­ly grow­ing, ap­proach­ing the val­ue of one.

Num­ber of build is­sues

At first, it took me some time to fig­ure out why this last met­ric is wrong as well. Af­ter all, it's bad when the build doesn't work as ex­pect­ed, and it's good when it does.

There was, how­ev­er, one par­tic­u­lar be­hav­ior that this met­ric en­cour­aged. Imag­ine you're a sys­tem ad­min­is­tra­tor, and you have an SLA for a sys­tem. The SLA doesn't tell that the sys­tem should be up 99.9% of time. In­stead, it tells that the sys­tem should have no more than ten fail­ures per month. I would be more than hap­py to ad­min­is­ter such a sys­tem. On the first fail­ure, I'll keep the sys­tem down, and keep it down un­til the last day of the month. SLA meet. Every­thing's great.

This is a bit what hap­pened with the build. A con­stant prob­lem with the build was that things were so bro­ken, that dif­fer­ent el­e­ments failed on reg­u­lar ba­sis. You could have fixed the build ear­ly in the morn­ing, and by the af­ter­noon, it was down once again, some­times for the same rea­son, some­times be­cause of some­thing com­plete­ly un­re­lat­ed. The only way to look good on the last Andy's met­ric was to stop try­ing to solve prob­lems. If be­fore, pro­gram­mers were re­boot­ing the build serv­er sev­er­al times per day, the new met­rics forced them to keep away from the serv­er. There was a mo­ment where the build serv­er failed on the first day of the month, and no­body want­ed to take re­spon­si­bil­i­ty to bring it back and risk an­oth­er fail­ure. And so the serv­er stayed down un­til the last day of the month. Andy's met­rics showed that this was an ex­cel­lent month: the build serv­er failed only once.

The root prob­lems

Andy had great ideas. Andy knew that he had to mea­sure things in or­der to change be­hav­ior. But he didn't know what to mea­sure and how. He didn't know how mea­sur­ing should be im­ple­ment­ed.

The first er­ror was to write the cri­te­ria in stone. Re­mem­ber, I told that he went to his boss for ap­proval, and an ap­proval he got in front of the whole team. With all this for­mal­ism, once the big boss says that the met­rics are great, it be­comes near­ly im­pos­si­ble to say, a few weeks lat­er: “ac­tu­al­ly, all our met­rics are all wrong, let's pick some oth­er ones.” So Andy missed one of the most im­por­tant char­ac­ter­is­tics when mea­sur­ing qual­i­ty: met­rics are volatile by de­f­i­n­i­tion. You can't just draft them once and for all. In­stead, you re­view them on reg­u­lar ba­sis, and adapt them to the evolv­ing con­text. Some­times, the met­rics are great for a few weeks or months, but then be­come ob­so­lete. Oth­er times, you no­tice that the met­rics which looked nice on pa­per are com­plete­ly wrong, as soon as you start im­ple­ment­ing them. For in­stance, I could have imag­ined that the last of Andy's met­rics would be a great thing, and would no­tice its ter­ri­ble im­pact only days lat­er.

The sec­ond er­ror was to de­cide all alone what should be mea­sured and how. When I worked as a con­sul­tant, I nev­er did that. In­stead, I worked with the team, and we were cre­at­ing all to­geth­er the met­rics which were mak­ing them hap­py. While I had an im­por­tant role—I had to check that the met­rics of the team con­form to the busi­ness met­rics, and I had to pro­tect the team from choos­ing met­rics which are known to be use­less or harm­ful—the choice was not all mine, but the one of the team. If Andy was less for­mal in his ap­proach, he could have gath­ered valu­able feed­back from his team.

The third er­ror was to con­sid­er only pos­i­tive ef­fects. Mea­sure­ments which have only pos­i­tive (or only neg­a­tive) ef­fects are rare; most have both. There­fore, one have al­ways to think what un­de­sired be­hav­ior a giv­en met­ric could bring. This is a dif­fi­cult task: in­tu­itive­ly, an un­de­sired be­hav­ior is as­sumed to be some­thing that a bad per­son would do. This is false. When you mea­sure some­thing, you show that this is what mat­ters. If you mea­sure the num­ber of com­mits per day or the code cov­er­age, you show that some­how, the fre­quen­cy of the com­mits or the code cov­er­age is im­por­tant for the com­pa­ny. Pro­gram­mers may not un­der­stand why are those things im­por­tant, nor are they even ex­pect­ed to think about it. A nat­ur­al hu­man be­hav­ior, when faced with a met­ric, is to score bet­ter. And in or­der to score bet­ter, one can op­ti­mize his be­hav­ior. I can't blame pro­gram­mers who cre­at­ed lots of bug tick­ets, or who did fifty com­mits per day: this is an ex­pect­ed be­hav­ior where a per­son tends to op­ti­mize his score. Those are not bad pro­gram­mers: those are bad met­rics. De­sign­ing good met­rics could be damn hard, and take many, many it­er­a­tions.