Measuring quality and undesired behavior
When I talk about measuring quality, I can't emphasize more how measuring a given thing influences the behavior of the persons who are being measured. When done properly, this is an excellent lever which encourages persons to work better. In other cases, however, it can have terrible effects on a given project.
Meet Andy, a technical leader who joined a small American company a few years ago. One of the developers working in this company told me the story when we were speaking about quality and productivity, and I believe that this story is an excellent illustration of how measuring incorrectly could negatively impact the business.
The primary business of the company is not software development, but the IT department was, among others, in charge of the development of several pieces of software, and one being used internally by the company, and one being marketed to the customers. It grew over time, and outgrew the skills of the original team, who could successfully build something simple, but had no skills to maintain a larger product in a context where different customers ask for different features, with no consideration of the overall architecture. So the boss hired Andy, who had an impressive resume and promised that he could solve any problem.
The first week, Andy skimmed through the project and made several recommendations which were communicated to the boss, who congratulated Andy in front of the team. Here they are:
- The code is mostly untested. The team should write more tests.
- There is no documentation. Proper technical documentation should be created.
- Programmers don't commit often. They should.
- There is a strong emphasis on new features over bugs. Programmers should “fix bugs before writing new code”.
- The build is mostly unusable. It should be fixed.
The recommendations don't look exceptional. I'm sure most of you have done similar recommendations when inheriting a project or when joining a team, and most of you would expect proper testing and documentation, regular commits, bugs prioritized, and a usable build.
Andy knew that recommendations are useless, unless they are enforced, and they are enforced only if their enforcement is measured. So for the following month, he worked on a system which would collect the measurements he needed. When he finished, this is what he actually measured:
- Code coverage, determined by the build Andy fixed in the meantime. High LOCs covered by unit tests is good. Low is bad.
- Number of articles in the Wiki Andy created for the project.
- Number of commits per week. Higher is better.
- Ratio of bug tickets closed per sprint to all tickets closed per sprint.
- Number of build failures per month, where the failures are due to something other than an issue in the code itself: for instance the build failed because an agent crashed qualifies, whereas a build failed because a test failure doesn't qualify.
Unfortunately for Andy, all five measurements are plain wrong, and all five had negative effects on the project itself.
Code coverage
There are many things written on code coverage, and many heated debates took place as to whether this is a useful metric or not. The way Andy was presenting his measurement had two problems.
First at all, he was using LOCs. While there are valid uses of LOCs as a metric, there are many cases where there are better metrics out there. In the case of code coverage, a usual alternative is the number of branches. The same code with the same complexity could be written differently; in one case, a block of code would span twenty lines, in another one it would be using only five.
And this is exactly what some of the programmers did. They compacted code. Thus, a block of code which had a code coverage of 25% could magically obtain a code coverage of 100% by simply putting every statement on the same line. It made a lot of code completely unreadable, but the readability was never measured, so it didn't matter.
Even worse, code coverage doesn't take in account the usefulness of a given test. There is code which should be tested aggressively. And there is code which needs no tests whatsoever. As most business applications, this one had a small set of functions which were doing something algorithmically interesting, and some functions which were doing something which is worth being tested, but 80% of the code was just pipeline stuff, plain getters and setters, models passing from one level to another, etc. In order to increase the code coverage, programmers focused specifically on those 80%, as they were easiest to test. I can't blame them—when some moron decided at one of my jobs that we need to reach a given code coverage threshold, I was doing the exact same thing myself.
The negative impact of those tests is that they make it more difficult to refactor code, while also demoralizing the team. You make a little change, and suddenly tens of unit tests start to fail. As a result, you avoid doing little changes.
Essentially, the measure taken by Andy proved to be not just ineffective, but actively harmful. It led to a spreading of long, unreadable one-liners, and it made refactoring harder, with no benefit whatsoever in terms of the quality of the product.
Number of articles
Andy didn't know how to measure whether the project is documented or not, and so he picked up the metric he could easily collect from the Wiki: the number of pages. This, however, is the stupidest metric as well. It is like claiming that a given book is better than another one, because it contains four hundred pages versus three hundred.
The effect of this measurement is easy to predict. Programmers started to create lots of small pages. A year later, the Wiki contained several hundred pages of technical documentation. Nobody read them, not only because nobody cared about them, but also because (1) nobody could figure out where to find a given piece of information, (2) the same piece of information was usually in several pages, (3) the quality of the content made it perfectly unusable, and (4) most pages were outdated anyway.
But the chart showing an ever-increasing curve was looking nice.
Frequency of commits
There are definitively patterns which could raise an alarm when looking at how often a given person commits his changes. Moreover, my impression is that programmers who commit less than once per day are usually not very skillful, although I don't have hard data to prove it. Anyway, committing rarely causes problems, from complex merges to difficult to read commits. Looking at my own work on my personal projects, I can assert without hesitation that when I didn't commit work for more than a day, I usually either screwed up, or introduced a bug, or didn't understand what I did later on when reviewing a commit.
This doesn't mean, however, that more is better. Consider someone who commits once per minute. Does it seem right?
This is a bit what happened over time in Andy's team. Programmers who committed the most were rewarded, and so there was an incentive to commit more, no matter what a commit would contain. Renamed a member? Commit. Changed a label? Commit. There is nothing wrong in committing tiny changes like that, when they make sense. A sign that those programmers just wanted to increase their score was, however, the style of their commit messages. You would imagine that they wrote useless things such as “Renamed X” or “Changed label Y,” but it was worse than that. At some moment, the developer I was talking with took all the commits for a period of one year and measured how many matched a case-insensitive “WIP.” It was 52.5%. For the remaining half, things weren't great either. There were messages such as “rename,” “del cmt,” and even a short and sweet “chg.” That's right, a commit... actually... changes something! Who could have imagined that!
One member of the team once proudly told that he made his fiftieth commit for the day. Good for him. Those should have been five to ten real commits instead, but thanks to Andy, there wasn't much choice.
Closed bugs vs. closed tickets
A math problem. You have three new features that you'll likely to finish during the sprint. And the ratio of closed bugs to all closed tickets should be 0.85 in order to perform better than during the last sprint. How many bugs should you close during the sprint? 3 / 0.85 × 1 = 20, that is, there should be twenty tickets closed, or seventeen bug tickets.
The number of new features is fixed. The target ratio is fixed too. Which means that you need to vary the number of bugs. So let's create bugs.
This metric possibly led to the most radical change of behavior in the team. In the past, programmers were at least trying to think about the possible regressions of a given change, or the edge cases which need to be handled. A few months later after the measurements were in place, programmers started to be sloppy. And when bugs were found, they were creating tickets which were then solved in a matter of minutes, because they knew well what caused the bugs in the first place.
On a larger scale, the sloppiness caused the quality of the software product to drop. But the metrics were great: the ratio curve was steadily growing, approaching the value of one.
Number of build issues
At first, it took me some time to figure out why this last metric is wrong as well. After all, it's bad when the build doesn't work as expected, and it's good when it does.
There was, however, one particular behavior that this metric encouraged. Imagine you're a system administrator, and you have an SLA for a system. The SLA doesn't tell that the system should be up 99.9% of time. Instead, it tells that the system should have no more than ten failures per month. I would be more than happy to administer such a system. On the first failure, I'll keep the system down, and keep it down until the last day of the month. SLA meet. Everything's great.
This is a bit what happened with the build. A constant problem with the build was that things were so broken, that different elements failed on regular basis. You could have fixed the build early in the morning, and by the afternoon, it was down once again, sometimes for the same reason, sometimes because of something completely unrelated. The only way to look good on the last Andy's metric was to stop trying to solve problems. If before, programmers were rebooting the build server several times per day, the new metrics forced them to keep away from the server. There was a moment where the build server failed on the first day of the month, and nobody wanted to take responsibility to bring it back and risk another failure. And so the server stayed down until the last day of the month. Andy's metrics showed that this was an excellent month: the build server failed only once.
The root problems
Andy had great ideas. Andy knew that he had to measure things in order to change behavior. But he didn't know what to measure and how. He didn't know how measuring should be implemented.
The first error was to write the criteria in stone. Remember, I told that he went to his boss for approval, and an approval he got in front of the whole team. With all this formalism, once the big boss says that the metrics are great, it becomes nearly impossible to say, a few weeks later: “actually, all our metrics are all wrong, let's pick some other ones.” So Andy missed one of the most important characteristics when measuring quality: metrics are volatile by definition. You can't just draft them once and for all. Instead, you review them on regular basis, and adapt them to the evolving context. Sometimes, the metrics are great for a few weeks or months, but then become obsolete. Other times, you notice that the metrics which looked nice on paper are completely wrong, as soon as you start implementing them. For instance, I could have imagined that the last of Andy's metrics would be a great thing, and would notice its terrible impact only days later.
The second error was to decide all alone what should be measured and how. When I worked as a consultant, I never did that. Instead, I worked with the team, and we were creating all together the metrics which were making them happy. While I had an important role—I had to check that the metrics of the team conform to the business metrics, and I had to protect the team from choosing metrics which are known to be useless or harmful—the choice was not all mine, but the one of the team. If Andy was less formal in his approach, he could have gathered valuable feedback from his team.
The third error was to consider only positive effects. Measurements which have only positive (or only negative) effects are rare; most have both. Therefore, one have always to think what undesired behavior a given metric could bring. This is a difficult task: intuitively, an undesired behavior is assumed to be something that a bad person would do. This is false. When you measure something, you show that this is what matters. If you measure the number of commits per day or the code coverage, you show that somehow, the frequency of the commits or the code coverage is important for the company. Programmers may not understand why are those things important, nor are they even expected to think about it. A natural human behavior, when faced with a metric, is to score better. And in order to score better, one can optimize his behavior. I can't blame programmers who created lots of bug tickets, or who did fifty commits per day: this is an expected behavior where a person tends to optimize his score. Those are not bad programmers: those are bad metrics. Designing good metrics could be damn hard, and take many, many iterations.