Measuring quality like a pro

Arseni Mourzenko

Founder and lead developer

179

articles

April 1, 2021

Tags: quality 36

My previous article led some of the readers wondering how a correct approach to measuring stuff looks. I'll use this as an opportunity to show how I worked myself as a consultant. The article would retrace my consulting experience from 2016 for one small French company (about thirty persons) developing a software product used in a very narrow field with very few customers. The company shared this market with a competitor of roughly the same expertise level. The competitor appeared in this market three years ago, and over time “stole” an important number of customers.

The founder of this company came to me with a major concern: there were too many bugs and regressions in the software product. While everyone understands, intuitively, that bugs and regressions are bad, the sole presence of bugs and regressions is not a useful metric. One company may have lots of them, while remaining successful; another one could have few bugs, which would severely impact the business. The goal of the first meeting was to qualify what the boss actually meant by “too many bugs and regressions.” The meeting revealed that there are four issues caused by the bugs and regressions in production, which have a direct financial impact:

When a customer finds a bug, he calls the support. The support wastes time trying to find what's wrong, then discovers that the problem is likely to be related to what developers did. The subject is escalated to the developers. The developers waste their time solving the bug, and then an hotfix has to be deployed to production. The deployment step takes one day of work. Even a tiny little bug would take at least two man days to be solved.
The image of the company suffers. For instance, a month before our meeting, an important customer moved to the competitor, because of the constant issues with the product. Moreover, since there are about three hundred potential customers in France in this sector, they do talk to each other occasionally, and some may prefer the product of the competitor if they hear someone else talking about all the bugs he had found in the product of my client.
Because of high stress, developers are often sick and occasionally have burnouts.
The time which is spent by the developers fixing bugs is not spent on something actually useful, that is, the new features and enhancements. One concrete problem is that on several occasions, the competitor released the same features months earlier, attracting more customers who needed those specific features.

The fact that it takes one day to deploy to production surprised me, so we went to see the team manager for a session of five whys:

Why does it take one day to do a deployment? Because the deployment procedure is manual.
Why is the procedure manual? Because it is too complex to automate.
Why is it too complex? Because there were too many changes over time, and nobody has the skills to refactor the process.
Why is nobody has the skills? Because the original guy who created the thing left the company, and no one understands the idiosyncratic approaches which were used.
Why are the approaches considered idiosyncratic? Because the guy didn't know any standards back then, and used an in-house approach, reinventing the wheel.

Not the most constructive five whys, but it seems that the number of bugs in production is not the only problem in this company. Spending one day doing a error-prone procedure by hand is not particularly enjoyable. There is a very concrete risk of making errors, and errors were made regularly. When I asked how many deployments the team were doing per year, the manager told that it should be about sixty. That's three man-months, large enough to be able to hire an expert for several months to automate the thing.

Soon, we started working on a series of metrics:

We can track over time how much time the support spends working on the bugs and regressions. People from support had an habit, unlike the developers, to work with the tickets and track whatever they do, so such a metric was easy to collect. The idea was that it would give us a very concrete metric of the time wasted, and we could compare how it evolves over time. It was really about the duration, and not the number of bugs: some bugs are relatively easy to spot, and it takes a few seconds for the support to know that they need to escalate it to the developers, while some others are cryptic enough to waste a few hours.

The metric can be gamed with developers putting better logs, for instance. I'm perfectly happy with that, since it still achieves the target of less time being spent on bugs.

In retrospective, this metric was a bad idea, and it wasn't communicated correctly. The folks from support misunderstood its purpose. At the beginning, they thought that the management will look at how much time the team members were spending, in order to know who works more. The management explained that this is not the purpose, and that the metric would cover the whole team, and not individual members. This created another issue: support guys were afraid that the management would still start asking why the whole team spends so little time doing the work they are paid for, an so they started to artificially increase the time. The difference is clearly visible in Fig. 1, which shows the time spent by the support on the tickets related to bugs and regressions in production. Except the moment where 1 everybody goes on vacation, and the moment 2 that the support guys will remember the rest of their lives as “the Black Week,” one can see something happening when 3 the measurement was announced to the support team. Checking the raw data, one can indeed spot a number of unusual practices. For instance, some of the guys weren't pausing the counter during the lunch time, as they did before.

Figure 1 Time spent by the support every week on the tickets related to bugs and regressions in production.
We can also track the number of tickets opened by the support and assigned to the developers, taking only the ones which are closed as solved during the current or the next sprint. The filtering is important here: some tickets are not really bugs which affect customers directly, and they end up being closed months or years later, sometimes because nobody cares about them. Such tickets would pollute the data set, lowering its representativity.

The metric cannot be gamed easily. The interested party is the development team, while the tickets are created by the support. Developers don't have leverage to prevent the support from opening too many tickets. The only thing they can do to escape the metric is to wait for two sprints and only then fix a given problem. It was unlikely that anyone would do such a thing, given that nearly every one of those tickets have a high priority.
We are also able to measure the number of hotfixes, less is better. I haven't thought enough of the repercussions, so I missed the fact that such metric would encourage the developers to merge several hotfixes into one, in order to do less of them. And that's exactly what they did. Very soon, the metric was considered harmful and abandoned.
While the business didn't know exactly why customers were leaving, there was already a metric measuring how many new customers there were per year, and how much have left. Given the narrow domain and the small number of potential customers, it is not surprising that the metric was measured once per year, and not more frequently. A new customer, or a customer who left, was a big event for the company.
Another metric available for us was from the human resources. They knew exactly how many developers were sick (including burnouts), and for how long. We took care, however, to put this value in context. People often get sick in October and November; such external factors should be accounted, and metrics should be adjusted accordingly.
Finally, a Gantt chart with the different features released or being developed was created, enriched with the extra data, showing when the competitor released a similar feature. Figure 2 shows a simplified version of such Gantt chart. Green 1 means we're great: we released it first. Red means the competitor was faster; sometimes much faster than us 2. An additional visualization allowed to have a visual comparison of the difference in release time for the past year, in order to see the trend.

Figure 2 The Gantt chart, with the embedded release dates of the feature by the competitor.

Next, it was time to talk with the developers. As it was clear that something wrong happens around the number of bugs reaching production, it was still necessary to find why this would happen.

Why are there bugs and regressions? Because we work under pressure.
Why do you work under pressure? Because business needs features fast.
Why do you think the business needs features fast? Because there is a competitor who may release those features sooner.
Why do you mention the competitor in the first place? Because the stakeholders focus on the competitor when talking about the priorities.
Why stakeholders do that? Because the competitor's features make us lose customers.

Essentially, customers were leaving for the competitor because of the bugs, and the bugs existed in the first place because the customers were leaving for the competitor. This circular relation had to be broken at some point.

One immediate consequence was that the Gantt chart, especially comparing the company with the competitor, had to be hidden from the developers, as it would only make things worse. The only effect of such visualization would be to increase the stress, and so increase the number of bugs, and slow down development—the opposite of what we wanted.

Next, stakeholders were asked to stop talking with the developers about the competitor. It has absolutely no value. I took time to explain to the stakeholders that the features and the time it takes to deliver those features is not the only important point. For them, it was crucial. But comparing their priorities with the concerns of the boss showed the discrepancy, which forced them to adjust their priorities accordingly, putting more focus on stability.

Two months later, I came back to this company to see how things are going on. Over the two months, the number of bug tickets slightly decreased, but there was nothing impressive. Two sprints per month meant four values, and this wasn't too representative. Talking with the team members, I had an impression that nothing changed, actually. So we made another five whys session. We took a ticket which was solved the day before. The bug consisted of a wrong conversion being used for a given value, which impacted most of the customers in a quite critical way. The unit tests were unable to catch the issue, because the developer introduced the very same conversion error in the tests as well. The product owner haven't noticed the problem either, and it was only after a call by an angry customer that the team found that the conversion is wrong.

Why did the bug happen in the first place? Because I was exhausted and haven't noticed that I called a wrong method.
Why nobody else caught up the mistake? Because there is no pair programming or code reviews.
Why isn't there pair programming and code reviews? Because we don't have time for that.
Why don't you have time for that? Because we need to ship quickly.
Why do you need to ship quickly? Because there is a competitor who may release the same feature faster than us.

Here we are, back to the original problem we had two months earlier. The stakeholders were still emphasizing the importance of releasing early, despite the fact that the boss clearly stated what is important for the business. They were still talking about the competitor to the developers, instead of focusing on the number of bugs reported by the support. This time, in order to encourage to change this behavior, I installed a monitor in the space where developers were working, the monitor showing the number of bugs in the current sprint, a comparison with the average for the last ten sprints, and the chart showing the evolution over a year. This was visual enough to change the focus.

I also advised the team to start making code reviews, and created the corresponding metric. After a code review, the original author can mark that the reviewer provided a valuable feedback, and can also mark that the reviewer found a bug. The metric would measure the number of those marks per person. We also discussed pair programming, but the team was against it, with the exception of the manager. I advised to forget about pair programming: manager's opinion is irrelevant here. If nobody among the members want to do it, they won't; if they are forced to do it by the management, they'll find a way to make it ineffective.

Regarding the exhaustion factor, I advised the manager to stop tracking how much time people are present in the office. I made him agree in front of the team that team members can leave earlier if they feel tired, and there would be no negative consequences. It was essential to emphasize that they will not be evaluated based on the number of hours they would spend in the office. This wasn't an easy task: the custom in French IT industry is to make the hours spent at work a major, and sometimes the only, measurement. I also imagined a metric which would show whether the exhaustion factor drops over time or not. Developers were asked, once per week, anonymously, whether they consider staying in this company for the rest of their life. Few months later, I would be using a very similar measurement, the Friday's motivation metric: “I want so much this week to end!” for another team.

Two months later, it's time to check the results. Those results are rather impressive:

The deployment process is fully automated, and takes about one minute. Nice.
Code reviews work great. They helped finding two bugs the first month, and four bugs the second one, that's six bugs which never reached production.
The number of bug-related tickets created by the support dropped significantly. I had to spend some time checking raw data to find out whether there was gamification involved or not. Seemed that there was no such thing here.
When I asked the manager whether there are persons who stay less time in the office, he gave me the exact answer I was hoping to hear: “No idea!” This shows that he actually haven't tried to check who left the office when, and that's great.
Much more developers started considering that they could stay in this company for the rest of their life. This could be due to the fact that the deployment was automated. Or the fact that there was less stress now. Or the fact that the quality of the software product increased. Or possibly all those factors combined.
The Gantt chart was never shown since then to the developers; only the boss could see it. There was nothing significant there yet, however. Nor did the boss have some significant data about the new customers, and the customers who left.

In order to show to the team that they are going in the right direction, and encourage them to continue, I suggested to the management to give a financial reward to all the members of the team, and to define a series of other rewards when the number of bug-related tickets would hit specific thresholds for the first time for several consecutive sprints. I also asked to the boss to give his congratulations to the developer who found the most (that is, three out of six) bugs during code reviews, as well as to the whole team, who prevented six bugs from affecting the customers.

While talking with the team, I noticed a few concerns over the quality of the tests. Indeed, there were a bunch of complex methods which weren't tested well enough. I advised to set up a monitor showing the branch coverage for all methods with complexity higher than x. That x would be adjusted by the team over time. The original value that I suggested, for instance, was completely wrong, and was adjusted two days later.

Three years later, I contacted the boss again to see how things are going. The company is doing great, unlike the competitor. There are some important new customers. A number of other customers left, but not because of the bugs. The boss was unwilling to disclose the real reason. A new metric appeared: the number of deployments. There are now about one hundred deployments per year, with up to four releases some days. The high frequency means that there is no need for hotfixes any longer. If there is a bug, it will be solved in the next ordinary release. Developers have less health problems, which is absolutely great. The boss didn't know what's the state of the code coverage, but from the positive feedback he gave about the developers, I'm sure they got their tests right. The Gantt chart I was so proud of was thrown away: instead, time-to-market (TTM) is now measured. The management learned their lesson: the development team has no knowledge of the existence of the TTM metric; they don't work under pressure any longer.

Comparing the story in this article with the previous one, this one looks messy. In the American company, Andy have set simple, straightforward metrics, which were considered permanent. There were no revisions, no new metrics over time. On the other hand, the metrics from the French company were rather complex, and changed all the time. Old ones were removed. New ones were created. The following illustration shows the volatility of the metrics. Notice that none survived unchanged for three years and four months.

Figure 3 The volatility of the metrics over time.

The fact is, the volatility is what makes the system effective. As the system which is being measured evolves, so should the metrics. The metrics influence the system, and the system influences the metrics. If the metrics are not adjusted to the evolving system, there is no certainty that those metrics will have any benefit, and more often than not, an unevolving metric starts to be actively harmful very fast. In the case of Andy, the metrics were wrong from the beginning, but this is not his major mistake. In the story above, I did spectacular mistakes too, by suggesting to use the metrics which appeared to be harmful from the beginning. However, I never wrote those metrics in stone. We drafted them all together, and went to check what would happen. As soon as we noticed that the metrics were impacting the system negatively, we either modified them, or got rid of them if they were beyond salvation. When it comes to measuring the metrics themselves, more volatility is better.