A failure as an opportunity

Arseni Mourzenko
Founder and lead developer
November 10, 2014
Tags: management 33 communication 27

Fail­ures hap­pen, es­pe­cial­ly in soft­ware de­vel­op­ment in­dus­try, where mul­ti­ple fac­tors can in­duce fail­ure, from un­re­al­is­tic ex­pec­ta­tions by the cus­tomer to the in­com­pe­ten­cy of the staff to the se­vere is­sues with com­mu­ni­ca­tion.

In gen­er­al, stud­ies show that around 50% of IT pro­jects fail. As is, the num­ber is of course ir­rel­e­vant: what was the study about? What is a failed pro­ject? How the study was done? Still, it’s a good hint, and a good in­di­ca­tion that ap­prox­i­mate­ly half of the pro­jects will ei­ther com­plete­ly fail, or at least end up over bud­get and/or late.

Per­son­al­ly, ob­serv­ing the soft­ware de­vel­op­ment in­dus­try in France, I would ex­pect more than 99% of pro­jects end up be­ing a fail­ure, or at least be­ing over bud­get or late. Even pro­jects done for multi­na­tion­al cor­po­ra­tions based in France have too many flaws: lega­cy sys­tems, code base too ugly to at­tract com­pe­tent peo­ple, man­age­ment is­sues, com­mu­ni­ca­tion is­sues, the lack of stan­dard­iza­tion, etc. Add to this a mul­ti­tude of tiny com­pa­nies which don’t have mon­ey for their IT pro­jects, and you un­der­stand that 99% fail­ure rate is not an over-es­ti­ma­tion.

Fail­ures be­ing in­evitable, they are at the same time a good op­por­tu­ni­ty to im­prove. This means that in­stead of be­ing fa­tal­ist and throw away failed pro­jects to quick­ly start fail­ing an­oth­er one, it would be much bet­ter to ben­e­fit from fail­ures by study­ing and un­der­stand­ing them.

Study your fail­ures

When a pro­ject fails, there is some­times one, of­ten sev­er­al rea­sons. The rea­sons are too nu­mer­ous to be list­ed here, but I can quote a few ex­am­ples:

Those are the first-class rea­sons, which may be caused by sec­ond-class rea­sons as the bud­get be­ing too low, or the com­pa­ny hav­ing a low score at Joel Test, or the staff be­ing in­com­pe­tent.

When study­ing the fail­ures, it is most­ly use­ful to iden­ti­fy the first-class rea­sons. The oth­er ones, well, no­body cares about, ex­cept the mar­ket­ing folks or the per­son who will need to ex­plain the rea­son of the fail­ure to his boss. Still, in or­der to im­prove, you need to know the first-class rea­sons only.

First-class rea­sons show what you need to im­prove. Sec­ond-class rea­sons show what will help you to im­prove it.

For ex­am­ple, the fact that the pro­ject failed for the rea­son of the lack of tools like the bug track­ing sys­tem or ver­sion con­trol (first-class) is rel­e­vant tech­ni­cal­ly. It means that un­less you set­up a bug track­ing sys­tem and a ver­sion con­trol, you will fail stu­pid­ly again and again. The fact that you don’t have those el­e­men­tary tools be­cause you don’t have mon­ey to de­ploy and main­tain them (sec­ond-class) is ir­rel­e­vant tech­ni­cal­ly, but helps ask­ing more mon­ey for the IT de­part­ment for this sort of tasks.

An­oth­er ex­am­ple: the fact that the pro­ject failed be­cause of the lack of com­mu­ni­ca­tion be­tween the QA de­part­ment and the de­vel­op­ers (first-class) is rel­e­vant tech­ni­cal­ly: un­less you im­prove the com­mu­ni­ca­tion be­tween two de­part­ments, there will be more and more fail­ures. The fact that the com­mu­ni­ca­tion is so bad be­cause the leader of the QA de­part­ment is a mo­ron and no­body can bear him (sec­ond-class) is a rea­son to fire him and to hire a bet­ter per­son in­stead.

Tech­ni­cal­ly speak­ing, first-class rea­sons are cru­cial in or­der to im­prove the qual­i­ty of your soft­ware. Large, fa­tal fail­ures are no more than hints that some­thing goes wrong. Not to­tal­ly wrong, but wrong any­way. It’s not to­tal­ly wrong to nev­er doc­u­ment the de­ploy­ment of your prod­uct to the serv­er, but if it costs you a $50 000 pro­ject and you’re a small com­pa­ny, well, you’d bet­ter cre­ate the de­ploy­ment plan for your next pro­jects.

Pre­pare for fail­ure

You don’t have to wait un­til the fa­tal fail­ure in or­der to fail. As I pre­vi­ous­ly said, fa­tal fail­ures are a hint that some­thing goes wrong, but those are hints among oth­ers. Be­cause of their con­se­quences and the im­pact on the busi­ness, we per­ceive them as some­thing very spe­cial and very im­por­tant, but as in­di­ca­tors, they are sim­i­lar to any oth­er hint.

Con­sid­er the fol­low­ing di­a­log at the phone, be­tween two de­vel­op­ers:

Sorry, have you modified the `AutomationWorkflow.cs` file?
I am. Is there something wrong?
I believe that you’ve erased my yesterday’s work.
Oh… em… I’m sorry. Do you have a backup copy?
No, I don’t. Do you?
I don’t. You know, I never backup the source code.
OK. I think I should redo what I’ve done for the last four hours yesterday. Bye.
Good luck. Bye.

The pro­ject hasn’t failed just be­cause Lucy lost four hours of her work. She can still do the same work again, and Lucy and Robert can de­liv­er the pro­ject in time. But this is a good hint which clear­ly shows that some­thing goes ter­ri­bly wrong in their com­pa­ny, and is not re­al­ly dif­fer­ent from a to­tal fail­ure of the en­tire pro­ject be­cause of the lack of source con­trol and back­up plan.

This means that in­stead of wait­ing un­til the pro­ject fails, you can study fail­ure right now, be­fore it fails for real. Doc­u­ment the lost pro­duc­tiv­i­ty, doc­u­ment the loss of data or any oth­er is­sues, and study them. And if one day the whole pro­ject fails, study not only the fail­ure it­self, but also how you could pre­dict the im­mi­nent fail­ure.

Re­mem­ber: the ear­ly you fail, the less mon­ey will be wast­ed. Fail­ing ear­ly enough also gives you the op­por­tu­ni­ty to start over on im­proved ba­sics.

It still fails

The con­tin­u­ous fail­ure as­sess­ment is a good thing to im­prove the fail­ure rate, since it shows what you are do­ing wrong. It’s like a pro­fil­er of a pro­gram: it shows where your pro­gram is slow.

Like know­ing the re­sults of pro­fil­ing doesn’t mag­i­cal­ly im­prove your code, con­tin­u­ous fail­ure as­sess­ment doesn’t re­duce by it­self your fail­ure rate in every case. There are few rea­sons for this: