The cost of always-on, never fail

Arseni Mourzenko
Founder and lead developer
November 6, 2014
Tags: reliability 2 business-continuity 1 short 48

It is not un­usu­al for the com­pa­nies of any size to or­der a web­site which should “nev­er fail”. Busi­ness-wise, this re­quire­ment is un­der­stand­able: if your cor­po­rate web­site is down, you're not giv­ing a pos­i­tive im­age of your com­pa­ny. The only is­sue is that only few or­ga­ni­za­tions do re­al­ly need it, and even less en­ti­ties are tru­ly ca­pa­ble to sup­port the cost in­her­ent to such re­quire­ment.

Do I need it?

There are per­fect­ly valid cas­es where you should have a nev­er-fail sys­tem. Those cas­es in­clude sys­tems whose fail­ure will prob­a­bly cause peo­ple's in­juries or death and sys­tems whose fail­ure will cause di­rect ma­jor eco­nom­i­cal con­se­quences.

The first case is easy to il­lus­trate. A sys­tem which con­trols the cool­ing or a re­ac­tor in a nu­clear plant should nev­er fail. A sys­tem which al­lows the com­mand cen­ter to com­mu­ni­cate with mil­i­tary units on the field should nev­er fail. On the oth­er hand, a sys­tem which pro­vides elec­tric­i­ty for a city doesn't en­ter in this cat­e­go­ry, since a fail­ure of, say, ten min­utes, will not usu­al­ly be the di­rect cause of peo­ple's death. On the oth­er hand, the pow­er in a hos­pi­tal must al­ways be on, which is the rea­son why every hos­pi­tal has UPS and au­tonomous pow­er en­gines for a case where the main pow­er grid is down.

The sec­ond case is more sub­tle. The fact that a small com­pa­ny is los­ing one cus­tomer be­cause the web­site was down when the cus­tomer want­ed to vis­it it is not enough to mi­grate to a sys­tem which nev­er fails. In terms of mon­ey, a near­ly-100% re­li­able sys­tem makes sense only for the NASA-scale pro­jects where it's more ap­pro­pri­ate to spend mil­lions of dol­lars to en­sure that the soft­ware is re­li­able enough rather than fail­ing a mis­sion.

Un­less you're a mil­i­tary, an or­ga­ni­za­tion ca­pa­ble of spend­ing bil­lions of dol­lars on a sin­gle pro­ject or an en­ti­ty build­ing a soft­ware prod­uct on which hu­mans' lives will de­pend, the al­ways-on/nev­er-fail sys­tems are not for you.

I still want it, be­cause I want to be like all those large com­pa­nies.

For the last two months, I had at least three times where the web­site of French rail­way com­pa­ny, SNCF, was down for at least half an hour.

Mi­crosoft Azure cloud was once down for eight hours.

FogCreek have re­cent­ly been forced to shut down their web ap­pli­ca­tions like Fog­Bugz be­cause a part of a data cen­ter they use was flood­ed af­ter a hur­ri­cane.

Most com­pa­nies, in­clud­ing the largest ones, had ma­jor down­time at least once.

Be­ing a large cor­po­ra­tion is not in­com­pat­i­ble with out­ages and oth­er fail­ures. Af­ter all, it's not the fail­ure which is im­por­tant, it's how do you han­dle it.

Still not con­vinced. So can I? How would it cost?

First, you need two data­base ad­min­is­tra­tors who will be ready to wake up at 3 A.M. if it ap­pears that the data cen­ter is down. Why two? Be­cause such work con­di­tions are hard to stand, and while it is ac­cept­able to force some­one to be at your reach at any mo­ment for a month, it's also per­fect­ly un­der­stand­able that this per­son would need an­oth­er month to spend her time with her fam­i­ly and friends.

If I were hired for a job like this, I would re­quest $10 000 per month of work, work­ing six months per year. $5 000/month seems rea­son­able giv­en the fact that the work­ing con­di­tions are re­al­ly ter­ri­ble and that as an or­di­nary sys­tem ad­min­is­tra­tor, I ex­pect to be paid at least $3 000/month.

In or­der to have an al­ways-on sys­tem, one should have sol­id hard­ware, and by sol­id, I mean ex­pen­sive and re­dun­dant enough. For ex­am­ple, a sin­gle data cen­ter is not enough, since it may be flood­ed, burned, pow­ered down for an un­known du­ra­tion or de­stroyed by an UFO. A sin­gle data cen­ter is al­ready very ex­pen­sive; in or­der to have two of them, the cost will be mul­ti­plied per two. I'll not give any in­di­ca­tion about the cost of a data cen­ter since it de­pends on too many fac­tors, but be­lieve me, they are huge­ly ex­pen­sive. Note that build­ing the data cen­ters is not enough. They should be main­tained, and the con­sume en­er­gy.

Fi­nal­ly, the cost of an ap­pli­ca­tion which has a high lev­el of re­li­a­bil­i­ty has noth­ing to do with the cost of or­di­nary ap­pli­ca­tions. While one can have a cus­tom e-com­merce web­site done from scratch for only $5 000, the most ba­sic web­sites re­quir­ing high re­li­a­bil­i­ty will start from $100 000. This is the cost of dis­as­ter re­cov­ery plan, rig­or­ous test­ing and code re­views, and dozens of oth­er things which are rarely done for or­di­nary pro­jects.