Home Home

The cost of always-on, never fail

Arseni Mourzenko
Founder and lead developer, specializing in developer productivity and code quality
130
articles
November 1, 2012

It is not unusual for the companies of any size to order a website which should “never fail”. Business-wise, this requirement is understandable: if your corporate website is down, you're not giving a positive image of your company. The only issue is that only few organizations do really need it, and even less entities are truly capable to support the cost inherent to such requirement.

Do I need it?

There are perfectly valid cases where you should have a never-fail system. Those cases include systems whose failure will probably cause people's injuries or death and systems whose failure will cause direct major economical consequences.

The first case is easy to illustrate. A system which controls the cooling or a reactor in a nuclear plant should never fail. A system which allows the command center to communicate with military units on the field should never fail. On the other hand, a system which provides electricity for a city doesn't enter in this category, since a failure of, say, ten minutes, will not usually be the direct cause of people's death. On the other hand, the power in a hospital must always be on, which is the reason why every hospital has UPS and autonomous power engines for a case where the main power grid is down.

The second case is more subtle. The fact that a small company is losing one customer because the website was down when the customer wanted to visit it is not enough to migrate to a system which never fails. In terms of money, a nearly-100% reliable system makes sense only for the NASA-scale projects where it's more appropriate to spend millions of dollars to ensure that the software is reliable enough rather than failing a mission.

Unless you're a military, an organization capable of spending billions of dollars on a single project or an entity building a software product on which humans' lives will depend, the always-on/never-fail systems are not for you.

I still want it, because I want to be like all those large companies.

For the last two months, I had at least three times where the website of French railway company, SNCF, was down for at least half an hour.

Microsoft Azure cloud was once down for eight hours.

FogCreek have recently been forced to shut down their web applications like FogBugz because a part of a data center they use was flooded after a hurricane.

Most companies, including the largest ones, had major downtime at least once.

Being a large corporation is not incompatible with outages and other failures. After all, it's not the failure which is important, it's how do you handle it.

Still not convinced. So can I? How would it cost?

First, you need two database administrators who will be ready to wake up at 3 A.M. if it appears that the data center is down. Why two? Because such work conditions are hard to stand, and while it is acceptable to force someone to be at your reach at any moment for a month, it's also perfectly understandable that this person would need another month to spend her time with her family and friends.

If I were hired for a job like this, I would request $10 000 per month of work, working six months per year. $5 000/month seems reasonable given the fact that the working conditions are really terrible and that as an ordinary system administrator, I expect to be paid at least $3 000/month.

In order to have an always-on system, one should have solid hardware, and by solid, I mean expensive and redundant enough. For example, a single data center is not enough, since it may be flooded, burned, powered down for an unknown duration or destroyed by an UFO. A single data center is already very expensive; in order to have two of them, the cost will be multiplied per two. I'll not give any indication about the cost of a data center since it depends on too many factors, but believe me, they are hugely expensive. Note that building the data centers is not enough. They should be maintained, and the consume energy.

Finally, the cost of an application which has a high level of reliability has nothing to do with the cost of ordinary applications. While one can have a custom e-commerce website done from scratch for only $5 000, the most basic websites requiring high reliability will start from $100 000. This is the cost of disaster recovery plan, rigorous testing and code reviews, and dozens of other things which are rarely done for ordinary projects.