Topography of tests

Arseni Mourzenko
Founder and lead developer
September 3, 2015
Tags: testing 7 productivity 36 quality 36 featured 8

Im­por­tant note: this ar­ti­cle is writ­ten for de­vel­op­ers who don't prac­tice TDD, that is more than 99% of de­vel­op­ers I know. TDD is a very dif­fer­ent world, and a few as­ser­tions I make in this ar­ti­cle don't ap­ply there. For in­stance, my skep­ti­cism to­wards unit test­ing and the sug­ges­tion to start with oth­er forms of test­ing makes lit­tle sense with TDD, where unit tests are nec­es­sar­i­ly avail­able be­fore pro­duc­tion code, and are nec­es­sar­i­ly cov­er­ing all or most of the busi­ness cas­es.

“It's near­ly fin­ished. I just have to run a few tests to check if every­thing works as ex­pect­ed. I'll spend no more than two hours to cov­er all the cas­es.”—en­sured a pro­gram­mer his pro­ject man­ag­er who was ask­ing how soon the fea­ture will be ready. The pro­ject man­ag­er, sat­is­fied by the an­swer, thanked the pro­gram­mer and left the room.

He should have had a very dif­fer­ent re­ac­tion. Some­thing which looks more like “WTF is wrong with you?!”

Test­ing has two parts: ex­plo­ration, which con­sists of de­ter­min­ing what should be test­ed and how, and ver­i­fi­ca­tion, which con­sists of run­ning the tests them­selves. The first one is (most­ly) man­u­al. The sec­ond one has to be ful­ly au­to­mat­ic.

Un­for­tu­nate­ly, many in­ex­pe­ri­enced pro­gram­mers still run their tests by hand. This is prob­lem­at­ic for sev­er­al rea­sons:

The pre­vi­ous points may give you an im­pres­sion that au­to­mat­ed test­ing is some­thing which should be im­ple­ment­ed con­sis­tent­ly in every pro­ject, and man­u­al tests shouldn't ex­ist. In prac­tice, this doesn't work this way.

First, not all code is well suit­ed for tests. Imag­ine your ap­pli­ca­tion in two parts. The first part is the core of your app, its busi­ness log­ic, the essence of it, the rea­son it is here. It's in this core that you find the most in­ter­est­ing stuff. And then there is a sec­ond part, the one which is at the edges of your ap­pli­ca­tion, the one which makes it pos­si­ble for this ap­pli­ca­tion to in­ter­act with the world: show stuff on the screen, lis­ten to user's voice, store data in files, get some­thing from a data­base.

The first part should be test­ed in depth, au­to­mat­i­cal­ly. Mocks and stubs of its parts and of the sec­ond part help you do that. The sec­ond part, on the oth­er hand, doesn't fit well for test­ing. You can hard­ly cre­ate mocks and stubs for it, so usu­al­ly, you end up test­ing it once, and try­ing not to change it too much.

The mis­take of many be­gin­ner pro­gram­mers is to rely too much on the sec­ond part, and to not mak­ing the gap be­tween the two parts well enough. File.WriteAllText buried deep in your C# busi­ness log­ic can ruin your test­ing, and I'm not even talk­ing about ap­pli­ca­tions which as­sume that they nec­es­sar­i­ly run in a con­text of ASP.NET MVC or Djan­go. Do you a ser­vice and make sure the class­es of the sec­ond part don't mix with the class­es from the first part. The first part shouldn't care if it stores data to a data­base or a flat file, or if it runs as a desk­top ap­pli­ca­tion or a REST ser­vice or a web­site.

While near­ly every test case can be au­to­mat­ed, it may take time, and some­times a lot of it. For a test which can be ex­e­cut­ed man­u­al­ly with­in min­utes, it may eas­i­ly re­quire months of work to be au­to­mat­ed. This makes it time and cost pro­hib­i­tive to write such au­to­mat­ed tests. Un­less you work on life crit­i­cal sys­tems, you can't af­ford spend­ing ten months writ­ing tests for a fea­ture which took one hour to im­ple­ment.

The mis­take of many teams is to tell them­selves that a prag­mat­ic ap­proach con­sists of just falling back to man­u­al test­ing in those sit­u­a­tions. What is im­por­tant to un­der­stand is that this is not a au­to­mat­ed ver­sus man­u­al test­ing de­bate, but test­ing ver­sus the lack of test­ing: as I ex­plained, there is no way you can rea­son­ably scale with man­u­al tests.

In oth­er words, you end up hav­ing a suite of au­to­mat­ed tests on one hand, and the lack of tests on the oth­er hand, with a ban­dage in a form of man­u­al test­ing.

One should note that ex­plorato­ry test­ing is usu­al­ly all man­u­al (al­though there are some tools which let you au­to­mate some parts of the process and their mar­ket­ing de­part­ment will do every­thing to ex­plain you that those tools are mag­ic and will change your life, most work will still be done by hand.) Ex­plorato­ry test­ing, in­tend­ed to dis­cov­er the test cas­es you have to au­to­mate, con­sists of wan­der­ing around the dif­fer­ent fea­tures of your app, try­ing to screw things by en­ter­ing in­valid val­ues, re­mov­ing files the ap­pli­ca­tion needs, chang­ing per­mis­sions and cut­ting in­ter­net con­nec­tion when the app is ac­tive­ly us­ing it.

Ex­plorato­ry test­ing deals with both the valid cas­es, such as “When the app says it stores a record in the data­base, does it ac­tu­al­ly store it?”, as well as the in­valid ones, that is edge cas­es which could be im­ple­ment­ed in­cor­rect­ly, such as “What if I put let­ters in a field which re­quires a num­ber?” or “What if I sub­mit a form con­tain­ing an in­valid val­ue which was deemed in­valid through JavaScript val­i­da­tion?”

The work­flow which ap­pears to be quite suc­cess­ful and gen­er­al­ly used con­sists of start­ing by ex­plorato­ry test­ing, and when a bug is found, gen­er­at­ing an au­to­mat­ed test which re­pro­duces the bug. A par­tic­u­lar form of that is done by the sup­port team which han­dles the re­ports by the cus­tomers. In a way, cus­tomers are just per­form­ing ex­plorato­ry test­ing with­out even notic­ing it.

Now that I ex­plained the most im­por­tant as­pect of test­ing—au­toma­tion and its lim­its, let me fo­cus on tests them­selves and their topol­o­gy. I'll start by talk­ing about tests in gen­er­al to ex­plain what are they, and then de­scribe the most com­mon types of tests.

What is a test?

A test is any­thing which can be au­to­mat­ed, has a bi­na­ry re­sult (yes or no) and en­sures that a giv­en part of the ap­pli­ca­tion is work­ing as ex­pect­ed.

You may have no­ticed that I haven't in­clud­ed any­thing about the form of the test. In fact, it can be ex­pressed through code, but this is not a ne­ces­si­ty. Tests can also take a form of a script, or data used by Se­le­ni­um. The form doesn't mat­ter.

Test­ing frame­works are not nec­es­sary ei­ther, but they are of­ten very use­ful. For sim­plest pro­jects, tests can take a form of a sim­ple Bash or Pow­er­Shell script and/or a piece of code. For larg­er pro­jects, a test­ing frame­work can en­able a more for­mal ap­proach and make it much eas­i­er to both write the tests (by pro­vid­ing con­ve­nient meth­ods for com­par­i­son of se­quences or for ex­pect­ed ex­cep­tions, for ex­am­ple) and ex­e­cute them (by pro­vid­ing a com­mon in­ter­face and re­port­ing ca­pa­bil­i­ty).

Why do we test?

Ju­nior pro­gram­mers some­times be­lieve that tests are a form of guar­an­tee that shows that the code is free of bugs. This is ab­solute­ly not the case. The num­ber of tests or the branch cov­er­age are ir­rel­e­vant: you may have a lot of tests and still a lot of bugs. What makes it pos­si­ble to tell that the code has no bugs is for­mal proof, which is a very dif­fer­ent tech­nique.

The pri­ma­ry goal of test­ing is to find re­gres­sions. When chang­ing code, re­gres­sions are un­avoid­able, even with prac­tices such as pair pro­gram­ming or code re­views. Tests, giv­en enough cov­er­age, help find­ing some of the re­gres­sions which were missed by pro­gram­mers and re­view­ers.

This makes the lives of de­vel­op­ers much eas­i­er. When work­ing on a pro­ject which doesn't have tests, any change is done at a high risk of break­ing some­thing. This usu­al­ly dis­cour­ages pro­gram­mers to change any­thing, which, in turn, means that no refac­tor­ing is done. In turn, this means that tech­ni­cal debt in­creas­es con­stant­ly, and the pro­ject blows soon­er or lat­er.

The lack of tests also usu­al­ly lead to bad work­ing con­di­tions. Not only can't pro­gram­mers do their jobs cor­rect­ly by ap­ply­ing refac­tor­ing on dai­ly ba­sis, but they are also con­stant­ly stressed, and afraid break­ing stuff. Such con­di­tions are harm­ful and should be avoid­ed in any com­pa­ny which have the slight­est re­spect to­wards their em­ploy­ees.

The sec­ond goal of tests is to make it eas­i­er to lo­cate bugs. Unit tests, for in­stance, make it pos­si­ble to pin­point a lo­ca­tion of a test very pre­cise­ly com­pared to in­te­gra­tion and oth­er tests. Usu­al­ly, they pin­point the method at the ori­gin of a bug, which is nice if meth­ods are short enough. Tests oth­er than unit tests help lo­cat­ing bugs as well: while they don't have the pre­ci­sion of unit tests, they still give you some hints about the pos­si­ble lo­ca­tion.

Tests help lo­cat­ing bugs not only in space, but also in time. If they run af­ter every com­mit and com­mits are done on reg­u­lar ba­sis, the re­ports can show very well that a re­gres­sion ap­peared in a giv­en re­vi­sion. This leads us to the ques­tion of the pe­ri­od­ic­i­ty of test runs.

When to run tests?

One could think that a nat­ur­al place for the tests would be the pre-com­mit phase. This would not only pre­vent code with re­gres­sions from reach­ing the ver­sion con­trol, but also en­sure de­vel­op­ers are in­formed soon enough about the re­gres­sions.

Un­for­tu­nate­ly, this is im­pos­si­ble to do for any but tiny pro­jects. As I al­ready ex­plained be­fore, run­ning tests is prob­lem­at­ic for two rea­sons: it ex­e­cutes cus­tom code, which is very prob­lem­at­ic in terms of se­cu­ri­ty (re­mem­ber, pre-com­mit hooks are ex­e­cut­ed by the ver­sion con­trol serv­er), and it takes time.

The speed prob­lem means that de­vel­op­ers would have to wait on every com­mit, which will dis­cour­age them from com­mit­ting their code in the first place; also, tests of most pro­jects take from a few min­utes to a few days to run any­way. Even small de­lays of a few sec­onds are very prob­lem­at­ic, which is also the rea­son lin­ters have of­ten no place in pre-com­mit hooks.

In­stead, tests should run in­side the Con­tin­u­ous in­te­gra­tion flow, some tests run­ning dur­ing the build, while oth­ers be­ing han­dled by CI it­self.

What types of tests are there?

Smoke tests

Ask any ju­nior pro­gram­mer what tests should he im­ple­ment first. “Unit tests” would be the an­swer, and it's the wrong an­swer.

It's prac­ti­cal­ly like ask­ing a ju­nior pro­gram­mer to list the de­sign pat­terns he knows. Sin­gle­ton will be the first an­swer, and usu­al­ly the only an­swer, while this is prob­a­bly one of the most use­less pat­terns, and also the most mis­used one.

Pro­jects han­dled by in­ex­pe­ri­enced teams usu­al­ly fol­low the pre­dictable pat­tern. They start with a bunch of unit tests. Every­one is mo­ti­vat­ed and writ­ing a lot of unit tests, a few ones be­ing use­ful, much more be­ing just re­dun­dant.

Lat­er, the team writes less and less unit tests. At some point, the old ones are not up­dat­ed any longer, and a few months lat­er, run­ning them will show that a few dozens fail.

Then, the team even­tu­al­ly makes a few at­tempts to ei­ther write new tests or up­date (or sim­ply re­move) the bro­ken ones, but the branch cov­er­age con­tin­ues to de­crease, and noth­ing can stop the de­cay.

From the mo­ment where only a part of the code base has unit tests, the val­ue of unit test­ing drops sub­stan­tial­ly.

For this rea­son, the very first tests which should be im­ple­ment­ed are smoke tests. Smoke tests con­sist of ver­i­fy­ing a giv­en flow with­in the sys­tem. It could be, for ex­am­ple, a set of op­er­a­tions an av­er­age user will com­mon­ly per­form. Usu­al­ly, the op­er­a­tions are com­plex and in­volve sev­er­al sub­sys­tems at once. For in­stance, a smoke test of an e-com­merce web­site could con­sist of the fol­low­ing steps:

  1. Search for a prod­uct.
  2. Vi­su­al­ize the prod­uct.
  3. Add the prod­uct to cart.
  4. Go to the cart.
  5. En­ter a code which makes it pos­si­ble to have a re­bate.
  6. Change the quan­ti­ty of the prod­uct.
  7. At­tempt to pur­chase it. This leads to a reg­is­tra­tion page.
  8. Reg­is­ter.
  9. Start pur­chas­ing the prod­uct, en­ter­ing wrong cred­it card info.
  10. En­ter cor­rect cred­it card info and fin­ish the pay­ment process.
  11. Go to the list of pur­chas­es and en­sure the op­er­a­tion is there.

When such smoke test pass­es, there are great chances that the ap­pli­ca­tion is run­ning most­ly cor­rect­ly and ac­com­plish­es its pri­ma­ry goal. Cus­tomers are prob­a­bly able to view and pur­chase prod­ucts.

When such smoke test fails, that's a sign that some­thing went com­plete­ly wrong. There is no way to de­liv­er the code to pro­duc­tion in its cur­rent state, be­cause it can and prob­a­bly will lead to ma­jor down­time for the busi­ness.

The ben­e­fit of such smoke tests over unit tests is their num­ber. If the team is un­able to fol­low TDD or en­sure code cov­er­age close to 100%, the same team may be more in­clined to make three to five tests work for every re­lease. Psy­cho­log­i­cal­ly, it's eas­i­er to keep up­dat­ed a few tests than a few thou­sand tests.

An­oth­er ben­e­fit is that unit tests won't nec­es­sar­i­ly re­veal is­sues which are raised at the high­er scale: that is in­te­gra­tion or sys­tem. A smoke test, on the oth­er hand, acts at the high­est ab­strac­tion lev­el: it doesn't even know the sub­sys­tems, and in­ter­acts di­rect­ly with the whole sys­tem.

The ob­vi­ous draw­back of smoke tests is that if those are the only tests you have, many sit­u­a­tions will re­main untest­ed. To pre­vent this from hap­pen­ing, unit, in­te­gra­tion and sys­tem tests should be used.

Sys­tem, unit and in­te­gra­tion tests

Hav­ing smoke tests is good, but not enough for most prod­ucts. While smoke tests watch your main­stream flow, most sit­u­a­tions re­main untest­ed and prob­lems which oc­cur there will be dis­cov­ered the hard way—through the an­gry calls from the cus­tomers.

This means that the cov­er­age, that is the area un­der tests, should be in­creased. This is done through sys­tem test­ing. Sys­tem tests are no dif­fer­ent from smoke tests in that they per­ceive the sys­tem as a whole, with­out en­ter­ing in the de­tails. The dif­fer­ence is that (1) they are usu­al­ly more dis­so­ci­at­ed from the ac­tu­al flow of the users, that is the ac­tu­al use cas­es and (2) they are of­ten more gran­u­lar, in oth­er words per­form less ac­tions.

If we take the pre­vi­ous ex­am­ple of an e-com­merce web­site, a sys­tem test can cre­ate an en­vi­ron­ment where a pur­chase is done, and ask, through the web­site, for can­cel­la­tion, test­ing whether the can­cel­la­tion is ac­tu­al­ly done.

When you start writ­ing smoke and sys­tem tests, you may no­tice a very an­noy­ing thing: when­ev­er you break some­thing, a bunch of sys­tem tests stop work­ing, but none give you a hint about the lo­ca­tion of the re­gres­sion. Since smoke and sys­tem tests in­volve dozens or hun­dreds of class­es and meth­ods, a re­gres­sion in any of those class­es or meth­ods af­fect them. Thus, you need a more gran­u­lar ap­proach.

This is where unit test­ing comes. Each unit test has a very small work­ing sur­face: in gen­er­al, it is lim­it­ed to a method or a bunch of meth­ods with­in a class. This makes them very pre­cise at lo­cat­ing re­gres­sions. For the same rea­son, a re­gres­sion will gen­er­al­ly cause only one or few unit tests to fail, mean­ing that you can fo­cus on those tests and the small amount of code they cov­er.

The fact that unit tests are lim­it­ed to a small part of the code which is ex­e­cut­ed in iso­la­tion from oth­er code is a ben­e­fit, but also a prob­lem. You may have a pret­ty good branch cov­er­age, and prac­ti­cal­ly no bugs at the lev­el of a sin­gle class, but when you start link­ing one class to an­oth­er, things start to get ugly. On one hand, you have your unit tests which don't help, and on the oth­er hand, you have smoke and sys­tem tests which don't show you the lo­ca­tion of the is­sue. This is where you can use in­te­gra­tion tests. Those tests have a scope larg­er than unit tests, but not as large as smoke and sys­tem tests, mean­ing that they have the ben­e­fits of both worlds.

Func­tion­al and ac­cep­tance tests

You now have your smoke, sys­tem, unit and in­te­gra­tion tests up and run­ning. You're con­fi­dent that every­thing would be fine, and then you de­liv­er your prod­uct, and the stake­hold­ers tell you that, well, it might be work­ing and all, but it's not what they need, and ac­tu­al­ly, if only you could read the spec care­ful­ly... Well, you know the sto­ry.

If your pro­ject has func­tion­al re­quire­ments, you could be writ­ing func­tion­al tests too. Those tests ver­i­fy that the sys­tem which is ac­tu­al­ly built cor­re­sponds to the busi­ness re­quire­ments. Imag­ine you're work­ing on a word proces­sor. The spec says that the user should be able to change the font size, set the text to bold and ital­ic. The ac­tu­al prod­uct you built makes it pos­si­ble to change the font size and has “Bold” but­ton, but no “Ital­ic”. Would you catch this er­ror us­ing smoke, sys­tem, unit or in­te­gra­tion tests? Prob­a­bly not: those tests would rather find that your “Bold” but­ton is not do­ing any­thing, or that when set­ting font size to 0, the ap­pli­ca­tion crash­es, but there would be no test which will high­light the lack of “Ital­ic” but­ton.

Func­tion­al tests are of­ten con­fused with ac­cep­tance tests. Ac­cep­tance test­ing con­sists of de­ter­min­ing whether the cus­tomer re­al­ly needs what we built. For in­stance, if you im­ple­ment­ed the font size, “Bold”, “Ital­ic” and “Un­der­line”, ac­cep­tance test­ing may show that the cus­tomers don't need “Un­der­line”, but what they ac­tu­al­ly need is the abil­i­ty to change the font. Func­tion­al test­ing is a ver­i­fi­ca­tion ac­tiv­i­ty; ac­cep­tance test­ing is a val­i­da­tion ac­tiv­i­ty.

Stress and load tests

OK, at this point, you know that you built the right thing which con­forms to the spec and works pret­ty well. But what about its per­for­mance?

Stress tests to non-func­tion­al re­quire­ments of per­for­mance is like func­tion­al tests to func­tion­al re­quire­ments. They test the en­tire prod­uct or a part of it (some­times as small as a sin­gle method) on spe­cif­ic hard­ware un­der spe­cif­ic load, and mea­sure the time it needs to run a giv­en ac­tion. Then, they com­pare it to the thresh­old.

Re­mem­ber, one of the char­ac­ter­is­tics of tests is their bi­na­ry re­sult: it pass­es, or fails. A ba­sic vari­ant of a stress test which mea­sures the ex­e­cu­tion of the code and com­pares it to a val­ue can be too un­re­li­able: if yes­ter­day, the code ran un­der 499.7 ms., and to­day, it took 500.1 ms. with a thresh­old at 500 ms., does it re­al­ly mean that we have a re­gres­sion to­day? Prob­a­bly not. In or­der to pre­vent some ran­dom­ness from af­fect­ing the re­sults (and find the re­gres­sion as soon as it is cre­at­ed), a stress test can run the same ac­tion sev­er­al times and mea­sure the av­er­age.

Load tests are a dif­fer­ent beast: in­stead of test­ing per­for­mance of a giv­en part of the sys­tem, they test the scal­a­bil­i­ty. In oth­er words, they are not mea­sur­ing how fast your sys­tem is, but rather how much would it take to bring it down. For ex­am­ple, would an e-com­merce web­site work well if there are fifty cus­tomers who are pur­chas­ing some­thing at the same time? Yes? Great. What about two thou­sand? Maybe six­ty thou­sand?

pdiff: the mag­ic of find­ing re­gres­sions un­caught by oth­er tests

Your prod­uct has a few smoke tests, thou­sands of unit tests, hun­dreds of in­te­gra­tion, sys­tem and func­tion­al tests, and a bunch of stress and load tests. You feel safe. You know noth­ing wrong can hap­pen. So on Fri­day evening, you make a small change, com­mit your code, check that all tests are still green and, with a feel­ing of ac­com­plish­ment, you leave, plan­ning to spend a great week­end with your wife.

And then your phone rings. It's your boss. The web­site is com­plete­ly screwed. The home page won't even show. Prod­uct pages are... well, let's not even talk about prod­uct pages. Sup­port is over­whelmed by the calls from cus­tomers who are won­der­ing if the web­site was hacked.

You rush back to the work­place. You open your fa­vorite brows­er and, in­deed, what you see makes you want to kill your­self, right now. WTF hap­pened?

What hap­pened is that you mod­i­fied a CSS file. You couldn't see the change be­cause the stag­ing serv­er was serv­ing the old cached mini­fied bun­dle. The new one—the one you wrote, crushed the lay­out of near­ly every page on the site.

Thou­sands of tests were un­able to de­ter­mine this sim­ple mis­take in your CSS code. And if you think about it, what unit, func­tion­al or sys­tem tests could you write to pre­vent such re­gres­sion? There is not much you can do there.

Well, it ap­pears that you can. pdiff stands for per­cep­tu­al diff. It con­sists of an al­go­rithm which com­pares two im­ages and de­ter­mines whether they look dif­fer­ent. Us­ing per­cep­tu­al diff for test­ing makes it pos­si­ble to catch re­gres­sions which in­flu­ence how your web­site or soft­ware prod­uct looks, even if noth­ing changed func­tion­al­ly speak­ing. You in­ad­ver­tent­ly changed the padding of an el­e­ment? pdiff will see it. Text in­creased? You'll be no­ti­fied.

Tests which are not tests: A/B us­abil­i­ty test­ing

Some tests are not the ac­tu­al tests as de­fined pre­vi­ous­ly in this ar­ti­cle. While they have “test­ing” in their name, they are com­plete­ly dif­fer­ent from the tests we've seen. They are not au­to­mat­ed, their re­sults are not bi­na­ry and they don't ver­i­fy that the prod­uct is work­ing as ex­pect­ed.

Two test­ing tech­niques are es­pe­cial­ly im­por­tant:

What if I can't test my app?

Every time I au­dit a pro­ject which lack test­ing, I hear pro­gram­mers say­ing that there are parts which are “non-de­ter­min­is­tic enough” to be test­ed. In most cas­es, they re­fer to meth­ods which are based on cur­rent time, and meth­ods which use pseu­do-ran­dom num­ber gen­er­a­tors.

Ac­tu­al­ly, there is noth­ing in those two cas­es which pre­vent test­ing. In or­der to be able to test those meth­ods, one can use stubs and mocks.

Stubs are small parts of code (usu­al­ly class­es, rarely in­di­vid­ual meth­ods) which re­place a giv­en func­tion­al­i­ty by some­thing which does noth­ing and sim­ply pro­duces con­sis­tent and pre­dictable re­sults. For ex­am­ple, if a method re­lies on now() date and time, a stub can feed this method with a con­stant date and time. A stub for a pseu­do-ran­dom num­ber gen­er­a­tor can be as sim­ple as:

var nextRandomStub = function (seed) {
    return 0;

Mocks are sim­i­lar to stubs, but they are cus­tomized with­in the tests. For in­stance, the mock of a pseu­do-ran­dom num­ber gen­er­a­tor may re­quire the test to set the ac­tu­al val­ue which will be re­turned to the caller. This al­lows, for ex­am­ple, to see how the caller would re­act to neg­a­tive val­ues, or val­ues su­pe­ri­or to one.

In prac­ti­cal­ly every case, stubs and mocks are all you need. For the sake of com­plete­ness, I'll ex­plain two oth­er sim­i­lar el­e­ments, giv­en that I'm not par­tic­u­lar­ly con­vinced about their use­ful­ness.

Fakes are light­weight im­ple­men­ta­tions of a part of a sys­tem. For in­stance, a com­po­nent which in­ter­acts with a giv­en web ser­vice can be re­placed by a fake dur­ing tests. I have seen no cas­es where fakes can­not be sub­sti­tut­ed by one or sev­er­al stubs and mocks.

Fix­tures con­sist of an em­u­la­tion of an en­vi­ron­ment. It could be a data­base with test­ing data, or an HTML page stored as a sta­t­ic file and used to test JavaScript. While it might make test­ing look eas­i­er at first, tests us­ing fix­tures are more dif­fi­cult to main­tain. They may also quick­ly lead to tests which are slow.

Ex­plo­ration: what to test?

The gen­er­al path is ob­vi­ous­ly some­thing you should test, but be par­tic­u­lar­ly care­ful with edge cas­es as well. Those edge cas­es are usu­al­ly dif­fi­cult to find, and make the dif­fer­ence be­tween a good and an av­er­age tester.

Edge cas­es can be found by in­spect­ing the al­go­rithm. If the method has two paths: one for pos­i­tive in­te­gers, an­oth­er one for neg­a­tive ones, you may be in­ter­est­ed in test­ing what hap­pens with a zero. Of­ten, edge cas­es re­quire a deep knowl­edge of the lan­guage and the frame­work be­ing used. For ex­am­ple, if the method re­lies on a string, would it work with Uni­code? What about a string con­tain­ing bil­lions of char­ac­ters? Or maybe zero char­ac­ters? And what about a null be­ing passed in­stead? White space, maybe?

If you are test­ing a piece of code, imag­ine that this code was writ­ten by your cowork­er you hate. Imag­ine you're a hack­er. How much dam­age you can do to this code? How many flaws can you find? If the method ex­pects an in­te­ger, give it a float. If the method ex­plic­it­ly asks you for a pos­i­tive in­te­ger, give it a neg­a­tive one. If the method begs for a non-emp­ty se­quence, feed it with a null.

Ex­plo­ration: third-par­ty tools

This ar­ti­cle would be in­com­plete with­out men­tion­ing third-par­ty tools which make ex­plo­ration step eas­i­er. Why would you spend your time find­ing cas­es to test, when there are tools which can study your code and find those cas­es for you?

Mi­crosoft Re­search's Pex pro­ject is one of those tools which can speed up the ex­plo­ration by gen­er­at­ing unit tests for you. This has two cru­cial ben­e­fits:

Dif­fer­ent sta­t­ic analy­sis tools may also be very use­ful when mov­ing the code to­wards for­mal proof, mak­ing code more re­li­able while re­quir­ing less tests.

While dif­fer­ent tools will be very help­ful in gen­er­at­ing tests or sug­gest­ing paths to be test­ed, be very sus­pi­cious of any com­pa­ny which will claim that their prod­uct will write tests for you. The cake is a lie, and noth­ing will make it pos­si­ble to com­plete­ly skip ex­plo­ration step. Tools are too stu­pid to know what to test and how; they may give hints, but they can't do work for you. Sim­i­lar­ly to a be­gin­ner pro­gram­mer, they will end up cre­at­ing many tests which are not par­tic­u­lar­ly use­ful, but miss the im­por­tant cas­es. Use them to help you, not to re­place you.