Topography of tests

Arseni Mourzenko

Founder and lead developer

179

articles

September 3, 2015

Important note: this article is written for developers who don't practice TDD, that is more than 99% of developers I know. TDD is a very different world, and a few assertions I make in this article don't apply there. For instance, my skepticism towards unit testing and the suggestion to start with other forms of testing makes little sense with TDD, where unit tests are necessarily available before production code, and are necessarily covering all or most of the business cases.

“It's nearly finished. I just have to run a few tests to check if everything works as expected. I'll spend no more than two hours to cover all the cases.”—ensured a programmer his project manager who was asking how soon the feature will be ready. The project manager, satisfied by the answer, thanked the programmer and left the room.

He should have had a very different reaction. Something which looks more like “WTF is wrong with you?!”

Testing has two parts: exploration, which consists of determining what should be tested and how, and verification, which consists of running the tests themselves. The first one is (mostly) manual. The second one has to be fully automatic.

Unfortunately, many inexperienced programmers still run their tests by hand. This is problematic for several reasons:

This approach doesn't scale. It works quite well for tiny projects, that is projects which can easily be made by one programmer in two-three days. Beyond that, manual testing becomes difficult, and it is practically impossible for any project which involves weeks or months of work for several programmers.

With project size, the number of tests grows quickly. It is not unusual to have thousands of tests for even a small project, and programmers cannot possibly run by hand thousands of tests after every commit.

This also makes it impossible to have any decent continuous integration even for small projects. Given the general rate of fifty commits per day for an ordinary Agile team, how could they possibly run all the required tests by hand every time?

Even the simplest tasks will take seconds. How would we able to compete with machines which will often do the same tasks in less than a millisecond?
This approach is error prone. Humans are really bad for repetitive tasks. They don't follow procedures well. They are tempted to take shortcuts. They may not focus well. A simple test such as “enter this in that field; click on this button; click on that; close the message box; verify that this string was actually appended to that file” is not so easy to perform consistently for a human. Sometimes it will work well; sometimes, the tester will click somewhere else, or miss the trailing line break in the file (you do have “show special characters” thing checked in all the editors you use, don't you?), or enter the wrong thing.

The machine won't have this problem. It will perform the task consistently and precisely, and ensure that the result is identical to what was expected. It will immediately spot a missing new line or a missing dot.
Integration of manual tests in testing frameworks is difficult to impossible. How do you report by hand the results of a test? Can developers be informed about the regressions in a comfortable way through testing reports, panels displayed on a dedicated monitor, etc.? I don't think so.

Even more importantly, how do you ensure that developers are informed quickly enough? The more you wait, the harder it is to fix the bugs. A bug discovered through background compiling as soon as the developer types it can be fixed within seconds. A bug discovered months later could require weeks of debugging.
Manual testing is immoral. Robert C. Martin mentions this aspect in few of his talks, including The Land that Scrum Forgot and Craftsmanship and Ethics, and I can't agree more with him. Repetitive tasks are for computers; asking persons to do a boring, repetitive task, again and again, is not the nicest thing the company can do to its employees.

The previous points may give you an impression that automated testing is something which should be implemented consistently in every project, and manual tests shouldn't exist. In practice, this doesn't work this way.

First, not all code is well suited for tests. Imagine your application in two parts. The first part is the core of your app, its business logic, the essence of it, the reason it is here. It's in this core that you find the most interesting stuff. And then there is a second part, the one which is at the edges of your application, the one which makes it possible for this application to interact with the world: show stuff on the screen, listen to user's voice, store data in files, get something from a database.

The first part should be tested in depth, automatically. Mocks and stubs of its parts and of the second part help you do that. The second part, on the other hand, doesn't fit well for testing. You can hardly create mocks and stubs for it, so usually, you end up testing it once, and trying not to change it too much.

The mistake of many beginner programmers is to rely too much on the second part, and to not making the gap between the two parts well enough. File.WriteAllText buried deep in your C# business logic can ruin your testing, and I'm not even talking about applications which assume that they necessarily run in a context of ASP.NET MVC or Django. Do you a service and make sure the classes of the second part don't mix with the classes from the first part. The first part shouldn't care if it stores data to a database or a flat file, or if it runs as a desktop application or a REST service or a website.

While nearly every test case can be automated, it may take time, and sometimes a lot of it. For a test which can be executed manually within minutes, it may easily require months of work to be automated. This makes it time and cost prohibitive to write such automated tests. Unless you work on life critical systems, you can't afford spending ten months writing tests for a feature which took one hour to implement.

The mistake of many teams is to tell themselves that a pragmatic approach consists of just falling back to manual testing in those situations. What is important to understand is that this is not a automated versus manual testing debate, but testing versus the lack of testing: as I explained, there is no way you can reasonably scale with manual tests.

In other words, you end up having a suite of automated tests on one hand, and the lack of tests on the other hand, with a bandage in a form of manual testing.

One should note that exploratory testing is usually all manual (although there are some tools which let you automate some parts of the process and their marketing department will do everything to explain you that those tools are magic and will change your life, most work will still be done by hand.) Exploratory testing, intended to discover the test cases you have to automate, consists of wandering around the different features of your app, trying to screw things by entering invalid values, removing files the application needs, changing permissions and cutting internet connection when the app is actively using it.

Exploratory testing deals with both the valid cases, such as “When the app says it stores a record in the database, does it actually store it?”, as well as the invalid ones, that is edge cases which could be implemented incorrectly, such as “What if I put letters in a field which requires a number?” or “What if I submit a form containing an invalid value which was deemed invalid through JavaScript validation?”

The workflow which appears to be quite successful and generally used consists of starting by exploratory testing, and when a bug is found, generating an automated test which reproduces the bug. A particular form of that is done by the support team which handles the reports by the customers. In a way, customers are just performing exploratory testing without even noticing it.

Now that I explained the most important aspect of testing—automation and its limits, let me focus on tests themselves and their topology. I'll start by talking about tests in general to explain what are they, and then describe the most common types of tests.

What is a test?

A test is anything which can be automated, has a binary result (yes or no) and ensures that a given part of the application is working as expected.

The automation part means that the test can run multiple times and is expected to produce the same result if the inputs are the same. This implies that the test case can be defined in a form which is not subject to interpretation or may change in any way from person to person. For example, when dealing with performance, an incorrect way to write a non-functional requirement would be to tell that a feature should be “fast enough”. This cannot possibly be tested, because a person can consider the feature fast, while another person will assert that the same feature is slow as hell, because his expectations of the speed were different.

The usual problem arises from the non-deterministic behavior which usually appears when executing code in parallel. Some sequential code which is too complicated can reveal this behavior as well: a test may pass dozens of times, and then fail, while the code was unchanged. Then, if tests are rerun for the same revision of the code, they may turn green once again. In programmers' hell, programmers have to deal exclusively with such tests.
The binary result further prevents any interpretation. The test either passes, or it fails. There is no in-between. Manual tests, for example, are prone to non-binary results. “The test of this feature nearly passed, but we still need to work on this part.” No, if it's not green, it's red, and this means failure.

The non-binary approach would be problematic, because it makes impossible to track progress. What does it mean “We are nearly done”? Is Jeff's “nearly” the same as Scott's “nearly”?

If in some situations, it appears that partial statuses make sense, and especially if you can objectively determine that 35% of the test passes, this is a good sign that you are doing your tests wrong. In fact, what is happening here is that your big test runs many little tests at once. Instead, you should have separate tests for every step, allowing a much better integration within the tool you use to track the number of tests passed or failed.
Finally, the test ensures that a part of the application works as expected. It doesn't check a presence of a bug, and it doesn't check whether other systems work as expected. For example, if an app relies on a library, the tests of the app shouldn't cover the library—that's the job of the developers (and testers) of the library itself.

This may seem obvious, but I've seen many cases where programmers are tempted to test more than they need to: other libraries, underlying framework, the OS itself. It's difficult enough to test the code of the project; there is no need to further complicate the matter by testing the outside world as well.

You may have noticed that I haven't included anything about the form of the test. In fact, it can be expressed through code, but this is not a necessity. Tests can also take a form of a script, or data used by Selenium. The form doesn't matter.

Testing frameworks are not necessary either, but they are often very useful. For simplest projects, tests can take a form of a simple Bash or PowerShell script and/or a piece of code. For larger projects, a testing framework can enable a more formal approach and make it much easier to both write the tests (by providing convenient methods for comparison of sequences or for expected exceptions, for example) and execute them (by providing a common interface and reporting capability).

Why do we test?

Junior programmers sometimes believe that tests are a form of guarantee that shows that the code is free of bugs. This is absolutely not the case. The number of tests or the branch coverage are irrelevant: you may have a lot of tests and still a lot of bugs. What makes it possible to tell that the code has no bugs is formal proof, which is a very different technique.

The primary goal of testing is to find regressions. When changing code, regressions are unavoidable, even with practices such as pair programming or code reviews. Tests, given enough coverage, help finding some of the regressions which were missed by programmers and reviewers.

This makes the lives of developers much easier. When working on a project which doesn't have tests, any change is done at a high risk of breaking something. This usually discourages programmers to change anything, which, in turn, means that no refactoring is done. In turn, this means that technical debt increases constantly, and the project blows sooner or later.

The lack of tests also usually lead to bad working conditions. Not only can't programmers do their jobs correctly by applying refactoring on daily basis, but they are also constantly stressed, and afraid breaking stuff. Such conditions are harmful and should be avoided in any company which have the slightest respect towards their employees.

The second goal of tests is to make it easier to locate bugs. Unit tests, for instance, make it possible to pinpoint a location of a test very precisely compared to integration and other tests. Usually, they pinpoint the method at the origin of a bug, which is nice if methods are short enough. Tests other than unit tests help locating bugs as well: while they don't have the precision of unit tests, they still give you some hints about the possible location.

Tests help locating bugs not only in space, but also in time. If they run after every commit and commits are done on regular basis, the reports can show very well that a regression appeared in a given revision. This leads us to the question of the periodicity of test runs.

When to run tests?

One could think that a natural place for the tests would be the pre-commit phase. This would not only prevent code with regressions from reaching the version control, but also ensure developers are informed soon enough about the regressions.

Unfortunately, this is impossible to do for any but tiny projects. As I already explained before, running tests is problematic for two reasons: it executes custom code, which is very problematic in terms of security (remember, pre-commit hooks are executed by the version control server), and it takes time.

The speed problem means that developers would have to wait on every commit, which will discourage them from committing their code in the first place; also, tests of most projects take from a few minutes to a few days to run anyway. Even small delays of a few seconds are very problematic, which is also the reason linters have often no place in pre-commit hooks.

Instead, tests should run inside the Continuous integration flow, some tests running during the build, while others being handled by CI itself.

What types of tests are there?

Smoke tests

Ask any junior programmer what tests should he implement first. “Unit tests” would be the answer, and it's the wrong answer.

It's practically like asking a junior programmer to list the design patterns he knows. Singleton will be the first answer, and usually the only answer, while this is probably one of the most useless patterns, and also the most misused one.

Projects handled by inexperienced teams usually follow the predictable pattern. They start with a bunch of unit tests. Everyone is motivated and writing a lot of unit tests, a few ones being useful, much more being just redundant.

Later, the team writes less and less unit tests. At some point, the old ones are not updated any longer, and a few months later, running them will show that a few dozens fail.

Then, the team eventually makes a few attempts to either write new tests or update (or simply remove) the broken ones, but the branch coverage continues to decrease, and nothing can stop the decay.

From the moment where only a part of the code base has unit tests, the value of unit testing drops substantially.

For this reason, the very first tests which should be implemented are smoke tests. Smoke tests consist of verifying a given flow within the system. It could be, for example, a set of operations an average user will commonly perform. Usually, the operations are complex and involve several subsystems at once. For instance, a smoke test of an e-commerce website could consist of the following steps:

Search for a product.
Visualize the product.
Add the product to cart.
Go to the cart.
Enter a code which makes it possible to have a rebate.
Change the quantity of the product.
Attempt to purchase it. This leads to a registration page.
Register.
Start purchasing the product, entering wrong credit card info.
Enter correct credit card info and finish the payment process.
Go to the list of purchases and ensure the operation is there.

When such smoke test passes, there are great chances that the application is running mostly correctly and accomplishes its primary goal. Customers are probably able to view and purchase products.

When such smoke test fails, that's a sign that something went completely wrong. There is no way to deliver the code to production in its current state, because it can and probably will lead to major downtime for the business.

The benefit of such smoke tests over unit tests is their number. If the team is unable to follow TDD or ensure code coverage close to 100%, the same team may be more inclined to make three to five tests work for every release. Psychologically, it's easier to keep updated a few tests than a few thousand tests.

Another benefit is that unit tests won't necessarily reveal issues which are raised at the higher scale: that is integration or system. A smoke test, on the other hand, acts at the highest abstraction level: it doesn't even know the subsystems, and interacts directly with the whole system.

The obvious drawback of smoke tests is that if those are the only tests you have, many situations will remain untested. To prevent this from happening, unit, integration and system tests should be used.

System, unit and integration tests

Having smoke tests is good, but not enough for most products. While smoke tests watch your mainstream flow, most situations remain untested and problems which occur there will be discovered the hard way—through the angry calls from the customers.

This means that the coverage, that is the area under tests, should be increased. This is done through system testing. System tests are no different from smoke tests in that they perceive the system as a whole, without entering in the details. The difference is that (1) they are usually more dissociated from the actual flow of the users, that is the actual use cases and (2) they are often more granular, in other words perform less actions.

If we take the previous example of an e-commerce website, a system test can create an environment where a purchase is done, and ask, through the website, for cancellation, testing whether the cancellation is actually done.

When you start writing smoke and system tests, you may notice a very annoying thing: whenever you break something, a bunch of system tests stop working, but none give you a hint about the location of the regression. Since smoke and system tests involve dozens or hundreds of classes and methods, a regression in any of those classes or methods affect them. Thus, you need a more granular approach.

This is where unit testing comes. Each unit test has a very small working surface: in general, it is limited to a method or a bunch of methods within a class. This makes them very precise at locating regressions. For the same reason, a regression will generally cause only one or few unit tests to fail, meaning that you can focus on those tests and the small amount of code they cover.

The fact that unit tests are limited to a small part of the code which is executed in isolation from other code is a benefit, but also a problem. You may have a pretty good branch coverage, and practically no bugs at the level of a single class, but when you start linking one class to another, things start to get ugly. On one hand, you have your unit tests which don't help, and on the other hand, you have smoke and system tests which don't show you the location of the issue. This is where you can use integration tests. Those tests have a scope larger than unit tests, but not as large as smoke and system tests, meaning that they have the benefits of both worlds.

Functional and acceptance tests

You now have your smoke, system, unit and integration tests up and running. You're confident that everything would be fine, and then you deliver your product, and the stakeholders tell you that, well, it might be working and all, but it's not what they need, and actually, if only you could read the spec carefully... Well, you know the story.

If your project has functional requirements, you could be writing functional tests too. Those tests verify that the system which is actually built corresponds to the business requirements. Imagine you're working on a word processor. The spec says that the user should be able to change the font size, set the text to bold and italic. The actual product you built makes it possible to change the font size and has “Bold” button, but no “Italic”. Would you catch this error using smoke, system, unit or integration tests? Probably not: those tests would rather find that your “Bold” button is not doing anything, or that when setting font size to 0, the application crashes, but there would be no test which will highlight the lack of “Italic” button.

Functional tests are often confused with acceptance tests. Acceptance testing consists of determining whether the customer really needs what we built. For instance, if you implemented the font size, “Bold”, “Italic” and “Underline”, acceptance testing may show that the customers don't need “Underline”, but what they actually need is the ability to change the font. Functional testing is a verification activity; acceptance testing is a validation activity.

Stress and load tests

OK, at this point, you know that you built the right thing which conforms to the spec and works pretty well. But what about its performance?

Stress tests to non-functional requirements of performance is like functional tests to functional requirements. They test the entire product or a part of it (sometimes as small as a single method) on specific hardware under specific load, and measure the time it needs to run a given action. Then, they compare it to the threshold.

Remember, one of the characteristics of tests is their binary result: it passes, or fails. A basic variant of a stress test which measures the execution of the code and compares it to a value can be too unreliable: if yesterday, the code ran under 499.7 ms., and today, it took 500.1 ms. with a threshold at 500 ms., does it really mean that we have a regression today? Probably not. In order to prevent some randomness from affecting the results (and find the regression as soon as it is created), a stress test can run the same action several times and measure the average.

Load tests are a different beast: instead of testing performance of a given part of the system, they test the scalability. In other words, they are not measuring how fast your system is, but rather how much would it take to bring it down. For example, would an e-commerce website work well if there are fifty customers who are purchasing something at the same time? Yes? Great. What about two thousand? Maybe sixty thousand?

`pdiff`: the magic of finding regressions uncaught by other tests

Your product has a few smoke tests, thousands of unit tests, hundreds of integration, system and functional tests, and a bunch of stress and load tests. You feel safe. You know nothing wrong can happen. So on Friday evening, you make a small change, commit your code, check that all tests are still green and, with a feeling of accomplishment, you leave, planning to spend a great weekend with your wife.

And then your phone rings. It's your boss. The website is completely screwed. The home page won't even show. Product pages are... well, let's not even talk about product pages. Support is overwhelmed by the calls from customers who are wondering if the website was hacked.

You rush back to the workplace. You open your favorite browser and, indeed, what you see makes you want to kill yourself, right now. WTF happened?

What happened is that you modified a CSS file. You couldn't see the change because the staging server was serving the old cached minified bundle. The new one—the one you wrote, crushed the layout of nearly every page on the site.

Thousands of tests were unable to determine this simple mistake in your CSS code. And if you think about it, what unit, functional or system tests could you write to prevent such regression? There is not much you can do there.

Well, it appears that you can. pdiff stands for perceptual diff. It consists of an algorithm which compares two images and determines whether they look different. Using perceptual diff for testing makes it possible to catch regressions which influence how your website or software product looks, even if nothing changed functionally speaking. You inadvertently changed the padding of an element? pdiff will see it. Text increased? You'll be notified.

Tests which are not tests: A/B usability testing

Some tests are not the actual tests as defined previously in this article. While they have “testing” in their name, they are completely different from the tests we've seen. They are not automated, their results are not binary and they don't verify that the product is working as expected.

Two testing techniques are especially important:

A/B testing. It's all about statistical analysis, not regressions. In A/B testing, you create two variants of the same object (would it be an advertisement, a web page, or anything else which will involve an interaction with the user), and measure which variant is more successful. For instance, if a download button is shown green to half of your users, and blue to the other half, you can statistically show that, in this particular case, the blue one has 34% more clicks than the green one. Based on that, you'll make the button appear in blue for everyone.

This testing is used a lot by marketing, but can be used more globally to make choices in situations where such choices are not clear. If there are two possible ways to show a feature, implement both and see which one is used more.
Usability testing. If A/B testing consists of determining which one of two alternatives is more successful, usability testing is much more specific. It consists of asking to a person to interact with an element, and actually look at how the person is doing that. By doing that, you can notice the flaws in the UX and the overall design of your product, and act accordingly.

For instance, asking the user to perform an action such as purchasing a product on an e-commerce website may show you that “Buy now” button is too small and difficult to reach for some users, or that users may not necessarily know how to change the quantity in cart.

Usability testing is essential for any system which interacts with the end users and should be used consistently to ensure that there are no user experience mistakes.

What if I can't test my app?

Every time I audit a project which lack testing, I hear programmers saying that there are parts which are “non-deterministic enough” to be tested. In most cases, they refer to methods which are based on current time, and methods which use pseudo-random number generators.

Actually, there is nothing in those two cases which prevent testing. In order to be able to test those methods, one can use stubs and mocks.

Stubs are small parts of code (usually classes, rarely individual methods) which replace a given functionality by something which does nothing and simply produces consistent and predictable results. For example, if a method relies on now() date and time, a stub can feed this method with a constant date and time. A stub for a pseudo-random number generator can be as simple as:

var nextRandomStub = function (seed) {
    return 0;
};

Mocks are similar to stubs, but they are customized within the tests. For instance, the mock of a pseudo-random number generator may require the test to set the actual value which will be returned to the caller. This allows, for example, to see how the caller would react to negative values, or values superior to one.

In practically every case, stubs and mocks are all you need. For the sake of completeness, I'll explain two other similar elements, given that I'm not particularly convinced about their usefulness.

Fakes are lightweight implementations of a part of a system. For instance, a component which interacts with a given web service can be replaced by a fake during tests. I have seen no cases where fakes cannot be substituted by one or several stubs and mocks.

Fixtures consist of an emulation of an environment. It could be a database with testing data, or an HTML page stored as a static file and used to test JavaScript. While it might make testing look easier at first, tests using fixtures are more difficult to maintain. They may also quickly lead to tests which are slow.

Exploration: what to test?

The general path is obviously something you should test, but be particularly careful with edge cases as well. Those edge cases are usually difficult to find, and make the difference between a good and an average tester.

Edge cases can be found by inspecting the algorithm. If the method has two paths: one for positive integers, another one for negative ones, you may be interested in testing what happens with a zero. Often, edge cases require a deep knowledge of the language and the framework being used. For example, if the method relies on a string, would it work with Unicode? What about a string containing billions of characters? Or maybe zero characters? And what about a null being passed instead? White space, maybe?

If you are testing a piece of code, imagine that this code was written by your coworker you hate. Imagine you're a hacker. How much damage you can do to this code? How many flaws can you find? If the method expects an integer, give it a float. If the method explicitly asks you for a positive integer, give it a negative one. If the method begs for a non-empty sequence, feed it with a null.

Exploration: third-party tools

This article would be incomplete without mentioning third-party tools which make exploration step easier. Why would you spend your time finding cases to test, when there are tools which can study your code and find those cases for you?

Microsoft Research's Pex project is one of those tools which can speed up the exploration by generating unit tests for you. This has two crucial benefits:

The exploration of some methods can be completely automated, with Pex producing the complete suite of unit tests with full branch coverage.
For other methods where exploration was done by hand, it may be interesting to see the results from Pex and compare it with manual tests: Pex may discover cases which were missed by testers.

Different static analysis tools may also be very useful when moving the code towards formal proof, making code more reliable while requiring less tests.

While different tools will be very helpful in generating tests or suggesting paths to be tested, be very suspicious of any company which will claim that their product will write tests for you. The cake is a lie, and nothing will make it possible to completely skip exploration step. Tools are too stupid to know what to test and how; they may give hints, but they can't do work for you. Similarly to a beginner programmer, they will end up creating many tests which are not particularly useful, but miss the important cases. Use them to help you, not to replace you.