Make your build agents short living
While discussing with a friend of mine what I described in my previous article, it appeared that my approach to launching tests from a build pipeline is not necessarily well known. It seems that I can share some insight on that subject.
A mandatory disclaimer: although I worked enough on the theory, the system I'm using myself now is not the one I will be talking about in this article. I hope the one I'll describe below will emerge somewhere in 2019, but unfortunately, time constraints don't let me to be certain even about that.
So. You have a product. The product has tests. Unit tests. System tests. Load tests. You have a continuous deployment workflow which runs the tests. Where do those tests run?
For those of you who used Jenkins or TeamCity, you may be familiar with the notion of nodes or agents. The idea is that there is cluster which handles the pipeline itself, but the individual steps of this pipeline, such as running the actual tests, is delegated to another cluster containing a bunch of virtual machines which obey the orders of the first cluster. An agent, therefore, receives requests to build something or to run some tests or perform some other tasks, and once it performs them, it can be requested to do something else. Different steps are performed by different agents, depending on their availability and compatibility with the task.
This is a viable approach. But it can be better.
In important issue is the way agents are deployed. The usual way is to create a bunch of agents, configure them, and then rebuild the ones which indicate that they may be broken, for instance by failing a build which, unchanged, passes on other agents. Such longevity among the agents is not only problematic by itself, but also unnecessary, restrictive and scale-prohibitive.
Problematic. The problem, as already stated, is that an agent can suddenly start failing the build, while the same build is passed by its pairs. This leads not only to unpredictable builds and a loss of trust, but also the waste of time checking for false positives and a loss of interest from the developers for failed builds: “It failed, again? Well, re-run it to see if it passes.” at best, “It failed, again? I don't care, it should be the problematic agents, once again.” at worst.
Unnecessary. Since agents don't have to keep state (this is the job of artifact repositories), there is absolutely no reason why they would be preserved after they finish the task.
Restrictive. If agents are intended to live for weeks or months, it means that they will use the resources of the host, whenever they are doing something or not. In turn, this means that in order to keep the hardware usage low, one would be inclined to have few types of agents, or few agents of every type.
In the first case, it would lead to fat agents: they are able to perform diverse steps, but it comes at a cost of clarity. If I have a virtual machine which should run unit tests of a Python script, I don't need this machine to be able to run C++ unit tests or to launch Ruby, or to be able to connect to internet, or to be able to access NFS. Narrowing down dedicated machines makes things much clearer.
In the second case, it leads to slow builds. Having twenty machines handling fifteen tasks is better than having ten machines handling ten tasks with five tasks waiting for an agent to become available.
Scale-prohibitive. The previous point directly translates into a problem of scale. If, at a given moment, there are lots of changes in Ruby projects, I want to have tens or hundreds of Ruby agents hungry for some testing. At the same time, if there is no activity on Python projects, there is no need to have more than two or three Python testing agents waiting for some work.
The reason, by the way, to avoid fat agents, is the time needed to create one. If I had no activity on Python projects, and suddenly, things start changing fast, the two or three idle agents will quickly be overwhelmed; it is, therefore, essential to be able to deploy new agents within seconds.
The solution
Here it comes: dynamically-allocated, flexible pool of one-time machines which are automatically recycled once they are done with the task.
The idea is simple. Any step of the build pipeline delegated to an agent is executed by a machine which never executed any build step before. Once the machine reports the result back to the orchestrator, it is automatically destroyed.
The caveat is that even if it takes around twenty seconds to create a virtual machine, it is not acceptable to wait for twenty seconds before starting a step. The build should be fast, and some of its steps, such as linter checks or unit tests, should report their results within seconds after a commit in order to provide a decently fast feedback to the developer. Therefore, those virtual machines should already be there before the commit.
This problem can be solved by a pool of agents and a bit of magic. A pool of agents is a simple part: if I write a Java app, I want the infrastructure to have a freshly deployed, never used Java test agent be available for me at the moment I commit my changes. Since I don't care who else can be using the agents, there should be an agent for my commit even if ten other people committed Java code in the past three minutes. This is the magic part: a service should track the current usage of the agents, and make sure to deploy new ones as soon as the number of available agents starts to drop.
This is very similar to cloud computing where a sudden increase in the number of requests of a specific web application could lead to a deployment of the application to the new nodes. The difference is that in the cloud, it is important to be able to scale both up (to respond to a growth in resource usage) and down (to reduce the cost). In my case, however, there is no need to scale down: just keep the agents running until they finish what they do, and they will be deallocated automatically.