Make your build agents short living

Arseni Mourzenko
Founder and lead developer
177
articles
March 14, 2018
Tags: performance 13 continuous-deployment 1

While dis­cussing with a friend of mine what I de­scribed in my pre­vi­ous ar­ti­cle, it ap­peared that my ap­proach to launch­ing tests from a build pipeline is not nec­es­sar­i­ly well known. It seems that I can share some in­sight on that sub­ject.

A manda­to­ry dis­claimer: al­though I worked enough on the the­o­ry, the sys­tem I'm us­ing my­self now is not the one I will be talk­ing about in this ar­ti­cle. I hope the one I'll de­scribe be­low will emerge some­where in 2019, but un­for­tu­nate­ly, time con­straints don't let me to be cer­tain even about that.

So. You have a prod­uct. The prod­uct has tests. Unit tests. Sys­tem tests. Load tests. You have a con­tin­u­ous de­ploy­ment work­flow which runs the tests. Where do those tests run?

For those of you who used Jenk­ins or Team­C­i­ty, you may be fa­mil­iar with the no­tion of nodes or agents. The idea is that there is clus­ter which han­dles the pipeline it­self, but the in­di­vid­ual steps of this pipeline, such as run­ning the ac­tu­al tests, is del­e­gat­ed to an­oth­er clus­ter con­tain­ing a bunch of vir­tu­al ma­chines which obey the or­ders of the first clus­ter. An agent, there­fore, re­ceives re­quests to build some­thing or to run some tests or per­form some oth­er tasks, and once it per­forms them, it can be re­quest­ed to do some­thing else. Dif­fer­ent steps are per­formed by dif­fer­ent agents, de­pend­ing on their avail­abil­i­ty and com­pat­i­bil­i­ty with the task.

This is a vi­able ap­proach. But it can be bet­ter.

In im­por­tant is­sue is the way agents are de­ployed. The usu­al way is to cre­ate a bunch of agents, con­fig­ure them, and then re­build the ones which in­di­cate that they may be bro­ken, for in­stance by fail­ing a build which, un­changed, pass­es on oth­er agents. Such longevi­ty among the agents is not only prob­lem­at­ic by it­self, but also un­nec­es­sary, re­stric­tive and scale-pro­hib­i­tive.

The so­lu­tion

Here it comes: dy­nam­i­cal­ly-al­lo­cat­ed, flex­i­ble pool of one-time ma­chines which are au­to­mat­i­cal­ly re­cy­cled once they are done with the task.

The idea is sim­ple. Any step of the build pipeline del­e­gat­ed to an agent is ex­e­cut­ed by a ma­chine which nev­er ex­e­cut­ed any build step be­fore. Once the ma­chine re­ports the re­sult back to the or­ches­tra­tor, it is au­to­mat­i­cal­ly de­stroyed.

The caveat is that even if it takes around twen­ty sec­onds to cre­ate a vir­tu­al ma­chine, it is not ac­cept­able to wait for twen­ty sec­onds be­fore start­ing a step. The build should be fast, and some of its steps, such as lin­ter checks or unit tests, should re­port their re­sults with­in sec­onds af­ter a com­mit in or­der to pro­vide a de­cent­ly fast feed­back to the de­vel­op­er. There­fore, those vir­tu­al ma­chines should al­ready be there be­fore the com­mit.

This prob­lem can be solved by a pool of agents and a bit of mag­ic. A pool of agents is a sim­ple part: if I write a Java app, I want the in­fra­struc­ture to have a fresh­ly de­ployed, nev­er used Java test agent be avail­able for me at the mo­ment I com­mit my changes. Since I don't care who else can be us­ing the agents, there should be an agent for my com­mit even if ten oth­er peo­ple com­mit­ted Java code in the past three min­utes. This is the mag­ic part: a ser­vice should track the cur­rent us­age of the agents, and make sure to de­ploy new ones as soon as the num­ber of avail­able agents starts to drop.

This is very sim­i­lar to cloud com­put­ing where a sud­den in­crease in the num­ber of re­quests of a spe­cif­ic web ap­pli­ca­tion could lead to a de­ploy­ment of the ap­pli­ca­tion to the new nodes. The dif­fer­ence is that in the cloud, it is im­por­tant to be able to scale both up (to re­spond to a growth in re­source us­age) and down (to re­duce the cost). In my case, how­ev­er, there is no need to scale down: just keep the agents run­ning un­til they fin­ish what they do, and they will be deal­lo­cat­ed au­to­mat­i­cal­ly.