Workflows, ETLs, and pure magic

Arseni Mourzenko
Founder and lead developer
December 11, 2020
Tags: quality 34 productivity 36 python 4

Five years ago, I wrote a rather opin­ion­at­ed and very crit­i­cal ar­ti­cle about Nin­tex Work­flows, a per­fect­ly use­less prod­uct which makes your life mis­er­able while you're pay­ing for it. Nin­tex is sold as a so­lu­tion to al­low non-tech­ni­cal per­sons to change busi­ness log­ic in a giv­en con­text; how­ev­er, it is so poor­ly de­signed and has a so con­vo­lut­ed way to show the log­ic and make it pos­si­ble to mod­i­fy it, that it not only pre­vents non-tech­ni­cal per­sons to deal with busi­ness log­ic, but does that as well for IT pro­fes­sion­als as well.

A bit of con­text. Back then, I found my­self at the head of a team who had to deal with the fol­low­ing sit­u­a­tion: a cus­tomer had a work­flow so com­plex that no­body could un­der­stand it. Not that the log­ic be­hind it was com­plex—it was rather sim­ple and straight­for­ward—but Nin­tex Work­flows made it com­plete­ly im­pos­si­ble to un­der­stand. The work­flow oc­ca­sion­al­ly and ran­dom­ly crashed (with­out re­port­ing any er­ror), and my team was asked to solve that. Ob­vi­ous­ly, no­body had a freak­ing idea how to ap­proach this beast.

In the ar­ti­cle back then, I gave a dar­ing sug­ges­tion: use Python in­stead. More specif­i­cal­ly, train non-tech­ni­cal per­sons to write code in Python, cre­ate an API around it, con­nect it to an MQS, and en­joy.

Since then, two things hap­pened.

First and fore­most, I met a man­ag­er who did ex­act­ly that. He has a team with a few pro­gram­mers and a few busi­ness an­a­lysts, who do not con­sid­er them­selves as pro­gram­mers, many of them with ab­solute­ly no pri­or ex­pe­ri­ence in any­thing re­lat­ed to pro­gram­ming or script­ing—if such dis­tinc­tion could be con­sid­ered rel­e­vant here. Those busi­ness an­a­lysts work on data pro­cess­ing, and in­stead of us­ing an ETL, they ac­tu­al­ly code the thing them­selves in Python, while re­ly­ing on the API cre­at­ed by the de­vel­op­ers. I in­ter­act­ed a lot with the man­ag­er and his team, and could gath­er enough in­for­ma­tion to pro­vide some more de­tails about the ap­proach, the ben­e­fits and the draw­backs.

Sec­ond, about half a dozen per­sons re­act­ed to my ear­ly ar­ti­cle, some won­der­ing if it's re­al­ly pos­si­ble to train a non-tech­ni­cal per­son to use Python, oth­ers ask­ing how to deal with the mess that a non-tech­ni­cal per­son who doesn't nec­es­sar­i­ly know the rules of clean code will cre­ate over time. My re­cent ex­pe­ri­ence could help me an­swer­ing those two ques­tions.

Note that while the con­text of the team I was talk­ing about is a bit dif­fer­ent—they are ex­tract­ing, trans­form­ing, and load­ing data from and to the spe­cif­ic data sources, and not han­dling the busi­ness work­flows—the idea is ex­act­ly the same. As for busi­ness rules, there are ex­is­tent work­flow ap­pli­ca­tions, there is too a mar­ket for ETLs as well. In both cas­es, there are sit­u­a­tions where an ex­is­tent tool can be valu­able—in terms of work­flows, I'm ob­vi­ous­ly talk­ing about things such as Win­dows Work­flow Foun­da­tion, not Nin­tex Work­flows, which is per­fect­ly and com­plete­ly use­less—and there are cas­es, where Python can pro­vide much more val­ue.

For in­stance, imag­ine an ap­pli­ca­tion which han­dles the ad­min­is­tra­tive part of hir­ing an em­ploy­ee. De­pend­ing on the cir­cum­stances, the com­pa­ny may de­cide to change the work­flow, and if the soft­ware is not flex­i­ble enough, this may not be pos­si­ble. How­ev­er, if the soft­ware prod­uct in­volves user-man­age­able work­flows, it adds tremen­dous flex­i­bil­i­ty to the users, and so the val­ue to the prod­uct it­self.

Sim­i­lar­ly, a team may be ag­gre­gat­ing a spe­cif­ic type of data from dif­fer­ent data sets to put it into the EDW, while keep­ing an eye on the reg­u­lar changes in the struc­ture of the in­com­ing data. When a source changes its struc­ture, or an­oth­er source should be added, it should be rel­a­tive­ly easy with an ETL to just remap the fields, or add new map­pings.

Those are the cas­es where ETLs or work­flow tools (ex­cept Nin­tex!) can re­al­ly pro­vide val­ue: a nar­row field with strict bound­aries.

Now, imag­ine you need to do some com­plex data pro­cess­ing over the data sets com­ing from dif­fer­ent data sources, REST APIs, and SOAP ser­vices, while caching the re­quests and also sav­ing a bunch of CSV files for his­tor­i­cal use. Here, an ETL can quick­ly be­come a lim­i­ta­tion and a bur­den, rather than a tool which em­pow­ers the team.

So comes Python. In 2015, I wrote:

It's ex­treme­ly sim­ple. Take a mes­sage queue ser­vice (MQS), an API and a Python script. Yes, that's all you need. The Python script in­ter­acts with both the API and the MQS.

In ret­ro­spec­tive, I made a small mis­take: it's not an API which is need­ed, but a frame­work. This was ex­act­ly my ap­proach when de­vel­op­ing Grape: the frame­work which han­dles all the com­plex­i­ty of the com­mu­ni­ca­tion through an MQS, all the idio­syn­crasies of spe­cif­ic hard­ware de­vices, hides the com­plex­i­ty, and leaves the user with the abil­i­ty to de­fine very eas­i­ly (at least, with­out both­er­ing about such ex­cit­ing things as the sub­tleties of the bi­na­ry com­mu­ni­ca­tion pro­to­col used to com­mu­ni­cate with Ar­duino de­vices) the busi­ness rules of the or­ches­tra­tor: “Ye shall trig­ger an alarm when an in­trud­er comes into my flat,” or “Make the red LED flash un­til the build is fixed,” or “Start the fan if the room tem­per­a­ture reach­es a thresh­old, de­fined as a com­plex al­go­rithm based on the out­side weath­er and the in­te­ri­or hu­mid­i­ty.”

In fact, writ­ing a Python script is dif­fi­cult. It is for a pro­gram­mer, there­fore it is as well for a non-pro­gram­mer. One needs to know what stdout is, and how much mem­o­ry is used, and what hap­pens if a file was re­moved at the ex­act mo­ment the script start­ed to read it, or if an in­fi­nite loop sud­den­ly ter­mi­nates (every time I do this joke, I lose half of my read­ers; come on, it's a fun­ny one!) When, on the oth­er hand, a script is sur­round­ed by a frame­work which han­dles log­ging and stream­ing and mem­o­ry man­age­ment and faults and hun­dreds of bad things which can hap­pen out there in the wild, it be­comes much eas­i­er to fo­cus on the ac­tu­al busi­ness need and write a sim­ple piece of code which cor­re­sponds to it. This, es­sen­tial­ly, solves a ma­jor part of the dif­fi­cul­ties that non-tech­ni­cal users en­counter when they start us­ing Python. The ex­pe­ri­ence here is com­pa­ra­ble to when IT spe­cial­ists start to learn a new tech. You re­al­ly want to learn in­ter­est­ing stuff, but in­stead, you're stuck with this stu­pid er­ror telling that one of the con­fig­u­ra­tion files can­not be parsed, and that sucks, be­cause you don't want to spend the next few hours deal­ing with this low-lev­el stuff.

This con­cept of a frame­work is, in a way, sim­i­lar to AWS Lamb­da. You don't want to deal with in­fra­struc­ture and ver­sions of Lin­ux and de­pen­den­cies and fire­walls and disk spaces. All you want is to re­ceive in­put, be able to process it in a way you want, and cre­ate the out­put. Python's frame­work for non-pro­gram­mers should look ex­act­ly the same (heck, it could be based on AWS Lamb­da!), with the script tak­ing pa­ra­me­ters in, flush­ing out­put to stdout, and not be­ing con­cerned about its en­vi­ron­ment. Once those en­vi­ron­ment con­sid­er­a­tions are away, non-tech­ni­cal per­sons can do a great job.

Now, ob­vi­ous­ly, there is no frame­work which could force some­one to write clean code. There are how­ev­er two ways to avoid the mess.

First, the frame­work it­self can pro­vide use­ful in­for­ma­tion about the script it runs. For in­stance, it can high­light parts of the script which take a while. I don't ex­pect a non-tech­ni­cal per­son to be able to use a pro­fil­er—at least not if pre­sent­ed in a way so fa­mil­iar to de­vel­op­ers—but some­thing vi­su­al which shows that, well, a 15 sec­onds script spent 14.5 sec­onds on line 67, may be large­ly enough to lead the per­son to the right di­rec­tion. And I don't ex­pect a non-tech­ni­cal per­son to be able to ex­plain the dif­fer­ence be­tween a smoke test and a sys­tem test, but I'm pret­ty sure any­one can, when in­vit­ed by a frame­work, map some ex­pect­ed out­puts to a se­ries of in­puts and see if the script match­es the ex­pec­ta­tions.

Sec­ond, there should be an abil­i­ty for the non-tech­ni­cal per­sons to ask a pro­gram­mer to help them—this is ac­tu­al­ly the first thing I sug­gest­ed to the man­ag­er when I learned that non-tech­ni­cal per­sons write code in his team. This help may take dif­fer­ent forms: one may re­view the pull re­quests, or per­form train­ing ses­sions, or may do oc­ca­sion­al pair pro­gram­ming, or just give a hand with a par­tic­u­lar­ly daunt­ing task. The fact is, it ac­tu­al­ly saves mon­ey for the com­pa­ny. Re­cent­ly, for in­stance, one pro­gram­mer on the team helped a busi­ness an­a­lyst with a script which took too long to ex­e­cute. As a re­sult, a sev­er­al hours script runs now in just a few min­utes, thanks to a few changes in the struc­ture of the code—a game chang­er for every­one.

Lan­guages such as Python are de­signed in a way that it makes it very dif­fi­cult to write bad code—bad­ly for­mat­ted, bad­ly writ­ten, mis­lead­ing. Cou­pled with some help from a pro­gram­mer, and a sol­id frame­work which pro­vides use­ful in­sight as to how the script runs and what it re­al­ly does, those lan­guages can be a very valu­able tool which em­pow­ers the users when it comes to solv­ing a prob­lem which would be too com­plex for an ETL or a work­flow tool. Use them, they are great.