Scripting languages, automation and productivity

Arseni Mourzenko
Founder and lead developer
176
articles
April 1, 2015
Tags: productivity 36

I re­mem­ber Prag­mat­ic Pro­gram­mer book en­cour­ag­ing to use plain text, and what could be more pow­er­ful than Lin­ux shell and a script­ing lan­guage to work with text in or­der to trans­form it?

That's why when a col­league asked to­day how he can trans­form an Ex­cel table to a for­mat sim­i­lar to:

+-------------------------------------+---------+
| Product                             | Price   |
+-------------------------------------+---------+
| Dell PowerEdge R320                 | 1429.00 |
| Dell PowerEdge R730xd               | 2509.00 |
| PowerVault MD3 Serial Attached SCSI | 6282.70 |
+-------------------------------------+---------+

I im­me­di­ate­ly thought about Python, and also about a li­brary which I use in Solange.

So I gave the link to my col­league, but then I re­tract­ed it: it ap­peared that the task was to ex­port a small table, and dive into a third-par­ty li­brary to do the task which is fast to do by hand seems an overkill.

I thought it was still in­ter­est­ing to know how much is this an overkill. So back home, I did the ac­tu­al test. The re­sults were both im­pres­sive and ex­cit­ing.

Pre­lim­i­nary work

Un­sur­pris­ing­ly, the hard­est part was to pre­pare the sam­ple data. Ob­vi­ous­ly, it made no sense to load the Ex­cel file di­rect­ly in Python, so the first step was to cre­ate a CSV file. Strange­ly, num­bers were not real num­bers and were ex­port­ed with strange lead­ing and trail­ing white­space and even stranger quotes, and since I know noth­ing about Ex­cel, it took me sev­er­al min­utes to find how to spec­i­fy that num­bers are ac­tu­al­ly num­bers.

Once the file was cre­at­ed and moved to a di­rec­to­ry, the sec­ond step was to cre­ate the Python script it­self and to down­load the li­brary:

touch sample.py
chmod +x sample.py
svn co http://source.pelicandd.com/infrastructure/createvm/app/table/
vi sample.py

Great. Few sec­onds af­ter be­gin­ning, we are ready to write the script.

Writ­ing the script

The script couldn't be sim­pler. We ba­si­cal­ly need to im­port CSV data and feed the Table class with it:

#!/usr/bin/env python

import csv
import table

with open("SampleData.csv") as f:
    reader = csv.reader(f)
    print(table.Table(next(reader), reader).display())

The fact that the first's Ex­cel row con­tains the names of columns is very handy. Giv­en that Table re­quests two ar­rays: the names of columns and the con­tents, we sim­ply feed the names with next(reader), which also moves the cur­sor to the sec­ond row.

Un­for­tu­nate­ly, the li­brary is not ar­chi­tect­ed very well. The bad thing is that it has an ugly de­pen­den­cy on colorize. My first re­ac­tion was to re­move the de­pen­den­cy by al­ter­ing the code, but I lost time to do that. In ret­ro­spec­tive, I think I could do bet­ter by down­load­ing the file as well.

This is what I get as a re­sult:

The screenshot showing a terminal which displays a tabular data with a nice border

Close enough, but we can do bet­ter.

One of the ways to do bet­ter is to use sed. With a sim­ple:

python3 ./sample-data.py \
    | sed -r 's/[┃│┏┯┓]/|/g' \
    | sed 's/━/-/g' \
    | sed -r 's/[┗┷┛]/+/g'

we end up with:

| OrderDate | Region -| Rep -----| Item ---| Units | Unit Cost | Total --|
| 1/6/14    | East    | Jones    | Pencil  | 95    | 1.99      | 189.05  |
| 1/23/14   | Central | Kivell   | Binder  | 50    | 19.99     | 999.50  |
| 2/9/14    | Central | Jardine  | Pencil  | 36    | 4.99      | 179.64  |
     ⋮         ⋮        ⋮         ⋮      ⋮       ⋮           ⋮
| 11/17/15  | Central | Jardine  | Binder  | 11    | 4.99      | 54.89   |
| 12/4/15   | Central | Jardine  | Binder  | 94    | 19.99     | 1879.06 |
| 12/21/15  | Central | Andrews  | Binder  | 28    | 4.99      | 139.72  |
+-----------+---------+----------+---------+-------+-----------+---------+

(re­sult short­ened to im­prove read­abil­i­ty)

Not bad at all. We can ac­tu­al­ly work with that. But we can do bet­ter. The first line is ex­pect­ed to be the same as the last line, which can be shown with python3 ./sample-data.py | tail -n2 | head -n1 (ex­clud­ing sed). Also, the cur­rent first line—the one which con­tains the names of columns, should con­tain spaces in­stead of dash­es. This means that the com­mand should be mod­i­fied like this:

python3 ./sample-data.py | tail -n2 | head -n1 \
    | sed 's/━/-/g' \
    | sed -r 's/[┗┷┛]/+/g'

python3 ./sample-data.py | head -n1 \
    | sed 's/━/ /g' \
    | sed -r 's/[┏┯┓]/|/g'

python3 ./sample-data.py | tail -n2 | head -n1 \
    | sed 's/━/-/g' \
    | sed -r 's/[┗┷┛]/+/g'

python3 ./sample-data.py | tail -n+2 \
    | sed -r 's/[┃│┏┯┓]/|/g' \
    | sed 's/━/-/g' \
    | sed -r 's/[┗┷┛]/+/g'

That's a bunch of spooky scary seds we have here! At least we have the ex­pect­ed re­sult:

+-----------+---------+----------+---------+-------+-----------+---------+
| OrderDate | Region  | Rep      | Item    | Units | Unit Cost | Total   |
+-----------+---------+----------+---------+-------+-----------+---------+
| 1/6/14    | East    | Jones    | Pencil  | 95    | 1.99      | 189.05  |
| 1/23/14   | Central | Kivell   | Binder  | 50    | 19.99     | 999.50  |
| 2/9/14    | Central | Jardine  | Pencil  | 36    | 4.99      | 179.64  |
     ⋮         ⋮        ⋮         ⋮      ⋮       ⋮           ⋮
| 11/17/15  | Central | Jardine  | Binder  | 11    | 4.99      | 54.89   |
| 12/4/15   | Central | Jardine  | Binder  | 94    | 19.99     | 1879.06 |
| 12/21/15  | Central | Andrews  | Binder  | 28    | 4.99      | 139.72  |
+-----------+---------+----------+---------+-------+-----------+---------+

A much bet­ter al­ter­na­tive is to tell to Python what we need in the first place, which is very easy with the table li­brary. The li­brary is based on for­mat­ting files which de­fine ex­act­ly how the table will be dis­played. There is a for­mat­ting class for near­ly bor­der­less table, and com­plex for­mat­ting class­es which rely on col­or, but not what we need—a sim­ple text table with “+,” “-,” and “|.”

Let's cre­ate one. The de­fault for­mat­ter looks like this:

The screenshot of the source code

By copy­ing it as text_formatting.py and mod­i­fy­ing it like this:

The screenshot of the source code

and the orig­i­nal source code like this:

#!/usr/bin/env python

import csv
import table

with open("SampleData.csv") as f:
    reader = csv.reader(f)
    layout = table.Table(next(reader), reader)
    text = layout.display(formatting=table.TextFormatting())
    print(text)

we ob­tain:

| OrderDate | Region  | Rep      | Item    | Units | Unit Cost | Total   |
| 1/6/14    | East    | Jones    | Pencil  | 95    | 1.99      | 189.05  |
| 1/23/14   | Central | Kivell   | Binder  | 50    | 19.99     | 999.50  |
| 2/9/14    | Central | Jardine  | Pencil  | 36    | 4.99      | 179.64  |
     ⋮         ⋮        ⋮         ⋮      ⋮       ⋮           ⋮
| 11/17/15  | Central | Jardine  | Binder  | 11    | 4.99      | 54.89   |
| 12/4/15   | Central | Jardine  | Binder  | 94    | 19.99     | 1879.06 |
| 12/21/15  | Central | Andrews  | Binder  | 28    | 4.99      | 139.72  |
+-----------+---------+----------+---------+-------+-----------+---------+

No need for sed any longer:

python3 ./sample-data.py | tail -n2 | head -n1
python3 ./sample-data.py | head -n1
python3 ./sample-data.py | tail -n2 | head -n1
python3 ./sample-data.py | tail -n+2

Fast, faster, fastest

Now the re­sults:

Task Du­ra­tion (min:sec)
Time to find an Ex­cel spread­sheet and ex­port it as CSV 4:20
Cre­ate the orig­i­nal script and mod­i­fy the li­brary 1:50
Write orig­i­nal bash with un­read­able seds 3:10
Cre­ate a for­mat­ter for the li­brary 2:40
Op­ti­mal path from a CSV file to the fi­nal so­lu­tion 4:30

Per­son­al­ly, I'm im­pressed by the re­sults. Less than five min­utes cre­ate a script which process­es the CSV file and out­puts it ac­cord­ing to the re­quired for­mat is a much short­er time lapse than I would ex­pect.

If the Ex­cel file con­tains three rows and three columns, it's ob­vi­ous­ly faster to do the task by hand. But for any av­er­age file, script­ing ap­pears an at­trac­tive al­ter­na­tive. Do I need to men­tion the fact that the script can be reused?

I was cheat­ing

While the re­sults are in­ter­est­ing, I should men­tion that I cheat­ed. They are true: I ac­tu­al­ly mea­sured the time it took me to do the task. But the con­text in which they were mea­sured have noth­ing to do with the con­text of my col­league.

When mea­sur­ing the time it takes to per­form those tasks:

  1. I wasn't un­der pres­sure. While I was try­ing to do those tasks as fast as pos­si­ble, there was no en­vi­ron­men­tal pres­sure. I could have com­plete­ly failed the task, and it wouldn't both­er me much. My col­league, on the oth­er hand, was prob­a­bly do­ing a job for a cus­tomer who was ex­pect­ing re­sults, and ex­pect­ing them soon.

  2. I was in a calm en­vi­ron­ment. Not in an of­fice. Slight­est noise has a huge im­pact on pro­duc­tiv­i­ty, and I wouldn't be sur­prised that it takes twice as much time to per­form the same task in a noisy en­vi­ron­ment.

  3. I was in an en­vi­ron­ment de­signed for pro­duc­tiv­i­ty. Two large mon­i­tors, ex­cel­lent key­board, com­fort­able chair. I don't think this has the same ef­fect as a qui­et en­vi­ron­ment, but still, per­form­ing the same task in a dif­fer­ent en­vi­ron­ment is not that easy.

  4. I used Lin­ux. This means that I had Python, and I had bash. And that alone is a huge ben­e­fit. My col­league had Win­dows. In­stalling Python would take min­utes. Fig­ur­ing out how to for­mat the out­put sim­i­lar­ly to sed would take min­utes too. I'm sure Pow­er­Shell has sim­i­lar ca­pa­bil­i­ty, but not know­ing Pow­er­Shell well enough, I would have failed the task mis­er­ably.

  5. And fi­nal­ly the most im­por­tant point: I knew the table li­brary. I de­signed it. Hav­ing to dis­cov­er the li­brary for the first time will take min­utes, and try­ing to un­der­stand how to mod­i­fy it will take time as well. On the oth­er hand, I knew ex­act­ly how to use and ex­tend it for the pur­pose of the test.