Review and refactor

Arseni Mourzenko
Founder and lead developer
177
articles
July 16, 2018
Tags: code-review 2 unit-testing 6 reliability 2

Soft­wa­reEngi­neer­ing.SE is a good place for in­ter­est­ing ques­tions. One of the re­cent ques­tions about unit tests was par­tic­u­lar­ly in­ter­est­ing: not for its unit test­ing part, but for giv­ing a small piece of code which may seem par­tic­u­lar­ly straight­for­ward and lin­ear, but still could have been refac­tored in or­der to mit­i­gate the risk of er­rors.

Let's try to an­swer the ques­tion and take this op­por­tu­ni­ty to do a deep re­view and cor­rect the con­cerned piece of code. The orig­i­nal code is:

public static string SHA256(string plainText)
{
    StringBuilder sb = new StringBuilder();

    SHA256CryptoServiceProvider provider = new SHA256CryptoServiceProvider();
    var hashedBytes = provider.ComputeHash(Encoding.UTF8.GetBytes(plainText));

    for (int i = 0; i < hashedBytes.Length; i++)
    {
        sb.Append(hashedBytes[i].ToString("x2").ToLower());
    }

    return sb.ToString();
}

Through this ar­ti­cle, we well pro­gres­sive­ly mod­i­fy this piece of code in or­der to test it, as well as sim­pli­fy it and move the bur­den of do­ing the ac­tu­al work to .NET Frame­work it­self.

This article refers to the SVN revisions. You can get the repository dump (15 KB) in order to track yourself the changes.

If you don't know how to re­store SVN dumps, you may vis­it a wiki about it.

  1. What unit tests do we need?

--

Per­son­al­ly, I'll write only three tests, and mit­i­gate the need of writ­ing oth­er ones by mod­i­fy­ing the code it­self. The tests I'll write are en­sur­ing the method be­haves as ex­pect­ed:

A more for­mal ap­proach would be to fol­low the NIST stan­dard. In my opin­ion, for this pro­ject, it is an overkill. It would be dif­fer­ent if we were reim­ple­ment­ing SHA256 gen­er­a­tion with­out us­ing the one avail­able in .NET Frame­work.

Note that the fact that I don't use TDD for this pro­ject af­fects the choice of tests. For ex­am­ple, if I were writ­ing tests first, with­out know­ing which would be the ac­tu­al im­ple­men­ta­tion, I would have test­ed dif­fer­ent lo­cales, Uni­code strings, etc. Cur­rent­ly, giv­en the im­ple­men­ta­tion, I know that I don't need those tests.

■ See re­vi­sion 2 con­tain­ing both the method and the three unit tests.

  1. What about in­put check­ing?

--

The unit test ComputeSHA256OfNullString is not test­ing our method. The fact that the method nev­er checks the in­put, pass­ing it all around to .NET Frame­work means that the log­ic cov­ered by this test is miss­ing from the method. Sim­ply it hap­pened that GetBytes is throw­ing the ex­cep­tion we want when the in­put is null. This means that:

Let's cov­er the null case our­selves. By in­clud­ing the check­ing of in­put for null, we make our­selves in­de­pen­dent of oth­er meth­ods, and make the code eas­i­er to un­der­stand and safer for fu­ture changes.

■ See re­vi­sion 3 for in­puts check­ing.

  1. Are vari­able names ex­plic­it enough?

--

The orig­i­nal code is well for­mat­ted, fol­low­ing Mi­crosoft's style con­ven­tions. On the oth­er hand, there are im­prove­ments to do for nam­ing con­ven­tions.

Short names are good, but un­der­stand­able names are even bet­ter. sb or SHA256 are not ex­plic­it enough. builder or ComputeSHA256Hash are slight­ly bet­ter (even if they are still not the best ones we can imag­ine).

■ See re­vi­sion 4 where vari­able names were re­named.

■ See also: Code Com­plete, Sec­ond edi­tion · Chap­ter 11: The Pow­er of Vari­able Names

  1. Is the code cor­rect?

--

Now that cos­met­ic changes were made, it's fi­nal­ly time to study to code it­self.

It's pret­ty straight­for­ward. There is:

The first thing is to check the code with Code analy­sis. Sur­prise, the sta­t­ic analy­sis of the code re­veals two im­por­tant prob­lems:

One way to solve this is­sue is to spec­i­fy the in­vari­ant cul­ture. An­oth­er way is to con­tin­ue think­ing about the orig­i­nal in­ten­tion of the de­vel­op­er and what the code ac­tu­al­ly does.

someByte
    .ToString("x2", CultureInfo.InvariantCulture)
    .ToLower(CultureInfo.InvariantCulture)

is not do­ing what the au­thor ex­pect­ed it to do.

In­deed, the text is con­vert­ed to low­er­case, while orig­i­nal­ly, it al­ready is low­er­cased, giv­en the "x2".

Con­clu­sion: the case when the code is do­ing some­thing dif­fer­ent from what was ex­pect­ed and when the re­sult is still the same as if it was im­ple­ment­ed cor­rect­ly is one of the ugli­est cas­es which may hap­pen. Since com­pi­la­tion, test­ing and sta­t­ic analy­sis are of­ten un­able to catch the dif­fer­ence, the code is as­sumed cor­rect, un­til the er­ror re­veals it­self at the mo­ment when no­body ex­pects it.

To avoid such er­rors, don't pure­ly rely on unit tests. As you can see, sta­t­ic check­ing and an in­for­mal code re­view — such as the one done in this ar­ti­cle, helped find­ing bugs which were dif­fi­cult or im­pos­si­ble to de­tect with unit tests.

Unit test­ing is just one of the ways to find de­fects, and none of the tech­niques can achieve de­fect-de­tec­tion rates above 75%.

■ See also: Code com­plete, Sec­ond edi­tion, page 470 (Chap­ter 20: The Soft­ware-Qual­i­ty Land­scape)

The method ends with a for. As I pre­vi­ous­ly ex­plained on Soft­wa­reEngi­neer­ing.SE, for and foreach are two com­plete­ly dif­fer­ent beasts: foreach is clos­er to a while than a for.

The us­age of for in the cur­rent case is prob­lem­at­ic for two rea­sons:

Use for when ap­pro­pri­ate (that is at most 2% of the cas­es in busi­ness-ori­ent­ed ap­pli­ca­tions). Use foreach when it­er­at­ing through every el­e­ment of a se­quence.

■ See re­vi­sion 5, 6 and 8 for the changes re­lat­ed to cul­ture, re­dun­dant code and the mi­gra­tion from for to a foreach.

  1. Do we re­al­ly need three unit tests?

--

The less code we have to test, the bet­ter. It ap­pears that with .NET Frame­work 4, there is a way to change the method in or­der to make one of the three unit tests use­less.

By us­ing Code con­tracts in­stead of a man­u­al check of in­puts, we mit­i­gate the pa­ra­me­ters val­i­da­tion from our code to .NET (not count­ing the ben­e­fit in terms of sta­t­ic analy­sis).

■ See re­vi­sion 7 and 9 where the guard clause was re­placed by a Code con­tract, al­low­ing to re­move one of the unit tests while re­duc­ing com­plex­i­ty, and where some oth­er Code con­tracts were added for ad­di­tion­al re­li­a­bil­i­ty.

■ See also: C# In Depth by Jon Skeet, Sec­ond edi­tion · Chap­ter 15 Let­ting your code speak more clear­ly with Code Con­tracts

  1. for, fore­ach, noth­ing.

--

We pre­vi­ous­ly re­placed a for by a foreach. An ad­di­tion­al step is to re­move the foreach. C# has LINQ, which is a pow­er­ful way to work with se­quences while us­ing par­a­digms in­spired by func­tion­al lan­guages.

Giv­en that the hash is noth­ing more than a se­quence of bytes and the re­sult—a string—a se­quence of char­ac­ters, LINQ is a good choice when it comes to trans­form­ing one se­quence to an­oth­er.

■ See re­vi­sion 12 where LINQ re­duces the com­plex­i­ty and re­moves the foreach.

  1. Wait, aren't we slow now?

--

In re­vi­sion 12, we moved from a loop to a LINQ chain­ing, which also led us to stop us­ing StringBuilder, which is known as be­ing a much faster al­ter­na­tive to a con­cate­na­tion when it comes to ap­pend­ing lots of strings.

Per­for­mance is an im­por­tant as­pect of pro­gram­ming, but in prac­tice, it is of­ten over­rat­ed and ne­glect­ed at the same time. Pro­gram­mers may spent hours ar­gu­ing which one of the im­ple­men­ta­tions is faster, when the dif­fer­ence be­tween them is at most a few mi­crosec­onds, and those same pro­gram­mers will write code which could have been op­ti­mized very eas­i­ly.

I won't ex­plain in de­tail how ex­pect­ed per­for­mance should be de­ter­mined, how ac­tu­al one should be mea­sured, and how op­ti­miza­tion should be done. Let's sim­ply note that:

The ac­tu­al piece of code is a per­fect ex­am­ple. Know­ing that StringBuilder has a rep­u­ta­tion of be­ing fast, one can as­sume that the re­vi­sion 12 de­te­ri­o­rat­ed the per­for­mance of the code, and StringBuilder should be in­tro­duced again to ob­tain a de­cent speed.

Run­ning a pro­fil­er, it ap­pears that, as usu­al, I was wrong. The bot­tle­neck is not the new string(...) or the LINQ ex­pres­sion, but the con­struc­tor of SHA256 ser­vice provider (see Fig. 1).

Fig 1. The original bottleneck
ZIP file with VSP file (1.1 MB)

In­deed, cre­at­ing a cryp­to ser­vice provider may not be an easy task, while cre­at­ing a string com­posed of 64 char­ac­ters can be done pret­ty fast. The pri­ma­ry op­ti­miza­tion to do is to call the cryp­to ser­vice provider only once.

At this point, we are hit by the fact that un­til now, the util­i­ty class as well as the method it­self were sta­t­ic. Sta­t­ic meth­ods are good for cer­tain tasks, but this is not one of them, es­pe­cial­ly since the gen­er­a­tion of SHA256 hash­es can be so eas­i­ly mod­eled in an ob­ject ori­ent­ed way.

Mov­ing from sta­t­ic ob­jects to in­stances makes it pos­si­ble to ini­tial­ize the cryp­to ser­vice provider only once, and then call the com­pu­ta­tion method mul­ti­ple times. This sim­ple change drops the time re­quired to com­pute the hash by 2.7 times (see Fig. 2).

Fig 2. The optimization by a factor of 2.7.
ZIP file with VSP file (6.8 MB)

■ See re­vi­sion 14 which aban­dons sta­t­ic class­es and meth­ods, al­low­ing op­ti­miz­ing the method to mul­ti­ply the per­for­mance by 2.7.

We can do bet­ter. With a tiny mod­i­fi­ca­tion con­sist­ing of mov­ing from val­ues com­put­ed on the fly to a table, an­oth­er op­ti­miza­tion by a fac­tor of 2 be­comes pos­si­ble (see Fig. 3).

■ Also see: Code com­plete, Sec­ond edi­tion · Chap­ter 18: Table-Dri­ven Meth­ods

Fig 3. An additional optimization by a factor of 2.
ZIP file with VSP file (10.0 MB)

There are prob­a­bly many oth­er op­ti­miza­tions we can do. Some will come at a cost of read­abil­i­ty, oth­ers will waste more mem­o­ry, oth­ers will have near­ly no neg­a­tive im­pact. The fact is that the meth­ods looks pret­ty fast to me now, since on my ma­chine, it per­forms above my ex­pec­ta­tions. When do­ing pro­fil­ing and op­ti­miza­tion, know when to stop. Spend­ing days in or­der to gain a few mil­lisec­onds of ex­e­cu­tion time is not worth the ef­fort (see Fig. 4).

Fig 4. The actual performance matches the requirements.
ZIP file with VSP file (2.8 MB)
  1. What do we get?

--

It's time to study the re­sult of the refac­tor­ing. What do we get? Is it ac­tu­al­ly bet­ter than the orig­i­nal ver­sion?

Here is a com­par­i­son be­tween the orig­i­nal and the lat­est ver­sions, i.e. the re­vi­sion 2 and the re­vi­sion 18:

Met­ric Re­vi­sion 2 Re­vi­sion 18
Code cov­er­age of the li­brary
high­er is bet­ter
100% 100%
Code cov­er­age of unit tests
high­er is bet­ter
87.5% 100%
Num­ber of Code con­tracts as­ser­tions
high­er is bet­ter
14 789
Num­ber of un­known Code con­tracts as­ser­tions
low­er is bet­ter
1 0
Main­tain­abil­i­ty in­dex of the li­brary
high­er is bet­ter
65 68
Main­tain­abil­i­ty in­dex of unit tests
high­er is bet­ter
83 77
Cy­clo­mat­ic com­plex­i­ty of the li­brary
low­er is bet­ter
2 8
Cy­clo­mat­ic com­plex­i­ty of unit tests
low­er is bet­ter
4 5
Depth of in­her­i­tance of the li­brary
low­er is bet­ter
1 1
Depth of in­her­i­tance of unit tests
low­er is bet­ter
1 1
Class cou­pling of the li­brary
low­er is bet­ter
4 8
Class cou­pling of unit tests
low­er is bet­ter
5 4
IL­LOC of the li­brary
low­er is bet­ter
4 8
IL­LOC of unit tests
low­er is bet­ter
8 11
Code analy­sis vi­o­la­tions
low­er is bet­ter
5 2

Based on those met­rics, it ap­pears that the mod­i­fi­ca­tion is not re­al­ly worth it. It im­proved some met­rics, but de­te­ri­o­rat­ed oth­ers. What those met­rics don't show is that the new code:

Every of those three points makes it to­tal­ly worth the ef­fort.