Refactoring

Arseni Mourzenko
Founder and lead developer
161
articles
November 10, 2014
Tags: refactoring 11

Ab­stract

It is not un­usu­al to find a piece of code which has se­vere is­sues and can­not be main­tained as is. It can ac­tu­al­ly come from a less ex­pe­ri­enced col­league, a low qual­i­ty in­ter­net re­source, or even… your­self. In all those cas­es, the code re­quires refac­tor­ing. This ar­ti­cle will give you a quick pre­view of the refac­tor­ing process and how to ap­proach this task which may seem com­plex in an or­ga­nized way.

Why to refac­tor? It is a valid ques­tion to ask your­self, es­pe­cial­ly when you have to jus­ti­fy to man­age­ment “wast­ing” days or weeks or months do­ing a task which in de­fin­i­tive will not make any­thing new, and not im­ple­ment any cool fea­ture for the users. Still, refac­tor­ing pur­sues sev­er­al goals:

As an ex­am­ple for this ar­ti­cle, I take my own ap­pli­ca­tion I’ve writ­ten a few years ago. Your own lega­cy code is al­ways fun to refac­tor, and also self-sat­is­fac­to­ry: you dis­cov­er that a short lapse of time ago, you were so much in­ex­pe­ri­enced com­pared to now, that you can hope, that in a few years, you’ll be ways ex­pe­ri­enced com­pared to to­day too. So, let’s see what we can do with my lega­cy code which is old and ugly and can­not be rea­son­ably used as is.

To fol­low what I’m do­ing, get the source code of HtmlChar­ac­ters with SVN.

Re­vi­sion 1 cor­re­sponds to the state when I al­ready refac­tored the code to match the style sug­gest­ed by Style­Cop. This refac­tor­ing step is not in­ter­est­ing to dis­cuss: you sim­ply walk through the ugly code and en­force ba­sic rules re­lat­ed to style. For ex­am­ple, in the past when I wrote this piece of code, I used to cap­i­tal­ize fields (see SA1306) or use .NET types in­stead of their C# analogs (see SA1121), etc. In gen­er­al I make those changes when I’m too tired to do real work: you must do it, but it’s bor­ing and mo­not­o­nous and re­quires an IQ of a mon­key.

Step 1
Get­ting the overview of the code

The first thing is to have a gen­er­al idea of the code base. Dif­fer­ent peo­ple use dif­fer­ent means for that. Per­son­al­ly, I us class di­a­grams a lot. In the cur­rent case, it’s still not very help­ful: the only thing I can in­fer is that the ap­pli­ca­tion is mono­lith­ic and pro­ce­dur­al; every­thing seems to be in FormMain.

The class diagram of the original code base

A quick view of the ap­pli­ca­tion it­self and its main win­dow helps un­der­stand­ing what it does. It has two forms, a main form and a form which shows “About us” info. The main form en­ables the user to do three things: search for a char­ac­ter, con­vert char­ac­ters and high­light them.

Step 2
Ar­chi­tec­tur­ing it the clean way (re­vi­sion 2)

The goal is to make the ap­pli­ca­tion more main­tain­able. As I said above, the ap­pli­ca­tion is mono­lith­ic, every­thing be­ing in a sin­gle class. It’s not cool.

A quick draft can be made to de­sign this ap­pli­ca­tion prop­er­ly. One class will load char­ac­ters from a data source, an­oth­er one will search for oc­cur­rences, and fi­nal­ly an­oth­er one, in­her­it­ed from the pre­vi­ous one, will re­place the oc­cur­rences.

Let’s do it.

The class diagram of three new classes

The loader

The loader ab­stracts the ac­cess to the data. We don’t care if the source is an SQL table, an XML file or re­source, or what­so­ev­er. We just need the list of char­ac­ters. This means that, hav­ing seen the con­tents of chars.xml, I don’t re­al­ly care about the struc­ture when I de­fine the Char­ac­ter class, which rep­re­sents the char­ac­ters re­turned from the data source. For ex­am­ple, chars.xml has a list of groups, ref­er­enced then by ID by the el­e­ments cor­re­spond­ing to the char­ac­ters, but the caller of the loader doesn’t have to know it.

In fi­nal, a lin­ear struc­ture seems more ap­pro­pri­ate to ex­pect from the loader. An­oth­er way would be to group char­ac­ters by groups, and re­turn the list of groups, but I be­lieve it would be more in­tu­itive to use a flat list of char­ac­ters. Ul­ti­mate­ly, it’s not so im­por­tant: I can al­ways do a GroupBy or a SelectMany on the re­sult.

If we were build­ing a sys­tem in­tend­ed for reuse, I would have cre­at­ed an ILoader in­ter­face. This would en­able in fu­ture to cre­ate oth­er load­ers, with­out chang­ing the rest of the code. In the cur­rent sce­nario where the data has strong chances to re­main in a sin­gle form, I con­sid­er this not use­ful, and don’t want to over­ar­chi­tec­ture the ap­pli­ca­tion.

Searcher and re­plac­er

The searcher search­es for char­ac­ters and re­turns the info in­di­cat­ing, for every char­ac­ter found, its po­si­tion, form, etc. The re­plac­er just does the re­place­ment, and re­turns the new text. Mul­ti­ple ar­chi­tec­tur­al ap­proach­es can be used here: a re­plac­er can call a searcher for ex­am­ple, or there is only a sin­gle class which does both search and re­place. For me, in­her­it­ing the re­plac­er from the searcher is the most in­tu­itive, but it’s ex­clu­sive­ly my per­son­al opin­ion, and can be crit­i­cized with valid ar­gu­ments.

Both are filled with char­ac­ters to search/re­place, and then the main method is called. This means that the same search/re­place ob­ject can be reused for mul­ti­ple strings.

The search method re­turns the set of oc­cur­rences. The oc­cur­rence is de­signed to con­tain all the in­for­ma­tion we need: the char­ac­ter, the form, etc., which al­lows pro­cess­ing the in­for­ma­tion lat­er much eas­i­er com­pared to less ver­bose struc­tures.

The class diagram of three new classes and their members

Step 3
Test­ing (re­vi­sion 3)

Chang­ing the ex­is­tent code­base which lacks unit tests is like walk­ing through a room full of nee­dles with your eyes closed. You might reach the des­ti­na­tion safe­ly, but you take a great risk which you can eas­i­ly avoid.

When the code­base al­ready has tests, it’s a good start. Oth­er­wise, it is a good idea to cre­ate tests be­fore even start­ing the refac­tor­ing. If those tests pass on lega­cy code, it’s easy to find what refac­tor­ing breaks them.

In the case of the pro­ject I il­lus­trate, there are no tests, and it’s not easy to cre­ate ones to test lega­cy code. In­stead, I will build new tests for new code only, which will cor­re­spond to the re­quire­ments that the lega­cy code prob­a­bly matched.

A note apart: I’m us­ing a very strange con­ven­tion to name the unit test meth­ods. In­stead of us­ing a verb, as ex­pect­ed for a name of a method, some­times I use a name. In my opin­ion, this con­ven­tion is vi­able for unit tests, since unit tests do one thing: en­sure that some­thing works. So I sim­ply omit the “En­sure” pre­fix; for ex­am­ple, in­stead of writ­ing “En­sure­Can­Cre­ate­Cat­e­go­ry” for a method which en­sures that a CRUD busi­ness ap­pli­ca­tion can cre­ate a cat­e­go­ry, I will sim­ply write “Can­Cre­ate­Cat­e­go­ry”.

To be­gin, I write 16 unit tests which cor­re­spond ex­clu­sive­ly to func­tion­al re­quire­ments. Since the ap­pli­ca­tion is a small desk­top ap­pli­ca­tion, which is not busi­ness crit­i­cal and with no spe­cif­ic qual­i­ty re­quire­ments, this is large­ly enough. For a more se­ri­ous ap­pli­ca­tion, more unit tests would be re­quired, both to check func­tion­al and non func­tion­al re­quire­ments.

All 16 unit tests fail. This is im­por­tant: if they suc­ceed even be­fore I start im­ple­ment­ing code, there is some­thing wrong with my tests.

Step 4
Mi­grat­ing code

Loader (re­vi­sion 4)

Let’s ex­plore the code which loads the data from the XML file.

The data is loaded in LoadData(), re­vi­sion 3, FormMain.cs, line 37. This method is pret­ty ac­cept­able, but con­tains er­rors. For ex­am­ple, StringReader, which im­ple­ments IDisposable, is not dis­posed af­ter it’s used. Those er­rors must be cor­rect­ed dur­ing refac­tor­ing. Some oth­er changes can be made. For ex­am­ple, XDocument can be used in­stead of XmlDocument to sim­pli­fy the source code.

Af­ter load­ing the XML, this method calls re­cur­sive­ly AddTreeNode, which is frankly to­tal­ly in­com­pre­hen­si­ble with its 45 IL LOC, cy­clo­mat­ic com­plex­i­ty of 18 and main­tain­abil­i­ty in­dex of 39. Why re­cur­sive­ly? Since the items are added to a list, and not a tree, this feels very strange. With­out in­ves­ti­gat­ing fur­ther, I rewrite this method from scratch.

Re­sult: a new LoadData which con­tains only two ma­jor lines: the first one is a LINQ query which con­verts the char­ac­ters into groups and items to add to the list, and the sec­ond one which adds those groups and items. I also add sev­er­al meth­ods in oth­er class­es, all ex­treme­ly small and easy to un­der­stand. Last but not least, MainForm los­es two pri­vate fields which were not use­ful at class scope.

Is LINQ a read­abil­i­ty im­prove­ment here? I be­lieve so. LINQ queries fit well for the sce­nar­ios where you ma­nip­u­late data by trans­form­ing it step by step. This is the case here, where a list of char­ac­ters is pro­gres­sive­ly trans­formed into an enu­mer­a­tion of ListViewGroup’s and an enu­mer­a­tion of ListViewItem’s, items be­ing as­so­ci­at­ed to groups. The same code could be writ­ten with two loops, and ac­tu­al­ly re­sults in a bet­ter main­tain­abil­i­ty in­dex, but I’m still con­vinced that for a de­vel­op­er fa­mil­iar with LINQ, my so­lu­tion would be more read­able.

Searcher and re­plac­er (re­vi­sion 5)

If you ex­plore the ac­tu­al code, you see that the search is based on reg­u­lar ex­pres­sions. Those reg­u­lar ex­pres­sions are gen­er­at­ed by GetReplacingPatterns method. I start to move the method to Searcher class, and change it to use char­ac­ters in­stead of the list view.

private void GetReplacingPatterns(out string plainCharacters,
    out string codes, out string ids)
{
    plainCharacters = string.Empty;
    codes = "&(";
    ids = "&#(";
    for (int a = 0; a < this.characters.Count; a++)
    {
        Character c = this.characters[a];
        plainCharacters += c.Value + "|";
        codes += c.Code + "|";
        ids += c.ID + "|";
    }

    plainCharacters = plainCharacters
        .Remove(plainCharacters.Length - 1);
    codes = codes.Remove(codes.Length - 1);
    codes += ");";
    ids = ids.Remove(ids.Length - 1);
    ids += ");";
}

The lega­cy code uses a for loop. C# pro­vides a foreach loop which is more ap­pro­pri­ate in this case. Then, when foreach is used, some­times re­plac­ing it by a LINQ query can make the code eas­i­er to read. This is the case here: we are just build­ing three strings through a loop, an ac­tion which is re­peat­ed on every el­e­ment of the loop, and a fi­nal con­cate­na­tion. Why not do­ing the same thing in a con­cise method which will be called from oth­ers?

private string BuildPattern(string format,
    Func<Character, object> action)
{
    return string.Format(format,
        string.Join("|", this.characters
            .Select(action)
            .Select(c => c.ToString())
            .ToArray()));
}

What this does is ex­act­ly the same as the thir­teen-line mon­ster, but, well, bet­ter. In­stead of cre­at­ing three strings and re­turn­ing them through out pa­ra­me­ters, we are sim­ply cre­at­ing one of them and re­turn it to the caller. The caller has an easy to read code too. Among three lines where the method is called, the most dif­fi­cult one is:

string codes = this.BuildPattern(
    "&({0});",
    c => c.Code.Substring(1, c.Code.Length - 2));

Oth­er two lines are:

string symbols = this.BuildPattern("&#({0});", c => c.ID);

and:

string values = this.BuildPattern("{0}", c => c.Value);

Now that we trans­formed that, let’s refac­tor the re­main­ing method which does the con­ver­sion:

private void ButtonConvert_Click(object sender, EventArgs e)
{
    string convertText = tboxConvertFrom.Text;

    int indexFrom = this.GetTransformTypeSide(
        comboConvertFrom.Text);
    int indexTo = this.GetTransformTypeSide(
        comboConvertTo.Text);

    // Build replacing patterns for Regex
    string plainCharacters, codes, ids;
    this.GetReplacingPatterns(out plainCharacters, out codes,
        out ids);

    try
    {
        convertText = Regex.Replace(
            convertText,
            (indexFrom == 0) ?
                plainCharacters :
                ((indexFrom == 1) ? ids : codes),
            match =>
            {
                for (
                    int x = 0;
                    x < this.listView.Items.Count;
                    x++)
                {
                    ListViewItem lvi = this.listView.Items[x];
                    if (lvi.SubItems[indexFrom].Text ==
                        ((indexFrom == 0) ? match.Value :
                            match.Groups[1].Value))
                    {
                        return (indexTo == 0) ?
                            lvi.SubItems[indexTo].Text :
                            (((indexTo == 2) ? "&" : "&#")
                                + lvi.SubItems[indexTo].Text + ";");
                    }
                }

                return match.Value;
            });
    }
    catch (ArgumentException ex)
    {
        // Syntax error in the regular expression.
        convertText = "Transform failed: " + ex.Message;
    }

    this.tboxConvertTo.Text = convertText;
}

This method has sev­er­al is­sues:

To refac­tor this, we move it to Searcher and Replacer class, and split into sev­er­al meth­ods. The main method, FindOccurrences, would stay pret­ty short and straight­for­ward:

public IEnumerable<Occurrence> FindOccurrences(string text)
{
    var codesAndSymbols = this.FindCodesAndSymbols(text)
        .ToList();

    var starts = new HashSet<int>(codesAndSymbols
        .Select(c => c.At));
    string values = this.BuildPattern("{0}", c => c.Value);
    var valueOccurrences = this.FindCharactersOfForm(
        text, values, CharacterForm.Value,
        (match, c) => c.Value == match.Value.First())
        .Where(c => c.Text != "&" || !starts.Contains(c.At));

    return codesAndSymbols.Union(valueOccurrences)
        .OrderBy(c => c.At);
}

First, it finds both codes and sym­bols in text. Since, when high­light­ing en­ti­ties to the user, the char­ac­ter “&” must not col­lide with the en­cod­ed en­ti­ties (like “é”), we must know where codes and sym­bols are be­fore search­ing for val­ues.

Af­ter search­ing for all three types, we fi­nal­ly re­turn the union of three sets.

The Replacer is even eas­i­er:

public string ReplaceOccurrences(string text,
    CharacterForm from, CharacterForm to)
{
    foreach (var occurrence in this.FindOccurrences(text)
        .Where(c => c.Form == from)
        .OrderByDescending(c => c.At))
    {
        text = text.Remove(occurrence.At)
            + conversions[to](occurrence.Character)
            + text.Substring(occurrence.At + occurrence.Length);
    }

    return text;
}

But dur­ing this refac­tor­ing, we in­tro­duce a new bug which was nonex­is­tent in the lega­cy code base. The new bug is well il­lus­trat­ed by the unit test Re­plac­erTests/Re­placeAm­per­sands. It must re­place val­ues to codes, which means that “á” must be re­placed into “á”. Sad­ly, the new searcher looks for every form of char­ac­ter, and has a pref­er­ence for codes and sym­bols over char­ac­ters.

As un­pro­fes­sion­al as it can be, I will keep this bug for the mo­ment, and solve it lat­er. But don’t do it in a com­mer­cial pro­ject.

Step 5
Data bind­ing (re­vi­sion 6, 7, 8 and 15)

Data bind­ing is a great way to im­prove code read­abil­i­ty and ease of main­te­nance. In­stead of car­ry­ing about up­dat­ing the prop­er­ties of con­trols and get­ting the prop­er­ties back when they change, you just cre­ate a data bind­ing and for­get about it.

This forces me to make slight change to the code out­side the main win­dow. In Char­ac­ter class, the con­ver­sion of a char­ac­ter to a list view item puts the char­ac­ter it­self in the Tag prop­er­ty. Now, the char­ac­ter prop­er­ties are eas­i­er to re­trieve from the se­lect­ed el­e­ment.

I also cre­ate For­m­Main­Da­ta class which is in­tend­ed to con­tain all the prop­er­ties to bind the prop­er­ties of con­trols too.

The sixth re­vi­sion is lim­it­ed to the se­lect­ed char­ac­ter in the list view. The sev­enth re­vi­sion adds data bind­ing for fil­ters and is more help­ful in un­der­stand­ing why data bind­ing is ben­e­fi­cial to code qual­i­ty. The pre­vi­ous re­vi­sion just made the code longer and more dif­fi­cult to read.

The eighth re­vi­sion is a good ex­am­ple of a refac­tor­ing of a sin­gle method which is lim­it­ed to rewrit­ing the code by chang­ing its struc­ture. In Ini­tial­ize­Bind­ings, I had three repet­i­tive lines fol­lowed by four repet­i­tive lines, which is frankly a copy-paste code and vi­o­lates the DRY prin­ci­ple. Even if I could keep this code as is, since it’s short enough, I pre­fer to refac­tor it. The first three lines can be short­ened with a helper method that I cre­ate for this pur­pose: Bind­S­e­lec­tion­La­bel. The next four lines are even eas­i­er: I can just ex­tract the dif­fer­ences to an ar­ray of anony­mous types and walk through it with a foreach loop.

Step 6
Re­moval of un­need­ed fea­tures (re­vi­sion 9)

Refac­tor­ing is not only about mod­i­fy­ing the source code with­out af­fect­ing the fea­tures. If a fea­ture is not need­ed, I would pre­fer to re­move it as quick­ly as pos­si­ble. While chang­ing the li­cense to MS-PL, I also re­move the About di­a­log. The only thing it does is show­ing the name and the logo of the com­pa­ny, and oth­er stuff the user nev­er needs. A mes­sage box is large­ly enough for this info. A logo can be re­moved too, re­duc­ing the size of the ex­e­cutable.

Step 7
Style­Cop and Code Analy­sis
(re­vi­sions 10 and 11)

This is an im­por­tant part of the refac­tor­ing. With­out uni­form style, the code is still more dif­fi­cult to read. As for Code Analy­sis, it shows some pre­cious info, in­clud­ing warn­ings re­lat­ed to in­ter­na­tion­al­iza­tion, etc.

Do I need to en­force those rules be­fore or af­ter the refac­tor­ing?

In gen­er­al, I en­force some style rules be­fore and every­thing else dur­ing and af­ter the refac­tor­ing. Do­ing most of the work af­ter refac­tor­ing has sev­er­al ad­van­tages:

What’s next?

For larg­er pro­jects, refac­tor­ing would re­quire to sort the pro­ject files into the di­rec­to­ries for bet­ter or­ga­ni­za­tion. The cur­rent pro­ject is too small and does not re­quire that. In­stead, the class­es re­lat­ed to search and re­place can be moved to a sep­a­rate pro­ject, since I in­tend reusing them lat­er in oth­er pro­jects where I used un­re­lat­ed code.

The cur­rent code is fi­nal­ly read­able enough to be mod­i­fied in or­der to add fea­tures or re­move bugs.

Class di­a­gram of source code re­main­ing in the ap­pli­ca­tion

The class diagram of the remaining code

Class di­a­gram of what was ex­tract­ed to the sep­a­rate li­brary

The class diagram of the extracted code