Code duplication

Arseni Mourzenko
Founder and lead developer
176
articles
November 9, 2014
Tags: refactoring 12

Be­gin­ners' code is al­ways an amaz­ing op­por­tu­ni­ty to re­move code du­pli­ca­tion, as well as ex­plain what code du­pli­ca­tion is to the orig­i­nal au­thor.

I was re­view­ing a few thou­sand LOC, so I thought this is an op­por­tu­ni­ty to post an ex­am­ple and to show how should it be writ­ten in­stead. This is also an ex­cel­lent op­por­tu­ni­ty to show a bit more what an in­for­mal code re­view is about, for those of you who are un­lucky to find your­self in com­pa­nies where there are no code re­views.

This is the orig­i­nal piece of code; the vari­ables were mod­i­fied to ful­fill the damn NDA, but the log­ic re­mains the same.

public List<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<List<IReportRow>>() != null);

    List<IReportRow> result = new List<IReportRow>();
    if (this.DisplayFormat == OutputFormat.RichText &&
        this.LoadCurrentDisplaySupport()
            .Capabilities
            .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
            .Count() > 0)
    {
        // Generate header containing advanced stuff, like sorting, filtering
        // and counting.
        result.Add(
            new HeaderFactory(this.DataContext)
                .CreateFromSQL(new ReportMeta(reportId).FindType()));
    }
    else
    {
        // Generate basic header with just the names of the columns.
        result.Add(
            new HeaderFactory(null).CreateFromRow(
                new ReportData(this.DataContext).Load(reportId).Rows[0]))
    }

    List<DataRow> rows =
        new ReportData(this.DataContext).Load(reportId).Rows.ToList();

    for (int i = 0; i < rows.Length; i++)
    {
        if (this.DisplayFormat == OutputFormat.RichText &&
            rows[i].RichTextEnabled == true &&
            this.LoadCurrentDisplaySupport()
                .Capabilities
                .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
                .Count() > 0)
        {
            // Display advanced controls, such as filters.
            result.Add(new ReportRowBuilder().AddData(rows[i]).Build());
        }
        else
        {
            // Display text only if either the row itself doesn't contain
            // advanced controls or if the display format doesn't support them.
            result.Add(new ReportDataRow(rows[i]));
        }
    }

    return result;
}

Aside wrong types, there are also a bunch of copy-past­ed lines which re­al­ly shouldn't be. For each one, I'll il­lus­trate them with a sim­ple pat­tern to make it as vi­su­al as pos­si­ble.

T a = new T()

C# 3.0 in­tro­duced im­plic­it­ly typed lo­cal vari­ables. Near­ly sev­en years lat­er, it's still un­clear to some C# pro­gram­mers what this key­word is about, where to use it and where not to use it.

What is var

var mit­i­gates to the com­pil­er the bur­den of find­ing the type of a lo­cal vari­able. In­stead of deal­ing with the of­ten com­plex types your­self, let the com­pil­er do the job.

Dictionary<Guid, HashSet<Tuple<string, List<decimal>>>> wtf = this.LoadWTF();

may sim­ply be writ­ten:

var wtf = this.LoadWTF();

giv­en that if you re­al­ly have such types in your code base, you maybe might have some more im­por­tant is­sues to ad­dress first.

Also, var is par­tic­u­lar­ly handy when it comes to anony­mous types; spec­i­fy­ing those ones by hand would be slight­ly prob­lem­at­ic.

Some pro­gram­mers also tend to be­lieve that var in­di­cates that the type is dy­nam­ic. This has noth­ing to do with dy­nam­ic typ­ing. var keeps the types as sta­t­ic as they are when hard-cod­ed. The ac­tu­al type be­hind var is in­ferred at com­pile-time, while dynamic types are eval­u­at­ed at run­time.

Where to use or not to use var

There are three cas­es of use for var:

  1. Di­rect ini­tial­iza­tion with re­dun­dant type, i.e. where the de­clared vari­able is im­me­di­ate­ly ini­tial­ized, and the ini­tial­iza­tion con­tains the type any­way.

    List<string> words = new List<string>();
    

    vs.

    var words = new List<string>();
    

    I pre­fer the sec­ond vari­ant and would refac­tor any at­tempt to use the first one. The first one is just code du­pli­ca­tion. There is no rea­son what­so­ev­er to use this vari­ant. It doesn't bring any ben­e­fit, and has a draw­back of mak­ing code use­less­ly longer and mak­ing main­tainance more dif­fi­cult: if the type changes, a main­tain­er should change it twice.

  2. Ex­plic­it cast.

    List<string> words = (List<string>)this.LoadWords();
    

    vs.

    var words = (List<string>)this.LoadWords();
    

    Iden­ti­cal­ly to the pre­vi­ous point, there is no ben­e­fit of writ­ing the same type twice, so just use the var and let the com­pil­er do its job.

  3. In­di­rect ini­tial­iza­tion, where the type is not im­me­di­ate­ly vis­i­ble in the ini­tial­iza­tion.

    private List<string> LoadWords() { ... }
    ...
    List<string> words = this.LoadWords();
    

    vs.

    private List<string> LoadWords() { ... }
    ...
    var words = this.LoadWords();
    

    If I work on busi­ness-crit­i­cal code, I would pick the first ap­proach: spec­i­fy­ing the type once more might help find­ing type er­rors. This, of course, is de­bat­able, and I don't have met­rics to sup­port that. I imag­ine that it would help to avoid less than 0.5% of bugs, and those bugs would in all cas­es be re­vealed by sta­t­ic check­ing and for­mal code re­views, which makes it not re­al­ly im­por­tant.

    In or­di­nary code, I would avoid spec­i­fy­ing the type. Vi­su­al Stu­dio does a great job of show­ing the types of lo­cal vari­ables through the tooltips, so the ben­e­fit of spec­i­fy­ing the type is ex­treme­ly low. For those who think that hav­ing ex­plic­it type here helps read­abil­i­ty with­out hav­ing to ask help to Vi­su­al Stu­dio, you don't get the point. Oth­er­wise, we would have to rewrite:

    var quantity = this.Cart.IsEmpty ? 0 : this.Cart.CountProducts();
    if (this.HasCoupon)
    {
        quantity++;
    }
    
    return quantity;
    

    this way:

    int quantity = (int)0;
    ICart cart = (ICart)this.Cart;
    bool isCartEmpty = (bool)((ICart)cart).IsEmpty;
    if (!(bool)isCartEmpty)
    {
        int quantityInCart = (int)((ICart)cart).CountProducts();
        /* quantity is an integer */ quantity += (int)quantityInCart;
    }
    
    if ((bool)this.HasCoupon)
    {
        /* quantity is an integer */ quantity++;
    }
    
    return (int)quantity;
    

    be­cause know­ing the type when us­ing the vari­able is as im­por­tant as know­ing it when de­clar­ing it.

  4. Box­ing.

    private List<string> LoadWords() { ... }
    ...
    ICollection<string> words = this.LoadWords();
    

    vs.

    private List<string> LoadWords() { ... }
    ...
    var words = this.LoadWords();
    

    This one is ob­vi­ous: the code is not the same, so if you need box­ing, you can't use var.

  5. Im­plic­it con­ver­sion

    private int MeasureSpeed() { ... }
    ...
    double speed = this.MeasureSpeed();
    

    vs.

    private int MeasureSpeed() { ... }
    ...
    var speed = this.MeasureSpeed();
    

    This one is ob­vi­ous: the code is not the same, so if you need im­plic­it con­ver­sion, you can't use var.

  6. Anony­mous types

    var product = new { Title = "A good refactoring book", Price = 59.5 };
    

    Ob­vi­ous­ly, you have to use var.

  7. Edge cas­es

    Fi­nal­ly, there are some edge cas­es where you some­how need an ex­plic­it type, but not right now, or some­thing like that. For ex­am­ple, in:

    private IEnumerable<Product> FindProducts() { ... }
    ...
    IEnumerable<Product> products = this.FindProducts();
    

    you want to be sure that if the sig­na­ture of FindProducts changes lat­er to private ReadOnlyCollection<Product> FindProducts(), you still keep IEnumerable<Product> as a type of products. I would say that such cas­es are ex­treme­ly rare, and if you en­counter them of­ten enough in the code base, this can be a code smell.

Ap­ply­ing var

Let's now ap­ply var key­word to the orig­i­nal source.

public List<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<List<IReportRow>>() != null);

    var result = new List<IReportRow>();
    if (this.DisplayFormat == OutputFormat.RichText &&
        this.LoadCurrentDisplaySupport()
            .Capabilities
            .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
            .Count() > 0)
    {
        // Generate header containing advanced stuff, like sorting, filtering
        // and counting.
        result.Add(
            new HeaderFactory(this.DataContext)
                .CreateFromSQL(new ReportMeta(reportId).FindType()));
    }
    else
    {
        // Generate basic header with just the names of the columns.
        result
            .Add(new HeaderFactory(null)
            .CreateFromRow(
                new ReportData(this.DataContext).Load(reportId).Rows[0]))
    }

    var rows = new ReportData(this.DataContext).Load(reportId).Rows.ToList();
    for (var i = 0; i < rows.Length; i++)
    {
        if (this.DisplayFormat == OutputFormat.RichText &&
            rows[i].RichTextEnabled == true &&
            this.LoadCurrentDisplaySupport()
                .Capabilities
                .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
                .Count() > 0)
        {
            // Display advanced controls, such as filters.
            result.Add(new ReportRowBuilder().AddData(rows[i]).Build());
        }
        else
        {
            // Display text only if either the row itself doesn't contain
            // advanced controls or if the display format doesn't support them.
            result.Add(new ReportDataRow(rows[i]));
        }
    }

    return result;
}

if (a && b) ... if (a && c)

This is a ba­sic case where orig­i­nal code du­pli­ca­tion be­comes dif­fi­cult to get rid of af­ter one code branch was changed, but oth­er didn't. The code pre­sent­ed at the be­gin­ning of the ar­ti­cle con­tains two ifs, which are near­ly iden­ti­cal, but still dif­fer­ent. It's not ob­vi­ous to see that the con­di­tion is near­ly the same in two cas­es, which makes it par­tic­u­lar­ly harm­ful to the qual­i­ty of the code base. When those state­ments are for­mat­ted dif­fer­ent­ly, it be­comes eas­i­er to see the du­pli­ca­tion:

this.DisplayFormat == OutputFormat.RichText &&
this.LoadCurrentDisplaySupport()
    .Capabilities
    .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
    .Count() > 0

this.DisplayFormat == OutputFormat.RichText &&
rows[i].RichTextEnabled == true &&
this.LoadCurrentDisplaySupport()
    .Capabilities
    .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
    .Count() > 0

Let's change this part of the code.

public List<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<List<IReportRow>>() != null);

    var result = new List<IReportRow>();
    var useRichText =
        this.DisplayFormat == OutputFormat.RichText &&
        this.LoadCurrentDisplaySupport()
            .Capabilities
            .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
            .Count() > 0;
    if (useRichText)
    {
        // Generate header containing advanced stuff, like sorting, filtering
        // and counting.
        result.Add(
            new HeaderFactory(this.DataContext).CreateFromSQL(
                new ReportMeta(reportId).FindType()));
    }
    else
    {
        // Generate basic header with just the names of the columns.
        result.Add(
            new HeaderFactory(null).CreateFromRow(
                new ReportData(this.DataContext).Load(reportId).Rows[0]))
    }

    var rows = new ReportData(this.DataContext).Load(reportId).Rows.ToList();
    for (var i = 0; i < rows.Length; i++)
    {
        if (useRichText && rows[i].RichTextEnabled == true)
        {
            // Display advanced controls, such as filters.
            result.Add(new ReportRowBuilder().AddData(rows[i]).Build());
        }
        else
        {
            // Display text only if either the row itself doesn't contain
            // advanced controls or if the display format doesn't support them.
            result.Add(new ReportDataRow(rows[i]));
        }
    }

    return result;
}

Few read­ers might no­tice that the ac­tu­al refac­tor­ing changed the ex­e­cu­tion of the code by in­vert­ing the parts of the sec­ond if block (A && B && C be­comes A && C && B). This means that we should re­run unit tests af­ter this change to en­sure that there are no re­gres­sions.

if (a b) else (a c)

I see code du­pli­ca­tion with­in if/else blocks in near­ly every piece of code I have to re­view. A code such as:

if (condition)
{
    statement(a);
}
else
{
    statement(b);
}

should im­me­di­ate­ly be refac­tored into:

statement(condition ? a : b);

The change is not as mi­nor as it looks like. The orig­i­nal code looks like:

if something:
    do a thing
otherwise:
    do a different thing

which doesn't ex­press cor­rect­ly the log­ic. What the code is ac­tu­al­ly do­ing is:

do a thing on a if something or on b otherwise

More­over, the orig­i­nal, non-refac­tored form suf­fers from sev­er­al is­sues in­her­ent to code du­pli­ca­tion. The first one is that a per­son may mod­i­fy code in one place, and for­get to do it in an­oth­er lo­ca­tion. The sec­ond one is that one branch may grow, re­sult­ing in a mess and dif­fi­cul­ty to re­move du­pli­cat­ed code lat­er.

The code now be­comes:

public List<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<List<IReportRow>>() != null);

    var result = new List<IReportRow>();
    var useRichText =
        this.DisplayFormat == OutputFormat.RichText &&
        this.LoadCurrentDisplaySupport()
            .Capabilities
            .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
            .Count() > 0;
    var factory = new HeaderFactory(useRichText ? this.DataContext : null);
    var header = useRichText ?
        // Generate header containing advanced stuff, like sorting, filtering
        // and counting.
        factory.CreateFromSQL(new ReportMeta(reportId).FindType()) :
        // Generate basic header with just the names of the columns.
        factory.CreateFromRow(
            new ReportData(this.DataContext).Load(reportId).Rows[0]);

    result.Add(header);

    var rows = new ReportData(this.DataContext).Load(reportId).Rows.ToList();
    for (var i = 0; i < rows.Length; i++)
    {
        var resultRow = useRichText && rows[i].RichTextEnabled == true ?
            // Display advanced controls, such as filters.
            new ReportRowBuilder().AddData(rows[i]).Build() :
            // Display text only if either the row itself doesn't contain
            // advanced controls or if the display format doesn't support them.
            new ReportDataRow(rows[i]);

        result.Add(resultRow);
    }

    return result;
}

That's a wall of con­densed code! It doesn't mat­ter, when we fin­ish refac­tor­ing, it will be much more read­able.

(bool)b == true

Re­dun­dan­cy such as:

if (row.IsVisible == true)

can of­ten be seen in code writ­ten by be­gin­ner pro­gram­mers. The == true part doesn't bring any­thing to the read­abil­i­ty in this con­text and should be re­moved. The code:

var resultRow = useRichText && row.RichTextEnabled == true

should be:

var resultRow = useRichText && row.RichTextEnabled

a[i] ... a[i]

Re­triev­ing the same el­e­ment from an ar­ray may quick­ly be­come prob­lem­at­ic, since not only it has no ben­e­fit (aside from not de­clar­ing an ad­di­tion­al vari­able), but it can eas­i­ly be a source of a mis­take dur­ing main­te­nance. Let's fo­cus on this part of the code:

for (var i = 0; i < rows.Length; i++)
{
    var resultRow = useRichText && rows[i].RichTextEnabled ?
        // Display advanced controls, such as filters.
        new ReportRowBuilder().AddData(rows[i]).Build() :
        // Display text only if either the row itself doesn't contain advanced
        // controls or if the display format doesn't support them.
        new ReportDataRow(rows[i]);

    result.Add(resultRow);
}

One can re­duce du­pli­ca­tion by de­clar­ing an ad­di­tion­al vari­able:

for (var i = 0; i < rows.Length; i++)
{
    var row = rows[i];
    var resultRow = useRichText && row.RichTextEnabled ?
        // Display advanced controls, such as filters.
        new ReportRowBuilder().AddData(row).Build() :
        // Display text only if either the row itself doesn't contain advanced
        // controls or if the display format doesn't support them.
        new ReportDataRow(row);

    result.Add(resultRow);
}

but there is an eas­i­er way to get rid of those in­dex­es:

foreach (var row in rows)
{
    var resultRow = useRichText && row.RichTextEnabled ?
        // Display advanced controls, such as filters.
        new ReportRowBuilder().AddData(row).Build() :
        // Display text only if either the row itself doesn't contain advanced
        // controls or if the display format doesn't support them.
        new ReportDataRow(row);

    result.Add(resultRow);
}

This trans­form has ad­di­tion­al ben­e­fits too.

new List; list.Add; list.Add; re­turn list

One prob­lem with this method is that it cre­ates a list, adds el­e­ments to it and then re­turns it to the caller. In .NET, a much pow­er­ful con­cept of enu­mer­a­tors. If we change the method so it re­turns an enu­mer­a­tion, the caller can still con­vert it to a list, but it can also start the enu­mer­a­ton and aban­don it in the mid­dle, for ex­am­ple.

public IEnumerable<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<IEnumerable<IReportRow>>() != null);

    var useRichText =
        this.DisplayFormat == OutputFormat.RichText &&
        this.LoadCurrentDisplaySupport()
            .Capabilities
            .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
            .Count() > 0;
    var factory = new HeaderFactory(useRichText ? this.DataContext : null);
    yield return useRichText ?
        // Generate header containing advanced stuff, like sorting, filtering
        // and counting.
        factory.CreateFromSQL(new ReportMeta(reportId).FindType()) :
        // Generate basic header with just the names of the columns.
        factory.CreateFromRow(
            new ReportData(this.DataContext).Load(reportId).Rows[0]);

    foreach (var row in new ReportData(this.DataContext).Load(reportId).Rows)
    {
        yield return useRichText && row.RichTextEnabled ?
            // Display advanced controls, such as filters.
            new ReportRowBuilder().AddData(row).Build() :
            // Display text only if either the row itself doesn't contain
            // advanced controls or if the display format doesn't support them.
            new ReportDataRow(row);
    }
}

col­lec­tion.Where(pred­i­cate).Count() > 0

Ex­am­ple:

this.LoadCurrentDisplaySupport()
    .Capabilities
    .Where(cap => cap.DisplayFormat == OutputFormat.RichText).Count() > 0

This one is fre­quent too and can con­tain two pos­si­ble im­prove­ments. The first one is to use the over­ride of Count to get rid of Where:

this.LoadCurrentDisplaySupport()
    .Capabilities
    .Count(cap => cap.DisplayFormat == OutputFormat.RichText) > 0

While it helps a lit­tle, the sec­ond one is bet­ter. By re­plac­ing Count by Any, one en­sures that the count­ing ac­tu­al­ly stops once a match is found. This means that if you're load­ing one bil­lion rows from the data­base with Count, with Any, the load­ing stops at the first oc­cur­rence (which may be in the top ten rows), free­ing the data­base from the bur­den of re­triev­ing all the rows.

this.LoadCurrentDisplaySupport()
    .Capabilities
    .Any(cap => cap.DisplayFormat == OutputFormat.RichText)

The fi­nal touch

This fin­ish­es the refac­tor­ing re­lat­ed to code du­pli­ca­tion. The fi­nal touch would be to move meth­ods to the class­es they be­longs to, es­pe­cial­ly to re­duce the risk of cross-class code du­pli­ca­tion, as well as find more mean­ing­full names to the class­es and meth­ods. Here's the fi­nal ver­sion of the code. As you can see, it's very easy to un­der­stand each method sep­a­rate­ly. The orig­i­nal 30-LOC code was not the worst piece of code I've ever seen, but those small meth­ods with an av­er­age of 7 LOC are still much eas­i­er.

private bool? useRichTextBackingField;

private bool UseRichText
{
    get
    {
        if (!this.useRichTextBackingField.HasValue)
        {
            this.useRichTextBackingField =
                this.DisplayFormat == OutputFormat.RichText &&
                this.LoadCurrentDisplaySupport().CanUse(OutputFormat.RichText);
        }
        
        return this.useRichTextBackingField.Value;
    }
}

public IEnumerable<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<IEnumerable<IReportRow>>() != null);

    yield return this.GenerateTableHeader(reportId);

    foreach (var row in new ReportData(this.DataContext).Load(reportId).Rows)
    {
        yield return this.GenerateTableRow(row);
    }
}

private IReportRow GenerateTableHeader(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<IReportRow>() != null);

    var factory = new HeaderFactory(this.DataContext).ForReport(reportId);
    return this.UseRichText ? factory.CreateAdvanced() : factory.CreateBasic();
}

private IReportRow GenerateTableRow(DataRow row)
{
    Contract.Requires(row != null)
    Contract.Ensures(Contract.Result<IReportRow>() != null);

    if (this.UseRichText && row.RichTextEnabled)
    {
        // Display advanced controls, such as filters.
        return new ReportRowBuilder().AddData(row).Build();
    }

    // Display text only if either the row itself doesn't contain advanced
    // controls or if the display format doesn't support them.
    return new ReportDataRow(row);
}