Home Home

Code duplication

Arseni Mourzenko
Founder and lead developer, specializing in developer productivity and code quality
130
articles
July 8, 2014

Beginners' code is always an amazing opportunity to remove code duplication, as well as explain what code duplication is to the original author.

I was reviewing a few thousand LOC, so I thought this is an opportunity to post an example and to show how should it be written instead. This is also an excellent opportunity to show a bit more what an informal code review is about, for those of you who are unlucky to find yourself in companies where there are no code reviews.

This is the original piece of code; the variables were modified to fulfill the damn NDA, but the logic remains the same.

public List<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<List<IReportRow>>() != null);

    List<IReportRow> result = new List<IReportRow>();
    if (this.DisplayFormat == OutputFormat.RichText &&
        this.LoadCurrentDisplaySupport()
            .Capabilities
            .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
            .Count() > 0)
    {
        // Generate header containing advanced stuff, like sorting, filtering
        // and counting.
        result.Add(
            new HeaderFactory(this.DataContext)
                .CreateFromSQL(new ReportMeta(reportId).FindType()));
    }
    else
    {
        // Generate basic header with just the names of the columns.
        result.Add(
            new HeaderFactory(null).CreateFromRow(
                new ReportData(this.DataContext).Load(reportId).Rows[0]))
    }

    List<DataRow> rows =
        new ReportData(this.DataContext).Load(reportId).Rows.ToList();

    for (int i = 0; i < rows.Length; i++)
    {
        if (this.DisplayFormat == OutputFormat.RichText &&
            rows[i].RichTextEnabled == true &&
            this.LoadCurrentDisplaySupport()
                .Capabilities
                .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
                .Count() > 0)
        {
            // Display advanced controls, such as filters.
            result.Add(new ReportRowBuilder().AddData(rows[i]).Build());
        }
        else
        {
            // Display text only if either the row itself doesn't contain
            // advanced controls or if the display format doesn't support them.
            result.Add(new ReportDataRow(rows[i]));
        }
    }

    return result;
}

Aside wrong types, there are also a bunch of copy-pasted lines which really shouldn't be. For each one, I'll illustrate them with a simple pattern to make it as visual as possible.

T a = new T()

C# 3.0 introduced implicitly typed local variables. Nearly seven years later, it's still unclear to some C# programmers what this keyword is about, where to use it and where not to use it.

What is var

var mitigates to the compiler the burden of finding the type of a local variable. Instead of dealing with the often complex types yourself, let the compiler do the job.

Dictionary<Guid, HashSet<Tuple<string, List<decimal>>>> wtf = this.LoadWTF();

may simply be written:

var wtf = this.LoadWTF();

given that if you really have such types in your code base, you maybe might have some more important issues to address first.

Also, var is particularly handy when it comes to anonymous types; specifying those ones by hand would be slightly problematic.

Some programmers also tend to believe that var indicates that the type is dynamic. This has nothing to do with dynamic typing. var keeps the types as static as they are when hard-coded. The actual type behind var is inferred at compile-time, while dynamic types are evaluated at runtime.

Where to use or not to use var

There are three cases of use for var:

  1. Direct initialization with redundant type, i.e. where the declared variable is immediately initialized, and the initialization contains the type anyway.

    List<string> words = new List<string>();
    

    vs.

    var words = new List<string>();
    

    I prefer the second variant and would refactor any attempt to use the first one. The first one is just code duplication. There is no reason whatsoever to use this variant. It doesn't bring any benefit, and has a drawback of making code uselessly longer and making maintainance more difficult: if the type changes, a maintainer should change it twice.

  2. Explicit cast.

    List<string> words = (List<string>)this.LoadWords();
    

    vs.

    var words = (List<string>)this.LoadWords();
    

    Identically to the previous point, there is no benefit of writing the same type twice, so just use the var and let the compiler do its job.

  3. Indirect initialization, where the type is not immediately visible in the initialization.

    private List<string> LoadWords() { ... }
    ...
    List<string> words = this.LoadWords();
    

    vs.

    private List<string> LoadWords() { ... }
    ...
    var words = this.LoadWords();
    

    If I work on business-critical code, I would pick the first approach: specifying the type once more might help finding type errors. This, of course, is debatable, and I don't have metrics to support that. I imagine that it would help to avoid less than 0.5% of bugs, and those bugs would in all cases be revealed by static checking and formal code reviews, which makes it not really important.

    In ordinary code, I would avoid specifying the type. Visual Studio does a great job of showing the types of local variables through the tooltips, so the benefit of specifying the type is extremely low. For those who think that having explicit type here helps readability without having to ask help to Visual Studio, you don't get the point. Otherwise, we would have to rewrite:

    var quantity = this.Cart.IsEmpty ? 0 : this.Cart.CountProducts();
    if (this.HasCoupon)
    {
        quantity++;
    }
    
    return quantity;
    

    this way:

    int quantity = (int)0;
    ICart cart = (ICart)this.Cart;
    bool isCartEmpty = (bool)((ICart)cart).IsEmpty;
    if (!(bool)isCartEmpty)
    {
        int quantityInCart = (int)((ICart)cart).CountProducts();
        /* quantity is an integer */ quantity += (int)quantityInCart;
    }
    
    if ((bool)this.HasCoupon)
    {
        /* quantity is an integer */ quantity++;
    }
    
    return (int)quantity;
    

    because knowing the type when using the variable is as important as knowing it when declaring it.

  4. Boxing.

    private List<string> LoadWords() { ... }
    ...
    ICollection<string> words = this.LoadWords();
    

    vs.

    private List<string> LoadWords() { ... }
    ...
    var words = this.LoadWords();
    

    This one is obvious: the code is not the same, so if you need boxing, you can't use var.

  5. Implicit conversion

    private int MeasureSpeed() { ... }
    ...
    double speed = this.MeasureSpeed();
    

    vs.

    private int MeasureSpeed() { ... }
    ...
    var speed = this.MeasureSpeed();
    

    This one is obvious: the code is not the same, so if you need implicit conversion, you can't use var.

  6. Anonymous types

    var product = new { Title = "A good refactoring book", Price = 59.5 };
    

    Obviously, you have to use var.

  7. Edge cases

    Finally, there are some edge cases where you somehow need an explicit type, but not right now, or something like that. For example, in:

    private IEnumerable<Product> FindProducts() { ... }
    ...
    IEnumerable<Product> products = this.FindProducts();
    

    you want to be sure that if the signature of FindProducts changes later to private ReadOnlyCollection<Product> FindProducts(), you still keep IEnumerable<Product> as a type of products. I would say that such cases are extremely rare, and if you encounter them often enough in the code base, this can be a code smell.

Applying var

Let's now apply var keyword to the original source.

public List<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<List<IReportRow>>() != null);

    var result = new List<IReportRow>();
    if (this.DisplayFormat == OutputFormat.RichText &&
        this.LoadCurrentDisplaySupport()
            .Capabilities
            .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
            .Count() > 0)
    {
        // Generate header containing advanced stuff, like sorting, filtering
        // and counting.
        result.Add(
            new HeaderFactory(this.DataContext)
                .CreateFromSQL(new ReportMeta(reportId).FindType()));
    }
    else
    {
        // Generate basic header with just the names of the columns.
        result
            .Add(new HeaderFactory(null)
            .CreateFromRow(
                new ReportData(this.DataContext).Load(reportId).Rows[0]))
    }

    var rows = new ReportData(this.DataContext).Load(reportId).Rows.ToList();
    for (var i = 0; i < rows.Length; i++)
    {
        if (this.DisplayFormat == OutputFormat.RichText &&
            rows[i].RichTextEnabled == true &&
            this.LoadCurrentDisplaySupport()
                .Capabilities
                .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
                .Count() > 0)
        {
            // Display advanced controls, such as filters.
            result.Add(new ReportRowBuilder().AddData(rows[i]).Build());
        }
        else
        {
            // Display text only if either the row itself doesn't contain
            // advanced controls or if the display format doesn't support them.
            result.Add(new ReportDataRow(rows[i]));
        }
    }

    return result;
}

if (a && b) ... if (a && c)

This is a basic case where original code duplication becomes difficult to get rid of after one code branch was changed, but other didn't. The code presented at the beginning of the article contains two ifs, which are nearly identical, but still different. It's not obvious to see that the condition is nearly the same in two cases, which makes it particularly harmful to the quality of the code base. When those statements are formatted differently, it becomes easier to see the duplication:

this.DisplayFormat == OutputFormat.RichText &&
this.LoadCurrentDisplaySupport()
    .Capabilities
    .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
    .Count() > 0

this.DisplayFormat == OutputFormat.RichText &&
rows[i].RichTextEnabled == true &&
this.LoadCurrentDisplaySupport()
    .Capabilities
    .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
    .Count() > 0

Let's change this part of the code.

public List<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<List<IReportRow>>() != null);

    var result = new List<IReportRow>();
    var useRichText =
        this.DisplayFormat == OutputFormat.RichText &&
        this.LoadCurrentDisplaySupport()
            .Capabilities
            .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
            .Count() > 0;
    if (useRichText)
    {
        // Generate header containing advanced stuff, like sorting, filtering
        // and counting.
        result.Add(
            new HeaderFactory(this.DataContext).CreateFromSQL(
                new ReportMeta(reportId).FindType()));
    }
    else
    {
        // Generate basic header with just the names of the columns.
        result.Add(
            new HeaderFactory(null).CreateFromRow(
                new ReportData(this.DataContext).Load(reportId).Rows[0]))
    }

    var rows = new ReportData(this.DataContext).Load(reportId).Rows.ToList();
    for (var i = 0; i < rows.Length; i++)
    {
        if (useRichText && rows[i].RichTextEnabled == true)
        {
            // Display advanced controls, such as filters.
            result.Add(new ReportRowBuilder().AddData(rows[i]).Build());
        }
        else
        {
            // Display text only if either the row itself doesn't contain
            // advanced controls or if the display format doesn't support them.
            result.Add(new ReportDataRow(rows[i]));
        }
    }

    return result;
}

Few readers might notice that the actual refactoring changed the execution of the code by inverting the parts of the second if block (A && B && C becomes A && C && B). This means that we should rerun unit tests after this change to ensure that there are no regressions.

if (a b) else (a c)

I see code duplication within if/else blocks in nearly every piece of code I have to review. A code such as:

if (condition)
{
    statement(a);
}
else
{
    statement(b);
}

should immediately be refactored into:

statement(condition ? a : b);

The change is not as minor as it looks like. The original code looks like:

if something:
    do a thing
otherwise:
    do a different thing

which doesn't express correctly the logic. What the code is actually doing is:

do a thing on a if something or on b otherwise

Moreover, the original, non-refactored form suffers from several issues inherent to code duplication. The first one is that a person may modify code in one place, and forget to do it in another location. The second one is that one branch may grow, resulting in a mess and difficulty to remove duplicated code later.

The code now becomes:

public List<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<List<IReportRow>>() != null);

    var result = new List<IReportRow>();
    var useRichText =
        this.DisplayFormat == OutputFormat.RichText &&
        this.LoadCurrentDisplaySupport()
            .Capabilities
            .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
            .Count() > 0;
    var factory = new HeaderFactory(useRichText ? this.DataContext : null);
    var header = useRichText ?
        // Generate header containing advanced stuff, like sorting, filtering
        // and counting.
        factory.CreateFromSQL(new ReportMeta(reportId).FindType()) :
        // Generate basic header with just the names of the columns.
        factory.CreateFromRow(
            new ReportData(this.DataContext).Load(reportId).Rows[0]);

    result.Add(header);

    var rows = new ReportData(this.DataContext).Load(reportId).Rows.ToList();
    for (var i = 0; i < rows.Length; i++)
    {
        var resultRow = useRichText && rows[i].RichTextEnabled == true ?
            // Display advanced controls, such as filters.
            new ReportRowBuilder().AddData(rows[i]).Build() :
            // Display text only if either the row itself doesn't contain
            // advanced controls or if the display format doesn't support them.
            new ReportDataRow(rows[i]);

        result.Add(resultRow);
    }

    return result;
}

That's a wall of condensed code! It doesn't matter, when we finish refactoring, it will be much more readable.

(bool)b == true

Redundancy such as:

if (row.IsVisible == true)

can often be seen in code written by beginner programmers. The == true part doesn't bring anything to the readability in this context and should be removed. The code:

var resultRow = useRichText && row.RichTextEnabled == true

should be:

var resultRow = useRichText && row.RichTextEnabled

a[i] ... a[i]

Retrieving the same element from an array may quickly become problematic, since not only it has no benefit (aside from not declaring an additional variable), but it can easily be a source of a mistake during maintenance. Let's focus on this part of the code:

for (var i = 0; i < rows.Length; i++)
{
    var resultRow = useRichText && rows[i].RichTextEnabled ?
        // Display advanced controls, such as filters.
        new ReportRowBuilder().AddData(rows[i]).Build() :
        // Display text only if either the row itself doesn't contain advanced
        // controls or if the display format doesn't support them.
        new ReportDataRow(rows[i]);

    result.Add(resultRow);
}

One can reduce duplication by declaring an additional variable:

for (var i = 0; i < rows.Length; i++)
{
    var row = rows[i];
    var resultRow = useRichText && row.RichTextEnabled ?
        // Display advanced controls, such as filters.
        new ReportRowBuilder().AddData(row).Build() :
        // Display text only if either the row itself doesn't contain advanced
        // controls or if the display format doesn't support them.
        new ReportDataRow(row);

    result.Add(resultRow);
}

but there is an easier way to get rid of those indexes:

foreach (var row in rows)
{
    var resultRow = useRichText && row.RichTextEnabled ?
        // Display advanced controls, such as filters.
        new ReportRowBuilder().AddData(row).Build() :
        // Display text only if either the row itself doesn't contain advanced
        // controls or if the display format doesn't support them.
        new ReportDataRow(row);

    result.Add(resultRow);
}

This transform has additional benefits too.

new List; list.Add; list.Add; return list

One problem with this method is that it creates a list, adds elements to it and then returns it to the caller. In .NET, a much powerful concept of enumerators. If we change the method so it returns an enumeration, the caller can still convert it to a list, but it can also start the enumeraton and abandon it in the middle, for example.

public IEnumerable<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<IEnumerable<IReportRow>>() != null);

    var useRichText =
        this.DisplayFormat == OutputFormat.RichText &&
        this.LoadCurrentDisplaySupport()
            .Capabilities
            .Where(cap => cap.DisplayFormat == OutputFormat.RichText)
            .Count() > 0;
    var factory = new HeaderFactory(useRichText ? this.DataContext : null);
    yield return useRichText ?
        // Generate header containing advanced stuff, like sorting, filtering
        // and counting.
        factory.CreateFromSQL(new ReportMeta(reportId).FindType()) :
        // Generate basic header with just the names of the columns.
        factory.CreateFromRow(
            new ReportData(this.DataContext).Load(reportId).Rows[0]);

    foreach (var row in new ReportData(this.DataContext).Load(reportId).Rows)
    {
        yield return useRichText && row.RichTextEnabled ?
            // Display advanced controls, such as filters.
            new ReportRowBuilder().AddData(row).Build() :
            // Display text only if either the row itself doesn't contain
            // advanced controls or if the display format doesn't support them.
            new ReportDataRow(row);
    }
}

collection.Where(predicate).Count() > 0

Example:

this.LoadCurrentDisplaySupport()
    .Capabilities
    .Where(cap => cap.DisplayFormat == OutputFormat.RichText).Count() > 0

This one is frequent too and can contain two possible improvements. The first one is to use the override of Count to get rid of Where:

this.LoadCurrentDisplaySupport()
    .Capabilities
    .Count(cap => cap.DisplayFormat == OutputFormat.RichText) > 0

While it helps a little, the second one is better. By replacing Count by Any, one ensures that the counting actually stops once a match is found. This means that if you're loading one billion rows from the database with Count, with Any, the loading stops at the first occurrence (which may be in the top ten rows), freeing the database from the burden of retrieving all the rows.

this.LoadCurrentDisplaySupport()
    .Capabilities
    .Any(cap => cap.DisplayFormat == OutputFormat.RichText)

The final touch

This finishes the refactoring related to code duplication. The final touch would be to move methods to the classes they belongs to, especially to reduce the risk of cross-class code duplication, as well as find more meaningfull names to the classes and methods. Here's the final version of the code. As you can see, it's very easy to understand each method separately. The original 30-LOC code was not the worst piece of code I've ever seen, but those small methods with an average of 7 LOC are still much easier.

private bool? useRichTextBackingField;

private bool UseRichText
{
    get
    {
        if (!this.useRichTextBackingField.HasValue)
        {
            this.useRichTextBackingField =
                this.DisplayFormat == OutputFormat.RichText &&
                this.LoadCurrentDisplaySupport().CanUse(OutputFormat.RichText);
        }

        return this.useRichTextBackingField.Value;
    }
}

public IEnumerable<IReportRow> GenerateTable(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<IEnumerable<IReportRow>>() != null);

    yield return this.GenerateTableHeader(reportId);

    foreach (var row in new ReportData(this.DataContext).Load(reportId).Rows)
    {
        yield return this.GenerateTableRow(row);
    }
}

private IReportRow GenerateTableHeader(int reportId)
{
    Contract.Requires(reportId > 0)
    Contract.Ensures(Contract.Result<IReportRow>() != null);

    var factory = new HeaderFactory(this.DataContext).ForReport(reportId);
    return this.UseRichText ? factory.CreateAdvanced() : factory.CreateBasic();
}

private IReportRow GenerateTableRow(DataRow row)
{
    Contract.Requires(row != null)
    Contract.Ensures(Contract.Result<IReportRow>() != null);

    if (this.UseRichText && row.RichTextEnabled)
    {
        // Display advanced controls, such as filters.
        return new ReportRowBuilder().AddData(row).Build();
    }

    // Display text only if either the row itself doesn't contain advanced
    // controls or if the display format doesn't support them.
    return new ReportDataRow(row);
}