Don’t repeat Yourself ?

The “Don’t repeat Yourself” (DRY) principle is a largely shared best practice in programming. It warns us against code duplication. The present point is to show how extreme duplication avoidance can produce more complexity. More generally, removing one code smell can produce more code smell.

How duplication happen ?

Let’s not give into the too easy solution “because people are lazy and stupid”. A more reasonable scenario is:

Found a code matching my problem, but a variation is needed
Copy/pasted
Adding a variation for my problem

This goes totally along the Keep It Simple, Stupid principle. By duplicating, the developer avoid any changes to the origin code.

Why duplication is bad?

Duplication is one of the six major anti-patterns STUPID (Singleton, Tight-coupling ,Untestable, Premature-optimization, Indescriptive-naming, Duplication).

Let’s state the obvious: why duplication is harmful? When the same code is repeated several times, it comes with two human costs:

As long as repetitions are synchronized, developers must repeat their editions on all occurrences. For N lines repeated D times, the overwork is around Nx(D-1) lines of code. This is a form of Tight-coupling design: more code than ideally intended are linked by changes and create inertia.
Once repetitions are no more synchronized, developers must also take care of the local variations in the code. Each reading brings the temptation to remove the variations -which can be present for a good reason-.

How avoiding duplication could be bad, then?

Let’s use an optimistic situation: a code is duplicated 5 times, with 5 justified variations. The refactoring merges the codes into one larger code. This codes features now 5 variations within the same code.

Let’s use an example, with a code solving the evolution of a soap film (in pseudo code):

Algorithm SimulateSoapFilm():
    Initialize soap_film as a 2D grid of values representing film thickness
    Initialize time_step
    Initialize simulation_duration

    for time = 0 to simulation_duration do
        Initialize next_film as a copy of the current soap_film

        for each cell in soap_film do
            Calculate Laplacian of film thickness in the neighborhood of the cell
            Calculate change in film thickness based on Laplacian and other factors
            Update next_film[cell] = current_film[cell] + change

            // Apply boundary conditions (e.g., fixed thickness at edges)
            if cell is at the boundary then
                next_film[cell] = boundary_thickness
            end if
        end for

        Update soap_film to be next_film 
    end for
End Algorithm

The 4 variations from the reference are:

A 3-D version, to solve a bubble without frontiers
A 3-D version for large bubble, stabilized by glycerine, with large deformations
A 2-D version with moving frontiers
A 2-D version with a source term.

Inside the merged code, depending how the if statements are nested, the number of possible path to read the code ranges from 5 to 2**5 (32)! This is a large increase in cyclomatic complexity. Depending on how variations are triggered, the numbers of arguments of the function can rise from 0 to +4. The input complexity of the code also increased. For the soap film example, we get the following code:

Algorithm SimulateSoapFilm(is-3d, large-deformations, source-terms, moving-boundaries):

    if is-3d:
        Initialize soap_film as a 3D grid of values representing film thickness
    else :
        Initialize soap_film as a 2D grid of values representing film thickness
    endif

    Initialize time_step
    Initialize simulation_duration

    for time = 0 to simulation_duration do
        Initialize next_film as a copy of the current soap_film

        for each cell in soap_film do
            if is-3d:
                if large-deformations:
                    Calculate 3DLaplacian in large deformations
                else:
                    Calculate 3DLaplacian 
            else:
                Calculate 2DLaplacian 
            endif

            if source-terms:
                Apply source terms
            Calculate change in film thickness based on Laplacian and other factors
            Update next_film[cell] = current_film[cell] + change

            if not is-3d:
                // Apply boundary conditions (e.g., fixed thickness at edges)
                if moving-boundaries:
                    if cell is at the boundary then
                        next_film[cell] = moving_boundary_thickness
                    end if
                else:
                    if cell is at the boundary then
                        next_film[cell] = boundary_thickness
                    end if
                end if
            end if
        end for

        Update soap_film to be next_film 
    end for
End Algorithm

Note that the removal of duplication comes with a concerning change of the perimeter. Indeed, Before merging, the perimeter is:

2D film - simple
2D film with moving boundaries
2D film with source terms
3D bubble - simple
3D bubble with large deformations

After merging, new combinations are implicitly possible, for example:

3D bubble with large deformations with source terms
2D film with source terms and large deformations
3D bubble with moving boundaries - which probably makes no sense.

Finally, outside the merged code, the calls to the code became longer (extra arguments). The merged function might also be in a more generic, less application-specific context. And developers , A.I. or humans, rely a lot on context to know what to do.

One could hide this new complexity behind smarter structures like function overloading or templates. Something (compiler? processor?) will guess, just by knowing the hardware, the memory available, the type of data, which variation of the code will be used. We had cyclomatic or calling complexities, exchanged for structural complexity. This is a code bloater on its own called AHA “avoid hasty abstractions”, and influenced by Sandy Metz’s “prefer duplication over the wrong abstraction”.

We did remove duplication in this case, but ended up with a code more complex, with more situations to support.

Takeaway : Balancing simplicity and duplication

The present text tried to illustrate how the removal of duplications can produce a bad code too. The damage comes from the number and nature of variations between duplicates. Of course, in the ideal case of no variations, a fusion would add zero complexity.

In the end, the best codebase should be permanently discussed by the developers community to find the best compromise. We use programming principles to find this local optimum. However, we saw here the collision of two of these principles, “Keep it simple, stupid” and “Don’t repeat yourself”.

Next time you come across one of these principles, do not stop on their reassuring wise-man vibe: do question how far it is reasonable to stick to it, for your situation.

Let’s finish with a zen of python quote:

Special cases aren’t special enough to break the rules. Although practicality beats purity.

Antoine Dauptain is a research scientist focused on computer science and engineering topics for HPC.

Avoid duplication with care

Don’t repeat Yourself ?

How duplication happen ?

Why duplication is bad?

How avoiding duplication could be bad, then?

Takeaway : Balancing simplicity and duplication

Keep Reading

Published

Category

Tags

Stay in Touch