warning

Disclaimer : This post is not finished and has NOT been fully reviewed. Feedback are welcome, but no arguments (yet) please!

Introduction

Code development is an “art” involving developers with different backgrounds and coding styles. A piece of code can quickly grow and result in developers only working on some parts of it with little interaction from others, and who knows even resulting in duplicates. Adding a high turnaround of developers to the picture can result in situations where knowledge is lost about parts of the code. How can we avoid such situations? Can we add some methodology to keep an overview of how a code evolves over time? Are there pieces of the code that should be refactored? What other type of information can we use to improve coding practices?

A possible answer would be to rely on the “footprint” of the code, i.e. the version control (git, mercurial or other). An open-source tool to do so is readily available: codemetrics.

Codemetrics

The concept at the base of codemetrics is inspired by Adam Tornhill’s use of forensic techniques applied to coding. From the version control, much information can be obtained, analysed and relied upon to improve coding practices, teamwork and project structure. Without going into much detail we’ll cover some of the information that we can extract with codemetrics. A deeper dive on our actual use of codemetrics will be provided in a separate post.

We will now apply codemetrics to the Eilmer open-source CFD solver which is part of a larger Gas Dynamic Toolkit. The information presented here was mined in June 2021.

Overview of the project

The first type of useful information that can be extracted is an overview of the project in terms of coding languages used and the number of lines as shown below. The Gas Dynamic Toolkit repository consists mainly of D and lua. There are also some C and Python files. The former are mainly related to the lua and zlib external packages, and to a smaller extent to the chemical equilibrium library. The Eilmer part is dominated by D and lua. Around 17 % of the non-blank D lines of code are comments while it is only 6 % for the lua parts. Obviously this information does not tell us if there are clusters of comments, but keep in mind that an over-use of comments can reduce code readibility for the developers.

Contributions in the past year

Given that codemetrics relies on the version control (git in the present case), we have access to the time history of the code. The temporal footprint can be extremely useful to identify hurdles that had to be overcome, problems to be fixed, intense periods of development, parts of the code that have been refactored over and over again, etc. This could especially be key for project planning.

If we look at the past year (with respect to June 2021), the development activity appear to have considerably increased in the second half of the year with respect to the first half. The D related activity has also increased in the second half of the year. The impact of the Australian summer holidays (December - February) translates in a decreased level of development activity.

Distribution of code lines and activity

The time history and project overview can be combined in more striking representations about the project such as the interactive nested-object-like view shown below. Each circle represent a file containing code. The circles are sized by the number of lines and the color representation is given as a function of the time since the last modification: lighter for more recent. This representation has more of an informative purpose but could already contain useful information such as areas of the code that are being modified simultaneously (recently) or simply files that are too large (and potentially more difficult to maintain).

The graph presented below limits itself to the /src part of the Gas Dynamic Toolkit. The D-YAML part (/src/extern/D-YAML) shows the least activity as a complete block. Most recent activity is related to the Eilmer (/src/eilmer), the gas (/src/gas) and kinetics (/src/kinetics) blocks. The more recent “gas” developments appear to be mostly related to the two temperature gas modeling, with associated additions in the properties of the species database. The two temperature modeling does as well drive the changes in the kinetics block. Various developments in the Eilmer part are visible without specific cluster, except perhaps the boundary conditions.

In terms of lines of code (LOC), the files are quite well distributed without extreme differences in files. The largest file (shape_sensitivity_core.d) contains 2882 LOC and controls the adjoint capability of the project.

Hotspots

Continuing the path of interactive representations, identifying hotspots is an important task. Codemetrics computes a complexity score for each file. The complexity is a number presently obtained with the Lizard Cyclomatic Complexity Analyzer. It is a measure of the maximum number of linearly independent paths of each file. An “if” statement is, for instance, a decision which would result in a new path. The higher the complexity score, the more difficult it will be to modify, maintain or test a piece of code (see also). Note that the length of a file, namely the LOC, can obviously impact the complexity score.

The representation below presents the same files as in “Distribution of code lines and activity” but the circles are now sized according to complexity. Coloring is this time given by the amount of changes it has undergone since the past year: darker for more changes. If a file has been changed repetitively, it could be good to understand the underlying reasons. It does not automatically imply a source of problems and could simply be a required evolution over time. Nonetheless, you can now target parts of the code for further investigations.

The highest complexity (apart from the D-YAML related block) is attributed to the “simcore_gasdynamic_step.d ” of Eilmer with a value of 109.8. It performs the temporal integration of the CFD solver and exhibits possible combinations and options, including user choices. In comparison, the file with the highest LOC (shape_sensitivity_core.d) has only a complexity of 9.

The most changed files are the “simcore.d”, “fluidblock.d” and “globalconfig.d” but they all have a complexity below 10. The “simcore.d” contains functions that coordinates different parts of Eilmer including the initialisation, finalisation as well as the time integration (different function calls) in loop of Eilmer. The “fluidblock.d” file contains the FluidBlock class which is a main pillar of Eilmer. The “globalconfig.d” contains many of the options and is logically modified when new features or options are introduced.

Code coupling

The previous representations provide information by looking at the code in a decoupled fashion. It is easy to say, based on a hotspot, that a given file needs refactoring. But how will this affect the rest of the code? If the file of interest is thightly coupled to the rest of the code it could become a nightmare to refactor or simply not possible. It is, therefore, also beneficial to look at the code coupling.

The level of coupling is a value between 0 and 1, derived solely from the git history. Specifically, the score is based on the occurances a pair of files is modified simultaneously (in the same commit). For each file pair, the score is given as: number of times files modified together / number of time files modified overall. A measure such as coupling could be interesting in the comparisons of different codes.

In the case of Eilmer, the “prep-grid.lua” and “prep-flow.lua” are tightly coupled (graph_0_2) which is expected as they are both key in the preparation step of an Eilmer simulation. Another example (graph_3_3) is related to the energy exchange in the two temperature modeling inside the kinetics block (/src/kinetics)

Final comments

We saw some ways that information mined with codemetrics can be represented. It is however so vaste that endless graphs can be generated and types of analyses can be devised. Nevertheless, the few types of graphs considered in this post can already provide important insights about possible issues within a project and may suffice as a first step.

Like this post? Share on: TwitterFacebookEmail


Jimmy-John Hoste is a postdoctoral researcher in computer science engineering with a focus on CFD related topics.

Keep Reading


Published

Category

Work In Progress

Tags

Stay in Touch