detection

Photo Fer Troulik on Unsplash

The Boring Reality of Explaining Code

“Explain this code.” It is a phrase I found myself typing several times into a large language model, to sense of a module I had inherited or a routine I had not touched for years. It is effective: copy a file, paste it into the model, and receive a neat summary. Repeat…

But very quickly the process hit its limits. Copy–paste became mechanical and error-prone. Slight variations in my phrasing led to inconsistent answers. If the code was too long, I had to split it into chunks, but where to split? How to merge ? Once the codebase exceeded two or three modules, the method simply became unmanageable.

Also, when you are concerned by the resource limits like me, using big Large Language Models for this task felt just wrong. And above all, repeating this operation dozens of times consumed more human attention than the problem justified, an other waste of resource.

The situation raised a broader question: can we automate this task in a way that saves time, reduces energy, and remains consistent across a project?

Walking the Repository

This was the motivation for developing the WalkingPrompt, a tool designed to systematize the process. Its principle is simple: rather than asking the human to navigate and prompt the model repeatedly, the tool “walks” through the repository and does the job automatically. The workflow is organized in several stages.

  1. First, the repository is traversed to identify the relevant source files.
  2. Each file is then loaded and, if small enough, passed directly to a language model together with a natural-language prompt. If the file is too large, it is split into manageable chunks, each summarized separately, and the partial results are aggregated.
  3. Finally, the collected explanations are reformatted into a more usable form, whether as a structured PDF or as an interactive visualization, such as circular packing of modules.

Importantly, the prompt itself is flexible. The current use case is summarize this source code in natural language, but it could just as well be highlight potential performance bottlenecks or explain parallelization patterns. This generality is the reason for the name WalkingPrompt.

What It Looks Like in Practice

In practice, the difference between manual prompting and automated walking is significant. When handled by hand, the summaries I obtained were heterogeneous in tone and scope. Let’s demonstrate :

Here is a Fortran snippet TimeIntegrate:

Here is a Fortran code:

SUBROUTINE TimeIntegrate(rp, grid)
    USE mod_fields
    IMPLICIT NONE
    TYPE(runParams_t), INTENT(INOUT) :: rp
    TYPE(grid_t), INTENT(IN) :: grid
    CHARACTER(len=strl) :: routine
    INTEGER :: kstage

    routine = trim(modName)//"->TimeIntegrate"

    CALL Compute_timestep(rp, grid)

    DO WHILE (rp%time < rp%timeMax)
      rp%time   = rp%time + rp%timeStep
      DO kstage   = 1, rp%nStage
        rp%kstage = kstage
        CALL Compute_Residual(rp, grid)
        CALL Advance_Conserved_Variables(rp, grid)
      END DO
    END DO

    BLOCK
      USE pyplot_module
      USE mod_fields
      type(pyplot) :: plt
      INTEGER :: i, dLen

      dLen = INT(SQRT(REAL(grid%nnode, kind=pr)))
      CALL plt%initialize(grid=.true.,xlabel='x',title='u(x)',legend=.true.)
      i = (dLen/2)+1
      CALL plt%add_plot(grid%x_n(1,(i-1)*dLen+1:i*dLen),w_n(1,(i-1)*dLen+1:i*dLen),label='u(x)',linestyle='-r', ylim=[-0.1_pr, 1.1_pr], linewidth=2)

      CALL plt%savefig('temp.png', pyfile='temp.py')

      ERROR STOP "Simulation completed. Check the output plot"
    END BLOCK
  END SUBROUTINE TimeIntegrate

Here is a first attempt, using Qwen3-Coder which is the code version of Qwen3, the large language model series developed by Qwen team, Alibaba Cloud.

Here is a Fortran code:
<code>
Explain the code

This Fortran subroutine implements a time integration scheme for solving partial differential equations (likely fluid dynamics or similar physics problems). Here’s a breakdown:

Purpose: The subroutine advances a numerical simulation forward in time using a multi-stage time integration method (likely a Runge-Kutta scheme). (…)

Interestingly, the fluid dynamics mention seems to be pure luck : while the present source code is taken from a mini-application for fluid dynamics, there are no direct clues in it (IMHO, this should be considered as a flaw.).

Here is a second attempt

This is a source I am working on:
<code>
Make a brief summary from this code. 
Mention routine names and numerical concepts.

This is a time integration subroutine that implements a multi-stage time-stepping scheme for solving >partial differential equations on a computational grid. Key Routines:

  • TimeIntegrate - Main time integration driver

  • Compute_timestep - Calculates time step size

  • Compute_Residual - Computes residual terms

  • Advance_Conserved_Variables - Updates solution variables

(…)

This one of my usual errors : by trying to improve the answer (mention routine names), I made the output worse : the routine defined TimeIntegrate is at the same level as the others like Compute_timestep.

Now, If the process is automated , such error-prone variation can be spotted, then removed, the results are more systematic. For the same routine, the tool WalkingPrompt gave, using model llama3:latest (8 billions params):

The TimeIntegrate subroutine is part of a numerical simulation program that integrates conserved variables over time. It takes two main inputs: rp, an instance of the runParams_t class, which represents run parameters for the simulation; and grid, an instance of the grid_t class, which represents a grid data structure.

The subroutine first computes the time step size using the Compute_timestep function. Then, it enters a loop that continues until the current time reaches the maximum allowed time (rp%timeMax). Within each iteration of the loop, it updates the current time by adding the time step size and advances conserved variables on the grid using the Advance_Conserved_Variables subroutine.

After completing the simulation, the subroutine generates a plot of the conserved variable u(x) using the pyplot module. The plot is saved as an image file named “temp.png” and a Python script named “temp.py”.

Once all these outputs are agregated, the top-most code explanation assigned to the root of the repository reads as:

The DAL_MicroKernal program sets up a simulation by creating instances of modules/classes, including runParams, grid, fields, scalarTransport, and numerics. It then:

  • Creates runParams using create().

  • Creates grid using create(runParams).

  • Sets up fields on the grid with create_fields(grid, runParams).

  • Creates a specific field (Create_field(runParams, grid, w_n)).

  • Integrates the simulation over time with TimeIntegrate(runParams, grid).

(…)

Of course, I never took the time to do this aggregation manually.

At the same time, the tool is not without errors. For example, it provided outputs mixing the role of the LLM model with the answer, giving some hallucinations at the end :

The scatter_target subroutine is a developer-assistant translation. It takes two integer inputs, ncomp_ur and nsource_local_ur,(…)

Such mistakes are not rare and remind us that automation is not a substitute for human verification. Yet, even imperfect, the automated summaries form a coherent first layer of documentation that is more consistent than what could be assembled manually.

Temperature_2D An interactive tool to browse the multiple summaries obtained at all the levels of a software project

What Can Be Improved

Several factors currently determine the quality and efficiency of WalkingPrompt.

  • At the level of prompts, one can tune the instructions to emphasize brevity, completeness, or pedagogical clarity.
  • At the level of chunking, different strategies are possible: splitting files mechanically by size or attempting to identify semantic units such as functions and classes.
  • Hardware also matters, since inference time and energy use depend on the resources available.
  • Finally, the choice of model is crucial: large models provide broader context and more fluent summaries, but smaller models may suffice for many modules while consuming far less energy. Note that a fine-tuned could provide the same result with far less energy.

These levers influence outcomes along three main axes: time to solution, energy consumption, and user trust. While computation time is less critical —summaries can be generated overnight— energy cost and reliability remain important for sustainable and usable deployment.

Next Steps: Beyond Raw Module-level Summaries

One promising extension is to complement textual summaries with structural information extracted from the code. A static callgraph, for example, reveals which functions invoke which others. By integrating this graph into the summarization process, each file or chunk can be enriched with the context of its dependencies. Instead of isolated descriptions, the output could highlight not only what a routine does, but also how it fits into the broader execution flow. This additional layer would mitigate one of the current weaknesses of LLM-based explanations: their tendency to overlook relationships across files. In this way, WalkingPrompt could move from producing descriptive documentation to generating a genuine mental map of the program.

Trust, Privacy, and Scientific Computing

In scientific computing, code explanations are not merely conveniences; they support reproducibility and knowledge transfer. For this reason, faithfulness matters. A misleading summary can cause more confusion than no summary at all. Validation mechanisms are therefore necessary, whether by human review or by cross-checking outputs against static analyses.

Another issue is privacy: many research codes cannot be sent to cloud services for legal or ethical reasons. Local deployment on smaller models, though less glamorous, may be the only acceptable option. Finally, integration with existing workflows is essential. Our present implementation runs locally on an Ollama server.

A tool that generates summaries in isolation is less useful than one that can plug into continuous integration pipelines or documentation systems like we do with Sphinx or Doxygen. These considerations determine whether the tool remains an experiment or becomes a durable part of the scientific software ecosystem. In the present case, a monthly pipeline refreshing the documentation can be implemented. The technical challenge would be to give the pipeline access to a local or online server. However while the generation is possible, the validation would be quite complex to enable.

Why Bother?

Large scientific codes are rarely written from scratch. They evolve over decades, accumulate modules, and involve numerous contributors. Documentation often lags behind, leaving new team members with a steep learning curve. The concept behind WalkingPrompt offers a way to mitigate this problem: it automates a repetitive task, reduces wasted human effort, and does so in a more energy-efficient manner than repeated ad hoc prompting. While the results are not flawless, they are reproducible, consistent, and adaptable.

For the scientific computing community, the promise is not only to simplify code comprehension but also to embed documentation generation from the codebase itself into routine practice. The broader vision is that explaining code should not be an afterthought—it should be a service, running quietly in the background, ensuring that complex software remains accessible to those who need it.

Like this post? Share on: TwitterFacebookEmail


Antoine Dauptain is a research scientist on computer science and engineering for HPC. He is the assistant team leader of COOP.

Keep Reading


Published

Category

Pitch

Tags

Stay in Touch