On Code-transformation for High Performance Computing

Reading time: 12 min

path

Disambiguation

The term “Code Transformation” can mean different things depending on the context. For the COOP Team at CERFACS, it refers to the process of modifying source code to improve performance, portability and maintainability in high-performance computing (HPC).

It is important to distinguish code transformation in the current context from transformations performed by the compiler itself. Modern compilers apply numerous transformations and optimizations to target a given HPC architecture, and typically need to support scalar, vector and parallel execution models. Given the complexity of the latter, they are often guided by developers through the use of pragma-based approaches such as OpenACC or OpenMP. However, compiler optimizations occur at the final stage of the software development process. The type of code transformation we focus on happens earlier, re-shaping and reorganising a common code base so that the compiler can then produce even faster and more portable executables.

Another distinction must be made with domain-specific languages (DSLs), which often integrate automatic code generation to optimize execution. Example DSLs in scientific computing include JAX (for machine learning) , FEniCS and Firedrake (for finite element computations), WaLBerla (for lattice Boltzman simulations), and OPS/OP2 (for general structured and unstructured mesh-based applications). While these frameworks can transform abstract descriptions into efficient implementations, they rely on relatively specific paradigms which help to reduce the scope of these transformations.

In the most general use-case, code transformation may be applied to large, legacy codebases in mainstream compiled languages like C++ and Fortran, ensuring they can evolve toward new hardware architectures without re-writing (then re-testing and debugging) large portions of the code. However, code transformation can also be used as the basis on new HPC applications where we ‘separate the concerns’ of experienced HPC developers from the scientists and numerical analysts who develop the physical models and algorithms on which the application itself depends. In this approach, the aim is to leverage portable and highly-optimised execution libraries at a low level, while maintaining much simpler software constructs at higher level. The technique has the potential to deliver not only portability and efficiency, but sustainable software that is easy to read, develop and maintain.

Achieving performance, portability and maintainability in HPC

HPC applications constantly strive to balance performance, portability, and maintainability. In 2025, developers can choose between several strategies. One approach is to write direct interfaces to hardware-specific languages, as demonstrated by projects like Neko. Others prefer middle layers such as NekRS with OCCA, or C++-based solutions like Kokkos that provide a common abstraction across different architectures.

Another path is to rely on advanced metaprogramming techniques, as employed by Kalpataru or CODA, where C++ templates enable specialization at compile time. More conventional solutions use pragma-based optimizations like OpenACC and OpenMP to assist compilers in generating architecture-specific code. Finally, newer trends explore just-in-time (JIT) compilation, though this remains mostly relevant for Python-based environments like Numba, which do not directly address the needs of large Fortran or C++ projects.

A recurring challenge in HPC is the rapid pace of hardware evolution. Many engineers have seen their efforts wasted when porting code to architectures that became obsolete before full adoption. The case of Intel’s Xeon Phi processors is a striking example. The second-generation MIC architecture Knights Landing was introduced in 2016 and abandoned by 2017, forcing developers to reconsider their long-term strategies. The key challenge is ensuring that scientific software remains viable despite these shifts, avoiding costly rewrites every time a new accelerator or processor architecture gains traction.

One potential solution often discussed is leveraging Julia’s meta-programming capabilities. Julia offers powerful tools for programmatically modifying and generating code, which could simplify certain aspects of code transformation. However, this approach displaces the problem rather than solving it. Transforming an existing C++ or Fortran codebase into Julia requires a robust parser and converter phase, which in itself is a complex and error-prone task we have not surveyed yet.

At Cerfacs, our goal is to explore practical solutions that enhance the productivity of HPC experts. The challenge is not just about optimizing for today’s architectures but ensuring that code remains adaptable for future platforms. Whether through improved automation, abstraction layers, or better compiler guidance, our focus remains on keeping high-performance computing accessible, maintainable, and future-proof.

Code transformation possibilities

A direct source-to-source transformation with LOKI

One of the most direct ways to transform HPC code is through automatic source-to-source translation, ensuring that legacy applications remain performant on modern architectures. LOKI, developed at the European Centre for Medium-Range Weather Forecasts (ECMWF), provides a framework to refactor and modernize large Fortran-based applications. By providing high-level abstractions to represent and manipulate the components of the code (AST and visitors), LOKI eases the development of static code analysis and transformation scripts, which enables Fortran to Fortran optimization, facilitating the transition from legacy patterns to GPU-ready OpenMP or OpenACC code. This approach is particularly useful when maintaining a single, readable codebase while progressively introducing performance enhancements.

One such experience we contributed to is the GPU porting of the Arpège-IFS weather forecasting simulation from Météo-France which consists in several millions of lines of legacy Fortran code. The choice had been made to completely conserve the legacy code and use OpenACC directives to implement the GPU version. However several aspects of the code are incompatible or inefficient with a direct OpenACC implementation, such as the usage of large amounts of local automatic arrays in subroutines, computations spread over many small loops and usage of global variables with a deep hierarchy of derived types. Thanks to Loki’s expressiveness we were able to experiment with various strategies to automatically transform the code to circumvent those issues and insert the relevant OpenACC directives. Applying complex transformations requires the code to conform to a specific set of rules, Loki also allowed us to design scripts that identify and correct unfit patterns in the code base.

We plan to explore those capabilities to optimize the GPU port of CERFACS’ AVBP solver. Being a smaller legacy fortran code with very few incompatibilities, it had been ported with manual insertion of OpenACC directives. However, while reaching decent performance on GPUs experiments with this implementation have shown there are still potential large performance gains if we refactor the code with a GPU implementation in mind. We plan to use Loki to design transformations that would allow to automatically leverage this performance while keeping the actual code base.

Leveraging the recent LLM advances in the IDE (Integrated Development Environment)

Large Language Models (LLMs) are revolutionizing code development, enabling AI-assisted programming and code transformation. CURSOR is an AI-driven code editor that integrates LLM-powered suggestions, automated refactoring, and context-aware completions. While traditional refactoring tools rely on static analysis, CURSOR leverages machine learning to suggest transformations, and interacts directly with le local codebase. This opens the door to semi-automated porting, refactoring, and even assisted migration from Fortran to C++ or Python, reducing the manual effort typically required for large-scale transformations.

While LLM-based tooling is promising, open-source and non-contaminant IP solutions are particularly attractive for HPC teams. Tools built on open models can evolve faster, integrate the latest advancements seamlessly, and avoid intellectual property (IP) entanglements that arise with proprietary AI solutions. In contrast, limited-diffusion or closed-source models often require local AI solutions that must be explicitly trained and validated within an organization’s IP constraints. This slower adoption curve can hinder productivity, whereas open models provide a more agile, adaptable environment for HPC development.

Using static code analysis and/or RAGs to detect our future challenges with Pragma-based approaches

Pragma-based parallelism, such as OpenMP and OpenACC, has long been a key strategy for enabling portable HPC performance. However, manually inserting pragmas into a complex codebase can be error-prone and suboptimal. Static code analysis tools, such as Clang-based analyzers or LLVM-based tools, can automatically detect hotspots and suggest optimal parallelization strategies. More recently, Retrieval-Augmented Generation (RAG) has emerged as a promising AI-driven approach, combining LLM-generated insights with real-world HPC knowledge bases. This could help developers predict performance bottlenecks, identify poorly optimized pragmas, and even propose automated pragma insertions for maximum efficiency.

Finding the correct abstraction, and moving a 500K LOC CFD Fortran Code to this abstraction: the next generation of OP2

For many legacy HPC codes, the challenge is not just performance, but maintainability and portability. The OP2 framework, initially developed at the University of Oxford, offers a high-level abstraction for unstructured mesh applications, enabling back-end independence across CPUs, GPUs, and FPGAs. Moving a 500,000+ lines-of-code (LOC) Fortran CFD application to OP2 requires identifying the core computational patterns, restructuring the code to fit OP2’s DSL-like abstraction, and ensuring performance parity with the original hand-optimized implementation. The next generation of OP2 aims to extend these capabilities, making it easier to transition legacy codes into a hardware-agnostic, performance-portable future.

Looking forward, the evolution of OP3 could further enhance this approach by using Kokkos as a backend, ensuring seamless performance portability across architectures. Additionally, OP3 could introduce a dedicated OP3-Mesh library, designed to handle multiple topological views of the same mesh—including edge-based, cell-based, and node-based representations—thereby enabling more efficient and flexible computational strategies for next-generation HPC applications.

Above we described two possible ways to address legacy code: apply code transformation to the legacy code or develop a new and highly modular code that uses code transformation at its core. A third option might involve a mixture of both, but this is only viable if the data-structures at the heart of both the legacy code and the new code are similar enough.
The current AVBP Fortran code was originally designed as a scalable MPI application for scalar and vector architectures. As with most hybrid-mesh CFD solvers that use Finite Volume or Finite Element methods, the numerical algorithm is organised into loops over mesh-based entities such as the nodes and elements, with optional gather and scatter operations at the beginning or ends of these loops, and messages sent between MPI processes as necessary. AVBP was originally designed for legacy vector architectures as well as RISC and x86 architectures, and the code was organised to achieve relatively good performance on both. This involved keeping the structure of the code as simple as possible using a large number of compact subroutines, then dividing the (large) loops over mesh entities into much small ‘groups’. The size of this group was then chosen to optimise the use of vector registers (on vector machines) or minimise cache and TLB misses (on RISC/x86 machines). Although AVBP has since been ported to modern Nvidia and AMD GPUs (using OpenACC and OpenMP), achieving moderately good performance, the current code structure and its use of large numbers of subroutines does not lend itself optimally to these modern architectures. The strategy currently being investigated for use within AVBP-NG (Next Generation) involves two core components codenamed AVBP-MeshLib and AVBP-µK (micro-Kernel).

The idea of AVBP-MeshLib is to provide multiple topological views of the same mesh —including edge-based, cell-based, and node-based representation— within the same code. This will enable more efficient and flexible computational strategies when considering high-order numerical methods, while generalising the existing MPI capabilities that are focused on element and node-based data-structures. With AVBP-µK, we are investigating a potentially unusual approach by breaking down the underlying algorithm into even more discrete functions or micro-kernels. At first sight, this would seem at odds with the statement made above about performance optimisation on modern GPU architectures – but this is where code transformation comes in.

The idea behind code transformation for micro-kernels is that we analyse the interdependence between the functions (or kernels) called within a given parallel loop and then make decisions on how to transform the code to achieve optimal performance. By using simple well-defined kernels, most of which are limited to performing relatively trivial arithmetic computations, we simplify the task of inter-dependency analysis. We may then make informed decisions on how to perform multiple types of optimisations – eg. merging micro-kernels, re-ordering them, executing them in parallel or in pipelines, using JIT compilation strategies etc. By combining this approach with AVBP-MeshLib, we can potentially re-order, re-colour and re-partition data ‘behind the scenes’ and we open the door to further run-time optimizations. Finally, we note that the use of micro-kernels can also result in codes that are easier to understand and maintain, and by their very nature they are less reliant on specific constructs from any given programming language. While the concepts are simple to express, their implementation is less trivial and will take several man-years of investigation. We also expect it to involve close partnerships with other HPC laboratories. In the short term we are embarking on prototype implementation of code transformation using both pragma-based programming, as well as portable performance libraries such as Kokkos. We will test these concepts on mini-apps that mimic the main features of AVBP but significantly reduce the complexity. Once proven, these mini-apps will pave the way towards AVBP-NG.

Conclusion

These approaches represent our current global vision at COOP for tackling the challenges of code transformation in HPC. However, it is important to remember that these solutions are primarily focused on legacy codebases, where maintaining continuity and compatibility is crucial. When greater flexibility is available, restarting from scratch with a new paradigm—whether through modern C++ abstractions, domain-specific languages, or even entirely new computational frameworks—can sometimes be the better long-term approach.

Regardless of the chosen strategy, one fundamental truth remains: code transformation must never be an excuse to neglect training our HPC community on the latest techniques. Whether working within existing constraints or pioneering new approaches, equipping developers and researchers with the best tools and knowledge is essential to ensuring HPC codes remain efficient, portable, and future-proof in an ever-evolving technological landscape.

Mohamed Ghenai is a senior researcher in HPC

Antoine Dauptain is a research scientist on computer science and engineering for HPC. He is the assistant team leader of COOP.

Michael Rudgyard is a research scientist on computer science and engineering for HPC. He is the team leader of COOP.

On Code-transformation for High Performance Computing

Disambiguation

Achieving performance, portability and maintainability in HPC

Code transformation possibilities

A direct source-to-source transformation with LOKI

Leveraging the recent LLM advances in the IDE (Integrated Development Environment)

Using static code analysis and/or RAGs to detect our future challenges with Pragma-based approaches

Finding the correct abstraction, and moving a 500K LOC CFD Fortran Code to this abstraction: the next generation of OP2

Conclusion

Keep Reading

Published

Category

Tags

Stay in Touch