detection

The problem with legacy codes

Legacy codes are not an rare occurence in scientific computation. These code tend to occupy a niche, and when they succeed at developing a community they can quickly turn into a defacto standard. It is both a blessing and a curse : since the community grows, the software evolves to cover more and more usages, wihch helps in turn its community to grow. Having an established community also brings stability since the user will generally want the results of the software to stay consistent across time.

However this can also lead to an ever growing codebase and long-established habits from the users that can impair deep evolutions of the code. This had not been a major obstacles in HPC fpr a long since the established architecture had been a grid of distributed CPU for three decades, and CPU and their compilers have made great progress into providing ever growing performance on the same codebase along the years.

Nowadays the advent of GPU cards presents a technological breakthrough since they require the usage of completely new programming models, but rewriting a several-millions lines code into a new paradigm is hardly feasible and might lead to great friction within the community.

Programming models such as OpenACC and OpenMP aim at bridging the gap by providing compiler directives which allow to provide instructions to the compiler for GPU code generation, while at the same time keeping intact the original codebase. These also represent a much more manageable solution since they only requires the addition of a relatively small number of directives.

While these assertions generally hold true in actual implementations scenarios, they still come short of the general goal. The original code base remains but the developers are still burdened with the added directives and have to learn their programming model. More importantly, the legacy codebase are generally deeply rooted in efficient implementations for CPUs. Applying GPU offloading directives on a code base designed for CPU implementation will probably lead to poor (if not abysmal in the worst cases) performance on GPUs, which defeat the purpose of GPUs that exist in HPC solely to provide better performance than CPUs.

Another approach that has been gaining traction in the recent years is the usage of automatic translation tools to generate efficient GPU code from the original code base. This brings the advantage of theoretically leaving the original codebase untouched, while allowing to generate completely different implementations that could be tailored for GPUs. Diversity and quick evolutions on the hardware, programming languages and runtime libraries also mean that having a fixed optimization strategy is not the best option. Being able to quickly evaluate and deploy alternative optimization strategies is also another key advantage for automatic translation.

Case study : the ARPEGE weather forecasting simulation

ARPEGE is a prime example of such a legacy code : made of several millions of lines in Fortran, its sheer size renders a manual porting unmanageable, and decades-old habits from researchers brought a strong will to keep the legacy codebase in its current form.

Transformation of the code will be applied through Loki, a python library primarly aimed at providing an abstract internal representations of Fortran code and high-level operators that allow to easily build very expressive transformations processes.

Here is a short example to illustrate the expressiveness of loki : the content of the routine is explored by a visitor function, then we iterate only over AST nodes whose type is “Loop”. Then if the loop counter is a variable named ‘k’ we output the content of the loop.

for loop in FindNodes(Loop).visit(routine.body):
  if (loop.variable.name == 'k'):
    print ("Loop found : ", loop.body)

The structure of ARPEGE

Most of the physics models operate on a grid representing the surface of the earth or a part of it, each grid point representing a column of atmosphere divided in vertical levels. The structure of the code is tailored for both CPU efficiency and relative simplicity : the grid is divided in small blocks for cache efficiency, top-level routines manage the global loop across blocks, then the subsequent (large) call tree of subroutines implement computations on a single block of grid points.

Data is a mix of global arrays representing physics fields over the whole grid, and local arrays declared in the subroutines whose size correspond to a single block.

Here are two code snippets that illustrate those properties :

!Global routine
KBGPBLKS = GRID_SIZE / KLON
ALLOCATE GLOBAL_DATA_1(KLON, 0:KLEV, KGPBLKS)
...

DO IBL=1, YDCPG_OPTS%KGPBLKS
    CALL PHYS_MODEL(GLOBAL_DATA_1(:,:,IBL), GLOBAL_DATA_2(:,:,IBL), ...)
SUBROUTINE PHYS_MODEL(GLOBAL_DATA_1, GLOBAL_DATA_2)

REAL, LOCAL_DATA_1(KLON, KLEV)
REAL, LOCAL_DATA_1(KLON)


DO JLON=KIDIA, KFDIA
    LOCAL_DATA_2(JLON) = 0.0
    DO JLEV=0,KLEV
        LOCAL_DATA_1(JLON, JLEV) = GLOBAL_DATA_1(JLON, JLEV) * X
        LOCAL_DATA_2(JLON) = LOCAL_DATA_2(JLON) + LOCAL_DATA_1(JLOV, JLEV) / Y
    END DO
END DO

....

DO JLON=...
  ...
END DO

CALL SUBMODEL(LOCAL_DATA_2, GLOBAL_DATA_2)
...

END SUBROUTINE

This layout is adequate for CPUs since having small blocks makes efficient usage of the cache, and it simplifies the development of the code since at any time most routines only have to care about arguments received and the local data in the current block.

However there are several characteristic that would make it unefficient or even impossible to generate GPU kernels from the code in its current form.

  • The first issue is that subroutines make heavy usage on local automatic arrays, those are generally grid-block sized and there can be dozens of those. These allocations happen on the stack ; on a CPU the stack can be pretty much unlimited in size, but on GPUs it limited to only a few dozen kbytes on each of the computing units, and all those arrays simply cannot fit in such a small space.

  • Next the layout of the computations which consist in many small loops over the grid points on the current block. Any subroutine consists in many subsequent such loops, this would lead to many GPU kernels with small amounts of computations which is inefficient.

  • A third issue comes from the size of local arrays. Only the top-level routines see the whole fields, however parallelizing at this level would also probably be inefficient as it would generate very large GPU kernels that would require more hardware resources (memory, registers…) than a GPU can provide. Ideally we want to parallelize at lower levels but those levels only expose the content of a single block, while we would like to parallelize across all blocks.

  • A final issue is that we need to be able to identify precisely in the current code where we want to generate those GPU kernels.

Solving these issues through automatic transformation

To get a working GPU implementation, we need first to devise solutions to these issues, then find a transformation strategy and finally find a way to implement it using Loki’s capabilities. Here we will not elaborate on the implementation since it is essentially a lot of tedious details.

The transformation strategies have to be able to automatically understand to structure of the code in order to modify it. Fixed elements such as semantics of the language or compile-time constants can be directly interpreted through Loki’s parsing, however user-defined identifiers such as variables, routines or data types names are by essence meaningless.

Human knoweledge has to be manually introduced in the system so the scripts can identify key elements in the code. Thankfully, ARPEGE already has a history of enforced coding rules which confers the codebase a high level of uniformity. For example, a major point is identifying loops that iterate over the grid points in a block. The reference variable for the length of the blocks is conventionally called NPROMA, so we define lists of names such as nproma_aliases, nproma_bounds or nproma_loop_indices that contains the corresponding variables names encountered in the code which represent either the length, the boundaries or the loop counter when treating a grid-point block.

We can then write scripts that transform the code based on those identifier names, as long a the code to transform complies to this standard.

Local automatic arrays

Automatic variables are a standard element of the Fortran language : these are variables defined in the specification section of a routine that are not arguments. These variable are implicitely allocated on the stack when entering the routine and deallocated on exit. On typical CPU usage this does not pose a problem : the default stack is large (8 MB per process is a common value) and can easily be extend at will on linux systems with the command ulimit -s unlimited.

However the stack on a GPU thread is limited by the local memory on the compute unit. For exemple, the compute units inside an nvidia A100 card only possess 192 KB of L1 cache memory. This memory is shared by warps of 32 threads, and a common practice is to have 4 such warps residing on the unit. This barely leaves 1.5 KB of cache memory available for each thread, definitely not enough to store dozens of arrays that may contain several hundreds of FP64 values.

There are multiple solutions to such a problem. One could be to turn these local arrays into global arrays, either by declaring them in a higher-level caller routine (in which case they become arguments in the subroutine) or by allocating them in a module (in which case they become a module import). This is manageable for a small amount of arrays, but with a deeply nested call stack that uses dozens of arrays this can quickly bloat the code. And one more issue is that these variable will have a much longer lifetime than their original scope which was stricly limited to the subroutine that used those.

We explored a second solution : allocating a very large buffer in memory, and turning local arrays into pointers to regions in this buffer. We use the CRAY pointers extension which is a non-standard but well-established part of Fortran (notably supported by the nvhpc compiler). Expect for the buffer object that has to be created at the top-level and passed to subsequent subroutines, all required transformations are purely local and do not impact the array’s usage in the subroutine.

For a given identified array :

REAL(KIND=JPRB) :: ZDET(KLON, KLEV)

We just have to use the POINTER intrinsics to associate the array with a named cray pointer, then we associate this pointer with the current offset in the buffer (object of type STACK), and increment the current offset in the buffer by the size of the array.

REAL(KIND=JPRB) :: ZDET(KLON, KLEV)
POINTER(IP_ZDET_, ZDET)
TYPE(STACK), INTENT(IN) :: YDSTACK
...
IP_ZDET_ = YLSTACK%L
YLSTACK%L = YLSTACK%L + JPRB*SIZE(ZDET)
...
ZDET(JLON, JLEV) = ...

A drawback of that approach is that its portability is not guaranteed, and we need to reserve a large enough buffer for the whole application. This still wastes some memory space, and we have the additional burden of finding an adequate size for that buffer.

Small compute loops

The layout of computations in ARPEGE mainly consists in subsequent small loops over grid-points in a block (which we will call “horizontal loops”). This is efficient in a CPU because this optimize cache memory usage, and a CPU is very efficient at treating quickly varying workloads.

However on GPU starting a new loop kernel entails a sensible latency cost, and having kernels that make a substantial amount of computation is mandatory to hide these latencies. Multiple small loops with few computations is clearly unefficient in that regards.

Here we can benefit from the structure of ARPEGE : subsequent horizontal loops usually compute independant values for each of their grid-point and could therefore be fused into a single loop. Instead of having multiple loops doing some computation on the whole block, we will have a single loop where a single iteration does all the required computation for a single grid point. A grid point representing a column of atmosphere in the model, this transformation is called the Single Column Coalesced of SCC transformation. Coalescence is the fact that neighbouring iterations of a loop access neighbouring data in memory, which is implicitely the case here and is the optimal scheme for memory accesses on a GPU.

!Original routine

DO JLON=1, NPROMA
  ZARRAY = 0.0
END DO

DO JLON=1, NPROMA
  DO JLEV=1, KLEV
    ...loop2...
  END DO
END DO

DO JLON=1, NPROMA
  DO JLEV=2, KLEV-1
    ...loop3...
  END DO
END DO
!SCC transformed routine

DO JLON=1, NPROMA
  ZARRAY = 0.0

  DO JLEV=1, KLEV
    ...loop2...
  END DO

  DO JLEV=2, KLEV-1
    ...loop3...
  END DO

END DO

Implicit operations on whole arrays on sections of arrays cannot be directly treated by the SCC transformation, therefore a preliminary step identifies such operations on block arrays and replace them with explicit loops.

The SCC transformation also allows to leverage another optimization : local arrays that only contain nproma values see their values idenpendently used for each iteration of the horizontal loop. These array can then be safely demoted to a single scalar value, allowing them to be placed on the stack of the thread instead of the global “stack” buffer.

Data visibility in subroutines

Currently, only a handful of top-level routines actually manipulate complete data fields, looping across blocks of the field and passing only a single block to its subroutines. This allows for an easy multi-threaded parallelism by assigning a CPU thread to the iteration of a block. However, blindly generating a GPU kernel from that level is not the best idea : there is often a deep hierarchy of subroutines called inside those loops, and GPUs have very limited local resources for data and program instructions in their compute units.

Having very long GPU kernels is detrimental since when the compute unit runs out of resource it compensates by “spilling” into the global GPU memory. This can easily create a major bottleneck since global memory accesses are slower and there are more than 100 compute units competing for that access in current GPUs, therefore we should aim ideally at generating kernels that have the most computation possible while not triggering memory spilling.

It is very hard to predict what combination of routines might reach close to this optimum and this optimum will vary across generations of hardware. Therefore we need to be able to generate GPU kernels anywhere in the subroutines hierarchy in order to experiment and find efficient combinations across the whole codebase. Having this flexibility is also crucial for debugging : since a GPU kernel act as a black box, having a large amount of operations in a single kernel renders it nearly impossible to debug and we want to be able to split it in smaller individual kernels.

A major issue here is that currently subroutines only have visibility on one block, and we naturally want to apply coarse-grain parallelism across those blocks, therefore we need to transform local block in global data fields. To that effect a family of abstract classes named FieldAPI had been developed, a FieldAPI is enssentially a wrapper around arrays which contains a pointer for a CPU-allocated array, a pointer for a GPU-allocated array and methods to manage those arrays (create, copy, update, …).

The transformation we will implement here consist in turning relevant arrays into FieldAPI objects with an additional dimension (the number of blocks) in subroutines that generate GPU kernels, and replacing the array by the relevant pointer in computations. Plain arrays are simply identified by the fact that their leading dimension is one of the nproma aliases. For arrays used in derived types, FieldAPI objects are introduced in the definition of the derived type conjointly to the relevant arrays and a list of those arrays is extracted. Arrays from derived types are compared to this list to determine if they should be replaced by their FieldAPI counterpart in the routine.

REAL(KIND=JPRB) :: ZMIXNL(NPROMA, NFLEVG)
...
ZMIXNL(:, :) = 1.0_JPRB
TYPE (FIELD_3RB_ARRAY) :: YL_ZMIXNL
...
YL_ZMIXNL%F_P%PTR(:, :, JBLK) = 1.0_JPRB

Data exchange

One major issue still untreated is the data exchange between CPU and GPU. To that effect we created another transformation which call the relevant updates methods on FieldAPI objects used in its computations.

We want those update to happen outside of the parallel kernel, therefore the transformation creates a copy of the relevant part of the code. Inside this copy, computations using FieldAPI arrays are replaced with call to their SYNC method wich triggers an update between the CPU and GPU data if relevant. A pass of simplification is applied since arrays tend to be reused many times, and updates are triggered for whole arrays so horizontal and vertical loops are eliminated from the code.

Application to the whole code

We now have several building blocks to adapt the code to a suitable form for efficient GPU kernels, the last step is to devise a flexible way to orchestrate their application in the code. To that effect a set of the so-called “ACDC” directive has been defined which allows the developer to tag certain regions of the code and bootstrap the transformation process. These allow to choose between 3 parallelisation mode : OpenMP (classical multi-threading on block), OpenMPSingleColumn (same as OpenMP but with the SCC transformation) and OpenACCSingleColumn which applies the SCC transformation and generate GPU kernels through OpenACC directives.

Therefore, our main transformation will scout the files for these custom directives and then apply the necessary transformations.

  • Creating a conditional block for each parallelisation mode required (so taht we may choose any implementation at runtime)

  • Creating syncronization calls for each mode

  • Creating the main loop across blocks and eventually an horitonal loop

  • Transforming the body of the loop with the array to FieldAPI transformation and combine it with the SCC transformation when relevant

  • Introducing the adequate OpenMP or OpenACC directives

!$ACDC PARALLEL, TARGET=OpenACCSingleColumn {

ZMIXNL(:, :) = 1.0_JPRB

!$ACDC }
CALL YL_ZMIXNL%F_P%SYNC_DEVICE_RDWR()

!$ACC PARALLEL LOOP GANG DEFAULT(PRESENT)
DO JBLK=JBLKMIN, JLKMAX
!$ACC LOOP VECTOR
  DO JLON=1, KLON
    YL_ZMIXNL%F_P%PTR(JLON, :, JBLK) = 1.0_JPRB
  END DO
END DO

Et voilà ! We finally generated parallel kernels from our original code. Of course here I only covered the general principles and as always, the devil is in the details and the actual scripts required a lot of work to manage edge-cases and specific construct ; for example pointers that point to FieldAPI arrays should be transformed identically as their array conuterpart, this requires us to conduct a first investigation pass on the code to identify those pointers that actually are set to FieldAPI arrays.

Refactoring of the code was also necessary to completely uniformize it and manually solve some rare and tricky edge-cases that would not be easily managed by the transformation scripts.

Like this post? Share on: TwitterFacebookEmail


Keep Reading


Published

Category

Pitch

Tags

Stay in Touch