Multi-architecture parallelism using Kokkos/C++ library

  From Monday 9 December 2024 to Wednesday 11 December 2024


Duration : 2.5 days / 17 hours

Face to face training session

About 15 years ago, Nvidia introduces the CUDA architecture, i.e. the use of graphics of processing units for doing general purpose high performance applicative tasks.  Today, it is still a very challenging task to write new applications or simply refactor existing ones with portability and performance in mind, that is for being able to use efficiently available hardware HPC ressources, either multicore CPU (x86, ARM64, …) or GPU (Nvidia /AMD / Intel).

This training aims at providing a dedicated introduction to the open source Kokkos C++ library, which is mainly developped in US (Sandia and OakRidge labs) for about a decade, and funded by the Departement of Energy, under the Exascale Computing Project (ECP). We will present a theoretical and practical introduction to the Kokkos programming model, illustrating the advantages over other alternative shared memory programming model like OpenMP/OpenACC.

The participants of the training are expected to be able to integrate Kokkos library in their own HPC projects.

Objective of the training

Provide a theoretical and practical introduction to Kokkos programming model from abstract concepts, i.e. parallel programming patterns (parallel for, reduce and scan loops), data containers to an overview of the ecosystem (profiling tools, linear algebra, python bindings, …).

Learning outcomes

On completion of this course, you will be able to :

  • build, install and integrate Kokkos into an existing application or write a kokkos application from scratch
  • write Kokkos computing kernels, manage memory data container in heterogeneous platform (CPU/GPU).
  • Able to access performance by using profiling tools
  • Refactor existing code for performance portability

Teaching methods

The training is an alternation of theoretical presentations and practical work. A multiple choice question allows the final evaluation. The training room is equipped with computers, the work can be done in sub-groups of two people.

Referent teacher: Pierre KESTENER

Target participants

This course is for anyone wishing to learn about writing HPC parallel code for running across a variety of hardware architecture, and achieving performance portability using a modern C++  library (Kokkos).


  • Be an employee of a European company; a certificate from the employer is required
  • Have at least 5 years of high education or Master 2 trainee
  • Knowledge of C++ language
  • Some knowledge of parallel programing: multithreading and/or OpenMP basics
  • Interest for HPC application development
  • The training can take place in French or English depending on the audience, level B2 of the CEFR is required.

December, Monday 11, from 2pm to 5pm

  • Refresher on hardware architecture basics (CPU / GPU), on performance measurements (memory bandwidth, FLOPs), all that is needed to understand the difficulty of writing portable and performant code. Practical exercise.
  • General introduction to Kokkos c++ library, its origin, overview of concepts and software abstractions: abstract machine model (hCst/device), the kokkos parallel programing model
  • Examples of production codes using c++/kokkos
  • Pratical exercise: how to build and install Kokkos C++ library, for using  CPU OpenMP backend and GPU/CUDA backend; how to chose compiler, how to integrate Kokkos in a Cmake based project; how to write a modulefile to simplify the use of the library.

December, Tuesday 12, from 2pm to 5pm

  • Writing C++/Kokkos computing kernels: parallel programing patterns: for loops, reduce loops, scan loops.
  • Overview of execution space and memory space concepts : why do we need them, how to use them. Concepts of execution policy
  • Using hardware-aware data containers : multidimensional arrays (Kokkos::View) and hash maps (Kokkos::UnorderedMap)
  • profiling tools (Kokkos::Tools)

December, Tuesday 13, from 2pm to 5pm

  • Advanced use of  Kokkos/C++: hierarchical parallelism (about using teams of threads). Examples of use.
  • Coupling MPI and Kokkos for distributed applications. Introduction the Kokkos/RemoteSpace (experimental feature, accessing remote nodes memory)
  • Overview of Kokkos ecosystem: linear algebra (KokkosKernels), pykokkos (python bindings)
  • Kokkos for Fortran users: Kokkos/FLCL (Fortran Language Compatibility Layer)

Final examination

A final exam will be conducted during the training.






