Context
Inno4scale is a European initiative, which was started to support the development of innovative algorithms for exascale supercomputers, so their efficient use can be fully exploited. Currently existing codes for high-performance computing will not be able to function efficiently on upcoming exascale systems in the future.
Therefore, the project will identify and support the development of applications that have the potential to fully exploit the new upcoming EuroHPC exascale systems. The most successful application will be taken up by science and industry after the project.
The objective of Inno4scale is to support the EuroHPC Joint Undertaking, whose goal it is to achieve the deployment of Exascale supercomputers in Europe. As part of the project, the development of novel algorithms for applications on upcoming European Exascale supercomputers will be efficiently exploited. Used in public administration or industry these supercomputers will be able to solve previously unaffordable computational challenges. Industry, science, as well as public administration will then be able to reduce their time-to-solution for computational simulations and approach larger problems with novel solutions.
The Flowgen project in one of the studies funded by Inno4scale.
Introduction
CFD simulations play a pivotal role in modern scientific developments, however they are computationally expensive for complex flow scenarios. Machine-learning based CFD approaches, including surrogate models, promise a faster alternative.
The amount of data produced by CFD simulations to train a surrogate model can easily become massive, leading to challenges on data storage and transfer.
In this project, we introduce an on-the-fly training framework, where a fully differentiable solver is coupled with an ML-based surrogate model, thereby removing the requirement for data storage.
Framework design
The framework is built around the JAX-Fluids solver, a fully-differentiable CFD solver for 3D, compressible two-phase flows, developed with the intention to push and facilitate research at the intersection of ML and CFD.
The neural network is built using Pytorch, and is coupled with JAX-Fluids using the ADIOS2 library, a unified high-performance I/O framework designed for data exchanges in extreme-scale parallel environment.
Datasets including Homogeneous Isotropic Turbulence (HIT) were generated to train surrogate models. Two deep learning architectures, the Fourier Neural Operator and U-Net, were trained to predict a single time step and subsequently multiple time steps via auto-regressive training.
Challenges
In the case of online learning, changing data statistics—commonly referred to as concept drift—can bias the model toward recently observed features, as each training sample is only processed once. This dynamic nature of the data distribution impairs the model’s ability to generalize, often resulting in degraded performance compared to offline training.
Standard optimization techniques like stochastic gradient descent (SGD) assume stationary loss landscapes, making them less effective in such evolving scenarios and leading to erratic weight updates. Additional challenges arise due to the necessity of estimating normalization statistics on-the-fly, often resulting in unstable training dynamics. Deep models also suffer from vanishing gradients and feature diminishment when learning sequentially, as backpropagated signals weaken over time without reinforcement. The lack of validation phases further complicates model selection and architecture tuning, making adaptive depth estimation a necessity.
U-Net and Fourier Neural Operator (FNO) architectures exhibit limitations in capturing long-term flow dynamics, regardless of the training strategy. Particularly under auto-regressive training schemes, where models are trained on their own predictions, error accumulation across time steps often leads to divergence from physically accurate flow fields.
Collectively, these challenges underscore the complexity of deploying surrogate models for CFD tasks in dynamic, real-time environments.
Solutions
Adaptive strategies and algorithmic innovations have been proposed to address the multifaceted challenges of our study.
Elastic Weight Consolidation (EWC) preserves knowledge from previously seen data by penalizing updates to critical parameters, thus alleviating catastrophic forgetting.
Replay buffers and reservoir sampling maintain a representative subset of past data, improving stability in non-stationary environments.
Hedge Backpropagation (HBP) is a promising direction to address the challenges of traditional gradient algorithms by adaptively adjusting the contributions of different network depths during training. HBP mitigates the issue of unknown optimal depth by allowing shallow layers to make early predictions while progressively incorporating deeper representations, thereby enhancing robustness against concept drift and reducing reliance on static architecture choices.
Acknowledgements
The Inno4scale project has received funding from the European High Performance Computing Joint Undertaking (JU) under grant agreement No 101118139. The JU receives support from the European Union’s Horizon Europe Programme.
This work has been done at CERFACS in collaboration with reaserchers from FAU Erlangen :
• Harald Koestler, Professor, Software engineering and HPC expert
• Shubham Kavane, PhD student, AI expert