Disclaimer : This tutorial is done using the software AVBP, a Cerfacs software available for academic research by requesting a Non Disclosure Agreement . See https://avbp-wiki.cerfacs.fr/ for more information on the AVBP software.

In this tutorial, we will see how to use PhyDLL to perform an online learning of an AVBP simulation. With this approach, the Deep Learning (DL) model will be trained at the same time as the AVBP simulation goes on. Compared to the offline learning where the model is trained with all the data available, the online learning is useful for applications where it is impossible to store the solutions to train the model or when you want your model to adapt to recent trends (e.g. stock market price).

In order to be able to perform this tutorial, please do the steps 0. and 1. of this previous tutorial. For the step 0, you will need to use the PhyDLL branch online_learning_v0.1:

cd phydll
git checkout online_learning_v0.1

If you are too lazy to do the step 1. for AVBP (which I would completely understand !) or you do not want to be bothered too much with Fortran (which once again I would completely understand !), you can start back from the tuto_phydll branch:

cd avbp
git checkout tuto_phydll

1. Test case

As in the aforementioned tutorial, we will take as a test case the decaying Homogeneous Isotropic Turbulence (HIT) with the Smagorinsky LES model. But this time, we will try to fit a model on the total Turbulent Kinetic Energy (TKE) given by:

$$ k = \frac{1}{2}(\overline{u^{\prime 2}} + \overline{v^{\prime 2}} + \overline{w^{\prime 2}}) $$

whre \(k\) is the TKE, \(u^\prime\), \(v^\prime\) and \(w^\prime\) respectively represent the fluctuations of the velocity in the \(x\), \(y\) and \(z\) directions:

$$ u^\prime = u - \overline{u} $$
$$ v^\prime = v - \overline{v} $$
$$ w^\prime = w - \overline{w} $$

The \(\bar{.}\) represents the spatial mean of the quantity.

You can find the test case here. Download it, and set in RUN/run.params the simulation_end_iteration to 1000.

Run the case. If everything went well, you will have 1000 solutions in your SOLUT folder. We can use it to build the TKE with Python. Run in a jupyter notebook or in a Python program where you can have a graphical output:

import numpy as np
from matplotlib import pyplot as plt
import h5py
from pathlib import Path

# We build the turnover time.
L = 1.3900e-04 * 2
init_solut = "INIT/init_tetra.h5"

with h5py.File(init_solut, "r") as solut:
    rho = solut["GaseousPhase"]["rho"][:]
    u = solut["GaseousPhase"]["rhou"][:] / rho
    v = solut["GaseousPhase"]["rhov"][:] / rho
    w = solut["GaseousPhase"]["rhow"][:] / rho

    u_prime = u - np.mean(u)
    v_prime = v - np.mean(v)
    w_prime = w - np.mean(w)

    u_rms = np.sqrt(np.mean(u_prime**2 + v_prime**2 + w_prime**2) / 3)

t_turn = L / u_rms

# We build the TKE
folder = Path("RUN/SOLUT/")
h5_files = sorted(folder.glob("*.h5"))

for h5_file in h5_files[:]:
    # We exclude last_solution.h5 from the solutions found in SOLUT
    if "last_solution.h5" in h5_file.name:
        h5_files.remove(h5_file)

t_list = []
k_rms_list = []

for h5file in h5_files:
    with h5py.File(h5file, "r") as solut:
        t = solut["Parameters"]["dtsum"][:]
        rho = solut["GaseousPhase"]["rho"][:]
        u = solut["GaseousPhase"]["rhou"][:] / rho
        v = solut["GaseousPhase"]["rhov"][:] / rho
        w = solut["GaseousPhase"]["rhow"][:] / rho 

        test_u = solut["HITgas"]["var_u"]

        u_prime = u - np.mean(u)
        v_prime = v - np.mean(v)
        w_prime = w - np.mean(w)

        k = 0.5 * np.mean(u_prime**2 + v_prime**2 + w_prime**2)
        t_list.append(t/t_turn)
        k_rms_list.append(k)

# Plot
plt.figure() 
plt.plot(t_list, k_rms_list)
plt.ylabel("k")
plt.xlabel(r"t/${\tau}$")
plt.show()

Adapt the init_solut and the folder accordingly.

Normally, you should have something like this:

tke

\(t\) is the time and \(\tau\) represents the turnover time given by:

$$ \tau = \frac{L}{U} $$

where \(L\) is a characteristic length that we take here as the dimension of the domain and \(U\) a characteristic velocity that we define here as the root mean square of the velocity fluctuations:

$$ U = \sqrt{\frac{1}{3}(\overline{u^{\prime 2}} + \overline{v^{\prime 2}} + \overline{w^{\prime 2}})} $$

2. Offline learning

In this tutorial, we will compare the online learning approach with an overfitted DL model built through offline learning. Thus, we will first build a DL model with PyTorch.

Of course the output of our model will be the TKE ! The inputs will be \(u\), \(v\), \(w\) in the whole domain.

First, we need to build a custom dataset with PyTorch with the corresponding inputs/outputs pairs. You can do that with the following script:

import h5py
import numpy as np
import torch

from torch.utils.data import Dataset
from pathlib import Path

class CustomDatasetMLP(Dataset):
    def __init__(self, folder):

        # We find the .h5 in the folder and sort them.
        folder = Path(folder)
        solutions = sorted(folder.glob("*.h5"))

        # We delete the "last_solution.h5" and find the number of points.
        for solution in solutions[:]:
            if "last_solution.h5" in solution.name:
                with h5py.File(solution, "r") as solut:
                    N = len(solut["GaseousPhase"]["rho"][:])
                solutions.remove(solution)

        self.inputs = np.zeros((len(solutions), 3, N))
        self.outputs = np.zeros((len(solutions), 1))

        for i, solution in enumerate(solutions):
            with h5py.File(solution, "r") as solut:
                rho = solut["GaseousPhase"]["rho"][:]
                u = solut["GaseousPhase"]["rhou"][:] / rho
                v = solut["GaseousPhase"]["rhov"][:] / rho
                w = solut["GaseousPhase"]["rhow"][:] / rho

                u_prime = u - np.mean(u)
                v_prime = v - np.mean(v)
                w_prime = w - np.mean(w)

                k = 0.5 * np.mean(u_prime**2 + v_prime**2 + w_prime**2)

                self.inputs[i,0,:] = u / 100
                self.inputs[i,1,:] = v / 100
                self.inputs[i,2,:] = w / 100

                self.outputs[i,:] = k

        self.inputs = torch.from_numpy(self.inputs).float()
        self.outputs = torch.from_numpy(self.outputs).float()

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        u = self.inputs[idx, 0, :]
        v = self.inputs[idx, 1, :]
        w = self.inputs[idx, 2, :]

        velocity = torch.cat([u, v, w], dim=0)
        return velocity, self.outputs[idx]

Note that the inputs have been divided by 100 to stabilize the DL training.

The CustomDatasetMLP object is initialized with your SOLUT folder path:

folder = "RUN/SOLUT"
dataset = CustomDatasetMLP(folder)

Like in the previous PhyDLL tutorial, we will use a classical Multi-Layers Perceptron (MLP):

from torch import nn

class NeuralNetwork(nn.Module):
    def __init__(self):
        n_inputs = 107811
        n_outputs = 1
        n_neurones = 256
        super().__init__()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(n_inputs, n_neurones),
            nn.ReLU(),
            nn.Linear(n_neurones, n_neurones),
            nn.ReLU(),
            nn.Linear(n_neurones, n_neurones),
            nn.ReLU(),
            nn.Linear(n_neurones, n_neurones),
            nn.ReLU(),
            nn.Linear(n_neurones, n_neurones),
            nn.ReLU(),
            nn.Linear(n_neurones, n_outputs),
        )

    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits

We can finally perform the training of this model:

import copy
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=256)

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

model = NeuralNetwork().to(device)
print(model)

learning_rate = 1e-3
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    running_loss = 0.0

    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

    epoch_loss = running_loss / size  # Moyenne de la perte
    return epoch_loss

def evaluate(dataloader, model, loss_fn):
    model.eval() 
    running_loss = 0.0
    size = len(dataloader.dataset)

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            loss = loss_fn(pred, y)
            running_loss += loss.item() * X.size(0)

    epoch_loss = running_loss / size
    return epoch_loss


epochs = 100
timely_loss_fcn = []
best_loss = float('inf')

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(dataloader, model, loss_fn, optimizer)
    epoch_loss = evaluate(dataloader, model, loss_fn)
    timely_loss_fcn.append(epoch_loss)
    print(f"Epoch {t+1} average loss: {epoch_loss:.6f}")

    if epoch_loss < best_loss:
        best_loss = epoch_loss
        best_state_dict = copy.deepcopy(model.state_dict())
print("Done!")

We can then plot the timely loss function:

from matplotlib import pyplot as plt

plt.figure()
plt.plot(np.arange(epochs), timely_loss_fcn, "b")
plt.xlabel("Epochs")
plt.ylabel("Loss function")
plt.yscale("log")
plt.show()

which should give you something like this:

loss_function

We can also compare with the real AVBP solutions. For that, you will need the t_turn, the t_list and k_rms_list given in Sec. 1.

Once you have it, you can run the following program to plot the TKE for both AVBP and your offline model:

y_offline = []

model.eval()

for i in range(0, len(dataset.inputs)):
    X, Y = dataset.__getitem__(i)
    X = X.to(device)
    with torch.no_grad():
        y = model(X)
    y = y.cpu().detach().numpy()
    y_offline.append(y)

plt.figure()
plt.plot(t_list, k_rms_list, label="AVBP")
plt.plot(t_list, y_offline, label="Offline learning")
plt.ylabel("k")
plt.xlabel(r"t/${\tau}$")
plt.legend()
plt.show()

which personnaly gave me:

Comparison between AVBP and the offline learning

Since it will be useful later, we can also plot the relative error between your model and the real AVBP simulations:

plt.figure()
plt.plot(t_list, np.abs(np.array(k_rms_list) - np.array(y_offline).ravel()) / np.array(k_rms_list), label="Offline")
plt.ylabel(r"$\epsilon$")
plt.xlabel(r"t/${\tau}$")
plt.legend()
plt.show()

which personnally gave me:

Relative error of the offline learning

Well, that is not so bad, as we have an error inferior to \(2.5\%\) everywhere ! Feel free to explore the batch_size, the learning rate and the number of epochs to check how reacts the model ! But for now, we will save the weights that we have:

torch.save(model.state_dict(), 'model_offline.pth')

3. Online learning

We can finally move on to the online learning part.

3.1 Implementation of the fields to send in AVBP

This step will be basically the same as the step 2.1 of the AVBP/PhyDLL tutorial but simpler ! Hallelujah ! (tough word to write, I had to google that to check the orthography)

If it is not already done, you will have to create a file named neuralnet_fields_to_send.f90 in SOURCES/MAIN and copy/paste the following code:

!--------------------------------------------------------------------------------------------------
!     Copyright (c) CERFACS (all rights reserved)
!--------------------------------------------------------------------------------------------------


!--------------------------------------------------------------------------------------------------
!     MODULE mod_neuralnet_fields_to_send
!>    @brief   Set the fields to send from AVBP to your Deep Learning model through PhyDLL
!!    @details Set the fields to send from AVBP to your Deep Learning model through PhyDLL
!!    @authors A. Larroque
!!    @date    2025-02-25
!--------------------------------------------------------------------------------------------------
MODULE mod_neuralnet_fields_to_send

  USE mod_param_defs

  CONTAINS

    SUBROUTINE neuralnet_set_fields_to_send ( grid )

    USE mod_grid,                    ONLY: grid_t
    USE mod_timers,                  ONLY: timer_start, timer_stop
    USE mod_input_param_defs,        ONLY: itimers
    USE mod_misc_defs,               ONLY: avbpdl_timer_send_pre
#ifdef PHYDLL
    USE phydll,                      ONLY: phydll_set_phy_field, phydll_send_phy_fields
    USE mod_parallel_defs,           ONLY: enable_phydll
#endif

    IMPLICIT NONE

!   IN/OUT
    TYPE(grid_t) :: grid

    CALL timer_start ( avbpdl_timer_send_pre, itimers )

#ifdef PHYDLL
    IF (enable_phydll == YES_) THEN
      CALL phydll_set_phy_field(field=grid%data_n%w(1,:), label="rho", index=1)
      CALL phydll_set_phy_field(field=grid%data_n%w(2,:), label="rhou", index=2)
      CALL phydll_set_phy_field(field=grid%data_n%w(3,:), label="rhov", index=3)
      CALL phydll_set_phy_field(field=grid%data_n%w(4,:), label="rhow", index=4)
      CALL phydll_set_phy_field(field=grid%data_n%x(1,:), label="x", index=5)
      CALL phydll_set_phy_field(field=grid%data_n%x(2,:), label="y", index=6)
      CALL phydll_set_phy_field(field=grid%data_n%x(3,:), label="z", index=7)
      CALL phydll_send_phy_fields(loop=.true., ol=.true.)
    END IF
#endif

    CALL timer_stop ( avbpdl_timer_send_pre, itimers )

  END SUBROUTINE neuralnet_set_fields_to_send

END MODULE mod_neuralnet_fields_to_send

Note that the only difference with the other tutorial (except that this function is simpler since we do not compute the gradients) is the ol=.true. in phydll_send_phy_fields. This is to specify that we are in a online learning framework so that AVBP will not need to wait later to receive fields from the Python program.

Note that the first fields sent are the density \(\rho\) and the conservative variables of Navier Stokes equations: \(\rho u\), \(\rho v\) and \(\rho w\). We will then compute again the velocity fields with these four variables. The last fields are the coordinates of the points \(x\), \(y\), and \(z\). This will be needed later as the fields sent through PhyDLL are not stored in the same order as it is in the mesh mesh.h5. So to compare our model, with the data issued from the offline learning, we will need the coordinates to reorder the inputs of the model.

Once again, if it is not already done, open SOURCES/MAIN/compute.f90, replace at the beginning of the subroutine compute:

USE mod_efcy_neuralnet,                        ONLY: efcy_neuralnet_set_fields_to_send

for

USE mod_neuralnet_fields_to_send,              ONLY: neuralnet_set_fields_to_send

and change at the beginning the paragraph:

! ----------------------------------------------------------------------
! Sends species-based progress variable to Python (AVBP-DL)
! ----------------------------------------------------------------------
    IF ( kstage==1 .AND. enable_phydll==YES_ ) THEN
      IF ( using_acc ) THEN
        message = "kstage==1 .AND. enable_phydll==YES_ not supported by with openACC"
        CALL print_error_and_quit ( err_routine=routine, err_message=message )
      END IF
      CALL efcy_neuralnet_set_fields_to_send ( grid )
      CALL send_coupling ( coupling(avbpdl) )
    END IF

for

! ----------------------------------------------------------------------
! Sends species-based progress variable to Python (PhyDLL)
! ----------------------------------------------------------------------
    IF ( kstage==1 .AND. enable_phydll==YES_ ) THEN
      IF ( using_acc ) THEN
        message = "kstage==1 .AND. enable_phydll==YES_ not supported by with openACC"
        CALL print_error_and_quit ( err_routine=routine, err_message=message )
      END IF
      CALL neuralnet_set_fields_to_send ( grid )
    END IF

That’s it, your fields are ready to be sent from AVBP to your DL model for your online learning.

3.2 The Python program

Copy your RUN folder to RUN_DL and enter in the new folder created. Open the run.params and set the save_solution to no.

Now, open a file named phydll.yml and copy/paste the following:

Output:
    debug_level : 3

Coupling:
  coupling_frequency : 1
  save_fields_frequency : 1        # (optional, default=0) Frequency to save exchanged fields in ./cwipi

Finally, we can create the python program in charge to train the model on-the-fly. Open a file named main.py and copy/paste the following:

import numpy as np
from phydll.phydll import PhyDLL
import h5py
import numpy as np
import torch

from torch.utils.data import Dataset, DataLoader
from torch import nn
import copy
from pathlib import Path

class CustomDatasetMLP(Dataset):
    def __init__(self, u_list, v_list, w_list):
        u = np.array(u_list)
        v = np.array(v_list)
        w = np.array(w_list)

        u_prime = u - np.mean(u, axis=1).reshape(-1,1)
        v_prime = v - np.mean(v, axis=1).reshape(-1,1)
        w_prime = w - np.mean(w, axis=1).reshape(-1,1)

        k = 0.5 * np.mean(u_prime**2 + v_prime**2 + w_prime**2, axis=1)

        self.inputs = np.zeros((u.shape[0], 3, u.shape[1]))
        self.outputs = np.zeros((u.shape[0], 1))

        self.inputs[:,0,:] = u / 100
        self.inputs[:,1,:] = v / 100
        self.inputs[:,2,:] = w / 100
        self.outputs[:,:] = k.reshape(-1,1)

        self.inputs = torch.from_numpy(self.inputs).float()
        self.outputs = torch.from_numpy(self.outputs).float()

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        u = self.inputs[idx, 0, :]
        v = self.inputs[idx, 1, :]
        w = self.inputs[idx, 2, :]

        velocity = torch.cat([u, v, w], dim=0)
        return velocity, self.outputs[idx]


class NeuralNetwork(nn.Module):
    def __init__(self):
        n_inputs = 107811
        n_outputs = 1
        n_neurones = 256
        super().__init__()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(n_inputs, n_neurones),
            nn.ReLU(),
            nn.Linear(n_neurones, n_neurones),
            nn.ReLU(),
            nn.Linear(n_neurones, n_neurones),
            nn.ReLU(),
            nn.Linear(n_neurones, n_neurones),
            nn.ReLU(),
            nn.Linear(n_neurones, n_neurones),
            nn.ReLU(),
            nn.Linear(n_neurones, n_outputs),
        )

    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

print(f"Using {device} device")

model = NeuralNetwork().to(device)
print(model)

learning_rate = 1e-3
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    running_loss = 0.0

    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def evaluate(dataloader, model, loss_fn):
    model.eval()
    running_loss = 0.0
    size = len(dataloader.dataset)

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            loss = loss_fn(pred, y)
            running_loss += loss.item() * X.size(0)

    epoch_loss = running_loss / size
    return epoch_loss


def find_argmin_dataset(dataset, model):
    model.eval()
    u = dataset.inputs[:, 0, :]
    v = dataset.inputs[:, 1, :]
    w = dataset.inputs[:, 2, :]

    X = torch.cat([u,v,w], dim=1)
    X = X.to(device)

    with torch.no_grad():
        Y = model(X)

        error = torch.abs(Y - dataset.outputs.to(device))

    idx_min = torch.argmin(error)

    return idx_min


def main():
    """
    Main coupling routine
    """
    phydll = PhyDLL(coupling_scheme="DS",
                    mesh_type="phymesh",
                    phy_nfields=7,
                    dl_nfields=0)
    epochs = 1
    best_loss = float('inf')
    c = 0
    timely_loss_fcn = []
    u_list = []
    v_list = []
    w_list = []
    n_samples = 1
    batch_size = n_samples

    # Temporal
    phydll.mpienv.glcomm.Barrier()
    use_temporal_loop = False
    if use_temporal_loop:
        phydll.set_predict_func(your_predict_fct)
        phydll.temporal()
    else:
        while phydll.fsignal:
            phy_fields = phydll.receive_phy_fields()
            rho = phy_fields[0,:]
            u_list.append(phy_fields[1,:] / rho)
            v_list.append(phy_fields[2,:] / rho)
            w_list.append(phy_fields[3,:] / rho)

            if c == 0:
                x = phy_fields[4,:]
                y = phy_fields[5,:]
                z = phy_fields[6,:]
                coord = np.zeros((len(x), 3))
                coord[:,0] = x
                coord[:,1] = y
                coord[:,2] = z
                np.save("coordinates", coord)

            dataset = CustomDatasetMLP(u_list, v_list, w_list)

            if len(u_list) > n_samples:
                idx_min = find_argmin_dataset(dataset, model)
                u_list.pop(idx_min)
                v_list.pop(idx_min)
                w_list.pop(idx_min)
                dataset.inputs = torch.cat([dataset.inputs[:idx_min,:,:], dataset.inputs[idx_min+1:,:,:]], dim=0)
                dataset.outputs = torch.cat([dataset.outputs[:idx_min,:], dataset.outputs[idx_min+1:,:]], dim=0)

            dataloader = DataLoader(dataset, batch_size=batch_size)

            for t in range(epochs):
                print(f"Epoch {t+1}\n-------------------------------")
                train(dataloader, model, loss_fn, optimizer)
                epoch_loss = evaluate(dataloader, model, loss_fn)
                timely_loss_fcn.append(epoch_loss)

                if epoch_loss < best_loss:
                    best_loss = epoch_loss
                    torch.save(model.state_dict(), f"model_online_{n_samples}.pth")
                    print(f"MIN FOUND: {best_loss} at iteration : {c}")

            c += 1

    np.save(f"timely_loss_online_{n_samples}", np.array(timely_loss_fcn))

    # Finalize
    phydll.finalize()

if __name__ == "__main__":
    main()

Okay, that is a lot to take ! Take a deep breath to release the anxiety, and let me break it for you !

The first thing that you may notice is the CustomDatasetMLP is slightly different than the one that we used for the offline learning. The reason is that we here, we do not use the .h5 files but rather the fields directly sent by AVBP in an array.

As in the offline learning, we find the train function to train our model, the evaluate function to evaluate the loss and later keep the weights that minimize this loss, and finally the find_argmin_dataset function that you may wonder what it is for.

Let me explain it for you ! For this tutorial, we will use what we call a reservoir: we will store some solutions in the time in order to train our model. The size of this reservoir is determined by n_samples in the program. Thus, this find_argmin_dataset function will be in charge to determine which sample we must delete in the dataset when the size of the reservoir is exceeded with the new fields received. In order to determine which sample, we delete, we will use the Maximum Absolute Error (MAE):

$$ MAE = \left| \hat{Y} - Y \right| $$

where \(\hat{Y}\) is the prediction on the outputs of the model and \(Y\) the true outputs. The sample with the smallest MAE will be deleted from the dataset.

The rest of the program is classic as in other PhyDLL applications, except that instead of sending DL fields, we train our model in the main loop. For this reason, we set dl_fields = 0 in the PhyDLL object as we do not send DL fields to AVBP and there are no phydll.send_dl_fields().

Also, note that in the main loop, we keep the coordinates in coord and save them in a numpy format in order to rebuild later the model with the h5 files. Also, even if it will not be shown in this tutorial, the evolution of the loss function is also saved in a numpy format under the name timely_loss_function_{n_samples}.npy. Feel free to later plot them !

Okay, we hare ready to submit the job !

3.3 Run your job

In your RUN_DL folder, open a file named batch_job and copy/paste the following:

#!/bin/bash
#SBATCH --job-name=first_run
#SBATCH --partition=t4
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --time=00:10:00
#SBATCH --output=output_test.o
#SBATCH --error=error_test.e
#SBATCH --exclusive

# LOAD MODULES ##########
module list
#########################
exec="$AVBP_HOME/HOST/KRAKEN/BIN/AVBP_V7_dev.KRAKEN"

# NUMBER OF TASKS #######
export PHY_TASKS_PER_NODE=31
export DL_TASKS_PER_NODE=1
export TASKS_PER_NODE=$(($PHY_TASKS_PER_NODE + $DL_TASKS_PER_NODE))
export NP_PHY=$(($SLURM_NNODES * $PHY_TASKS_PER_NODE))
export NP_DL=$(($SLURM_NNODES * $DL_TASKS_PER_NODE))
#########################

# EXTRA COMMANDS ########
source /scratch/coop/larroque/python_venvs/phydll_venv/bin/activate
#########################

# ENABLE PHYDLL #########
export ENABLE_PHYDLL=TRUE
#########################

# PLACEMENT FILE ########
python /scratch/coop/larroque/softwares/GITLAB/PHYDLL/scripts/placement4mpmd.py --Run impi --NpPHY $NP_PHY --NpDL $NP_DL
#########################

# MPMD EXECUTION ########
mpirun -l --machinefile machinefile_$SLURM_NNODES-$NP_PHY-$NP_DL -np $NP_PHY $exec : -np $NP_DL python main.py
#########################

This script is an example to run on Kraken. Adapt it to your machine. Note that you can also use the script generator given by PhyDLL. Also, do not forget to change the paths mentioned in the script accordingly !

4. Comparison offline/online

In this tutorial, I ran the simulation for n_samples = 1, 5, 10, 20, 50, 100.

We can now exploit the results, the first step is to create a mapping between the coordinates of the .h5 files and the coordinates sent by PhyDLL:

import h5py
import numpy as np

# We have to shuffle the inputs so the coordinates sent by PhyDLL and in the .h5 are the sames
coord_phydll = np.load("RUN_DL_tuto/coordinates.npy")

with h5py.File("MESH/mesh_tetra.mesh.h5", "r") as mesh:
    x = mesh["Coordinates"]["x"][:]
    y = mesh["Coordinates"]["y"][:]
    z = mesh["Coordinates"]["z"][:]

coord_h5 = np.zeros((len(x), 3))
coord_h5[:,0] = x
coord_h5[:,1] = y
coord_h5[:,2] = z

x1_view = [tuple(row) for row in coord_phydll]
x2_view = [tuple(row) for row in coord_h5]


x2_dict = {val: i for i, val in enumerate(x2_view)}

idx = np.array([x2_dict[row] for row in x1_view])

Using the dataset and model created in Sec. 2, we can then reorder the coordinates in the .h5 files in order to create predictions with our different models:

n_samples_list = [1, 5, 10, 20, 50, 100]
y_online_list = []

for n_samples in n_samples_list:
    y_online = []

    model = NeuralNetwork()
    model.load_state_dict(torch.load(f"RUN_DL_tuto/model_online_{n_samples}.pth", weights_only=True, map_location=device))
    model.to(device)
    model.eval()

    for i in range(0, len(dataset)):
        inputs, Y = dataset.inputs[i], dataset.outputs[i]
        u = inputs[0, :]
        v = inputs[1, :]
        w = inputs[2, :]

        u = u[idx]
        v = v[idx]
        w = w[idx]

        X = torch.cat([u, v, w], dim=0)
        X = X.to(device)
        with torch.no_grad():
            y = model(X)
        y = y.cpu().detach().numpy()
        y_online.append(y)

    y_online_list.append(y_online)

We can then compare with the online learning. If you do not have it loaded, you can use the following program:

y_offline = []

model = NeuralNetwork()
model.load_state_dict(torch.load('model_offline.pth', weights_only=True, map_location=device))
model.to(device)
model.eval()

for i in range(0, len(dataset.inputs)):
    X, Y = dataset.__getitem__(i)
    X = X.to(device)
    with torch.no_grad():
        y = model(X)
    y = y.cpu().detach().numpy()
    y_offline.append(y)

We are now ready to plot the relative error for each model:

plt.figure()
plt.plot(t_list, np.abs(np.array(k_rms_list) - np.array(y_offline).ravel()) / np.array(k_rms_list), label="Offline")

for i, n_samples in enumerate(n_samples_list):
    plt.plot(t_list, np.abs(np.array(k_rms_list) - np.array(y_online_list[i]).ravel()) / np.array(k_rms_list), label=f"Online learning ({n_samples})")
plt.ylabel(r"$\epsilon$")
plt.xlabel(r"t/${\tau}$")
plt.legend()
plt.show()

which personnally gave me:

Comparison relative error of the offline/online learning

As we can see, with 1, 5, or 10 samples in our reservoir, large errors can be found at one extremity (or even both) of the curve. After investigation with the timely loss functions that we kept, it seems to be linked to how we saved the weights of the model. Indeed, we kept the weights that minimized the loss function during the whole simulation. So for 1 and 10 samples, the weights were kept at approximately half the simulation whereas for 5 samples, they were kept around the end of the simulation.

Anyways, we see that the accuracy of the model globally increases as we keep more samples in our reservoir and find pretty good results with 100 samples in our reservoir ! We can even compare the difference between the error found with the offline model and the one got through online learning with 100 samples:

plt.figure()
e_offline = np.abs(np.array(k_rms_list) - np.array(y_offline).ravel()) / np.array(k_rms_list)
e_online = np.abs(np.array(k_rms_list) - np.array(y_online_list[-1]).ravel()) / np.array(k_rms_list)
plt.plot(t_list, e_offline - e_online , "r")
plt.axhline(y=0, color='black', linestyle='--', linewidth=1)
plt.ylabel(r"$\epsilon_{offline} - \epsilon_{online}$")
plt.xlabel(r"t/${\tau}$")
plt.show()

which personally gave me:

Comparison error offline and online learning with 100 samples

The online learning gave generally better results than the offline learning ! One hypothesis worth investigating to explain that would be that the online learning had in the end more “total epochs” since it was optimized each time it received fields. So it was more efficient to perform more optimization iterations than using a larger dataset for the model, especially due to the simple expression linking the velocities and the TKE.

But I leave that to you to check the veracity of this hypothesis and to investigate how reacts the online learning when you change the way you store the weights, the samples in the reservoir and parameters such as epochs, learning_rate and batch_size ;)

5. Final notes

With the online_learning_0.1 branch of PhyDLL, AVBP is supposed to run asynchronously of the training of your DL model. The fields will be sent in a buffer and the DL model will use them when it will be ready.

However, it may happen that your simulation runs much faster than your DL model and the buffer will be full. In that case, AVBP will wait for storage in the buffer to be available before going on the next iteration, hence slowing down the AVBP simulation.

If that’s a problem for you, you can increase the coupling_frequency of your phydll.yml to send less regularly fields to your DL model.

You can also design more elaborated stragies on the python side such as waiting to receive a certain number of samples before training your model.

A final possibility is to use the phydll_check_ready_signal_dl and phydll_recv_ready_signal_dl of the Fortran interface and the phydll_send_ready_signal on the Python side. phydll_recv_ready_signal_dl will wait in a non-blocking way the signal that the DL model is ready to receive fields, phydll_check_ready_signal_dl will check for this signal, and phydll_send_ready_signal will send the signal from the Python side that the DL model is now ready to receive fields. You can check a minimal example of this approach here. Note however that this approach may be conflictual with the coupling_iteration of the phydll.yml.

That’s it, you can go back to a normal activity !

Like this post? Share on: TwitterFacebookEmail


Anthony Larroque is a postdoctoral researcher in computer science engineering with a focus on CFD related topics.

Keep Reading


Published

Category

Tutorials

Tags

Stay in Touch