Disclaimer : under construction

Tags :

  • open_source: pypi, gitlab.com, readthedocs
  • tool: Helper to setup a secured chain of jobs on a Job Scheduler
  • CFD: Useful for all CFD runs that need to be all submitted repeateldy until a criterion is met, then apply a post processing
  • Python package: available pypi
  • Excellerat: Relative to the task WP7.2 of the initial project: Making chained jobs easier for production runs by engineers

Lemmings-A tool to chain runs.

Chained run with Lemmings for a 3D case

For 3D cases for real industrial computations with turbulent flames, meshes can reach several million cells, but with a time limit of around 12 hours per run on more machines, it can be a great time-saver to chain runs automatically.

In a chained run, the last physical solution reached by the first run, will be used as the initialization for the next run !

The only constraint is: have a rough idea for the final physical time you want to reach ! This time has to be chosen wisely, we don’t want to waste Central Processing Unit (CPU) hours ! but we want to reach a certain point where we get a closed term for heat release for example.

Lemmings installation

Clone of the remote GIT repositery

  • First, be sure to have been granted access to Lemmings on Nitrox and clone :

git clone git@nitrox.cerfacs.fr:open-source/lemmings.git

A GIT repositery named lemmings is created at the location where you cloned it.

  • Create a virtual environment with Python 3 (check https://docs.python.org/3/tutorial/venv.html for help)

  • Go into the lemmings git repositery and under your virtual environment: install lemmings by typing: python setup.py develop

Depending on the machine you run on , there will be need of a yaml file proper to the machine you use, check that a file corresponding to your machine is in the folder ‘lemmings/src/lemmings/chain/machine_template’ and you have to define the LEMMINGS_MACHINE variable by adding in your ~/.bashrc_priv:

export LEMMINGS_MACHINE='path/machine.yml'

The main things specified in this machine.yml file are the queues that your chained runs are going to be launched on.

You also know which command you use from the batch language to launch runs on your machine, which should usually be

qsub 
sbatch

For more info on how to submit a job on the machine you’re using, check this link: http://intranet.cerfacs.fr/intranet_equipes/csg/.

A typical machine.yml file with only two defined queues will look like this:

commands:
  submit: sbatch
  get_cpu_time: sacct -j -LEMMING-JOBID- --format=Elapsed -n
  dependency: "--dependency=afterany:"
queues:
  prod:
    wall_time: '00:15:00'
    core_nb: 540
    header: |
            #!/bin/bash
            #SBATCH --partition prod
            #SBATCH --nodes=15
            #SBATCH --ntasks-per-node=36
            #SBATCH --job-name -LEMMING-JOB_NAME-
            #SBATCH --time=-LEMMING-WALL-TIME-

            -EXEC-
  prod_pj:
    wall_time: 00:02:00
    core_nb: 2
    header: |
            #!/bin/bash
            #SBATCH --partition prod
            #SBATCH --nodes=1
            #SBATCH --ntasks-per-node=2
            #SBATCH --job-name -LEMMING-POSTJOB_NAME-
            #SBATCH --time=-LEMMING-WALL-TIME-

Traps to avoid here : - core_nb has to be the same as #SBATCH —nodes=

Here only two queues are described, you have to be aware that you might not be allowed to run on all queues of a machine, here the difficulty I encountered was that I was using the prod queue for the long run and the debug queue for the short run that just prepares the next one, and to respect priorities you should run on the same queue. - The other classic trap is for the wall_time parameter, yml format interpret a time starting with 1 as a string, you have to wrap your time in ‘ ‘ to avoid this issue. for example

wall_time: '12:00:00'

if not you will get this kind of error:

if ":" in self.machine.job_template.wall_time:
TypeError: argument of type 'int' is not iterable

Preparing the run

You have been given a standard AVBP folder with a run.params as described here (https://www.cerfacs.fr/avbp7x/HELP/avbphelp.php):

Adressing the problem of time !

The only thing that you have to change in your run.params file is the time in the first block. You will have to specify the physical time you want to reach the state we have in this picture for example in the MASRI computation we saw below:

we need to reach about 11 ms. this time limit has to be specified in the RUN CONTROL block of the run.params file,

simulation_end_time = 11.0D-03

AVBP can work with simulation_end_iteration or simulation_end_time, don’t forget to put simulation_end_time in your run.params even if you are used to simulation_end_iteration, because lemmings only understands simulation_end_time. Check here (https://www.cerfacs.fr/avbp7x/HELP/keywords_list.php) for more details on this. be sure to use the avbp format for example 3.0D-03 if you want 3 ms, this time is the physical time of the whole run you want (not each chained run) If you leave simulation_end_iteration you will get this error message:

UnboundLocalError: local variable 'condition_tgt' referenced before assignment

The batch file for lemmings

Then prepare the avbp_recursif file: This file is basically your usual batch file that you use when you submit a classic single run with the command sbatch or qsub.

How do I define CPU limit

Remember we talked about physical time above, well in our case, around 12 ms. In the machine.yml file, we defined the number of nodes and number of cores. In our case, we have 15 nodes and 38 cores.

CPU time = real time x number of nodes x number of cores

In our case, that yields about 22 000 CPU hours so one 12 (real time) hours run corresponds to about 22 000 CPU hours. in our case asking for a CPU limit of 52 000 hours makes sense.

Launching the run

You launch the first run with the command in your virtual environment

lemmings run avbp_recursif

Be sure not to launch

lemmings run avbp_recursif.yml

if you use auto-completion (tab in your terminal window), this trap is easy fo fall into.

Then a confirmation for the number of CPU Hours will appear on the screen:

You have to answer as input the cpu time in hours if you find you are not mistaken.

Then the runs will be launched: Notice that two runs will be launched, the second being on a dependency on the first one.

When the next run is launched, we can see that in the run.params file, the initialization file has automatically been replaced by the last solution reached by the previous run:

You fill find a folder named by the name of the run that has been launched with in it the run.params that was used for each run.

Then in the folder TIME, note that a different folder, clearly named with a reference to the iteration number it started, which helps keeping track of history.

This work has been supported by the EXCELLERAT project which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 823691.

Like this post? Share on: TwitterFacebookEmail


Elsa Gullaud After a phD in Acoustics, she is now doing a postdoc in data science at CERFACS-

Keep Reading


Published

Category

Our Creations

Tags

Stay in Touch