Disclaimer : under construction
- open_source: pypi, gitlab.com, readthedocs
- tool: Helper to setup a secured chain of jobs on a Job Scheduler
- CFD: Useful for all CFD runs that need to be all submitted repeateldy until a criterion is met, then apply a post processing
- Python package: available pypi
- Excellerat: Relative to the task WP7.2 of the initial project: Making chained jobs easier for production runs by engineers
For 3D cases for real industrial computations with turbulent flames, meshes can reach several million cells, but with a time limit of around 12 hours per run on more machines, it can be a great time-saver to chain runs automatically.
In a chained run, the last physical solution reached by the first run, will be used as the initialization for the next run !
The only constraint is: have a rough idea for the final physical time you want to reach ! This time has to be chosen wisely, we don’t want to waste Central Processing Unit (CPU) hours ! but we want to reach a certain point where we get a closed term for heat release for example.
- First, be sure to have been granted access to Lemmings on Nitrox and clone :
git clone firstname.lastname@example.org:open-source/lemmings.git
A GIT repositery named lemmings is created at the location where you cloned it.
Create a virtual environment with Python 3 (check https://docs.python.org/3/tutorial/venv.html for help)
Go into the lemmings git repositery and under your virtual environment: install lemmings by typing:
python setup.py develop
Depending on the machine you run on , there will be need of a yaml file proper to the machine you use, check that a file corresponding to your machine is in the folder ‘lemmings/src/lemmings/chain/machine_template’ and you have to define the LEMMINGS_MACHINE variable by adding in your ~/.bashrc_priv:
The main things specified in this machine.yml file are the queues that your chained runs are going to be launched on.
You also know which command you use from the batch language to launch runs on your machine, which should usually be
For more info on how to submit a job on the machine you’re using, check this link: http://intranet.cerfacs.fr/intranet_equipes/csg/.
A typical machine.yml file with only two defined queues will look like this:
commands: submit: sbatch get_cpu_time: sacct -j -LEMMING-JOBID- --format=Elapsed -n dependency: "--dependency=afterany:" queues: prod: wall_time: '00:15:00' core_nb: 540 header: | #!/bin/bash #SBATCH --partition prod #SBATCH --nodes=15 #SBATCH --ntasks-per-node=36 #SBATCH --job-name -LEMMING-JOB_NAME- #SBATCH --time=-LEMMING-WALL-TIME- -EXEC- prod_pj: wall_time: 00:02:00 core_nb: 2 header: | #!/bin/bash #SBATCH --partition prod #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --job-name -LEMMING-POSTJOB_NAME- #SBATCH --time=-LEMMING-WALL-TIME-
Traps to avoid here : - core_nb has to be the same as #SBATCH —nodes=
Here only two queues are described, you have to be aware that you might not be allowed to run on all queues of a machine, here the difficulty I encountered was that I was using the prod queue for the long run and the debug queue for the short run that just prepares the next one, and to respect priorities you should run on the same queue. - The other classic trap is for the wall_time parameter, yml format interpret a time starting with 1 as a string, you have to wrap your time in ‘ ‘ to avoid this issue. for example
if not you will get this kind of error:
if ":" in self.machine.job_template.wall_time: TypeError: argument of type 'int' is not iterable
You have been given a standard AVBP folder with a run.params as described here (https://www.cerfacs.fr/avbp7x/HELP/avbphelp.php):
The only thing that you have to change in your run.params file is the time in the first block. You will have to specify the physical time you want to reach the state we have in this picture for example in the MASRI computation we saw below:
we need to reach about 11 ms. this time limit has to be specified in the RUN CONTROL block of the run.params file,
simulation_end_time = 11.0D-03
AVBP can work with simulation_end_iteration or simulation_end_time, don’t forget to put simulation_end_time in your run.params even if you are used to simulation_end_iteration, because lemmings only understands simulation_end_time. Check here (https://www.cerfacs.fr/avbp7x/HELP/keywords_list.php) for more details on this. be sure to use the avbp format for example 3.0D-03 if you want 3 ms, this time is the physical time of the whole run you want (not each chained run) If you leave simulation_end_iteration you will get this error message:
UnboundLocalError: local variable 'condition_tgt' referenced before assignment
Then prepare the avbp_recursif file: This file is basically your usual batch file that you use when you submit a classic single run with the command sbatch or qsub.
Remember we talked about physical time above, well in our case, around 12 ms. In the machine.yml file, we defined the number of nodes and number of cores. In our case, we have 15 nodes and 38 cores.
CPU time = real time x number of nodes x number of cores
In our case, that yields about 22 000 CPU hours so one 12 (real time) hours run corresponds to about 22 000 CPU hours. in our case asking for a CPU limit of 52 000 hours makes sense.
You launch the first run with the command in your virtual environment
lemmings run avbp_recursif
Be sure not to launch
lemmings run avbp_recursif.yml
if you use auto-completion (tab in your terminal window), this trap is easy fo fall into.
Then a confirmation for the number of CPU Hours will appear on the screen:
You have to answer as input the cpu time in hours if you find you are not mistaken.
Then the runs will be launched: Notice that two runs will be launched, the second being on a dependency on the first one.
When the next run is launched, we can see that in the run.params file, the initialization file has automatically been replaced by the last solution reached by the previous run:
You fill find a folder named by the name of the run that has been launched with in it the run.params that was used for each run.
Then in the folder TIME, note that a different folder, clearly named with a reference to the iteration number it started, which helps keeping track of history.
This work has been supported by the EXCELLERAT project which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 823691.