lemming Lemmings is a 1991 video game where the player try to herd small animals, the “lemmings” out of a a 2D puzzle. Lemmings are clueless about their surroundings, walk blindly, and will eventually fall, burn, be crushed, … well die, unless the player personally take care of them. The “Lemmings Jobs “, introduced here, are the same : by nature, these unsupervised job submission often end up in dramatic failures. A human oversight during execution is compulsory when you are dealing with chained runs.

An Introduction to Lemmings

This post aims at a general introduction of Lemmings, the ideas behind it and what it can offer without overloading the reader with hands-on information. Dedicated tutorial pages (tuto1 , tuto2 ) will provide a view on some practical aspects of its use. (estimated read time: 10 min)


Foreword Do remember that chained jobs make CPU consumption incredibly fast and easy. Use a chained job only when you are absolutely confident in the workflow. Then, consider using Lemmings if this workflow should be shipped to a customer for recurrent use.


What is Lemmings?

Lemmings

  • is best described as an ensemble of functionalities that facilitates the interaction with a Job Scheduler
  • is designed for large scale jobs on High Performance Computers (HPCs)
  • interacts with either Slurm - or PBS Job Schedulers
  • has been developed within the context of CFD solvers but is not limited by this
  • is written in python

The basis of its development is threefold:

Allow advanced job chaining (1) while providing independence of the machine Job Sheduler (2) and accounting for the the computational cost corresponding to the desired workflow (3).

  1. Chain runs of CFD simulations on HPCs are common, can vary in complexity and require knowledge of shell scripting.

    • Lemmings offers a python framework to create its chain where the user does not need to know about the underlying complexity including the shell scripting.
  2. Whilst ideas in chain runs do not depend on the machine Job Sheduler, the commands to do so are.

    • Lemmings removes this hassle from the user: a resulting workflow will be portable between machines.
    • The choice of python allows the user a vaste amount of possibilities thanks to a wide developer community.
  3. It is easy to lose track of the computational cost (CPU Hours) of a set of simulations, especially in chain runs.

    • Lemmings requires the user input of a max amount of CPU Hours for the whole workflow. This constraint intrinsically demands from the end user a priori reflection on the cost of a set of simulations.

How does Lemmings differ from a normal batch process / job?

Or in other words: why should I bother using Lemmings and not generate my own shell scripts? The answers to this question are best understood by looking at different scenarios.

Scenario 1: Classic chain run of a single CFD simulation

Target

We would like to run a CFD simulation for 72 h. Let’s assume this corresponds to 24 ms of CFD simulation time which would be the more likely requirement.

Boundary Conditions

In such a situation the end user is limited by the max run time (wall clock time) for a single job on the selected machine. Presume that this is set to 12h.

  • wall clock time: 12:00:00

Workflow setup

Standard way

The standard way to do this would be to create a looped job chain script which submits a set of jobs (or job scripts). In this chain script, the subsequent jobs are made dependent upon completion of another (in present scenario: the previous one). All jobs will be submitted to the scheduler at the same time. The difference being that in the queue the dependent jobs will receive a “Held” job status instead of a classical “Queued / Pending” status. For our specific example, a total of 6 runs will be submitted with 5 dependent on completion of a previous run.

Lemmings way

In a working Lemmings environment we would specify:

  • the CFD executable to be run
  • the desired CFD simulation time: 24 ms in present case
  • a max amount of CPU Hours to be used: an estimate can be computed based on the 72 h in present case

Lemmings will submit two scripts:

  • a job script: runs the simulations
  • a post_job script: in present workflow will resubmits a job and post_job script to the Job Sheduler upon completion of the job script

A summary of the process is given in Table 1.

[Table 1: CFD chain run scenario 1][tab-table1]

A normal procedure The Lemmings procedure
Create a looped job chain script my_job.sh. Specify CFD executable, desired CFD simulation time and max CPUH.
Submit script to Job Sheduler which will generate the different jobs. Submits a job.sh and post_job.sh to the Job Sheduler.
All jobs are in the queue, 5 conditional on the others. Only 2 jobs in queue , post_job.sh conditional on the job.sh finalizing.

Result

Both setups are equivalent and will produce the same result, that is if nothing goes wrong during the simulation. An obvious difference is that the Standard way will impact the priority (decrease) of the user on the Job Sheduler, something which does not occur in the Lemmings way. Such disadvantage would be even more pronounced when an interdependent sequence of say 50 jobs have to be run. Note that it is up to the queueing system to put a limit on the number of jobs per user allowed in queue which could be nefast to your workflow. The message is: Don’t get into a situation where everyone will know you as “the person that saturates the queueing system”.

The advantages of the Lemmings way are becoming more visible when something does not go as planned in the CFD simulation.

What if something goes wrong?

End and restart of simulation

The Standard way requires the user to ensure a CFD solution is written before the job gets killed from which the next job can restart. If you don’t want to fill your disk space by saving solutions at very short intervals, it requires an a priori estimate of when to write a solution to ensure this is performed closely before the wall clock time is reached. Such an estimate does not account for sudden changes in simulation behavior which could result in the job being killed before an output is written. In the best case a prior output has been written from which a restart can be performed. Nevertheless, you lost precious computational resources.

Lemmings allows the user to define a time needed to finalize the simulation, e.g. 5 min or 00:05:00. This time will be substracted from the wall clock time (12:00:00) and be used as a new wall clock time (12:00:00-00:05:00 = 11:55:00) for the simulation run. Upon end of this time, a CFD output is written which is used to restart the following job. This way, as little as possible computational resources are lost. Unlike in the Standard way, the simulation is not dependent upon an a priori fine tuning to control the output saving interval / control.

Total desired simulation time

In our example our target is set to 24 ms. As the simulation progresses it is not uncommon that flow changes occur which would lead to changes in time step for instance.

In the Standard way the above situation could result in a total simulation run time (sum of job sequences) lower than our targeted one. A subsequent job would need to be submitted by the user.

Lemmings is not limited in the same way, that is, as long as the simulation time has not reached 24 ms, Lemmings will continue submitting jobs. The hard limit in terms of number of jobs equal to 6 (set by 72h desired wall clock time) does not exist. What does limit a Lemmings chain run is the maximum allowed number of CPU Hours which is user defined. If it is set to represent exactly a 72h run, the simulation will stop at the same moment as in the Standard way. If some buffer on the allowed CPU Hours is accounted for the simulation can continue running past the 72h point if required. This ads a level of flexibility to our runs.

Crash in the chain

As simulation progresses issues could be encountered which would lead to a simulation to crash, hence influencing the chain of jobs. E.g., a crash in the third job would inhibit the subsequent jobs to run.

In the Standard way the above situation requires the user to have a look at what happened and recreate a workflow in which it would hopefully not occur any more.

With Lemmings it is possible to pre-define actions to be performed to try and act in such situation without requiring the user to have a look. An action could consist of restarting the previous simulation with a lower CFL number. This versatility is made possible through the sequencing choice of submitting a job and post_job script.

Scenario 2: Chain run of a single CFD simulation depending on the runtime result

Let’s add some spice to our previous scenario.

Target

Similarily to Scenario 1, we would like to run a CFD simulation for 72 h which corresponds to 24 ms of CFD simulation time. On top of this, we wish that the settings of our simulation change upon a requirement. For the sake of simplicity let’s assume that we wish to increase the solution output number if the velocity profile along a line at a given location inside the computational domain has reached a desired shape (a reference) as to collect some statistics. This operation needs to be performed only once and only if the target profile has been reached.

Boundary Conditions

The same boundary conditions as in Scenario 1 apply:

  • wall clock time: 12:00:00

Workflow setup

The workflow in this scenario would be similar to Scenario 1 with the addition of an extra check at the end of each CFD job prior to starting the following. This extra check consists of running a script extracting the given velocity profile from the resulting flow field, compare it to a reference and adapt the simulation settings if necessary. We do not know at which point during the run the required condition will be reached and it is even possible that it will never be reached.

Standard way

In the Standard way this would be achieved by calling a postprocessing job in between each CFD job. This would result in 5 extra jobs to be submitted in between the CFD runs. Each of these jobs are made conditional upon finalization of a CFD job.

Lemmings way

The same specification as in Scenario 1 are required. On top of that, we need to tell Lemmings to run a postprocessing job as well. The latter will be run within the post_job script.

Lemmings will submit two scripts:

  • a job script: runs the simulations
  • a post_job script: in present workflow
  • runs the postprocessing script
  • resubmits a job and post_job script to the Job Sheduler upon completion of postprocessing job.

[Table 2: CFD chain run scenario 2][tab-table1]

A normal procedure The Lemmings procedure
Create a looped job chain script my_job.sh. Specify CFD executable, desired CFD simulation time and max CPUH. Tell Lemmings to run a postprocessing script.
Submit script to Job Sheduler which will generate the different jobs (CFD and post job). Submits a job.sh and post_job.sh to the Job Sheduler.
All jobs are in the queue, 10 conditional on the others: 5 CFD jobs, 5 postprocessing jobs. Only 2 jobs in queue , post_job.sh conditional on the job.sh finalizing.

Result

Both setups are equivalent and will produce the same result, that is if nothing goes wrong during the simulation. The obvious disadvantage of the Standard way is that 5 extra jobs will be added to the queue. It is even possible that our target postprocessing operation will already be performed after the 1st CFD job, in which case there is no need for the 4 subsequent postprocessing checks. Lemmings keeps its condensed format of only submitting two jobs at a time. A simple check in the Lemmings workflow structure will avoid the post_job.sh script to run the postprocessing job. Obviously, such “sanity check” could be put in place in the Standard way but this should then be done within the postprocessing script and would therefore still require the postprocessing jobs to be run.

Final comments

The two scenarios illustrated the conceptual differences between the way Lemmings operates and the same workflow setup without Lemmings, dubbed the Standard Way. Some aspects pointed obvious advantages of Lemmings. Lemmings does offer additional flexibility in how a workflow can be set-up but it does remain the end-user’s task to use this in a clever manner.

What Lemmings does can obviously be done without Lemmings, but in the end you will be creating a tool which does what Lemmings already does.So why not use Lemmings and focus on a clever workflow definition instead ?

Ackowledgement

Lemmings is a service created in the EXCELLERAT Center Of Excellence, funded by the European community.
logo

Like this post? Share on: TwitterFacebookEmail


Jimmy-John Hoste is a postdoctoral researcher in computer science engineering with a focus on CFD related topics.

Keep Reading


Published

Category

Our Creations

Tags

Stay in Touch