Lemmings

A flexible Python framework for hpc workflow definition.

May 10, 2021

Introduction

Lemmings (lemmings-hpc) is an open-source Python code designed to simplify job scheduling on HPC clusters. It achieves this goal by offering the user a set of functionalities that does not require a priori knowledge on how to interact with a job sheduler. The emphasize can then be placed on the workflow management. Portability of these workflows between different machines and machine environments will be ensured through lemmings.

This post will focus on what the workflow entails, how it has been conceptualised and implemented, what makes it so flexible to use, and will hopefully convince the reader about the vaste possibilities it can offer in any type of tasks involving HPC interaction.
A sketch of the lemmings workflow with its associated methods.

The lemmings workflow

A representation of the workflow structure is depicted in the image.

It is essentially a conditional loop with 7 methods (indicated in blue) over which the user has full control.

The methods

are Python functions in which the user has total liberty to perform actions of interest. They are defined at different positions within the workflow’s loop. Two of these methods must return a boolean (see sketch) in order to evolve through the loop choices. Not all methods have to be used.

The actual definition of the workflow and its methods will be detailed in ANOTHER POST?

A spawn job (or child process) is set up after the prior_to_job() method and is the starting point of the “inner loop” within the workflow.

The workflow interacts with the hpc’s job sheduler between the prepare_run() and the check_on_end() methods. The interaction consists of the submission of two batch files, a batch_job and a batch_pjob, which is automatically generated by lemmings.

Note: it is possible for “expert users” to take over the specification of these batch files.

That’s all well and good, but how do we break the loop?

By default the maximum allowed CPU time, a required input of lemmings, will be the decisive parameter if no explicit control is taken by the user in the check_on_end() method.

The intend of lemmings is, however, that the user takes control by specyfing one (or multiple) conditions that have to be met in order to stop the simulation. The maximum allowed CPU time will still remain a decisive parameter on top of the user specified one(s).

If the user defined condition(s) in check_on_end() has (have) not been met, the method should return False, in which case a new spawn job will be initiated

All methods can share information with each other in the form of a database.yml file. It is the preferred way for the user to rely on this for interacting