A step-by-step guide to Kraken GPU nodes

By Luciano Drozda, Guillaume Bogopolsky, Victor Xing - 2020/06/17

Reading of Deep Learning at Cerfacs by Corentin Lapeyre is strongly recommended as a pre-requisite to this guide.

Basics

Connect to the CERFACS intranet via the lab's wired network or by the VPN SSL client (a keychain token to be asked at csg is required in this case);
Type ssh kraken on a terminal of your local machine. The command leads to the so-called head-on node in Kraken (or "la frontale", in French), which works as a connection and work node for the whole cluster, and does not have access to a GPU. Therefore, you should NEVER run jobs on the head-on node;
Once in the head-on node, type salloc -p gpu. This command allocates ressources on a compute node in the gpu partition, which are all outfitted with GPUs. A message like this will show up on your screen, meaning that access to a node has been granted:

    salloc: Granted job allocation 333261
    salloc: Waiting for resource configuration
    salloc: Nodes krakengpu1 are ready for job

Finally, type ssh krakengpu1 (or ssh krakengpu2, following which was reserved to you) and you will be connected to a GPU node.

If you want to work on a specific GPU node, say gpu1, you might add an option to the salloc call in the step 3 above as follows:

salloc -p gpu -w krakengpu1

In order to know which jobs are currently running on your working node and how much memory they are consuming, use the command nvidia-smi.

In order to know your own running jobs, type squeue -u <USERNAME>. Since you will frequently use this command, you might want to associate an alias to it. Add the following to your ~/.bashrc in order to set sq as an equivalent command:

alias sq='squeue -u <USERNAME>'

In order to know your user memory quota, type mmlsquota (while in a GPU node) or rquota (while in the head-on node). It is worth noting that rquota provides a more user-friendly format.

In order to disconnect from a node, use Ctrl-D (or type exit).

Useful tweaks

We recommend adding the following aliases to your ~/.bashrc file as a way to reduce typing during the connection procedure above described:

alias gpu='salloc -p gpu'
alias biggpu='salloc -p biggpu'
alias gpu1='ssh krakengpu1'
alias gpu2='ssh krakengpu2'
alias bgpu='ssh krakenbgpu'

You should use /scratch/<WORKGROUP>/<USERNAME> as your working directory by default (since you are granted way more memory there than in your $HOME directory). That being said, change to it automatically at each starting session by adding this to your ~/.bashrc:

# Change to /scratch directory
cd /scratch/<WORKGROUP>/<USERNAME>/

Furthermore, still in your ~/.bashrc, add the following lines that will load a python3.6 module containing common-use packages at each starting session:

# Load python 3.6
module load python/anaconda3.6

# User specific aliases and functions
source /softs/anaconda3/etc/profile.d/conda.sh

Such a module is useful because it prevents you from using the provided memory quota with python packages it may already contain.

Once loaded, the module has many environments that can be listed via the command conda env list. By the time of this post, the output is as follows:

# conda environments:
#
base                  *  /softs/anaconda3
ert_env                  /softs/anaconda3/envs/ert_env
fenics                   /softs/anaconda3/envs/fenics
fenics-py37              /softs/anaconda3/envs/fenics-py37
pangeo                   /softs/anaconda3/envs/pangeo
tf1.10.1-cuda9-py36      /softs/anaconda3/envs/tf1.10.1-cuda9-py36
tf1.12-cuda10-py36       /softs/anaconda3/envs/tf1.12-cuda10-py36
tf1.12-cuda10-py36-jupyter     /softs/anaconda3/envs/tf1.12-cuda10-py36-jupyter
tf1.13.1-cuda10-py36     /softs/anaconda3/envs/tf1.13.1-cuda10-py36
tf1.8-cuda9-py36         /softs/anaconda3/envs/tf1.8-cuda9-py36
tf2.0-cuda10             /softs/anaconda3/envs/tf2.0-cuda10

In order to activate one of these, type conda activate <ENV_NAME>. You might see an indication in the left of your shell prompt as follows:

(tf2.0-cuda10) [<USERNAME>@kraken1 <USERNAME>]$

Finally, in case you are in a project requiring Python packages out of the conda environment corresponding to the Tensorflow version you are working with, you may opt by creating a Python light virtual environment with the command

python -m venv --system-site-packages /<VENV_PATH>/

The --system-site-packages option makes links to the packages from the original conda environment so that memory from your user quota is dedicated only to new packages to be installed. In order to activate the recently created virtual environment, type source /<VENV_PATH>/bin/activate. You might see an indication in the left of your shell prompt:

(<VENV_NAME>) [<USERNAME>@kraken1 <USERNAME>]$

Creating virtual environments with conda is also possible and all details about it can be found in the Using conda at Cerfacs article by Victor Xing.

Advanced use

It is possible to launch jobs that will continue to run even when you disconnect from the GPU node and the head-on node. It is also possible to queue multiple jobs and have them run once the previous has ended. This is done using the Slurm workload manager:

Prepare a bash script following the Slurm documentation. A simple example is given below for the gpu partition:

    #!/bin/bash
    #SBATCH --partition gpu
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --job-name TRAIN_NEURAL_NET
    #SBATCH --time=12:00:00
    srun python train.py > ${SLURM_JOBID}.out

Run the bash script you just created (named, e.g., train.sh) with the sbatch command:

    sbatch train.bash

You can check that your job is actually running via sq (should wait some seconds till it is launched by Slurm).

Running on the biggpu partition, you might want to add the following option to your header:

#SBATCH --gres=gpu:1

It will allocate 1 of the GPU of the node to your job, exclusively, by setting the CUDA_VISIBLE_DEVICES environment variable (more info here). That way, you ensure you are alone on the GPU, and if none is available, your job is queued.