Deep Learning at Cerfacs

By Corentin Lapeyre - 2019/10/08

Hardware

Cerfacs has a number of machines equipped to train deep neural-networks efficiently on GPU:

The Kraken cluster has several queues with GPU capability:
- gpu: 2 nodes with each one a V100 card with 16 GB RAM
- biggpu: 4 V100 cards, interconnected via NVLink, with 32 GB RAM each
- t4: 1 T4 card with 16 GB RAM
Ibo, which has a Quadro P4000 with 8 GB RAM
Pelican, which has an RTX 2080 Ti with 11GB RAM

Kraken gpu and t4 are meant for shared usage by several concurrent users. All other systems are mostly meant for single usage, but this is not software-enforced, so if you experience e.g. memory problems check if someone is using the same resource as you.

JupyterHub access

The most simple way to get started with deep learning at Cerfacs (e.g. to follow a tutorial) is to access the jupyterhub server that runs on Kraken2 : http://138.63.200.58:8000. You may choose the node on which you want to launch your run : krakengpu1, krakengpu2 or krakent4

To connect to the notebook, type in the following address in your browser:

http://138.63.200.58:8000

To connect to the lab, type in the following address in your browser:

http://138.63.200.58:8000/user/USERNAME/lab

where you must replace USERNAME with your Cerfacs login. This will automatically spawn a single-CPU session.

WARNING: All notebooks are spawned on the same node, and share the same GPU card. The RAM of such a card runs out fast, so be careful of how big your model and data are. If you run into memory issues, check how many other people are using the same node by logging into kraken and typing:

squeue | grep spawner

If you are new to this, it is a good idea to check the memory usage section below.

Memory usage

GPUs have a limited amount of memory, and some deep learning frameworks such as Tensorflow default to allocating all the memory available, regardless of the size of your model and data.

You should limit the memory usage of your tasks in order to let other users work on the same node at the same time. This is typically done by adding some lines just after importing Tensorflow (TF) in each of your Python scripts.

In TF 1.X:

memory_fraction = 0.2

# GPU setup
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=memory_fraction)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

or, alternatively,

# GPU setup
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)

The former sets a specific value for the memory amount to be used whereas the latter instructs Tensorflow to limit its usage to what is needed.

In TF 2.0:

memory_limit = 1024 # in MB

# GPU setup
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=memory_limit)])
  except RuntimeError as e:
    print(e)

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
  except RuntimeError as e:
    print(e)

Shell Access

Once you have access to the Cerfacs network (either on site or via VPN), log in to one of the above machines (N.B.: you need to have had a specific account opened on Kraken for access). Please prefer local storage for all your work:

on /scratch for Kraken nodes
in /store otherwise, ask csg for access on the target machine

Recent preinstalled environments are available on all machines listed above, available via singularity. To access, load the singularity module, and call python through the appropriate image. On Kraken:

module load tools/singularity/2.6.1
singularity run --nv -B /scratch /usr/local/singularity/images/tf19.03.simg python my_script.py

and elsewhere:

module load tools/singularity/2.6.1
singularity run --nv -B /store /usr/local/singularity/images/tf19.04.simg python my_script.py

Python modules

The singularity image comes loaded with a healthy list of python modules dedicated to data-science tasks. Following is the list of current modules and versions in the images:

Package	Version
horovod	0.16.0
jupyterlab	0.35.4
Keras	2.2.4
librosa	0.6.1
matplotlib	3.0.3
mpi4py	3.0.1
nltk	3.2.5
nvidia-dali-tf-plugin	0.7.0
pandas	0.23.0
pip	19.0.3
portpicker	1.3.1
python-speech-features	0.6
sacrebleu	1.2.20
sentencepiece	0.1.6
tensorflow-gpu	1.13.1+nv
virtualenv	16.5.0

However, it sometimes happens that you need a module that is not readily available. In that case, there are 2 recommended options:

Suggest an update to the reference images. Just remember that this environment is used by everyone, so if you just want to do a quick test this might not be the best method. However in many cases it can benefit everyone to strengthen the base images, so please direct your requests for additional packages to the workgroup.
Create a virtual environment using venv. Since Python 3.3, the venv module enables users to easily create their own environments, and to choose their location. Create the environment you want and install everything you need inside. When working with a TF version <= 1.14, remember to install tensorflow-gpu and not simply tensorflow to take advantage of the huge bonus that comes with the GPUs for deep learning.