Deep Learning at Cerfacs
By Corentin Lapeyre - 2019/10/08
Hardware
Cerfacs has a number of machines equipped to train deep neural-networks efficiently on GPU:
- The Kraken cluster has several queues with GPU capability:
gpu
: 2 nodes with each one a V100 card with 16 GB RAMbiggpu
: 4 V100 cards, interconnected via NVLink, with 32 GB RAM eacht4
: 1 T4 card with 16 GB RAM
- Ibo, which has a Quadro P4000 with 8 GB RAM
- Pelican, which has an RTX 2080 Ti with 11GB RAM
Kraken gpu
and t4
are meant for shared usage by several concurrent users. All other
systems are mostly meant for single usage, but this is not software-enforced,
so if you experience e.g. memory problems check if someone is using the same
resource as you.
JupyterHub access
The most simple way to get started with deep learning at Cerfacs (e.g. to follow a tutorial) is to access the jupyterhub server that runs on Kraken2 : http://138.63.200.58:8000. You may choose the node on which you want to launch your run : krakengpu1, krakengpu2 or krakent4
To connect to the notebook, type in the following address in your browser:
http://138.63.200.58:8000
To connect to the lab, type in the following address in your browser:
http://138.63.200.58:8000/user/USERNAME/lab
where you must replace USERNAME
with your Cerfacs login. This will
automatically spawn a single-CPU session.
WARNING: All notebooks are spawned on the same node, and share the same GPU card. The RAM of such a card runs out fast, so be careful of how big your model and data are. If you run into memory issues, check how many other people are using the same node by logging into kraken and typing:
squeue | grep spawner
If you are new to this, it is a good idea to check the memory usage section below.
Memory usage
GPUs have a limited amount of memory, and some deep learning frameworks such as Tensorflow default to allocating all the memory available, regardless of the size of your model and data.
You should limit the memory usage of your tasks in order to let other users work on the same node at the same time. This is typically done by adding some lines just after importing Tensorflow (TF) in each of your Python scripts.
In TF 1.X:
memory_fraction = 0.2
# GPU setup
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=memory_fraction)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
or, alternatively,
# GPU setup
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
The former sets a specific value for the memory amount to be used whereas the latter instructs Tensorflow to limit its usage to what is needed.
In TF 2.0:
memory_limit = 1024 # in MB
# GPU setup
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=memory_limit)])
except RuntimeError as e:
print(e)
or
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
Shell Access
Once you have access to the Cerfacs network (either on site or via VPN), log in to one of the above machines (N.B.: you need to have had a specific account opened on Kraken for access). Please prefer local storage for all your work:
- on
/scratch
for Kraken nodes - in
/store
otherwise, ask csg for access on the target machine
Recent preinstalled environments are available on all machines listed above, available via singularity. To access, load the singularity module, and call python through the appropriate image. On Kraken:
module load tools/singularity/2.6.1
singularity run --nv -B /scratch /usr/local/singularity/images/tf19.03.simg python my_script.py
and elsewhere:
module load tools/singularity/2.6.1
singularity run --nv -B /store /usr/local/singularity/images/tf19.04.simg python my_script.py
Python modules
The singularity image comes loaded with a healthy list of python modules dedicated to data-science tasks. Following is the list of current modules and versions in the images:
Package | Version |
---|---|
horovod | 0.16.0 |
jupyterlab | 0.35.4 |
Keras | 2.2.4 |
librosa | 0.6.1 |
matplotlib | 3.0.3 |
mpi4py | 3.0.1 |
nltk | 3.2.5 |
nvidia-dali-tf-plugin | 0.7.0 |
pandas | 0.23.0 |
pip | 19.0.3 |
portpicker | 1.3.1 |
python-speech-features | 0.6 |
sacrebleu | 1.2.20 |
sentencepiece | 0.1.6 |
tensorflow-gpu | 1.13.1+nv |
virtualenv | 16.5.0 |
However, it sometimes happens that you need a module that is not readily available. In that case, there are 2 recommended options:
- Suggest an update to the reference images. Just remember that this environment is used by everyone, so if you just want to do a quick test this might not be the best method. However in many cases it can benefit everyone to strengthen the base images, so please direct your requests for additional packages to the workgroup.
- Create a virtual environment using
venv
. Since Python 3.3, thevenv
module enables users to easily create their own environments, and to choose their location. Create the environment you want and install everything you need inside. When working with a TF version <= 1.14, remember to installtensorflow-gpu
and not simplytensorflow
to take advantage of the huge bonus that comes with the GPUs for deep learning.