This article can be viewed as a presentation here.


We will use 2 V100s on 2 differents kraken nodes (krakenkengpu1 and krakengpu2) for this example.

You can install the last version of horovod with pip. Then, you must perform a few modifications to your code to distribute it on multiple GPUs, and you’ll be ready to go.

Code modifications

  1. Import horovod and initialize it :

    python import horovod.keras as hvd hvd.init()

  2. Pin one GPU per process :

    python config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config))

  3. Wrap your optimizer with horovod :

    python opt = hvd.DistributedOptimizer(opt)

  4. broadcast the initial variable states from rank 0 to all other processes :

    python callbacks = [#... every other callbacks, hvd.callbacks.BroadcastGlobalVariablesCallback(0)]

  5. Add ModelCheckpoint as callback only for the worker 0 if you use it :

    python if hvd.rank() == 0: callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))

Ready to go!

These are the mandatory modification to your script to parallelize it on multiple GPUs. Note that you will have to use slurm and mpirun to execute your script. Indeed, we want to run 1 process for each GPU node. This can be done with the following command, once connected to the cluster:

#! /bin/bash
#SBATCH --partition gpu
#SBATCH --nodes=2

mpirun -np 2 python training.py

Maintaining accuracy

Unfortunately “getting it to work” is not enough. To ensure that you maintain the same accuracy as when using a single GPU, there is a bag of tricks that you should definitely consider using:

  1. Scaling your number of epochs and your LR according to the number of GPUs

    python LR = LR * hvd.size() epochs = epochs // hvd.size()

  2. Using a Learning Rate Warm up : use a small learning rate at the beginning and then switch back to the initial learning rate when the training process is stable.

    python callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5))

  3. Using a Learning Rate scheduler: use a reduce the Learning Rate whenever the loss plateaus.

    python callbacks.append(keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1))

  4. Using a LARS optimizer when training with large batches: training with large batch size (bigger than 8K) often results in the lower model accuracy, and even with a learning rate scaling with warm-up, the training might diverge. Using a LARS optimizer can prevent it. An implementation of this optimizer can be found here. Once this code has been added to your script, you can use it as usual :

    `python opt = LARS(lr=0.001) opt = hvd.DistributedOptimizer(opt)

    model.compile(optimizer=opt, loss=’mse’) “`

Like this post? Share on: TwitterFacebookEmail


Nicolas Cazard was a research engineer working on Deep Learning - 2018-2020.

Keep Reading


Published

Category

Data Science

Tags

Stay in Touch