How to use Horovod for training with GPUs on multiple nodes :

We will use 2 V100s on 2 differents kraken nodes (krakenkengpu1 and krakengpu2) for this example.

You can install the last version of horovod with pip.
Then, you will just have to add few modifications to your code to distribute it on multiple GPUs :

1) Import horovod and initialize it :

In [3]:
import horovod.keras as hvd
hvd.init()

2) Pin one GPU per process :

In [4]:
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))

3) Wrap your optimizer with horovod :

In [ ]:
opt = hvd.DistributedOptimizer(opt)

4) broadcast the initial variable states from rank 0 to all other processes :

In [ ]:
callbacks = [#... every other callbacks,
    hvd.callbacks.BroadcastGlobalVariablesCallback(0)]

5) Add ModelCheckpoint as callback only for the worker 0 if you use it :

In [ ]:
if hvd.rank() == 0:
    callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))

1) Import horovod and initialize it :
2) Pin one GPU per process :
3) Wrap your optimizer with horovod :
4) broadcast the initial variable states from rank 0 to all other processes :
5) Add ModelCheckpoint as callback only for the worker 0 if you use it :

These are the mandatory modification to your script to parallelize it on multiple GPUs.

Note that you will have to use slurm and mpirun to execute your script.

Indeed, we want to run 1 process for each GPU node in kraken.
It can be done with the following command, once connected to kraken :

In [ ]:
#! /bin/bash
#SBATCH --partition gpu
#SBATCH --nodes=2

mpirun -np 2 python training.py

However, to ensure that you maintain the same accuracy as when using a single GPU, there is a bag of tricks that you should definitely consider using :

1) Scaling your number of epochs and your LR according to the number of GPUs

In [5]:
LR = LR * hvd.size()
epochs = epochs // hvd.size()

2) Using a Learning Rate Warm up :

We use a small learning rate at the beginning and then switch back to the initial learning rate when the training process is stable.

In [ ]:
callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5))

3) Using a Learning Rate scheduler :

We use a reduce the Learning Rate whenever the loss plateaus

In [8]:
callbacks.append(keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1))

4) Using a LARS optimizer when training with large batches :

Training with large batch size (bigger than 8K) often results in the lower model accuracy, and even with a learning rate scaling with warm-up, the training might diverge.
Using a LARS optimizer can prevent it.

An implementation of this optimizer can be found here.
Once this code has been added to your script, you can use it as usual :

In [ ]:
opt = LARS(lr=0.001)
opt = hvd.DistributedOptimizer(opt)

model.compile(optimizer=opt, loss='mse')