We will use 2 V100s on 2 differents kraken nodes (krakenkengpu1 and krakengpu2) for this example.
You can install the last version of horovod with pip.
Then, you will just have to add few modifications to your code to distribute it on multiple GPUs :
import horovod.keras as hvd
hvd.init()
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
opt = hvd.DistributedOptimizer(opt)
callbacks = [#... every other callbacks,
hvd.callbacks.BroadcastGlobalVariablesCallback(0)]
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
1) Import horovod and initialize it :
2) Pin one GPU per process :
3) Wrap your optimizer with horovod :
4) broadcast the initial variable states from rank 0 to all other processes :
5) Add ModelCheckpoint as callback only for the worker 0 if you use it :
Note that you will have to use slurm and mpirun to execute your script.
Indeed, we want to run 1 process for each GPU node in kraken.
It can be done with the following command, once connected to kraken :
#! /bin/bash
#SBATCH --partition gpu
#SBATCH --nodes=2
mpirun -np 2 python training.py
LR = LR * hvd.size()
epochs = epochs // hvd.size()
We use a small learning rate at the beginning and then switch back to the initial learning rate when the training process is stable.
callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5))
We use a reduce the Learning Rate whenever the loss plateaus
callbacks.append(keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1))
An implementation of this optimizer can be found here.
Once this code has been added to your script, you can use it as usual :
opt = LARS(lr=0.001)
opt = hvd.DistributedOptimizer(opt)
model.compile(optimizer=opt, loss='mse')