Using Tensorboard from a remote server

By Victor Xing - 2019/04/01

Tensorboard is an interactive interface designed for Tensorflow providing visualizations of the learning process of your model. It is a powerful tool that is highly customizable and easy to integrate to a Keras or Tensorflow model. Check this repo out for a version that is compatible with PyTorch.

At CERFACS your machine learning model will probably be on a remote machine like one of the calculators or a workstation. This means that visualization with Tensorboard is a bit tricky to set up. This post will present how to integrate Tensorboard to an existing Keras model and use the interface from your local machine.

Setting up Keras for Tensorboard

The Tensorboard package comes bundled with a Tensorflow installation using conda or pip. I recommend that you install and update it this way to keep it up to date with your Tensorflow version.

Tensorboard reads data written in log files by the FileWriter objects of Tensorflow. In Keras, the TensorBoard callback deals with this backend work for you. By default, the training and validation loss of your model are logged and plotted on separate graphs. Since I prefer having them on the same graph, I use the following class from this stackoverflow thread to subclass the default callback:

class TrainValTensorBoard(keras.callbacks.TensorBoard):
    '''
    Plot training and validation losses on the same Tensorboard graph
    Supersede Tensorboard callback
    '''
    def __init__(self, log_dir=my_log_dir, **kwargs):
        # Make the original `TensorBoard` log to a subdirectory 'training'
        training_log_dir = os.path.join(log_dir, 'training')
        super(TrainValTensorBoard, self).__init__(training_log_dir, **kwargs)

        # Log the validation metrics to a separate subdirectory
        self.val_log_dir = os.path.join(log_dir, 'validation')

    def set_model(self, model):
        # Setup writer for validation metrics
        self.val_writer = tf.summary.FileWriter(self.val_log_dir)
        super(TrainValTensorBoard, self).set_model(model)

    def on_epoch_end(self, epoch, logs=None):
        # Pop the validation logs and handle them separately with
        # `self.val_writer`. Also rename the keys so that they can
        # be plotted on the same figure with the training metrics
        logs = logs or {}
        val_logs = {k.replace('val_', ''): v for k, v in logs.items() if k.startswith('val_')}
        for name, value in val_logs.items():
            summary = tf.Summary()
            summary_value = summary.value.add()
            summary_value.simple_value = value.item()
            summary_value.tag = name
            self.val_writer.add_summary(summary, epoch)
        self.val_writer.flush()

        # Pass the remaining logs to `TensorBoard.on_epoch_end`
        logs = {k: v for k, v in logs.items() if not k.startswith('val_')}
        super(TrainValTensorBoard, self).on_epoch_end(epoch, logs)

    def on_train_end(self, logs=None):
        super(TrainValTensorBoard, self).on_train_end(logs)
        self.val_writer.close()

In this class, the my_log_dir string must be a directory containing training/ and validation/ subdirectories that respectively contain the training and validation loss logs. You may need to adapt this to your own organization, so long as the logs are contained directly in the directory specified by thelog_dir class argument, or any of its subdirectory. I strongly recommend that you set up an organization so that training and validation logs are saved with a simple referencing system like this one.

After including the above class in your code, simply add TrainValTensorBoard() to your callbacks and start training your model.

Running Tensorboard from a remote machine

You can run Tensorboard while your model is training to monitor the live evolution of your training and validations losses. If you were on your local machine, the Tensorboard interface would open in a web browser tab. But since we are working from a remote server, we need to set up port forwarding from one of the remote ssh port to one of our local ports.

The default port used by Tensorboard is 6006. It is a good idea for each user of the remote machine to choose his own port number and stick with it to avoid listening to the Tensorboard instance of someone else by accident.

Here is the list of the port numbers already reserved on kraken:

6006: Victor
6007: Nicolas
6008: Camille

Please contact Victor to reserve your favorite number :).

Let us choose port number 6000. Now we can launch Tensorboard: tensorboard --logdir=path/to/log/dir --port=6000

On your local machine, set up ssh port forwarding to one of your unused local ports, for instance port 8898: ssh -NfL localhost:8898:localhost:6000 user@remote.

Be aware of port conflicts with other applications like jupyter lab (default port 8888). You can see which processes are using port XXXX with the command lsof -ti:XXXX. If you need to free this specific port, locate and kill these processes with lsof -ti:XXXX | xargs kill -9.

Finally, go to localhost:8898 on your local web browser. The Tensorboard interface should pop up, starting on the Scalars tab that displays your training and validation curves on the same plot.

Your Tensorboard window

I have only covered the basic use case of monitoring your training and validation losses via Tensorboard, but you can do a ton of other things like visualizing the graph of your model or displaying predictions on your validation set at each epoch. Have a look at this page for a few Tensorboard tutorials.