Organizing your training directory

When building your machine learning model, you will usually experiment with several versions of the model to try out modifications to hyperparameter values or the architecture. Juggling with different versions without organizing your training runs can be quite the headache and significantly slow down your workflow, especially if you want to iterate quickly over simple modifications.

Ideally, you would want a system that allows you to easily keep track and refer to all your model versions. This blog post will present the system I use to organize my training runs and keep my aspirin consumption low.

Argparse is your best friend

When I first started training neural networks as a Python newbie, I kept all my variable parameter definitions at the top of my code. This way, I thought, I could find them quickly whenever I needed to tweak a specific value. This turned out to be a pretty bad idea for two reasons:

Since my model was trained on a remote machine, I needed to send every updated version of my code over ssh before running it. This led to unfortunate mistakes like forgetting to send the last version before restarting a training run, and wondering why the changes I thought I had made to the parameters had no effect.
git commits were littered with unimportant parameter changes, since each new set of parameters meant a new version of the training script. I wanted to keep my commits clean and reserved for meaningful architecture changes.

Since then, I have switched to using argparse, a native Python module that deals with command line arguments and options. Parameter values are given as arguments when running the script with Python and read by an ArgumentParser object that stores them as its attributes. Here is an example of how I use argparse in one of my training scripts to control sensitive parameters:

import argparse
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)

parser.add_argument(
    'run_number', type=int,
    help='Run number')
parser.add_argument(
    '--main_axis', choices=['x', 'y', 'z'], default='x',
    help='Main axis')
parser.add_argument(
    '--samples', type=int, default=4,
    help='Number of training images sampled at each iteration')
parser.add_argument(
    '--crops_per_block', type=int, default=10,
    help='Number of random crops per training image')
parser.add_argument(
    '--epochs', type=int, default=100,
    help='Number of epochs')
parser.add_argument(
    '--steps_train', type=int, default=100,
    help='Steps per epoch during training')
parser.add_argument(
    '--steps_valid', type=int, default=50,
    help='Steps per epoch during validation')
parser.add_argument(
    '--initial_LR', type=float, default=0.01,
    help='Initial learning rate')
parser.add_argument(
    '--decay', type=float, default=0.8,
    help='LR decay every 10 epochs')

args = parser.parse_args()

...

history = model.fit_generator(train_generator,
steps_per_epoch=args.steps_train,
epochs=args.epochs,
validation_data=val_generator,
validation_steps=args.steps_valid,
callbacks=callbacks_list)

Note the run_number argument that brings me to my next point: numbering your training runs.

Keep it numbered

A simple referencing system for your training runs is crucial to avoid making a mess out of your training history. A simple yet effective system is to number them, giving every training run a new number by running python training.py [run_number] with the help of argparse. I keep a spreadsheet where I record the details of each run, like the values of the hyperparameters, and remarks on the results, so I can remember which number corresponds to which model version.

This numbering system helps me organize:

Tensorboard logs: I use Tensorboard to monitor my training and validation losses. Here is a write-up on how to set it up on a remote machine like the calculators at CERFACS,
Model weights: they are saved at the end of the training by the ModelCheckpoint callback.

This is what my training directory looks like:

training.py
prediction.py
log_dir/
- run1/
  - training/
  - validation/
- run2/
  - training/
  - validation/
save_model/
- run1/
- run2/

For every training run, Tensorboard log and model weight directories are automatically created by this piece of code:

import os
import shutil

my_log_dir = './log_dir/run'+str(args.run_number)+'/'
train_log_dir = os.path.join(my_log_dir, 'training/')
val_log_dir = os.path.join(my_log_dir, 'validation/')
if os.path.exists(os.path.dirname(my_log_dir)):
    for file in os.listdir(my_log_dir):
        file_path = os.path.join(my_log_dir, file)
        try:
            if os.path.isfile(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            print(e)
else:
    os.makedirs(os.path.dirname(my_log_dir))
os.makedirs(os.path.dirname(train_log_dir))
os.makedirs(os.path.dirname(val_log_dir))

save_dir = './save_model/run'+str(args.run_number)+'/'
if os.path.exists(os.path.dirname(save_dir)):
    for file in os.listdir(save_dir):
        file_path = os.path.join(save_dir, file)
        try:
            if os.path.isfile(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            print(e)
else:
os.makedirs(os.path.dirname(save_dir))

Be aware that it overwrites existing run subfolders, so you need to be on top of your numbering!

This directory organization leads to a tidy Runs section in Tensorboard

Tensorboard runs section

It can also be integrated in your prediction script to quickly refer to a saved model by its run number with python prediction.py [run_number]:

import argparse
import os
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument(
'run_number', type=int,
help='Run number')
args = parser.parse_args()
model_path = os.path.join('save_model', 'run'+str(args.run_number))
name_models = sorted(glob.glob(model_path+'/*'))
best_model = name_models[-1]
model = load_model(best_model)
# Use model for predictions...

Finally, Nicolas wrote a neat function that generates a text file containing tons of details on the model that is training:

def write_report()
    infopath = "save_model/run" +str(run_number)+ "/info.txt"
    with open(infopath, 'w') as fh:
        fh.write("Training parameters : \n")
        fh.write("Image dimensions : " + str(img_dimension[0]) + ", " + str(img_dimension[1]) + "\n")
        fh.write("Epochs - LR - Decay : " + str(nb_epochs) + " - " + str(LR) + " - " + str(Decay) + "\n")
        fh.write("Batch_size - Steps_train - Steps_valid : " + str(batch_size) + " - " + str(steps_train) + " - " + str(steps_valid) +"\n")
        fh.write("Final loss - val_loss : " + str(min(history.history['loss'])) + " - " + str(min(history.history['val_loss'])) + "\n")
        fh.write("Network architecture : \n")
        # Pass the file handle in as a lambda function to make it callable
        model.summary(print_fn=lambda x: fh.write(x + '\n'))

Victor Xing was a Ph.D. student working on Deep Learning for turbulent combustion.

Organizing your training directory

Argparse is your best friend

Keep it numbered

Keep Reading

Published

Category

Tags

Stay in Touch