Keras: Starting, stopping, and resuming training

In this tutorial, you will learn how to use Keras to train a neural network, stop training, update your learning rate, and then resume training from where you left off using the new learning rate. Using this method you can increase your accuracy while decreasing model loss.

Today’s tutorial is inspired by a question I received from PyImageSearch reader, Zhang Min.

Zhang Min writes:

Hi Adrian, thanks for the PyImageSearch blog. I have two questions:

First, I am working on my graduation project and my university is allowing me to share time on their GPU machines. The problem is that I can only access a GPU machine in two hour increments — after my two hours is up I’m automatically booted off the GPU. How can I save my training progress, safely stop training, and then resume training from where I left off?

Secondly, my initial experiments aren’t going very well. My model quickly jumps to 80%+ accuracy but then stays there for another 50 epochs. What else can I be doing to improve my model accuracy? My advisor said I should look into adjusting the learning rate but I’m not really sure how to do that.

Thanks Adrian!

Learning how to start, stop, and resume training a deep learning model is a super important skill to master — at some point in your deep learning practitioner career you’ll run into a situation similar to Zhang Min’s where:

  • You have limited time on a GPU instance (which can happen on Google Colab or when using Amazon EC2’s cheaper spot instances).
  • Your SSH connection is broken and you forgot to use a terminal multiplexer to save your session (such as screen or tmux).
  • Your deep learning rig locks up and forces shuts down.

Just imagine spending an entire week to train a state-of-the-art deep neural network…only to have your model lost due to a power failure!

Luckily, there’s a solution — but when those situations happen you need to know how to:

  1. Take a snapshotted model that was saved/serialized to disk during training.
  2. Load the model into memory.
  3. Resume training from where you left off.

Secondly, starting, stopping, and resume training is standard practice when manually adjusting the learning rate:

  1. Start training your model until loss/accuracy plateau
  2. Snapshot your model every N epochs (typically N={1, 5, 10})
  3. Stop training, normally by force exiting via ctrl + c
  4. Open your code editor and adjust your learning rate (typically lowering it by an order
    of magnitude)
  5. Go back to your terminal and restart the training script, picking up from the last
    snapshot of model weights

Using this ctrl + c method of training you can boost your model accuracy while simultaneously driving down loss, leading to a more accurate model.

The ability to adjust the learning rate is a critical skill for any deep learning practitioner to master, so take the time now to study and practice it!

To learn how to start, stop, and resume training with Keras, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

Keras: Starting, stopping, and resuming training

In the first part of this blog post, we’ll discuss why we would want to start, stop, and resume training of a deep learning model.

We’ll also discuss how stopping training to lower your learning rate can improve your model accuracy (and why a learning rate schedule/decay may not be sufficient).

From there we’ll implement a Python script to handle starting, stopping, and resuming training with Keras.

I’ll then walk you through the entire training process, including:

  1. Starting the initial training script
  2. Monitoring loss/accuracy
  3. Noticing when loss/accuracy is plateauing
  4. Stopping training
  5. Lowering your learning rate
  6. Resuming training from where you left off with the new, lowered learning rate

Using this method of training you’ll often be able to improve your model accuracy.

Let’s go ahead and get started!

Why do we need to start, stop, and resume training?

There are a number of reasons you may need to start, stop, and resume training of your deep learning model, but the two primary grounds include:

  1. Your training session being terminated and training stopping (due to a power outage, GPU session timing out, etc.).
  2. Needing to adjust your learning rate to improve model accuracy (typically by lowering the learning rate by an order of magnitude).

The second point is especially important — if you go back and read the seminal AlexNet, SqueezeNet, ResNet, etc. papers you’ll find that the authors all say something along the lines of:

We started training our model with the SGD optimizer and an initial learning rate of 1e-1. We reduced our learning rate by an order of magnitude on epochs 30 and 50, respectively.

Why is the drop-in learning rate so important? And how can it lead to a more accurate model?

To explore that question, take a look at the following plot of ResNet-18 trained on the CIFAR-10 dataset:

Figure 1: Training ResNet-18 on the CIFAR-10 dataset. The characteristic drops in loss and increases in accuracy are evident of learning rate changes. Here, (1) training was stopped on epochs 30 and 50, (2) the learning rate was lowered, and (3) training was resumed. (image source)

Notice for epochs 1-29 there is a fairly “standard” curve that you come across when training a network:

  1. Loss starts off very high but then quickly drops
  2. Accuracy starts off very low but then quickly rises
  3. Eventually loss and accuracy plateau out

But what is going on around epoch 30?

Why does the loss drop so dramatically? And why does the accuracy rise so considerably?

The reason for this behavior is because:

  1. Training was stopped
  2. The learning rate was lowered by an order of magnitude
  3. And then training was resumed

The same goes for epoch 50 as well — again, training was stopped, the learning rate lowered, and then training resumed.

Each time we encounter a characteristic drop in loss and then a small increase in accuracy.

As the learning rate becomes smaller, the impact of the learning rate reduction has less and less impact.

Eventually, we run into two issues:

  1. The learning rate becomes very small which in turn makes the weight updates very small and thus the model cannot make any meaningful progress.
  2. We start to overfit due to the small learning rate. The model descends into areas of lower loss in the loss landscape, overfitting to the training data and not generalizing to the validation data.

The overfitting behavior is evident past epoch 50 in Figure 1 above.

Notice how validation loss has plateaued and is even started to rise a bit. And the same time training loss is continuing to drop, a clear sign of overfitting.

Dropping your learning rate is a great way to boost the accuracy of your model during training, just realize there is (1) a point of diminishing returns, and (2) a chance of overfitting if training is not properly monitored.

Why not use learning rate schedulers or decay?

Figure 2: Learning rate schedulers are great for some training applications; however, starting/stopping Keras training typically leads to more control over your deep learning model.

You might be wondering “Why not use a learning rate scheduler?”

There are a number of learning rate schedulers available to us, including:

If the goal is to improve model accuracy by dropping the learning rate, then why not just rely on those respective schedules and classes?

Great question.

The problem is that you may not have a good idea on:

  • The approximate number of epochs to train for
  • What a proper initial learning rate is
  • What learning rate range to use for CLRs

Additionally, one of the benefits of using what I call ctrl + c training is that it gives you more fine-grained control over your model.

Being able to manually stop your training at a specific epoch, adjust your learning rate, and then resume training from where you left off (and with the new learning rate) is something most learning rate schedulers will not allow you to do.

Once you’ve ran a few experiments with ctrl + c training you’ll have a good idea on what your hyperparameters should be — when that happens, then you start incorporating hardcoded learning rate schedules to boost your accuracy even further.

Finally, keep in mind that nearly all seminal CNN papers that were trained on ImageNet used a method to start/stop/resume training.

Just because other methods exist doesn’t make them inherently better — as a deep learning practitioner, you need to learn how to use ctrl + c training along with learning rate scheduling (don’t rely strictly on the latter).

If you’re interested in learning more about ctrl + c training, along with my tips, suggestions, and best practices when training your own models, be sure to refer to my book, Deep Learning for Computer Vision with Python.

Project structure

Let’s review our project structure:

Today we will review train.py , our training script. This script trains Fashion MNIST on ResNet.

The key to this training script is that it uses two “callbacks”, epochcheckpoint.py  and trainingmonitor.py . I review these callbacks in detail inside Deep Learning for Computer Vision with Python — they aren’t covered today, but I encourage you to review the code.

These two callbacks allow us to (1) save our model at the end of every N-th epoch so we can resume training on demand, and (2) output our training plot at the conclusion of each epoch, ensuring we can easily monitor our model for signs of overfitting.

The models are checkpointed (i.e. saved) in the output/checkpoints/  directory and the accompanying JSON file. The training plot is overwritten upon each epoch end as resnet_fashion_mnist.png . We’ll be paying close attention to the training plot to determine when to stop training.

Implementing the training script

Let’s get started implementing our Python script that will be used for starting, stopping, and resuming training with Keras.

This guide is written for intermediate practitioners, even though it teaches an essential skill. If you are new to Keras or deep learning, or maybe you just need to brush up on the basics, definitely check out my Keras Tutorial first.

Open up a new file, name it train.py, and insert the following code:

Lines 2-19 import our required packages, namely our EpochCheckpoint  and TrainingMonitor  callbacks. We also import our fashion_mnist  dataset and ResNet  CNN. The keras.backend as K  will allow us to retrieve and set our learning rate.

Now let’s go ahead and parse command line arguments:

Our command line arguments include:

  • --checkpoints : The path to our output checkpoints directory.
  • --model : The optional path to a specific model checkpoint to load when resuming training.
  • --start-epoch : The optional start epoch can be provided if you are resuming training. By default, training starts at epoch 0 .

Let’s go ahead and load our dataset:

Line 34 loads Fashion MNIST.

Lines 38-48 then preprocess the data including (1) resizing to 32×32, (2) scaling pixel intensities to the range [0, 1], and (3) adding a channel dimension.

From here we’ll (1) binarize our labels, and (2) initialize our data augmentation object:

And now to the code for loading model checkpoints:

If no model checkpoint is supplied then we need to initialize the model (Lines 62-68). Notice that we specify our initial learning rate as 1e-1  on Line 64.

Otherwise, Lines 71-81 load the model checkpoint (i.e. a model that was previously stopped via ctrl + c ) and update the learning rate. Line 79 will be the line you edit whenever you want to update the learning rate.

Next, we’ll construct our callbacks:

Lines 84 and 85 specify our plot and JSON paths.

Lines 88-93 construct two callbacks , putting them directly into a list:

  • EpochCheckpoint : This callback is responsible for saving our model as it currently stands at the conclusion of every epoch. That way, if we stop training via ctrl + c  (or an unforeseeable power failure), we don’t lose our machine’s work — for training complex models on huge datasets, this could quite literally save you days of time.
  • TrainingMonitor : A callback that saves our training accuracy/loss information as a PNG image plot and JSON dictionary. We’ll be able to open our training plot at any time to see our training progress — valuable information to you as the practitioner, especially for multi-day training processes.

Again, please review epochcheckpoint.py  and trainingmonitor.py  on your own time for the details and/or if you need to add functionality. I cover these callbacks in detail inside Deep Learning for Computer Vision with Python.

Finally, we have everything we need to start, stop, and resume training. This last block actually starts or resumes training:

Our call to .fit_generator  fits/trains our model  using and our callbacks (Lines 97-103). Be sure to review my tutorial on Keras’ fit_generator for more details on how the .fit_generator  function is used to train our model.

I’d like to call your attention to the epochs parameter (Line 101) — when you adjust your learning rate you’ll typically want to update the epochs as well. Typically you should over-estimate the number of epochs as you’ll see in the next three sections.

For a more detailed explanation of starting, stopping, and resuming training (along with the implementations of my EpochCheckpoint and TrainingMonitor classes), be sure to refer to Deep Learning for Computer Vision with Python.

Phase #1: 40 epochs at 1e-1

Make sure you’ve used the “Downloads” section of this blog post to download the source code to this tutorial.

From there, open up a terminal and execute the following command:

Figure 3: Phase 1 of training ResNet on the Fashion MNIST dataset with a learning rate of 1e-1 for 40 epochs before we stop via ctrl + c, adjust the learning rate, and resume Keras training.

Here I’ve started training ResNet on the Fashion MNIST dataset using the SGD optimizer and an initial learning rate of 1e-1.

After every epoch my loss/accuracy plot in Figure 3 updates, enabling me to monitor training in real-time.

Past epoch 20 we can see training and validation loss starting to diverge, and by epoch 40 I decided to ctrl + c out of the train.py script.

Phase #2: 10 epochs at 1e-2

The next step is to update both:

  1. My learning rate
  2. The number of epochs to train for

For the learning rate, the standard practice is to lower it by an order of magnitude.

Going back to Line 64 of train.py we can see that my initial learning rate is 1e-1 :

I’m now going to update my learning rate to be 1e-2  on Line 79:

So, why am I updating Line 79 and not Line 64?

The reason is due to the if/else statement.

The else statement handles when we need to load a specific checkpoint from disk — once we have the checkpoint we’ll resume training, thus the learning rate needs to be updated in the else block.

Secondly, I also update my epochs on Line 101. Initially, the epochs value was 80 :

I have decided to lower the number of epochs to train for to 40  epochs:

Typically you’ll set the epochs value to be much larger than what you think it should actually be.

The reason for this is due to the fact that we’re using the EpochCheckpoint class to save model snapshots every 5 epochs — if at any point we decide we’re unhappy with the training progress we can just ctrl + c out of the script and go back to a previous snapshot.

Thus, there is no harm in training for longer since we can always resume training from a previous model weight file.

After both my learning rate and the number of epochs to train for were updated, I then executed the following command:

Figure 4: Phase 2 of Keras start/stop/resume training. The learning rate is dropped from 1e-1 to 1e-2 as is evident in the plot at epoch 40. I continued training for 10 more epochs until I noticed validation metrics plateauing at which point I stopped training via ctrl + c again.

Notice how we’ve updated our learning rate from 1e-1  to 1e-2  and then resumed training.

We immediately see a drop in both training/validation loss as well as an increase in training/validation accuracy.

The problem here is that our validation metrics have plateaued out — there may not be much more gains left without risking overfitting. Because of this, I only allowed training to continue for another 10 epochs before once again ctrl + c ing out of the script.

Phase #3: 5 epochs at 1e-3

For the final phase of training I decided to:

  1. Lower my learning rate from 1e-2  to 1e-3 .
  2. Allow training to continue (but knowing I would likely only be training for a few epochs given the risk of overfitting).

After updating my learning rate, I executed the following command:

Figure 5: Upon resuming Keras training for phase 3, I only let the network train for 5 epochs because there is not significant learning progress being made. Using a start/stop/resume training approach with Keras, we have achieved 94.15% validation accuracy.

At this point the learning rate has become so small that the corresponding weight updates are also very small, implying that the model cannot learn much more.

I only allowed training to continue for 5 epochs before killing the script. However, looking at my final metrics you can see what we are obtaining 97.25% training accuracy along with 94.15% validation accuracy.

We were able to achieve this result by using our start, stop, and resuming training method.

At this point, we could either continue to tune our learning rate, utilize a learning rate scheduler, apply Cyclical Learning Rates, or try a new model architecture altogether.

Where can I learn more deep learning tips, suggestions, and best practices?

Figure 6: My deep learning book is the go-to resource for deep learning students, developers, researchers, and hobbyists, alike. Use the book to build your skillset from the bottom up, or read it to gain a deeper understanding. Don’t be left in the dust as the fast-paced AI revolution continues to accelerate.

Today’s tutorial introduced you to starting, stopping, and resuming training with Keras.

If you’re looking for more of my tips, suggestions, and best practices when training deep neural networks, be sure to refer to my book, Deep Learning for Computer Vision with Python.

Inside the book I cover:

  1. Deep learning fundamentals and theory without unnecessary mathematical fluff. I present the basic equations and back them up with code walkthroughs that you can implement and easily understand.
  2. How to spot underfitting and overfitting while you’re using the TrainingMonitor callback.
  3. Recommendations and best practices for selecting learning rates.
  4. My tips/tricks, suggestions, and best practices for training CNNs.

Besides content on learning rates, you’ll also find:

  • Super practical walkthroughs that present solutions to actual, real-world image classification, object detection, and segmentation problems.
  • Hands-on tutorials (with lots of code) that not only show you the algorithms behind deep learning for computer vision but their implementations as well.
  • A no-nonsense teaching style that is guaranteed to help you master deep learning for image understanding and visual recognition.

To learn more about the book, and grab the table of contents + free sample chapters, just click here!

Summary

In this tutorial you learned how to start, stop, and resume training using Keras and Deep Learning.

Learning how to resume from where your training left off is a super valuable skill for two reasons:

  1. It ensures that if your training script crashes, you can pick up again from the most recent model checkpoint.
  2. It enables you to adjust your learning rate and improve your model accuracy.

When training your own custom neural networks you’ll want to monitor your loss and accuracy — once you start to see validation loss/accuracy plateau, try killing the training script, lowering your learning rate by an order of magnitude, and then resume training.

You’ll often find that this method of training can lead to higher accuracy models.

However, you should be wary of overfitting!

Lowering your learning rate enables your model to descend into lower areas of the loss landscape; however, there is no guarantee that these lower loss areas will still generalize!

You likely will only be able to drop the learning rate 1-3 times before either:

  1. The learning rate becomes too small, making the corresponding weight updates too small, and preventing the model from learning further.
  2. Validation loss stagnates or explodes while training loss continues to drop (implying that the model is overfitting).

If those cases occur and your model is still not satisfactory you should consider adjusting other hyperparameters to your model, including regularization strength, dropout, etc. You may want to explore other model architectures as well.

For more of my tips, suggestions, and best practices when training your own neural networks on your custom datasets, be sure to refer to Deep Learning for Computer Vision with Python, where I cover my best practices in-depth.

To download the source code to this tutorial (and be notified when future tutorials are published on the PyImageSearch blog), just enter your email address in the form below!

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , ,

6 Responses to Keras: Starting, stopping, and resuming training

  1. Kenechukwu September 23, 2019 at 9:03 pm #

    Hi Adrian Rosebrock.

    Thanks for this practical and important tutorial. It will be helpful to know if you are training on GPU or CPU.

    Thanks.

    • Adrian Rosebrock September 25, 2019 at 8:44 am #

      You are welcome Kenechukwu!

  2. Abdullah September 29, 2019 at 2:49 am #

    Hi Adrian, thank you so much for this amazing and important topic. May I ask, is this method only used on a huge dataset or it can be used on a small one as well? Secondly, does it matter whether this is implemented on a GPU or a CPU?

    Thanks again
    Abdullah

    • Adrian Rosebrock September 30, 2019 at 2:48 pm #

      1. It can be used on any size dataset.
      2. It doesn’t matter if you’re using a CPU or GPU — it will work for either.

  3. david October 8, 2019 at 2:47 am #

    Thank you very much. I have been following bloggers for half a year and benefited a lot. I would like to ask if it is appropriate to give a series of lectures on OCR text recognition without tesseract?

    • Adrian Rosebrock October 10, 2019 at 10:18 am #

      Hey David — I will consider it in the future but cannot guarantee I will cover it.

Before you leave a comment...

Hey, Adrian here, author of the PyImageSearch blog. I'd love to hear from you, but before you submit a comment, please follow these guidelines:

  1. If you have a question, read the comments first. You should also search this page (i.e., ctrl + f) for keywords related to your question. It's likely that I have already addressed your question in the comments.
  2. If you are copying and pasting code/terminal output, please don't. Reviewing another programmers’ code is a very time consuming and tedious task, and due to the volume of emails and contact requests I receive, I simply cannot do it.
  3. Be respectful of the space. I put a lot of my own personal time into creating these free weekly tutorials. On average, each tutorial takes me 15-20 hours to put together. I love offering these guides to you and I take pride in the content I create. Therefore, I will not approve comments that include large code blocks/terminal output as it destroys the formatting of the page. Kindly be respectful of this space.
  4. Be patient. I receive 200+ comments and emails per day. Due to spam, and my desire to personally answer as many questions as I can, I hand moderate all new comments (typically once per week). I try to answer as many questions as I can, but I'm only one person. Please don't be offended if I cannot get to your question
  5. Do you need priority support? Consider purchasing one of my books and courses. I place customer questions and emails in a separate, special priority queue and answer them first. If you are a customer of mine you will receive a guaranteed response from me. If there's any time left over, I focus on the community at large and attempt to answer as many of those questions as I possibly can.

Thank you for keeping these guidelines in mind before submitting your comment.

Leave a Reply

[email]
[email]