Why is my validation loss lower than my training loss?

In this tutorial, you will learn the three primary reasons your validation loss may be lower than your training loss when training your own custom deep neural networks.

I first became interested in studying machine learning and neural networks in late high school. Back then there weren’t many accessible machine learning libraries — and there certainly was no scikit-learn.

Every school day at 2:35 PM I would leave high school, hop on the bus home, and within 15 minutes I would be in front of my laptop, studying machine learning, and attempting to implement various algorithms by hand.

I rarely stopped for a break, more than occasionally skipping dinner just so I could keep working and studying late into the night.

During these late-night sessions I would hand-implement models and optimization algorithms (and in Java of all languages; I was learning Java at the time as well).

And since they were hand-implemented ML algorithms by a budding high school programmer with only a single calculus course under his belt, my implementations were undoubtedly prone to bugs.

I remember one night in particular.

The time was 1:30 AM. I was tired. I was hungry (since I skipped dinner). And I was anxious about my Spanish test the next day which I most certainly did not study for.

I was attempting to train a simple feedforward neural network to classify image contents based on basic color channel statistics (i.e., mean and standard deviation).

My network was training…but I was running into a very strange phenomenon:

My validation loss was lower than training loss!

How could that possibly be?

  • Did I accidentally switch the plot labels for training and validation loss? Potentially. I didn’t have a plotting library like matplotlib so my loss logs were being piped to a CSV file and then plotted in Excel. Definitely prone to human error.
  • Was there a bug in my code? Almost certainly. I was teaching myself Java and machine learning at the same time — there were definitely bugs of some sort in that code.
  • Was I just so tired that my brain couldn’t comprehend it? Also very likely. I wasn’t sleeping much during that time of my life and could have very easily missed something obvious.

But, as it turns out it was none of the above cases — my validation loss was legitimately lower than my training loss.

It took me until my junior year of college when I took my first formal machine learning course to finally understand why validation loss can be lower than training loss.

And a few months ago, brilliant author, Aurélien Geron, posted a tweet thread that concisely explains why you may encounter validation loss being lower than training loss.

I was inspired by Aurélien’s excellent explanation and wanted to share it here with my own commentary and code, ensuring that no students (like me many years ago) have to scratch their heads and wonder “Why is my validation loss lower than my training loss?!”.

To learn the three primary reasons your validation loss may be lower than your training loss, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

Why is my validation loss lower than my training loss?

In the first part of this tutorial, we’ll discuss the concept of “loss” in a neural network, including what loss represents and why we measure it.

From there we’ll implement a basic CNN and training script, followed by running a few experiments using our freshly implemented CNN (which will result in our validation loss being lower than our training loss).

Given our results, I’ll then explain the three primary reasons your validation loss may be lower than your training loss.

What is “loss” when training a neural network?

Figure 1: What is the “loss” in the context of machine/deep learning? And why is my validation loss lower than my training loss? (image source)

At the most basic level, a loss function quantifies how “good” or “bad” a given predictor is at classifying the input data points in a dataset.

The smaller the loss, the better a job the classifier is at modeling the relationship between the input data and the output targets.

That said, there is a point where we can overfit our model — by modeling the training data too closely, our model loses the ability to generalize.

We, therefore, seek to:

  1. Drive our loss down, thereby improving our model accuracy.
  2. Do so as fast as possible and with as little hyperparameter updates/experiments.
  3. All without overfitting our network and modeling the training data too closely.

It’s a balancing act and our choice of loss function and model optimizer can dramatically impact the quality, accuracy, and generalizability of our final model.

Typical loss functions (also called “objective functions” or “scoring functions”) include:

  • Binary cross-entropy
  • Categorical cross-entropy
  • Sparse categorical cross-entropy
  • Mean Squared Error (MSE)
  • Mean Absolute Error (MAE)
  • Standard Hinge
  • Squared Hinge

A full review of loss functions is outside the scope of this post, but for the time being, just understand that for most tasks:

  • Loss measures the “goodness” of your model
  • The smaller the loss, the better
  • But you need to be careful not to overfit

To learn more about the role of loss functions when training your own custom neural networks, be sure to:

Additionally, if you would like a complete, step-by-step guide on the role of loss functions in machine learning/neural networks, make sure you read Deep Learning for Computer Vision with Python where I explain parameterized learning and loss methods in detail (including code and experiments).

Project structure

Go ahead and use the “Downloads” section of this post to download the source code. From there, inspect the project/directory structure via the tree  command:

Today we’ll be using a smaller version of VGGNet called MiniVGGNet. The pyimagesearch  module includes this CNN.

Our fashion_mnist.py  script trains MiniVGGNet on the Fashion MNIST dataset. I’ve written about training MiniVGGNet on Fashion MNIST in a previous blog post, so today we won’t go into too much detail.

Today’s training script generates a training.pickle  file of the training accuracy/loss history. Inside the Reason #2 section below, we’ll use plot_shift.py  to shift the training loss plot half an epoch to demonstrate that the time at which loss is measured plays a role when validation loss is lower than training loss.

Let’s dive into the three reasons now to answer the question, “Why is my validation loss lower than my training loss?”.

Reason #1: Regularization applied during training, but not during validation/testing

Figure 2: Aurélien answers the question: “Ever wonder why validation loss > training loss?” on his twitter feed (image source). The first reason is that regularization is applied during training but not during validation/testing.

When training a deep neural network we often apply regularization to help our model:

  1. Obtain higher validation/testing accuracy
  2. And ideally, to generalize better to the data outside the validation and testing sets

Regularization methods often sacrifice training accuracy to improve validation/testing accuracy — in some cases that can lead to your validation loss being lower than your training loss.

Secondly, keep in mind that regularization methods such as dropout are not applied at validation/testing time.

As Aurélien shows in Figure 2, factoring in regularization to validation loss (ex., applying dropout during validation/testing time) can make your training/validation loss curves look more similar.

Reason #2: Training loss is measured during each epoch while validation loss is measured after each epoch

Figure 3: Reason #2 for validation loss sometimes being less than training loss has to do with when the measurement is taken (image source).

The second reason you may see validation loss lower than training loss is due to how the loss value are measured and reported:

  1. Training loss is measured during each epoch
  2. While validation loss is measured after each epoch

Your training loss is continually reported over the course of an entire epoch; however, validation metrics are computed over the validation set only once the current training epoch is completed.

This implies, that on average, training losses are measured half an epoch earlier.

If you shift the training losses half an epoch to the left you’ll see that the gaps between the training and losses values are much smaller.

For an example of this behavior in action, read the following section.

Implementing our training script

We’ll be implementing a simple Python script to train a small VGG-like network (called MiniVGGNet) on the Fashion MNIST dataset. During training, we’ll save our training and validation losses to disk. We’ll then create a separate Python script to compare both the unshifted and shifted loss plots.

Let’s get started by implementing the training script:

Lines 2-8 import our required packages, modules, classes, and functions. Namely, we import MiniVGGNet  (our CNN), fashion_mnist  (our dataset), and pickle  (ensuring that we can serialize our training history for a separate script to handle plotting).

The command line argument, --history , points to the separate .pickle  file which will soon contain our training history (Lines 11-14).

We then initialize a few hyperparameters, namely our number of epochs to train for, initial learning rate, and batch size:

We then proceed to load and preprocess our Fashion MNIST data:

Lines 25-34 load and preprocess the training/validation data.

Lines 37 and 38 binarize our class labels, while Lines 41 and 42 list out the human-readable class label names for classification report purposes later.

From here we have everything we need to compile and train our MiniVGGNet model on the Fashion MNIST data:

Lines 46-49 initialize and compile the MiniVGGNet  model.

Lines 53-55 then fit/train the model .

From here we will evaluate our model  and serialize our training history:

Lines 59-62 make predictions on the test set and print a classification report to the terminal.

Lines 66-68 serialize our training accuracy/loss history to a .pickle  file. We’ll use the training history in a separate Python script to plot the loss curves, including one plot showing a one-half epoch shift.

Go ahead and use the “Downloads” section of this tutorial to download the source code.

From there, open up a terminal and execute the following command:

Checking the contents of your working directory you should have a file named training.pickle — this file contains our training history logs.

In the next section we’ll learn how to plot these values and shift our training information a half epoch to the left, thereby making our training/validation loss curves look more similar.

Shifting our training loss values

Our plot_shift.py script is used to plot the training history output from fashion_mnist.py. Using this script we can investigate how shifting our training loss a half epoch to the left can make our training/validation plots look more similar.

Open up the plot_shift.py file and insert the following code:

Lines 2-5 import matplotlib  (for plotting), NumPy (for a simple array creation operation), argparse  (command line arguments), and pickle  (to load our serialized training history).

Lines 8-11 parse the --input  command line argument which points to our .pickle  training history file on disk.

Let’s go ahead load our data and initialize our plot figure:

Line 14 loads our serialized training history  .pickle  file using the --input  command line argument.

Line 18 makes space for our x-axis which spans from zero to the number of epochs  in the training history.

Lines 19 and 20 set up our plot figure to be two stacked plots in the same image:

  • The top plot will contain loss curves as-is.
  • The bottom plot, on the other hand, will include a shift for the training loss (but not for the validation loss). The training loss will be shifted half an epoch to the left just as in Aurélien’s tweet. We’ll then be able to observe if the plots line up more closely.

Let’s generate our top plot:

And then draw our bottom plot:

Notice on Line 32 that the training loss is shifted 0.5  epochs to the left — the heart of this example.

Let’s now analyze our training/validation plots.

Open up a terminal and execute the following command:

Figure 4: Shifting the training loss plot 1/2 epoch to the left yields more similar plots. Clearly the time of measurement answers the question, “Why is my validation loss lower than training loss?”.

As you can observe, shifting the training loss values a half epoch to the left (bottom) makes the training/validation curves much more similar versus the unshifted (top) plot.

Reason #3: The validation set may be easier than the training set (or there may be leaks)

Figure 5: Consider how your validation set was acquired/generated. Common mistakes could lead to validation loss being less than training loss. (image source)

The final most common reason for validation loss being lower than your training loss is due to the data distribution itself.

Consider how your validation set was acquired:

  • Can you guarantee that the validation set was sampled from the same distribution as the training set?
  • Are you certain that the validation examples are just as challenging as your training images?
  • Can you assure there was no “data leakage” (i.e., training samples getting accidentally mixed in with validation/testing samples)?
  • Are you confident your code created the training, validation, and testing splits properly?

Every single deep learning practitioner has made the above mistakes at least once in their career.

Yes, it is embarrassing when it happens — but that’s the point — it does happen, so take the time now to investigate your code.

BONUS: Are you training hard enough?

Figure 6: If you are wondering why your validation loss is lower than your training loss, perhaps you aren’t “training hard enough”.

One aspect that Aurélien didn’t touch on in his tweets is the concept of “training hard enough”.

When training a deep neural network, our biggest concern is nearly always overfitting — and in order to combat overfitting, we introduce regularization techniques (discussed in Reason #1 above). We apply regularization in the form of:

  • Dropout
  • L2 weight decay
  • Reducing model capacity (i.e., a more shallow model)

We also tend to be a bit more conservative with our learning rate to ensure our model doesn’t overshoot areas of lower loss in the loss landscape.

That’s all fine and good, but sometimes we end up over-regularizing our models.

If you go through all three reasons for validation loss being lower than training loss detailed above, you may have over-regularized your model. Start to relax your regularization constraints by:

  • Lowering your L2 weight decay strength.
  • Reducing the amount of dropout you’re applying.
  • Increasing your model capacity (i.e., make it deeper).

You should also try training with a larger learning rate as you may have become too conservative with it.

Do you have unanswered deep learning questions?

You can train your first neural network in minutes…with just a few lines of Python.

But if you are just getting started, you may have questions such as today’s: “Why is my validation loss lower than training loss?”

Similar questions can stump you for weeks — maybe months. You might find yourself searching online for answers only to be disappointed in the explanations. Or maybe you posted your burning question to Stack Overflow or Quora and you are still hearing crickets.

It doesn’t have to be like that.

What you need is a comprehensive book to jumpstart your education. Discover and study deep learning the right way in my book: Deep Learning for Computer Vision with Python.

Inside the book, you’ll find self-study tutorials and end-to-end projects on topics like:

  • Convolutional Neural Networks
  • Object Detection via Faster R-CNNs and SSDs
  • Generative Adversarial Networks (GANs)
  • Emotion/Facial Expression Recognition
  • Best practices, tips, and rules of thumb
  • …and much more!

Using the knowledge gained by reading this book you’ll finally be able to bring deep learning to your own projects.

What’s more is that you’ll learn the “art” of training neural networks, answering questions such as:

  1. Which deep learning CNN architecture is right for my task at hand?
  2. How can I spot underfitting and overfitting either after or during training?
  3. What is the most effective way to set my initial learning rate and to use a learning rate decay scheduler to improve accuracy?
  4. Which deep learning optimizer is the best one for the job and how do I evaluate new, state-of-the-art optimizers as they are published?
  5. How do I apply regularization techniques effectively ensuring that I am not over-regularizing my model?

You’ll find the answers to all of these questions inside my deep learning book.

Customers of mine attest that this is the best deep learning education you’ll find online — inside the book you’ll find:

  • Super practical walkthroughs that present solutions to actual, real-world image classification, object detection, and segmentation problems.
  • Hands-on tutorials (with lots of code) that not only show you the algorithms behind deep learning for computer vision but their implementations as well.
  • A no-nonsense teaching style that is guaranteed to help you master deep learning for image understanding and visual recognition.

So why wait?

The cost of fumbling around the internet looking for answers to your questions only to find sub-par resources is costing you time that you can’t get back. The value you receive by reading my book is far, far greater than the price you pay and will ensure you receive a positive return on your investment of time and finances, I guarantee that.

To learn more about the book, and grab the table of contents + free sample chapters, just click here!

Summary

Today’s tutorial was heavily inspired by the following tweet thread from author, Aurélien Geron.

Inside the thread, Aurélien expertly and concisely explained the three reasons your validation loss may be lower than your training loss when training a deep neural network:

  1. Reason #1: Regularization is applied during training, but not during validation/testing. If you add in the regularization loss during validation/testing, your loss values and curves will look more similar.
  2. Reason #2: Training loss is measured during each epoch while validation loss is measured after each epoch. On average, the training loss is measured 1/2 an epoch earlier. If you shift your training loss curve a half epoch to the left, your losses will align a bit better.
  3. Reason #3: Your validation set may be easier than your training set or there is a leak in your data/bug in your code. Make sure your validation set is reasonably large and is sampled from the same distribution (and difficulty) as your training set.
  4. BONUS: You may be over-regularizing your model. Try reducing your regularization constraints, including increasing your model capacity (i.e., making it deeper with more parameters), reducing dropout, reducing L2 weight decay strength, etc.

Hopefully, this helps clear up any confusion on why your validation loss may be lower than your training loss!

It was certainly a head-scratcher for me when I first started studying machine learning and neural networks and it took me until mid-college to understand exactly why that happens — and none of the explanations back then were as clear and concise as Aurélien’s.

I hope you enjoyed today’s tutorial!

To download the source code (and be notified when future tutorials are published here on PyImageSearch), just enter your email address in the form below!

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , ,

20 Responses to Why is my validation loss lower than my training loss?

  1. David Bonn October 14, 2019 at 1:05 pm #

    Great Post!

    I’ve noticed this effect when training my own models. When I ran some more experiments it became clear that validation loss was more likely to be less than training loss when the slope of the loss curve was steeper. In addition, if I applied more aggressive dropout or applied more aggressive data augmentation this effect was also more likely to happen.

    So thanks and this makes me feel much better that I understand what is going on.

    • Adrian Rosebrock October 16, 2019 at 7:00 am #

      Absolutely — regularization techniques will sacrifice training accuracy in the hopes of improving validation accuracy and the overall generalizability of the model.

  2. Amit October 14, 2019 at 5:28 pm #

    Thanks for the blog – really useful. So we know why this could happen. Is this really a problem though? I mean is this necessarily a bad thing?

    • Adrian Rosebrock October 16, 2019 at 6:59 am #

      It’s not necessarily a “bad” thing by itself but when you look at the common reasons for it you should double-check what’s going on. You may not be training hard enough and leaving accuracy on the table. Or, you may have a data leakage issue which absolutely IS a problem. Finally, you still need to check the generalizability of your model.

  3. Chua October 15, 2019 at 12:16 am #

    Thank you for your nice article on this useful topic. This is different from typical topic of model overfitting. I am also reading Andrew Ng’s Machine Learning Yearning book, another possible reason could be mislabeled validation samples according to Chapter 16.

    • Adrian Rosebrock October 16, 2019 at 6:58 am #

      Mislabeled samples could be a problem — you may also be applying label smoothing as well.

  4. Chua October 15, 2019 at 12:28 am #

    Based on the loss graphs above, it seems that validation loss is typically higher than training loss when the model is not trained long enough. Therefore could I say that another possible reason is that the model is not trained long enough/early stopping criterion is too strict?

    • Adrian Rosebrock October 16, 2019 at 6:57 am #

      It’s not so much “trained long enough” as it may be “not training hard enough”. See the “Bonus Reason #4” that I included in the post as well.

  5. robotechnics October 15, 2019 at 1:52 am #

    Hi Adrian,
    thank you very much for this post.
    I do not understand why the calculations are different for training and validation datasets.
    According to DLCV, for each individual image, the loss is calculated and at the end of each epoch, the total sum of all loss is accounted and then the optimizer (SGD etc) is in charge of finding the absolute minimum of the function.
    Could you please explain with more detail, why is it different?

    • Adrian Rosebrock October 16, 2019 at 6:56 am #

      First, I think you’re confusing how training and validation loss are computed.

      1. Training loss is computed continually throughout an epoch (i.e., after each epoch).
      2. At the end of each epoch the validation loss is computed.

      Secondly, I think you have a misunderstanding what DL4CV is explaining. For each individual batch of input images, the batch is forward propagated, then the loss is computed, and the error is backpropagated. Thus, training loss is computed continually with each batch update.

  6. Abdullah October 15, 2019 at 2:28 am #

    Thank you for the great explanation. So if the val_loss was lower than train_loss because of the second reason ” Training loss is measured during each epoch while validation loss is measured after each epoch”, then that is fine correct?

    • Adrian Rosebrock October 16, 2019 at 6:54 am #

      Yes, that is fine — but I still recommend creating a second test set to validate that your model can generalize.

  7. lgenzelis October 15, 2019 at 9:51 am #

    Great post Adrian!! I have also scratched my heads many times around this issue. Thank you for un-scratching it!

    • Adrian Rosebrock October 16, 2019 at 6:53 am #

      Thanks lgenzelis, I’m glad you enjoyed the guide and it helped you 🙂

  8. Walid October 15, 2019 at 2:53 pm #

    Great post.

    I have see this phenomna before more than once 🙂

    If we use full batch in training, “Reason 2” will disappear. am I right?

    Walid

    • Adrian Rosebrock October 16, 2019 at 6:53 am #

      Hey Walid — what do you mean by “full batch in training”?

  9. Shivam October 18, 2019 at 9:41 am #

    Hey Adrain,
    Will these reasons be valid incase if the validation accuracy is higher than the training accuracy ?
    I saw a post somewhere that guy asked this que as he was facing this problem while training on his data

    • Adrian Rosebrock October 25, 2019 at 10:06 am #

      Yes, same reasons apply.

  10. Sachin October 19, 2019 at 9:43 am #

    Hi Adrian,

    Thanks for making such deep blog on deep learning.
    I am working on GANs for my college project. I was hoping if you could make some blogs on it.
    Right now there are really less resources on GANs.

    Thank you
    Sachin

Before you leave a comment...

Hey, Adrian here, author of the PyImageSearch blog. I'd love to hear from you, but before you submit a comment, please follow these guidelines:

  1. If you have a question, read the comments first. You should also search this page (i.e., ctrl + f) for keywords related to your question. It's likely that I have already addressed your question in the comments.
  2. If you are copying and pasting code/terminal output, please don't. Reviewing another programmers’ code is a very time consuming and tedious task, and due to the volume of emails and contact requests I receive, I simply cannot do it.
  3. Be respectful of the space. I put a lot of my own personal time into creating these free weekly tutorials. On average, each tutorial takes me 15-20 hours to put together. I love offering these guides to you and I take pride in the content I create. Therefore, I will not approve comments that include large code blocks/terminal output as it destroys the formatting of the page. Kindly be respectful of this space.
  4. Be patient. I receive 200+ comments and emails per day. Due to spam, and my desire to personally answer as many questions as I can, I hand moderate all new comments (typically once per week). I try to answer as many questions as I can, but I'm only one person. Please don't be offended if I cannot get to your question
  5. Do you need priority support? Consider purchasing one of my books and courses. I place customer questions and emails in a separate, special priority queue and answer them first. If you are a customer of mine you will receive a guaranteed response from me. If there's any time left over, I focus on the community at large and attempt to answer as many of those questions as I possibly can.

Thank you for keeping these guidelines in mind before submitting your comment.

Leave a Reply

[email]
[email]