Is Rectified Adam actually *better* than Adam?

Is the Rectified Adam (RAdam) optimizer actually better than the standard Adam optimizer? According to my 24 experiments, the answer is no, typically not (but there are cases where you do want to use it instead of Adam).

In Liu et al.’s 2018 paper, On the Variance of the Adaptive Learning Rate and Beyond, the authors claim that Rectified Adam can obtain:

  • Better accuracy (or at least identical accuracy when compared to Adam)
  • And in fewer epochs than standard Adam

The authors tested their hypothesis on three different datasets, including one NLP dataset and two computer vision datasets (ImageNet and CIFAR-10).

In each case Rectified Adam outperformed standard Adam…but failed to outperform standard Stochastic Gradient Descent (SGD)!

The Rectified Adam optimizer has some strong theoretical justifications — but as a deep learning practitioner, you need more than just theory — you need to see empirical results applied to a variety of datasets.

And perhaps more importantly, you need to obtain a mastery level experience operating/driving the optimizer (or a small subset of optimizers) as well.

Today is part two in our two-part series on the Rectified Adam optimizer:

  1. Rectified Adam (RAdam) optimizer with Keras (last week’s post)
  2. Is Rectified Adam actually *better* than Adam (today’s tutorial)

If you haven’t yet, go ahead and read part one to ensure you have a good understanding of how the Rectified Adam optimizer works.

From there, read today’s post to help you understand how to design, code, and run experiments used to compare deep learning optimizers.

To learn how to compare Rectified Adam to standard Adam, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

Is Rectified Adam actually *better* than Adam?

In the first part of this tutorial, we’ll briefly discuss the Rectified Adam optimizer, including how it works and why it’s interesting to us as deep learning practitioners.

From there, I’ll guide you in designing and planning our set of experiments to compare Rectified Adam to Adam — you can use this section to learn how you design your own deep learning experiments as well.

We’ll then review the project structure for this post, including implementing our training and evaluation scripts by hand.

Finally, we’ll run our experiments, collect results, and ultimately decide is Rectified Adam actually better than Adam?

What is the Rectified Adam optimizer?

Figure 1: The Rectified Adam (RAdam) deep learning optimizer. Is it better than the standard Adam optimizer? (image source: Figure 6 from Liu et al.)

The Rectified Adam optimizer was proposed by Liu et al. in their 2019 paper, On the Variance of the Adaptive Learning Rate and Beyond. In their paper they discussed how their update to the Adam optimizer, called Rectified Adam, can:

  1. Obtain a higher accuracy/more generalizable deep neural network.
  2. Complete training in fewer epochs.

Their work had some strong theoretical justifications as well. They found that adaptive learning rate optimizers (such as Adam) both:

  • Struggle to generalize during the first few batch updates
  • Had very high variance

Liu et al. studied the problem in detail and found that the issue could be rectified (hence the name, Rectified Adam) by:

  1. Applying warm up with a low initial earning rate.
  2. Simply turning off the momentum term for the first few sets of input training batches.

The authors evaluated their experiments on one NLP dataset and two image classification datasets and found that their Rectified Adam implementation outperformed standard Adam (but neither optimizer outperformed standard SGD).

We’ll be continuing Liu et al.’s experiments today and comparing Rectified Adam to standard Adam in 24 separate experiments.

For more details on how the Rectified Adam optimizer works, be sure to review my previous blog post.

Planning our experiments

Figure 2: We will plan our set of experiments to evaluate the performance of the Rectified Adam (RAdam) optimizer using Keras.

To compare Adam to Rectified Adam, we’ll be training three Convolutional Neural Networks (CNNs), including:

  1. ResNet
  2. GoogLeNet
  3. MiniVGGNet

The implementations of these CNNs came directly from my book, Deep Learning for Computer Vision with Python.

These networks will be trained on four datasets:

  1. MNIST
  2. Fashion MNIST
  3. CIFAR-10
  4. CIFAR-100

For each combination of dataset and CNN architecture, we’ll apply two optimizers:

  1. Adam
  2. Rectified Adam

Taking all possible combinations, we end up with 3 x 4 x 2 = 24 separate training experiments.

We’ll run each of these experiments individually, collect, the results, and then interpret them to determine which optimizer is indeed better.

Whenever you plan your own experiments make sure you take the time to write out the list of model architectures, optimizers, and datasets you intend on applying them to. Additionally, you may want to list the hyperparameters you believe are important and are worth tuning (i.e., learning rate, L2 weight decay strength, etc.).

Considering the 24 experiments we plan to conduct, it makes the most sense to automate the data collection phase. From there, we will be able to work on other tasks while the computation is underway (often requiring days of compute time). Upon completion of the data collection for our 24 experiments, we will then be able to sit down and analyze the plots and classification reports in order to evaluate RAdam on our CNNs, datasets, and optimizers.

How to design your own deep learning experiments

Figure 3: Designing your own deep learning experiments, requires thought and planning. Consider your typical deep learning workflow and design your initial set of experiments such that a thorough preliminary investigation can be conducted using automation. Planning for automated evaluation now will save you time (and money) down the line.

Typically, my experiment design workflow goes something like this:

  1. Select 2-3 model architectures that I believe would work well on a particular dataset (i.e., ResNet, VGGNet, etc.).
  2. Decide if I want to train from scratch or perform transfer learning.
  3. Use my learning rate finder to find an acceptable initial learning rate for the SGD optimizer.
  4. Train the model on my dataset using SGD and Keras’ standard decay schedule.
  5. Look at my results from training, select the architecture that performed best, and start tuning my hyperparameters, including model capacity, regularization strength, revisiting the initial learning rate, applying Cyclical Learning Rates, and potentially exploring other optimizers.

You’ll notice that I tend to use SGD in my initial experiments instead of Adam, RMSprop, etc.

Why is that?

To answer that question you’ll need to read the “You need to obtain mastery level experience operating these three optimizers” section below.

Note: For more of my suggestions, tips, and best practices when designing and running your own experiments, be sure to refer to my book, Deep Learning for Computer Vision with Python.

However, in the context of this tutorial, we’re attempting to compare our results to the work of Liu et al.

We, therefore, need to fix the model architectures, training from scratch, learning rate, and optimizers — our experiment design now becomes:

  1. Train ResNet, GoogLeNet, and MiniVGGNet on MNIST, Fashion MNIST, CIFAR-10, and CIFAR-100, respectively.
  2. Train all networks from scratch.
  3. Use the initial, default learning rates for Adam/Rectified Adam (1e-3).
  4. Utilize the Adam and Rectified Adam optimizers for training.
  5. Since these are one-off experiments we’ll not be performing an exhaustive dive on tuning hyperparameters (you can refer to Deep Learning for Computer Vision with Python if you would like details on how to tune your hyperparameters).

At this point we’ve motivated and planned our set of experiments — now let’s learn how to implement our training and evaluation scripts.

Project structure

Go ahead and grab the “Downloads” and then inspect the project directory with the tree  command:

Our project consists of two output directories:

  • output/ : Holds our classification report .txt  files organized by experiment. Additionally, there is one  .pickle  file per experiment containing the serialized training history data (for plotting purposes).
  • plots/ : For each CNN/dataset combination, a stacked accuracy/loss curve plot is output so that we can conveniently compare the Adam and RAdam optimizers.

The pyimagesearch  module contains three Convolutional Neural Networks (CNNs) architectures constructed with Keras. These CNN implementations come directly from Deep Learning for Computer Vision with Python.

We will review three Python scripts in today’s tutorial:

  • : Our training script accepts a CNN architecture, dataset, and optimizer via command line argument and begins fitting a model accordingly. This script will be invoked automatically for each of our 24 experiments via the  bash script. Our training script produces two types of output files:
    • .txt : A classification report printout in scikit-learn’s standard format.
    • .pickle : Serialized training history so that it can later be recalled for plotting purposes.
  • : This script computes all the experiment combinations for which we will train models and collect data. The result of executing this script is a bash/shell script named .
  • : Plots accuracy/loss curves for Adam/RAdam using matplotlib directly from the output/*.pickle  files.

Implementing the training script

Our training script will be responsible for accepting:

  1. A given model architecture
  2. A dataset
  3. An optimizer

And from there, the script will handle training the specified model, on the supplied dataset, using the specified optimizer.

We’ll use this script to run each of our 24 experiments.

Let’s go ahead and implement the script now:

Imports include our three CNN architectures, four datasets, and two optimizers ( Adam  and RAdam ).

Let’s parse command line arguments:

Our command line arguments include:

  • --history : The path to the output training history .pickle  file.
  • --report : The path to the output classification report .txt  file.
  • --dataset : The dataset to train our model on can be any of the choices  listed on Line 26.
  • --model : The deep learning model architecture must be one of the choices  on Line 29.
  • --optimizer : Our adam  or radam  deep learning optimization method.

Upon providing the command line arguments via the terminal, our training script dynamically sets up and launches the experiment. Output files are named according to the parameters of the experiment.

From here we’ll set two constants and initialize the default number of channels for the dataset:

If our --dataset  is MNIST or Fashion MNIST, we’ll load the dataset in the following manner:

Keep in mind that MNIST images are 28×28 but we need 32×32 images for our architectures. Thus, Lines 66 and 67 resize  all images in the dataset. Lines 71 and 72 then add the batch dimension.

Otherwise, we have a CIFAR variant --dataset  to load:

CIFAR datasets contain 3-channel color images (Line 77). These datasets are already comprised of 32×32 images (no resizing is necessary).

From here, we’ll scale our data and determine the total number of classes:

Followed by initializing this experiment’s deep learning optimizer:

Either Adam  or RAdam  is initialized according to the --optimizer  command line argument switch.

Our model  is then built depending upon the --model  command line argument:

Once either ResNet, GoogLeNet, or MiniVGGNet is built, we’ll binarize our labels and construct our data augmentation object:

Followed by compiling our model and training the network:

We then evaluate the trained model and dump training history to disk:

Each experiment will contain a classification report .txt  file along with a serialized training history .pickle  file.

The classification reports will be inspected manually whereas the training history files will later be opened by operations inside , the training history parsed, and finally plotted.

As you’ve learned, creating a training script that dynamically sets up an experiment is quite straightforward.

Creating our experiment combinations

At this point, we have our training script which can accept a (1) model architecture, (2) dataset, and (3) optimizer, followed by fitting a model using the respective combination.

That being said, are we going to manually run each and every individual command?

No, not only is that a tedious task, it’s also prone to human error.

Instead, let’s create a Python script to generate a shell script containing the command for each experiment we want to run.

Open up the file and insert the following code:

Our script requires two command line arguments:

  • --output : The path to the output directory where the training files will be stored.
  • --script : The path to the output shell script which will contain all of our training script commands with command line argument combinations.

Let’s go ahead and open a new file for writing:

Line 14 opens a shell script file writing. Subsequently, Line 15 writes the “shebang” to indicate that this shell script is executable.

Lines 18-20 then list our datasets , models , and optimizers .

We will form all possible combinations of experiments from these lists in a nested loop:

Inside the loop, we:

  • Construct our history file path (Lines 27-29).
  • Assemble our report file path (Lines 32-34).
  • Concatenate each command per the current loop iteration’s combination and write it to the shell file (Lines 38-43).

Finally, we close the shell script file.

Note: I am making the assumption that you are using a Unix machine to run these experiments. If you’re using Windows you should either (1) update this script to generate a batch file instead, or (2) manually execute the command for each respective experiment. Note that I do not support Windows on the PyImageSearch blog so you will be on your own to implement it based on this script.

Generating the experiment shell script

Go ahead and use the “Downloads” section of this tutorial to download the source code to the guide.

From there, open up a terminal and execute the script:

After the script has executed you should have a file named in your working directory — this file contains the 24 separate experiments we’ll be running to compare Adam to Rectified Adam.

Go ahead and investigate now:

Note: Be sure to use the horizontal scroll bar to inspect the entire contents of the  script. I intentionally did not break up lines or automatically wrap them for better display. You can also refer to Figure 4 below — I suggest clicking the image to enlarge + inspect it.

Figure 4: The output of our file is a shell script listing the training script commands to run in succession. Click image to enlarge.

Notice how there is a call for each of the 24 possible combinations of model architecture, dataset, and optimizer. Furthermore, the “shebang” on Line 1 indicates that this shell script is executable.

Running our experiments

The next step is to actually perform each of these experiments.

I executed the shell script on an Amazon EC2 instance with an NVIDIA K80 GPU. It took approximately 48 hours to run all the experiments.

To launch the experiments for yourself, just run the following command:

After the script has finished running, your output/ directory should be filled with .pickle and .txt files:

The .txt files contain the output of scikit-learn’s classification_report, a human-readable output that tells us how well our model performed.

The .pickle files contain the training history for the model. We’ll use this .pickle file to plot both Adam and Rectified Adam’s performance in the next section.

Implementing our Adam vs. Rectified Adam plotting script

Our final Python script,, will be used to plot the performance of Adam vs. Rectified Adam, giving us a nice, clear visualization of a given model architecture trained on a specific dataset.

The plot file opens each Adam/RAdam  .pickle file pair and generates a corresponding plot.

Open up and insert the following code:

Lines 2-6 handle imports, namely the matplotlib.pyplot  module.

The plot_history  function is responsible for generating two stacked plots via the subplots feature:

  • Training/validation accuracy curves (Lines 16-23).
  • Training/validation loss curves (Lines 26-33).

Both Adam and Rectified Adam training history curves are generated from adamHist  and rAdamHist  data passed as parameters to the function.

Note: If you are using TensorFlow 2.0 (i.e., tf.keras ) to run this code , you’ll need to change all occurrences of acc  and val_acc  to accuracy  and val_accuracy , respectively as TensorFlow 2.0 has made a breaking change to the accuracy name.

Let’s handle parsing command line arguments:

Our command line arguments consist of:

  • --input : The path to the input directory of training history files to be parsed for plot generation.
  • --plots : Our output path where the plots will be stored.

Lines 47 and 48 list our datasets  and models . We’ll loop over the combinations of datasets and models to generate our plots:

Inside our nested datasets / models  loop, we:

  • Construct Adam and Rectified Adam’s file paths (Lines 54-62).
  • Load serialized training history (Lines 66 and 67).
  • Generate the plots using our plot_history  function (Lines 71-75).
  • Export the figures to disk (Lines 78-83).

Plotting Adam vs. Rectified Adam

We are now ready to run the script.

Again, make sure you have used the “Downloads” section of this tutorial to download the source code.

From there, execute the following command:

You can then check the plots/ directory and ensure it has been populated with the training history figures:

In the next section, we’ll review the results of our experiments.

Adam vs. Rectified Adam Experiments with MNIST

Figure 5: Montage of samples from the MNIST digit dataset.

Our first set of experiments will compare Adam vs. Rectified Adam on the MNIST dataset, a standard benchmark image classification dataset for handwritten digit recognition.


Figure 6: Which is better — Adam or RAdam optimizer using MiniVGGNet on the MNIST dataset?

Our first experiment compares Adam to Rectified Adam when training MiniVGGNet on the MNIST dataset.

Below is the output classification report for the Adam optimizer:

As well as the classification report for the Rectified Adam optimizer:

As you can see, we’re obtaining 99% accuracy for both experiments.

Looking at Figure 6 you can observe the warmup period associated with Rectified Adam:

Loss starts off very high and accuracy very low

After warmup is complete the Rectified Adam optimizer catches up with Adam

What’s interesting to note though is that Adam obtains lower loss compared to Rectified Adam — we’ll actually see that trend continue in the rest of the experiments we run (and I’ll explain why this happens as well).

MNIST – GoogLeNet

Figure 7: Which deep learning optimizer is actually better — Rectified Adam or Adam? This plot is from my experiment notebook while testing RAdam and Adam using GoogLeNet on the MNIST dataset.

This next experiment compares Adam to Rectified Adam for GoogLeNet trained on the MNIST dataset.

Below follows the output of the Adam optimizer:

As well as the output for the Rectified Adam optimizer:

Again, 99% accuracy is obtained for both optimizers.

This time both the training/validation plots are near identical for both accuracy and loss.

MNIST – ResNet

Figure 8: Training accuracy/loss plot for ResNet on the MNIST dataset using both the RAdam (Rectified Adam) and Adam deep learning optimizers with Keras.

Our final MNIST experiment compares training ResNet using both Adam and Rectified Adam.

Given that MNIST is not a very challenging dataset we obtain 99% accuracy for the Adam optimizer:

As well as the Rectified Adam optimizer:

But take a look at Figure 8 — note how Adam obtains much lower loss than Rectified Adam.

That’s not necessarily a bad thing as it may imply that Rectified Adam is obtaining a more generalizable model; however, performance on the testing set is identical so we would need to test on images outside MNIST (which is outside the scope of this blog post).

Adam vs. Rectified Adam Experiments with Fashion MNIST

Figure 9: The Fashion MNIST dataset was created by e-commerce company, Zalando, as a drop-in replacement for MNIST Digits. It is a great dataset to practice/experiment with when using Keras for deep learning. (image source)

Our next set of experiments evaluate Adam vs. Rectified Adam on the Fashion MNIST dataset, a drop-in replacement for the standard MNIST dataset.

You can read more about Fashion MNIST here.

Fashion MNIST – MiniVGGNet

Figure 10: Testing optimizers with deep learning, including new ones such as RAdam, requires multiple experiments. Shown in this figure is the MiniVGGNet CNN trained on the Fashion MNIST dataset with both Adam and RAdam optimizers.

Our first experiment evaluates the MiniVGGNet architecture trained on the Fashion MNIST dataset.

Below you can find the output of training with the Adam optimizer:

As well as the Rectified Adam optimizer:

Note that the Adam optimizer outperforms Rectified Adam, obtaining 92% accuracy compared to the 90% accuracy of Rectified Adam.

Furthermore, take a look at the training plot in Figure 10 — training is very stable with validation loss falling below training loss.

With more aggressive training with Adam, we can likely improve our accuracy further.

Fashion MNIST – GoogLeNet

Figure 11: Is either RAdam or Adam a better deep learning optimizer using GoogLeNet? Using the Fashion MNIST dataset with Adam shows signs of overfitting past epoch 30. RAdam appears more stable in this experiment.

We now evaluate GoogLeNet trained on Fashion MNIST using Adam and Rectified Adam.

Below is the classification report from the Adam optimizer:

As well as the output from the Rectified Adam optimizer:

This time both optimizers obtain 93% accuracy, but what’s more interesting is to take a look at the training history plot in Figure 11.

Here we can see that training loss starts to diverge past epoch 30 for the Adam optimizer — this divergence grows wider and wider as we continue training. At this point, we should start to be concerned about overfitting using Adam.

On the other hand, Rectified Adam’s performance is stable with no signs of overfitting.

In this particular experiment, it’s clear that Rectified Adam is generalizing better, and had we wished to deploy this model to production, the Rectified Adam optimizer version would be the one to go with.

Fashion MNIST – ResNet

Figure 12: Which deep learning optimizer is better — Adam or Rectified Adam (RAdam) — using the ResNet CNN on the Fashion MNIST dataset?

Our final experiment compares Adam vs. Rectified Adam optimizer trained on the Fashion MNIST dataset using ResNet.

Below is the output of the Adam optimizer:

Here is the output of the Rectified Adam optimizer:

Both models obtain 92% accuracy, but take a look at the training history plot in Figure 12.

You can observe that Adam optimizer results in lower loss and that the validation loss follows the training curve.

The Rectified Adam loss is arguably more stable with fewer fluctuations (as compared to standard Adam).

Exactly which one is “better” in this experiment would be dependent on how well the model generalizes to images outside the training, validation, and testing set.

Further experiments would be required to mark the winner here, but my gut tells me that it’s Rectified Adam as (1) accuracy on the testing set is identical, and (2) lower loss doesn’t necessarily mean better generalization (in some cases it means that the model may fail to generalize well) — but at the same time, training/validation loss are near identical for Adam. Without further experiments it’s hard to make the call.

Adam vs. Rectified Adam Experiments with CIFAR-10

Figure 13: The CIFAR-10 benchmarking dataset has 10 classes. We will use it for Rectified Adam experimentation to evaluate if RAdam or Adam is the better choice (image source).

In these experiments, we’ll be comparing Adam vs. Rectified Adam performance using MiniVGGNet, GoogLeNet, and ResNet, all trained on the CIFAR-10 dataset.

CIFAR-10 – MiniVGGNet

Figure 14: Is the RAdam or Adam deep learning optimizer better using the MiniVGGNet CNN on the CIFAR-10 dataset?

Our next experiment compares Adam to Rectified Adam by training MiniVGGNet on the CIFAR-10 dataset.

Below is the output of training using the Adam optimizer:

And here is the output from Rectified Adam:

Here the Adam optimizer (84% accuracy) beats out Rectified Adam (74% accuracy).

Furthermore, validation loss is lower than training loss for the majority of training, implying that we can “train harder” by reducing our regularization strength and potentially increasing model capacity.

CIFAR-10 – GoogLeNet

Figure 15: Which is a better deep learning optimizer with the GoogLeNet CNN? The training accuracy/loss plot shows results from using Adam and RAdam as part of automated deep learning experiment data collection.

Next, let’s check out GoogLeNet trained on CIFAR-10 using Adam and Rectified Adam.

Here is the output of Adam:

And here is the output of Rectified Adam:

The Adam optimizer obtains 90% accuracy, slightly beating out the 87% accuracy of Rectified Adam.

However, Figure 15 tells an interesting story — past epoch 20 there is quite the divergence between Adam’s training and validation loss.

While the Adam optimized model obtained higher accuracy, there are signs of overfitting as validation loss is essentially stagnant past epoch 30.

Additional experiments would be required to mark a true winner but I imagine it would be Rectified Adam after some additional hyperparameter tuning.

CIFAR-10 – ResNet

Figure 16: This Keras deep learning tutorial helps to answer the question: Is Rectified Adam or Adam the better deep learning optimizer? One of the 24 experiments uses the ResNet CNN and CIFAR-10 dataset.

Next, let’s check out ResNet trained using Adam and Rectified Adam on CIFAR-10.

Below you can find the output of the standard Adam optimizer:

As well as the output from Rectified Adam:

Adam is the winner here, obtaining 88% accuracy versus Rectified Adam’s 84%.

Adam vs. Rectified Adam Experiments with CIFAR-100

Figure 17: The CIFAR-100 classification dataset is the brother of CIFAR-10 and includes more classes of images. (image source)

The CIFAR-100 dataset is the bigger brother of the CIFAR-10 dataset. As the name suggests, CIFAR-100 includes 100 class labels versus the 10 class labels of CIFAR-10.

While there are more class labels in CIFAR-100, there are actually fewer images per class (CIFAR-10 has 6,000 images per class while CIFAR-100 only has 600 images per class).

CIFAR-100 is, therefore, a more challenging dataset than CIFAR-10.

In this section, we’ll investigate Adam vs. Rectified Adam’s performance on the CIFAR-100 dataset.

CIFAR-100 – MiniVGGNet

Figure 18: Will RAdam stand up to Adam as a preferable deep learning optimizer? How does Rectified Adam stack up to SGD? In this experiment (one of 24), we train MiniVGGNet on the CIFAR-100 dataset and analyze the results.

Let’s apply Adam and Rectified Adam to the MiniVGGNet architecture trained on CIFAR-100.

Below is the output from the Adam optimizer:

And here is the output from Rectified Adam:

The Adam optimizer is the clear winner (58% accuracy) over Rectified Adam (46% accuracy).

And just like in our CIFAR-10 experiments, we can likely improve our model performance further by relaxing regularization and increasing model capacity.

CIFAR-100 – GoogLeNet

Figure 19: Adam vs. RAdam optimizer on the CIFAR-100 dataset using GoogLeNet.

Let’s now perform the same experiment, only this time use GoogLeNet.

Here’s the output from the Adam optimizer:

And here is the output from Rectified Adam:

The Adam optimizer obtains 66% accuracy, better than Rectified Adam’s 59%.

However, looking at Figure 19 we can see that the validation loss from Adam is quite unstable — towards the end of training validation loss even starts to increase, a sign of overfitting.

CIFAR-100 – ResNet

Figure 20: Training a ResNet model on the CIFAR-100 dataset using both RAdam and Adam for comparison. Which deep learning optimizer is actually better for this experiment?

Below we can find the output of training ResNet using Adam on the CIFAR-100 dataset:

And here is the output of Rectified Adam:

The Adam optimizer (68% accuracy) crushes Rectified Adam (51% accuracy) here, but we need to be careful of overfitting. As Figure 20 shows there is quite the divergence between training and validation loss when using the Adam optimizer.

But on the other hand, Rectified Adam really stagnates past epoch 20.

I would be inclined to go with the Adam optimized model here as it obtains significantly higher accuracy; however, I would suggest running some generalization tests using both the Adam and Rectified Adam versions of the model.

What can we take away from these experiments?

One of the first takeaways comes from looking at the training plots of the experiments — using the Rectified Adam optimizer can lead to more stable training.

When training with Rectified Adam we see there are significantly fewer fluctuations, spikes, and drops in validation loss (as compared to standard Adam).

Furthermore, the Rectified Adam validation loss is much more likely to follow training loss, in some cases near exactly.

Keep in mind that raw accuracy isn’t everything when it comes to training your own custom neural networks — stability matters as well as it goes hand-in-hand with generalization.

Whenever I’m training a custom CNN I’m not only looking for high accuracy models, I’m also looking for stability. Stability typically implies that a model is converging nicely and will ideally generalize well.

In this regard, Rectified Adam delivers on its promises from the Liu et al. paper.

Secondly, you should note that Adam obtains lower loss than Rectified Adam in every single experiment.

This behavior is not necessarily a bad thing — it could imply that Rectified Adam is generalizing better, but it’s hard to say without running further experiments using images outside the respective training and testing sets.

Again, keep in mind that lower loss is not necessarily a better model! When you encounter very low loss (especially loss near zero) your model may be overfitting to your training set.

You need to obtain mastery level experience operating these three optimizers

Figure 21: Mastering deep learning optimizers is like driving a car. You know your car and you drive it well no matter the road condition. On the other hand, if you get in an unfamiliar car, something doesn’t feel right until you have a few hours cumulatively behind the wheel. Optimizers are no different. I suggest that SGD be your daily driver until you are comfortable trying alternatives. Then you can mix in RMSprop and Adam. Learn how to use them before jumping into the latest deep learning optimizer.

Becoming familiar with a given optimization algorithm is similar to mastering how to drive a car — you drive your own car better than other people’s cars because you’ve spent so much time driving it; you understand your car and its intricacies.

Often times, a given optimizer is chosen to train a network on a dataset not because the optimizer itself is better, but because the driver (i.e., you, the deep learning practitioner) is more familiar with the optimizer and understands the “art” behind tuning its respective parameters.

As a deep learning practitioner you should gain experience operating a wide variety of optimizers, but in my opinion, you should focus your efforts on learning how to train networks using the three following optimizers:

  1. SGD
  2. RMSprop
  3. Adam

You might be surprised to see SGD is included in this list — isn’t SGD an older, less efficient optimizer than the newer adaptive methods, including Adam, Adagrad, Adadelta, etc.?

Yes, it absolutely is.

But here’s the thing — nearly every state-of-the-art computer vision model is trained using SGD.

Consider the ImageNet classification challenge for example:

  • AlexNet (there’s no mention in the paper but both the official implementation and CaffeNet used SGD)
  • VGGNet (Section 3.1, Training)
  • ResNet (Section 3.4, Implementation)
  • SqueezeNet (it’s not mentioned in the paper, but SGD was used in their solver.prototxt)

Every single one of those classification networks was trained using SGD.

Now let’s consider the object detection networks trained on the COCO dataset:

You guessed it — SGD was used to train all of them.

Yes, SGD may the “old, unsexy” optimizer compared to its younger counterparts, but here’s the thing, standard SGD just works.

That’s not to say that you shouldn’t learn how to use other optimizers — you absolutely should!

But before you go down that rabbit hole, obtain a mastery level of SGD first. From there, start exploring other optimizers — I typically recommend RMSprop and Adam.

And if you find Adam is working well, consider replacing Adam with Rectified Adam to see if you can get an additional boost in accuracy (sort of like how replacing ReLUs with ELUs can usually give you a small boost).

Once you understand how to use those optimizers on a variety of datasets, continue your studies and explore other optimizers as well.

All that said, if you’re new to deep learning, don’t immediately try jumping into the more “advanced” optimizers — you’ll only run into trouble later in your deep learning career.

What’s next?

Figure 4: My deep learning book, Deep Learning for Computer Vision with Python, is trusted by employees and students of top institutions.

If you’re interested in diving head-first into the world of computer vision/deep learning and discovering how to:

  • Understand, practice, and proficiently operate each of the “big three” optimizers
  • Select the best optimizer for the job to achieve state-of-the-art results
  • Train custom Convolutional Neural Networks on your own custom datasets
  • Learn my best practices, tips, and suggestions (leading you to becoming a deep learning expert)

…then be sure to take a look at my book, Deep Learning for Computer Vision with Python!

My complete, self-study deep learning book is trusted by members of top machine learning schools, companies, and organizations, including Microsoft, Google, Stanford, MIT, CMU, and more!

Readers of my book have gone on to win Kaggle competitions, secure academic grants, and start careers in CV and DL using the knowledge they gained through study and practice.

My book not only teaches the fundamentals, but also teaches advanced techniques, best practices, and tools to ensure that you are armed with practical knowledge and proven coding recipes to tackle nearly any computer vision and deep learning problem presented to you in school, research, or the modern workforce.

Be sure to take a look  — and while you’re at it, don’t forget to grab your (free) table of contents + sample chapters.



In this tutorial, we investigated the claims from Liu et al. that the Rectified Adam optimizer outperforms the standard Adam optimizer in terms of:

  1. Better accuracy (or at least identical accuracy when compared to Adam)
  2. And in fewer epochs than standard Adam

To evaluate those claims we trained three CNN models:

  1. ResNet
  2. GoogLeNet
  3. MiniVGGNet

These models were trained on four datasets:

  1. MNIST
  2. Fashion MNIST
  3. CIFAR-10
  4. CIFAR-100

Each combination of the model architecture and dataset were trained using two optimizers:

  • Adam
  • Rectified Adam

In total, we ran 3 x 4 x 2 = 24 different experiments used to compare standard Adam to Rectified Adam.

The result?

In each and every experiment Rectified Adam either performed worse or obtained identical accuracy compared to standard Adam.

That said, training with Rectified Adam was more stable than standard Adam, likely implying that Rectified Adam could generalize better (but additional experiments would be required to validate that claim).

Liu et al.’s study of warmup can be utilized in adaptive learning rate optimizers and will likely help future researchers build on their work and create even better optimizers.

For the time being, my personal opinion is that you’re better off sticking with standard Adam for your initial experiments. If you find that Adam is working well for your experiments, substitute in Rectified Adam to see if you can improve your accuracy.

You should especially try to use the Rectified Adam optimizer if you notice that Adam is working well, but you need better generalization.

The second takeaway from this guide is that you should obtain mastery level experience operating these three optimizers:

  1. SGD
  2. RMSprop
  3. Adam

You should especially learn how to operate SGD.

Yes, SGD is “less sexy” compared to the newer adaptive learning rate methods, but nearly every computer vision state-of-the-art architecture has been trained using it.

Learn how to operate these three optimizers first.

Once you have a good understanding of how they work and how to tune their respective hyperparameters, then move on to other optimizers.

If you need help learning how to use these optimizers and tune their hyperparameters, be sure to refer to Deep Learning for Computer Vision with Python where I cover my tips, suggestions, and best practices in detail.

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), just enter your email address in the form below!


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , , ,

17 Responses to Is Rectified Adam actually *better* than Adam?

  1. Sovit Ranjan Rath October 7, 2019 at 11:16 am #

    Wow! Really great tutorial. Your posts are really some of the best. One can easily read your post and with his own experimentation and dataset create a big project out of it. Thanks for sharing such knowledge.

    If you don’t mind, can I ask you how many hours a week you spend on deep learning computer vision tasks? I am asking because I am into deep learning and computer vision myself.

    • Adrian Rosebrock October 8, 2019 at 12:28 pm #

      Thanks Sovit, I’m glad you enjoyed the post 🙂

      As for how much time I spend each week on CV/DL tasks, it’s quite a bit. Studying CV and DL is my passion.

  2. Varun Anand October 7, 2019 at 12:57 pm #

    Hey Adrian! Top-level stuff as usual!

    One clarification though, on line 121 of the training script, is there any particular reason as to why you are using np.unique to calculate the number of classes? Wouldn’t the same value be available as the length of the label names dictionary?

    • Adrian Rosebrock October 8, 2019 at 12:29 pm #

      You mean from the labelNames list? You could do that or you could take the unique count of label integers — either will work.

  3. Abkul October 7, 2019 at 2:02 pm #

    Excellent tutorial.

    This is one gray area which was confusing , but now well explained

    Kindly do optimization approaches for inter/or intra species transfer learning.

  4. Xu Zhang October 7, 2019 at 7:24 pm #

    Thank you so much for your great post.

    Because training a deep neural network is a stochastic process. If we ran several times, we will get different results. I didn’t find where you set your random seeds. Even we set a random seed, we still can not get replaceable results. Your results are from a single run or an average of multiple runs?

    • Adrian Rosebrock October 8, 2019 at 12:29 pm #

      These are single runs. If you wanted to I would recommend running each experiment 5-10 times and then computing averages and standard deviations. I’ll leave that as an exercise to you, the reader.

      • Xu Zhang October 9, 2019 at 2:47 pm #

        Thank you for your reply. If we run them 5-10 times and get the averages, when we compare the results, maybe we will get different conclusions to yours.

  5. Pawel October 8, 2019 at 3:23 pm #

    I am always having this feeling that RMSprop and Adam are having some kind of learning rate decay implemented as default. I don’t say they are bad! They are great for me especially when I am trying to choose the best model for my task just for fast recognition – I don’t have to focus on hyper parameters so deeply. And when I choose right model for my task I always use SGD. It’s very power full especially when tuning hyperparameters. And of course the more I use the better I feel SGD! Thanks!

    • Adrian Rosebrock October 10, 2019 at 10:15 am #

      Thanks Pawel.

  6. Frank October 9, 2019 at 12:59 pm #

    After reading the previous post, I find this one… Sorry for the duplicated comments.

    Thanks for your post and comparison and all!!!

    However, the setting for the RAdam optimizer does not seem to be right, and the comparison does not seem to be fair.

    In the current version, RAdam uses warmup and linear learning rate decay (in their paper, they didn’t use warmup, and use the same learning rate scheduler as SGDM).

    Why this setting will be a problem?

    1. since ‘total steps’ is set to 5000, the learning rate of RAdam will become 1e-5 (min_lr) after the first 5000 updates, which is too small. At the same time, Adam will have constant learning rate 1e-3. It explains why in Figure 3, RAdam cannot further improve the performance (the learning rate is too small).

    2. better not to use warmup (in the official pytorch implementation, they don’t have warmup). Using warmup requires additional hyper-parameter tuning, also, as mentioned before, a wrongly configured setting has catastrophic effects.

    For your comparison, I would recommend to use the original setting for RAdam (without warmup or linear learning rate decay), and use the same learning rate with Adam.

    PS: for image tasks, it is common to use learning rate scheduler like MultiStepLR, especially for SGDM.

    PPS: for beginners, I would recommend to try RAdam first, then turn to SGDM with carefully tuned hyper-parameters. SGDM usually can lead to a better performance, but really requires some hyper-parameter tuning.

    • Adrian Rosebrock October 10, 2019 at 10:13 am #

      Hey Frank — you are partially right in the senes that this is not a fair comparison for the reasons you detailed. However, keep in mind that the point of this post is to compare the default optimizer parameters like they did in the original paper. You also pointed out that Rectified Adam has additional hyperparameters to tune (in regards to warmup) — that’s also my point in the post (we wanted to test the default values). Rectified Adam can generalize better but I genuinely recommend that if you find standard Adam working well, then, and only then, start substituting in RAdam and start tuning its hyperparameters.

      • Frank October 17, 2019 at 11:42 am #

        Hey Adrian,

        Appreciate your reply. I understand your concern, however, the default optimizer (by the paper / official repo) is different from your setting. The original RAdam does not employ additional warmup (nor require additional hyper-parameters). These additional hyper-parameters are introduced by the re-implementation, not the original RAdam… I really like these comparison, and think such studies are important; really hope you can fix these hyper-parameters (the devil is in the detail, especially for deep learning).

        • Hong December 2, 2019 at 6:12 pm #

          Thanks very much for the study!

          Also, I want to point out the keras_radam implementation uses a different epsilon value from the official repository version (while the official one is the same as Adam’s), the epsilon here is actually epsilon * sqrt(1-beta2^t).

          As RAdam’s author mentioned in their paper, the epsilon value is very relevant to the learning performance.
          Given this, I am not very convinced by the experiment results here.

  7. Shivam October 13, 2019 at 10:56 am #

    Hey Adrain,
    Thank you for all these amazing tutorials, I came across a doubt while going through your posts. In your codes sometimes you use LabelEncoder to encode the labels and sometimes LabelBinarizer. I thought while we are using LabelBinarizer our last layer should be a softmax layer and if we are going for LabelEncoder we’ll be using the sigmoid activation. But then I came across a code where you used LabelEncoder with softmax .
    So can you tell me when to use which one?
    Thanks 🙂

    • Adrian Rosebrock October 17, 2019 at 7:04 am #

      A LabelEncoder will encode class labels as integers. A LabelBinarizer performs one-hot encoding. You typically use LabelBinarizer when training CNNs for classification tasks. You can learn more about datasets, label encoding, and how to properly train your networks inside Deep Learning for Computer Vision with Python.

Before you leave a comment...

Hey, Adrian here, author of the PyImageSearch blog. I'd love to hear from you, but before you submit a comment, please follow these guidelines:

  1. If you have a question, read the comments first. You should also search this page (i.e., ctrl + f) for keywords related to your question. It's likely that I have already addressed your question in the comments.
  2. If you are copying and pasting code/terminal output, please don't. Reviewing another programmers’ code is a very time consuming and tedious task, and due to the volume of emails and contact requests I receive, I simply cannot do it.
  3. Be respectful of the space. I put a lot of my own personal time into creating these free weekly tutorials. On average, each tutorial takes me 15-20 hours to put together. I love offering these guides to you and I take pride in the content I create. Therefore, I will not approve comments that include large code blocks/terminal output as it destroys the formatting of the page. Kindly be respectful of this space.
  4. Be patient. I receive 200+ comments and emails per day. Due to spam, and my desire to personally answer as many questions as I can, I hand moderate all new comments (typically once per week). I try to answer as many questions as I can, but I'm only one person. Please don't be offended if I cannot get to your question
  5. Do you need priority support? Consider purchasing one of my books and courses. I place customer questions and emails in a separate, special priority queue and answer them first. If you are a customer of mine you will receive a guaranteed response from me. If there's any time left over, I focus on the community at large and attempt to answer as many of those questions as I possibly can.

Thank you for keeping these guidelines in mind before submitting your comment.

Leave a Reply