ImageNet: VGGNet, ResNet, Inception, and Xception with Keras

A few months ago I wrote a tutorial on how to classify images using Convolutional Neural Networks (specifically, VGG16) pre-trained on the ImageNet dataset with Python and the Keras deep learning library.

The pre-trained networks inside of Keras are capable of recognizing 1,000 different object categories, similar to objects we encounter in our day-to-day lives with high accuracy.

Back then, the pre-trained ImageNet models were separate from the core Keras library, requiring us to clone a free-standing GitHub repo and then manually copy the code into our projects.

This solution worked well enough; however, since my original blog post was published, the pre-trained networks (VGG16, VGG19, ResNet50, Inception V3, and Xception) have been fully integrated into the Keras core (no need to clone down a separate repo anymore) — these implementations can be found inside the applications sub-module.

Because of this, I’ve decided to create a new, updated tutorial that demonstrates how to utilize these state-of-the-art networks in your own classification projects.

Specifically, we’ll create a special Python script that can load any of these networks using either a TensorFlow or Theano backend, and then classify your own custom input images.

To learn more about classifying images with VGGNet, ResNet, Inception, and Xception, just keep reading.

Looking for the source code to this post?
Jump right to the downloads section.

VGGNet, ResNet, Inception, and Xception with Keras

In the first half of this blog post I’ll briefly discuss the VGG, ResNet, Inception, and Xception network architectures included in the Keras library.

We’ll then create a custom Python script using Keras that can load these pre-trained network architectures from disk and classify your own input images.

Finally, we’ll review the results of these classifications on a few sample images.

State-of-the-art deep learning image classifiers in Keras

Keras ships out-of-the-box with five Convolutional Neural Networks that have been pre-trained on the ImageNet dataset:

  1. VGG16
  2. VGG19
  3. ResNet50
  4. Inception V3
  5. Xception

Let’s start with a overview of the ImageNet dataset and then move into a brief discussion of each network architecture.

What is ImageNet?

ImageNet is formally a project aimed at (manually) labeling and categorizing images into almost 22,000 separate object categories for the purpose of computer vision research.

However, when we hear the term “ImageNet” in the context of deep learning and Convolutional Neural Networks, we are likely referring to the ImageNet Large Scale Visual Recognition Challenge, or ILSVRC for short.

The goal of this image classification challenge is to train a model that can correctly classify an input image into 1,000 separate object categories.

Models are trained on ~1.2 million training images with another 50,000 images for validation and 100,000 images for testing.

These 1,000 image categories represent object classes that we encounter in our day-to-day lives, such as species of dogs, cats, various household objects, vehicle types, and much more. You can find the full list of object categories in the ILSVRC challenge here.

When it comes to image classification, the ImageNet challenge is the de facto benchmark for computer vision classification algorithms — and the leaderboard for this challenge has been dominated by Convolutional Neural Networks and deep learning techniques since 2012.

The state-of-the-art pre-trained networks included in the Keras core library represent some of the highest performing Convolutional Neural Networks on the ImageNet challenge over the past few years. These networks also demonstrate a strong ability to generalize to images outside the ImageNet dataset via transfer learning, such as feature extraction and fine-tuning.

VGG16 and VGG19

Figure 1: A visualization of the VGG architecture (source).

The VGG network architecture was introduced by Simonyan and Zisserman in their 2014 paper, Very Deep Convolutional Networks for Large Scale Image Recognition.

This network is characterized by its simplicity, using only 3×3 convolutional layers stacked on top of each other in increasing depth. Reducing volume size is handled by max pooling. Two fully-connected layers, each with 4,096 nodes are then followed by a softmax classifier (above).

The “16” and “19” stand for the number of weight layers in the network (columns D and E in Figure 2 below):

Figure 2: Table 1 of Very Deep Convolutional Networks for Large Scale Image Recognition, Simonyan and Zisserman (2014).

In 2014, 16 and 19 layer networks were considered very deep (although we now have the ResNet architecture which can be successfully trained at depths of 50-200 for ImageNet and over 1,000 for CIFAR-10).

Simonyan and Zisserman found training VGG16 and VGG19 challenging (specifically regarding convergence on the deeper networks), so in order to make training easier, they first trained smaller versions of VGG with less weight layers (columns A and C) first.

The smaller networks converged and were then used as initializations for the larger, deeper networks — this process is called pre-training.

While making logical sense, pre-training is a very time consuming, tedious task, requiring an entire network to be trained before it can serve as an initialization for a deeper network.

We no longer use pre-training (in most cases) and instead prefer Xaiver/Glorot initialization or MSRA initialization (sometimes called He et al. initialization from the paper, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification). You can read more about the importance of weight initialization and the convergence of deep neural networks inside All you need is a good init, Mishkin and Matas (2015).

Unfortunately, there are two major drawbacks with VGGNet:

  1. It is painfully slow to train.
  2. The network architecture weights themselves are quite large (in terms of disk/bandwidth).

Due to its depth and number of fully-connected nodes, VGG is over 533MB for VGG16 and 574MB for VGG19. This makes deploying VGG a tiresome task.

We still use VGG in many deep learning image classification problems; however, smaller network architectures are often more desirable (such as SqueezeNet, GoogLeNet, etc.).


Unlike traditional sequential network architectures such as AlexNet, OverFeat, and VGG, ResNet is instead a form of “exotic architecture” that relies on micro-architecture modules (also called “network-in-network architectures”).

The term micro-architecture refers to the set of “building blocks” used to construct the network. A collection of micro-architecture building blocks (along with your standard CONV, POOL, etc. layers) leads to the macro-architecture (i.e,. the end network itself).

First introduced by He et al. in their 2015 paper, Deep Residual Learning for Image Recognition, the ResNet architecture has become a seminal work, demonstrating that extremely deep networks can be trained using standard SGD (and a reasonable initialization function) through the use of residual modules:

Figure 3: The residual module in ResNet as originally proposed by He et al. in 2015.

Further accuracy can be obtained by updating the residual module to use identity mappings, as demonstrated in their 2016 followup publication, Identity Mappings in Deep Residual Networks:

Figure 4: (Left) The original residual module. (Right) The updated residual module using pre-activation.

That said, keep in mind that the ResNet50 (as in 50 weight layers) implementation in the Keras core is based on the former 2015 paper.

Even though ResNet is much deeper than VGG16 and VGG19, the model size is actually substantially smaller due to the usage of global average pooling rather than fully-connected layers — this reduces the model size down to 102MB for ResNet50.

Inception V3

The “Inception” micro-architecture was first introduced by Szegedy et al. in their 2014 paper, Going Deeper with Convolutions:

Figure 5: The original Inception module used in GoogLeNet.

The goal of the inception module is to act as a “multi-level feature extractor” by computing 1×1, 3×3, and 5×5 convolutions within the same module of the network — the output of these filters are then stacked along the channel dimension and before being fed into the next layer in the network.

The original incarnation of this architecture was called GoogLeNet, but subsequent manifestations have simply been called Inception vN where N refers to the version number put out by Google.

The Inception V3 architecture included in the Keras core comes from the later publication by Szegedy et al., Rethinking the Inception Architecture for Computer Vision (2015) which proposes updates to the inception module to further boost ImageNet classification accuracy.

The weights for Inception V3 are smaller than both VGG and ResNet, coming in at 96MB.


Figure 6: The Xception architecture.

Xception was proposed by none other than François Chollet himself, the creator and chief maintainer of the Keras library.

Xception is an extension of the Inception architecture which replaces the standard Inception modules with depthwise separable convolutions.

The original publication, Xception: Deep Learning with Depthwise Separable Convolutions can be found here.

Xception sports the smallest weight serialization at only 91MB.

What about SqueezeNet?

Figure 7: The “fire” module in SqueezeNet, consisting of a “squeeze” and an “expand”. (Iandola et al., 2016).

For what it’s worth, the SqueezeNet architecture can obtain AlexNet-level accuracy (~57% rank-1 and ~80% rank-5) at only 4.9MB through the usage of “fire” modules that “squeeze” and “expand”.

While leaving a small footprint, SqueezeNet can also be very tricky to train.

That said, I demonstrate how to train SqueezeNet from scratch on the ImageNet dataset inside my upcoming book, Deep Learning for Computer Vision with Python.

Classifying images with VGGNet, ResNet, Inception, and Xception with Python and Keras

Let’s learn how to classify images with pre-trained Convolutional Neural Networks using the Keras library.

Open up a new file, name it , and insert the following code:

Lines 2-13 import our required Python packages. As you can see, most of the packages are part of the Keras library.

Specifically, Lines 2-6 handle importing the Keras implementations of ResNet50, Inception V3, Xception, VGG16, and VGG19, respectively.

Please note that the Xception network is compatible only with the TensorFlow backend (the class will throw an error if you try to instantiate it with a Theano backend).

Line 7 gives us access to the imagenet_utils  sub-module, a handy set of convenience functions that will make pre-processing our input images and decoding output classifications easier.

The remainder of the imports are other helper functions, followed by NumPy for numerical processing and cv2  for our OpenCV bindings.

Next, let’s parse our command line arguments:

We’ll require only a single command line argument, --image , which is the path to our input image that we wish to classify.

We’ll also accept an optional command line argument, --model , a string that specifies which pre-trained Convolutional Neural Network we would like to use — this value defaults to vgg16  for the VGG16 network architecture.

Given that we accept the name of our pre-trained network via a command line argument, we need to define a Python dictionary that maps the model names (strings) to their actual Keras classes:

Lines 25-31 defines our MODELS  dictionary which maps a model name string to the corresponding class.

If the --model  name is not found inside MODELS , we’ll raise an AssertionError  (Lines 34-36).

A Convolutional Neural Network takes an image as an input and then returns a set of probabilities corresponding to the class labels as output.

Typical input image sizes to a Convolutional Neural Network trained on ImageNet are 224×224227×227256×256, and 299×299; however, you may see other dimensions as well.

VGG16, VGG19, and ResNet all accept 224×224 input images while Inception V3 and Xception require 299×299 pixel inputs, as demonstrated by the following code block:

Here we initialize our inputShape  to be 224×224 pixels. We also initialize our preprocess  function to be the standard preprocess_input  from Keras (which performs mean subtraction).

However, if we are using Inception or Xception, we need to set the inputShape  to 299×299 pixels, followed by updating preprocess  to use a separate pre-processing function that performs a different type of scaling.

The next step is to load our pre-trained network architecture weights from disk and instantiate our model:

Line 58 uses the MODELS  dictionary along with the --model  command line argument to grab the correct Network  class.

The Convolutional Neural Network is then instantiated on Line 59 using the pre-trained ImageNet weights;

Note: Weights for VGG16 and VGG19 are > 500MB. ResNet weights are ~100MB, while Inception and Xception weights are between 90-100MB. If this is the first time you are running this script for a given network, these weights will be (automatically) downloaded and cached to your local disk. Depending on your internet speed, this may take awhile. However, once the weights are downloaded, they will not need to be downloaded again, allowing subsequent runs of  to be much faster.

Our network is now loaded and ready to classify an image — we just need to prepare this image for classification:

Line 65 loads our input image from disk using the supplied inputShape  to resize the width and height of the image.

Line 66 converts the image from a PIL/Pillow instance to a NumPy array.

Our input image is now represented as a NumPy array with the shape (inputShape[0], inputShape[1], 3) .

However, we typically train/classify images in batches with Convolutional Neural Networks, so we need to add an extra dimension to the array via np.expand_dims  on Line 72.

After calling np.expand_dims  the image  has the shape (1, inputShape[0], inputShape[1], 3) . Forgetting to add this extra dimension will result in an error when you call .predict  of the model .

Lastly, Line 76 calls the appropriate pre-processing function to perform mean subtraction/scaling.

We are now ready to pass our image through the network and obtain the output classifications:

A call to .predict  on Line 80 returns the predictions from the Convolutional Neural Network.

Given these predictions, we pass them into the ImageNet utility function .decode_predictions  to give us a list of ImageNet class label IDs, “human-readable” labels, and the probability associated with the labels.

The top-5 predictions (i.e., the labels with the largest probabilities) are then printed to our terminal on Lines 85 and 86.

The last thing we’ll do here before we close out our example is load our input image from disk via OpenCV, draw the #1 prediction on the image, and finally display the image to our screen:

To see our pre-trained ImageNet networks in action, take a look at the next section.

VGGNet, ResNet, Inception, and Xception classification results

All examples in this blog post were gathered using Keras >= 2.0 and a TensorFlow backend. If you are using TensorFlow, make sure you are using version >= 1.0, otherwise you will run into errors. I’ve also tested this script with the Theano backend and confirmed that the implementation will work with Theano as well.

Once you have TensorFlow/Theano and Keras installed, make sure you download the source code + example images to this blog post using the “Downloads” section at the bottom of the tutorial.

From there, let’s try classifying an image with VGG16:

Figure 8: Classifying a soccer ball using VGG16 pre-trained on the ImageNet database using Keras (source).

Taking a look at the output, we can see VGG16 correctly classified the image as “soccer ball” with 93.43% accuracy.

To use VGG19, we simply need to change the --model  command line argument:

Figure 9: Classifying a vehicle as “convertible” using VGG19 and Keras (source).

VGG19 is able to correctly classify the the input image as “convertible” with a probability of 91.76%. However, take a look at the other top-5 predictions: sports car with 4.98% probability (which the car is), limousine at 1.06% (incorrect, but still reasonable), and “car wheel” at 0.75% (also technically correct since there are car wheels in the image).

We can see similar levels of top-5 accuracy in the following example where we use the pre-trained ResNet architecture:

Figure 10: Using ResNet pre-trained on ImageNet with Keras + Python (source).

ResNet correctly classifies this image of Clint Eastwood holding a gun as “revolver” with 69.79% accuracy. It’s also interesting to see “rifle” at 7.74% and “assault rifle” at 5.63% included in the top-5 predictions as well. Given the viewing angle of the revolver and the substantial length of the barrel (for a handgun) it’s easy to see how a Convolutional Neural Network would also return higher probabilities for a rifle as well.

This next example attempts to classify the species of dog using ResNet:

Figure 11: Classifying dog species using ResNet, Keras, and Python.

The species of dog is correctly identified as “beagle” with 94.48% confidence.

I then tried classifying the following image of Johnny Depp from the Pirates of the Caribbean franchise:

Figure 12: Classifying a ship wreck with ResNet pre-trained on ImageNet with Keras (source).

While there is indeed a “boat” class in ImageNet, it’s interesting to see that the Inception network was able to correctly identify the scene as a “(ship) wreck” with 96.29% probability. All other predicted labels, including “seashore”, “canoe”, “paddle”, and “breakwater” are all relevant, and in some cases absolutely correct as well.

For another example of the Inception network in action, I took a photo of the couch sitting in my office:

Figure 13: Recognizing various objects in an image with Inception V3, Python, and Keras.

Inception correctly predicts there is a “table lamp” in the image with 69.68% confidence. The other top-5 predictions are also dead-on, including a “studio couch”“window shade” (far right of the image, barely even noticeable), “lampshade”, and “pillow”.

In the context above, Inception wasn’t even used as an object detector, but it was still able to classify all parts of the image within its top-5 predictions. It’s no wonder that Convolutional Neural Networks make for excellent object detectors!

Moving on to Xception:

Figure 14: Using the Xception network architecture to classify an image (source).

Here we have an image of scotch barrels, specifically my favorite scotch, Lagavulin. Xception correctly classifies this image as “barrels”.

This last example was classified using VGG16:

Figure 15: VGG16 pre-trained on ImageNet with Keras.

The image itself was captured a few months ago as I was finishing up The Witcher III: The Wild Hunt (easily in my top-3 favorite games of all time). The first prediction by VGG16 is “home theatre” — a reasonable prediction given that there is a “television/monitor” in the top-5 predictions as well.

As you can see from the examples in this blog post, networks pre-trained on the ImageNet dataset are capable of recognizing a variety of common day-to-day objects. I hope that you can use this code in your own projects!

What now?


You can now recognize 1,000 separate object categories from the ImageNet dataset using pre-trained state-of-the-art Convolutional Neural Networks.

…but what if you wanted to train your own custom deep learning networks from scratch?

How would you go about it?

Do you know where to start?

Let me help:

Whether this is the first time you’ve worked with machine learning and neural networks or you’re already a seasoned deep learning practitioner, my new book is engineered from the ground up to help you reach deep learning expert status.


In today’s blog post we reviewed the five Convolutional Neural Networks pre-trained on the ImageNet dataset inside the Keras library:

  1. VGG16
  2. VGG19
  3. ResNet50
  4. Inception V3
  5. Xception

I then demonstrated how to use each of these architectures to classify your own input images using the Keras library and the Python programming language.

If you are interested in learning more about deep learning and Convolutional Neural Networks (and how to train your own networks from scratch), be sure to take a look at my upcoming book, Deep Learning for Computer Vision with Python, available for pre-order now.


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , ,

96 Responses to ImageNet: VGGNet, ResNet, Inception, and Xception with Keras

  1. Aramis March 20, 2017 at 12:24 pm #

    Dear Adrian;

    Thank you for you nice tutorial. I always learn many new points from your tutorials which organized and explained very-well. I have implemented this code and I could figure out how to use these models with keras. I thought now I can use transfer learning with these pre-trained models and train on my own data.

    However, the main problem with my data is that they are medical images and gray-scale. I could follow the tutorial which proposed by FCohelt but I couldn’t figure out how to change the structure of the models to accept 1 channel data.

    I would be glad if you could give some hint for transfer learning with pre-trained models for not RGB but gray-scale images.


    • Adrian Rosebrock March 21, 2017 at 7:20 am #

      These pre-trained networks assume you are using 3 channel images — you won’t be able to modify them to use 1 channel images unless you train them from scratch. Instead, the solution is to turn your 1 channel image into a 3 channel image:

      image = np.dstack([1chan, 1chan, 1chan])

      From there you can pass the image through the network since it’s a 3 channel image (but appears gray).

      I’ll also be discussing transfer learning in great detail in my upcoming book, Deep Learning for Computer Vision with Python.

    • JDk January 15, 2018 at 2:26 am #

      Great Job,

  2. Parth March 20, 2017 at 12:42 pm #

    Hey Adrian,
    Thanks for the blog.
    I was hoping to do Pedestrian/human detection using Convolutional Neural Networks. I have tried using HoG but it didn’t turn out to be super accurate. The problem I am facing with using CNN with ImageNet trained classifiers is that there is no class/label as ‘person’ or ‘human’ or anything of that sort. What do you suggest I do? Could I try training it with INRIA person dataset or something similar? If yes, how?

    • Adrian Rosebrock March 21, 2017 at 7:18 am #

      I would fine-tune one of the networks on a dataset that is representative of the people you want to detect in images. If that’s INRIA, use it for fine-tuning.

  3. nicho March 20, 2017 at 12:59 pm #


  4. Ashti March 20, 2017 at 5:11 pm #


    Not related to this post.
    But i have a query wrt to keyframe extraction from videos.
    Using python and opencv i have to extract keyframes.
    I tried getting frames for each frame and then subtracting from each other and storing unique one which resulted in huge amoutn of frames.\
    I need to calculate pixel difference of frames and compare it with a threshold value. if PD > threshold store it as keyframe. Can you please give me an example on how can i calculate threshold of images which would be fetch me good amount of keyframes. same would be applied for other videos too…

    • Adrian Rosebrock March 21, 2017 at 7:13 am #

      Hey Ashti — I would kindly ask that comments on a particular blog post be related to the subject matter of the post (otherwise it comes off as a bit rude/presumptive). If you want to learn more about comparing images, try this post. Best of luck with the project.

  5. MiaoDX March 21, 2017 at 3:31 am #

    Aha, not so easy for me to point out a typo since there are so many readers and you’re so careful.

    However, in “VGG16 and VGG19” section, “Due to its depth and number of fully-connected nodes, VGG is over 533MB for VGG16 and 574MB for VGG16. This makes deploying VGG a tiresome task.”

    The latter should be VGG19, I think. ^_^

    • Adrian Rosebrock March 21, 2017 at 7:06 am #

      You are correct, thank you for pointing out the typo! It is fixed now.

  6. Ruben March 22, 2017 at 4:32 am #

    When I import from keras.applications import ResNet50, I have the next error:

    ImportError: cannot import name ‘GlobalAveragePooling2D’

    • Adrian Rosebrock March 22, 2017 at 8:34 am #

      Which version of Keras are you running? And which version of TensorFlow/Theano?

      • Ruben March 22, 2017 at 8:59 am #

        Thanks, all is ok with keras-2.0.2 theano -0.9.0

        • Adrian Rosebrock March 22, 2017 at 9:08 am #

          Congrats on resolving the issue!

  7. Abraham George March 23, 2017 at 11:57 am #

    I need to take live images and label it how do i do it?
    I cannot pre process the obtained frame ,what should i do?

    • Adrian Rosebrock March 25, 2017 at 9:34 am #

      For each frame in your video stream you would pass it through the network and obtain the output class labels.

  8. Abraham George March 23, 2017 at 11:59 am #

    what is the difference between parsing an image and reading it using imread?

    • Adrian Rosebrock March 25, 2017 at 9:34 am #

      I’m not sure what you mean Abraham, can you please elaborate on your comment?

  9. Sunggu kim March 24, 2017 at 11:01 am #

    Thank you for great tutorial.

    i’m always wondering about can i append more class to pre-trained network with my data or should i re-train all things?

    If possible we can save huge time and resources.

    Will it be possible?


    • Adrian Rosebrock March 25, 2017 at 9:22 am #

      The process of changing the output classes of a pre-trained network without having to re-train it from scratch is called fine-tuning. I’ll be covering fine-tuning in detail inside Deep Learning for Computer Vision with Python.

      • Sunggu kim March 26, 2017 at 4:31 am #

        Oh you always have a great answer.

        I already bought the course from kickstarter.

        I hope it to be released as soon as possible.

        Thank you.

        • Adrian Rosebrock March 28, 2017 at 1:09 pm #

          Thank you Sunggu Kim! I am working on the book and will ensure it will be released as soon as possible.

  10. MJB March 28, 2017 at 7:19 pm #

    Hi Adrian,

    Great post as always. I was wondering, how one can test the top 1 and top 5 error of this pre-trained model across a standardized data set say Imagenet to compare these in a more scientific way. Any tips?

    • Adrian Rosebrock March 31, 2017 at 2:07 pm #

      Can you elaborate more on what you mean by comparing the top-1 and top-5 accuracies? Normally for benchmark datasets like ImageNet your rank-1 and rank-5 accuracy on the test set is the standardized method to compare algorithms.

  11. ap March 31, 2017 at 8:23 am #

    Thank, excellent !
    As more models emerge having a clean framework to review results with is very helpful, thank you and KERAS. Tested with Keras2/TF1.01 on Windows.

  12. Aurora Guerra April 11, 2017 at 12:33 pm #

    Hi Adrian.
    How could you train a neural network for the recognition of leaf species?
    I would have to create my own network or use an existing network
    Thanks for all post, these are great

    • Adrian Rosebrock April 12, 2017 at 1:05 pm #

      Hey Aurora — I don’t have any blog posts specifically related to leaf species classification, but I’ll keep that in mind for a future blog post. Do you have a link to a leaf dataset you are currently working with?

      In the meantime, be sure to take a look at Deep Learning for Computer Vision with Python where I’ll be discussing training your own deep learning neural networks in detail. A book like this would surely help with your project.

  13. revan April 12, 2017 at 10:08 pm #

    hello sir,
    I’m presently working on image processing project I want to know(step 1) how to differentiate human from animals.(step2)If captured image is human I want to confirm whether the human in the captured image has performed any crime by comparing currently captured image with an image that has been already stored in the database or cloud.
    so, it will b great if u provide code for step 1 n step2 asap…..

    • revan April 12, 2017 at 10:12 pm #

      by the way I’m using raspberry pi3, OpenCV, python language please help me and Guide me…

    • Adrian Rosebrock April 16, 2017 at 9:07 am #

      Differentiating between humans and animals can easily be accomplished via a bit of machine learning or deep learning. Exactly which method you should use is highly dependent on your input images/video streams.

      As for crime detection, that sounds more like “activity recognition” which is not something I cover on PyImageSearch.

  14. Ravi Kishan April 15, 2017 at 4:58 pm #

    Hey Adrian,
    Your tutorial’s are really good. I had an issue which you could help me out with :). I want to store the value of the Tensor at the “Global Pool Layer” in Resnet50 but am unable to do so.
    Would be really nice if you could help me out

    • Adrian Rosebrock April 16, 2017 at 8:52 am #

      So if I understand correctly, you want to pass an image through the network and then take the raw values from the global pool layer prior to the softmax classifier being applied?

  15. kranthi April 16, 2017 at 12:31 am #

    using tensorflow for inception case got attribute error on concat_v2.

    • kranthi April 16, 2017 at 1:19 am #

      hello sir,

      Using tensorflow 1.01 version keras >2 version working for inception.

      Tried with theano 0.90 and keras >2 but not working when i tried with inception. you said only xception has to be run on tensorflow backend.

      error was TypeError: int() argument must be a string, a bytes-like object or a number, not ‘list’.

      everything else worked as given.

      Thanks for great and up to date technology based tutorials.

      • Adrian Rosebrock April 16, 2017 at 8:51 am #

        As I mentioned in the blog post, Xception only works for the TensorFlow backend. As for Inception, this should work without a problem on Theano. Can you try upgrading your Theano version as well?

  16. Jeff April 21, 2017 at 7:08 pm #

    Hi Adrian, this is AWESOME. Thank you very much. One question, is it posible to train my own model and merge it with an existing one? Thank you.

    • Adrian Rosebrock April 24, 2017 at 9:52 am #

      Hey Jeff — you can’t really “merge” the models together, but what you can do is:

      1. Train your own model(s) and create an ensemble from your other models (and pre-trained ones) as well. This makes the assumption that all networks are trying to predict the same class labels.
      2. If you want to predict different class labels from the labels in ImageNet, you should try fine-tuning a pre-trained network.

      I’m covering both techniques inside Deep Learning for Computer Vision with Python.

  17. shiva April 27, 2017 at 10:23 pm #

    Hi Adrian,

    My name is Shiva, doing postdoctoral research in computer vision at ASU.
    I first found you because of an online search for deep learning tutorials.
    I am greatly interested in using deep learning models to perform medical image classification, segmentation and CBIR.
    My question is:
    I have data with training and validation splits for three classes. How do you modify the above codes to accept the training and validation splits and print the validation accuracy?
    The reason I am asking is because I could find tutorials on using pre-trained models to predict a single image but not in-depth analysis on using these very deep models for data classification with train and validation splits.
    My experience with deep learning is intermediate level.
    Thanks and looking forward.


    • Adrian Rosebrock April 28, 2017 at 9:26 am #

      Hi Shiva — I think you might have some confusion regarding pre-trained neural networks. Once the networks are trained on a given number of classes (in this case, 1,000 ImageNet classes) you cannot use them to train on new classes (in your case, three classes) unless you apply feature extraction or fine-tuning.

      Fine-tuning will be covered in detail inside Deep Learning for Computer Vision with Python. Otherwise, I would suggest you work through the PyImageSearch Gurus course so you get get some more experience working with machine learning and training models.

  18. Ramesh June 18, 2017 at 1:51 am #

    Hi Adrian,
    Wonderful tutorial. I want to limit the output to a particular set of labels only. That is to say, I don’t want all the ImageNet labels. Am I right in stating that in the previous few comments, you were referring to the solution of exact task I want to do when you said fine tuning of the pre-trained model is required?

    • Adrian Rosebrock June 20, 2017 at 11:09 am #

      There are two ways to do this.

      The first is a bit “hackish”. Simply use the pre-trained network as is, then ignore the indexes of the labels you are not interested in. Then, take the label with the largest probability (form the set of labels you care about) and use that as your final classification. Again, this is a hack and only recommended in very specific situations.

      Otherwise, I would suggest fine-tuning.

  19. Hesam Moshiri July 12, 2017 at 5:00 am #


    is it possible to fine-tune these existing models for a custom dataset?

  20. Lucas August 7, 2017 at 6:49 pm #

    Hi Adrian, I’m interested in implementing the Xception and Inception models to my own image classification problem. However, my dataset consists of small images of 25 by 25 pixels, which are black and white, so an input_shape of ( 1, 25, 25). Do you know if it’s possible, or the network fundamentally requires color?

    • Adrian Rosebrock August 10, 2017 at 8:58 am #

      I wouldn’t recommend trying to use Xception on your images if they are (1) grayscale and (2) substantially smaller than the images Xception was trained on. You can resize your images and then convert them to 3 channels by using:

      image = np.dstack([image] * 3)

      However, I wouldn’t expect very good accuracy.

  21. *_cyrus_rex_* September 1, 2017 at 5:12 pm #

    Hi Adrian, great tutorial! I would be interested in classify just few of all the labels (seashore, lakeshore and alp). How could I go through this, maybe modifying the inception v3 model? Thanks in advance!

  22. Somnath Banerjee September 17, 2017 at 5:54 pm #

    Dear Adrian,
    Very nice tutorials. I used your sample code to do some simple recognition. Primarily using the Resnet50 model. Below is a quick summary of my findings. Any suggestions on how to take this to more accuracy?

    python –image images/burger.jpg –model resnet ==> Cheeseburger (That was good)

    python –image images/milk.jpg –model resnet ==> EggNog (Understandable)

    python –image images/fruits.jpg –model resnet ==> BellPepper (Close)

    python –image images/chicken_biriyani.jpg –model resnet ==> Plate (Needs help with Indian / Subcontinent food)
    python –image images/pasta.jpg –model resnet ==> Corn (Close)

    • Adrian Rosebrock September 18, 2017 at 2:02 pm #

      It really depends on your input images but if you are intending to detect a small subset of images consider applying transfer learning, specifically fine-tuning or feature extraction. I cover both of these techniques in-depth inside Deep Learning for Computer Vision with Python.

  23. Rafael September 27, 2017 at 2:08 pm #

    hi Adrian,

    always a nice article!

    there is a typo in “To use VGG19, we simply need to change the –network command line argument:” phrase the change command id “–model” not “–network”

    I have a doubt of how to use transfer learning with different image inputs. Example: I have 2 different grayscale + depth image and I’d like to use existing trained model.
    Do you think it is possible?
    or do I have to train a new model?

    • Adrian Rosebrock September 28, 2017 at 9:10 am #

      Thank you for pointing out the type, Rafael.

      As for your question, keep in mind that the ImageNet classifiers provided by Keras are pre-trained on RGB (3 channel) images in the ImageNet dataset. You can explicitly construct a 3 channel image from a single channel image via:

      gray = np.dstack([gray] * 3)

      And fine-tune from there; however, keep in mind that the filters learned by the neural network assume multi-channel. Your accuracy likely won’t be as good, but I would give it a shot just to obtain a baseline.

  24. Kendall Edwards September 29, 2017 at 2:49 pm #

    Adrian. Great tutorial. Is there any way to make this work in a Jupyter Notebook?

    • Adrian Rosebrock October 2, 2017 at 10:03 am #

      Provided you have Jupyter Notebooks and OpenCV installed on your system, yes. Make sure you replace the command line arguments with hard coded paths, though.

  25. Ballu October 14, 2017 at 5:29 am #

    Line 59 always exits with killed 🙁

    • Adrian Rosebrock October 14, 2017 at 10:31 am #

      What system are you executing the script on? Normally the vague “killed” message happens due to an incorrect compile or the system running out of memory.

  26. Abid November 8, 2017 at 12:27 pm #

    Hi, Adrian Rosebrock, I need to know that how could I found and draw the bounding boxes around the detected objects

    • Adrian Rosebrock November 9, 2017 at 6:22 am #

      Please take a look at this blog post on object detection with deep learning.

  27. samad January 4, 2018 at 3:33 am #

    Dear DR.Adrian, thank you, I’d like to design a system that capture abnormal activity from camera, please guide me.

  28. xingtao wei January 5, 2018 at 10:54 am #

    Nice tutorial. I used keras with tensorflow backend to fine-tuning an inceptionV3 model, and I saw the model size tripled after fine-tuning. That is, the original inceptionV3 model was about 98MB, and the size grew to 288MB after fine-tuning. Any ideas on the reason?

    • Adrian Rosebrock January 5, 2018 at 1:23 pm #

      I would suggest checking to see if the optimizer status was serialized to the model as well. You can delete it via:

  29. Niladri February 26, 2018 at 3:28 am #

    Hi Adrian,

    I am using live video stream to detect the objects which are labelled, only issue is that..I want to print the detected objects a stream and not the output of the detection after stopping the script. Could you please provide any help or snippet.


    • Adrian Rosebrock February 26, 2018 at 1:45 pm #

      Hey Niladri — can you elaborate more on what you mean by “in a stream and not the output of detection after stopping the script”. I’m not sure what you mean.

  30. Anusha Prakash March 2, 2018 at 12:05 am #

    What was the conclusion? Which of the models work best? And which layer features are best to pass to a classifier?

    • Adrian Rosebrock March 2, 2018 at 10:27 am #

      There is no “best model” as it is highly dependent on your image classification project. The same goes for the best feature extraction layer in a network. If you’re interested in learning more about these best practices and which models/layers to choose for a given project, I would suggest working through Deep Learning for Computer Vision with Python.

  31. Zubair March 18, 2018 at 1:34 am #

    hi Adrian

    I am Zubair Nawaz and i want to run this program on video not on images. how I can give videos path or anything else that will work not images.


    waiting for your quick reply.

    • Adrian Rosebrock March 19, 2018 at 5:17 pm #

      You should use the “cv2.VideoCapture” function to access your video file. This blog post will help you get started.

  32. Adesh March 21, 2018 at 8:13 am #

    Hey Adrian, how can i draw a rectangle around the detected objects ?

    • Adrian Rosebrock March 22, 2018 at 9:59 am #

      You need a network train for object detection not image classification. I would suggest starting with this post.

  33. Tom March 22, 2018 at 7:15 pm #

    Which model will be good for painting cross verification, if the painting is original or not?

  34. Gagandeep Singh April 10, 2018 at 3:04 am #

    Hi Adrian,
    Is it possible to draw bounding box (basically object detection) while using inception or alexnet? Do we have to apply selective segmentation or something similar before feeding the image for evaluation or is there any other neural network that can identify ROI first (specially in tensorflow)?


    • Gagandeep Singh April 10, 2018 at 3:07 am #

      P.S. I dont want to use SSD or fast rcnn models!

      • Adrian Rosebrock April 10, 2018 at 11:57 am #

        No, unfortunately you cannot use a network trained for image classification directly for object detection. I’ll be covering this in more detail in a blog post publishing later this month/early next. There is a hack you can do, however. You can treat an image classifier as an object detector by:

        1. Applying a sliding window + image pyramid
        2. Extracting the ROI at each step along the way
        3. Classifying the ROI with the network

        I actually demonstrate exactly how to do this inside the Practitioner Bundle of Deep Learning for Computer Vision with Python.

  35. OLUCHI May 6, 2018 at 7:50 pm #

    Hello,I am working on license plate detection using deep learning,am planing to use vggnet16 pre-trained model for the final verification of the license plate bounding box.I have successfully extracted my license plate region and want to use CNN to verify for the true license plate region from among the candidate region.I Dont really know how to go about it.

    • Adrian Rosebrock May 9, 2018 at 10:12 am #

      I would suggest two approaches using transfer learning:

      1. Fine-tuning
      2. Feature extraction and training a model on top of the features

      I cover both inside Deep Learning for Computer Vision with Python.

      I hope that helps point you in the right direction or at the very least gives you some more terms to go on.

  36. Ashish Gupta June 18, 2018 at 9:30 am #

    Which is the most accurate architecture on Imagenet among alexnet, resnet, Inception, Vgg?

    • Adrian Rosebrock June 19, 2018 at 8:43 am #

      On ImageNet specifically? ResNet is typically the most accurate.

  37. PJ September 13, 2018 at 6:48 am #

    I am getting the output like this :

    [INFO] loading inception…
    [INFO] loading and pre-processing image…
    [INFO] classifying image with ‘inception’…
    1. wreck: 96.29%
    2. seashore: 0.42%
    3. canoe: 0.22%
    4. paddle: 0.14%
    5. breakwater: 0.10%

    it matches with your output but then
    is showing error…..?
    I have written the command cv2.imwrite(‘Classified_image44.png’,orig) ,
    it is creating .png file with empty contents.

    I tried the model vgg16,vgg19 also

    • Adrian Rosebrock September 14, 2018 at 9:38 am #

      What specifically is the error you are receiving?

  38. Moses Waiming Wong October 12, 2018 at 2:24 pm #

    Xception has one key feature worth mentioning which is Residual to enable deeper network without vanishing gradient. This is the accent of Residual Network and FChollet’s idea is to combine the advantage of Inception above CNN with this Residual feature and making really good results.

  39. abbas October 25, 2018 at 11:33 pm #

    hi adrain! which CNN architecture is best for features extraction from images??inception or Xception?

    • Adrian Rosebrock October 29, 2018 at 1:47 pm #

      No single network is “best” at feature extraction. Some models generalize better than others but it’s really dependent on your actual project. I would suggest you try multiple models, run experiments, and let your empirical results guide your decisions.

  40. abbas October 29, 2018 at 12:47 am #

    Great tutorial adrain!! do you have any video tutorials??

    • Adrian Rosebrock October 29, 2018 at 1:16 pm #

      I don’t do video very often. I’m a writer and I prefer writing rather than creating video.

      • abbas November 17, 2018 at 12:34 am #

        Do you have any tutorial regarding image captioning using inception model??

        • Adrian Rosebrock November 19, 2018 at 12:46 pm #

          Sorry, I do not have any tutorials for image captioning as of yet.

  41. Phyu Phyu December 24, 2018 at 9:48 pm #

    I want to know about inception score and do you have any tutorials for it? It is related to make a decision that the generated image is true or false/better or worse. Although we can see with human eyes, we need to show in words. Could you give me an idea if you don’t mind?

    • Adrian Rosebrock December 27, 2018 at 10:31 am #

      Sorry, I’m not sure what you mean by the “inception score” and the generated image — are you referring to GANs…?

  42. Johnny January 3, 2019 at 5:24 pm #

    Hello Adrian! Great tutorial.

    I wonder if i can use this this code on a raspberry pi with movidius usb stick?

  43. Pearline February 27, 2019 at 1:56 am #

    That was a great post to learn about cnn models quickly. !!! I have a doubt regarding depthwise separable convolution of xception and mobilenet. The depthwise separable convolution used by both models are same or different, sir

    • Adrian Rosebrock February 27, 2019 at 5:28 am #

      Only the Xception network utilizes depthwise separable convolution, the others listed in this post use standard convolution.

  44. Raju Kulkarni September 10, 2019 at 5:05 am #

    Hi Adrian,

    Great article, Thanks for this. I have one question suppose for Xception I used input shape (512,512) instead of (299,299) Then what happen exactly because still I am able to train model and having include_top = False.

  45. Chris W January 17, 2020 at 3:11 pm #

    Adrian – my application uses monochrome images and my inference processing time suffers from using 3-channels when I could be using just 1-channel. My model is built from VGG16 and I’m using Keras/TF2. Any thoughts how how it would be possible to speed up the inference time by reducing the VGG16 model to accept 1-channel inputs?

Before you leave a comment...

Hey, Adrian here, author of the PyImageSearch blog. I'd love to hear from you, but before you submit a comment, please follow these guidelines:

  1. If you have a question, read the comments first. You should also search this page (i.e., ctrl + f) for keywords related to your question. It's likely that I have already addressed your question in the comments.
  2. If you are copying and pasting code/terminal output, please don't. Reviewing another programmers’ code is a very time consuming and tedious task, and due to the volume of emails and contact requests I receive, I simply cannot do it.
  3. Be respectful of the space. I put a lot of my own personal time into creating these free weekly tutorials. On average, each tutorial takes me 15-20 hours to put together. I love offering these guides to you and I take pride in the content I create. Therefore, I will not approve comments that include large code blocks/terminal output as it destroys the formatting of the page. Kindly be respectful of this space.
  4. Be patient. I receive 200+ comments and emails per day. Due to spam, and my desire to personally answer as many questions as I can, I hand moderate all new comments (typically once per week). I try to answer as many questions as I can, but I'm only one person. Please don't be offended if I cannot get to your question
  5. Do you need priority support? Consider purchasing one of my books and courses. I place customer questions and emails in a separate, special priority queue and answer them first. If you are a customer of mine you will receive a guaranteed response from me. If there's any time left over, I focus on the community at large and attempt to answer as many of those questions as I possibly can.

Thank you for keeping these guidelines in mind before submitting your comment.

Leave a Reply