A gentle guide to deep learning object detection

Today’s blog post is inspired by PyImageSearch reader Ezekiel, who emailed me last week and asked:

Hey Adrian,

I went through your previous blog post on deep learning object detection along
with the followup tutorial for real-time deep learning object detection. Thanks for those.

I’ve been using your source code in my example projects but I’m having two issues:

  1. How do I filter/ignore classes that I am uninterested in?
  2. How can I add new classes to my object detector? Is that even possible?

I would really appreciate it if you could cover this in a blog post.

Thanks.

Ezekiel isn’t the only reader with those questions. In fact, if you go through the comments section of my two most recent posts on deep learning object detection (linked above), you’ll find that one of the most common questions is typically (paraphrased):

How do I modify your source code to include my own object classes?

Since this appears to be such a common question, and ultimately a misunderstanding on how neural networks/deep learning object detectors actually work, I decided to revisit the topic of deep learning object detection in today’s blog post.

Specifically, in this post you will learn:

  • The differences between image classification and object detection
  • The components of a deep learning object detector including the differences between an object detection framework and the base model itself
  • How to perform deep learning object detection with a pre-trained model
  • How you can filter and ignore predicted classes from a deep learning model
  • Common misconceptions and misunderstandings when adding or removing classes from a deep neural network

To learn more about deep learning object detections, and perhaps even debunk a few misconceptions or misunderstandings you may have with deep learning-based object detection, just keep reading.

Looking for the source code to this post?
Jump right to the downloads section.

A gentle guide to deep learning object detection

Today’s blog post is meant to be a gentle introduction to deep learning-based object detection.

I’ve done my best to provide a review of the components of deep learning object detectors, including OpenCV + Python source code to perform deep learning using a pre-trained object detector.

Use this guide to help you get started with deep learning object detection, but also realize that the object detection is highly nuanced and detailed — I could not possibly include every detail of deep learning object detection in a single blog post.

That said, we’ll start today’s blog post by discussing the fundamental differences between image classification and object detection, including if a network trained for image classification can be used for object detection (and under what circumstances).

Once we understand what object detection is, we’ll review the core components of a deep learning object detector, including the object detection framework along with the base model, two key components that readers new to object detection tend to misunderstand.

From there, we’ll implement real-time deep learning object detection using OpenCV.

I’ll also demonstrate how you can ignore and filter object classes you are not interested in without having to modify the network architecture or retrain the model.

Finally, we’ll wrap up today’s blog post by discussing how you can add or remove classes from a deep learning object detector, including my recommended resources to help you get started.

Let’s go ahead and dive in to deep learning object detection!

The difference between image classification and object detection

Figure 1: The difference between classification (left) and object detection (right) is intuitive and straightforward. For image classification, the entire image is classified with a single label. In the case of object detection, our neural network localizes (potentially multiple) objects within the image.

When performing standard image classification, given an input image, we present it to our neural network, and we obtain a single class label and perhaps a probability associated with the class label as well.

This class label is meant to characterize the contents of the entire image, or at least the most dominant, visible contents of the image.

For example, given the input image in Figure 1 above (left) our CNN has labeled the image as “beagle”.

We can thus think of image classification as:

  • One image in
  • And one class label out

Object detection, regardless of whether performed via deep learning or other computer vision techniques, builds on image classification and seeks to localize exactly where in the image each object appears.

When performing object detection, given an input image, we wish to obtain:

  • A list of bounding boxes, or the (x, y)-coordinates for each object in an image
  • The class label associated with each bounding box
  • The probability/confidence score associated with each bounding box and class label

Figure 1 (right) demonstrates an example of performing deep learning object detection. Notice how both the person and the dog are localized with their bounding boxes and class labels predicted.

Therefore, object detection allows us to:

  • Present one image to the network
  • And obtain multiple bounding boxes and class labels out

Can a deep learning image classifier be used for object detection?

Figure 2: A non-end-to-end deep learning object detector uses a sliding window (left) + image pyramid (right) approach combined with classification.

Okay, so at this point you understand the fundamental difference between image classification and object detection:

  • When performing image classification, we present one input image to the network and obtain one class label out.
  • But when performing object detection, we can present one input image and obtain multiple bounding boxes and class labels out.

That motivates the question:

Can we take a network already trained for classification and use it for object detection instead?

The answer is a bit tricky as it’s technically “Yes”, but for reasons not so obvious.

The solutions involve:

  1. Applying standard, computer-vision based object detection methods (i.e., non-deep learning methods) such as sliding windows and image pyramids — this method is typically used in your HOG + Linear SVM-based object detectors.
  2. Taking the pre-trained network and using it as a base network in a deep learning object detection framework (i.e., Faster R-CNN, SSD, YOLO).

Method #1: The traditional object detection pipeline

The first method is not a pure end-to-end deep learning object detector.

We instead utilize:

  1. Fixed size sliding windows, which slide from left-to-right and top-to-bottom to localize objects at different locations
  2. An image pyramid to detect objects at varying scales
  3. Classification via a pre-trained (classification) Convolutional Neural Network

At each stop of the sliding window + image pyramid, we extract the ROI, feed it into a CNN, and obtain the output classification for the ROI.

If the classification probability of label L is higher than some threshold T, we mark the bounding box of the ROI as the label (L). Repeating this process for every stop of the sliding window and image pyramid, we obtain the output object detectors. Finally, we apply non-maxima suppression to the bounding boxes yielding our final output detections:

Figure 3: Applying non-maxima suppression will suppress overlapping, less confident bounding boxes.

This method can work in some specific use cases, but in general it’s slow, tedious, and a bit error-prone.

However, it’s worth learning how to apply this method as it can turn an arbitrary image classification network into an object detector, avoiding the need to explicitly train an end-to-end deep learning object detector. This method could save you a ton of time and effort depending on your use case.

If you’re interested in this object detection method and want to learn more about the sliding window + image pyramid + image classification approach to object detection, please refer to my book, Deep Learning for Computer Vision with Python.

Method #2: Base network of an object detection framework

The second method to deep learning object detection allows you to treat your pre-trained classification network as a base network in a deep learning object detection framework (such as Faster R-CNN, SSD, or YOLO).

The benefit here is that you can create a complete end-to-end deep learning-based object detector.

The downside is that it requires a bit of intimate knowledge on how deep learning object detectors work — we’ll discuss this more in the following section.

The components of a deep learning object detector

Figure 4: The VGG16 base network is a component of the SSD deep learning object detection framework.

There are many components, sub-components, and sub-sub-components of a deep learning object detector, but the two we are going to focus on today are the two that most readers new to deep learning object detection often confuse:

  1. The object detection framework (ex. Faster R-CNN, SSD, YOLO).
  2. The base network which fits into the object detection framework.

The base network you are likely already familiar with (you just haven’t heard it referenced as a “base network” before).

Base networks are your common (classification) CNN architectures, including:

  • VGGNet
  • ResNet
  • MobileNet
  • DenseNet

Typically these networks are pre-trained to perform classification on a large image dataset, such as ImageNet, to learn a rich set of discerning, discriminating filters.

Object detection frameworks consist of many components and sub-components.

For example, the Faster R-CNN framework includes:

  • The Region Proposal Network (RPN)
  • A set of anchors
  • The Region of Interest (ROI) pooling module
  • The final Region-based Convolutional Neural Network

When using Single Shot Detectors (SSDs) you have components and sub-components such as:

  • MultiBox
  • Priors
  • Fixed priors

Keep in mind that the base network is just one of the many components that fit into the overall deep learning object detection framework — Figure 3 at the top of this section depicts the VGG16 base network inside the SSD framework.

Typically, “network surgery” is performed on the base network. This modification:

  • Forms it to be fully-convolutional (i.e., accept arbitrary input dimensions).
  • Eliminates CONV/POOL layers deeper in the base network architecture and replaces them with a series of new layers (SSD), new modules (Faster R-CNN), or some combination of the two.

The term “network surgery” is a colloquial way of saying we remove some of the original layers of the base network architecture and supplant them with new layers.

You’ve likely seen low budget horror movies where the killer, likely carrying an ax or large knife, attacks their victim and unceremoniously hacks at them.

Network surgery is more precise and exacting than the typical B horror film killer.

Network surgery is also very tactical — we remove parts of the network we do not need and replace it with a new set of components.

Then, when we go to train our framework to perform object detection, both the weights of the (1) new layers/modules and (2) base network are modified.

Again, a complete review of how various deep learning object detection frameworks work (including the role the base network plays) is outside the scope of this blog post.

If you’re interested in complete review of deep learning object detection, including theory and implementation, please refer to my book, Deep Learning for Computer Vision with Python.

How do I measure the accuracy of a deep learning object detector?

When evaluating object detector performance we use an evaluation metric called mean Average Precision (mAP) which is based on the Intersection over Union (IoU) across all classes in our dataset.

Intersection over Union (IoU)

Figure 5: In this visual example of Intersection over Union (IoU), the ground-truth bounding box (green) can be compared to the predicted bounding box (red). IoU is used with mean Average Precision (mAP) to evaluate the accuracy of a deep learning object detector. The simple equation to calculate IoU is shown on the right.

You’ll typically find IoU and mAP used to evaluate the performance of HOG + Linear SVM detectors, Haar cascades, and deep learning-based methods; however, keep in mind that the actual algorithm used to generate the predicted bounding boxes does not matter.

Any algorithm that provides predicted bounding boxes (and optionally class labels) as output can be evaluated using IoU. More formally, in order to apply IoU to evaluate an arbitrary object detector, we need:

  1. The ground-truth bounding boxes (i.e., the hand-labeled bounding boxes from our testing set that specify where an image our object is).
  2. The predicted bounding boxes from our model.
  3. If you want to compute recall along with precision, you’ll also need the ground-truth class labels and predicted class labels.

In Figure 4 (left) I have included a visual example of a ground-truth bounding box (green) versus a predicted bounding box (red). Computing IoU can be determined by the equation illustration in Figure 4 (right).

Examining this equation you can see that IoU is simply a ratio.

In the numerator, we compute the area of overlap between the predicted bounding box and the ground-truth bounding box.

The denominator is the area of the union, or more simply, the area encompassed by both the predicted bounding box and the ground-truth bounding box.

Dividing the area of overlap by the area of union yields a final score — the Intersection over Union.

mean Average Precision (mAP)

Note: I decided to edit this section from its original form. I wanted to keep the discussion of mAP higher level and avoid some of the more confusing recall calculations but as a couple commenters pointed out this section wasn’t technically correct. Because of that I decided to update the post.

Since this is a gentle introduction to deep learning-based object detection I’m going to keep the explanation of mAP on the simplified side just so you understand the fundamentals.

Readers and practitioners new to object detection can be confused by the mAP calculation. This is partially due to the fact that mAP is a more complicated evaluation metric. It’s also the definition of calculation of mAP can even vary from one object detection challenge to another (when I say “object detection challenge” I’m referring to competitions such as COCO, PASCAL VOC, etc.).

Computing the Average Precision (AP) for a particular object detection pipeline is essentially a three step process:

  1. Compute the precision which is the proportion of true positives.
  2. Compute the recall which is the proportion of true positives out of all possible positives.
  3. Average together the maximum precision value across all recall levels in steps of size s.

To compute the precision we first apply our object detection algorithm to an input image. The bounding box scores are then sorted in descending order by their confidence.

We know from a priori knowledge (i.e., it’s a validation/testing example and we therefore know the total number of objects in the image) there are 4 objects in this image. We seek to determine how many “correct” detections our network made.  A “correct” prediction here is one where we have a minimum IoU of 0.5 (this value is tunable depending on the challenge but 0.5 is a standard value).

Here is where the calculation starts to become a bit more complicated. We need to compute the precision at different recall values (also called “recall levels” or “recall steps”) .

For example, let’s pretend we are computing the precision and recall values for the top-3 predictions. Out of the top-3 predictions from our deep learning object detector, we made 2 correct. Our precision is then the proportion of true positives: 2/3 = 0.667. Our recall is the proportion of the true positives out of all the possible positives in the image: 2 / 4 = 0.5. We repeat this process for (typically) the top-1 to top-10 predictions. This process yields a list of precision values.

The next step is to compute the average for all your top-N values, hence the term Average Precision (AP). We loop over all recall values r, find the maximum precisionthat we can obtain with our recall > r and then compute the average. We now have our average precision for a single evaluation image.

Once we have computed the average precision for all images in our testing/validation set we perform two more calculations:

  1. Compute the mean of the APs for each class, giving us a mAP for each individual class (for many datasets/challenges you’ll want to examine the mAP class-wise so you can spot if your deep learning object detector is struggling with a specific class)
  2. Take the mAPs for each individual class and then average them together, yielding the final mAP for the dataset

Again, mAP is more complicated than traditional accuracy so don’t be frustrated if you don’t understand it on the first pass. This is an evaluation metric you’ll want to study multiple times before you fully understand it. The good news is that deep learning object detection implementations handle computing mAP for you.

Deep learning-based object detection with OpenCV

We’ve discussed deep learning and object detection on this blog in previous posts; however, let’s review actual source code in this post as a matter of completeness.

Our example includes the Single Shot Detector (framework) with a MobileNet base model. The model was trained by GitHub user chuanqi305 on the Common Objects in Context (COCO) dataset.

For additional detail, check out my previous post where I introduced chuanqi305’s model with pertinent background information.

Let’s loop back to Ezekiel’s first question from the top of this post:

  1. How do I filter/ignore classes that I am uninterested in?

I’m going to answer that very question in the following example script.

But first you need to prepare your system:

  • You need a minimum of OpenCV 3.3 installed in your Python virtual environment (provided you are using Python virtual environments). OpenCV 3.3+ includes the DNN module required to run the following code. Be sure to use one of the OpenCV installation tutorials on the following page while paying extra attention to which version of OpenCV you download + install.
  • You should also install my imutils package. To install/update imutils in your Python virtual environment, simply use pip: pip install --upgrade imutils .

When you’re ready, go ahead and create a new file named filter_object_detection.py  and let’s begin:

On Lines 2-8 we import our required packages and modules, notably imutils  and OpenCV. We will be using my VideoStream  class to handle capturing frames from a webcam.

We’re armed with the necessary tools, so let’s continue by parsing command line arguments:

Our script requires two command line arguments at runtime:

  • --prototxt : The path to the Caffe prototxt file which defines the model definition.
  • --model : Our CNN model weights file path.

Optionally you may specify --confidence , a threshold to filter weak detections.

Our model can predict 21 object classes:

The CLASSES  list contains all class labels the network was trained on (i.e. COCO labels).

A common misconception of the CLASSES  list is that you can:

  1. Add a new class label to the list
  2. Or remove a class label from the list

…and have the network automatically “know” what you are trying to accomplish.

That is not the case.

You cannot simply modify a list of text labels and have the network automatically modify itself to learn, add, or remove patterns on data it was never trained on. That is not how neural networks work.

That said, there is a quick hack you can use to filter and ignore predictions you are uninterested in.

The solution is to:

  1. Define a set of IGNORE  labels (i.e., the list of class labels the network was trained on that you want to filter and ignore).
  2. Make a prediction on an input image/video frame.
  3. Ignore any predictions where the class label exists in the IGNORE  set.

Implemented in Python, the IGNORE  set looks like this:

Here we’ll be ignoring all predicted objects with class label "person"  (the if  statement used for filtering will be covered later in this code review).

You can easily add additional elements (class labels from the CLASSES  list) to ignore to the set.

Next, we’ll generate random label/box colors, load our model, and start the video stream:

On Line 27 a random array of COLORS  is generated to correspond to each of the 21 CLASSES . We’ll use these colors later for display purposes.

Our Caffe model is loaded on Line 31 using the cv2.dnn.readNetFromCaffe  function and both of our required command line arguments passed as parameters.

Then we instantiate the VideoStream  object as vs , and start our fps  counter (Lines 36-38). The 2-second sleep  allows our camera plenty of time to warm up.

At this point we’re ready to loop over the incoming frames from the camera and send them through our CNN object detector:

On Line 44 we grab a frame  and then resize  while preserving aspect ratio for display (Line 45).

From there, we extract the height and width as we’ll need these values later (Line 48).

Lines 48 and 49 generate a blob  from our frame. To learn more about a blob  and how it’s constructed using the cv2.dnn.blobFromImage  function, refer to this previous post for all the details.

Next, we, send that blob  through our neural net  to detect objects (Lines 54 and 55).

Let’s loop over the detections:

On Line 58 we begin our detections  loop.

For each detection, we extract the confidence  (Line 61) followed by comparing it to our confidence threshold (Line 65).

In the case that our confidence  surpasses the minimum (the default of 0.2 can be changed via the optional command line argument), we’ll consider the detection a positive, valid detection and continue processing it.

First, we extract the index of the class label from detections  (Line 68).

Then, going back to Ezekiel’s first question, we can ignore classes in the IGNORE  set on Lines 72 and 73. If the class is to be ignored, we simply continue  back to the top of the detections loop (and we don’t display labels or boxes for this class). This fulfills our “quick hack” solution.

Otherwise, we’ve detected an object in the whitelist and we need to display the class label and rectangle on the frame:

In this code block, we are extracting bounding box coordinates (Lines 77 and 78) followed by drawing a label and rectangle on the frame (Lines 81-87).

The color of the label + rectangle will be the same for each unique class; objects of the same class will have the same color (i.e. all "boats"  in the video would have the same color label and box).

Finally, still in our while  loop, we’ll display our hard work on our screen:

We display the frame  and capture keypresses on Lines 90 and 91.

If the "q"  key is pressed, we quit by breaking out of the loop (Lines 94 and 95).

Otherwise, we proceed to update our fps  counter (Line 98) and continue grabbing and processing frames.

On the remaining lines, when the loop breaks, we display time + frames per second metrics and cleanup.

Running your deep learning object detector

In order to run today’s script, you’ll need to grab the files by scrolling to the “Downloads” section below.

Once you’ve extracted the files, open a terminal and navigate to downloaded code + model. From there, execute the following command:

Figure 6: A real-time deep learning object detection demonstration of using the same model — in the right video I’ve ignored certain object classes programmatically.

In the GIF above you can see on the left that the “person” class is detected — this is due to me having an empty IGNORE . On the right you can see that I am not detected — this behavior is due to be adding the “person” class to the IGNORE  set.

While our deep learning object detector is still technically detecting the “person” class, our post-processing code is able to filter it out.

Perhaps you encountered an error running the deep learning object detector?

Troubleshooting step one would be to verify that you have a webcam hooked up. If that’s not the problem, maybe you saw the following error message in your terminal:

If you see this message, then you didn’t pass “command line arguments” to the program. This is a common problem PyImageSearch readers have if they aren’t familiar with Python, argparse, and command line arguments. Check out the link if you are having trouble.

Here is the full version of the video with commentary:

How can I add or remove classes to my deep learning object detector?

Figure 7: Fine-tuning and transfer learning for deep learning object detectors.

As I mentioned earlier in this guide, you cannot simply add or remove class labels from the CLASSES  list — the underlying network itself has not changed.

All you have done, at best, is modify a text file that lists out the class labels.

Instead, if you want to explicitly add or remove classes from a neural network you will either need to either:

  1. Train from scratch
  2. Perform fine-tuning

Training from scratch tends to be a time consuming, expensive operation so we try to avoid it when we can — but in some cases this is completely unavoidable.

The other option is to perform fine-tuning.

Fine-tuning is a form of transfer learning and is the process of:

  1. Removing the fully-connected layer responsible for classification/labeling
  2. Replacing it with a brand new, freshly and randomly initialized fully-connected layer

We may optionally modify other layers in the network as well (including freezing the weights of some layers and unfreezing them during the training process).

Exactly how to train your own custom deep learning object detector (including both fine-tuning and training from scratch) are advanced topics outside the scope of this blog post, but see the section below to help you get started.

Where can I learn more about deep learning object detection?

Figure 8: Real-time deep learning object detection for front and rear views of vehicles.

As we’ve discussed in this blog post, object detection is not as simple and straightforward as image classification, the details and intricacies of which are outside the scope of this (already lengthy) blog post.

This tutorial will certainly not be my last guide to deep learning object detection (I will unquestionably be writing more about object detection in the future), but if you’re interested in learning how to:

  1. Prepare your own image datasets for object detection
  2. Fine-tune and train your own custom object detectors, including Faster R-CNNs and SSDs on your own datasets
  3. Uncover my best practices, techniques, and procedures to utilize when training your own deep learning object detectors

…then you’ll want to be sure to take a look at my new deep learning book. Inside Deep Learning for Computer Vision with Python, I will guide you, step-by-step, on building your own deep learning object detectors.

Be sure to take a look — and don’t forget to grab your (free) sample chapters + table of contents PDF while you’re there!

Summary

In today’s blog post you were gently introduced to some of the intricacies involved in deep learning object detection. We started by reviewing the fundamental differences between image classification and object detection, including how we can use a network trained for image classification for object detection.

We then reviewed the core components of a deep learning object detector:

  1. The framework
  2. The base model

The base model is typically a pre-trained (classification) network, normally trained on a large image dataset such as ImageNet to learn a robust set of discerning filters.

We can also train the base network from scratch but this usually takes a significantly longer amount of time for the object detector to reach reasonable accuracy.

You should, in most situations, start with a pre-trained base model instead of trying to train from scratch.

Once we acquired a solid understanding of deep learning object detectors, we implemented an object detector capable of running in real-time in OpenCV.

I also demonstrated how you can filter and ignore class labels that you are uninterested in.

Finally, we learned that actually adding or removing a class to a deep learning object detector is not as simple as adding/removing a label from the hardcoded class labels list.

The neural network itself doesn’t care if you modify a list of class labels — instead, you would need to either:

  1. Modify the network architecture itself by removing the fully-connected class prediction layer and fine-tuning
  2. Or train the object detection framework from scratch

For more deep learning object detection projects you will start with a deep learning object detector pre-trained on an object detection task, such as COCO. You then perform fine-tuning on the model to obtain your own detector.

Training an end-to-end custom deep learning object detector is outside the scope of this blog post, so if you’re interested in discovering how to train your own deep learning object detectors, please refer to my book, Deep Learning for Computer Vision with Python.

Inside the book, I have included a number of deep learning object detection examples, including training your own object detectors to:

  1. Detect traffic signs, such as stop signs, pedestrian crossing signs, etc.
  2. Along with the front and rear views of vehicles

To learn more about my deep learning book, just click here!

If you enjoyed today’s blog post, be sure to enter your email address in the form below to be notified when future tutorials are published here on PyImageSearch!

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , , , ,

48 Responses to A gentle guide to deep learning object detection

  1. Anirban May 14, 2018 at 11:26 am #

    Really good blog post and with the youtube video , it is even better . I am really happy that I purchased your deep learning for CV with python , in a few months I have learnt so much about DL for CV that I now feel confident that I can apply for a DL in CV post.
    Disclaimer : I am a banker by profession. Have not coded in last ten years and this my honest review.

    • Adrian Rosebrock May 14, 2018 at 11:42 am #

      Thanks so much for the kind words, Anirban! 😀 I’m so incredibly happy for you and your transition from bank to CV and practitioner. Keep up the great work!

  2. Raym May 14, 2018 at 12:26 pm #

    Thanks for the clarification!!!

    • Adrian Rosebrock May 14, 2018 at 12:27 pm #

      Thanks Raym, I’m glad it helped 🙂

  3. MImranKhan May 14, 2018 at 12:47 pm #

    but how we can use our own model that we train by our self rather than picking Pre-train model

    • Adrian Rosebrock May 14, 2018 at 2:17 pm #

      You would typically take a network pre-trained on ImageNet and then fine-tune it to your own dataset. You could train your own base network first and then fine-tune but whether or not that works better really depends on your dataset and project. I would suggest running experiments for both.

  4. Vijin May 14, 2018 at 2:05 pm #

    I think mAP computation mentioned in this blog is wrong.

    • Adrian Rosebrock May 14, 2018 at 2:16 pm #

      Hey Vijin — what specifically regarding the mAP computation do you think is incorrect?

      UPDATE: I went back and updated the mAP computation. I was trying to keep it simplistic but after reading (1) Ye Hu’s comment and (2) reviewing the post itself a few times I decided to go back and include the full calculation.

  5. Tiri May 14, 2018 at 2:49 pm #

    very interesting article! hope to see soon new posts on object detection 🙂
    in which bundle of your books do you do the object detection topic and examples like traffic signs?

    • Adrian Rosebrock May 14, 2018 at 2:59 pm #

      Hi Tiri, there will certainly be more posts on object detection. The Practitioner Bundle of Deep Learning for Computer Vision with Python discusses the traditional sliding window + image pyramid method for object detection, including how to use a CNN trained for classification as an object detector. The ImageNet Bundle includes all examples on training Faster R-CNNs and SSDs for traffic sign and front/rear view vehicle detection.

  6. camp May 14, 2018 at 9:13 pm #

    nice. thank you

  7. Nikhil May 14, 2018 at 11:03 pm #

    Hi Adrian, Why am I getting this error?
    $ python3 filter_object_detection.py –prototxt MobileNetSSD_deploy.prototxt.txt –model MobileNetSSD_deploy.caffemodel

    AttributeError: module ‘cv2’ has no attribute ‘dnn’

    • Adrian Rosebrock May 15, 2018 at 6:01 am #

      Make sure you have at least OpenCV 3.3 installed (see the blog post for more details as I discuss why and how you can install OpenCV 3.3+).

  8. Ye Hu May 15, 2018 at 2:10 am #

    So do I. The mAP involves the precision-recall curve.

    • Adrian Rosebrock May 15, 2018 at 6:08 am #

      In the context of object detection the precision would the proportion of our true positives (TP) for each image. The recall would be the proportion of the TP out of all the possible positives for each image. The average precision is then the average of maximum precision values at varying recall steps. I didn’t include the step value for the precision/recall calculation as this is meant to be an introductory blog post to object detection. It’s also not an exhaustive example of how to compute mAP for object detection either (although that could make for a good tutorial).

      If anyone is finds the mAP explanation too simplified (or even too complicated) let me know and I will consider rewriting it.

      UPDATE: I decided to go back and update the blog post to describe the full calculation. Trying to explain the entire mAP calculation is too much for this already lengthy blog post. I’ll cover a detailed computation of mAP in a future tutorial.

  9. Chandramouleeswar May 16, 2018 at 7:48 am #

    Hello Adrian,

    Can you give me a suggestion for image recognition in videos? I am looking forward to implementing Mask-R CNN using Resnet as a base network for recognising persons, vehicles, traffic signals on roads from a video Dataset. What is the better Dataset for my choice?

    • Adrian Rosebrock May 16, 2018 at 5:05 pm #

      Just to clarify, are you looking to perform segmentation on each frame in the dataset which is essentially treating it like working with a set of images? Or are you trying to do activity recognition within the dataset as well where sequences of frames are important?

  10. Elain May 17, 2018 at 2:24 am #

    Can i get a link to the wallpaper?

    • Adrian Rosebrock May 17, 2018 at 6:43 am #

      Which wallpaper are you referring to?

  11. Gilad May 17, 2018 at 8:15 am #

    I would like to understand how we can get 7fps.
    When I trained a CNN for face detection and used Haar-cascade to detect the face itself, on the same computer I got ~7fps.
    If I understand correctly, under the hood, the algorithm is running thousands inference on each box and calculate what it found. How can we reach 7fps?
    Thx for very very interesting post.
    G

    • Adrian Rosebrock May 17, 2018 at 8:51 am #

      The deep learning face detector in this post will already get you over 7 FPS on the CPU. Haar cascades will run many times faster (but likely less accurate depending on your project). Are you using your own CNN trained for face detection? If so consider pushing the computation to the GPU for faster inference.

      • Gilad May 17, 2018 at 3:16 pm #

        I would like to understand what is under the hood of the network in your post. Is it indeed doing inference thousands of times for each picture as your post suggest?

        • Adrian Rosebrock May 22, 2018 at 6:48 am #

          Be careful with the term “inference” here. Typically we use the term inference to refer to a prediction from the model as it’s inferring from the data. In the context of neural networks, an inference is a single forward pass which returns the prediction.

          Perhaps you mean to say the network is performing thousands of computations for each input image? If so, that statement is correct.

  12. Siladittya Manna May 17, 2018 at 12:38 pm #

    This post cleared a lot of confusion I had regarding implementation of object detection and image classification. Thanks a lot!!

    • Adrian Rosebrock May 21, 2018 at 10:39 am #

      Thanks Siladittya, I’m happy to hear you found it helpful 🙂

  13. Gilad May 18, 2018 at 4:25 am #

    Thx Adrian again

    https://youtu.be/ULE40CgDrwo

    • Adrian Rosebrock May 22, 2018 at 6:49 am #

      Thanks so much for sharing your demo Gilad, great job! 🙂

  14. Zubair Ahmed May 27, 2018 at 11:43 am #

    Nice blog post and off course I learned this and more from your book. To all the readers, if you like this post make sure you get Adrian’s book

    • Adrian Rosebrock May 27, 2018 at 11:57 am #

      Thanks Zubair! 😀

      • Zubair Ahmed May 27, 2018 at 2:40 pm #

        Well to top it off another tutorial to do Object Counting would be an awesome addition to this series 🙂

  15. Suresh Kumar June 19, 2018 at 7:03 am #

    Suresh Kumar:

    #1

    You have ignored, Human from this object detection.

    How do I include, Human?

    #2

    I would like to add one object like watch or mobile to be detected, How do I add to the Caffe Model File?

    • Adrian Rosebrock June 19, 2018 at 8:22 am #

      1. You could set the IGNORE set to be empty or you could modify the code to use a KEEP class that includes only the specified set of classes.

      2. Please read the blog post as I discuss the answer to your question. You’ll want to apply fine-tuning/transfer learning.

  16. Dave A June 19, 2018 at 8:36 pm #

    Excellent post again. I’m really enjoying these. In a matter of weeks I’ve modified your code to communicate to some Node-Red flows I have sending me snapshots of motion, faces or certain classes of objects when detected on a Raspberry Pi 3b. (And not be ‘that guy’, but you may want to look over your figure numbering and the references within the text.)
    You make it almost too easy. Thank you!

    • Adrian Rosebrock June 21, 2018 at 5:50 am #

      Congrats on the progress Dave, that’s fantastic!

  17. Suresh Kumar June 20, 2018 at 12:42 am #

    Yes I have added the person, by excluding the lines of IGNORE.. Thank Sir..

    #3

    I need a log file to created after stopping the program, How may object are detected and what is the percentage of prediction of each object ..

    How can I do that Sir ?

    • Adrian Rosebrock June 21, 2018 at 5:46 am #

      You should read up on basic file I/O operations using the Python programming language. I’m happy to help but please take the time to do your proper research and read online. There are many Python tutorials available that teach you the fundamentals of the language.

  18. Carlos July 19, 2018 at 10:54 am #

    Hello Adrian,

    Do you think SSD is better than YOLO for object dertection? I noticed you implement SSD on Image Bundle, and not YOLO. Why is that?

    Another question, for detecting targets like airplanes and military targets from satellite images, which one would recommend?

    Loving your 2nd book from dl4cv. When finish this, surely will buy the 3rd!

    Thanks

    • Adrian Rosebrock July 20, 2018 at 6:35 am #

      While YOLO is fast it’s not as accurate as SSDs or Faster R-CNNs. A general rule of thumb is that if you want pure speed and can sacrifice accuracy, use YOLO. If you need to detect tiny objects use Faster R-CNN. If you need a balance, use SSD.

      As far as your second question goes, I assume those objects would appear to be pretty tiny. In that case, Faster R-CNN.

      • Carlos July 20, 2018 at 7:56 pm #

        Thanks for the answer!

        I will try to study more about them, as I want to work in this area in the future.

        Have a nice weekend!

  19. Carlos July 23, 2018 at 10:06 am #

    Dear Adrian,

    On IMAGENET BUNDLE (Faster R-CNNs and Single Shot Detectors (SSDs)) you show how to train these architectures for object detection from my own dataset?

    I am trying to identify cars, people and airplanes from aerial images (satellite, drones, UAV).

    I finished the Convolutional Neural Networks course from Coursera (Andrew Ng) and we implemented YOLO using YAD2K package, but I have no idea (yet) about how to train deep learning architectures for detect my own targets.

    In which book (and chapter) I will find these answers?

    Thanks for the attention.

    • Adrian Rosebrock July 25, 2018 at 8:12 am #

      Hey Carlos — you are correct, the ImageNet Bundle of Deep Learning for Computer Vision with Python will show you how to train Faster R-CNNs and SSDs on your own custom datasets. You will find all chapters on how to perform object detection in the ImageNet Bundle of the book.

  20. Lluis August 6, 2018 at 3:42 pm #

    Hi Adriam,

    thanks for your detailed tutorials, they are a big help to start with deep-learning. What I want to accomplish is to train a network to detect objects (not only classify). The images are in FITS format, used in astronomy images. I was able to train a model in order to classify the object (I followed one of your tutoria Santa/not Santa), but with object detection is not so easy. All the examples or tutorials start with a pretrained newtwork, but I need to start from scratch. Do you have any advice or source that I could follow to accomplish my goal?

    Thanks in advance!

    • Adrian Rosebrock August 7, 2018 at 6:37 am #

      Hi Lluis — I have a number of chapters inside Deep Learning for Computer Vision with Python that demonstrate how to train an object detector model from scratch. That would be my recommended starting point for you to achieve your goal.

      • Lluis August 7, 2018 at 7:16 am #

        Hi Adrian,

        thanks, I will take a look, and let you know with the result.

        Thanks and regards.

  21. Márcio August 12, 2018 at 4:43 pm #

    Hello Adrian, do you have raspberry sdcard .iso with that project?

    • Adrian Rosebrock August 15, 2018 at 8:55 am #

      I do. My Raspbian .img file with OpenCV pre-configured and pre-installed is included in the Quickstart Bundle and Hardcopy Bundle of Practical Python and OpenCV.

  22. Benya Jamiu September 3, 2018 at 6:20 pm #

    Dear Dr.
    Infact i’m yet to buy the book or enroll in any of your course but you have made most of my days and im just lloking for a place to practice it right and i have applied for Msc in AI here in Paris to be specialized in Computer Vision , very soon i will be buy both your books but right now i’m practising all your examples online
    You are great without leaving my room and im moving closer to …..GURU specialist in Computer Vision even with many stress but still practising sleeping 12-02:00 am sometimes

    • Adrian Rosebrock September 5, 2018 at 8:51 am #

      Thank you for the kind words, Benya. I’m so happy to hear you are enjoying the blog and will one day pick up a copy of my books. Keep practicing, you’re doing great! 🙂

Quick Note on Comments

Please note that all comments on the PyImageSearch blog are hand-moderated by me. By moderating each comment on the blog I can ensure (1) I interact with and respond to as many readers as possible and (2) the PyImageSearch blog is kept free of spam.

Typically, I only moderate comments every 48-72 hours; however, I just got married and am currently on my honeymoon with my wife until early October. Please feel free to submit comments of course! Just keep in mind that I will be unavailable to respond until then. For faster interaction and response times, you should join the PyImageSearch Gurus course which includes private community forums.

I appreciate your patience and thank you being a PyImageSearch reader! I will see you when I get back.

Leave a Reply