Mask R-CNN with OpenCV

Click here to download the source code to this post.

In this tutorial, you will learn how to use Mask R-CNN with OpenCV.

Using Mask R-CNN you can automatically segment and construct pixel-wise masks for every object in an image. We’ll be applying Mask R-CNNs to both images and video streams.

In last week’s blog post you learned how to use the YOLO object detector to detect the presence of objects in images. Object detectors, such as YOLO, Faster R-CNNs, and Single Shot Detectors (SSDs), generate four sets of (x, y)-coordinates which represent the bounding box of an object in an image.

Obtaining the bounding boxes of an object is a good start but the bounding box itself doesn’t tell us anything about (1) which pixels belong to the foreground object and (2) which pixels belong to the background.

That begs the question:

Is it possible to generate a mask for each object in our image, thereby allowing us to segment the foreground object from the background?

Is such a method even possible?

The answer is yes — we just need to perform instance segmentation using the Mask R-CNN architecture.

To learn how to apply Mask R-CNN with OpenCV to both images and video streams, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

Mask R-CNN with OpenCV

In the first part of this tutorial, we’ll discuss the difference between image classificationobject detection, instance segmentation, and semantic segmentation.

From there we’ll briefly review the Mask R-CNN architecture and its connections to Faster R-CNN.

I’ll then show you how to apply Mask R-CNN with OpenCV to both images and video streams.

Let’s get started!

Instance segmentation vs. Semantic segmentation

Figure 1: Image classification (top-left), object detection (top-right), semantic segmentation (bottom-left), and instance segmentation (bottom-right). We’ll be performing instance segmentation with Mask R-CNN in this tutorial. (source)

Explaining the differences between traditional image classification, object detection, semantic segmentation, and instance segmentation is best done visually.

When performing traditional image classification our goal is to predict a set of labels to characterize the contents of an input image (top-left).

Object detection builds on image classification, but this time allows us to localize each object in an image. The image is now characterized by:

  1. Bounding box (x, y)-coordinates for each object
  2. An associated class label for each bounding box

An example of semantic segmentation can be seen in bottom-left. Semantic segmentation algorithms require us to associate every pixel in an input image with a class label (including a class label for the background).

Pay close attention to our semantic segmentation visualization — notice how each object is indeed segmented but each “cube” object has the same color.

While semantic segmentation algorithms are capable of labeling every object in an image they cannot differentiate between two objects of the same class.

This behavior is especially problematic if two objects of the same class are partially occluding each other — we have no idea where the boundaries of one object ends and the next one begins, as demonstrated by the two purple cubes, we cannot tell where one cube starts and the other ends.

Instance segmentation algorithms, on the other hand, compute a pixel-wise mask for every object in the image, even if the objects are of the same class label (bottom-right). Here you can see that each of the cubes has their own unique color, implying that our instance segmentation algorithm not only localized each individual cube but predicted their boundaries as well.

The Mask R-CNN architecture we’ll be discussing in this tutorial is an example of an instance segmentation algorithm.

What is Mask R-CNN?

The Mask R-CNN algorithm was introduced by He et al. in their 2017 paper, Mask R-CNN.

Mask R-CNN builds on the previous object detection work of R-CNN (2013), Fast R-CNN (2015), and Faster R-CNN (2015), all by Girshick et al.

In order to understand Mask R-CNN let’s briefly review the R-CNN variants, starting with the original R-CNN:

Figure 2: The original R-CNN architecture (source: Girshick et al,. 2013)

The original R-CNN algorithm is a four-step process:

  • Step #1: Input an image to the network.
  • Step #2: Extract region proposals (i.e., regions of an image that potentially contain objects) using an algorithm such as Selective Search.
  • Step #3: Use transfer learning, specifically feature extraction, to compute features for each proposal (which is effectively an ROI) using the pre-trained CNN.
  • Step #4: Classify each proposal using the extracted features with a Support Vector Machine (SVM).

The reason this method works is due to the robust, discriminative features learned by the CNN.

However, the problem with the R-CNN method is it’s incredibly slow. And furthermore, we’re not actually learning to localize via a deep neural network, we’re effectively just building a more advanced HOG + Linear SVM detector.

To improve upon the original R-CNN, Girshick et al. published the Fast R-CNN algorithm:

Figure 3: The Fast R-CNN architecture (source: Girshick et al., 2015).

Similar to the original R-CNN, Fast R-CNN still utilizes Selective Search to obtain region proposals; however, the novel contribution from the paper was Region of Interest (ROI) Pooling module.

ROI Pooling works by extracting a fixed-size window from the feature map and using these features to obtain the final class label and bounding box. The primary benefit here is that the network is now, effectively, end-to-end trainable:

  1. We input an image and associated ground-truth bounding boxes
  2. Extract the feature map
  3. Apply ROI pooling and obtain the ROI feature vector
  4. And finally, use the two sets of fully-connected layers to obtain (1) the class label predictions and (2) the bounding box locations for each proposal.

While the network is now end-to-end trainable, performance suffered dramatically at inference (i.e., prediction) by being dependent on Selective Search.

To make the R-CNN architecture even faster we need to incorporate the region proposal directly into the R-CNN:

Figure 4: The Faster R-CNN architecture (source: Girshick et al., 2015)

The Faster R-CNN paper by Girshick et al. introduced the Region Proposal Network (RPN) that bakes region proposal directly into the architecture, alleviating the need for the Selective Search algorithm.

As a whole, the Faster R-CNN architecture is capable of running at approximately 7-10 FPS, a huge step towards making real-time object detection with deep learning a reality.

The Mask R-CNN algorithm builds on the Faster R-CNN architecture with two major contributions:

  1. Replacing the ROI Pooling module with a more accurate ROI Align module
  2. Inserting an additional branch out of the ROI Align module

This additional branch accepts the output of the ROI Align and then feeds it into two CONV layers.

The output of the CONV layers is the mask itself.

We can visualize the Mask R-CNN architecture in the following figure:

Figure 5: The Mask R-CNN work by He et al. replaces the ROI Polling module with a more accurate ROI Align module. The output of the ROI module is then fed into two CONV layers. The output of the CONV layers is the mask itself.

Notice the branch of two CONV layers coming out of the ROI Align module — this is where our mask is actually generated.

As we know, the Faster R-CNN/Mask R-CNN architectures leverage a Region Proposal Network (RPN) to generate regions of an image that potentially contain an object.

Each of these regions is ranked based on their “objectness score” (i.e., how likely it is that a given region could potentially contain an object) and then the top N most confident objectness regions are kept.

In the original Faster R-CNN publication Girshick et al. set N=2,000, but in practice, we can get away with a much smaller N, such as N={10, 100, 200, 300} and still obtain good results.

He et al. set N=300 in their publication which is the value we’ll use here as well.

Each of the 300 selected ROIs go through three parallel branches of the network:

  1. Label prediction
  2. Bounding box prediction
  3. Mask prediction

Figure 5 above above visualizes these branches.

During prediction, each of the 300 ROIs go through non-maxima suppression and the top 100 detection boxes are kept, resulting in a 4D tensor of 100 x L x 15 x 15 where L is the number of class labels in the dataset and 15 x 15 is the size of each of the L masks.

The Mask R-CNN we’re using here today was trained on the COCO dataset, which has L=90 classes, thus the resulting volume size from the mask module of the Mask R CNN is 100 x 90 x 15 x 15.

To visualize the Mask R-CNN process take a look at the figure below:

Figure 6: A visualization of Mask R-CNN producing a 15 x 15 mask, the mask resized to the original dimensions of the image, and then finally overlaying the mask on the original image. (source: Deep Learning for Computer Vision with Python, ImageNet Bundle)

Here you can see that we start with our input image and feed it through our Mask R-CNN network to obtain our mask prediction.

The predicted mask is only 15 x 15 pixels so we resize the mask back to the original input image dimensions.

Finally, the resized mask can be overlaid on the original input image. For a more thorough discussion on how Mask R-CNN works be sure to refer to:

  1. The original Mask R-CNN publication by He et al.
  2. My book, Deep Learning for Computer Vision with Python, where I discuss Mask R-CNNs in more detail, including how to train your own Mask R-CNNs from scratch on your own data.

Project structure

Our project today consists of two scripts, but there are several other files that are important.

I’ve organized the project in the following manner (as is shown by the tree  command output directly in a terminal):

Our project consists of four directories:

  • mask-rcnn-coco/ : The Mask R-CNN model files. There are four files:
    • frozen_inference_graph.pb : The Mask R-CNN model weights. The weights are pre-trained on the COCO dataset.
    • mask_rcnn_inception_v2_coco_2018_01_28.pbtxt : The Mask R-CNN model configuration. If you’d like to build + train your own model on your own annotated data, refer to Deep Learning for Computer Vision with Python.
    • object_detection_classes_coco.txt : All 90 classes are listed in this text file, one per line. Open it in a text editor to see what objects our model can recognize.
    • colors.txt : This text file contains six colors to randomly assign to objects found in the image.
  • images/ : I’ve provided three test images in the “Downloads”. Feel free to add your own images to test with.
  • videos/ : This is an empty directory. I actually tested with large videos that I scraped from YouTube (credits are below, just above the “Summary” section). Rather than providing a really big zip, my suggestion is that you find a few videos on YouTube to download and test with. Or maybe take some videos with your cell phone and come back to your computer and use them!
  • output/ : Another empty directory that will hold the processed videos (assuming you set the command line argument flag to output to this directory).

We’ll be reviewing two scripts today:

  • mask_rcnn.py : This script will perform instance segmentation and apply a mask to the image so you can see where, down to the pixel, the Mask R-CNN thinks an object is.
  • mask_rcnn_video.py : This video processing script uses the same Mask R-CNN and applies the model to every frame of a video file. The script then writes the output frame back to a video file on disk.

OpenCV and Mask R-CNN in images

Now that we’ve reviewed how Mask R-CNNs work, let’s get our hands dirty with some Python code.

Before we begin, ensure that your Python environment has OpenCV 3.4.2/3.4.3 or higher installed. You can follow one of my OpenCV installation tutorials to upgrade/install OpenCV. If you want to be up and running in 5 minutes or less, you can consider installing OpenCV with pip. If you have some other requirements, you might want to compile OpenCV from source.

Make sure you’ve used the “Downloads” section of this blog post to download the source code, trained Mask R-CNN, and example images.

From there, open up the mask_rcnn.py  file and insert the following code:

First we’ll import our required packages on Lines 2-7. Notably, we’re importing NumPy and OpenCV. Everything else comes with most Python installations.

From there, we’ll parse our command line arguments:

Our script requires that command line argument flags and parameters be passed at runtime in our terminal. Our arguments are parsed on Lines 10-21, where the first two of the following are required and the rest are optional:

  • --image : The path to our input image.
  • --mask-rnn : The base path to the Mask R-CNN files.
  • --visualize  (optional): A positive value indicates that we want to visualize how we extracted the masked region on our screen. Either way, we’ll display the final output on the screen.
  • --confidence  (optional): You can override the probability value of 0.5  which serves to filter weak detections.
  • --threshold  (optional): We’ll be creating a binary mask for each object in the image and this threshold value will help us filter out weak mask predictions. I found that a default value of 0.3  works pretty well.

Now that our command line arguments are stored in the args  dictionary, let’s load our labels and colors:

Lines 24-26 load the COCO object class  LABELS . Today’s Mask R-CNN is capable of recognizing 90 classes including people, vehicles, signs, animals, everyday items, sports gear, kitchen items, food, and more! I encourage you to look at object_detection_classes_coco.txt  to see the available classes.

From there we load the COLORS  from the path, performing a couple array conversion operations (Lines 30-33).

Let’s load our model:

First, we build our weight and configuration paths (Lines 36-39), followed by loading the model via these paths (Line 44).

In the next block, we’ll load and pass an image through the Mask R-CNN neural net:

Here we:

  • Load the input image  and extract dimensions for scaling purposes later (Lines 47 and 48).
  • Construct a blob  via cv2.dnn.blobFromImage  (Line 54). You can learn why and how to use this function in my previous tutorial.
  • Perform a forward pass of the blob  through the net  while collecting timestamps (Lines 55-58). The results are contained in two important variables: boxes  and masks .

Now that we’ve performed a forward pass of the Mask R-CNN on the image, we’ll want to filter + visualize our results. That’s exactly what this next for loop accomplishes. It is quite long, so I’ve broken it into five code blocks beginning here:

In this block, we begin our filter/visualization loop (Line 66).

We proceed to extract the classID  and confidence  of a particular detected object (Lines 69 and 70).

From there we filter out weak predictions by comparing the confidence  to the command line argument confidence  value, ensuring we exceed it (Line 74).

Assuming that’s the case, we’ll go ahead and make a clone  of the image (Line 76). We’ll need this image later.

Then we scale our object’s bounding box as well as calculate the box dimensions (Lines 81-84).

Image segmentation requires that we find all pixels where an object is present. Thus, we’re going to place a transparent overlay on top of the object to see how well our algorithm is performing. In order to do so, we’ll calculate a mask:

On Lines 89-91, we extract the pixel-wise segmentation for the object as well as resize it to the original image dimensions. Finally we threshold the mask  so that it is a binary array/image (Line 92).

We also extract the region of interest where the object resides (Line 95).

Both the mask  and roi  can be seen visually in Figure 8 later in the post.

For convenience, this next block accomplishes visualizing the mask , roi , and segmented instance  if the --visualize  flag is set via command line arguments:

In this block we:

  • Check to see if we should visualize the ROI, mask, and segmented instance (Line 99).
  • Convert our mask from boolean to integer where a value of “0” indicates background and “255” foreground (Line 102).
  • Perform bitwise masking to visualize just the instance itself (Line 103).
  • Show all three images (Lines 107-109).

Again, these visualization images will only be shown if the --visualize  flag is set via the optional command line argument (by default these images won’t be shown).

Now let’s continue on with visualization:

Line 113 extracts only the masked region of the ROI by passing the boolean mask array as our slice condition.

Then we’ll randomly select one of our six COLORS  to apply our transparent overlay on the object (Line 118).

Subsequently, we’ll blend our masked region with the roi  (Line 119) followed by placing this blended  region into the clone  image (Line 122).

Finally, we’ll draw the rectangle and textual class label + confidence value on the image as well as display the result!

To close out, we:

  • Draw a colored bounding box around the object (Lines 125 and 126).
  • Build our class label + confidence text  as well as draw the text  above the bounding box (Lines 130-132).
  • Display the image until any key is pressed (Lines 135 and 136).

Let’s give our Mask R-CNN code a try!

Make sure you’ve used the “Downloads” section of the tutorial to download the source code, trained Mask R-CNN, and example images. From there, open up your terminal and execute the following command:

Figure 7: A Mask R-CNN applied to a scene of cars. Python and OpenCV were used to generate the masks.

In the above image, you can see that our Mask R-CNN has not only localized each of the cars in the image but has also constructed a pixel-wise mask as well, allowing us to segment each car from the image.

If we were to run the same command, this time supplying the --visualize  flag, we can visualize the ROI, mask, and instance as well:

Figure 8: Using the --visualize flag, we can view the ROI, mask, and segmentmentation intermediate steps for our Mask R-CNN pipeline built with Python and OpenCV.

Let’s try another example image:

Figure 9: Using Python and OpenCV, we can perform instance segmentation using a Mask R-CNN.

Our Mask R-CNN has correctly detected and segmented both people, a dog, a horse, and a truck from the image.

Here’s one final example before we move on to using Mask R-CNNs in videos:

Figure 10: Here you can see me feeding a treat to the family beagle, Jemma. The pixel-wise map of each object identified is masked and transparently overlaid on the objects. This image was generated with OpenCV and Python using a pre-trained Mask R-CNN model.

In this image, you can see a photo of myself and Jemma, the family beagle.

Our Mask R-CNN is capable of detecting and localizing me, Jemma, and the chair with high confidence.

OpenCV and Mask R-CNN in video streams

Now that we’ve looked at how to apply Mask R-CNNs to images, let’s explore how they can be applied to videos as well.

Open up the mask_rcnn_video.py  file and insert the following code:

First we import our necessary packages and parse our command line arguments.

There are two new command line arguments (which replaces --image  from the previous script):

  • --input : The path to our input video.
  • --output : The path to our output video (since we’ll be writing our results to disk in a video file).

Now let’s load our class LABELS , COLORS , and Mask R-CNN neural net :

Our LABELS  and COLORS  are loaded on Lines 24-31.

From there we define our weightsPath  and configPath  before loading our Mask R-CNN neural net  (Lines 34-42).

Now let’s initialize our video stream and video writer:

Our video stream ( vs ) and video writer  are initialized on Lines 45 and 46.

We attempt to determine the number of frames in the video file and display the total  (Lines 49-53). If we’re unsuccessful, we’ll capture the exception and print a status message as well as set total  to -1  (Lines 57-59). We’ll use this value to approximate how long it will take to process an entire video file.

Let’s begin our frame processing loop:

We begin looping over frames by defining an infinite while  loop and capturing the first frame  (Lines 62-64). The loop will process the video until completion which is handled by the exit condition on Lines 68 and 69.

We then construct a blob  from the frame and pass it through the neural net  while grabbing the elapsed time so we can calculate estimated time to completion later (Lines 75-80). The result is included in both boxes  and masks .

Now let’s begin looping over detected objects:

First we filter out weak detections with a low confidence value. Then we determine the bounding box coordinates and obtain the mask  and roi .

Now let’s draw the object’s transparent overlay, bounding rectangle, and label + confidence:

Here we’ve blended  our roi  with color and store  it in the original frame , effectively creating a colored transparent overlay (Lines 118-122).

We then draw a rectangle  around the object and display the class label + confidence  just above (Lines 125-133).

Finally, let’s write to the video file and clean up:

On the first iteration of the loop, our video writer  is initialized.

An estimate of the amount of time that the processing will take is printed to the terminal on Lines 143-147.

The final operation of our loop is to write  the frame to disk via our writer  object (Line 150).

You’ll notice that I’m not displaying each frame to the screen. The display operation is time-consuming and you’ll be able to view the output video with any media player when the script is finished processing anyways.

Note: Furthermore, OpenCV does not support NVIDIA GPUs for it’s dnn  module. Right now only a limited number of GPUs are supported, mainly Intel GPUs. NVIDIA GPU support is coming soon but for the time being we cannot easily use a GPU with OpenCV’s dnn  module.

Lastly, we release video input and output file pointers (Lines 154 and 155).

Now that we’ve coded up our Mask R-CNN + OpenCV script for video streams, let’s give it a try!

Make sure you use the “Downloads” section of this tutorial to download the source code and Mask R-CNN model.

You’ll then need to collect your own videos with your smartphone or another recording device. Alternatively, you can download videos from YouTube as I have done.

Note: I am intentionally not including the videos in today’s download because they are rather large (400MB+). If you choose to use the same videos as me, the credits and links are at the bottom of this section.

From there, open up a terminal and execute the following command:

Figure 11: Mask R-CNN applied to video with Python and OpenCV.

In the above video, you can find funny video clips of dogs and cats with a Mask R-CNN applied to them!

Here is a second example, this one of applying OpenCV and a Mask R- CNN to video clips of cars “slipping and sliding” in wintry conditions:

Figure 12: Mask R-CNN object detection is applied to a video scene of cars using Python and OpenCV.

You can imagine a Mask R-CNN being applied to highly trafficked roads, checking for congestion, car accidents, or travelers in need of immediate help and attention.

Credits for the videos and audio include:

  • Cats and Dogs
    • “Try Not To Laugh Challenge – Funny Cat & Dog Vines compilation 2017” on YouTube
    • “Happy rock” on BenSound
  • Slip and Slide
    • “Compilation of Ridiculous Car Crash and Slip & Slide Winter Weather – Part 1” on YouTube
    • “Epic” on BenSound

How do I train my own Mask R-CNN models?

Figure 13: Inside my book, Deep Learning for Computer Vision with Python, you will learn how to annotate your own training data, train your custom Mask R-CNN, and apply it to your own images. I also provide two case studies on (1) skin lesion/cancer segmentation and (2) prescription pill segmentation, a first step in pill identification.

The Mask R-CNN model we used in this tutorial was pre-trained on the COCO dataset…

…but what if you wanted to train a Mask R-CNN on your own custom dataset?

Inside my book, Deep Learning for Computer Vision with Python, I:

  1. Teach you how to train a Mask R-CNN to automatically detect and segment cancerous skin lesions — a first step in building an automatic cancer risk factor classification system.
  2. Provide you with my favorite image annotation tools, enabling you to create masks for your input images.
  3. Show you how to train a Mask R-CNN on your custom dataset.
  4. Provide you with my best practices, tips, and suggestions when training your own Mask R-CNN.

All of the Mask R-CNN chapters included a detailed explanation of both the algorithm and code, ensuring you will be able to successfully train your own Mask R-CNNs.

To learn more about my book (and grab your free set of sample chapters and table of contents), just click here.

Summary

In this tutorial, you learned how to apply the Mask R-CNN architecture with OpenCV and Python to segment objects from images and video streams.

Object detectors such as YOLO, SSDs, and Faster R-CNNs are only capable of producing bounding box coordinates of an object in an image — they tell us nothing about the actual shape of the object itself.

Using Mask R-CNN we can generate pixel-wise masks for each object in an image, thereby allowing us to segment the foreground object from the background.

Furthermore, Mask R-CNNs enable us to segment complex objects and shapes from images which traditional computer vision algorithms would not enable us to do.

I hope you enjoyed today’s tutorial on OpenCV and Mask R-CNN!

To download the source code to this post, and be notified when future tutorials are published here on PyImageSearch, just enter your email address in the form below!

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , , ,

78 Responses to Mask R-CNN with OpenCV

  1. Faizan Amin November 19, 2018 at 10:39 am #

    Hi. How can we train our own Mask RCNN model. Can we use Tensorflow Models API for this purpose?

  2. Steph November 19, 2018 at 10:49 am #

    Hi Adrian,

    thanks a lot for another great tutorial.
    I already knew Mask-RCNN for trying it on my problem, but apparently that is not the way to go.
    What I want to do is to detect movie posters in videos and then track them over time. The first time they appear I also manually define a mask to simplify the process. Unfortunately any detection/tracking method I tried failed miserably… the detection step is hard, because the poster is not an object available in the models, and it can vary a lot depending on the movie it represents; tracking also fails, since I need a pixel perfect tracking and any deep learning method I tried does not return a shape with straight borders but always rounded objects.

    Do you have any algorithms to recommend for this specific task? Or shall I resort to traditional, not DL-based methids?

    Thanks in advance!

    • Adrian Rosebrock November 19, 2018 at 12:21 pm #

      How many example images per movie poster do you have?

      • Steph November 20, 2018 at 3:49 am #

        I have 10 videos, each of them showing a movie poster for about 150 frames. The camera is always panning or zooming, so the shape and size of the poster is constantly changing.
        Thanks in advance for any help 🙂

        • Adrian Rosebrock November 20, 2018 at 9:04 am #

          I assume each of the 150 frames has the same movie poster? Are these 150 frames your training data? If so, have you labeled them and annotated them so you can train an object detector?

          • Steph November 20, 2018 at 9:27 am #

            Yes, I have 1500 images as training data. For each movie poster, i created a binary mask showing where is the poster. The shape is usually a quadrilateral, unless in case the poster is partially occluded.
            I’d like to train a system which, given an annotated frame of a video, could then detect the movie poster with pixel precision during camera movement and occlusions, but so far I didn’t have luck. Even system especially trained for that (as they do in the Davis challenge https://davischallenge.org/) seem to fail after just a few frames.
            If you are going to work / publish a post on the issue, let me know!

          • Adrian Rosebrock November 21, 2018 at 9:39 am #

            Thanks for the clarification. In that case I would highly suggest using a Mask R-CNN. The Mask R-CNN will give you a pixel-wise segmentation of the movie poster. Once you have the location of the poster you can either:

            1. Continue to process subsequent frames using the Mask R-CNN
            2. Or you can apply a dedicated object tracker

  3. Mansoor November 19, 2018 at 11:38 am #

    Adrian, you are constantly bombarding us with such valuable information every single week, which otherwise would take us months to even understand.

    Thank you for sharing this incredible piece of code with us.

    • Adrian Rosebrock November 19, 2018 at 12:20 pm #

      Thanks Mansoor — it is my pleasure 🙂

  4. YEVHENII RVACHOV November 19, 2018 at 12:04 pm #

    Hello, Adrian.

    Thanks so much for your article and explanation of principles R-CNN

    • Adrian Rosebrock November 19, 2018 at 12:20 pm #

      You are welcome, I’m happy you found the post useful! I hope you can apply it to your own projects.

  5. Atul November 19, 2018 at 12:31 pm #

    Thanks , very informative and useful 🙂

    • Adrian Rosebrock November 19, 2018 at 12:57 pm #

      Thanks Atul!

  6. Faraz November 19, 2018 at 12:42 pm #

    Hi Adrain.

    Thank you again for the great effort.My question is that mask rcnn is according to authors of paper Mask rcnn : https://arxiv.org/pdf/1703.06870.pdf ,fps is around 5fps.Isnt it a bit slow for using it in real time application and how do you compare YOLO or SSD with it.Thanks.

    • Adrian Rosebrock November 19, 2018 at 12:56 pm #

      Yes, Faster R-CNN and Mask R-CNN are slower than YOLO and SSD. I would request you read “Instance segmentation vs. Semantic segmentation” section of this tutorial — the section will explain to you how YOLO, SSD, and Faster R-CNN (object detectors) are different than Mask R-CNN (instance segmentation).

      • Faraz November 19, 2018 at 1:28 pm #

        Thanks Adrian ,so what i understand is that mask rcnn may not be suitable for real time applications.Great tutorial by the way.Thumbs up

  7. Cenk November 19, 2018 at 2:16 pm #

    Hi Adrian,

    Thank you very much for your sharing the code along with the blog, as it will be very helpful for us to play around and understand better.

    • Adrian Rosebrock November 19, 2018 at 2:20 pm #

      Thanks Cenk!

  8. Walid November 19, 2018 at 3:36 pm #

    Thanks a lot.
    I worked when I updated openCV 🙂

    • Adrian Rosebrock November 19, 2018 at 4:11 pm #

      Awesome, glad to hear it!

  9. atom November 19, 2018 at 7:49 pm #

    Great post, Adrian. Actually, a large number of papers are published everyday on machine learning, so can you share us the way you keep track almost of them. Thanks so muchs, Adrian

    • atom November 21, 2018 at 4:09 am #

      Adrian, please give me some comment about this. Thanks

  10. Paul November 19, 2018 at 8:17 pm #

    Hi Adrian
    This is awesome. I loved your book. (still trying to learn most of it)
    I used matterport’s Mask RCNN in our software to segment label-free cells in microscopy images and track them.
    I wonder if you can comment on two things
    1.
    would you comment on how to improve the accuracy of the mask?
    Do you think it’s the interpolation error or we can improve the accuracy by increasing the depth of the CNNs?

    2. I’ve seen this “flicking” thing in segmentation. (as in video)
    If i’m doing image segmentation, it would be one trained weight can recognize a target, while the other may not. some kind of false negative.
    would you know where it came from?

    • Adrian Rosebrock November 20, 2018 at 9:18 am #

      1. Are you already applying data augmentation? If not, make sure you are. I’m also not sure how much data you have for training but you may need more.

      2. False-negatives and false-positives will happen, especially if you’re trying to run the model on video. Ways to improve your model include using training data that is similar to your testing data, applying data augmentation, regularization, and anything that will increase the ability of your model to generalize.

  11. Jaan November 19, 2018 at 8:24 pm #

    This is looks really cool. Is this the same thing as pose estimation?

    • Adrian Rosebrock November 20, 2018 at 9:16 am #

      No, pose estimation actually finds keypoints/landmarks for specific joints/body parts. I’ll try to cover pose estimation in the future.

  12. Sumit November 20, 2018 at 12:01 am #

    Thank you so much for all the wonderful tutorials. i am great follower of your work. had a doubt here:

    To perform Localization and Classification at the same time we add 2 fully connected layers at the end of our network architecture. One classifies and other provides the bounding box information. But how will come to know which fully connected layer produces cordinates and which one is for classification?

    What i read in some blogs is that we receive a matrix at the end which contains: [confidence score, bx, by, bw, bh, class1, class2, class3].

    • Adrian Rosebrock November 20, 2018 at 9:11 am #

      We know due to our implementation. One FC branch is (N + 1)-d where N is the number of class labels plus an additional one for the background. The other FC branch is 4xN-d where each of the four values represents the deltas for the final predicted bounding boxes.

  13. Dona Paula November 20, 2018 at 12:54 am #

    Thanks for your invaluable tutorials. I ran your code as is, however I am getting only one object instance segemented. i.e If I have two cars in the image (e.g example1), only one car is detected and instance segmented. I have tried with other images. Same story.

    My openCV version is 3.4.3. Please suggest resolution.

    • Dona Paula November 20, 2018 at 1:32 am #

      Please ignore my previous comment. I thought it would be an animated gif.

      • Adrian Rosebrock November 20, 2018 at 9:08 am #

        Click on the window opened by OpenCV and press any key on your keyboard. It will advance the execution of the script to highlight the next car.

  14. Digant November 20, 2018 at 1:15 am #

    Hi Adrian,
    Can you suggest me any architecture for Sementic Segmentation which performs segmentation without resizing the image. Any blog/code related to it would be great.

  15. Kark November 20, 2018 at 7:04 am #

    Hi Adrian,

    Thanks for this awesome post.

    I am working on a similar project where I have to identify and localize each object in the picture. Can you please advise how to make this script identify all the objects in the picture like a carton box, wooden block etc. I will not know what could be in the picture in advance.

    • Adrian Rosebrock November 20, 2018 at 9:02 am #

      You would need to first train a Mask R-CNN to identify each of the objects you would like to recognize. Mask R-CNNs, and in general, all machine learning models, are not magic boxes that intuitively understand the contents of an image. Instead, we need to explicitly train them to do so. If you’re interested in training your own custom Mask R-CNN networks be sure to refer to Deep Learning for Computer Vision with Python where I discuss how to train your own models in detail (including code).

  16. abkul November 20, 2018 at 7:52 am #

    Great tutorial.

    I am interested in extracting and classifying/labeling plant disease(s) and insects from an image sent by a farmer using deep learning paradigm. Please advice the relevant approaches/techniques to be employed.

    Are you planning to diversify your blog with examples in the field of plant pests or disease diagnosis in future?

    • Adrian Rosebrock November 20, 2018 at 8:59 am #

      I haven’t covered plant diseases specifically before but I have cover human diseases such as skin lesion/cancer segmentation using a Mask R-CNN. Be sure to take a look at Deep Learning for Computer Vision with Python for more details. I’m more than confident that the book would help you complete your plant disease classification project.

  17. Abhiraj Biswas November 20, 2018 at 8:21 am #

    As you mentioned it’s storing as an output I wanted to know How can we show the output on the screen Frame by frame.

    • Adrian Rosebrock November 20, 2018 at 8:57 am #

      You can insert a call to cv2.imshow but keep in mind that the Mask R-CNN running on a CPU, at best, may only be able to do 1 FPS. The results wouldn’t look as good.

  18. Dave P November 20, 2018 at 1:02 pm #

    Hi Adrian, Another great tutorial – Your program examples just work first time (unlike many other object detection tutorials on the web…)
    I am trying to reduce the number of false positives from my CCTV alarm system which monitors for visitors against a very ‘noisy’ background (trees blowing in the wind etc) and using an RCNN looks most promising. The Mask RCNN gives very accurate results but I don’t really need the pixel-level masks and the extra CPU time to generate them.
    Is there a (simple) way to just generate the bounding boxes?
    I have tried to use Faster RCNN rather than Mask RCNN but the accuracy I am getting (from the aforementioned web tutorials and Github downloads) is much poorer.

  19. Paul Z November 20, 2018 at 5:51 pm #

    Never even heard of R-CNN until now .. but great follow up to the YOLO post. Question … sometimes the algo seems to identify the same person twice, very very similar confidence levels and at times, the same person twice, once at ~90% and once at ~50%.

    Any ideas?

    • Adrian Rosebrock November 21, 2018 at 9:27 am #

      The same person in the same frame? Or the same person in subsequent frames?

  20. sophia November 21, 2018 at 10:28 am #

    another great article! would it be possible to use instance segmentation or object detection to detect whether an object is on the floor? i wanna be able to scan a room and trigger an alert if an object is on the floor. I haven’t seen any deep learning algorithm applied to detect the floor. thanks, look forward to your reply.

    • Adrian Rosebrock November 25, 2018 at 9:44 am #

      That would actually be a great application of semantic segmentation. Semantic segmentation algorithms can be used to classify all pixels of an image/frame. Try looking into semantic segmentation algorithms for room understanding.

      • Sophia November 25, 2018 at 1:35 pm #

        thanks Adrian, I’ll look into using semantic segmentation for this, look forward to more articles from you!

  21. Bharath November 21, 2018 at 10:56 pm #

    Hi Adrian, I found u have lots of blogs on install opencv on raspberry pi, they build and compile (min 2hours)…..I found pip install opencv- python working fine on raspberry Pi. Did you try it?

    • Adrian Rosebrock November 25, 2018 at 9:35 am #

      I actually have an entire tutorial dedicated to installing OpenCV with pip. I would refer to it to ensure your install is working properly.

  22. abkul November 22, 2018 at 5:18 am #

    Like always great tutorial.

    No algorithm is perfect.What are the short comings of Mask R-CNN approach/algorithm?

    • Adrian Rosebrock November 25, 2018 at 9:29 am #

      Mask R-CNNs are extremely slow. Even on a GPU they only operate at 5-7 FPS.

  23. Mandar Patil November 22, 2018 at 6:57 am #

    Hey Adrian,
    I made the entire tree structure on Google Colab and ran the mask_rcnn.py file.

    !python mask_rcnn.py –mask-rcnn mask-rcnn-coco –image images/example_01.jpg

    It gave the following result:
    [INFO] loading Mask R-CNN from disk…
    [INFO] Mask R-CNN took 5.486852 seconds
    [INFO] boxes shape: (1, 1, 3, 7)
    [INFO] masks shape: (100, 90, 15, 15)
    : cannot connect to X server

    Could you please tell me why did this happen?

    • Adrian Rosebrock November 25, 2018 at 9:28 am #

      I don’t believe Google Colab has X11 forwarding which is required to display images via cv2.imshow. Don’t worry though, you can still use matplotlib to display images.

  24. xuli November 22, 2018 at 11:56 am #

    cool..leading the way for us to the most recent technology

  25. Micha November 24, 2018 at 3:26 pm #

    Thinking to use MASK R-CNN for background removal, is there and way to make the mask more accurate then the examples in the video in the examples?

    • Adrian Rosebrock November 25, 2018 at 8:56 am #

      You would want to ensure your Mask R-CNN is trained on objects that are similar to the ones in your video streams. A deep learning model is only as good as the training data you give it.

      • Micha Amir Cohen November 26, 2018 at 7:51 am #

        I’m talking about person recognize, It can be any person… so I’m understanding your comment ” objects that are similar ”

        look on the picture below the mask cut part of the person head (the one near the dog)… for example…
        however if I’m looking on this document the mask cover the persons better
        https://arxiv.org/pdf/1703.06870.pdf%5D

        any idea how the mask can cover the body better then the examples?

        • Micha Amir Cohen November 28, 2018 at 2:35 am #

          tFirst thanks for all the information you share with us!!!!

          I Just to verify, as I understand your opinion is that better training can improve the mask fit to the object required and it is not the limitation that related to the ability of Mask RCNN and for my needs I need to search for other AI model

  26. Gagandeep November 26, 2018 at 3:00 am #

    Thanx a lot for a great blog !

    on internet lots of article available on custom object detection using tensorflow API , but not well explained..

    In future Can we except blog on “Custom object detection using tensorflow API” ??

    thanx a lot your blogs are really very helpful for us…

    Best regards
    Gagandeep

    • Adrian Rosebrock November 26, 2018 at 2:29 pm #

      Hi Gagandeep — if you like how I explain computer vision and deep learning here on the PyImageSearch blog I would recommend taking a look at my book, Deep Learning for Computer Vision with Python which includes six chapters on training your own custom object detectors, including using the TensorFlow Object Detection API.

  27. Sunny December 1, 2018 at 11:20 pm #

    Hi Adrian,

    Thanks for such a great tutorial! I have some questions after reading the tutorial:

    1. Which one is faster between Faster R-CNN and Mask R-CNN? What about the accuracy?
    2. Under what condition I should consider using Mask R-CNN? Under what condition I should consider using Faster-CNN? (Just for Mask R-CNN and Faster R-CNN)
    3. What is the limitation of Mask R-CNN?

    Sincerely,
    Sunny

    • Adrian Rosebrock December 4, 2018 at 10:12 am #

      1. Mask R-CNN builds on Faster R-CNN and includes extra computation. Faster R-CNN is slightly faster.
      2 and 3. Go back and read the “Instance segmentation vs. Semantic segmentation” section of this post. Faster R-CNN is an object detector while Mask R-CNN is used for instance segmentation.

  28. sophia December 3, 2018 at 1:26 pm #

    the mask output that I’m getting for the images that you provided is not as smooth as the output that you have shown in this article – there are significant jagged edges on the outline of the mask. is there any way to get a smoother mask as you have got ? I’m running the script on a Macbook Pro.

    looking forward to your reply, thanks.

    • Sophia December 11, 2018 at 3:10 pm #

      Hi Adrian,

      don’t mean to annoy you, but it’d help me considerably if you could give me some ideas for why I’m getting masks with jagged edges (like steps all over the outline) as opposed to the smooth mask outputs, and how I can possible fix this problem. Thanks,

      • Adrian Rosebrock December 13, 2018 at 9:14 am #

        See my reply to Robert in this same comment thread. What interpolation are you using? Try using a different interpolation method when resizing. Instead of “cv2.INTER_NEAREST” you may want to try linear or cubic interpolation.

    • Robert December 12, 2018 at 5:10 pm #

      I’m running into the same issue. Do you have any recommendation Adrian? Are you smoothing the pixels in some way?

      • Adrian Rosebrock December 13, 2018 at 8:56 am #

        What interpolation method are you using when resizing the mask?

  29. Abhiraj Biswas December 4, 2018 at 10:18 pm #

    box = boxes[0, 0, i, 3:7] * np.array([W, H, W, H])
    (startX, startY, endX, endY) = box.astype(“int”)
    boxW = endX – startX
    boxH = endY – startY

    What is happening in the first step.?
    Why is it 3:7…?
    Looking forward for your reply.

    • Adrian Rosebrock December 6, 2018 at 9:50 am #

      That is the NumPy array slice. The 7 values correspond to:

      [batchId, classId, confidence, left, top, right, bottom]

  30. Bhagesh December 5, 2018 at 5:06 am #

    In a very simple yet detailed way all the procedures are described. Easy to understand.
    Can you please tell me how to get or generate these files ?

    colors.txt
    frozen_inference_graph.pb
    mask_rcnn_inception_v2_coco_2018_01_28.pbtxt
    object_detection_classes_coco.txt

    I want to go through your example.

    • Adrian Rosebrock December 6, 2018 at 9:42 am #

      These models were generated by training the Mask R-CNN network. You need to train the actual network which will require you to understand machine learning and deep learning. Do you have any prior experience in those areas?

    • Manuel December 7, 2018 at 10:49 am #

      it looks like those files are generated by Tensorflow, look for tutorials on how to use Tensorflow Object detection API.

  31. Bob Estes December 5, 2018 at 12:43 pm #

    Any thoughts on this error:

    … cv2.error: OpenCV(3.4.2) /home/estes/git/cv-modules/opencv/modules/dnn/src/tensorflow/tf_graph_simplifier.cpp:659: error: (-215:Assertion failed) !field.empty() in function ‘getTensorContent’

    Note that I’m using opencv 3.4.2, as suggested, and am running an unmodified version of your code.
    Thanks!

    • Bob Estes December 5, 2018 at 2:12 pm #

      Found a link suggesting I needed 3.4.3. I updated to 3.4 and all is well.

      • Bob Estes December 5, 2018 at 2:13 pm #

        Typo: can’t edit post. I upgraded to 4.0.0 and it worked.

        • Adrian Rosebrock December 6, 2018 at 9:34 am #

          Thanks for letting us know, Bob!

  32. Pablo December 12, 2018 at 10:09 am #

    Hello Adrian,

    Thanks for you post, it’s a really good tutorial!

    But I am wondering whether there is any way to limit the categories of coco dataset if I just want it to detect the ‘person’ class. Forgive my stupidity, I really couldn’t find the model file or some other file contains the code related to it.

    Looking forward to your reply;)

    • Adrian Rosebrock December 13, 2018 at 9:02 am #

      I show you exactly how to do that in this post.

Leave a Reply