Mask R-CNN with OpenCV

In this tutorial, you will learn how to use Mask R-CNN with OpenCV.

Using Mask R-CNN you can automatically segment and construct pixel-wise masks for every object in an image. We’ll be applying Mask R-CNNs to both images and video streams.

In last week’s blog post you learned how to use the YOLO object detector to detect the presence of objects in images. Object detectors, such as YOLO, Faster R-CNNs, and Single Shot Detectors (SSDs), generate four sets of (x, y)-coordinates which represent the bounding box of an object in an image.

Obtaining the bounding boxes of an object is a good start but the bounding box itself doesn’t tell us anything about (1) which pixels belong to the foreground object and (2) which pixels belong to the background.

That begs the question:

Is it possible to generate a mask for each object in our image, thereby allowing us to segment the foreground object from the background?

Is such a method even possible?

The answer is yes — we just need to perform instance segmentation using the Mask R-CNN architecture.

To learn how to apply Mask R-CNN with OpenCV to both images and video streams, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

Mask R-CNN with OpenCV

In the first part of this tutorial, we’ll discuss the difference between image classificationobject detection, instance segmentation, and semantic segmentation.

From there we’ll briefly review the Mask R-CNN architecture and its connections to Faster R-CNN.

I’ll then show you how to apply Mask R-CNN with OpenCV to both images and video streams.

Let’s get started!

Instance segmentation vs. Semantic segmentation

Figure 1: Image classification (top-left), object detection (top-right), semantic segmentation (bottom-left), and instance segmentation (bottom-right). We’ll be performing instance segmentation with Mask R-CNN in this tutorial. (source)

Explaining the differences between traditional image classification, object detection, semantic segmentation, and instance segmentation is best done visually.

When performing traditional image classification our goal is to predict a set of labels to characterize the contents of an input image (top-left).

Object detection builds on image classification, but this time allows us to localize each object in an image. The image is now characterized by:

  1. Bounding box (x, y)-coordinates for each object
  2. An associated class label for each bounding box

An example of semantic segmentation can be seen in bottom-left. Semantic segmentation algorithms require us to associate every pixel in an input image with a class label (including a class label for the background).

Pay close attention to our semantic segmentation visualization — notice how each object is indeed segmented but each “cube” object has the same color.

While semantic segmentation algorithms are capable of labeling every object in an image they cannot differentiate between two objects of the same class.

This behavior is especially problematic if two objects of the same class are partially occluding each other — we have no idea where the boundaries of one object ends and the next one begins, as demonstrated by the two purple cubes, we cannot tell where one cube starts and the other ends.

Instance segmentation algorithms, on the other hand, compute a pixel-wise mask for every object in the image, even if the objects are of the same class label (bottom-right). Here you can see that each of the cubes has their own unique color, implying that our instance segmentation algorithm not only localized each individual cube but predicted their boundaries as well.

The Mask R-CNN architecture we’ll be discussing in this tutorial is an example of an instance segmentation algorithm.

What is Mask R-CNN?

The Mask R-CNN algorithm was introduced by He et al. in their 2017 paper, Mask R-CNN.

Mask R-CNN builds on the previous object detection work of R-CNN (2013), Fast R-CNN (2015), and Faster R-CNN (2015), all by Girshick et al.

In order to understand Mask R-CNN let’s briefly review the R-CNN variants, starting with the original R-CNN:

Figure 2: The original R-CNN architecture (source: Girshick et al,. 2013)

The original R-CNN algorithm is a four-step process:

  • Step #1: Input an image to the network.
  • Step #2: Extract region proposals (i.e., regions of an image that potentially contain objects) using an algorithm such as Selective Search.
  • Step #3: Use transfer learning, specifically feature extraction, to compute features for each proposal (which is effectively an ROI) using the pre-trained CNN.
  • Step #4: Classify each proposal using the extracted features with a Support Vector Machine (SVM).

The reason this method works is due to the robust, discriminative features learned by the CNN.

However, the problem with the R-CNN method is it’s incredibly slow. And furthermore, we’re not actually learning to localize via a deep neural network, we’re effectively just building a more advanced HOG + Linear SVM detector.

To improve upon the original R-CNN, Girshick et al. published the Fast R-CNN algorithm:

Figure 3: The Fast R-CNN architecture (source: Girshick et al., 2015).

Similar to the original R-CNN, Fast R-CNN still utilizes Selective Search to obtain region proposals; however, the novel contribution from the paper was Region of Interest (ROI) Pooling module.

ROI Pooling works by extracting a fixed-size window from the feature map and using these features to obtain the final class label and bounding box. The primary benefit here is that the network is now, effectively, end-to-end trainable:

  1. We input an image and associated ground-truth bounding boxes
  2. Extract the feature map
  3. Apply ROI pooling and obtain the ROI feature vector
  4. And finally, use the two sets of fully-connected layers to obtain (1) the class label predictions and (2) the bounding box locations for each proposal.

While the network is now end-to-end trainable, performance suffered dramatically at inference (i.e., prediction) by being dependent on Selective Search.

To make the R-CNN architecture even faster we need to incorporate the region proposal directly into the R-CNN:

Figure 4: The Faster R-CNN architecture (source: Girshick et al., 2015)

The Faster R-CNN paper by Girshick et al. introduced the Region Proposal Network (RPN) that bakes region proposal directly into the architecture, alleviating the need for the Selective Search algorithm.

As a whole, the Faster R-CNN architecture is capable of running at approximately 7-10 FPS, a huge step towards making real-time object detection with deep learning a reality.

The Mask R-CNN algorithm builds on the Faster R-CNN architecture with two major contributions:

  1. Replacing the ROI Pooling module with a more accurate ROI Align module
  2. Inserting an additional branch out of the ROI Align module

This additional branch accepts the output of the ROI Align and then feeds it into two CONV layers.

The output of the CONV layers is the mask itself.

We can visualize the Mask R-CNN architecture in the following figure:

Figure 5: The Mask R-CNN work by He et al. replaces the ROI Polling module with a more accurate ROI Align module. The output of the ROI module is then fed into two CONV layers. The output of the CONV layers is the mask itself.

Notice the branch of two CONV layers coming out of the ROI Align module — this is where our mask is actually generated.

As we know, the Faster R-CNN/Mask R-CNN architectures leverage a Region Proposal Network (RPN) to generate regions of an image that potentially contain an object.

Each of these regions is ranked based on their “objectness score” (i.e., how likely it is that a given region could potentially contain an object) and then the top N most confident objectness regions are kept.

In the original Faster R-CNN publication Girshick et al. set N=2,000, but in practice, we can get away with a much smaller N, such as N={10, 100, 200, 300} and still obtain good results.

He et al. set N=300 in their publication which is the value we’ll use here as well.

Each of the 300 selected ROIs go through three parallel branches of the network:

  1. Label prediction
  2. Bounding box prediction
  3. Mask prediction

Figure 5 above above visualizes these branches.

During prediction, each of the 300 ROIs go through non-maxima suppression and the top 100 detection boxes are kept, resulting in a 4D tensor of 100 x L x 15 x 15 where L is the number of class labels in the dataset and 15 x 15 is the size of each of the L masks.

The Mask R-CNN we’re using here today was trained on the COCO dataset, which has L=90 classes, thus the resulting volume size from the mask module of the Mask R CNN is 100 x 90 x 15 x 15.

To visualize the Mask R-CNN process take a look at the figure below:

Figure 6: A visualization of Mask R-CNN producing a 15 x 15 mask, the mask resized to the original dimensions of the image, and then finally overlaying the mask on the original image. (source: Deep Learning for Computer Vision with Python, ImageNet Bundle)

Here you can see that we start with our input image and feed it through our Mask R-CNN network to obtain our mask prediction.

The predicted mask is only 15 x 15 pixels so we resize the mask back to the original input image dimensions.

Finally, the resized mask can be overlaid on the original input image. For a more thorough discussion on how Mask R-CNN works be sure to refer to:

  1. The original Mask R-CNN publication by He et al.
  2. My book, Deep Learning for Computer Vision with Python, where I discuss Mask R-CNNs in more detail, including how to train your own Mask R-CNNs from scratch on your own data.

Project structure

Our project today consists of two scripts, but there are several other files that are important.

I’ve organized the project in the following manner (as is shown by the tree  command output directly in a terminal):

Our project consists of four directories:

  • mask-rcnn-coco/ : The Mask R-CNN model files. There are four files:
    • frozen_inference_graph.pb : The Mask R-CNN model weights. The weights are pre-trained on the COCO dataset.
    • mask_rcnn_inception_v2_coco_2018_01_28.pbtxt : The Mask R-CNN model configuration. If you’d like to build + train your own model on your own annotated data, refer to Deep Learning for Computer Vision with Python.
    • object_detection_classes_coco.txt : All 90 classes are listed in this text file, one per line. Open it in a text editor to see what objects our model can recognize.
    • colors.txt : This text file contains six colors to randomly assign to objects found in the image.
  • images/ : I’ve provided three test images in the “Downloads”. Feel free to add your own images to test with.
  • videos/ : This is an empty directory. I actually tested with large videos that I scraped from YouTube (credits are below, just above the “Summary” section). Rather than providing a really big zip, my suggestion is that you find a few videos on YouTube to download and test with. Or maybe take some videos with your cell phone and come back to your computer and use them!
  • output/ : Another empty directory that will hold the processed videos (assuming you set the command line argument flag to output to this directory).

We’ll be reviewing two scripts today:

  • : This script will perform instance segmentation and apply a mask to the image so you can see where, down to the pixel, the Mask R-CNN thinks an object is.
  • : This video processing script uses the same Mask R-CNN and applies the model to every frame of a video file. The script then writes the output frame back to a video file on disk.

OpenCV and Mask R-CNN in images

Now that we’ve reviewed how Mask R-CNNs work, let’s get our hands dirty with some Python code.

Before we begin, ensure that your Python environment has OpenCV 3.4.2/3.4.3 or higher installed. You can follow one of my OpenCV installation tutorials to upgrade/install OpenCV. If you want to be up and running in 5 minutes or less, you can consider installing OpenCV with pip. If you have some other requirements, you might want to compile OpenCV from source.

Make sure you’ve used the “Downloads” section of this blog post to download the source code, trained Mask R-CNN, and example images.

From there, open up the  file and insert the following code:

First we’ll import our required packages on Lines 2-7. Notably, we’re importing NumPy and OpenCV. Everything else comes with most Python installations.

From there, we’ll parse our command line arguments:

Our script requires that command line argument flags and parameters be passed at runtime in our terminal. Our arguments are parsed on Lines 10-21, where the first two of the following are required and the rest are optional:

  • --image : The path to our input image.
  • --mask-rnn : The base path to the Mask R-CNN files.
  • --visualize  (optional): A positive value indicates that we want to visualize how we extracted the masked region on our screen. Either way, we’ll display the final output on the screen.
  • --confidence  (optional): You can override the probability value of 0.5  which serves to filter weak detections.
  • --threshold  (optional): We’ll be creating a binary mask for each object in the image and this threshold value will help us filter out weak mask predictions. I found that a default value of 0.3  works pretty well.

Now that our command line arguments are stored in the args  dictionary, let’s load our labels and colors:

Lines 24-26 load the COCO object class  LABELS . Today’s Mask R-CNN is capable of recognizing 90 classes including people, vehicles, signs, animals, everyday items, sports gear, kitchen items, food, and more! I encourage you to look at object_detection_classes_coco.txt  to see the available classes.

From there we load the COLORS  from the path, performing a couple array conversion operations (Lines 30-33).

Let’s load our model:

First, we build our weight and configuration paths (Lines 36-39), followed by loading the model via these paths (Line 44).

In the next block, we’ll load and pass an image through the Mask R-CNN neural net:

Here we:

  • Load the input image  and extract dimensions for scaling purposes later (Lines 47 and 48).
  • Construct a blob  via cv2.dnn.blobFromImage  (Line 54). You can learn why and how to use this function in my previous tutorial.
  • Perform a forward pass of the blob  through the net  while collecting timestamps (Lines 55-58). The results are contained in two important variables: boxes  and masks .

Now that we’ve performed a forward pass of the Mask R-CNN on the image, we’ll want to filter + visualize our results. That’s exactly what this next for loop accomplishes. It is quite long, so I’ve broken it into five code blocks beginning here:

In this block, we begin our filter/visualization loop (Line 66).

We proceed to extract the classID  and confidence  of a particular detected object (Lines 69 and 70).

From there we filter out weak predictions by comparing the confidence  to the command line argument confidence  value, ensuring we exceed it (Line 74).

Assuming that’s the case, we’ll go ahead and make a clone  of the image (Line 76). We’ll need this image later.

Then we scale our object’s bounding box as well as calculate the box dimensions (Lines 81-84).

Image segmentation requires that we find all pixels where an object is present. Thus, we’re going to place a transparent overlay on top of the object to see how well our algorithm is performing. In order to do so, we’ll calculate a mask:

On Lines 89-91, we extract the pixel-wise segmentation for the object as well as resize it to the original image dimensions. Finally we threshold the mask  so that it is a binary array/image (Line 92).

We also extract the region of interest where the object resides (Line 95).

Both the mask  and roi  can be seen visually in Figure 8 later in the post.

For convenience, this next block accomplishes visualizing the mask , roi , and segmented instance  if the --visualize  flag is set via command line arguments:

In this block we:

  • Check to see if we should visualize the ROI, mask, and segmented instance (Line 99).
  • Convert our mask from boolean to integer where a value of “0” indicates background and “255” foreground (Line 102).
  • Perform bitwise masking to visualize just the instance itself (Line 103).
  • Show all three images (Lines 107-109).

Again, these visualization images will only be shown if the --visualize  flag is set via the optional command line argument (by default these images won’t be shown).

Now let’s continue on with visualization:

Line 113 extracts only the masked region of the ROI by passing the boolean mask array as our slice condition.

Then we’ll randomly select one of our six COLORS  to apply our transparent overlay on the object (Line 118).

Subsequently, we’ll blend our masked region with the roi  (Line 119) followed by placing this blended  region into the clone  image (Line 122).

Finally, we’ll draw the rectangle and textual class label + confidence value on the image as well as display the result!

To close out, we:

  • Draw a colored bounding box around the object (Lines 125 and 126).
  • Build our class label + confidence text  as well as draw the text  above the bounding box (Lines 130-132).
  • Display the image until any key is pressed (Lines 135 and 136).

Let’s give our Mask R-CNN code a try!

Make sure you’ve used the “Downloads” section of the tutorial to download the source code, trained Mask R-CNN, and example images. From there, open up your terminal and execute the following command:

Figure 7: A Mask R-CNN applied to a scene of cars. Python and OpenCV were used to generate the masks.

In the above image, you can see that our Mask R-CNN has not only localized each of the cars in the image but has also constructed a pixel-wise mask as well, allowing us to segment each car from the image.

If we were to run the same command, this time supplying the --visualize  flag, we can visualize the ROI, mask, and instance as well:

Figure 8: Using the --visualize flag, we can view the ROI, mask, and segmentmentation intermediate steps for our Mask R-CNN pipeline built with Python and OpenCV.

Let’s try another example image:

Figure 9: Using Python and OpenCV, we can perform instance segmentation using a Mask R-CNN.

Our Mask R-CNN has correctly detected and segmented both people, a dog, a horse, and a truck from the image.

Here’s one final example before we move on to using Mask R-CNNs in videos:

Figure 10: Here you can see me feeding a treat to the family beagle, Jemma. The pixel-wise map of each object identified is masked and transparently overlaid on the objects. This image was generated with OpenCV and Python using a pre-trained Mask R-CNN model.

In this image, you can see a photo of myself and Jemma, the family beagle.

Our Mask R-CNN is capable of detecting and localizing me, Jemma, and the chair with high confidence.

OpenCV and Mask R-CNN in video streams

Now that we’ve looked at how to apply Mask R-CNNs to images, let’s explore how they can be applied to videos as well.

Open up the  file and insert the following code:

First we import our necessary packages and parse our command line arguments.

There are two new command line arguments (which replaces --image  from the previous script):

  • --input : The path to our input video.
  • --output : The path to our output video (since we’ll be writing our results to disk in a video file).

Now let’s load our class LABELS , COLORS , and Mask R-CNN neural net :

Our LABELS  and COLORS  are loaded on Lines 24-31.

From there we define our weightsPath  and configPath  before loading our Mask R-CNN neural net  (Lines 34-42).

Now let’s initialize our video stream and video writer:

Our video stream ( vs ) and video writer  are initialized on Lines 45 and 46.

We attempt to determine the number of frames in the video file and display the total  (Lines 49-53). If we’re unsuccessful, we’ll capture the exception and print a status message as well as set total  to -1  (Lines 57-59). We’ll use this value to approximate how long it will take to process an entire video file.

Let’s begin our frame processing loop:

We begin looping over frames by defining an infinite while  loop and capturing the first frame  (Lines 62-64). The loop will process the video until completion which is handled by the exit condition on Lines 68 and 69.

We then construct a blob  from the frame and pass it through the neural net  while grabbing the elapsed time so we can calculate estimated time to completion later (Lines 75-80). The result is included in both boxes  and masks .

Now let’s begin looping over detected objects:

First we filter out weak detections with a low confidence value. Then we determine the bounding box coordinates and obtain the mask  and roi .

Now let’s draw the object’s transparent overlay, bounding rectangle, and label + confidence:

Here we’ve blended  our roi  with color and store  it in the original frame , effectively creating a colored transparent overlay (Lines 118-122).

We then draw a rectangle  around the object and display the class label + confidence  just above (Lines 125-133).

Finally, let’s write to the video file and clean up:

On the first iteration of the loop, our video writer  is initialized.

An estimate of the amount of time that the processing will take is printed to the terminal on Lines 143-147.

The final operation of our loop is to write  the frame to disk via our writer  object (Line 150).

You’ll notice that I’m not displaying each frame to the screen. The display operation is time-consuming and you’ll be able to view the output video with any media player when the script is finished processing anyways.

Note: Furthermore, OpenCV does not support NVIDIA GPUs for it’s dnn  module. Right now only a limited number of GPUs are supported, mainly Intel GPUs. NVIDIA GPU support is coming soon but for the time being we cannot easily use a GPU with OpenCV’s dnn  module.

Lastly, we release video input and output file pointers (Lines 154 and 155).

Now that we’ve coded up our Mask R-CNN + OpenCV script for video streams, let’s give it a try!

Make sure you use the “Downloads” section of this tutorial to download the source code and Mask R-CNN model.

You’ll then need to collect your own videos with your smartphone or another recording device. Alternatively, you can download videos from YouTube as I have done.

Note: I am intentionally not including the videos in today’s download because they are rather large (400MB+). If you choose to use the same videos as me, the credits and links are at the bottom of this section.

From there, open up a terminal and execute the following command:

Figure 11: Mask R-CNN applied to video with Python and OpenCV.

In the above video, you can find funny video clips of dogs and cats with a Mask R-CNN applied to them!

Here is a second example, this one of applying OpenCV and a Mask R- CNN to video clips of cars “slipping and sliding” in wintry conditions:

Figure 12: Mask R-CNN object detection is applied to a video scene of cars using Python and OpenCV.

You can imagine a Mask R-CNN being applied to highly trafficked roads, checking for congestion, car accidents, or travelers in need of immediate help and attention.

Credits for the videos and audio include:

  • Cats and Dogs
    • “Try Not To Laugh Challenge – Funny Cat & Dog Vines compilation 2017” on YouTube
    • “Happy rock” on BenSound
  • Slip and Slide
    • “Compilation of Ridiculous Car Crash and Slip & Slide Winter Weather – Part 1” on YouTube
    • “Epic” on BenSound

How do I train my own Mask R-CNN models?

Figure 13: Inside my book, Deep Learning for Computer Vision with Python, you will learn how to annotate your own training data, train your custom Mask R-CNN, and apply it to your own images. I also provide two case studies on (1) skin lesion/cancer segmentation and (2) prescription pill segmentation, a first step in pill identification.

The Mask R-CNN model we used in this tutorial was pre-trained on the COCO dataset…

…but what if you wanted to train a Mask R-CNN on your own custom dataset?

Inside my book, Deep Learning for Computer Vision with Python, I:

  1. Teach you how to train a Mask R-CNN to automatically detect and segment cancerous skin lesions — a first step in building an automatic cancer risk factor classification system.
  2. Provide you with my favorite image annotation tools, enabling you to create masks for your input images.
  3. Show you how to train a Mask R-CNN on your custom dataset.
  4. Provide you with my best practices, tips, and suggestions when training your own Mask R-CNN.

All of the Mask R-CNN chapters included a detailed explanation of both the algorithm and code, ensuring you will be able to successfully train your own Mask R-CNNs.

To learn more about my book (and grab your free set of sample chapters and table of contents), just click here.


In this tutorial, you learned how to apply the Mask R-CNN architecture with OpenCV and Python to segment objects from images and video streams.

Object detectors such as YOLO, SSDs, and Faster R-CNNs are only capable of producing bounding box coordinates of an object in an image — they tell us nothing about the actual shape of the object itself.

Using Mask R-CNN we can generate pixel-wise masks for each object in an image, thereby allowing us to segment the foreground object from the background.

Furthermore, Mask R-CNNs enable us to segment complex objects and shapes from images which traditional computer vision algorithms would not enable us to do.

I hope you enjoyed today’s tutorial on OpenCV and Mask R-CNN!

To download the source code to this post, and be notified when future tutorials are published here on PyImageSearch, just enter your email address in the form below!


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , , ,

172 Responses to Mask R-CNN with OpenCV

  1. Faizan Amin November 19, 2018 at 10:39 am #

    Hi. How can we train our own Mask RCNN model. Can we use Tensorflow Models API for this purpose?

    • Adrian Rosebrock November 19, 2018 at 10:46 am #

      Hey Faizan — I cover how to train your own custom Mask R-CNN models inside Deep Learning for Computer Vision with Python.

      • sree December 18, 2018 at 11:07 pm #

        Thank you Adrian for the article.I am a beginner in python cv. Well when i was testing the code with example_01 image it was detecting only one car instead of two cars….any explanation??

        • Adrian Rosebrock December 19, 2018 at 1:53 pm #

          Click on the window opened by OpenCV and click any key on your keyboard to advance the execution of the script.

  2. Steph November 19, 2018 at 10:49 am #

    Hi Adrian,

    thanks a lot for another great tutorial.
    I already knew Mask-RCNN for trying it on my problem, but apparently that is not the way to go.
    What I want to do is to detect movie posters in videos and then track them over time. The first time they appear I also manually define a mask to simplify the process. Unfortunately any detection/tracking method I tried failed miserably… the detection step is hard, because the poster is not an object available in the models, and it can vary a lot depending on the movie it represents; tracking also fails, since I need a pixel perfect tracking and any deep learning method I tried does not return a shape with straight borders but always rounded objects.

    Do you have any algorithms to recommend for this specific task? Or shall I resort to traditional, not DL-based methids?

    Thanks in advance!

    • Adrian Rosebrock November 19, 2018 at 12:21 pm #

      How many example images per movie poster do you have?

      • Steph November 20, 2018 at 3:49 am #

        I have 10 videos, each of them showing a movie poster for about 150 frames. The camera is always panning or zooming, so the shape and size of the poster is constantly changing.
        Thanks in advance for any help 🙂

        • Adrian Rosebrock November 20, 2018 at 9:04 am #

          I assume each of the 150 frames has the same movie poster? Are these 150 frames your training data? If so, have you labeled them and annotated them so you can train an object detector?

          • Steph November 20, 2018 at 9:27 am #

            Yes, I have 1500 images as training data. For each movie poster, i created a binary mask showing where is the poster. The shape is usually a quadrilateral, unless in case the poster is partially occluded.
            I’d like to train a system which, given an annotated frame of a video, could then detect the movie poster with pixel precision during camera movement and occlusions, but so far I didn’t have luck. Even system especially trained for that (as they do in the Davis challenge seem to fail after just a few frames.
            If you are going to work / publish a post on the issue, let me know!

          • Adrian Rosebrock November 21, 2018 at 9:39 am #

            Thanks for the clarification. In that case I would highly suggest using a Mask R-CNN. The Mask R-CNN will give you a pixel-wise segmentation of the movie poster. Once you have the location of the poster you can either:

            1. Continue to process subsequent frames using the Mask R-CNN
            2. Or you can apply a dedicated object tracker

  3. Mansoor November 19, 2018 at 11:38 am #

    Adrian, you are constantly bombarding us with such valuable information every single week, which otherwise would take us months to even understand.

    Thank you for sharing this incredible piece of code with us.

    • Adrian Rosebrock November 19, 2018 at 12:20 pm #

      Thanks Mansoor — it is my pleasure 🙂

  4. YEVHENII RVACHOV November 19, 2018 at 12:04 pm #

    Hello, Adrian.

    Thanks so much for your article and explanation of principles R-CNN

    • Adrian Rosebrock November 19, 2018 at 12:20 pm #

      You are welcome, I’m happy you found the post useful! I hope you can apply it to your own projects.

  5. Atul November 19, 2018 at 12:31 pm #

    Thanks , very informative and useful 🙂

    • Adrian Rosebrock November 19, 2018 at 12:57 pm #

      Thanks Atul!

  6. Faraz November 19, 2018 at 12:42 pm #

    Hi Adrain.

    Thank you again for the great effort.My question is that mask rcnn is according to authors of paper Mask rcnn : ,fps is around 5fps.Isnt it a bit slow for using it in real time application and how do you compare YOLO or SSD with it.Thanks.

    • Adrian Rosebrock November 19, 2018 at 12:56 pm #

      Yes, Faster R-CNN and Mask R-CNN are slower than YOLO and SSD. I would request you read “Instance segmentation vs. Semantic segmentation” section of this tutorial — the section will explain to you how YOLO, SSD, and Faster R-CNN (object detectors) are different than Mask R-CNN (instance segmentation).

      • Faraz November 19, 2018 at 1:28 pm #

        Thanks Adrian ,so what i understand is that mask rcnn may not be suitable for real time applications.Great tutorial by the way.Thumbs up

  7. Cenk November 19, 2018 at 2:16 pm #

    Hi Adrian,

    Thank you very much for your sharing the code along with the blog, as it will be very helpful for us to play around and understand better.

    • Adrian Rosebrock November 19, 2018 at 2:20 pm #

      Thanks Cenk!

  8. Walid November 19, 2018 at 3:36 pm #

    Thanks a lot.
    I worked when I updated openCV 🙂

    • Adrian Rosebrock November 19, 2018 at 4:11 pm #

      Awesome, glad to hear it!

  9. atom November 19, 2018 at 7:49 pm #

    Great post, Adrian. Actually, a large number of papers are published everyday on machine learning, so can you share us the way you keep track almost of them. Thanks so muchs, Adrian

    • atom November 21, 2018 at 4:09 am #

      Adrian, please give me some comment about this. Thanks

  10. Paul November 19, 2018 at 8:17 pm #

    Hi Adrian
    This is awesome. I loved your book. (still trying to learn most of it)
    I used matterport’s Mask RCNN in our software to segment label-free cells in microscopy images and track them.
    I wonder if you can comment on two things
    would you comment on how to improve the accuracy of the mask?
    Do you think it’s the interpolation error or we can improve the accuracy by increasing the depth of the CNNs?

    2. I’ve seen this “flicking” thing in segmentation. (as in video)
    If i’m doing image segmentation, it would be one trained weight can recognize a target, while the other may not. some kind of false negative.
    would you know where it came from?

    • Adrian Rosebrock November 20, 2018 at 9:18 am #

      1. Are you already applying data augmentation? If not, make sure you are. I’m also not sure how much data you have for training but you may need more.

      2. False-negatives and false-positives will happen, especially if you’re trying to run the model on video. Ways to improve your model include using training data that is similar to your testing data, applying data augmentation, regularization, and anything that will increase the ability of your model to generalize.

  11. Jaan November 19, 2018 at 8:24 pm #

    This is looks really cool. Is this the same thing as pose estimation?

    • Adrian Rosebrock November 20, 2018 at 9:16 am #

      No, pose estimation actually finds keypoints/landmarks for specific joints/body parts. I’ll try to cover pose estimation in the future.

  12. Sumit November 20, 2018 at 12:01 am #

    Thank you so much for all the wonderful tutorials. i am great follower of your work. had a doubt here:

    To perform Localization and Classification at the same time we add 2 fully connected layers at the end of our network architecture. One classifies and other provides the bounding box information. But how will come to know which fully connected layer produces cordinates and which one is for classification?

    What i read in some blogs is that we receive a matrix at the end which contains: [confidence score, bx, by, bw, bh, class1, class2, class3].

    • Adrian Rosebrock November 20, 2018 at 9:11 am #

      We know due to our implementation. One FC branch is (N + 1)-d where N is the number of class labels plus an additional one for the background. The other FC branch is 4xN-d where each of the four values represents the deltas for the final predicted bounding boxes.

  13. Dona Paula November 20, 2018 at 12:54 am #

    Thanks for your invaluable tutorials. I ran your code as is, however I am getting only one object instance segemented. i.e If I have two cars in the image (e.g example1), only one car is detected and instance segmented. I have tried with other images. Same story.

    My openCV version is 3.4.3. Please suggest resolution.

    • Dona Paula November 20, 2018 at 1:32 am #

      Please ignore my previous comment. I thought it would be an animated gif.

      • Adrian Rosebrock November 20, 2018 at 9:08 am #

        Click on the window opened by OpenCV and press any key on your keyboard. It will advance the execution of the script to highlight the next car.

  14. Digant November 20, 2018 at 1:15 am #

    Hi Adrian,
    Can you suggest me any architecture for Sementic Segmentation which performs segmentation without resizing the image. Any blog/code related to it would be great.

  15. Kark November 20, 2018 at 7:04 am #

    Hi Adrian,

    Thanks for this awesome post.

    I am working on a similar project where I have to identify and localize each object in the picture. Can you please advise how to make this script identify all the objects in the picture like a carton box, wooden block etc. I will not know what could be in the picture in advance.

    • Adrian Rosebrock November 20, 2018 at 9:02 am #

      You would need to first train a Mask R-CNN to identify each of the objects you would like to recognize. Mask R-CNNs, and in general, all machine learning models, are not magic boxes that intuitively understand the contents of an image. Instead, we need to explicitly train them to do so. If you’re interested in training your own custom Mask R-CNN networks be sure to refer to Deep Learning for Computer Vision with Python where I discuss how to train your own models in detail (including code).

  16. abkul November 20, 2018 at 7:52 am #

    Great tutorial.

    I am interested in extracting and classifying/labeling plant disease(s) and insects from an image sent by a farmer using deep learning paradigm. Please advice the relevant approaches/techniques to be employed.

    Are you planning to diversify your blog with examples in the field of plant pests or disease diagnosis in future?

    • Adrian Rosebrock November 20, 2018 at 8:59 am #

      I haven’t covered plant diseases specifically before but I have cover human diseases such as skin lesion/cancer segmentation using a Mask R-CNN. Be sure to take a look at Deep Learning for Computer Vision with Python for more details. I’m more than confident that the book would help you complete your plant disease classification project.

  17. Abhiraj Biswas November 20, 2018 at 8:21 am #

    As you mentioned it’s storing as an output I wanted to know How can we show the output on the screen Frame by frame.

    • Adrian Rosebrock November 20, 2018 at 8:57 am #

      You can insert a call to cv2.imshow but keep in mind that the Mask R-CNN running on a CPU, at best, may only be able to do 1 FPS. The results wouldn’t look as good.

  18. Dave P November 20, 2018 at 1:02 pm #

    Hi Adrian, Another great tutorial – Your program examples just work first time (unlike many other object detection tutorials on the web…)
    I am trying to reduce the number of false positives from my CCTV alarm system which monitors for visitors against a very ‘noisy’ background (trees blowing in the wind etc) and using an RCNN looks most promising. The Mask RCNN gives very accurate results but I don’t really need the pixel-level masks and the extra CPU time to generate them.
    Is there a (simple) way to just generate the bounding boxes?
    I have tried to use Faster RCNN rather than Mask RCNN but the accuracy I am getting (from the aforementioned web tutorials and Github downloads) is much poorer.

  19. Paul Z November 20, 2018 at 5:51 pm #

    Never even heard of R-CNN until now .. but great follow up to the YOLO post. Question … sometimes the algo seems to identify the same person twice, very very similar confidence levels and at times, the same person twice, once at ~90% and once at ~50%.

    Any ideas?

    • Adrian Rosebrock November 21, 2018 at 9:27 am #

      The same person in the same frame? Or the same person in subsequent frames?

  20. sophia November 21, 2018 at 10:28 am #

    another great article! would it be possible to use instance segmentation or object detection to detect whether an object is on the floor? i wanna be able to scan a room and trigger an alert if an object is on the floor. I haven’t seen any deep learning algorithm applied to detect the floor. thanks, look forward to your reply.

    • Adrian Rosebrock November 25, 2018 at 9:44 am #

      That would actually be a great application of semantic segmentation. Semantic segmentation algorithms can be used to classify all pixels of an image/frame. Try looking into semantic segmentation algorithms for room understanding.

      • Sophia November 25, 2018 at 1:35 pm #

        thanks Adrian, I’ll look into using semantic segmentation for this, look forward to more articles from you!

  21. Bharath November 21, 2018 at 10:56 pm #

    Hi Adrian, I found u have lots of blogs on install opencv on raspberry pi, they build and compile (min 2hours)…..I found pip install opencv- python working fine on raspberry Pi. Did you try it?

    • Adrian Rosebrock November 25, 2018 at 9:35 am #

      I actually have an entire tutorial dedicated to installing OpenCV with pip. I would refer to it to ensure your install is working properly.

  22. abkul November 22, 2018 at 5:18 am #

    Like always great tutorial.

    No algorithm is perfect.What are the short comings of Mask R-CNN approach/algorithm?

    • Adrian Rosebrock November 25, 2018 at 9:29 am #

      Mask R-CNNs are extremely slow. Even on a GPU they only operate at 5-7 FPS.

  23. Mandar Patil November 22, 2018 at 6:57 am #

    Hey Adrian,
    I made the entire tree structure on Google Colab and ran the file.

    !python –mask-rcnn mask-rcnn-coco –image images/example_01.jpg

    It gave the following result:
    [INFO] loading Mask R-CNN from disk…
    [INFO] Mask R-CNN took 5.486852 seconds
    [INFO] boxes shape: (1, 1, 3, 7)
    [INFO] masks shape: (100, 90, 15, 15)
    : cannot connect to X server

    Could you please tell me why did this happen?

    • Adrian Rosebrock November 25, 2018 at 9:28 am #

      I don’t believe Google Colab has X11 forwarding which is required to display images via cv2.imshow. Don’t worry though, you can still use matplotlib to display images.

  24. xuli November 22, 2018 at 11:56 am #

    cool..leading the way for us to the most recent technology

  25. Micha November 24, 2018 at 3:26 pm #

    Thinking to use MASK R-CNN for background removal, is there and way to make the mask more accurate then the examples in the video in the examples?

    • Adrian Rosebrock November 25, 2018 at 8:56 am #

      You would want to ensure your Mask R-CNN is trained on objects that are similar to the ones in your video streams. A deep learning model is only as good as the training data you give it.

      • Micha Amir Cohen November 26, 2018 at 7:51 am #

        I’m talking about person recognize, It can be any person… so I’m understanding your comment ” objects that are similar ”

        look on the picture below the mask cut part of the person head (the one near the dog)… for example…
        however if I’m looking on this document the mask cover the persons better

        any idea how the mask can cover the body better then the examples?

        • Micha Amir Cohen November 28, 2018 at 2:35 am #

          tFirst thanks for all the information you share with us!!!!

          I Just to verify, as I understand your opinion is that better training can improve the mask fit to the object required and it is not the limitation that related to the ability of Mask RCNN and for my needs I need to search for other AI model

  26. Gagandeep November 26, 2018 at 3:00 am #

    Thanx a lot for a great blog !

    on internet lots of article available on custom object detection using tensorflow API , but not well explained..

    In future Can we except blog on “Custom object detection using tensorflow API” ??

    thanx a lot your blogs are really very helpful for us…

    Best regards

    • Adrian Rosebrock November 26, 2018 at 2:29 pm #

      Hi Gagandeep — if you like how I explain computer vision and deep learning here on the PyImageSearch blog I would recommend taking a look at my book, Deep Learning for Computer Vision with Python which includes six chapters on training your own custom object detectors, including using the TensorFlow Object Detection API.

  27. Sunny December 1, 2018 at 11:20 pm #

    Hi Adrian,

    Thanks for such a great tutorial! I have some questions after reading the tutorial:

    1. Which one is faster between Faster R-CNN and Mask R-CNN? What about the accuracy?
    2. Under what condition I should consider using Mask R-CNN? Under what condition I should consider using Faster-CNN? (Just for Mask R-CNN and Faster R-CNN)
    3. What is the limitation of Mask R-CNN?


    • Adrian Rosebrock December 4, 2018 at 10:12 am #

      1. Mask R-CNN builds on Faster R-CNN and includes extra computation. Faster R-CNN is slightly faster.
      2 and 3. Go back and read the “Instance segmentation vs. Semantic segmentation” section of this post. Faster R-CNN is an object detector while Mask R-CNN is used for instance segmentation.

  28. sophia December 3, 2018 at 1:26 pm #

    the mask output that I’m getting for the images that you provided is not as smooth as the output that you have shown in this article – there are significant jagged edges on the outline of the mask. is there any way to get a smoother mask as you have got ? I’m running the script on a Macbook Pro.

    looking forward to your reply, thanks.

    • Sophia December 11, 2018 at 3:10 pm #

      Hi Adrian,

      don’t mean to annoy you, but it’d help me considerably if you could give me some ideas for why I’m getting masks with jagged edges (like steps all over the outline) as opposed to the smooth mask outputs, and how I can possible fix this problem. Thanks,

      • Adrian Rosebrock December 13, 2018 at 9:14 am #

        See my reply to Robert in this same comment thread. What interpolation are you using? Try using a different interpolation method when resizing. Instead of “cv2.INTER_NEAREST” you may want to try linear or cubic interpolation.

        • Sophia December 13, 2018 at 2:16 pm #

          using cubic interpolation gives the same results as you show in this post. thank you so much!!

          • Adrian Rosebrock December 18, 2018 at 9:32 am #

            Awesome, glad to hear it!

    • Robert December 12, 2018 at 5:10 pm #

      I’m running into the same issue. Do you have any recommendation Adrian? Are you smoothing the pixels in some way?

      • Adrian Rosebrock December 13, 2018 at 8:56 am #

        What interpolation method are you using when resizing the mask?

  29. Abhiraj Biswas December 4, 2018 at 10:18 pm #

    box = boxes[0, 0, i, 3:7] * np.array([W, H, W, H])
    (startX, startY, endX, endY) = box.astype(“int”)
    boxW = endX – startX
    boxH = endY – startY

    What is happening in the first step.?
    Why is it 3:7…?
    Looking forward for your reply.

    • Adrian Rosebrock December 6, 2018 at 9:50 am #

      That is the NumPy array slice. The 7 values correspond to:

      [batchId, classId, confidence, left, top, right, bottom]

  30. Bhagesh December 5, 2018 at 5:06 am #

    In a very simple yet detailed way all the procedures are described. Easy to understand.
    Can you please tell me how to get or generate these files ?


    I want to go through your example.

    • Adrian Rosebrock December 6, 2018 at 9:42 am #

      These models were generated by training the Mask R-CNN network. You need to train the actual network which will require you to understand machine learning and deep learning. Do you have any prior experience in those areas?

    • Manuel December 7, 2018 at 10:49 am #

      it looks like those files are generated by Tensorflow, look for tutorials on how to use Tensorflow Object detection API.

  31. Bob Estes December 5, 2018 at 12:43 pm #

    Any thoughts on this error:

    … cv2.error: OpenCV(3.4.2) /home/estes/git/cv-modules/opencv/modules/dnn/src/tensorflow/tf_graph_simplifier.cpp:659: error: (-215:Assertion failed) !field.empty() in function ‘getTensorContent’

    Note that I’m using opencv 3.4.2, as suggested, and am running an unmodified version of your code.

    • Bob Estes December 5, 2018 at 2:12 pm #

      Found a link suggesting I needed 3.4.3. I updated to 3.4 and all is well.

      • Bob Estes December 5, 2018 at 2:13 pm #

        Typo: can’t edit post. I upgraded to 4.0.0 and it worked.

        • Adrian Rosebrock December 6, 2018 at 9:34 am #

          Thanks for letting us know, Bob!

  32. Pablo December 12, 2018 at 10:09 am #

    Hello Adrian,

    Thanks for you post, it’s a really good tutorial!

    But I am wondering whether there is any way to limit the categories of coco dataset if I just want it to detect the ‘person’ class. Forgive my stupidity, I really couldn’t find the model file or some other file contains the code related to it.

    Looking forward to your reply;)

    • Adrian Rosebrock December 13, 2018 at 9:02 am #

      I show you exactly how to do that in this post.

  33. Sophia December 13, 2018 at 2:30 pm #

    this is probably my favorite of all of your posts! i have a question about extending the Mask R-CNN model. Currently, if i run the code on a video that has more than 1 person, i get a mask output labeled ‘person’ for each person in the video. Is there any way to identify and track each person in the video, so the output would be ‘person 1’, ‘person 2’ and so on… Thanks,

  34. Michael December 19, 2018 at 11:27 pm #

    Hi Adrian,

    Amazing book. I’ve been reading through it. Love the materials. I was going through your custom mask rcnn pills example and the annotation is done using a circle. If I am training on something custom I’m using polygons. The code has it finding the center the circle from the annotation and draws a mask. Any suggestions on how to update this to get it to work with polygon annotations in via? Thanks!

    • Adrian Rosebrock December 20, 2018 at 5:15 am #

      Thanks Michael, I’m glad you’re enjoying Deep Learning for Computer Vision with Python!

      As for your question, yes, there is a way to draw polygons. Using the scikit-image library it’s actually quite easy. You’ll need the skimage.draw.polygon function.

  35. Michael December 20, 2018 at 5:11 pm #

    Hi Adrian,

    Thanks for that. I was able to train now but I realized it was only on CPU and it was sooo slow. When I convert to GPU I get a Segmentation Fault (Core Dumped) could be related to a version issue? How can I repay your time???


    • Adrian Rosebrock December 27, 2018 at 11:06 am #

      Hey Michael, be sure to see my quote from the tutorial:

      “Furthermore, OpenCV does not support NVIDIA GPUs for it’s dnn module. Right now only a limited number of GPUs are supported, mainly Intel GPUs. NVIDIA GPU support is coming soon but for the time being we cannot easily use a GPU with OpenCV’s dnn module.”

  36. Parupudi Pramod December 24, 2018 at 7:59 am #

    Can I use this on a gray scale image like Dental x-ray?

    • Adrian Rosebrock December 27, 2018 at 10:38 am #

      Yes, Mask R-CNNs can be used on grayscale, single channel images. I demonstrate how to train your own custom Mask R-CNNs, including Mask R-CNNs for medical applications, inside my book, Deep Learning for Computer Vision with Python.

  37. Christian December 29, 2018 at 5:22 pm #


    I really appreciate all of your detailed tutorials. I’m just getting familiar with openCV, and after walking through a few of them I have been able to start some cool projects.

    I was curious if you could think of a method to add a contrail to tracked objects using the code provided? Right now, I am “ignoring” all objects except for the sports ball class, so I am just looking to add the movement path to the ball (similar to your past Ball Tracking with OpenCv tutorial.


    • Adrian Rosebrock January 2, 2019 at 9:28 am #

      Thanks Christian, I’m glad you’re enjoying the tutorials.

      You could certainly adapt the ball tracking contrails to this tutorial as well. Just maintain a “deque” class for each detected object like we do in the ball tracking tutorial (I would recommend computing the center x,y-coordinates of the bounding box).

  38. setti January 9, 2019 at 4:50 pm #

    when i run it i see this error can you pls tell me how to fix it error: the following arguments are required: -i/–image, -m/–mask-rcnn

  39. Zhijia Chen January 10, 2019 at 1:15 pm #

    Hi Adrian,

    Currently, I am doing a project which is about capturing the trajectory of some scalpels when a surgeon is doing operations, so that I can input this data to a robot arm and hope it can help surgeons with operations.

    The first task of my project is to track the scalpels first, then the second task is to know their 2D movement from the videos provided and even 3D motions.

    I think CNN can help me with the first task easily, right?
    My question is: is it possible to help me with the second task?

    Looking forward to your reply, thanks.

    • Adrian Rosebrock January 11, 2019 at 9:34 am #

      Yes, Mask R-CNNs and object detectors will help you detect an object. You can then track them using object tracking algorithms.

  40. Carmelo January 11, 2019 at 3:30 am #


    congrats for the tuorial. Really well done!
    I have a question:
    I used your code but the masks are not as smooths as the one I see on your article, but they are quite roughly squared.
    Is there a reason for this?
    Thank you!

    • Adrian Rosebrock January 11, 2019 at 9:27 am #

      See my reply to Sophia.

  41. 葉又銘 January 12, 2019 at 4:24 am #

    How do you set ask_rcnn_video .py” line 97: box = boxes[0, 0, i, 3:7] * np.array([W, H, W, H])”, I am through your other articles and try I will use YOLO+opencv with centroidtracker, but there is always a problem with the coordinates. I think it is a problem with box. I don’t know yolo’s box=[0:4]. What is the difference between the two, I saw you have used centroidtracker’s article, all use: box = boxes[0, 0, i, 3:7], please help me answer
    I tried to use YOLO+centroidtracker to achieve thank you.

    • Adrian Rosebrock January 16, 2019 at 10:17 am #

      The returned coordinates for the bounding boxes are:

      [batchID, classID, confidence, left, top, right, bottom]

      • yoming January 17, 2019 at 9:39 am #

        yes,but is ” box = detection[0:4] * np.array([W, H, W, H])”,i don’t know how to use

        • Adrian Rosebrock January 22, 2019 at 9:49 am #

          YOLO’s return signature is slightly difference. It’s actually:

          [center_x, center_y, width, height]

  42. Ben January 17, 2019 at 6:06 am #

    Hi Adrian, really helpful post. Would it be possible to extract a 128-D object embedding vector (or larger size vector like 256-D or 512-D) that quantifies a specific instance of that object class – similar to the way a 128-D face embedding vector is extracted for a face

    For example, if you have two different (different color, different model) Toyota cars in an image, then two object embedding vectors would be generated in such a way that both cars could be re-identified in a later image, even if those cars would appear in different angles – similar to the way a person’s face can be re-identified by the 128-D face embedding vector.

    • Adrian Rosebrock January 22, 2019 at 9:51 am #

      Yes, but you would need to train a model to do exactly that. I would suggest looking into siamese networks and triplet loss functions.

  43. yoming January 17, 2019 at 8:57 am #

    How do I do that showing two bounding boxes in one image without pressing ESC

    • Adrian Rosebrock January 22, 2019 at 9:50 am #

      You would remove the “cv2.imshow” statement inside the “for” loop and place it after the loop.

  44. Walid January 20, 2019 at 6:34 pm #

    I think it is better in Figure 5 to change notation N to L for consistency

  45. Miguel Bordalo March 1, 2019 at 1:32 pm #

    Would it possible to run MaskR-CNN in the raspberry pi ?

    • Adrian Rosebrock March 5, 2019 at 9:04 am #

      Realistically, no. The Raspberry Pi is far too underpowered. The best you could do is attempt to run the model a Movidius NCS connected to the Pi.

  46. Adama March 12, 2019 at 8:01 am #

    I ordered the max bundler imageNet. It worth it !

    I hope more material using Tensorflow 2.0, TF Lite , TPU, Colab for more coherent and easy development.

    I have a question: can we add background sample images without masking them with the masked objects to train the model better on detecting similar object. Like detecting windows but not doors ?

    • Adrian Rosebrock March 13, 2019 at 3:17 pm #

      Thanks for picking up a copy of the ImageNet Bundle, Adama! I’m glad you are enjoying it.

      As far as your question goes, yes, you can insert “negative” samples in your dataset. As long as none of the regions are annotated they will be used as negative samples.

  47. Hocine March 13, 2019 at 7:45 am #

    Hello dear
    i want to know if it’s possible to run the Mask R-CNN with Web cam to make it detect in real time?

    • Adrian Rosebrock March 13, 2019 at 3:08 pm #

      You would need a GPU to run the Mask R-CNN network in real-time. It is not fast enough to run in real-time on the CPU.

      • Hocine March 14, 2019 at 8:19 am #

        it’s works but so heavy there’s no way to make it littel faster?

      • Alok November 22, 2019 at 10:49 am #

        Hello Andrian, will it work on lenovo i5 8th generation 4gb graphics card laptop

        • Adrian Rosebrock November 22, 2019 at 12:25 pm #

          Yes, but keep in mind that only your CPU will be used, not your GPU as OpenCV’s “dnn” module does not support most GPUs.

  48. Asher March 27, 2019 at 3:41 pm #

    Hello, fantastic articles that are just a wealth of information. Is the download link for the source code still functioning?

    • Adrian Rosebrock April 2, 2019 at 6:39 am #

      Yes, you can use the “Downloads” section of the post to download the source code and pre-trained model.

  49. Gabriella April 4, 2019 at 1:15 pm #

    Hi Adrian, How did you get the fc layers as 4096 in Figure 5? According to the Mask R-CNN paper the fc layers are 1024 from Figure 4 (in their paper).

  50. Dawid April 7, 2019 at 5:00 am #

    Dear Adrian,

    Great post, as always. Based on your posts I have learned a lot about CV, NN and python. I still have a question: I have my own Keras CNN saved as a model.h5. I would like to use it to detect features in the pictures, also hopefully with masking. I have transformed keras model to tensorflow and also generated the pdtxt file, however, my model does not want to work because of the error: ‘cv::dnn::experimental_dnn_34_v11::`anonymous-namespace’::addConstNodes’. Is there any other way to use own CNN to detect features on the images? I have tried with dividing image into blocks which were fed into CNN but this approach is rather slow and I would also need to include some more sophisticated algorithms to specify exact location. I would be very grateful for your answer!

    • Adrian Rosebrock April 12, 2019 at 12:50 pm #

      Could you elaborate a bit more about what you mean by “detect features”? What is the end goal of what you are trying to achieve?

  51. maomao April 8, 2019 at 5:39 am #

    do you have the code for training?I want to test it on my datasets,thank you

  52. Pallawi April 16, 2019 at 3:19 am #

    Hi Adrian,

    I am so much thankful to you for writing, encouraging and motivating so many young talents in the field of Computer Vision and AI.

    Thank you so much, once again.
    Keep writing.
    We love you so much.
    God bless you.

    • Adrian Rosebrock April 18, 2019 at 6:58 am #

      Than you for the kind words, Pallawi 🙂

  53. Izack April 25, 2019 at 9:56 pm #

    Adrian thank you so much for yet another amazing post!

    • Adrian Rosebrock May 1, 2019 at 12:05 pm #

      Thanks Izack, I’m glad you enjoyed it!

  54. may ashraf April 28, 2019 at 4:14 pm #

    how to draw contours for the output of the mask rcnn

    • Adrian Rosebrock May 1, 2019 at 11:50 am #

      Take a look at Line 92 where the mask is calculated. You can take that mask and find contours in it.

  55. Ina May 6, 2019 at 6:53 am #

    Hello Adrian,

    thank you for the tutorial. It really is great.

    Can you tell whether I can use this program also for the raspberry?

    Thank you 🙂

    • Adrian Rosebrock May 8, 2019 at 1:05 pm #

      No, the RPi is too underpowered to run Mask R-CNN. You would need to combine the Pi with a Movidius NCS or Google Coral USB Accelerator.

  56. Oli May 12, 2019 at 2:53 pm #

    Hi Adrian,

    Thanks for another great tutorial!

    I was wondering how I would go about getting the code to also output coordinates for the four corners of each bounding box? Is that possible?


    • Adrian Rosebrock May 15, 2019 at 2:58 pm #

      What do you mean by “output” the bounding box coordinates?

      • Oli May 17, 2019 at 9:03 am #

        Hi, thanks for your response.

        I am looking to collect data on where each object is located in an image. So, ideally, as well as producing the output image/video, the code will also produce an array containing the pixel coordinates for each bounding box.

        • Adrian Rosebrock May 23, 2019 at 10:12 am #

          Line 82 gives you the (x, y)-coordinates of the box.

  57. Pj May 16, 2019 at 3:38 pm #


    Thanks for this great tutorial.
    I am trying run this on intel movidius ncs 2 but am getting the following error:

    [INFO] loading Mask R-CNN from disk…
    terminate called after throwing an instance of ‘std::bad_cast’
    what(): std::bad_cast
    Aborted (core dumped)

    It works perfectly with opencv but gives error with openvino’s opencv

    • Adrian Rosebrock May 23, 2019 at 10:19 am #

      OpenVINO’s OpenCV has their own custom implementations. Unfortunately it’s hard to say what the exact issue is there. Have you tried posting the issue on their GitHub?

  58. Akhilesh May 18, 2019 at 4:26 am #

    Hi Adrian

    This is very informative. Actually I am trying to detect different color wires in an images. My dataset has images of wires in it, I want to detect where are the wires and what colors are they. I was trying to use MASK RCNN, it was able to detect the wires but it is classifying all the wires of same color.

    Do you know how can I improve my code.

    • Adrian Rosebrock May 23, 2019 at 10:03 am #

      Have you taken a look at Raspberry Pi for Computer Vision? That book will teach you how to train your own Mask R-CNNs. I also provide my best practices, tips, and suggestions.

  59. Med Chrigui May 28, 2019 at 9:45 am #

    Hi Adrian,
    Thank you for this excellent tutorial, I ran the code, it works but it gives me rectangular shapes, not like the results in the tutorial. the second problem is when I test with a 5MB image it gives me an error (cv::OutOfMemoryError). All my images contain only one object which is the body of a person, I like to use mask rcnn in order to detect the shape of the skin, can I obtain such a result starting from your tutorial code?
    Thank you in advance.

    • Adrian Rosebrock May 30, 2019 at 9:12 am #

      To avoid the memory error first resize your input image to the network — your machine is running out of memory trying to process the large image.

  60. Flávio May 29, 2019 at 4:24 pm #

    I wan to plot the image with Matplotlib but I don’t know exactly where in the code I put that.

    • Adrian Rosebrock May 30, 2019 at 9:02 am #

      You mean you want to use the matplotlib’s “plt.imshow” function to display the image?

  61. jeff June 9, 2019 at 10:33 am #

    Hi Adrian
    I really appreciate all of your detailed tutorials.

    For reference, I am not very familiar with DNN
    in line (source code for images): 113 ,,,

    roi = roi [ mask ]

    Q1 : Does ‘roi’ have all the pixels that are masked?
    Q2 : I want to know the center of the coordinates of the masked area using the OPENCV function. Is it possible?

    • Adrian Rosebrock June 12, 2019 at 1:45 pm #

      1. The ROI contains the “Region of Interest”. The “mask” variable contains the masked pixels. We use NumPy array indexing to grab only the masked pixels.
      2. Compute the centroid of the mask.

  62. Reed Kelso June 13, 2019 at 10:19 am #

    Hi Adrian,
    Great work! I bought the practitioner package to try and learn more about the process. I can’t find anything about image annotation tools for training my own dataset in the book. I found VGG from Oxford but I’m not sure if that will work with the tools you’ve put together.
    Thanks again for all these great tutorials!

  63. Asal June 17, 2019 at 7:08 pm #

    Hi Adrian,

    In which bundle you teach to train a Mask R-CNN on a custom dataset? I have the starter bundle of your book and it’s not there.


    • Adrian Rosebrock June 19, 2019 at 2:00 pm #

      The ImageNet Bundle of Deep Learning for Computer Vision with Python contains the Mask R-CNN chapters.

      If you would like to upgrade to the ImageNet Bundle from the Starter Bundle just send me an email and I can get you upgraded!

  64. Sandeep Pokhrel June 24, 2019 at 9:45 am #

    Hi Adrian,

    Can we do object detection in video by retaining the sound of the video?

    • Adrian Rosebrock June 26, 2019 at 1:18 pm #

      I’m not sure what you mean by “retaining the sound”? What do you hope to do with the audio from the video?

  65. Programmer June 24, 2019 at 9:31 pm #

    Thank you it works great, had some issues getting started because of the project interpreter but once I sorted that out it works exactly as stated, I learnt a lot from this tutorial thanks again.

  66. Bob July 10, 2019 at 12:02 pm #

    Hi Adrian!

    I am curious if I can combine mask r-cnn with webcam input in real time? Could you please give me any ideas how to achieve this?

    • Adrian Rosebrock July 25, 2019 at 10:20 am #

      A Mask R-CNN, even with a GPU, is not going to run in real-time (you’ll be in the 5-7 FPS range).

  67. WhoAmI July 11, 2019 at 9:13 am #

    Hi Adrian,

    Am a novice in the field of image recognition. I started exploring your blog and ran my first sample today.

    I have two points to mention

    1) Why is the Mask R-CNN not accurate in real time images? If I have around 5 images of car then it is detecting only 3 (The other 2 cars are might not be clear but still they are clearly visible (60%) for human eyes in the image and this algorithm is not detecting them).

    2) Instead of viewing different output files of an image, can’t I view the image segmentation in a single image? (Ex: If it detected 2 cars then it is poping up a window showing a single car and after closing it then it is reopening it and showing me the second car. Is there any chance of viewing them in a sigle window probably on a single image).

  68. Shamika K October 12, 2019 at 5:06 pm #

    Hi Adrian,

    Just went through this masking tutorial. You really made it made easy to understand every step.

    Have one question though, is there any way to extract the black and white resized mask that is present in Figure 6? I am not interested in actual masking but need shape of object for my next steps.

  69. usha November 4, 2019 at 5:01 pm #

    its a great post thanks for explaining each concept clearly, i have a query ,I ran the code with the image but i m not getting the required output , I m getting only 1 car labelled, this is with any image i am feeding , it is able to detect only one object in the image , i have not made any changes to the code, Thank You

    • Adrian Rosebrock November 7, 2019 at 10:20 am #

      Click on the window opened by OpenCV to advance execution of the script.

  70. Ankit November 10, 2019 at 3:54 am #

    Hello sir,
    this is an amazing tutorial ever seen.
    I wanted to save the cropped images which are detected after segmentation.
    I have done with the square cropping things, but I want that particular object to be saved.


    • Adrian Rosebrock November 14, 2019 at 9:29 am #

      Images can only be rectangular. You cannot save non-rectangular images. Perhaps you instead want to save the image and it’s alpha mask?

      • Ankit Pitroda November 15, 2019 at 4:18 am #

        Yes, sir, I am okay with an image with alpha mask

      • Ankit November 16, 2019 at 7:40 am #

        Thanks a lot, sir for the reply
        I want to save the masked region into the square/rectangle image with the background white/black/transparent.

        can I have some suggestions from you?

  71. Ankit November 24, 2019 at 2:57 pm #

    Hello Sir,
    I want to detect the floor of the room.
    Is there any technique to do this thing?

    Thank you

  72. Enes December 5, 2019 at 2:49 am #

    Hi Adrian, thank you very much for this tutorial. Your tutorials are very helpful for my DL journey.
    I have a question about RCNN mask. I try to detect shop signs from the street image. Most of the shop signs are rectengular and some of them are rotated. I want to get coordinates of the corners of shop signs from the mask matrix. (‘Roi’ information is not accurate when shop sign is rotated.) Mask matrix are boolean matrix and its pixel value is ‘True’, if this pixel is in the mask region. I cannot generate a solution for finding coordinates of corners of the mask from this mask matrix. Can you suggest a solution for me?

  73. Ankit December 12, 2019 at 2:22 am #

    hello sir
    again awesome tutorial.
    My question is:
    Can I set the sequence of the object detection?
    e.g. first it will detect all the chairs, then all the dining tables than all the wine glasses and so on?


    • Adrian Rosebrock December 12, 2019 at 10:04 am #

      No, you would do that in your post-processing code. First you obtain all detections from the network. You can then sort them as you see fit.

      • Ankit PItroda December 17, 2019 at 3:59 pm #

        Thanks man 🙂

  74. Asjad Murtaza January 4, 2020 at 4:18 pm #

    Hi, I have a question that is a little off topic, please guide me.

    Is it possible to do semantic segmentation with Matterport’s implementation of Mask RCNN ?

    • Adrian Rosebrock January 16, 2020 at 10:57 am #

      No, not out of the box. You would need to train the network specifically for semantic segmentation. The pre-trained network only does instance segmentation.

Before you leave a comment...

Hey, Adrian here, author of the PyImageSearch blog. I'd love to hear from you, but before you submit a comment, please follow these guidelines:

  1. If you have a question, read the comments first. You should also search this page (i.e., ctrl + f) for keywords related to your question. It's likely that I have already addressed your question in the comments.
  2. If you are copying and pasting code/terminal output, please don't. Reviewing another programmers’ code is a very time consuming and tedious task, and due to the volume of emails and contact requests I receive, I simply cannot do it.
  3. Be respectful of the space. I put a lot of my own personal time into creating these free weekly tutorials. On average, each tutorial takes me 15-20 hours to put together. I love offering these guides to you and I take pride in the content I create. Therefore, I will not approve comments that include large code blocks/terminal output as it destroys the formatting of the page. Kindly be respectful of this space.
  4. Be patient. I receive 200+ comments and emails per day. Due to spam, and my desire to personally answer as many questions as I can, I hand moderate all new comments (typically once per week). I try to answer as many questions as I can, but I'm only one person. Please don't be offended if I cannot get to your question
  5. Do you need priority support? Consider purchasing one of my books and courses. I place customer questions and emails in a separate, special priority queue and answer them first. If you are a customer of mine you will receive a guaranteed response from me. If there's any time left over, I focus on the community at large and attempt to answer as many of those questions as I possibly can.

Thank you for keeping these guidelines in mind before submitting your comment.

Leave a Reply