Instance segmentation with OpenCV

In this tutorial, you will learn how to perform instance segmentation with OpenCV, Python, and Deep Learning.

Back in September, I saw Microsoft release a really neat feature to their Office 365 platform — the ability to be on a video conference call, blur the background, and have your colleagues only see you (and not whatever is behind you).

The GIF at the top of this post demonstrates a similar feature that I have implemented for the purposes of today’s tutorial.

Whether you’re taking the call from a hotel room, working from a downright ugly office building, or simply don’t want to clean up around the home office, the conference call blurring feature can keep the meeting attendees focused on you (and not the mess in the background).

Such a feature would be especially helpful for people working from home and wanting to preserve the privacy of their family members.

Imagine your workstation being in clear view of your kitchen — you wouldn’t want your colleagues watching your kids eating dinner or doing their homework! Instead, just pop on the blurring feature and you’re all set.

In order to build such a feature, Microsoft leveraged computer vision, deep learning, and most notably, instance segmentation.

We covered Mask R-CNNs for instance segmentation in last week’s blog post — today we are going to take our Mask R-CNN implementation and use it to build a Microsoft Office 365-like video blurring feature.

To learn how to perform instance segmentation with OpenCV, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

Instance segmentation with OpenCV

Today’s tutorial is inspired by both (1) Microsoft’s Office 365 video call blurring feature and (2) PyImageSearch reader Zubair Ahmed. Zubair implemented a similar blurring feature using Google’s DeepLab (you can find his implementation on his blog).

Since we covered instance segmentation in last week’s blog post, I thought it was the perfect time to demonstrate how we can mimic the call blurring feature using OpenCV.

In the first part of this tutorial, we’ll briefly cover instance segmentation. From there we’ll use instance segmentation and OpenCV to:

  1. Detect and segment the user from the video stream
  2. Blur the background
  3. And then add the user back to the stream itself.

From there we’ll look at the results of our OpenCV instance segmentation algorithm, including some of the limitations and drawbacks.

What is instance segmentation?

Figure 1: The difference between object detection and instance segmentation. For object detection (left), a box is drawn around the individual objects. In the case of instance segmentation (right), an attempt is made to determine which pixels belong to each object. (source)

Explaining instance segmentation is best done with a visual example — refer to Figure 1 above where we have an example of object detection on the left and instance segmentation on the right.

Looking at these two examples we can clearly see a difference between the two.

When performing object detection we are:

  1. Computing the bounding box (x, y)-coordinates for each object
  2. And then associating a class label with each bounding box as well.

The problem is that object detection tells us nothing regarding the shape of the object itself — all we have is a set of bounding box coordinates. Instance segmentation, on the other hand, computes a pixel-wise mask for each object in the image.

Even if the objects are of the same class label, such as the two dogs in the above image, our instance segmentation algorithm still reports a total of three unique objects: two dogs and one cat.

Using instance segmentation we now have a more granular understanding of the object in the image — we know specifically which (x, y)-coordinates the object exists in.

Furthermore, by using instance segmentation we can easily segment our foreground objects from the background.

We’ll be using a Mask R-CNN for instance segmentation in this post.

For a more detailed review of instance segmentation, including comparing and contrasting image classification, object detection, semantic segmentation, and instance segmentation, please refer to last week’s blog post.

Project structure

You can grab the source code and trained Mask R-CNN model from the “Downloads” section of today’s post.

Once you’ve extracted the archive and navigated into it, simply take advantage of the tree  command to view the directory structure in your terminal:

Our project includes one directory (consisting of three files) and one Python script:

  • mask-rcnn-coco/ : The Mask R-CNN model directory contains three files:
    • frozen_inference_graph.pb : The Mask R-CNN model weights. The weights are pre-trained on the COCO dataset.
    • mask_rcnn_inception_v2_coco_2018_01_28.pbtxt : The Mask R-CNN model configuration. If you’d like to build + train your own model on your own annotated data, refer to Deep Learning for Computer Vision with Python.
    • object_detection_classes_coco.txt : All 90 classes are listed in this text file, one per line. Open it in a text editor to see what objects our model can recognize.
  • : We’ll be reviewing this background blur script today. Then we’ll put it to use and evaluate the results.

Implementing instance segmentation with OpenCV

Let’s get started implementing instance segmentation with OpenCV.

Open up the  file and insert the following code:

We’ll start off the script by importing our necessary packages. You need the following installed in your environment (virtual environments are highly recommended):

  • OpenCV 3.4.2+ — If you don’t have OpenCV installed, head over to my installation tutorials page. The fastest method for installing on most systems is via pip which will install OpenCV 3.4.3 at the time of this writing.
  • imutils — This is my personal package of computer vision convenience functions. You may install imutils via: pip install --upgrade imutils .

Again, I highly recommend that you place this software in an isolated virtual environment as you may need to accommodate for different versions for other projects.

Let’s parse our command line arguments:

Descriptions of each command line argument can be found below:

  • --mask-rcnn : The base path to the Mask R-CNN directory. We reviewed the three files in this directory in the “Project structure” section above.
  • --confidence : The minimum probability to filter out weak detections. I’ve set this value to a default of 0.5 , but you can easily pass different values via the command line.
  • --threshold : Our minimum threshold for the pixel-wise mask segmentation. The default is set to 0.3 .
  • --kernel : The size of the Gaussian blur kernel. I found that a 41 x 41 kernel looks pretty good, so a default of 41  is set.

For a review on how command line arguments work, be sure to read this guide.

Let’s load our labels and our OpenCV instance segmentation model:

Our labels file needs to be located in the mask-rcnn-coco/  directory — the directory specified via command line argument. Lines 23 and 24 build the labelsPath  and then Line 25 reads the LABELS  into a list.

The same goes for our weightsPath  and configPath  which are built on Lines 28-31.

Using these two paths, we take advantage of the dnn  module to initialize the neural net  (Line 36). This call loads the Mask R-CNN into memory before we start processing frames (we only need to load it once).

Let’s construct our blur kernel and start our webcam video stream:

The blur kernel tuple is defined on Line 40.

Our project has two modes: “normal mode” and “privacy mode”. Thus, a privacy  boolean is used for the mode logic. It is initialized to False  on Line 41.

Our webcam video stream is started on Line 45 where we pause for two seconds to allow the sensor to warm up (Line 46).

Now that all of our variables and objects are initialized, let’s start processing frames from the webcam:

Our frame processing loop begins on Line 49.

At each iteration, we’ll grab a frame  (Line 51) and resize  it to a known width, maintaining aspect ratio (Line 56).

For scaling purposes later, we go ahead and extract the dimensions of the frame  (Line 57).

Then, we construct a blob  and complete a forward pass through the network (Lines 63-66). You can read more about how this process works in this previous blog post.

The result is both boxes  and masks . We’ll be taking advantage of the masks , but we also need to use the data contained in boxes .

Let’s sort the indexes and initialize variables:

Line 70 sorts the indexes of the bounding boxes by their corresponding prediction probability. We’ll be making the assumption that the person with the largest corresponding detection probability is our user.

We then initialize the  mask , roi , and bounding box coords  (Lines 74-76).

Let’s loop over the indexes and filter the results:

We begin looping over the idxs  on Line 79.

We then extract the classID  and confidence  using boxes  and the current index (Lines 83 and 84).

Subsequently, we’ll perform our first filter — we only care about the "person"  class. If any other object class is encountered, we’ll continue to the next index (Lines 87 and 88).

Our next filter ensures the confidence  of the prediction exceeds the threshold set via command line arguments (Line 92).

If we pass that test, then we’ll scale the bounding box  coordinates back to the relative dimensions of the image (Lines 96). We then extract the coords  and object width/height (Lines 97-100).

Let’s compute our mask and extract the ROI:

Lines 106-109 extract the mask , resize it, and apply the threshold to create the binary mask itself. An example mask is shown in Figure 2:

Figure 2: The binary mask computed via instance segmentation of me in front of my webcam using OpenCV and instance segmentation. Computing the mask is part of the privacy filter pipeline.

In Figure 2 above all white pixels are assumed to be a person (i.e., the foreground) while all black pixels are the background.

With the mask , we’ll also compute the roi  (Line 115) via NumPy array slicing.

We then break  from the loop on Line 116 (since we have found the "person"  with the largest probability).

Let’s initialize our output frame and compute our blur if we are in “privacy mode”:

Our output  frame is simply a copy  of the original frame  (Line 119).

If we both:

  1. Have a mask  that is not empty
  2. And we are in ” privacy  mode”…

…then we’ll blur the background (using our kernel) and apply the mask  to the output  frame (Lines 123-129).

Now let’s display the output  image and handle keypresses:

Our output  frame is displayed via Line 132.

Keypresses are captured (Line 133). Two keys cause different behaviors (Lines 136-141):

  • "p" : When this key is pressed, “ privacy  mode” is toggled either on or off.
  • "q" : If this key is pressed, we’ll break out of the loop and “quit” the script.

Whenever we do quit, Lines 144 and 145 close the open window and stop the video stream.

Instance segmentation results

Now that we’ve implemented our OpenCV instance segmentation algorithm, let’s see it in action!

Be sure to use the “Downloads” section of this blog post to download the code and Mask R-CNN model.

From there, open up a terminal and execute the following command:

Figure 3: My demonstration of a “privacy filter” for web chatting. I’ve used OpenCV and Python to perform instance segmentation to find the prominent person (me), and then applied blurring to the background.

Here you can see a short GIF of me demoing our instance segmentation pipeline.

In this image, I am meant to be the “conference call attendee”. Trisha, my wife, is working in the background.

By enabling “privacy mode” I can:

  1. Use OpenCV instance segmentation to find the person detection with the largest corresponding probability (most likely that will be the person closest to the camera).
  2. Blur the background of the video stream.
  3. Overlay the segmented, non-blurry person back onto the video stream.

I have included a video demo, including my commentary, below:

You’ll immediately notice that we are not obtaining true real-time performance though — we’re only processing a few frames per second. Why is this?

How come our OpenCV instance segmentation pipeline isn’t faster?

To answer those questions, be sure to refer to the section below.

Limitations, drawbacks, and potential improvements

The first limitation is the most obvious one — our OpenCV instance segmentation implementation is too slow to run in real-time.

On my Intel Xeon W we’re only processing a few frames per second.

In order to obtain true real-time instance segmentation performance, we would need to leverage our GPU.

But therein lies the problem:

OpenCV’s GPU support for its dnn  module is fairly limited.

Currently, it mainly supports Intel GPUs.

NVIDIA CUDA GPU support is in development, but is currently not available.

Once OpenCV officially supports NVIDIA GPUs for the dnn  module we’ll be more easily able to build real-time (and even super real-time) deep learning applications.

But for now, this OpenCV instance segmentation tutorial serves as an educational demo of:

  1. What’s currently possible
  2. And what will be possible in a few months

Another improvement we can make is related to the overlaying of the segmented person back on the blurred background.

When you compare our implementation to Microsoft’s Office 365 video blurring feature, you’ll see that Microsoft’s is much more “smooth”.

We can mimic this feature by utilizing a bit of alpha blending.

A simple yet effective update to our instance segmentation pipeline would be to potentially:

  1. Use morphological operations to increase the size of our mask
  2. Apply a small amount of Gaussian blurring to the mask itself, helping smooth the mask
  3. Scale the mask values to the range [0, 1]
  4. Create an alpha layer using the scaled mask
  5. Overlay the smoothed mask + person ROI on the blurred background

Alternatively, you could compute the contours of the mask itself and then apply contour approximation to help create a “more smoothed” mask.

Please note that I have not tried this algorithm — it’s just something I thought of off the top of my head that I thought could give visually pleasing results.

If you wish to implement this instance segmentation update I would suggest reading this post where I discuss alpha blending in more detail.


In today’s blog post you learned how to perform instance segmentation using OpenCV, Deep Learning, and Python.

Instance segmentation is the process of:

  1. Detecting each object in an image
  2. Computing a pixel-wise mask for each object

Even if objects are of the same class, an instance segmentation should return a unique mask for each object.

In order to apply instance segmentation with OpenCV, we used our Mask R-CNN implementation from last week.

We then used our Mask R-CNN model to build a “video conference call blurring feature”, similar to the feature Microsoft released for Office 365 back in the summer.

Our instance segmentation results were similar to Microsoft’s feature; however, we could not obtain true real-time performance since OpenCV’s GPU support for the dnn  module is currently quite limited.

Therefore, today’s tutorial serves as a demo, highlighting what is currently possible and what will be possible when OpenCV’s GPU support increases.

I hope you enjoyed today’s tutorial!

To download the source code to this post, and be notified when future tutorials are published here on PyImageSearch, just enter your email address in the form below!


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , , ,

22 Responses to Instance segmentation with OpenCV

  1. Zubair Ahmed November 26, 2018 at 10:18 am #

    Wonderful post as always and thanks for the mention 🙂

  2. Arthur Zhang November 27, 2018 at 2:37 am #

    Really practical course!

    • Adrian Rosebrock November 30, 2018 at 9:34 am #

      Thanks so much, Arthur!

  3. Wilf December 6, 2018 at 9:22 am #

    This was terrific!!

    Question: what is the frame processing speed on your computer.
    My laptop does not have a GPU so my processing times are V-E-R-Y slow.
    (I read in a video clip and stored the “private” blurred frames to disk to better enjoy the background blurring)

    • Adrian Rosebrock December 6, 2018 at 9:25 am #

      My CPU is only processing a few frames per second. For true real-time performance using this method you would need a GPU (which OpenCV’s GPU support is currently a bit limited).

  4. Angelo December 8, 2018 at 5:22 pm #

    Too slow to run into raspberry pi, thanks for the info

  5. Muhammad Bilal January 11, 2019 at 6:51 am #

    hello, Adrian !
    an amazingly useful write, like always.
    Can you please guide me, I want to run image segmentation on Raspberry Pi 3B+
    1. If i train a custom Caffe for different terrains (i.e: grass, Roads, Rocky, water/wet, and different shades of sky)
    2. Question: if i reduce the Classes (to just 2 or 3) would i be able to achieve at-least 2-3 Fps on my Raspberry ?

    Thanks in advance <3
    -big fan to you.

    • Adrian Rosebrock January 11, 2019 at 9:24 am #

      The Raspberry Pi will be far, far too slow to run a Mask R-CNN network. You will not be able to get 2-3 FPS for instance segmentation on a Pi, it’s just too slow.

  6. Sourabh January 29, 2019 at 6:12 am #

    Amazing work ! Thank you

    • Adrian Rosebrock January 29, 2019 at 6:26 am #

      Thanks Sourabh!

  7. PRASHANT BANSOD March 19, 2019 at 6:11 am #

    Hi Adrian, thanks for the great tutorial. I would like to know whether I can use this for extracting human silhouette extraction or there is a better approach to tackle it. Thanks

    • Adrian Rosebrock March 19, 2019 at 9:51 am #

      Yes, instance segmentation is the suggested technique to obtain a pixel-wise mask of a person.

  8. santanu July 5, 2019 at 3:17 pm #

    Which one is best for instance segmentation(mask rcnn, segnet and deeplab) ??

    • Adrian Rosebrock July 10, 2019 at 9:57 am #

      There isn’t one “best” network for instance segmentation. It’s dependent on your dataset, your project requirements, and any computational limitations on the machine you’re either training or deploying to. You need to balance of all these when selecting an architecture.

  9. avantika September 4, 2019 at 3:36 am #

    why did you use Mask R-CNN for this video blurring effect over YOLO or SSD ? They also use deep learning

    • Adrian Rosebrock September 5, 2019 at 10:24 am #

      Mask R-CNN is an instance segmentation algorithm. It gives you a pixel-wise mask. YOLO and SSDs are object detectors. They only produce bounding boxes.

  10. Tony November 9, 2019 at 9:33 pm #

    How would you replace the video feed from your camera with a prerecorded video?

    • Adrian Rosebrock November 14, 2019 at 9:30 am #

      You would use the cv2.VideoCapture function and pass it in the path to the video file. If you’ve never done that before you can refer to Practical Python and OpenCV which will teach you how to do exactly that.

Before you leave a comment...

Hey, Adrian here, author of the PyImageSearch blog. I'd love to hear from you, but before you submit a comment, please follow these guidelines:

  1. If you have a question, read the comments first. You should also search this page (i.e., ctrl + f) for keywords related to your question. It's likely that I have already addressed your question in the comments.
  2. If you are copying and pasting code/terminal output, please don't. Reviewing another programmers’ code is a very time consuming and tedious task, and due to the volume of emails and contact requests I receive, I simply cannot do it.
  3. Be respectful of the space. I put a lot of my own personal time into creating these free weekly tutorials. On average, each tutorial takes me 15-20 hours to put together. I love offering these guides to you and I take pride in the content I create. Therefore, I will not approve comments that include large code blocks/terminal output as it destroys the formatting of the page. Kindly be respectful of this space.
  4. Be patient. I receive 200+ comments and emails per day. Due to spam, and my desire to personally answer as many questions as I can, I hand moderate all new comments (typically once per week). I try to answer as many questions as I can, but I'm only one person. Please don't be offended if I cannot get to your question
  5. Do you need priority support? Consider purchasing one of my books and courses. I place customer questions and emails in a separate, special priority queue and answer them first. If you are a customer of mine you will receive a guaranteed response from me. If there's any time left over, I focus on the community at large and attempt to answer as many of those questions as I possibly can.

Thank you for keeping these guidelines in mind before submitting your comment.

Leave a Reply