Sliding Windows for Object Detection with Python and OpenCV


So in last week’s blog post we discovered how to construct an image pyramid.

And in today’s article we are going to extend that example and introduce the concept of a sliding window. Sliding windows play an integral role in object classification, as they allow us to localize exactly “where” in an image an object resides.

Utilizing both a sliding window and an image pyramid we are able to detect objects in images at various scales and locations.

In fact, both sliding windows and image pyramids are both used in my 6-step HOG + Linear SVM object classification framework!

To learn more about the role sliding windows play in object classification and image classification, read on. By the time you are done reading this blog post, you’ll have an excellent understanding on how image pyramids and sliding windows are used for classification.

Looking for the source code to this post?
Jump right to the downloads section.

OpenCV and Python versions:
This example will run on Python 2.7/Python 3.4+ and OpenCV 2.4.X/OpenCV 3.0+.

What is a sliding window?

In the context of computer vision (and as the name suggests), a sliding window is rectangular region of fixed width and height that “slides” across an image, such as in the following figure:

Figure 2: Example of the sliding a window approach, where we slide a window from left-to-right and top-to-bottom.

Figure 1: Example of the sliding a window approach, where we slide a window from left-to-right and top-to-bottom.

For each of these windows, we would normally take the window region and apply an image classifier to determine if the window has an object that interests us — in this case, a face.

Combined with image pyramids we can create image classifiers that can recognize objects at varying scales and locations in the image.

These techniques, while simple, play an absolutely critical role in object detection and image classification.

Sliding Windows for Object Detection with Python and OpenCV

Let’s go ahead and build on your image pyramid example from last week.

Remember the  file? Open it back up and insert the sliding_window  function:

The sliding_window  function requires three arguments. The first is the image  that we are going to loop over. The second argument is the stepSize .

The stepSize indicates how many pixels we are going to “skip” in both the (x, y) direction. Normally, we would not want to loop over each and every pixel of the image (i.e.   stepSize=1 ) as this would be computationally prohibitive if we were applying an image classifier at each window.

Instead, the stepSize  is determined on a per-dataset basis and is tuned to give optimal performance based on your dataset of images. In practice, it’s common to use a stepSize  of 4 to 8 pixels. Remember, the smaller your step size is, the more windows you’ll need to examine.

The last argument windowSize  defines the width and height (in terms of pixels) of the window we are going to extract from our image .

Lines 24-27 are fairly straightforward and handle the actual “sliding” of the window.

Lines 24-26 define two for  loops that loop over the (x, y) coordinates of the image, incrementing their respective  x  and  y  counters by the provided step size.

Then, Line 27 returns a tuple containing the x  and y  coordinates of the sliding window, along with the window itself.

To see the sliding window in action, we’ll have to write a driver script for it. Create a new file, name it , and we’ll finish up this example:

On Lines 2-6 we import our necessary packages. We’ll use our pyramid  function from last week to construct our image pyramid. We’ll also use the sliding_window  function we just defined. Finally we import argparse  for parsing command line arguments and cv2  for our OpenCV bindings.

Lines 9-12 handle parsing our command line arguments. We only need a single switch here, the --image  that we want to process.

From there, Line 14 loads our image off disk and Line 15 defines our window width and height to be 128 pixels, respectfully.

Now, let’s go ahead and combine our image pyramid and sliding window:

We start by looping over each layer of the image pyramid on Line 18.

For each layer of the image pyramid, we’ll also loop over each window in the sliding_window  on Line 20. We also make a check on Lines 22-23 to ensure that our sliding window has met the minimum size requirements.

If we were applying an image classifier to detect objects, we would do this on Lines 25-27 by extracting features from the window and passing them on to our classifier (which is done in our 6-step HOG + Linear SVM object detection framework).

But since we do not have an image classifier, we’ll just visualize the sliding window results instead by drawing a rectangle on the image indicating where the sliding window is on Lines 30-34.


To see our image pyramid and sliding window in action, open up a terminal and execute the following command:

If all goes well you should see the following results:

Figure 2: An example of applying a sliding window to each layer of the image pyramid.

Figure 2: An example of applying a sliding window to each layer of the image pyramid.

Here you can see that for each of the layers in the pyramid a window is “slid” across it. And again, if we had an image classifier ready to go, we could take each of these windows and classify the contents of the window. An example could be “does this window contain a face or not?”

Here’s another example with a different image:

Figure 3: A second example of applying a sliding window to each layer of the image pyramid.

Figure 3: A second example of applying a sliding window to each layer of the image pyramid.

Once again, we can see that the sliding window is slid across the image at each level of the pyramid. High levels of the pyramid (and thus smaller layers) have less windows that need to be examined.


In this blog post we learned all about sliding windows and their application to object detection and image classification.

By combining a sliding window with an image pyramid we are able to localize and detect objects in images at multiple scales and locations.

While both sliding windows and image pyramids are very simple techniques, they are absolutely critical in object detection.

You can learn more about the more global role they play in this blog post, where I detail my framework on how to use the Histogram of Oriented Gradients image descriptor and a Linear SVM classifier to build a custom object detector.


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 11-page Resource Guide on Computer Vision and Image Search Engines, including exclusive techniques that I don’t post on this blog! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , , , ,

53 Responses to Sliding Windows for Object Detection with Python and OpenCV

  1. joe May 11, 2015 at 3:16 pm #

    hey Adrian, wonderful article. Just wondering about when you say “Remember, the larger your step size is, the more windows you’ll need to examine.” . Shouldn’t this be “the smaller the stepsize, the more windows”?
    Maybe i misunderstood something, but it looks to me as if each sliding window would move of pixels, so – as you say a few lines above that comment – having a stepSize=1 makes it prohibitive.
    Thanks for the article

    • Adrian Rosebrock May 11, 2015 at 4:53 pm #

      Hey Joe, you’re absolutely right. Thanks for catching that typo. I have updated it now. Thanks again!

  2. Rish May 14, 2015 at 12:38 am #

    Hi Adrian,

    I have had some discussions with you in other topic threads. Your tutorials has helped me create a object detector though in C++ with ease.

    I am new to this object recognition field. I was wondering other than sliding window for object search in the image space, what other methods are there. One of the biggest issue for me in Sliding Window is that incrementing the sliding window by small pixel margin gives the best results (say about 50 – 75% overlap to the previous window). In a normal image frame this is quite exhaustive search.

    I am just curious if there are other better or faster method for object search?

    • Adrian Rosebrock May 14, 2015 at 6:41 am #

      There are indeed other methods to using sliding windows, but the sliding window is pretty much the “default”. Take a look at the comments of this post to see a discussion of some faster variants of the standard sliding window.

      However, I will say that the exhaustive image search is actually a good thing. If our classifier is working correctly, then it will provide positive classifications for regions surrounding our object. We can then apply non-maxima suppression to select only the most probable bounding box.

  3. abbas June 4, 2015 at 2:06 am #

    hi Adrian
    i am working on HOG descriptor i train svm on 64*128 positive negative images output is good but i have a problem in large image human detection so u can help me because i start research in computer vission

    • Adrian Rosebrock June 4, 2015 at 6:24 am #

      If the human you are trying to detect is substantially larger than your 64×128 window, then you should apply an image pyramid. This way the image becomes smaller at each layer of the pyramid, while your 64×128 window remains fixed, allowing you to detect larger objects (in this case, humans).

  4. Hoon October 26, 2015 at 12:29 am #

    Thanks for the wonderful article!
    I am wondering that I should change each of the step size when the resolution of the image changes because of image pyramid.
    Thanks in advance.

    • Adrian Rosebrock October 26, 2015 at 6:13 am #

      No, the step size of the sliding window normally stays constant across levels of the image pyramid.

      • Hoon October 26, 2015 at 11:11 pm #

        Thank you!
        Can I ask one more?
        Should I calculate the entire hog features for each image of different resolution?
        I am assuming the following steps.

        1) calculate HOG features of the original image
        2) collect regions that have high similarities (ROI) into a list or something
        3) resize the original image (down-size)
        4) calculate HOG features again

        n) Draw rectangles by referring to the list.

        And plus, how do I extract original location of ROI in down-sized images?

        Thank you very much!

        • Adrian Rosebrock October 27, 2015 at 4:48 am #

          I think reading this post on using HOG and Linear SVM for object detection should really help you out and answer all your questions 🙂

  5. bob January 6, 2016 at 4:43 pm #

    Wow, what great examples. Thanks. I have a question. Let’s say you have a classifier with K classes and you call the classifier for each of the N sliding windows on the current image. You essentially have a matrix with N rows and K columns. How do you process that matrix in some sensible way to report which windows have a meaningful object in them?

    • Adrian Rosebrock January 6, 2016 at 6:35 pm #

      You would simply maintain a list of bounding boxes for each of the unique classes reported by the SVM. From there, you would apply non-maxima suppression for each set of bounding boxes.

  6. Bob Zigon March 9, 2016 at 3:51 pm #

    Adrian, I have a question about your NMS logic. I applied a classifier to each of N sliding windows. I then extracted the subset of windows associated with class = 1 and passed them through the NMS. There was only 2 instances of object 1 in the FOV. Their dimension is approximately 280×200. The sliding window was 140×100. This is also the size of patches that I trained with. I was expecting the NMS to “merge” the 140×100 windows into a bounding box that more closely approximated the 280×200 of the actual objects. The NMS reported 5 objects and not 2.

    Am I using the NMS wrong? I can’t train on images that are 280×200 because I want to be able to identify the object when it is sliding out of the FOV. That is why I extracted a bunch of random 140×100 patches from the 280×200 object and trained that way.

    • Adrian Rosebrock March 9, 2016 at 4:36 pm #

      NMS is meant to merge overlapping bounding boxes, either based on their spatial dimensions, or the probability returned by your SVM (where higher probabilities are preferred over the lower ones). If your bounding boxes are not overlapping, then NMS will not suppress them. From your comment, it’s not clear if bounding boxes were overlapping?

      • Bob Zigon March 14, 2016 at 12:38 am #

        Yes, the boxes were overlapping. (I wish there was a way to embed a graphic in these comments, it would be easier to describe the situation.)

        Let me ask the question a different way. If you train your classifier with images that are 140×100 (these are random subsets of the 280×200 target image), how do you get a bounding box around the target image with the NMS?

        • Adrian Rosebrock March 14, 2016 at 3:23 pm #

          If you want like to include an image, I would suggest uploading the image to Imgur and then posting the link in the comment.

          As for the bounding boxes, please see my previous comment. You would take the entire set of bounding boxes and apply NMS based on either (1) the bounding box coordinates (such as the bottom-right corner) or (2) the probability associated with the bounding box.

          Again, NMS isn’t used to actually generate the bounding box surrounding an object, it’s used to suppress bounding boxes that have heavy overlap.

  7. Bob Zigon March 14, 2016 at 10:43 pm #

    Hmmm .. ok. The distinction seems subtle. Is it fair to say that the bounding box (with a target size of 280×200) is just the union of the 140×100 boxes in physical proximity to each other that overlap some small amount?

    • Adrian Rosebrock March 15, 2016 at 4:36 pm #

      I’m not sure I understand your question. If you can provide visual examples, I can try to answer further.

  8. Vinit March 16, 2016 at 2:34 pm #

    Hey Adrian,

    I have been reading your blogs recently and they are very helpful for my work. However I am still not able to figure out, how I am going to train the SVM for the classification.

    I got to detect humans in image so I am using INRIA dataset for training but i can’t figure out one issue that in one image I can see many persons. Right now I am just taking the hog features of the whole image once its resized to certain dimensions and then send it to train svm. But the data contains multiple human images not only single one. So can you please help me out here. Also it would be great if you can make a small post on training svm too for this object detection part.

    Thanks in advance

    • Adrian Rosebrock March 17, 2016 at 10:42 am #

      You mentioned resizing your image to a fixed size, extracting HOG features, and then passing it to your SVM — this is partly correct, but you’re missing a few critical steps. To start, I would suggest reading through a description of the entire HOG + Linear SVM pipeline.

      Instead, you need to utilize a sliding window (detailed in this post). This window is a fixed size that “slides” across your input image. At each stop along the window, you extract HOG features, and then pass them to your SVM for classification. In this way, you can detect not only a single person but multiple people at various locations in image. Combined with an image pyramid, you can recognize objects both multiple scales AND multiple locations.

      As for a source code implementation of such an object detector, please see the PyImageSearch Gurus course, where I detail how to code an object detector in detail.

  9. Mohamed Ben Arbia April 5, 2016 at 5:59 am #

    Hey Adrian,

    Excellent post. This is really helpful and straightforward. Thanks!

    • Adrian Rosebrock April 6, 2016 at 9:14 am #

      I’m glad you found it helpful Mohamed! 🙂

      • Mohamed Ben Arbia June 20, 2016 at 6:41 am #

        Hi Adran,

        I have encountered one issue during my project concerning the object detection. What if there are rotated versions of the object we would like to detect ?

        What would be the best approach ta tackle this ? Would you use rotated versions of the sliding windows ? Or would you define rotated versions of the image containing the object (And probably rotated version of the object) as the image pyramids for scaling ?

        Thanks !

        • Adrian Rosebrock June 20, 2016 at 5:23 pm #

          Rotated objects can be a real pain in the ass to detect, depending on your problem. I would suggest training a detector for each rotated version of your image. Or better yet, try to utilize algorithms that are more invariant to changes in rotation. Keypoint detection and local invariant descriptors tend to work well here as well.

          • Mohamed June 21, 2016 at 4:09 am #

            Thanks for your response Adrian 🙂
            Yes, I think using algorithms that are invariant to changes in rotation is a good approach.
            Concerning my problem, here is a link to a screen shot to the image where I have my rotated objects: The goal is to detect the footprints in the image.


          • Adrian Rosebrock June 23, 2016 at 1:31 pm #

            Why not just apply a dilation or closing morphological operation to close the gaps in between the footprints? From there, thresholding and contour detection will give you the footprint regions.

  10. farah May 6, 2016 at 4:38 am #

    When we run our classifier on sliding windows then it will fetch many bounding boxes.I want to show these bounding boxes on the original image. How to change the coordinates of the bounding boxes from the different sized windows to the original scale to be shown on the original window.

    • Adrian Rosebrock May 6, 2016 at 4:32 pm #

      Hey Farah — I assume you’re also talking about using image pyramids as well? As the image pyramid code demonstrates, you can keep track of the current scale of the pyramid and use that to give you the location of the of the original image.

  11. Farah May 10, 2016 at 1:28 am #

    Sir to get to the original scale should I multiply the coordinates by the respective scaling factor used in resizing the window i.e if I am downscaling by 1.5 in both x and y direction then I just multiply the bounding boxes coordinates at this layer by 1.5.

    • Adrian Rosebrock May 10, 2016 at 8:06 am #

      Hey Farah — please see my previous comment. If you’re using sliding windows in conjunction with image pyramids, you need to keep track of ratio of the original image height to the current pyramid height. You can use this scale to multiply the bounding box coordinates and obtain them for the original image size. I cover this in more detail PyImageSearch Gurus.

      In this case, if you resize your image to be 1.5x smaller than the original, then yes, you would multiply your bounding boxes (obtained by the new, resized image) by this 1.5 factor to obtain the coordinates relative to the original image.

  12. Farah May 10, 2016 at 11:57 pm #

    Thanks Adrian for resolving my query

  13. Aka July 23, 2016 at 5:28 am #

    Hi Adrian,

    Nice post !

    I was wondering if the sliding window could be parallelised ? With a classifier which has a really low false positive rate and if the search need to be exhaustive, I feel sliding window is the best option. But say for a very large image it will be very slow. So if the sliding can be parallelised so that a list will have all the detections ( the order in which they get appended does not matter for NMS) , won’t it help speed up the detection process ?

    What do you think ? Do you know of such an implementation ?

    • Adrian Rosebrock July 27, 2016 at 2:47 pm #

      Yes, you can absolutely make the sliding window run in parallel. However, I instead recommend making the image pyramid run in parallel such that you have one process running for each of the layers of the pyramid. If you are only processing a small set of pyramid layers (or just one layer), then yes, absolutely make the sliding window run in parallel.

      I don’t have any implementations of this, but I do review how to build your own custom object detector inside the PyImageSearch Gurus course.

  14. Walid Ahmed September 28, 2016 at 4:05 pm #


    The code executed without errors for 2 images
    but nothing was shown
    any advice?

    • Adrian Rosebrock September 30, 2016 at 6:48 am #

      Can you elaborate on what you mean by “executed without error but nothing was shown”? I’m not sure I understand what you mean.

  15. Wei October 5, 2016 at 8:49 am #

    Hi, Adrian.

    I am wondering why the sliding window function does not give an “out of bound” error when “(x + winW) > image.shape[1]”?

    Thanks for the sharing, your website is very inspiring and helpful.

    • Adrian Rosebrock October 6, 2016 at 6:54 am #

      NumPy automatically prevents the out of bound error by treating the index as an array slice. If you try to slice an array past the actual bounds of the array, it simply returns all the elements along that dimension.

  16. Daryl November 2, 2016 at 4:59 pm #

    Hi Adrian,
    One doubt when i have an image pyramid i get the same image in different scales. Now from each of these images i get using the sliding window classifier say 3 images. Now how to choose between these images that i get in different levels of the pyramid.
    Example: Pyramid i have 400X400(original size);200X200;100X100
    From each i run a sliding window of 40X40
    I get 40X40 from first one
    80X80 from the second one(after scaling back to original size)
    160X160 from the third one

    • Adrian Rosebrock November 3, 2016 at 9:39 am #

      Your sliding window should always be the same fixed size — the sliding window size does not change. It’s the image pyramid itself that allows you to detect objects at different scales of the image. The sliding window simply allows you to detect objects at different locations.

      • Daryl November 3, 2016 at 2:11 pm #

        But how do i select between images of different scales was my question. If my sliding window gives 1 image in every level of the pyramid. How do i choose between these images?

        • Adrian Rosebrock November 4, 2016 at 9:56 am #

          I’m not sure what you mean by “select”. At each pyramid scale, and at each position of the sliding window you would extract your features and pass them on to your model for classification. You then apply non-maxima suppression across all levels to obtain your final detection. I detail the entire HOG + Linear SVM pipeline here. You then review the code in detail inside the PyImageSearch Gurus course.

          • Daryl November 4, 2016 at 10:13 am #

            Dont you apply non maximal suppression on each level separately?
            Because if you apply across all levels then you are comparing between bounding boxes of different sizes.

          • Adrian Rosebrock November 4, 2016 at 10:42 am #

            No, NMS is only applied after all bounding boxes are applied across all layers of the image pyramid. You resize each of your detected bounding boxes based on the ratio of the original image size to the current image size. This ensures that all bounding boxes are recorded at the same scale even though you are working with multiple scales of the image.

  17. Sumedha Agarwal November 8, 2016 at 2:30 am #

    Have been working on object detection, I was wondering why can’t we vary the window size instead of varying the image size(image pyramid).
    Any drawbacks with that?
    Thanks in advance!!!

    • Adrian Rosebrock November 10, 2016 at 7:05 am #

      Consider the HOG image descriptor which is commonly used for sliding windows and image pyramid. The size of the image/ROI passed into the HOG descriptor is influenced by the input image size. If you change the sliding window size, you change the output dimensionality of the descriptor. If all descriptors do not have the same dimensionality then you can’t apply a machine learning model to them.

      Because of this, the sliding window tends to be a fixed parameter in the model.

  18. Saloni Mittal November 30, 2016 at 3:42 am #

    My object detector (based on hog+svm) takes around 40-50 seconds to give the final result for a 1360×800 input image, with a 40×40 window size and step size 3×3-this is when i’ve done the computation for different scales parallely,creating threads. Is there any other way to speed up the process? Can we run this code on a GPU instead of using the CPU?

    • Adrian Rosebrock December 1, 2016 at 7:36 am #

      You can push the computation to the GPU, but you would need to recode using C++. The Python + OpenCV bindings do not have access to the GPU.

  19. Levy Anselmo December 1, 2016 at 1:54 pm #

    Nice article! But, can you tell me how to parse my camera? i want to try it with my camera frame by frame. Thanks!

    • Adrian Rosebrock December 5, 2016 at 1:46 pm #

      Hey Levy — can you elaborate more on what you mean by “parse” your camera?

  20. Kinley February 23, 2017 at 1:21 am #

    Hey, Can you suggest me some packages to implement the same using R.

  21. Ioannis March 6, 2017 at 3:29 pm #

    Hello Adrian, This article was very useful to me, good job.
    I applied texture analysis (GLCM) on satellite image using a sliding window (wnize=32) with a step size (step=32). I set a window size in order to make my scrip bit faster. The dimensions of the image is 250 x 200. After running the script, The final image was 8 x 6 (250/32, 200/32) due to the step size. I do not want to have a subset of the image though,
    By applying a step size, Is it possible to get the initial image back instead of a subset of it?

    I would appreciate any help
    thank you

    • Adrian Rosebrock March 6, 2017 at 3:33 pm #

      Hi Ioannis — thanks for the comment, although I’m not sure I understand your question. Can you elaborate on what you mean by the “initial image back”?

  22. Kumar Vishal March 8, 2017 at 5:44 am #

    HI Adrian, I am trying to build HOG based detector small confusion I have regarding scale factor say if I have scale factor = 1.03 that means at every step i have to reduce it by 3% percent . So If I have 648 * 460 image and min size I am putting 32 * 32 so I have to reduce the image by 3 percent every time until width (480) reduces to 32 or less. but it is creating a pyramid of approx 25 images. or more and each image from the bottom has more than 40000 patches is stride = 2 ; and over all time is coming to extract all the features is approx 1 second.


Leave a Reply