Fast, optimized ‘for’ pixel loops with OpenCV and Python

Have you ever had to loop over an image pixel-by-pixel using Python and OpenCV?

If so, you know that it’s a painfully slow operation even though images are internally represented by NumPy arrays.

So why is this? Why are individual pixel accesses in NumPy so slow?

You see, NumPy operations are implemented in C. This allows us to avoid the expensive overhead of Python loops. When using NumPy, it’s not uncommon to see performance gains by multiple orders of magnitude (as compared to standard Python lists). In general, if you can frame your problem as a vector operation using NumPy arrays, you’ll be able to benefit from the speed boosts.

The problem here is that accessing individual pixels is not a vector operation. Therefore, even though NumPy is arguably the best numerical processing library available for nearly any programming language, when combined with Python’s for  loops + individual element accesses, we lose much of the performance gains.

Along your computer vision journey, there will be algorithms you may need to implement that will require you to perform these manual for  loops. Whether you need to implement Local Binary Patterns from scratch, create a custom convolution algorithm, or simply cannot rely on vectorized operations, you’ll need to understand how to optimize for  loops using OpenCV and Python.

In the remainder of this blog post I’ll discuss how we can create super fast for pixel loops using Python and OpenCV — to learn more, just keep reading.

Looking for the source code to this post?
Jump right to the downloads section.

Super fast ‘for’ pixel loops with OpenCV and Python

A few weeks ago I was reading Satya Mallick’s excellent LearnOpenCV blog. His latest article discussed a special function named forEach . The forEach  function allows you to utilize all cores on your machine when applying a function to every pixel in an image.

Distributing the computation across multiple cores resulted in a ~5x speedup.

But what about Python?

Is there a forEach  OpenCV function exposed to the Python bindings?

Unfortunately, no, there isn’t — instead, we need to create our own forEach-like method. Luckily this isn’t as hard as it sounds.

I’ve been using this exact method to speed up for  pixel loops using OpenCV and Python for years — and today I’m happy to share the implementation with you.

In the first part of this blog post, we’ll discuss Cython and how it can be used to speed up operations inside Python.

From there, I’ll provide a Jupyter Notebook detailing how to implement our faster pixel loops with OpenCV and Python.

What is Cython? And how will it speed up our pixel loops?

We all know that Python, being a high-level language, provides a lot of abstraction and convenience — that’s the main reason why it is so great for image processing. What comes with this typically is slower speeds than a language which is closer to assembly like C.

You can think of Cython as a combination of Python with traces of C which provides C-like performance.

Cython differs from Python in that the code is translated to C using the CPython interpreter. This allows the script to be written mostly in Python along with some decorators and type declarations.

So when should you take advantage of Cython in image processing?

Probably the best time to use Cython would be when you find yourself looping pixel-by-pixel in an image. You see, OpenCV and scikit-image are already optimized — a call to a function such as template-matching, like we did when we OCR’d bank checks and credit cards, has been optimized in underlying C. There is a tiny amount of overhead in the function call, but that’s it. You would never write your own template-matching algorithm in Python — it just wouldn’t be fast enough.

If you find yourself writing any custom image processing functions in Python which analyze or modify images pixel-by-pixel (perhaps with a kernel) it is extremely likely that your function won’t run as fast as possible.

In fact, it will run very slowly.

However, if you take advantage of Cython, which compiles with major C/C++ compilers, you can achieve significant performance gains as we will demonstrate today.

Implementing faster pixel loops with OpenCV and Python

A few years ago I was struggling to come across a method to help improve the speed of accessing individual pixels in a NumPy array using Python and OpenCV.

Everything I tried didn’t work — I resorted to framing my problem as complicated, hard to follow vector operations on NumPy arrays to achieve my desired speed increase. But there will still times where looping over each individual pixel in an image was simply unavoidable.

It wasn’t until I found Matthew Perry’s excellent blog post on parallelizing NumPy array loops with Cython was I able to find a solution and adapt it to working with images.

In this section we’ll review a Jupyter Notebook I put together to help you
learn how to implement faster pixel-by-pixel loops with OpenCV and Python.

But before we get started, ensure you install NumPy, Cython, matplotlib, and Jupyter:

Note: I recommend that you install these into your virtual environment for computer vision development with Python. If you have followed an install tutorial on this site, you may have a virtual environment called cv. Before issuing the above commands (Lines 2-5), simply enter workon cv  in your shell (PyImageSearch Gurus members may install into their gurus  environment if they choose to do so). If you don’t already have a virtual environment, create one and then symbolic-link your  bindings following instructions available here.

From there you can launch a Jupyter Notebook in your environment and begin entering the code from this post:

Alternatively, use the “Downloads” section of this blog post to follow along with the Jupyter Notebook I have created for you (highly recommended). If you’re using the notebook from the Downloads section, ensure to change your working directory to where the notebook lives on your disk.

Regardless of whether you have chosen to use the pre-baked notebook or follow along from scratch, the remainder of this section will discuss how to boost pixel-by-pixel loops with OpenCV and Python by over two orders of magnitude.

In this example, we’ll be implementing a simple threshold function. For each pixel in the image, we’ll check to see if the input pixel is greater than or equal to some threshold value T .

If the pixel passes the threshold test, we’ll set the output value to 255. Otherwise, the output pixel will be set to 0.

Using this function we’ll be able to binarize our input image, very similar to how OpenCV and scikit-image’s built-in thresholding methods work.

We’ll be using a simple threshold function as an example as it will enable us to (1) not focus on the actual image processing code but rather (2) learn how to obtain speed boosts when manually looping over every pixel in an image.

To compare “naïve” pixel loops with our faster Cython loops, take a look at the notebook below:

Note: When your notebook is launched, I suggest you click “View” > “Toggle Line Numbers” from the menubar — in Jupyter, each In [ ]  and Out [ ]  block restarts numbering from 1, so you’ll see those same numbers reflected in the code blocks here. If you are using the notebook from the Downloads section of this post, feel free to execute all blocks by clicking “Cell” > “Run All”.

Inside In [1]  above, on Lines 2-3 we import our necessary packages. Line 5 simply specifies that we want our matplotlib plots to show up in-line within the notebook.

Next, we’ll load and preprocess an example image:

On Line 3 of In [2] , we load example.png  followed by converting it to grayscale on Line 4.

Then we show the graphic using matplotlib (Line 5).

In-line output of the command is shown below:

Figure 1: Our input image (400×400 pixels) that we will be thresholding.

Next, we will load Cython:

Within In [3]  above, we load Cython.

Now that we have Cython in memory, we will instruct Cython to show which lines can be optimized in our custom thresholding function:

Line 1 in In [3]  above tells the interpreter that we want Cython to determine which lines can be optimized.

Then, we define our function, threshold_slow . Our function requires two arguments:

  • T : the threshold
  • image : the input image

On Lines 5 and 6 we extract the height and width from the image’s .shape  object. We will need w  and h  such that we can loop over the image pixel-by-pixel.

Lines 9 and 10 begin a nested for loop where we’re looping top-to-bottom and left-to-right up until our height and width. Later, we will see that there is room for optimization in this loop.

On Line 12, we perform our in-place binary threshold of each pixel using the ternary operator — if the pixel is >= T  we set the pixel to white (255) and otherwise, we set the pixel to black (0).

Finally, we return our resulting image .

In Jupyter (assuming you execute the above In [ ]  blocks), you’ll see the following output:

The yellow-highlighted lines in Out [4]  demonstrate areas where Cython can be used for optimization — we’ll see later how to perform optimization with Cython. Notice how pixel-by-pixel looping action is highlighted.

Tip: You may click the ‘+’ at the beginning of a line to see the underlying C code — something that I find very interesting.

Next, let’s time the operation of the function:

Using the %timeit  syntax we can execute and time the function — we specify a threshold value of 5 and our image which we’ve already loaded. The resulting output is shown below:

The output shows that 244 ms was the fastest that the function ran on my system. This serves as our baseline time — we will reduce this number drastically later in this post.

Let’s see the result of the thresholding operation to visually validate that our function is working properly:

The two lines shown in In [6]  run the function and show the output in-line on the notebook. The resulting thresholded image is shown:

Figure 2: Thresholding our input image using the threshod_slow method.

Now we are to the fun part. Let’s leverage Cython to create a highly-optimized pixel-by-pixel loop:

Line 1 of In [7]  again specifies that we want Cython to highlight lines that can be optimized.

Then, we import Cython on Line 2.

The beauty of Cython is that very few changes are necessary for our Python code — you will; however, see some traces of C syntax. Line 4 is a Cython decorator stating that we won’t check array index bounds, offering a slight speedup.

The following paragraphs highlight some Cython syntax, so pay particular attention.

We then define the function (Line 5) using the cpdef  keyword rather than Python’s def  — this creates a cdef  type for C types and def  type for Python types (source).

The threshold_fast  function will return an unsigned char [:,:] , which will be our output NumPy array. We use unsigned char  since OpenCV represents images as unsigned 8-bit integers and an unsigned char  (effectively) gives us the same data type. The [:, :]  implies that we are working with a 2D array.

From there, we provide the actual data types to our function, including int T  (the threshold value), and another unsigned char  array, our input image .

On Line 7, using cdef  we can declare our Python variables as C variables instead — this allows Cython to understand our data types.

Everything else in In [7]  is identical to that of threshold_slow which demonstrates the convenience of Cython.

Our output is shown below:

This time notice in Out [7]  that fewer lines are highlighted by Cython. In fact, only the Cython import and the function declaration are highlighted — this is typical.

Next, we will reload and re-pre-process our original image (effectively resetting it):

The purpose for reloading the image is because our first threshold_slow  operation modified the image in-place. We need to re-initialize it to a known state.

Let’s go ahead and benchmark our threshold_fast   function against the original threshold_slow  function in Python:

The result:

This time we are achieving 41.2 microseconds per call, a massive improvement of the 244 milliseconds using strict Python. This implies that by using Cython we can increase the speed of our pixel-by-pixel loop by over 2 orders of magnitude!

What about OpenMP?

After reading through this tutorial you might be wondering if there are more performance gains we can achieve. While we have achieved massive performance gains by using Cython over Python, we’re actually still only using one core of our CPU.

But what if we wanted to distribute computation across multiple CPUs/cores? Is that possible?

It absolutely is — we just need to use OpenMP (Open Multi-processing).

In a follow-up blog post, I’ll demonstrate how to use OpenMP to further boost for pixel loops using OpenCV and Python.


Inspired by Satya Mallick’s original blog post to speed up for  pixel loops using C++, I decided to write a tutorial that attempts to accomplish the same thing — only in Python.

Unfortunately, Python has only a fraction of the function calls available as bindings (as compared to C++). Because of this, we need to “roll our own” faster ‘for’ loop method using Cython.

The results were quite dramatic — by using Cython we were able to boost our thresholding function from 244 ms per function call (pure Python) to less than 40.8 μs (Cython).

What’s interesting is that there are still optimizations to be made.

Our simple method thus far is only using one core of our CPU. By enabling OpenMP support, we can actually distribute the for  loop computation across multiple CPUs/cores — doing this will only further increase the speed of our function.

I will be covering how to use OpenMP to boost our for  pixel loops with OpenCV and Python in a future blog post.

For the time being, be sure to enter your email address in the form below to be notified when new blog posts are published!


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , ,

39 Responses to Fast, optimized ‘for’ pixel loops with OpenCV and Python

  1. Nick Shaver August 28, 2017 at 10:40 am #

    Thanks so much for all your OpenCV research. I’ve worked hard to get my for loops multiprocessed on Raspberry Pi 3’s quad cores, mostly via your other tutorials. I’m really looking forward to giving cython a try, and will definitely be looking forward to your OpenMP post.

    • Adrian Rosebrock August 28, 2017 at 4:20 pm #

      This code will help speed up the for loops dramatically using a single core. Using OpenMP will allow you to distribute the process across multiple cores 🙂

  2. Arighi August 28, 2017 at 11:35 am #

    Impressive. Will drowsiness detection using raspberry pi 3 fast enough for detetcting eye blink with this?

    • Adrian Rosebrock August 28, 2017 at 4:18 pm #

      No, that requires a different set of optimizations include Haar cascades and skip frames.

  3. Javier de la Rosa August 28, 2017 at 12:15 pm #

    Before stating with OpenMP, maybe joblib can do the job. Its super simple API makes using parallel execution very easy.

  4. Vitali August 28, 2017 at 2:13 pm #

    Hello Adrian,

    Thank you very much for a great post. Another option to speed up “for loop” is to use numba which is preinstalled in Anaconda. I am not sure if it faster than cython but the pure python code requires a minimum changes – just a decorator:

    from numba import njit

    def threshold_slow(T, image):
    # grab the image dimensions
    h = image.shape[0]
    w = image.shape[1]

    # loop over the image, pixel by pixel
    for y in range(0, h):
    for x in range(0, w):
    # threshold the pixel
    image[y, x] = 255 if image[y, x] >= T else 0

    # return the thresholded image
    return image


    • Adrian Rosebrock August 28, 2017 at 4:15 pm #

      Thanks for sharing, Vitali! I have not tried numba before, I’ll have to take a look.

      • Florian Scholz August 29, 2017 at 2:47 am #

        Hello Adrian, as Vitali also i am also numba fan. What i have seen it gets into the rage of speed of C. As long as you write the code without lists,dicts. (they call it nopython)
        Whats really cool is you can get faster buy switching the platform to: cuda, multi-cpu, amd-hsa. In contrast to pypy it gives and requires more interaction from the programmer.

        • Peter September 6, 2017 at 5:10 pm #

          Hello Adrian,
          Hello Florian

          I can recommend numba version 0.34 with prange and parallel, its a lot faster for larger images.

          from numba import jit,prange


          # loop over the image, pixel by pixel
          for y in prange(0, h):
          for x in prange(0, w):

          • Dian June 23, 2018 at 4:50 am #

            Can i run it on a raspi3?

    • Guru September 29, 2018 at 10:52 am #

      Hi Vitali,

      I am Guru, Working in Deep Learning/Computer Vision.

      Can you please share me few details on using Numba with the existing Python code(OpenCV DL), so that I can use the same in a GPU environment.


  5. Brian Norman August 28, 2017 at 2:25 pm #

    Now this is very timely. I have just spent two weeks writing an RGB LED matrix simulator, on my windows desktop machine, which creates a simulated LED matrix using TkInter to test and debug animations before FTPing to a Pi3 to drive a real matrix (one of those cheap HUB75 jobs from China). My preferred development environment is PyCharm and I have found this which shows how to use Cython with PyCharm and may be of interest to others (I hope). Currently, my simulator slows considerably the more animations I add to it (as expected) so optimising for loops is a must for me – even though I’m only using the simulator to debug my animation code but the optimization may well also be needed on the Pi too.

    My code is still in development but I’m looking forward to speeding up the for loops.

    Thanks for this – you must have been reading my mind LOL

    • Adrian Rosebrock August 28, 2017 at 4:15 pm #

      Thanks for sharing your project, Brian! The link to the PyCharm + Cython integration is also much appreciated.

    • Dave August 30, 2017 at 2:55 pm #

      Thanks for sharing, Brian. I’ve been using PyCharm more and more. In general, I like Jetbrains products (2 for 2 since I like IntelliJ as well). I wonder how the debugger will work with Cython — something not covered in that link or in some others I found while searching.

  6. Ben August 28, 2017 at 3:27 pm #

    Calling it now, Adrian, this will be one of your most popular blog posts.


    • Adrian Rosebrock August 28, 2017 at 4:14 pm #

      Thanks Ben! I hope you’re right 🙂

  7. Gaël Écorchard August 29, 2017 at 5:15 am #

    Thanks for this great post! I was also interested to compare to the pure Python implementation in this case. It’s actually comparable with threshold_slow. I got

    Python: 219 ms
    threshold_slow: 214 ms
    threshold_fast: 180 µs
    numpy: 85.4 µs

    In this case, a numpy implementation is possible (and the best, as shown above):

    def threshold_numpy(T, image):
    image[image > T] = 255
    return image

    As a side note, you can also use scipy.signal.convolve2d ( for — I guess optimized — kernel manipulations on numpy arrays.

    • Adrian Rosebrock August 31, 2017 at 8:41 am #

      Hi Gaël — the point of this blog post was not to find the fastest method to threshold an image. As I mentioned in the introduction, if you frame the problem as a vector operation NumPy will almost always win.

      Instead, the point of this blog post is to demonstrate how you can optimize your “for” loops for non-vector operations. Thresholding was an example and meant to be extended to other algorithms.

  8. Islam August 31, 2017 at 7:59 am #

    Great effort Adrian and great topic.
    Many thanks

  9. mar September 5, 2017 at 5:04 pm #

    Hi Adrian

    I tried pool.apply_async and numba separately, it can boost performance individually, but when I combine them two. It gets very slow………might be a good topic for next blog, combining compiled code and multiprocessing?

  10. Nico September 6, 2017 at 4:34 pm #

    Not that you’re claiming this method does, but just to confirm: this method (or OpenCV forEach) doesn’t actually get you vectorization, does it? So, you’re getting thread parallelism (through OpenMP or forEach) but not data parallelism.

    This is something where numba or numpy could potentially help.

    • Adrian Rosebrock September 7, 2017 at 6:58 am #

      Hi Nico — as I mentioned in the introduction of the blog post this method is meant to demonstrate how you can speedup for loops. I used thresholding as an example as it’s simple for everyone to understand. Yes, thresholding can be vectorized; however, there are algorithms that cannot be vectorized. In this case you would need to speedup your for loops — that is the intended purpose of this tutorial.

  11. Gerard September 17, 2017 at 2:10 pm #

    Adrian you savage! Can’t wait for the post regarding openMP, I’m really interested in the power of parallel execution

  12. Marco February 23, 2018 at 5:08 pm #

    When I return the variable and try to show it with cv2.imshow(“Image”, image) I got: “TypeError: mat is not a numpy array, neither a scalar”

    How can I display my result?

    • Adrian Rosebrock February 26, 2018 at 2:04 pm #

      Hey Marco — what function is returning a Mat object instead of a NumPy array? Could you elaborate?

      • Marco February 26, 2018 at 2:27 pm #

        I use your function: threshold_fast, I want to show the returning image using cv2.imshow but then the mentioned error appears. It seems that I have to use the function “numpy.asarray”, I’m confused with the types.

        Another question, Is it recommendable use built-in functions from OpenCV or Python in the .pyx file?


        • Adrian Rosebrock February 26, 2018 at 2:32 pm #

          Thank you for sharing, Marco. I haven’t encountered that error before but it’s good for other readers to know about. You can use OpenCV functions inside your Cython code provided your data types are correct.

  13. StephenL April 26, 2018 at 11:24 pm #

    This is like a face-off for image processing

    %timeit threshold_slow(5, image)
    176 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

    %timeit threshold_fast(5, image)
    185 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    %timeit threshold_njit(5, image)
    33.7 µs ± 365 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    %timeit threshold_prange(5, image)
    36 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    %timeit threshold_numpy(5, image)
    74.7 µs ± 519 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    That’s what I got on my machine. Njit is very tidy and quick and a nose in front after loading the function, prange was close on this 400 X 400 image but not the fastest today, numpy seems to lag njit and njit prange a bit on this test. I note in all cases that loading the function and then calling it caused quite variable timings, but repeatedly calling the function once the function is parsed is fairly stable. Interesting!

  14. jinesh john May 28, 2018 at 1:47 pm #

    i didnt understand one thing how will get T ? what is the algorithm will use for that ?

    • Adrian Rosebrock May 31, 2018 at 5:34 am #

      “T” is your threshold value. It is manually supplied.

  15. Jeff July 2, 2018 at 4:56 pm #

    I really appreciate this post. I am trying to augment this for my application, converting an RGB value to a DBZ (radar) output using a conversion table. I am very new to cython and am getting a syntax error at my *function definition*. Can you possibly identify my problem? Here is the function:

    cpdef float rgb2dbz_fast(float table_dbz, unsigned char [:,:] table_rgb, unsigned char [:, :] image):

    table_dbz has values from -32, -31.5, …, 95 (as will the output) so they must be floats. Can you see why this is giving me an error?

    • Adrian Rosebrock July 3, 2018 at 7:19 am #

      Hey Jeff — I’m not sure what may be causing this error off the top of my head. I hope another PyImageSearch reader can help you!

  16. Neha jain April 10, 2019 at 5:09 am #

    Hi Adrian,
    I am working on comparing one to many images(kept in the directory).
    I have implemented a for loop to iterate over the images in the directory, but the processing time is too much. Can you suggest something to reduce the execution time?
    I will be really grateful to you.

    • Adrian Rosebrock April 12, 2019 at 11:34 am #

      What algorithm are you using to compare the two images?

  17. Artem April 29, 2019 at 3:30 am #

    Hi Adrian!
    Do you know if anyone has done speeding up of your EAST implementation via Cython?
    I’m working on it now but getting a lot of problems on each step.
    The main one – how to declare a variables of “decode_predictions” function – scores, geometry, rects, confidences? If I leave them as python list = [], computation time doesn’t change – it’s ok. But how to declare them to make Cython use C language when handling this lists? I think this is the main step for speeding up this algorithm, isn’t it?
    Thank you in advance!

  18. Sachin Rajan August 28, 2019 at 6:59 am #

    Hey Adrian, thanks for the great tutorial. But I would like to know how do you cythonize the code when there opencv functions all over the code?

  19. Nick Kim October 30, 2019 at 11:05 pm #

    Hi Adrian,

    The links don’t work anymore for Matthew’s blog nor your Jupyter notebook. Can you update them? Thanks!


  20. Alistair January 17, 2020 at 3:26 pm #

    Hey Adrian,

    Fantastic blog post. I’m currently working on a script that requires me to step through an image that is not in gray scale (it’s in LAB). Is there any reason that cython wouldn’t work for such a case?

    • Adrian Rosebrock January 23, 2020 at 9:30 am #

      I don’t see any reason why not.

Before you leave a comment...

Hey, Adrian here, author of the PyImageSearch blog. I'd love to hear from you, but before you submit a comment, please follow these guidelines:

  1. If you have a question, read the comments first. You should also search this page (i.e., ctrl + f) for keywords related to your question. It's likely that I have already addressed your question in the comments.
  2. If you are copying and pasting code/terminal output, please don't. Reviewing another programmers’ code is a very time consuming and tedious task, and due to the volume of emails and contact requests I receive, I simply cannot do it.
  3. Be respectful of the space. I put a lot of my own personal time into creating these free weekly tutorials. On average, each tutorial takes me 15-20 hours to put together. I love offering these guides to you and I take pride in the content I create. Therefore, I will not approve comments that include large code blocks/terminal output as it destroys the formatting of the page. Kindly be respectful of this space.
  4. Be patient. I receive 200+ comments and emails per day. Due to spam, and my desire to personally answer as many questions as I can, I hand moderate all new comments (typically once per week). I try to answer as many questions as I can, but I'm only one person. Please don't be offended if I cannot get to your question
  5. Do you need priority support? Consider purchasing one of my books and courses. I place customer questions and emails in a separate, special priority queue and answer them first. If you are a customer of mine you will receive a guaranteed response from me. If there's any time left over, I focus on the community at large and attempt to answer as many of those questions as I possibly can.

Thank you for keeping these guidelines in mind before submitting your comment.

Leave a Reply