OpenCV and Python K-Means Color Clustering

Jurassic Park Movie Poster

Take a second to look at the Jurassic Park movie poster above.

What are the dominant colors? (i.e. the colors that are represented most in the image)

Well, we see that the background is largely black. There is some red around the T-Rex. And there is some yellow surrounding the actual logo.

It’s pretty simple for the human mind to pick out these colors.

But what if we wanted to create an algorithm to automatically pull out these colors?

Looking for the source code to this post?
Jump right to the downloads section.

You might think that a color histogram is your best bet…

But there’s actually a more interesting algorithm we can apply — k-means clustering.

In this blog post I’ll show you how to use OpenCV, Python, and the k-means clustering algorithm to find the most dominant colors in an image.

OpenCV and Python versions:
This example will run on Python 2.7/Python 3.4+ and OpenCV 2.4.X/OpenCV 3.0+.

K-Means Clustering

So what exactly is k-means?

K-means is a clustering algorithm.

The goal is to partition n data points into k clusters. Each of the n data points will be assigned to a cluster with the nearest mean. The mean of each cluster is called its “centroid” or “center”.

Overall, applying k-means yields k separate clusters of the original n data points. Data points inside a particular cluster are considered to be “more similar” to each other than data points that belong to other clusters.

In our case, we will be clustering the pixel intensities of a RGB image. Given a MxN size image, we thus have MxN pixels, each consisting of three components: Red, Green, and Blue respectively.

We will treat these MxN pixels as our data points and cluster them using k-means.

Pixels that belong to a given cluster will be more similar in color than pixels belonging to a separate cluster.

One caveat of k-means is that we need to specify the number of clusters we want to generate ahead of time. There are algorithms that automatically select the optimal value of k, but these algorithms are outside the scope of this post.

OpenCV and Python K-Means Color Clustering

Alright, let’s get our hands dirty and cluster pixel intensities using OpenCV, Python, and k-means:

Lines 2-6 handle importing the packages we need. We’ll use the scikit-learn implementation of k-means to make our lives easier — no need to re-implement the wheel, so to speak. We’ll also be using matplotlib to display our images and most dominant colors. To parse command line arguments we will use argparse. The utils package contains two helper functions which I will discuss later. And finally the cv2 package contains our Python bindings to the OpenCV library.

Lines 9-13 parses our command line arguments. We only require two arguments: --image, which is the path to where our image resides on disk, and --clusters, the number of clusters that we wish to generate.

On Lines 17-18 we load our image off of disk and then convert it from the BGR to the RGB colorspace. Remember, OpenCV represents images as multi-dimensions NumPy arrays. However, these images are stored in BGR order rather than RGB. To remedy this, we simply using the cv2.cvtColor function.

Finally, we display our image to our screen using matplotlib on Lines 21-23.

As I mentioned earlier in this post, our goal is to generate k clusters from n data points. We will be treating our MxN image as our data points.

In order to do this, we need to re-shape our image to be a list of pixels, rather than MxN matrix of pixels:

This code should be pretty self-explanatory. We are simply re-shaping our NumPy array to be a list of RGB pixels.

The 2 lines of code:

Now that are data points are prepared, we can write these 2 lines of code using k-means to find the most dominant colors in an image:

We are using the scikit-learn implementation of k-means to avoid re-implementing the algorithm. There is also a k-means built into OpenCV, but if you have ever done any type of machine learning in Python before (or if you ever intend to), I suggest using the scikit-learn package.

We instantiate KMeans on Line 29, supplying the number of clusters we wish to generate. A call to fit() method on Line 30 clusters our list of pixels.

That’s all there is to clustering our RGB pixels using Python and k-means.

Practical Python and OpenCV

Scikit-learn takes care of everything for us.

However, in order to display the most dominant colors in the image, we need to define two helper functions.

Let’s open up a new file, utils.py, and define the centroid_histogram function:

As you can see, this method takes a single parameter, clt. This is our k-means clustering object that we created in color_kmeans.py.

The k-means algorithm assigns each pixel in our image to the closest cluster. We grab the number of clusters on Line 8 and then create a histogram of the number of pixels assigned to each cluster on Line 9.

Finally, we normalize the histogram such that it sums to one and return it to the caller on Lines 12-16.

In essence, all this function is doing is counting the number of pixels that belong to each cluster.

Now for our second helper function, plot_colors:

The plot_colors function requires two parameters: hist, which is the histogram generated from the centroid_histogram function, and centroids, which is the list of centroids (cluster centers) generated by the k-means algorithm.

On Line 21 we define a 300×50 pixel rectangle to hold the most dominant colors in the image.

We start looping over the color and percentage contribution on Line 26 and then draw the percentage the current color contributes to the image on Line 29. We then return our color percentage bar to the caller on Line 34.

Again, this function performs a very simple task — generates a figure displaying how many pixels were assigned to each cluster based on the output of the centroid_histogram function.

Now that we have our two helper functions defined, we can glue everything together:

On Line 34 we count the number of pixels that are assigned to each cluster. And then on Line 35 we generate the figure that visualizes the number of pixels assigned to each cluster.

Lines 38-41 then displays our figure.

To execute our script, issue the following command:

If all goes well, you should see something similar to below:

Figure 1: Using Python, OpenCV, and k-means to find the most dominant colors in our image.

Figure 1: Using Python, OpenCV, and k-means to find the most dominant colors in our image.

Here you can see that our script generated three clusters (since we specified three clusters in the command line argument). The most dominant clusters are black, yellow, and red, which are all heavily represented in the Jurassic Park movie poster.

Let’s apply this to a screenshot of The Matrix:

Figure 2: Finding the four most dominant colors using k-means in our The Matrix image.

Figure 2: Finding the four most dominant colors using k-means in our The Matrix image.

This time we told k-means to generate four clusters. As you can see, black and various shades of green are the most dominant colors in the image.

Finally, let’s generate five color clusters for this Batman image:

Figure 3: Applying OpenCV and k-means clustering to find the five most dominant colors in a RGB image.

Figure 3: Applying OpenCV and k-means clustering to find the five most dominant colors in a RGB image.

So there you have it.

Using OpenCV, Python, and k-means to cluster RGB pixel intensities to find the most dominant colors in the image is actually quite simple. Scikit-learn takes care of all the heavy lifting for us. Most of the code in this post was used to glue all the pieces together.

Summary

In this blog post I showed you how to use OpenCV, Python, and k-means to find the most dominant colors in the image.

K-means is a clustering algorithm that generates k clusters based on n data points. The number of clusters k must be specified ahead of time. Although algorithms exist that can find an optimal value of k, they are outside the scope of this blog post.

In order to find the most dominant colors in our image, we treated our pixels as the data points and then applied k-means to cluster them.

We used the scikit-learn implementation of k-means to avoid having to re-implement it.

I encourage you to apply k-means clustering to our own images. In general, you’ll find that smaller number of clusters (k <= 5) will give the best results.

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 11-page Resource Guide on Computer Vision and Image Search Engines, including exclusive techniques that I don’t post on this blog! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , , ,

61 Responses to OpenCV and Python K-Means Color Clustering

  1. Xeth Waxman May 26, 2014 at 10:54 pm #

    Really cool little script. Thanks for putting it together!

    • Adrian Rosebrock May 27, 2014 at 7:54 am #

      I’m glad you liked it!

  2. Charles May 29, 2014 at 11:30 am #

    I wrote an article on this subject a while back using PIL and running the k-means calculation in pure python, in case you’re interested: http://charlesleifer.com/blog/using-python-and-k-means-to-find-the-dominant-colors-in-images/

    • Adrian Rosebrock May 29, 2014 at 12:43 pm #

      Hi Charles, thanks for posting. I really enjoyed looking at your pure Python implementation.

      • Charles May 29, 2014 at 2:32 pm #

        Thanks, and I yours! Looking forward to reading more of your posts in the future.

  3. Smitha Milli June 8, 2014 at 7:31 pm #

    Great article!

    I think that instead of using bin = numLabels for the histogram though that you want to use bin = np.arange(numLabels + 1). When you just use bin = numLabels (suppose numLabels = 5 for this example) the histogram gets sorted using the bin edges [0., 0.8, 1.6, 2.4, 3.2, 4. ] whereas with np.arange(numlabels + 1) it’s sorted based on the edges [0, 1, 2, 3, 4, 5]

    • Adrian Rosebrock June 9, 2014 at 7:13 am #

      Hi Smitha, thanks for the reply! 🙂

      And awesome catch on the bin edges! I have updated the code.

      Thanks again!

      • Deven Patel October 26, 2014 at 8:39 pm #

        i think think the “+1” should be in the outer bracket
        numLabels = np.arange(0, len(np.unique(labels) )+1)

        since we want the bins to be one more than the labels. Coz “np.unique(clt.labels_) + 1” just adds one to each label and we end up with the same number of unique labels.

        • Adrian Rosebrock October 28, 2014 at 6:14 am #

          Thanks Deven! I’ll be sure to update the code.

          • Arnaud P November 18, 2014 at 4:01 pm #

            While we’re at it, why don’t you use clt.cluster_centers_ directly instead of making numpy look for unique values across all the labels ?

            I know nothing about scikit, but you use that exact semantic as an argument when calling utils.plot_colors()

            Anyhow, thx for the demo, interesting.

          • Adrian Rosebrock November 18, 2014 at 6:55 pm #

            I suppose I could have, thanks.

  4. Sreekrishna June 26, 2014 at 10:53 am #

    This is a great article !

    I have a doubt. How do we segment the colours without knowing the value of K.Here K is an input that the user provides. Let us assume that the user doesn’t know what value has to be provided, then in that case is there any algorithm with which I can accomplish Image segmentation using Clustering ?

    • Adrian Rosebrock June 26, 2014 at 11:15 am #

      Right, so this is one of the problems many people find with k-means — based only on the standard implementation, there is no way to “automatically” know the value of k.

      However, there are extensions to the k-means algorithm, specifically X-means that utilizes Bayesian Information Criterion (BIC) to find the optimal value of k.

      If you’re interested in color based segmentation, definitely take a look at the segmentation sub-package of scikit-image.

  5. Sreekrishna June 26, 2014 at 1:25 pm #

    Thanks a lot Adrian !

  6. sereen yaser August 27, 2014 at 7:13 am #

    My question is if i want to reduce the dithering the code .. i mean if i want to show more colors what shall i change in the code?

    • Adrian Rosebrock August 27, 2014 at 8:34 am #

      If you want to show more colors, then you would want to increase the size of k, which is your number of clusters. If you want to show less colors, then you want to decrease k

  7. Wajih Ullah Baig August 29, 2014 at 7:19 am #

    Loved it!

  8. Mike November 23, 2014 at 4:42 pm #

    So lets say you are trying to find similar batman images, so you take the kmeans of a group of images, and find their most dominant colors too. How would you then find the most similar in color? Would you just take the distance between the most dominant colors of the two images, then the 2nd most dominant colors of the two images, all the way until the last? What if, in the batman example above, another batman image had the first two colors switched, so its most dominant was dark blue. Then wouldn’t the two images appear pretty different?

    • Adrian Rosebrock November 24, 2014 at 7:27 am #

      Hi Mike, great question. Basically, if you wanted to build a (color based) image search engine using k-means you would have to:

      1. Apply k-means to all the images in your dataset. You would loop over the dataset, load the images into memory, and then apply k-means to all of them. This would give you clusters of colors for the entire dataset.
      2. Loop over your dataset again. Then, for each image and each pixel in each image, determine which cluster the pixel belongs to. A good choice is to compute the Euclidean distance and find the minimum distance between the pixel and the centroid
      3. Then, based on Step 2, you can create a histogram of centroid counts. Simply tabulate the number of times a pixel is assigned to a given cluster
      4. To compare images, compute the distance between their histograms using your preferred metric. Chi-squared is a good choice. But intersection or correlation could work well too.

      I would also suggest using the L*a*b* color space over RGB for this problem since the Euclidean distance in the L*a*b* color space has perceptual meaning. This is definitely a lengthy topic and I should definitely write a blog post about it in the future.

      Thanks again for the great question!

  9. talha January 26, 2015 at 7:11 am #

    hi, thanks for the post. Can you show how we het rgb (or hsv) value of the most dominant colors? (the colors that are plotted)

    • Adrian Rosebrock January 26, 2015 at 8:27 am #

      Hi Talha. The dominant colors (i.e. “centroids” or “cluster centers”) are in the clt.cluster_centers_ variable, which is a list of the dominant colors found by the k-means algorithm.

      • talha January 26, 2015 at 1:06 pm #

        thanks a lot for quick (and cprrect 🙂 ) reply Adrian:)

  10. talha February 1, 2015 at 5:33 pm #

    Hello again Adrian, can you also expand your code to include applying color quantization to the image? I mean if our k = 2, then the quantizatied image will only have these two colors. Thanks in advance

    • Adrian Rosebrock February 1, 2015 at 6:37 pm #

      Hi Talha. If you’re interested in color quantization, check out this post.

  11. AKIRA March 3, 2015 at 7:19 am #

    Hello adrian..i dont want the background color.so i removed the background and used the background removed image as input to your code.But when it reads the image,background is generated again and it is given as one of the dominant colors.how do i resolve this?

    • Adrian Rosebrock March 3, 2015 at 7:57 am #

      Hi Akira, great question, thanks for asking. If you do not want to include the background in the dominant color calculation, then you’ll need to create a “mask”. A mask is an image that is the same size as your input image that indicates which pixels should be included in the calculation and which ones should not. Take a look at masked arrays in NumPy to aide you in doing this. It’s a little tricky if you’re using masked arrays for the first time. I’ve done it before, but unfortunately I don’t have any code ready to go to handle this particular situation, but I’ll definitely consider writing another article about it in the future!

  12. AKIRA March 4, 2015 at 1:00 am #

    thanks adrian!! will try resolving this

  13. AKIRA March 4, 2015 at 1:33 am #

    hi once again, i have removed the background already.but when i read in the image why is it showing the background again? an additional background is getting added.anyway to resolve this.i dont understand why a background removed image is behaving this way

    • Adrian Rosebrock March 4, 2015 at 6:35 am #

      Removing the background from the image normally means either (1) generating a mask to distinguish between background and foreground or (2) removing the background color and replacing it with a different color. For example, if you had a red background and performed background subtraction, your background would (likely) be black.

      Even though you have already removed the background the k-means algorithm does not understand that you have removed the background — all it sees is an array of pixels. It has no idea that those black pixels are background. You need to use the NumPy masked arrays functionality to indicate which pixels are background and which pixels are foreground.

  14. AKIRA March 10, 2015 at 7:47 am #

    hi adrain,i used alpha masking to remove the background.so when i get make histogram for background removed image.it returns large counts of black pixels values though black is not present in the image.any idea as to why black value appears in the background removed image

    • Adrian Rosebrock March 10, 2015 at 8:17 am #

      Hi Akira, like I mentioned in previous comments “removing the background” does not mean that the background pixels are somehow removed from the image. By “removing the background” you are simply setting the background pixels to black. But when you go to cluster pixel intensities of an image they are still black pixels. You need to accumulate a list of pixels that do not include these background pixels. A simple (but slow) method to do this is loop over the image and append any non-black pixels to a list of pixels to be clustered. A faster, more efficient way to do this is use masked arrays.

  15. AKIRA March 11, 2015 at 6:14 am #

    is there a way to background pixels completely? anything u know of.thanks

    • Adrian Rosebrock March 11, 2015 at 6:32 am #

      No, you cannot “remove” pixels from an image. An image will always be a rectangular grid of pixels. Instead, your algorithms must “mark” pixels as being part of a background. Normally, after performing background subtraction, the background pixels will be black — but they are still part of the image. You still need to insert logic into your code to remove these pixels prior to being clustered. Otherwise, they will affect the clusters generated.

  16. AKIRA March 12, 2015 at 2:10 am #

    thanks alott !! adrian

  17. raghav October 23, 2015 at 3:57 am #

    getting error: error:

    any help please

    • Adrian Rosebrock October 23, 2015 at 6:19 am #

      Make sure that the path to your input image is correct. It’s likely that the path to your input image is not valid.

  18. Hacklavya December 9, 2015 at 8:42 am #

    I am successfully using virtualenv with python, thanks for good tutorial.

    Now I need to install sklearn also, so how can I install inside virtualEnv?

    where do I give this command “pip install -U scikit-learn”

    hacklavya@shalinux:~$ here
    or
    (cv)hacklavya@shalinux:~$ here

    • Adrian Rosebrock December 9, 2015 at 9:27 am #

      You can install scikit-learn using:

      • Hacklavya December 9, 2015 at 10:28 am #

        Thanks a lot.
        I already tried the same and worked.

  19. Tuvi April 19, 2016 at 8:47 am #

    thank you so much….it is a great post

  20. nadjia May 5, 2016 at 9:27 am #

    how can we evaluate the result of images clustering?

  21. Vishwas June 8, 2016 at 7:01 am #

    How can I output the RGB or HSV value of the most dominant color?

    • Adrian Rosebrock June 9, 2016 at 5:28 pm #

      Take a look at the code to this blog post. Examine the clusters generated. Then find the cluster that has the largest percentage. You can accomplish this by looking at the hist and centroids lists.

  22. Niki June 11, 2016 at 8:50 pm #

    Hi Adrian,

    Nice tutorial! I have two questions: 1. Can I use histograms of images as the input to k-means clustering and use chi-squared instead of distance for clustering? 2.Can my images be from different sizes or they should all have the same size?

    Your help is greatly appreciated!

    • Adrian Rosebrock June 12, 2016 at 9:32 am #

      If you use color histograms, then your images can be varying sizes since your output feature vector will always be the number of bins in the histogram. And yes, you can certainly pass in color histograms into a clustering algorithm instead of raw pixel intensities (this is normally what is done in the first place). However, since the k-means algorithm assumes a Euclidean space, you won’t be able to use the chi-squared distance directly.

      • Niki June 13, 2016 at 10:01 am #

        Thank you for your response!

        If you know of examples in which chi-squared metric has been used in k-means clustering, could you please post some of those links or papers? Thanks!

        • Adrian Rosebrock June 15, 2016 at 12:47 pm #

          Hi Niki — you might want to re-read my previous comment. Since the chi-squared distance doesn’t “make sense” in a Euclidean space, you can’t use it for k-means clustering. Instead, what you can try to do is apply a chi-squared kernel transform to your inputs, and then apply the Euclidean distance to the kernel transform during clustering.

  23. Torben B. Jensen September 18, 2016 at 4:27 am #

    Hello Adrian,

    How can I extract the exact HSV-values of the clusters output from Kmeans? I want to use the HSV-values of the biggest cluster to subsequently do real time tracking of a ball with that color, using inrange and circle detection.


    Sorry, I just found the answer earlier in the other comments!

    • Adrian Rosebrock September 19, 2016 at 1:07 pm #

      Congrats on resolving the question Torben!

  24. Elbruceo September 25, 2016 at 8:04 pm #

    Hi Adrain,
    Thanks for the info on Python/OpenCV. I’m trying to run and test your code. One of your code lines is “from sklearn.cluster import KMeans” (line 2 of your example). All the other import statements work fine (lines 3-6) but I can’t get this one to work.

    Any thoughts on what I’m missing?
    Thanks

    • Adrian Rosebrock September 27, 2016 at 8:46 am #

      It sounds like you don’t have the scikit-learn package installed. Be sure to install scikit-learn before proceeding.

  25. Ilga Yulian Putra D November 27, 2016 at 3:48 am #

    hi adrian, I have problem, I can’t install scikit-learn because, dont have scipy in raspberry pi, but I could not find a way to installing the scipy on raspberry pi.

    • Adrian Rosebrock November 28, 2016 at 10:26 am #

      Just make sure you install SciPy before installing scikit-learn:

      $ pip install scipy
      $ pip install scikit-learn

      That will resolve the issue.

  26. Xenofon April 7, 2017 at 7:50 am #

    Hi Adrian! Big fan of your work! Could this project be implemented with a video feed from a webcam or rasp pi cam or even a video file? If so what would I need to change in the code?

    Thanks.

    • Adrian Rosebrock April 8, 2017 at 12:49 pm #

      Yes, absolutely. Basically you would need to access your video stream and then apply the k-means clustering phase to each frame.

  27. Ian M April 21, 2017 at 3:56 am #

    Hi Adrian, I’m trying to sort the colors in the histogram (most frequent color to least frequent color) but I’m confused by how to do this. Sorting the hist list gives changes the width values, but not the colors, and the clt.cluster_centers_ variable contains is made up of three values and so I’m not sure how to sort them correctly. Any help would be hugely appreciated.

    • Adrian Rosebrock April 21, 2017 at 10:46 am #

      It sounds like you are correctly sorting the histogram, but you’re not sorting the associated values in .cluster_centers_. Sort both of these lists at the same time and you’ll resolve the issue.

  28. jatin pal singh September 2, 2017 at 12:43 pm #

    i want to know how the same method could be applied to a small dataset of images .can you share the code and how to check confidence of model built..

    • Adrian Rosebrock September 5, 2017 at 9:33 am #

      Can you elaborate on what you are trying to accomplish? How small is a “small dataset”? Is your goal to cluster images into similar groups based on their appearance?

Trackbacks/Pingbacks

  1. Accessing the Raspberry Pi Camera with OpenCV and Python - PyImageSearch - March 30, 2015

    […] the past year the PyImageSearch blog has had a lot of popular blog posts. Using k-means clustering to find the dominant colors in an image was (and still is) hugely popular. One of my personal favorites, building a kick-ass […]

Leave a Reply