Building an Image Search Engine: Indexing Your Dataset (Step 2 of 4)

Last Wednesday’s blog post reviewed the first step of building an image search engine: Defining Your Image Descriptor.

We then examined the three aspects of an image that can be easily described:

  • Color: Image descriptors that characterize the color of an image seek to model the distribution of the pixel intensities in each channel of the image. These methods include basic color statistics such as mean, standard deviation, and skewness, along with color histograms, both “flat” and multi-dimensional.
  • Texture: Texture descriptors seek to model the feel, appearance, and overall tactile quality of an object in an image. Some, but not all, texture descriptors convert the image to grayscale and then compute a Gray-Level Co-occurrence Matrix (GLCM) and compute statistics over this matrix, including contrast, correlation, and entropy, to name a few. More advanced texture descriptors such as Fourier and Wavelet transforms also exist, but still utilize the grayscale image.
  • Shape: Many shape descriptor methods rely on extracting the contour of an object in an image (i.e. the outline). Once we have the outline, we can then compute simple statistics to to characterize the outline, which is exactly what OpenCV’s Hu Moments does. These statistics can be used to represent the shape (outline) of an object in an image.

Note: If you haven’t already seen my fully working image search engine yet, head on over to my How-To guide on building a simple image search engine using Lord of the Rings screenshots

When selecting a descriptor to extract features from our dataset, we have to ask ourselves what aspects of the image are we interested in describing? Is the color of an image important? What about the shape? Is the tactile quality (texture) important to returning relevant results?

Let’s take a look at a sample of the Flowers 17 dataset, a dataset of 17 flower species, for example purposes:

Figure 1 - Sample of the Flowers 17 Dataset. As we can see, some flowers might be indistinguishable using color or shape alone. Better results can be obtained by extracting both color and shape features.

Figure 1 – A sample of the Flowers 17 Dataset. As we can see, some flowers might be indistinguishable using color or shape alone (i.e. Tulip and Cowslip have similar color distributions). Better results can be obtained by extracting both color and shape features.

If we wanted to describe these images with the intention of building an image search engine, the first descriptor I would use is color. By characterizing the color of the petals of the flower, our search engine will be able to return flowers of similar color tones.

However, just because our image search engine will return flowers of similar color, does not mean all the results will be relevant. Many flowers can have the same color but be an entirely different species.

In order to ensure more similar species of flowers are returned from our flower search engine, I would then explore describing the shape of the petals of the flower.

Now we have two descriptors — color to characterize the different color tones of the petals, and shape to describe the outline of the petals themselves.

Using these two descriptors in conjunction with one another, we would be able to build a simple image search engine for our flowers dataset.

Of course, we need to know how to index our dataset.

Right now we simply know what descriptors we will use to describe our images.

But how are we going to apply these descriptors to our entire dataset?

In order to answer that question, today we are going to explore the second step of building an image search engine: Indexing Your Dataset.

Indexing Your Dataset

Definition: Indexing is the process of quantifying your dataset by applying an image descriptor to extract features from each and every image in your dataset. Normally, these features are stored on disk for later use.

Using our flowers database example above, our goal is to simply loop over each image in our dataset, extract some features, and store these features on disk.

It’s quite a simple concept in principle, but in reality, it can become very complex, depending on the size and scale of your dataset. For comparison purposes, we would say that the Flowers 17 dataset is small. It has a total of only 1,360 images (17 categories x 80 images per category). By comparison, image search engines such as TinEye have image datasets that number in the billions.

Let’s start with the first step: instantiating your descriptor.

1. Instantiate Your Descriptor

In my How-To guide to building an image search engine, I mentioned that I liked to abstract my image descriptors as classes rather than functions.

Furthermore, I like to put relevant parameters (such as the number of bins in a histogram) in the constructor of the class.

Why do I bother doing this?

The reason for using a class (with descriptor parameters in the constructor) rather than a function is because it helps ensure that the exact same descriptor with the exact same parameters is applied to each and every image in my dataset.

This is especially useful if I ever need to write my descriptor to disk using cPickle and load it back up again farther down the line, such as when a user is performing a query.

In order to compare two images, you need to represent them in the same manner using your image descriptor. It wouldn’t make sense to extract a histogram with 32 bins from one image and then a histogram with 128 bins from another image if your intent is to compare the two for similarity.

For example, let’s take a look at the skeleton code of a generic image descriptor in Python:

The first thing you notice is the __init__ method. Here I provide my relevant parameters for the descriptor.

Next, you see the describe method. This method takes a single parameter: the image we wish to describe.

Whenever I call the describe method, I know that the parameters stored during the constructor will be used for each and every image in my dataset. This ensures my images are described consistently with identical descriptor parameters.

While the class vs. function argument doesn’t seem like it’s a big deal right now, when you start building larger, more complex image search engines that have a large codebase, using classes helps ensure that your descriptors are consistent.

2. Serial or Parallel?

A better title for this step might be “Single-core or Multi-core?”

Inherently, extracting features from images in a dataset is a task that can be made parallel.

Depending on the size and scale of your dataset, it might make sense to utilize multi-core processing techniques to split-up the extraction of feature vectors from each image between multiple cores/processors.

However, for small datasets using computationally simple image descriptors, such as color histograms, using multi-core processing is not only overkill, it adds extra complexity to your code.

This is especially troublesome if you are just getting started working with computer vision and image search engines.

Why bother adding extra complexity? Debugging programs with multiple threads/processes is substantially harder than debugging programs with only a single thread of execution.

Unless your dataset is quite large and could greatly benefit from multi-core processing, I would stay away from splitting the indexing task up into multiple processes for the time being. It’s not worth the headache just yet. Although, in the future I will certainly have a blog post discussing best practice methods to make your indexing task parallel.

3. Writing to Disk

This step might seem a bit obvious. But if you’re going to go through all the effort to extract features from your dataset, it’s best to write your index to disk for later use.

For small datasets, using a simple Python dictionary will likely suffice. The key can be the image filename (assuming that you have unique filenames across your dataset) and the value the features extracted from that image using your image descriptor. Finally, you can dump the index to file using cPickle.

If your dataset is larger or you plan to manipulate your features further (i.e. scaling, normalization, dimensionality reduction), you might be better off using h5py to write your features to disk.

Is one method better than the other?

It honestly depends.

If you’re just starting off in computer vision and image search engines and you have a small dataset, I would use Python’s built-in dictionary type and cPickle for the time being.

If you have experience in the field and have experience with NumPy, then I would suggest giving h5py a try and then comparing it to the dictionary approach mentioned above.

For the time being, I will be using cPickle in my code examples; however, within the next few months, I’ll also start introducing h5py into my examples as well.

Summary

Today we explored how to index an image dataset. Indexing is the process of extracting features from a dataset of images and then writing the features to persistent storage, such as your hard drive.

The first step to indexing a dataset is to determine which image descriptor you are going to use. You need to ask yourself, what aspect of the images are you trying to characterize? The color distribution? The texture and tactile quality? The shape of the objects in the image?

After you have determined which descriptor you are going to use, you need to loop over your dataset and apply your descriptor to each and every image in the dataset, extracting feature vectors. This can be done either serially or parallel by utilizing multi-processing techniques.

Finally, after you have extracted features from your dataset, you need to write your index of features to file. Simple methods include using Python’s built-in dictionary type and cPickle. More advanced options include using h5py.

Next week we’ll move on to the third step in building an image search engine: determining how to compare feature vectors for similarity.

, , , , , ,

19 Responses to Building an Image Search Engine: Indexing Your Dataset (Step 2 of 4)

  1. Dinesh Vadhia February 10, 2014 at 12:36 pm #

    The performance of feature vector creation for content-based data (eg. images and audio) takes longer than text. Do you have a rough idea what the performance is like when using the various Python image extraction libraries?

    • Adrian Rosebrock February 10, 2014 at 1:55 pm #

      Hi Dinesh,

      Thanks for commenting. You hit the nail on the head — constructing feature vectors from images takes significantly more time than when extracting from text. However, we have (arguably) many more descriptors to choose from when dealing with images.

      Simple methods such as color statistics and color histograms are extremely fast to compute. However, using OpenCV’s calcHist is about 40x faster than using NumPy’s built-in histogram method.

      A very popular texture/shape descriptor used to detect objects in images, and that is very good for people detection/tracking, is the Histogram of Oriented Gradients, or simply HoG. I prefer to use the implementation of HoG from scikit-image. And while HoG is awesome, it also tends to be quite slow by comparison to other methods.

      If we wanted to use the “big guns”, we could break out keypoint detection and local invariant descriptors, such as SIFT, SURF, ORB, FREAK, etc. All of these are implemented in OpenCV and are quite fast. The downside to using these methods is that we need to construct a bag-of-visual-words, similar to the bag-of-words model for text, but we now have to add in a step of clustering and vector quantization, which can become very time consuming. So even though OpenCV implements these methods in raw C/C++ with Python bindings, it still takes a long time to get these descriptors in “usable” form since we have to now perform clustering.

      In general, I tend to look at the source code of the descriptor itself, provided that what I am using is an open source library. Is the code pure Python? Or is it a C/C++ library with Python bindings? You’ll normally find that the latter is faster than the former; however, you need to consider your descriptor as well. Color histograms are going to be faster to compute than HoG. Image search engines are not as clearcut world as text search engines simply because we have so many choices of descriptors.

      • Dinesh Vadhia February 10, 2014 at 2:21 pm #

        I guess companies like Amazon with their new Flow app (http://www.theverge.com/2014/2/7/5389888/amazon-ios-app-uses-iphone-camera-to-make-shopping-list) must be using fast C-based image feature extractors either on the mobile or most likely at the server.

        • Adrian Rosebrock February 10, 2014 at 5:29 pm #

          I haven’t tested Flow out myself, but here is the general algorithm I would use to build something like it:

              1. Extract features locally. OpenCV has an iOS implementation that is great.
              2. Detect keypoints using Difference of Gaussian. Very fast keypoint detector.
              3. Extract a SIFT descriptor from each keypoint.
              4. Compress and send your SIFT descriptors and DoG keypoints to the cloud, then have the servers perform spatial verification. You could also narrow your search space at this point as well.

          Anyway. That’s just a quick, high level overview. Thanks for the comments. I could see a blog post coming from something like this 🙂

          • Dinesh Vadhia February 11, 2014 at 11:54 am #

            I’d have thought sending the image to the server for feature extraction would be faster because the mobile device processors are also busy with other non-image processing stuff. Plus, a beefy multi-core server could ingest and spit out pretty quickly. Only way to know would be to benchmark both mobile and server methods to see which is fastest.

            A blog post on this area would be of value because it is most likely that content-based image search engines would be accessed from on-the-go mobile devices.

          • Adrian Rosebrock February 11, 2014 at 3:16 pm #

            You are exactly right by saying that content-based image search engines would be accessed from mobile devices. That trend will continue with the growth of mobile.

            However, consider that you need to transmit the image from your mobile device to your server. A typical image is around 1-2mb, depending on your device. And while our mobile networks are getting faster, that’s still a fair amount of data to transmit. Once the server finally has the image, extracting the features would be extremely fast, but a potential issue is getting the image there in the first place. If you are able to resize the image to make it smaller prior to transmission, then this approach is a lot more feasible.

  2. Tomasz Malisiewicz February 10, 2014 at 9:02 pm #

    Hi Adrian,

    What you’ve described in the last comment reminds of me of Oxford visual search engine.

    http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/

    When using HOG for comparing full images, in my research I found it useful to not pre-compute the HOG descriptor for a single image frame, but instead perform a sliding window search around the target image. This, of course, makes things slower. But you can then match images which are not aligned, and get the optimal alignment in the process. This is what we did for our cross-domain visual search engine:

    http://graphics.cs.cmu.edu/projects/crossDomainMatching/

    Keep the blog posts coming.

    • Adrian Rosebrock February 11, 2014 at 7:16 am #

      Hi Tomasz,

      Thanks for the reply!

      Yep, I was describing something along the lines of the work by Philbin et al., although I believed they used a Harris keypoint detector. It certainly isn’t the only way to construct BoVW model, but I normally find it to be a decent start.

      As for the cross-domain matching, was there a tolerance for rotation? You mentioned finding the “optimal alignment”, but does that include an object that had been rotated substantially?

  3. David February 16, 2014 at 9:51 pm #

    Has anyone managed to download the code and image search engine book? All I get at the link I was sent is:

    This XML file does not appear to have any style information associated with it. The document tree is shown below.
    AccessDeniedAccess DeniedF8B4F3FA2224D29COalz+h4XFN24w7KPndNY1ZetgERemCAZRmO35BuH2tcMybCrx6lLe08jcdszLbFR

    • Adrian Rosebrock February 17, 2014 at 2:00 pm #

      Hi David, please check the email I just sent you. I was having some problems with getting my email system setup to automatically send out the code and PDF. It looks like everything is resolved now, but if it’s not, please let me know and I’ll be glad to continue to look into it.

  4. Paula May 29, 2015 at 8:04 pm #

    Hi Adrian,

    I was wondering if I can use these steps of Image Search Engine in video streams?…applying it for each frame.
    Thanks

    • Adrian Rosebrock May 30, 2015 at 6:54 am #

      Hi Paula, you certainly could — but the problem is determining exactly which frame of the video you want to apply the search to. Most videos run at 32+ frames per second, so you would be running about 32 searches every single frame. That’s probably a bit computationally wasteful.

  5. Rocky March 15, 2018 at 8:38 am #

    I want to make my custom caffe model what to do .I want a car to detect path through picamera.The corners of path is highlighted with a fixed colour.And through neural networks +training on certain dataset it takes decision to car motors.Pls help

  6. Shreeyash June 19, 2018 at 6:02 am #

    can you please explain more about Texture descriptors?

    • Adrian Rosebrock June 19, 2018 at 8:23 am #

      Sure, I have over 30+ lessons on feature extraction and various texture descriptors inside the PyImageSearch Gurus course.

  7. Luciano March 19, 2019 at 5:00 pm #

    Hello Adriano,

    could you give any hints about the use of convolutional neural networks (CNNs) to do feature extraction instead of using traditional methods? Do you have any experience or can you give any tips on this? using CNNs is it possible to achieve similar and even better performance and accuracy than traditional methods?

Trackbacks/Pingbacks

  1. Building an Image Search Engine: Searching and Ranking (Step 4 of 4) - PyImageSearch - February 24, 2014

    […] Step 2: Indexing Your Dataset. Now that we have selected a descriptor, we can apply the descriptor to extract features from each and every image in our dataset. The process of extracting features from an image dataset is called “indexing”. These features are then written to disk for later use. Indexing is also a task that is easily made parallel by utilizing multiple cores/processors on our machine. […]

  2. Building an Image Search Engine: Defining Your Similarity Metric (Step 3 of 4) - PyImageSearch - April 28, 2014

    […] there, we moved on to Step 2: Indexing Your Dataset. Indexing is the process of quantifying our dataset by applying an image descriptor to extract […]

Leave a Reply

[email]
[email]