First, let’s have a quick review.
Two weeks ago we explored the first step of building an image search engine: Defining Your Image Descriptor. We explored three aspects of an image that can easily be described: color, texture, and shape.
From there, we moved on to Step 2: Indexing Your Dataset. Indexing is the process of quantifying our dataset by applying an image descriptor to extract features from every image in our dataset.
Indexing is also a task that can easily be made parallel — if our dataset is large, we can easily speedup the indexing process by using multiple cores/processors on our machine.
Finally, regardless if we are using serial or parallel processing, we need to write our resulting feature vectors to disk for later use.
Now, it’s time to move onto the third step of building an image search engine: Defining Your Similarity Metric
Defining Your Similarity Metric
Today we are going to perform a CURSORY review of different kinds of distance and similarity metrics that we can use to compare two feature vectors.
Note: Depending on which distance function you use, there are many, many “gotchas” you need to look out for along the way. I will be reviewing each and every one of these distance functions and providing examples on how to properly use them later on in this blog, but don’t blindly apply a distance function to feature vectors without first understanding how the feature vectors should be scaled, normalized, etc., otherwise you might get strange results.
So what’s the difference between a distance metric and a similarity metric?
In order to answer this question we first need to define some variables.
d be our distance function, and
z be real-valued feature vectors, then the following conditions must hold:
d(x, y) >= 0. This simply means that our distance must be non-negative.
- Coincidence Axiom:
d(x, y) = 0if and only if
x = y. A distance of zero (meaning the vectors are identical) is only possible if the two vectors have the same value.
d(x, y) = d(y, x). In order for our distance function to be considered a distance metric, the order of the parameters in the distance should not matter. Specifying
d(x, y)instead of
d(y, x)should not matter to our distance metric and both function calls should return the same value.
- Triangle Inequality:
d(x, z) <= d(x, y) + d(y, z). Do you remember back to your high school trig class? All this condition states is that the sum of the lengths of any two sides must be greater than the remaining side.
If all four conditions hold, then our distance function can be considered a distance metric.
So does this mean that we should only use distance metrics and ignore other types of similarity metrics?
Of course not!
But it’s important to understand the terminology, especially when you go off building image search engines on your own.
Let’s take a look at five of the more popular distance metrics and similarity functions. I’ve included links to the function’s corresponding SciPy documentation just in case you want to play with these functions yourself.
- Euclidean: Arguably the most well known and must used distance metric. The euclidean distance is normally described as the distance between two points “as the crow flies”.
- Manhattan: Also called “Cityblock” distance. Imagine yourself in a taxicab taking turns along the city blocks until you reach your destination.
- Chebyshev: The maximum distance between points in any single dimension.
- Cosine: We won’t be using this similarity function as much until we get into the vector space model, tf-idf weighting, and high dimensional positive spaces, but the Cosine similarity function is extremely important. It is worth noting that the Cosine similarity function is not a proper distance metric — it violates both the triangle inequality and the coincidence axiom.
- Hamming: Given two (normally binary) vectors, the Hamming distance measures the number of “disagreements” between the two vectors. Two identical vectors would have zero disagreements, and thus perfect similarity.
This list is by no means exhaustive. There are an incredible amount of distance functions and similarity measures. But before we start diving into details and exploring when and where to use each distance function, I simply wanted to provide a 10,000ft overview of what goes into building an image search engine.
Now, let’s compute some distances:
>>> from scipy.spatial import distance as dist
>>> import numpy as np
>>> x = np.random.rand(4)
>>> y = np.random.rand(4)
array([ 0.37454012, 0.95071431, 0.73199394, 0.59865848])
array([ 0.15601864, 0.15599452, 0.05808361, 0.86617615])
The first thing we are going to do is import the packages that we need: the
distance module of SciPy and NumPy itself. From there, we need to seed the pseudo-random number generator with an explicit value. By providing an explicit value (in this case, 42), it ensures that if you were to execute this code yourself, you would have the same results as me.
Finally, we generate our “feature vectors”. These are real-valued lists of length 4, with values in the range [0, 1].
Time to compute some actual distances:
>>> dist.euclidean(x, y)
>>> dist.cityblock(x, y)
>>> dist.chebyshev(x, y)
So what does this tell us?
Well, the Euclidean distance is smaller than the Manhattan distance. Intuitively, this makes sense. The Euclidean distance is “as the crow flies” (meaning that it can take the shortest path between two points, like an airplane flying from one airport to the other). Conversely, the Manhattan distance more closely resembles driving through the blocks of a city — we are making sharp angled turns, like driving on a piece of grid paper, thus the Manhattan distance is larger since it takes us longer to travel the distance between the two points.
Finally, the Chebyshev distance is the maximum distance between any two components in the vector. In this case, a maximum distance of 0.794 is found from |0.95071431 – 0.15599452|.
Now, let’s take a look at the Hamming distance:
>>> x = np.random.random_integers(0, high = 1, size =(4,))
>>> y = np.random.random_integers(0, high = 1, size =(4,))
array([1, 1, 1, 0])
array([1, 0, 1, 1])
>>> dist.hamming(x, y)
In the previous example we had real-valued feature vectors. Now we have binary feature vectors. The Hamming distance compares the number of mis-matches between the
y feature vectors. In this case it finds two mis-matches — the first being at
y and the second being at
Given that we have two mis-matches and the vectors are of length four, the ratio of mis-matches to the length of the vector is 2 / 4 = 0.5, thus our Hamming distance.
The first step in building an image search engine is to decide on an image descriptor. From there, the image descriptor can be applied to each image in the dataset and a set of features extracted. This process is called “indexing a dataset”. In order to compare two feature vectors and determine how “similar” they are, a similarity measure is required.
In this blog post we performed an cursory exploration of distance and similarity functions that can be used to measure how “similar” two feature vectors are.
Popular distance functions and similarity measures include (but are certainly not limited to): Euclidean distance, Manhattan (city block), Chebyshev, Cosine distance, and Hamming.
Not only will we explore these distance functions in further detail later on in this blog, but I will also introduce many more, including methods geared specifically to comparing histograms, such as the correlation, intersection, chi-squared, and Earth Movers Distance methods.
At this point you should have a basic idea of what it takes to build an image search engine. Start extracting features from images using simple color histograms. Then compare them using the distance functions discussed above. Note your findings.
Finally, sign-up for my Newsletter below to get updated when I post new image search engine articles. In return, I’ll send you an awesome 11-page resource guide on computer vision and image search engines.