Applying deep learning and a RBM to MNIST using Python

Restricted Boltzmann Machine Filters

In my last post, I mentioned that tiny, one pixel shifts in images can kill the performance your Restricted Boltzmann Machine + Classifier pipeline when utilizing raw pixels as feature vectors.

Today I am going to continue that discussion.

And more importantly, I’m going to provide some Python and scikit-learn code that you can use to apply Restricted Boltzmann Machines to your own image classification problems.

Looking for the source code to this post?
Jump right to the downloads section.

OpenCV and Python versions:
This example will run on Python 2.7 and OpenCV 2.4.X/OpenCV 3.0+.

But before we jump into the code, let’s take a minute to talk about the MNIST dataset.

The MNIST Dataset

Figure 1: MNIST digit recognition sample

Figure 1: MNIST digit recognition sample

The MNIST dataset is one of the most well studied datasets in the computer vision and machine learning literature. In many cases, it’s a benchmark, a standard to which some machine learning algorithms are ranked against.

The goal of this dataset is to correctly classify the handwritten digits 0-9. We are not going to utilize the entire dataset (which consists of 60,000 training images and 10,000 testing images), instead we are going to utilize a small sample (3,000 for training, 2,000 for testing). The data points are approximately uniformly distributed per digit, so no substantial class label imbalance exists.

Each feature vector is 784-dim, corresponding to the 28 x 28 grayscale pixel intensities of the image. These grayscale pixel intensities are unsigned integers, falling into the range [0, 255].

All digits are placed on a black background, with the foreground being white and shades of gray.

Given these raw pixel intensities, we are going to first train a Restricted Boltzmann Machine on our training data to learn an unsupervised feature representation of the digits.

Then, we are going to take these “learned” features and train a Logistic Regression classifier on top of them.

To evaluate our pipeline, we’ll take the testing data and run it through our classifier and report the accuracy.

However, I mentioned in my previous post that simple one pixel translations of the testing set images can lead to accuracy dropping, even though these translations are so small they are barely (if at all) noticeable to the human eye.

To test this claim, we’ll generate a testing set four times larger than the original by translating each image one pixel up, down, left, and right.

Finally, we’ll pass this “nudged” dataset through our pipeline and report our results.

Sound good?

Let’s examine some code.

Practical Python and OpenCV

Applying a RBM to the MNIST Dataset Using Python

The first thing we’ll do is create a file, rbm.py, and start importing the packages we need:

We’ll start by importing the train_test_split function from the cross_validation sub-package of scikit-learn. The train_test_split function will make it dead simple for us to create our training and testing splits of the MNIST dataset.

Next, we’ll import the classification_report function from the metrics sub-package, which we’ll use to produce a nicely formatted accuracy report on (1) the overall system and (2) the accuracy of each individual class label.

On Line 4 we’ll import the classifier we’ll be using throughout this example — a LogisticRegression classifier.

I mentioned that we’ll be using a Restricted Boltzmann Machine to learn an unsupervised representation of our raw pixel values. This will be handled by the BernoulliRBM class in the neural_network sub-package of scikit-learn.

The BernoulliRBM implementation (as the name suggests), consists of binary visible units and binary hidden nodes. The algorithm itself is O(d2), where d is the number of components to be learned.

In order to find optimal values of the coefficient C for Logistic Regression, along with the optimal learning rate, number of iterations, and number of components for our RBM, we’ll need to perform a cross-validated grid search over the feature space. The GridSearchCV class (which we import on Line 6) will take care of this search for us.

Next, we’ll need the Pipeline class, imported on Line 7. This class allows us to define a series of steps using the fit and transform methods of scikit-learn estimators.

Our classification pipeline will consist of first training a BernoulliRBM to learn an unsupervised representation of the feature space, followed by training a LogisticRegression classifier on the learned features.

Finally, we import NumPy for numerical processing, argparse to parse command line arguments, time to track the amount of time it takes for a given model to train, and cv2 for our OpenCV bindings.

But before we get too far, we first need to setup some functions to load and manipulate our MNIST dataset:

The load_digits function, as the name suggests, loads our MNIST digit dataset off disk. The function takes a single parameter, datasetPath, which is the path to where the dataset CSV file resides.

We load the CSV file off disk using the np.genfromtxt function, grab the class labels (which are the first column of the CSV file) on Line 17, followed by the actual raw pixel feature vectors on Line 18. These feature vectors are of 784-dim corresponding to the 28 x 28 flattened representation of the grayscale digit image.

Finally, we return a tuple of our feature vector matrix and class labels on Line 21.

Next up, we need a function to apply some pre-processing to our data.

The BernoulliRBM assumes that the columns of our feature vectors fall within the range [0, 1]. However, the MNIST dataset is represented as unsigned 8-bit integers, falling within the range [0, 255].

To scale the columns into the range [0, 1], all we need to do is define a scale function:

The scale function takes two parameters, our data matrix X and an epsilon value used to prevent division by zero errors.

This function is fairly self-explanatory. For each of the 784 columns in the matrix, we subtract the value from the minimum of the column and divide by the maximum of the column. By doing this, we have ensured that the values of each column fall into the range [0, 1].

Now we need one last function: a method to generated a “nudged” dataset four times larger than the original, translating each image one pixel up, down, left, and right.

To handle this nudging of the dataset, we’ll create the nudge function:

The nudge function takes two parameters: our data matrix X and our class labels y.

We start by initializing a list of our (x, y) translations, followed by our new data matrix and target labels on Lines 32-34.

Then, we start looping over each of the images and class labels on Line 37.

As I mentioned, each image is represented as a 784-dim feature vector, corresponding to the 28 x 28 digit image.

However, to utilize the cv2.warpAffine function to translate our image, we first need to reshape the 784 feature vector into a two dimensional array of shape (28, 28) — this is handled on Line 40.

Next up, we start looping over our translations on Line 43.

We construct our actual translation matrix M on Line 45 and then apply the translation by calling the cv2.warpAffine function on Line 46.

We are then able to update our new, nudged data matrix on Line 48 by flattening the 28 x 28 image back into a 784-dim feature vector.

Our class label target list is then updated on Line 50.

Finally, we return a tuple of the new data matrix and class labels on Line 53.

These three helper functions, while quite simple in nature, are critical to setting up our experiment.

Now we can finally start putting the pieces together:

Lines 56-63 handle parsing our command line arguments. Our rbm.py script requires three arguments: --dataset, which is the path to where our MNIST .csv file resides on disk, --test, the percentage of data to use for our testing split (the rest used for training), and --search, an integer used to determine if a grid search should be performed to tune hyper-parameters.

A value of 1 for --search indicates that a grid search should be performed; a value of 0 indicates that the grid search has already been performed and the model parameters for both the BernoulliRBM and LogisticRegression models have already been manually set.

Now that our command line arguments have been parsed, we can load our dataset off disk on Line 69. We then convert it to the floating point data type on Line 70 and scale the feature vector columns to fall into the range [0, 1] using our scale function on Line 71.

In order to evaluate our system we need two sets of data: a training set and a testing set. Our pipeline will be trained using the training data, and then evaluated using the testing set to ensure our accuracy reports are not biased.

To generate our training and testing splits, we’ll call the train_test_split function on Line 74. This function automatically generates our splits for us.

A check is made on Line 78 to see if a grid search should be performed to tune the hyper-parameters of our pipeline.

If a grid search is to be performed, we first search over the coefficient C of the Logistic Regression classifier on Lines 81-85. We’ll be evaluating our approach using just a Logistic Regression classifier on the raw pixel data AND a Restricted Boltzmann Machine + Logistic Regression classifier, hence we need to independently search the coefficient C space.

Lines 89-97 then print out the optimal parameters values for our standard Logistic Regression classifier.

Now we can move on to our pipeline: a BernoulliRBM and a LogisticRegression classifier used together.

We define our pipeline on Lines 100-102, consisting of our Restricted Boltzmann Machine and a Logistic Regression classifier.

However, now we have more parameters to search over than just the coefficient C of the Logistic Regression classifier. Now we also have to search over the number of iterations, number of components (i.e. the size of the resulting feature space), and the learning rate of the RBM. We define this search space on Lines 108-112.

We start on the grid search on Lines 115-117.

The optimal parameters for the pipeline are then displayed on Lines 121-129.

To determine the optimal values for our pipeline, execute the following command:

You might want to make a cup of coffee or go for nice long walk while the grid space is searched. For each of our parameter selections, a model has to be trained and cross-validated. It’s definitely not a fast operation. But it’s the price you pay for optimal parameters, which are crucial when utilizing a Restricted Boltzmann Machine.

After a long walk, you should see that the following optimal values have been selected:

Awesome. Our hyper-parameters have been tuned.

Let’s set these parameters and evaluate our classification pipeline:

To obtain a baseline accuracy, we’ll train a standard Logistic Regression classifier on the raw pixel feature vectors (no unsupervised learning) on Lines 140 and 141. The accuracy of the baseline is then printed out on Line 143 using the classification_report function.

We then construct our BernoulliRBM + LogisticRegression classifier pipeline and evaluate it on our testing data on Lines 147-155.

But what happens when we nudge our testing set by translating each image one pixel up, down, left, and right?

To find out, we nudge our dataset on Line 162 and then re-evaluate it on Line 163.

To evaluate our system, issue the following command:

After a few minutes, we should have some results to look at.

Results

The first set of results is our Logistic Regression classifier trained strictly on the raw pixel feature vectors:

Using this approach, we were able to achieve 89% accuracy. Not bad for using just the pixel intensities as our feature vectors.

But look what happens when we train our Restricted Boltzmann Machine + Logistic Regression pipeline:

Our accuracy is able to increase from 89% to 93%! That’s definitely a significant jump!

But now the problem starts…

What happens when we nudge the dataset, translating each image one pixel up, down, left, and right?

I mean, these shifts are so small they would be barely (if at all) recognizable to the human eye.

Surely that can’t be a problem, can it?

Well, it turns out, it is:

After nudging our dataset the RBM + Logistic Regression pipeline drops down to 88% accuracy. 5% below the original testing set and 1% below the baseline Logistic Regression classifier.

So now you can see the issue of using raw pixel intensities as feature vectors. Even tiny shifts in the image can cause accuracy to drop.

But don’t worry, there are ways to fix this issue.

How Do We Fix the Translation Problem?

There are two ways that researchers in neural networks, deep nets, and convolutional networks address the shifting and translation problem.

The first way is to generate extra data at training time.

In this post, we nudged our dataset after training to see its affect on classification accuracy. However, we can also nudge our dataset before training in an attempt to make our model more robust.

The second method is to randomly select regions from our training images, rather than using them in their entirety.

For example, instead of using the entire 28 x 28 image, we could randomly sample a 24 x 24 region from the image. Done enough times, over enough training images, we can mitigate the translation issue.

Summary

In this blog post we’ve demonstrated that even small, one pixel translations in images that are nearly indistinguishable to the human eye are able to hurt the performance of our classification pipeline.

The reason we see this drop in accuracy is because we are utilizing raw pixel intensities as feature vectors.

Furthermore, translations are not the only deformation that can cause loss in accuracy when utilizing raw pixel intensities as features. Rotations, transformations, and even noise when capturing the image can have a negative impact on model performance.

In order to handle these situations we can (1) generate additional training data in an attempt to make our model more robust and/or (2) sample randomly from the image rather than using it in its entirety.

In practice, neural nets and deep learning approaches are substantially more robust than this experiment. By stacking layers and learning a set of convolutional kernels, deep nets are able to handle many of these issues.

Still, this experiment is important if you are just starting out and using raw pixel intensities as your feature vectors.

Either be prepared to spend a lot of time pre-processing your data, or be sure you know how to utilize classification models to handle situations where your data may not be as “clean” and nicely pre-processed as you want it.

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , , , ,

21 Responses to Applying deep learning and a RBM to MNIST using Python

  1. Joel June 24, 2014 at 9:06 pm #

    Super interesting. Though if you’re arguing that its the nature of the RBM that’s causing this loss in accuracy (rather than testing on data that was not trained upon), it would be great to see how plain logistic performed against the nudged dataset.

    • Adrian Rosebrock June 25, 2014 at 6:25 am #

      It’s definitely not the nature of the RBM, anytime you take a supervised learning model and perturb the testing set, you will see a loss in accuracy — this is especially true when working with raw pixel intensities.

      The bigger point that I am trying to make is that extremely special care needs to be taken when using raw pixel intensities as feature vectors. Even tiny one pixel translations or rotations can really hurt performance, hence the incredible care that needs to be placed in pre-processing.

      This is why Convolutional Neural Nets are the best performing classification models for image datasets (for the time being, at least). They are able to abstract the raw pixel intensities and learn a set of convolution kernels that can (somewhat) account for the translations and rotations.

  2. vin December 12, 2015 at 8:08 pm #

    hi adrian!

    why did you use RBM for dimensionality reduction, rather than another technique like PCA? thanks!

    • Adrian Rosebrock December 13, 2015 at 7:36 am #

      Simply put: reconstruction and probability. RBMs are generative, stochastic network that learns a probability distribution over a set of networks. PCA uses an eigenvalue decomposition to find the most informative components. These components can be used to “reconstruct” an input. RBMs learn a distribution over these inputs which can be used to reconstruct the inputs; however, the “intermediate” components that RBMs learn are more suitable as “features” and inputs to classifiers.

      As for when or where you should use RBMs vs. PCA, that’s quite problem specific and you should “spot check” your algorithms to see which gives better performance.

  3. Ying Yi Wu March 13, 2016 at 8:40 am #

    After the images in train.txt haven been trained, and the images in val.txt have been tested.
    How to get the “precision” and “recall” from the val.txt file (which has been tested by Caffe model)?

    • Adrian Rosebrock March 13, 2016 at 10:13 am #

      This blog post doesn’t cover how to use Caffe, but if you already have a model trained using Caffe and want to test it against other data points, I recommend using the classification_report function inside of scikit-learn.

  4. Vimal April 13, 2016 at 9:38 pm #

    Adrian,

    I am having trouble understanding the data type that is provided by mnist. there are enough examples on the web on how to use mnist dataset and python.

    but i want to train my own dataset and mix my data with mnist. I also want to train to recognize characters.

    usually i go about creating a 28×28 image for training. but i don’t quite seem to understand the data types part. is there a tool to convert the mnist to gifs or jpgs?

    • Adrian Rosebrock April 14, 2016 at 4:49 pm #

      Are you referring to the binary MNIST dataset? Or the MNIST dataset provided by scikit-learn? I would recommend using the scikit-learn representation. It’s simply a NumPy array where each row is flattened 28 x 28 image, thus each row has 784 entries. If you would like to train your own classifier, then you need to “flatten” your images in the same manner, where each row is a single image. I detail how to do this in more detail inside Practical Python and OpenCV.

  5. Zheo long er November 28, 2016 at 9:31 pm #

    Hi Adrian!
    What’s the real meaning/relationship between hidden and visible units, when using RBM model?

    • Adrian Rosebrock November 29, 2016 at 7:58 am #

      I’m not sure what you are saying by “real meaning”? Can you please elaborate?

      • Zheo long er November 30, 2016 at 2:23 am #

        HI!Adrian:
        first thanks for your reply.when we use PCA to reduce dimension,we can get k new features ,and thoes features liner combination of original features.so my question is that when we use RBM to reduce dimension we can get some new features ,but i don’t know what’s the relationship between hidden units and visible units ,that is to say how can we explain the new feature and original features when we use RBM model to reduce dim.thanks

        • Adrian Rosebrock December 1, 2016 at 7:40 am #

          Think of the purpose of a RBM as to perform a reconstruction of the original data but using lesser inputs. In the same way that PCA does dimensionality reduction to reduce dimensions we can use these principal components to reconstruct the original inputs. The same is true for an RBM. This tutorial does an excellent job explaining the relationship between input and hidden units.

  6. Soham Jani December 1, 2016 at 1:53 pm #

    Thank you for such an informative tutorial.
    Is there a way to have an RBM network with 2 or more hidden layers using sklearn?
    I’m using a leaf data set, which does not have the pixel information, but instead has features obtained from the images, like texture shape, etc. I’m surprised that the combination provides almost a 96% accuracy. Does this seem strange ?

    • Adrian Rosebrock December 5, 2016 at 1:48 pm #

      An RBM is only intended to have visible and hidden nodes. You can stack multiple RBMs on top of each other to obtain a Deep Belief Network (DBN). Is that what you mean? I would suggest reading more about DBNs and classification before continuing.

  7. Gavin Hartnett January 27, 2017 at 1:58 pm #

    Thanks for the great article! Very helpful. I have one confusion however.

    You say:
    “The BernoulliRBM implementation (as the name suggests), consists of binary visible units and binary hidden nodes.”

    But then you later say:
    “The BernoulliRBM assumes that the columns of our feature vectors fall within the range [0, 1]. However, the MNIST dataset is represented as unsigned 8-bit integers, falling within the range [0, 255]. To scale the columns into the range [0, 1], all we need to do is define a scale …”

    Why are you allowed to take the MNIST visible units to be real valued in [0,1] when the RBM model assumes binary values? Thanks!

    • Adrian Rosebrock January 28, 2017 at 6:51 am #

      Hey Gavin — you are correct. BernoulliRBMs are intended for binary units. However, keep in mind that the MNIST dataset are (essentially) binary images. The foreground is represented as “white” (255) while the background is black (0). Dividing by 255 yields values of 0 and 1. Thus, they can be fed into the RBM.

  8. Gavin Hartnett January 28, 2017 at 8:59 am #

    Hi Adrian,

    Thanks for the reply. Sorry, but I am still a bit confused because your scaling doesn’t give strictly binary values (right?).

    Starting with the 0-255 valued discrete MNIST data, one option would be to process the data to be binary, perhaps by keeping the 0’s as is, and letting any pixel not 0 be 1 (for on). Another option would be to make up a rule, like any pixel with intensity > 100 is set to 1 and any <= 100 is set to 0. Then the Bernoulli RBM would be appropriate for the processed data because the values would be strictly binary.

    Obviously doing any of these options is less than ideal because you loose some information, so in some sense letting the data be real valued in the interval [0,1] might be preferable. But then, strictly speaking, the Bernoulli RBM model isn't appropriate as it assumes binary values.

    So would I be correct in interpreting this implementation of the Bernoulli RBM as not being strictly correct (because the underlying mathematical formulae of the model rely on the broken assumption of strictly binary values), but nonetheless the implementation preforms well and is probably a better option than one of the above scenarios I outlined above where information is lost? I understand that machine learning is a mix of mathematics and engineering, and at the end of the day a given algorithm is being used for preform some task, and the important thing is whether it does that task well, not whether it is being 100% consistently implemented. Is this the crux of the matter here? By the way, I don't intend this to be a criticism, but I've seen a few people do this rescaling and I'm trying to make sure I understand it!

    • Adrian Rosebrock January 29, 2017 at 2:51 pm #

      I think you might have been confused by my original comment. If assume the MNIST digits are already thresholded, then we have two pixels values: 255 (white, the foreground) and 0 (black, the background). If you divide all pixel values by 255, then they are all in the range [0, 1]. In fact, if the MNIST images are thresholded the only possible values are 0 and 1.

      You are also correct in saying that machine learning is a mix of mathematics and engineering. We often relax some of the strict theoretical concepts if they work well in practice.

      A better option for MNIST classification would be to use a Convolutional Neural Network (CNN). I cover LeNet for digit recognition here. More information on deep learning can be found inside the PyImageSearch Gurus course and in my upcoming deep learning book.

  9. Gavin Hartnett January 29, 2017 at 5:49 pm #

    Ah, I see. Good, then everything makes sense now, thanks very much!

  10. Naz March 26, 2017 at 4:59 am #

    Hi. Where can I actually download the appropriately formatted MNIST .csv file?

    • Adrian Rosebrock March 28, 2017 at 1:08 pm #

      The MNIST .csv file is included in the “Downloads” section of this tutorial.

Leave a Reply