Stochastic Gradient Descent (SGD) with Python

Figure 2: The loss function associated with Stochastic Gradient Descent. Loss continues to decrease as we allow more epochs to pass.

In last week’s blog post, we discussed gradient descent, a first-order optimization algorithm that can be used to learn a set of classifier coefficients for parameterized learning.

However, the “vanilla” implementation of gradient descent can be prohibitively slow to run on large datasets — in fact, it can even be considered computationally wasteful.

Instead, we should apply Stochastic Gradient Descent (SGD), a simple modification to the standard gradient descent algorithm that computes the gradient and updates our weight matrix W on small batches of training data, rather than the entire training set itself.

While this leads to “noiser” weight updates, it also allows us to take more steps along the gradient (1 step for each batch versus 1 step per epoch), ultimately leading to faster convergence and no negative affects to loss and classification accuracy.

To learn more about Stochastic Gradient Descent, keep reading.

Looking for the source code to this post?
Jump right to the downloads section.

Stochastic Gradient Descent (SGD) with Python

Taking a look at last week’s blog post, it should be (at least somewhat) obvious that the gradient descent algorithm will run very slowly on large datasets. The reason for this “slowness” is because each iteration of gradient descent requires that we compute a prediction for each training point in our training data.

For image datasets such as ImageNet where we have over 1.2 million training images, this computation can take a long time.

It also turns out that computing predictions for every training data point before taking a step and updating our weight matrix W is computationally wasteful (and doesn’t help us in the long run).

Instead, what we should do is batch our updates.

Updating our gradient descent optimization algorithm

Before I discuss Stochastic Gradient Descent in more detail, let’s first look at the original gradient descent pseudocode and then the updated, SGD pseudocode, both inspired by the CS231n course slides.

Below follows the pseudocode for vanilla gradient descent:

And here we can see the pseudocode for Stochastic Gradient Descent:

As you can see, the implementations are quite similar.

The only difference between vanilla gradient descent and Stochastic Gradient Descent is the addition of the next_training_batch  function. Instead of computing our gradient over the entire data  set, we instead sample our data, yielding a batch .

We then evaluate the gradient on this batch  and update our weight matrix W.

Note: For an implementation perspective, we also randomize our training samples before applying SGD.

Batching gradient descent for machine learning

After looking at the pseudocode for SGD, you’ll immediately notice an introduction of a new parameter: the batch size.

In a “purist” implementation of SGD, your mini-batch size would be set to 1. However, we often uses mini-batches that are > 1. Typical values include 3264128, and 256.

So, why are these common mini-batch size values?

To start, using batches > 1 helps reduce variance in the parameter update, ultimately leading to a more stable convergence.

Secondly, optimized matrix operation libraries are often more efficient when the input matrix size is a power of 2.

In general, the mini-batch size is not a hyperparameter that you should worry much about. You basically determine how many training examples will fit on your GPU/main memory and then use the nearest power of 2 as the batch size.

Implementing Stochastic Gradient Descent (SGD) with Python

We are now ready to update our code from last week’s blog post on vanilla gradient descent. Since I have already reviewed this code in detail earlier, I’ll defer an exhaustive, thorough review of each line of code to last week’s post.

That said, I will still be pointing out the salient, important lines of code in this example.

To get started, open up a new file, name it , and insert the following code:

Lines 2-5 start by importing our required Python packages. Then, Line 7 defines our sigmoid_activation  function used during the training process.

In order to apply Stochastic Gradient Descent, we need a function that yields mini-batches of training data — and that is exactly what the next_batch  function on Lines 12-16 does.

The next_batch method requires three parameters:

  • X : Our training dataset of feature vectors.
  • y : The class labels associated with each of the training data points.
  • batchSize : The size of each mini-batch that will be returned.

Lines 14-16 then loop over our training examples, yielding subsets of both X  and y  as mini-batches.

Next, let’s parse our command line arguments:

Lines 19-26 parse our (optional) command line arguments.

The --epochs  switch controls the number of epochs, or rather, the number of times the training process “sees” each individual training example.

The --alpha  value controls our learning rate in the gradient descent algorithm.

And finally, the --batch-size  indicates the size of each of our mini-batches. We’ll default this value to be 32.

In order to apply Stochastic Gradient Descent, we need a dataset. Below we generate some data to work with:

Above we generate a 2-class classification problem. We have a total of 400 data points, each of which are 2D. 200 data points belong to class 0 and the remaining 200 to class 1.

Our goal is to correctly classify each of these 400 data points into their respective classes.

Now let’s perform some initializations:

For a more through review of this section, please see last week’s tutorial.

Below follows our actual Stochastic Gradient Descent (SGD) implementation:

On Line 48 we start looping over the desired number of epochs.

We then initialize an epochLoss  list to store the loss value for each of the mini-batch gradient updates. As we’ll see later in this code block, the epochLoss  list will be used to compute the average loss over all mini-batch updates for an entire epoch.

Line 53 is the “core” of the Stochastic Gradient Descent algorithm and is what separates it from the vanilla gradient descent algorithm — we loop over our training samples in mini-batches.

For each of these mini-batches, we take the data, compute the dot product between it and the weight matrix, and then pass the results through the sigmoid activation function to obtain our predictions.

Line 62 computes the error  between the these predictions, allowing us to minimize the least squares loss on Line 66.

Line 72 evaluates the gradient for the current batch. Once we have the gradient , we can update the weight matrix W  on Line 76 by taking a step, scaled by our learning rate --alpha .

Again, for a more thorough, detailed review of the gradient descent algorithm, please refer to last week’s tutorial.

Our last code block handles visualizing our data points along with the decision boundary learned by the Stochastic Gradient Descent algorithm:

Visualizing Stochastic Gradient Descent (SGD)

To execute the code associated with this blog post, be sure to download the code using the “Downloads” section at the bottom of this tutorial.

From there, you can execute the following command:

You should then see the following plot displayed to your screen:

Figure 1: Learning the classification decision boundary using Stochastic Gradient Descent.

Figure 1: Learning the classification decision boundary using Stochastic Gradient Descent.

As the plot demonstrates, we are able to learn a weight matrix W that correctly classifies each of the data points.

I have also included a plot that visualizes loss decreasing in further iterations of the Stochastic Gradient Descent algorithm:

Figure 2: The loss function associated with Stochastic Gradient Descent. Loss continues to decrease as we allow more epochs to pass.

Figure 2: The loss function associated with Stochastic Gradient Descent. Loss continues to decrease as we allow more epochs to pass.


In today’s blog post, we learned about Stochastic Gradient Descent (SGD), an extremely common extension to the vanilla gradient descent algorithm. In fact, in nearly all situations, you’ll see SGD used instead of the original gradient descent version.

SGD is also very common when training your own neural networks and deep learning classifiers. If you recall, a couple of weeks ago we used SGD to train a simple neural network. We also used SGD when training LeNet, a common Convolutional Neural Network.

Over the next couple of weeks I’ll be discussing some computer vision topics, but then we’ll pick up a thorough discussion of backpropagation along with the various types of layers in Convolutional Neural Networks come early November.

Be sure to use the form below to sign up for the PyImageSearch Newsletter to be notified when future blog posts are published!


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 11-page Resource Guide on Computer Vision and Image Search Engines, including exclusive techniques that I don’t post on this blog! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , ,

8 Responses to Stochastic Gradient Descent (SGD) with Python

  1. Wajih October 17, 2016 at 1:15 pm #

    Another excellent explaination of a very important topic in deep learning… This blog of yours is a must read!!

    • Adrian Rosebrock October 17, 2016 at 1:59 pm #

      Thanks Wajih! I’m glad you are enjoying the blog posts and tutorials. I’ll be sure to continue to do more deep learning tutorials in the future.

      • wajih October 18, 2016 at 5:08 am #

        Thats is really cool. Looking forward for more important stuff on deep learning!

        • Adrian Rosebrock October 20, 2016 at 8:57 am #

          There will certainly be plenty more posts on deep learning 🙂

  2. i008 October 19, 2016 at 4:31 am #

    Line 84:
    Y = (-W[0] – (W[1] * X)) / W[2]

    Is not exactly clear to me, could someone please elaborate on that?

    • Hashim July 12, 2017 at 7:15 am #

      you can think of it like this:

      In order to draw the decision boundary, you need to draw only the points (x,y) which lie right over the boundary.

      According to the sigmoid function, the boundary is the value 0.5. So, in order to obtain a 0.5, you need to provide a zero value as input to the sigmoid (That is, a zero value as output from the scoring function).

      Thus, if the scoring function equals zero:

      0 = w0 + w1*x + w2*y ==> y = (-w0 – w1*x)/w2

      You can use any x’s coordinates you want, and you’ll get the proper y’s coordinates to draw the boundary

  3. foobar April 13, 2017 at 12:21 pm #

    Great stuff, thanks!

    method name is not next_method btw

    • Adrian Rosebrock April 16, 2017 at 9:02 am #

      Thanks! I have updated the blog post to fix this typo.

Leave a Reply