# Stochastic Gradient Descent (SGD) with Python In last week’s blog post, we discussed gradient descent, a first-order optimization algorithm that can be used to learn a set of classifier coefficients for parameterized learning.

However, the “vanilla” implementation of gradient descent can be prohibitively slow to run on large datasets — in fact, it can even be considered computationally wasteful.

Instead, we should apply Stochastic Gradient Descent (SGD), a simple modification to the standard gradient descent algorithm that computes the gradient and updates our weight matrix W on small batches of training data, rather than the entire training set itself.

While this leads to “noiser” weight updates, it also allows us to take more steps along the gradient (1 step for each batch versus 1 step per epoch), ultimately leading to faster convergence and no negative affects to loss and classification accuracy.

Looking for the source code to this post?

## Stochastic Gradient Descent (SGD) with Python

Taking a look at last week’s blog post, it should be (at least somewhat) obvious that the gradient descent algorithm will run very slowly on large datasets. The reason for this “slowness” is because each iteration of gradient descent requires that we compute a prediction for each training point in our training data.

For image datasets such as ImageNet where we have over 1.2 million training images, this computation can take a long time.

It also turns out that computing predictions for every training data point before taking a step and updating our weight matrix W is computationally wasteful (and doesn’t help us in the long run).

### Updating our gradient descent optimization algorithm

Before I discuss Stochastic Gradient Descent in more detail, let’s first look at the original gradient descent pseudocode and then the updated, SGD pseudocode, both inspired by the CS231n course slides.

Below follows the pseudocode for vanilla gradient descent:

And here we can see the pseudocode for Stochastic Gradient Descent:

As you can see, the implementations are quite similar.

The only difference between vanilla gradient descent and Stochastic Gradient Descent is the addition of the next_training_batch  function. Instead of computing our gradient over the entire data  set, we instead sample our data, yielding a batch .

We then evaluate the gradient on this batch  and update our weight matrix W.

Note: For an implementation perspective, we also randomize our training samples before applying SGD.

### Batching gradient descent for machine learning

After looking at the pseudocode for SGD, you’ll immediately notice an introduction of a new parameter: the batch size.

In a “purist” implementation of SGD, your mini-batch size would be set to 1. However, we often uses mini-batches that are > 1. Typical values include 3264128, and 256.

So, why are these common mini-batch size values?

To start, using batches > 1 helps reduce variance in the parameter update, ultimately leading to a more stable convergence.

Secondly, optimized matrix operation libraries are often more efficient when the input matrix size is a power of 2.

In general, the mini-batch size is not a hyperparameter that you should worry much about. You basically determine how many training examples will fit on your GPU/main memory and then use the nearest power of 2 as the batch size.

### Implementing Stochastic Gradient Descent (SGD) with Python

We are now ready to update our code from last week’s blog post on vanilla gradient descent. Since I have already reviewed this code in detail earlier, I’ll defer an exhaustive, thorough review of each line of code to last week’s post.

That said, I will still be pointing out the salient, important lines of code in this example.

To get started, open up a new file, name it sgd.py , and insert the following code:

Lines 2-5 start by importing our required Python packages. Then, Line 7 defines our sigmoid_activation  function used during the training process.

In order to apply Stochastic Gradient Descent, we need a function that yields mini-batches of training data — and that is exactly what the next_batch  function on Lines 12-16 does.

The next_batch method requires three parameters:

• X : Our training dataset of feature vectors.
• y : The class labels associated with each of the training data points.
• batchSize : The size of each mini-batch that will be returned.

Lines 14-16 then loop over our training examples, yielding subsets of both X  and y  as mini-batches.

Next, let’s parse our command line arguments:

Lines 19-26 parse our (optional) command line arguments.

The --epochs  switch controls the number of epochs, or rather, the number of times the training process “sees” each individual training example.

The --alpha  value controls our learning rate in the gradient descent algorithm.

And finally, the --batch-size  indicates the size of each of our mini-batches. We’ll default this value to be 32.

In order to apply Stochastic Gradient Descent, we need a dataset. Below we generate some data to work with:

Above we generate a 2-class classification problem. We have a total of 400 data points, each of which are 2D. 200 data points belong to class 0 and the remaining 200 to class 1.

Our goal is to correctly classify each of these 400 data points into their respective classes.

Now let’s perform some initializations:

For a more through review of this section, please see last week’s tutorial.

Below follows our actual Stochastic Gradient Descent (SGD) implementation:

On Line 48 we start looping over the desired number of epochs.

We then initialize an epochLoss  list to store the loss value for each of the mini-batch gradient updates. As we’ll see later in this code block, the epochLoss  list will be used to compute the average loss over all mini-batch updates for an entire epoch.

Line 53 is the “core” of the Stochastic Gradient Descent algorithm and is what separates it from the vanilla gradient descent algorithm — we loop over our training samples in mini-batches.

For each of these mini-batches, we take the data, compute the dot product between it and the weight matrix, and then pass the results through the sigmoid activation function to obtain our predictions.

Line 62 computes the error  between the these predictions, allowing us to minimize the least squares loss on Line 66.

Line 72 evaluates the gradient for the current batch. Once we have the gradient , we can update the weight matrix W  on Line 76 by taking a step, scaled by our learning rate --alpha .

Again, for a more thorough, detailed review of the gradient descent algorithm, please refer to last week’s tutorial.

Our last code block handles visualizing our data points along with the decision boundary learned by the Stochastic Gradient Descent algorithm:

### Visualizing Stochastic Gradient Descent (SGD)

To execute the code associated with this blog post, be sure to download the code using the “Downloads” section at the bottom of this tutorial.

From there, you can execute the following command:

You should then see the following plot displayed to your screen: Figure 1: Learning the classification decision boundary using Stochastic Gradient Descent.

As the plot demonstrates, we are able to learn a weight matrix W that correctly classifies each of the data points.

I have also included a plot that visualizes loss decreasing in further iterations of the Stochastic Gradient Descent algorithm: Figure 2: The loss function associated with Stochastic Gradient Descent. Loss continues to decrease as we allow more epochs to pass.

## Summary

In today’s blog post, we learned about Stochastic Gradient Descent (SGD), an extremely common extension to the vanilla gradient descent algorithm. In fact, in nearly all situations, you’ll see SGD used instead of the original gradient descent version.

SGD is also very common when training your own neural networks and deep learning classifiers. If you recall, a couple of weeks ago we used SGD to train a simple neural network. We also used SGD when training LeNet, a common Convolutional Neural Network.

Over the next couple of weeks I’ll be discussing some computer vision topics, but then we’ll pick up a thorough discussion of backpropagation along with the various types of layers in Convolutional Neural Networks come early November.

Be sure to use the form below to sign up for the PyImageSearch Newsletter to be notified when future blog posts are published! If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

### 19 Responses to Stochastic Gradient Descent (SGD) with Python

1. Wajih October 17, 2016 at 1:15 pm #

Another excellent explaination of a very important topic in deep learning… This blog of yours is a must read!!

• Adrian Rosebrock October 17, 2016 at 1:59 pm #

Thanks Wajih! I’m glad you are enjoying the blog posts and tutorials. I’ll be sure to continue to do more deep learning tutorials in the future.

• wajih October 18, 2016 at 5:08 am #

Thats is really cool. Looking forward for more important stuff on deep learning!

• Adrian Rosebrock October 20, 2016 at 8:57 am #

2. i008 October 19, 2016 at 4:31 am #

Line 84:
Y = (-W – (W * X)) / W

Is not exactly clear to me, could someone please elaborate on that?

• Hashim July 12, 2017 at 7:15 am #

you can think of it like this:

In order to draw the decision boundary, you need to draw only the points (x,y) which lie right over the boundary.

According to the sigmoid function, the boundary is the value 0.5. So, in order to obtain a 0.5, you need to provide a zero value as input to the sigmoid (That is, a zero value as output from the scoring function).

Thus, if the scoring function equals zero:

0 = w0 + w1*x + w2*y ==> y = (-w0 – w1*x)/w2

You can use any x’s coordinates you want, and you’ll get the proper y’s coordinates to draw the boundary

3. foobar April 13, 2017 at 12:21 pm #

Great stuff, thanks!

method name is not next_method btw

• Adrian Rosebrock April 16, 2017 at 9:02 am #

Thanks! I have updated the blog post to fix this typo.

4. i262666 January 4, 2018 at 12:33 am #

Great blog and very clear explanation! However, the overall logic is quite strange. We loop over all epochs and inside each epoch we process all data points in each mini-batch and adjusting the weights for every mini-batch. And because we use the gradient descent, we move for each mini-batch toward the minimal of the loss function. And tracking the loss function for each epoch shows on the displayed plot that the error values are decreasing.

So far, all is logical. But how you get the final weights for the trained network which best satisfies the targets for every data inputs. After training the network you star testing it against the test data-set. What weights we would use for testing the trained network?

• Adrian Rosebrock January 5, 2018 at 1:40 pm #

The weights you use to test the network would be the weights you obtained from training it. After the network has been trained the weights “freeze” and do not change. If you’re interested in learning more about gradient descent and how it works in the grand scheme of neural networks, take a look at my book, Deep Learning for Computer Vision with Python.

5. Misha Singla April 13, 2018 at 7:33 am #

Really helpful..but what are the changes required if hinge loss is considered instead of least squared error?

6. arron.niu August 3, 2018 at 5:12 am #

very thanks. I have pratised the vanilla gradient descent and the sgd according to the two posts. but I training 4000 and 40000 samples. but the execute time of sgd is slowly than the vanilla gradient descent. why?

7. georgem November 9, 2018 at 11:40 am #

Great post! and Thank you for your work on making this article very well structured and informative.

One question: I understand you split data in mini-batches and you computed gradients on each mini-batch. Why is it called Stochastic Gradient Decent? Because you don’t have any stochasticity, you’re just summing all the gradients in a mini-batch.

Maybe my understanding is wrong.

Thank you!

• Adrian Rosebrock November 10, 2018 at 9:57 am #

True SGD is called “stochastic” because it randomly samples a single data point and then updates the weights. In practice that’s highly inefficient so we use mini-batch gradient descent which has taken the name of SGD.

• georgem November 13, 2018 at 2:23 pm #

“In practice that’s highly inefficient” -> inefficient you mean that time of convergence might be high? True SGD is faster than regular true GD right? True GD computes the gradient for each example in the data so it takes longer.

If we only do mini-batch, with no randomness is that mini-batch GD?
And if we randomly shuffle the data at each epoch to have different data in our mini-batches is called mini-batch SDG?

Sorry, I hope I’m not being annoying. I’ve been looking everywhere online to clarify this. The deeper I dig, the more confused I get lol.

Thank you!

• Adrian Rosebrock November 13, 2018 at 4:10 pm #

1. It’s inefficient in time to convergence.
2. But it also tends to be highly inefficient for large datasets. You’ll be performing too many non-consecutive I/O operations which slows the whole operation down.

If you find yourself getting confused I would suggest working through Deep Learning for Computer Vision with Python where I discuss these concepts in far more detail.

• georgem November 15, 2018 at 9:05 am #

8. Disha November 4, 2019 at 3:24 pm #

Apologies if the question seems silly, I’m still an ultra-beginner to ML.
What do we do if i have class 1 and 2 instead of class 0 and class1?

• PRAKHAR GANDHI November 10, 2019 at 3:11 am #

Convert those classes to 0 and 1