Stochastic Gradient Descent (SGD) with Python

In the previous section, we discussed gradient descent, a first-order optimization algorithm that can be used to learn a set of classifier weights for parameterized learning. However, this “vanilla” implementation of gradient descent can be prohibitively slow to run on large datasets — in fact, it can even be considered computationally wasteful.

Instead, we should apply Stochastic Gradient Descent (SGD), a simple modification to the standard gradient descent algorithm that computes the gradient and updates the weight matrix W on small batches of training data, rather than the entire training set. While this modification leads to “more noisy” updates, it also allows us to take more steps along the gradient (one step per each batch versus one step per epoch), ultimately leading to faster convergence and no negative effects to loss and classification accuracy.

SGD is arguably the most important algorithm when it comes to training deep neural networks. Even though the original incarnation of SGD was introduced over 57 years ago (Stanford Electronics Laboratories et al., 1960), it is still the engine that enables us to train large networks to learn patterns from data points. Above all other algorithms covered in this book, take the time to understand SGD.

Looking for the source code to this post?

Mini-batch SGD

Reviewing the vanilla gradient descent algorithm, it should be (somewhat) obvious that the method will run very slowly on large datasets. The reason for this slowness is because each iteration of gradient descent requires us to compute a prediction for each training point in our training data before we are allowed to update our weight matrix. For image datasets such as ImageNet where we have over 1.2 million training images, this computation can take a long time.

It also turns out that computing predictions for every training point before taking a step along our weight matrix is computationally wasteful and does little to help our model coverage.

Instead, what we should do is batch our updates. We can update the pseudocode to transform vanilla gradient descent to become SGD by adding an extra function call:

while True:
	batch = next_training_batch(data, 256)
	Wgradient = evaluate_gradient(loss, batch, W)
	W += -alpha * Wgradient

The only difference between vanilla gradient descent and SGD is the addition of the next_training_batch function. Instead of computing our gradient over the entire data set, we instead sample our data, yielding a batch. We evaluate the gradient on the batch, and update our weight matrix W. From an implementation perspective, we also try to randomize our training samples before applying SGD since the algorithm is sensitive to batches.

After looking at the pseudocode for SGD, you’ll immediately notice an introduction of a new parameter: the batch size. In a “purist” implementation of SGD, your mini-batch size would be 1, implying that we would randomly sample one data point from the training set, compute the gradient, and update our parameters. However, we often use mini-batches that are > 1. Typical batch sizes include 32, 64, 128, and 256.

So, why bother using batch sizes > 1? To start, batch sizes > 1 help reduce variance in the parameter update (http://pyimg.co/pd5w0), leading to a more stable convergence. Secondly, powers of two are often desirable for batch sizes as they allow internal linear algebra optimization libraries to be more efficient.

In general, the mini-batch size is not a hyperparameter you should worry too much about (http://cs231n.stanford.edu). If you’re using a GPU to train your neural network, you determine how many training examples will fit into your GPU and then use the nearest power of two as the batch size such that the batch will fit on the GPU. For CPU training, you typically use one of the batch sizes listed above to ensure you reap the benefits of linear algebra optimization libraries.

Implementing Mini-batch SGD

Let’s go ahead and implement SGD and see how it differs from standard vanilla gradient descent. Open a new file, name it sgd.py, and insert the following code:

# import the necessary packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
import argparse

def sigmoid_activation(x):
	# compute the sigmoid activation value for a given input
	return 1.0 / (1 + np.exp(-x))

def sigmoid_deriv(x):
	# compute the derivative of the sigmoid function ASSUMING
	# that the input "x" has already been passed through the sigmoid
	# activation function
	return x * (1 - x)

Lines 2-7 import our required Python packages, exactly the same as the gradient_descent.py example earlier in this chapter. Lines 9-17 define our sigmoid_activation and sigmoid_deriv functions, both of which are identical to the previous version of gradient descent.

In fact, the predict method doesn’t change either:

def predict(X, W):
	# take the dot product between our features and weight matrix
	preds = sigmoid_activation(X.dot(W))

	# apply a step function to threshold the outputs to binary
	# class labels
	preds[preds <= 0.5] = 0
	preds[preds > 0] = 1

	# return the predictions
	return preds

However, what does change is the addition of the next_batch function:

def next_batch(X, y, batchSize):
	# loop over our dataset "X" in mini-batches, yielding a tuple of
	# the current batched data and labels
	for i in np.arange(0, X.shape[0], batchSize):
		yield (X[i:i + batchSize], y[i:i + batchSize])

The next_batch method requires three parameters:

X: Our training dataset of feature vectors/raw image pixel intensities.
y: The class labels associated with each of the training data points.
batchSize: The size of each mini-batch that will be returned.

Lines 34 and 35 then loop over the training examples, yielding subsets of both X and y as mini-batches.

Next, we can parse our command line arguments:

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-e", "--epochs", type=float, default=100,
	help="# of epochs")
ap.add_argument("-a", "--alpha", type=float, default=0.01,
	help="learning rate")
ap.add_argument("-b", "--batch-size", type=int, default=32,
	help="size of SGD mini-batches")
args = vars(ap.parse_args())

We have already reviewed both the --epochs (number of epochs) and --alpha (learning rate) switch from the vanilla gradient descent example — but also notice we are introducing a third switch: --batch-size, which as the name indicates is the size of each of our mini-batches. We’ll default this value to be 32 data points per mini-batch.

Our next code block handles generating our 2-class classification problem with 1,000 data points, adding the bias column, and then performing the training and testing split:

# generate a 2-class classification problem with 1,000 data points,
# where each data point is a 2D feature vector
(X, y) = make_blobs(n_samples=1000, n_features=2, centers=2,
	cluster_std=1.5, random_state=1)
y = y.reshape((y.shape[0], 1))

# insert a column of 1's as the last entry in the feature
# matrix -- this little trick allows us to treat the bias
# as a trainable parameter within the weight matrix
X = np.c_[X, np.ones((X.shape[0]))]

# partition the data into training and testing splits using 50% of
# the data for training and the remaining 50% for testing
(trainX, testX, trainY, testY) = train_test_split(X, y,
	test_size=0.5, random_state=42)

We’ll then initialize our weight matrix and losses just like in the previous example:

# initialize our weight matrix and list of losses
print("[INFO] training...")
W = np.random.randn(X.shape[1], 1)
losses = []

The real change comes next where we loop over the desired number of epochs, sampling mini-batches along the way:

# loop over the desired number of epochs
for epoch in np.arange(0, args["epochs"]):
	# initialize the total loss for the epoch
	epochLoss = []

	# loop over our data in batches
	for (batchX, batchY) in next_batch(trainX, trainY, args["batch_size"]):
		# take the dot product between our current batch of features
		# and the weight matrix, then pass this value through our
		# activation function
		preds = sigmoid_activation(batchX.dot(W))

		# now that we have our predictions, we need to determine the
		# "error", which is the difference between our predictions
		# and the true values
		error = preds - batchY
		epochLoss.append(np.sum(error ** 2))

On Line 69, we start looping over the supplied number of --epochs. We then loop over our training data in batches on Line 74. For each batch, we compute the dot product between the batch and W, then pass the result through the sigmoid activation function to obtain our predictions. We compute the error for the batch on Line 83 and use this value to update our least squares epochLoss on Line 84.

Now that we have the error, we can compute the gradient descent update, identical to computing the gradient from vanilla gradient descent, only this time we are performing the update on batches rather than the entire training set:

		# the gradient descent update is the dot product between our
		# (1) current batch and (2) the error of the sigmoid
		# derivative of our predictions
		d = error * sigmoid_deriv(preds)
		gradient = batchX.T.dot(d)

		# in the update stage, all we need to do is "nudge" the
		# weight matrix in the negative direction of the gradient
		# (hence the term "gradient descent" by taking a small step
		# towards a set of "more optimal" parameters
		W += -args["alpha"] * gradient

Line 96 handles updating our weight matrix based on the gradient, scaled by our learning rate --alpha. Notice how the weight update stage takes place inside the batch loop — this implies there are multiple weight updates per epoch.

We can then update our loss history by taking the average across all batches in the epoch and then displaying an update to our terminal if necessary:

	# update our loss history by taking the average loss across all
	# batches
	loss = np.average(epochLoss)
	losses.append(loss)

	# check to see if an update should be displayed
	if epoch == 0 or (epoch + 1) % 5 == 0:
		print("[INFO] epoch={}, loss={:.7f}".format(int(epoch + 1),
			loss))

Evaluating our classifier is done in the same way as in vanilla gradient descent — simply call predict on the testX data using our learned W weight matrix:

# evaluate our model
print("[INFO] evaluating...")
preds = predict(testX, W)
print(classification_report(testY, preds))

We’ll end our script by plotting the testing classification data along with the loss per epoch:

# plot the (testing) classification data
plt.style.use("ggplot")
plt.figure()
plt.title("Data")
plt.scatter(testX[:, 0], testX[:, 1], marker="o", c=testY[:, 0], s=30)

# construct a figure that plots the loss over time
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, args["epochs"]), losses)
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.show()

SGD Results

To visualize the results from our implementation, just execute the following command:

$ python sgd.py
[INFO] training...
[INFO] epoch=1, loss=0.1317637
[INFO] epoch=5, loss=0.0162487
[INFO] epoch=10, loss=0.0112798
[INFO] epoch=15, loss=0.0100234
[INFO] epoch=20, loss=0.0094581
[INFO] epoch=25, loss=0.0091053
[INFO] epoch=30, loss=0.0088366
[INFO] epoch=35, loss=0.0086082
[INFO] epoch=40, loss=0.0084031
[INFO] epoch=45, loss=0.0082138
[INFO] epoch=50, loss=0.0080364
[INFO] epoch=55, loss=0.0078690
[INFO] epoch=60, loss=0.0077102
[INFO] epoch=65, loss=0.0075593
[INFO] epoch=70, loss=0.0074153
[INFO] epoch=75, loss=0.0072779
[INFO] epoch=80, loss=0.0071465
[INFO] epoch=85, loss=0.0070207
[INFO] epoch=90, loss=0.0069001
[INFO] epoch=95, loss=0.0067843
[INFO] epoch=100, loss=0.0066731
[INFO] evaluating...
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       250
          1       1.00      1.00      1.00       250

avg / total       1.00      1.00      1.00       500

The SGD example uses a learning rate of (0.1) and the same number of epochs (100) as vanilla gradient descent. The results of which can be seen in Figure 1.

**Figure 1:** Applying Stochastic Gradient Descent to our dataset of red and blue data points we are able to obtain nearly two orders of magnitude lower loss by the end of the 100th epoch (as compared to standard, vanilla gradient descent) due to the multiple weight updates per batch.

Investigating the actual loss values at the end of the 100th epoch, you’ll notice that loss obtained by SGD is nearly two orders of magnitude lower than vanilla gradient descent (0.006 vs 0.447, respectively). This difference is due to the multiple weight updates per epoch, giving our model more chances to learn from the updates made to the weight matrix. This effect is even more pronounced on large datasets, such as ImageNet, where we have millions of training examples and small, incremental updates in our parameters can lead to a low loss (but not necessarily optimal) solution.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: July 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

About the Author

Hi there, I’m Adrian Rosebrock, PhD. All too often I see developers, students, and researchers wasting their time, studying the wrong things, and generally struggling to get started with Computer Vision, Deep Learning, and OpenCV. I created this website to show you what I believe is the best possible way to get your start.

19 responses to: Stochastic Gradient Descent (SGD) with Python

Wajih

October 17, 2016 at 1:15 pm

Another excellent explaination of a very important topic in deep learning… This blog of yours is a must read!!
- Adrian Rosebrock
  
  October 17, 2016 at 1:59 pm
  
  Thanks Wajih! I’m glad you are enjoying the blog posts and tutorials. I’ll be sure to continue to do more deep learning tutorials in the future.
  - wajih
    
    October 18, 2016 at 5:08 am
    
    Thats is really cool. Looking forward for more important stuff on deep learning!
    - Adrian Rosebrock
      
      October 20, 2016 at 8:57 am
      
      There will certainly be plenty more posts on deep learning 🙂
i008

October 19, 2016 at 4:31 am

Line 84:
Y = (-W[0] – (W[1] * X)) / W[2]

Is not exactly clear to me, could someone please elaborate on that?
- Hashim
  
  July 12, 2017 at 7:15 am
  
  you can think of it like this:
  
  In order to draw the decision boundary, you need to draw only the points (x,y) which lie right over the boundary.
  
  According to the sigmoid function, the boundary is the value 0.5. So, in order to obtain a 0.5, you need to provide a zero value as input to the sigmoid (That is, a zero value as output from the scoring function).
  
  Thus, if the scoring function equals zero:
  
  0 = w0 + w1*x + w2*y ==> y = (-w0 – w1*x)/w2
  
  You can use any x’s coordinates you want, and you’ll get the proper y’s coordinates to draw the boundary
foobar

April 13, 2017 at 12:21 pm

Great stuff, thanks!

method name is not next_method btw
- Adrian Rosebrock
  
  April 16, 2017 at 9:02 am
  
  Thanks! I have updated the blog post to fix this typo.
i262666

January 4, 2018 at 12:33 am

Great blog and very clear explanation! However, the overall logic is quite strange. We loop over all epochs and inside each epoch we process all data points in each mini-batch and adjusting the weights for every mini-batch. And because we use the gradient descent, we move for each mini-batch toward the minimal of the loss function. And tracking the loss function for each epoch shows on the displayed plot that the error values are decreasing.

So far, all is logical. But how you get the final weights for the trained network which best satisfies the targets for every data inputs. After training the network you star testing it against the test data-set. What weights we would use for testing the trained network?
- Adrian Rosebrock
  
  January 5, 2018 at 1:40 pm
  
  The weights you use to test the network would be the weights you obtained from training it. After the network has been trained the weights “freeze” and do not change. If you’re interested in learning more about gradient descent and how it works in the grand scheme of neural networks, take a look at my book, Deep Learning for Computer Vision with Python.
Misha Singla

April 13, 2018 at 7:33 am

Really helpful..but what are the changes required if hinge loss is considered instead of least squared error?
arron.niu

August 3, 2018 at 5:12 am

very thanks. I have pratised the vanilla gradient descent and the sgd according to the two posts. but I training 4000 and 40000 samples. but the execute time of sgd is slowly than the vanilla gradient descent. why?
georgem

November 9, 2018 at 11:40 am

Great post! and Thank you for your work on making this article very well structured and informative.

One question: I understand you split data in mini-batches and you computed gradients on each mini-batch. Why is it called Stochastic Gradient Decent? Because you don’t have any stochasticity, you’re just summing all the gradients in a mini-batch.

Maybe my understanding is wrong.

Thank you!
- Adrian Rosebrock
  
  November 10, 2018 at 9:57 am
  
  True SGD is called “stochastic” because it randomly samples a single data point and then updates the weights. In practice that’s highly inefficient so we use mini-batch gradient descent which has taken the name of SGD.
  - georgem
    
    November 13, 2018 at 2:23 pm
    
    Thanks for your prompt answer Adrian!
    
    “In practice that’s highly inefficient” -> inefficient you mean that time of convergence might be high? True SGD is faster than regular true GD right? True GD computes the gradient for each example in the data so it takes longer.
    
    If we only do mini-batch, with no randomness is that mini-batch GD?
    And if we randomly shuffle the data at each epoch to have different data in our mini-batches is called mini-batch SDG?
    
    Sorry, I hope I’m not being annoying. I’ve been looking everywhere online to clarify this. The deeper I dig, the more confused I get lol.
    
    Thank you!
    - Adrian Rosebrock
      
      November 13, 2018 at 4:10 pm
      
      1. It’s inefficient in time to convergence.
      2. But it also tends to be highly inefficient for large datasets. You’ll be performing too many non-consecutive I/O operations which slows the whole operation down.
      
      If you find yourself getting confused I would suggest working through Deep Learning for Computer Vision with Python where I discuss these concepts in far more detail.
      - georgem
        
        November 15, 2018 at 9:05 am
        
        ok, thank you Adrian!
Disha

November 4, 2019 at 3:24 pm

Hi Adrian,

Apologies if the question seems silly, I’m still an ultra-beginner to ML.
What do we do if i have class 1 and 2 instead of class 0 and class1?

Thanks in advance!
- PRAKHAR GANDHI
  
  November 10, 2019 at 3:11 am
  
  Convert those classes to 0 and 1

Comment section

Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.

At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.

Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.

If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.

Click here to browse my full catalog.

Looking for the source code to this post?

Mini-batch SGD

Implementing Mini-batch SGD

SGD Results

What's next? We recommend PyImageSearch University.

Download the Source Code and FREE 17-page Resource Guide

About the Author

19 responses to: Stochastic Gradient Descent (SGD) with Python

Comment section

PyImageSearch University

Face Recognition with Local Binary Patterns (LBPs) and OpenCV

OpenCV Age Detection with Deep Learning

Pixel Shuffle Super Resolution with TensorFlow, Keras, and Deep Learning

Topics

Books & Courses

PyImageSearch

Looking for the source code to this post?

Mini-batch SGD

Implementing Mini-batch SGD

SGD Results

What's next? We recommend PyImageSearch University.

Download the Source Code and FREE 17-page Resource Guide

About the Author

Reader Interactions

Gradient Descent with Python

Ubuntu 16.04: How to install OpenCV

19 responses to: Stochastic Gradient Descent (SGD) with Python

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?