Every relationship has its building blocks. Love. Trust. Mutual respect.
Yesterday, I asked my girlfriend of 7.5 years to marry me. She said yes.
It was quite literally the happiest day of my life. I feel like the luckiest guy in the world, not only because I have her, but also because this incredible PyImageSearch community has been so supportive over the past 3 years. Thank you for being on this journey with me.
And just like love and marriage have a set of building blocks, so do machine learning and neural network classifiers.
Over the past few weeks we opened our discussion of machine learning and neural networks with an introduction to linear classification that discussed the concept of parameterized learning, and how this type of learning enables us to define a scoring function that maps our input data to output class labels.
This scoring function is defined in terms of parameters; specifically, our weight matrix W and our bias vector b. Our scoring function accepts these parameters as inputs and returns a predicted class label for each input data point .
From there, we discussed two common loss functions: Multi-class SVM loss and cross-entropy loss (commonly referred to in the same breath as “Softmax classifiers”). Loss functions, at the most basic level, are used to quantify how “good” or “bad” a given predictor (i.e., a set of parameters) are at classifying the input data points in our dataset.
Given these building blocks, we can now move on to arguably the most important aspect of machine learning, neural networks, and deep learning — optimization.
Throughout this discussion we’ve learned that high classification accuracy is dependent on finding a set of weights W such that our data points are correctly classified. Given W, can compute our output class labels via our scoring function. And finally, we can determine how good/poor our classifications are given some W via our loss function.
But how do we go about finding and obtaining a weight matrix W that obtains high classification accuracy?
Do we randomly initialize W, evaluate, and repeat over and over again, hoping that at some point we land on a W that obtains reasonable classification accuracy?
Well we could — and it some cases that might work just fine.
But in most situations, we instead need to define an optimization algorithm that allows us to iteratively improve our weight matrix W.
In today’s blog post, we’ll be looking at arguably the most common algorithm used to find optimal values of W — gradient descent.
Looking for the source code to this post?
Jump right to the downloads section.
Gradient descent with Python
The gradient descent algorithm comes in two flavors:
- The standard “vanilla” implementation.
- The optimized “stochastic” version that is more commonly used.
Today well be reviewing the basic vanilla implementation to form a baseline for our understanding. Then next week I’ll be discussing the stochastic version of gradient descent.
Gradient descent is an optimization algorithm
The gradient descent method is an iterative optimization algorithm that operates over a loss landscape.
We can visualize our loss landscape as a bowl, similar to the one you may eat cereal or soup out of:
The surface of our bowl is called our loss landscape, which is essentially a plot of our loss function.
The difference between our loss landscape and your cereal bowl is that your cereal bowl only exists in three dimensions, while your loss landscape exists in many dimensions, perhaps tens, hundreds, or even thousands of dimensions.
Each position along the surface of the bowl corresponds to a particular loss value given our set of parameters, W (weight matrix) and b (bias vector).
Our goal is to try different values of W and b, evaluate their loss, and then take a step towards more optimal values that will (ideally) have lower loss.
Iteratively repeating this process will allow us to navigate our loss landscape, following the gradient of the loss function (the bowl), and find a set of parameters that have minimum loss and high classification accuracy.
The “gradient” in gradient descent
To make our explanation of gradient descent a little more intuitive, let’s pretend that we have a robot — let’s name him Chad:
We place Chad on a random position in our bowl (i.e., the loss landscape):
It’s now Chad’s job to navigate to the bottom of the basin (where there is minimum loss).
Seems easy enough, right? All Chad has to do is orient himself such that he’s facing “downhill” and then ride the slope until he reaches the bottom of the basin.
But we have a problem: Chad isn’t a very smart robot.
Chad only has one sensor — this sensor allows him to take his weight matrix W and compute a loss function L.
Therefore, Chad is able to compute his (relative) position on the loss landscape, but he has absolutely no idea in which direction he should take a step to move himself closer to the bottom of the basin.
What is Chad to do?
The answer is to apply gradient descent.
All we need to do is follow the slope of the gradient W. We can compute the gradient of W across all dimensions using the following equation:
In > 1 dimensions, our gradient becomes a vector of partial derivatives.
The problem with this equation is that:
- It’s an approximation to the gradient.
- It’s very slow.
In practice, we use the analytic gradient instead. This method is exact, fast, but extremely challenging to implement due to partial derivatives and multivariable calculus. You can read more about the numeric and analytic gradients here.
For the sake of this discussion, simply try to internalize what gradient descent is doing: attempting to optimize our parameters for low loss and high classification accuracy.
Pseudocode for gradient descent
Below I have included some Python-like pseudocode of the standard, vanilla gradient descent algorithm, inspired by the CS231n slides:
1 2 3 |
while True: Wgradient = evaluate_gradient(loss, data, W) W += -alpha * Wgradient |
This pseudocode is essentially what all variations of gradient descent are built off of.
We start off on Line 1 by looping until some condition is met. Normally this condition is either:
- A specified number of epochs has passed (meaning our learning algorithm has “seen” each of the training data points N times).
- Our loss has become sufficiently low or training accuracy satisfactorily high.
- Loss has not improved in M subsequent epochs.
Line 2 then calls a function named evaluate_gradient . This function requires three parameters:
- loss : A function used to compute the loss over our current parameters W and input data .
- data : Our training data where each training sample is represented by a feature vector.
- W : This is actually our weight matrix that we are optimizing over. Our goal is to apply gradient descent to find a W that yields minimal loss.
The evaluate_gradient function returns a vector that is K-dimensional, where K is the number of dimensions in our feature vector. The Wgradient variable is actually our gradient, where we have a gradient entry for each dimension.
We then apply the actual gradient descent on Line 3.
We multiply our Wgradient by alpha , which is our learning rate. The learning rate controls the size of our step.
In practice, you’ll spend a lot of time finding an optimal learning rate alpha — it is by far the most important parameter in your model.
If alpha is too large, we’ll end up spending all our time bouncing around our loss landscape and never actually “descending” to the bottom of our basin (unless our random bouncing takes us there by pure luck).
Conversely, if alpha is too small, then it will take many (perhaps prohibitively many) iterations to reach the bottom of the basin.
Because of this, alpha will cause you many headaches — and you’ll spend a considerable amount of your time trying to find an optimal value for your classifier and dataset.
Implementing gradient descent with Python
Now that we know the basics of gradient descent, let’s implement gradient descent in Python and use it to classify some data.
Open up a new file, name it gradient_descent.py , and insert the following code:
1 2 3 4 5 6 7 8 9 10 |
# import the necessary packages import matplotlib.pyplot as plt from sklearn.datasets.samples_generator import make_blobs import numpy as np import argparse def sigmoid_activation(x): # compute and return the sigmoid activation value for a # given input value return 1.0 / (1 + np.exp(-x)) |
Lines 2-5 import our required Python packages.
We then define the sigmoid_activation function on Line 7. When plotted, this function will resemble an “S”-shaped curve:
We call this an activation function because the function will “activate” and fire “ON” (output value >= 0.5) or “OFF” (output vale < 0.5) based on the inputs x .
While there are other (better) alternatives to the sigmoid activation function, it makes for an excellent “starting point” in our discussion of machine learning, neural networks, and deep learning.
I’ll also be discussing activation functions in more detail in a future blog post, so for the time being, simply keep in mind that this is a non-linear activation function that we can use to “threshold” our predictions.
Next, let’s parse our command line arguments:
12 13 14 15 16 17 18 |
# construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-e", "--epochs", type=float, default=100, help="# of epochs") ap.add_argument("-a", "--alpha", type=float, default=0.01, help="learning rate") args = vars(ap.parse_args()) |
We can provide two (optional) command line arguments to our script:
- --epochs : The number of epochs that we’ll use when training our classifier using gradient descent.
- --alpha : The learning rate for gradient descent. We typically see 0.1, 0.01, and 0.001 as initial learning rate values, but again, you’ll want to tune this hyperparameter for your own classification problems.
Now that our command line arguments are parsed, let’s generate some data to classify:
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# generate a 2-class classification problem with 250 data points, # where each data point is a 2D feature vector (X, y) = make_blobs(n_samples=250, n_features=2, centers=2, cluster_std=1.05, random_state=20) # insert a column of 1's as the first entry in the feature # vector -- this is a little trick that allows us to treat # the bias as a trainable parameter *within* the weight matrix # rather than an entirely separate variable X = np.c_[np.ones((X.shape[0])), X] # initialize our weight matrix such it has the same number of # columns as our input features print("[INFO] starting training...") W = np.random.uniform(size=(X.shape[1],)) # initialize a list to store the loss value for each epoch lossHistory = [] |
On Line 22 we make a call to make_blobs which generates 250 data points. These data points are 2D, implying that the “feature vectors” are of length 2.
Furthermore, 125 of these data points belong to class 0 and the other 125 to class 1. Our goal is to train a classifier that correctly predicts each data point as being class 0 or class 1.
Line 29 applies a neat little trick that allows us to skip explicitly keeping track of our bias vector b. To accomplish this, we insert a brand new column of 1’s as the first entry in our feature vector. This addition of a column containing a constant value across all feature vectors allows us to treat our bias as a trainable parameter that is within the weight matrix W rather than an entirely separate variable. You can learn more about this trick here and here.
Line 34 (randomly) initializes our weight matrix such that it has the same number of dimensions as our input features.
It’s also common to see both zero and one weight initialization, but I tend to prefer random initialization better. Weight initialization methods will be discussed in further detail inside future neural network and deep learning blog posts.
Finally, Line 37 initializes a list to keep track of our loss after each epoch. At the end of our Python script, we’ll plot the loss which should ideally decrease over time.
All of our variables are now initialized, so we can move on to the actual training and gradient descent procedure:
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# loop over the desired number of epochs for epoch in np.arange(0, args["epochs"]): # take the dot product between our features `X` and the # weight matrix `W`, then pass this value through the # sigmoid activation function, thereby giving us our # predictions on the dataset preds = sigmoid_activation(X.dot(W)) # now that we have our predictions, we need to determine # our `error`, which is the difference between our predictions # and the true values error = preds - y # given our `error`, we can compute the total loss value as # the sum of squared loss -- ideally, our loss should # decrease as we continue training loss = np.sum(error ** 2) lossHistory.append(loss) print("[INFO] epoch #{}, loss={:.7f}".format(epoch + 1, loss)) |
On Line 40 we start looping over the supplied number of --epochs . By default, we’ll allow our training procedure to “see” each of the training points a total of 100 times (thus, 100 epochs).
Line 45 takes the dot product between our entire training data X and our weight matrix W . We take the output of this dot product and feed the values through the sigmoid activation function, giving us our predictions.
Given our predictions, the next step is to determine the “error” of the predictions, or more simply, the difference between our predictions and the true values (Line 50).
Line 55 computes the least squares error over our predictions (our loss value). The goal of this training procedure is thus to minimize the least squares error.
Now that we have our error , we can compute the gradient and then use it to update our weight matrix W :
59 60 61 62 63 64 65 66 67 68 |
# the gradient update is therefore the dot product between # the transpose of `X` and our error, scaled by the total # number of data points in `X` gradient = X.T.dot(error) / X.shape[0] # in the update stage, all we need to do is nudge our weight # matrix in the negative direction of the gradient (hence the # term "gradient descent" by taking a small step towards a # set of "more optimal" parameters W += -args["alpha"] * gradient |
Line 62 handles computing the actual gradient, which is the dot product between our data points X and the error .
Line 68 is the most critical step in our algorithm and where the actual gradient descent takes place. Here we update our weight matrix W by taking a --step in the negative direction of the gradient, thereby allowing us to move towards the bottom of the basin of the loss landscape (hence the term, gradient descent).
After updating our weight matrix, we keep looping until the desired number of epochs has been met — gradient descent is thus an iterative algorithm.
To actually demonstrate how we can use our weight matrix W as a classifier, take a look at the following code block:
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
# to demonstrate how to use our weight matrix as a classifier, # let's look over our a sample of training examples for i in np.random.choice(250, 10): # compute the prediction by taking the dot product of the # current feature vector with the weight matrix W, then # passing it through the sigmoid activation function activation = sigmoid_activation(X[i].dot(W)) # the sigmoid function is defined over the range y=[0, 1], # so we can use 0.5 as our threshold -- if `activation` is # below 0.5, it's class `0`; otherwise it's class `1` label = 0 if activation < 0.5 else 1 # show our output classification print("activation={:.4f}; predicted_label={}, true_label={}".format( activation, label, y[i])) |
We start on on Line 72 by looping over a sample of our training data.
For each training point X[i] we compute the dot product between X[i] and the weight matrix W , then feed the value through our activation function.
On Line 81, we compute the actual output class label. If the activation is < 0.5, then the output is class 0; otherwise, the output is class 1.
Our last code block is used to plot our training data along with the decision boundary that is used to determine if a given data point is class 0 or class 1:
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
# compute the line of best fit by setting the sigmoid function # to 0 and solving for X2 in terms of X1 Y = (-W[0] - (W[1] * X)) / W[2] # plot the original data along with our line of best fit plt.figure() plt.scatter(X[:, 1], X[:, 2], marker="o", c=y) plt.plot(X, Y, "r-") # construct a figure that plots the loss over time fig = plt.figure() plt.plot(np.arange(0, args["epochs"]), lossHistory) fig.suptitle("Training Loss") plt.xlabel("Epoch #") plt.ylabel("Loss") plt.show() |
Visualizing gradient descent
To test our gradient descent classifier, be sure to download the source code using the “Downloads” section at the bottom of this tutorial.
From there, execute the following command:
1 |
$ python gradient_descent.py |
Examining the output, you’ll notice that our classifier runs for a total of 100 epochs with the loss decreasing and classification accuracy increasing after each epoch:
To visualize this better, take a look at the plot below which demonstrates how our loss over time has decreased dramatically:
We can then see a plot of our training data points along with the decision boundary learned by our gradient descent classifier:
Notice how the decision boundary learned by our gradient descent classifier neatly divides data points of the two classes.
We then then manually investigate the classifications made by our gradient descent model. In each case, we are able to correctly predict the class:
To visualize and demonstrate gradient descent in action, I have created the following animation which shows the decision boundary being “learned” after each epoch:
As you can see, our decision boundary starts off widely inaccurate due to the random initialization. But as time passes, we are able to apply gradient descent, update our weight matrix W, and eventually learn an accurate model.
Want to learn more about gradient descent?
In next week’s blog post, I’ll be discussing a slight modification to gradient descent called Stochastic Gradient Descent (SGD).
In the meantime, if you want to learn more about gradient descent, you should absolutely refer to Andrew Ng’s gradient descent lesson in the Coursera Machine Learning course.
I would also recommend Andrej Karpathy’s excellent slides from the CS231n course.
Summary
In this blog post we learned about gradient descent, a first-order optimization algorithm that can be used to learn a set of parameters that will (ideally) obtain low loss and high classification accuracy on a given problem.
I then demonstrated how to implement a basic gradient descent algorithm using Python. Using this implementation, we were able to actually visualize how gradient descent can be used to learn and optimize our weight matrix W.
In next week’s blog post, I’ll be discussing a modification to the vanilla gradient descent implementation called Stochastic Gradient Descent (SGD). The SGD flavor of gradient descent is more commonly used than the one we introduced today, but I’ll save a more thorough discussion for next week.
See you then!
Before you go, be sure to use the form below to sign up for the PyImageSearch Newsletter — you’ll then be notified when future blog posts are published.
Congratulations Adrian. And yes, the article is also very interesting.
Thank you very much Joe!
Didn’t have time to read this but CONGRATULATIONS! I hope your experience of marriage is as fulfilling and wonderful as mine has been.
Thank you Jason! 🙂
Nice article. Just wanted to point out a typo. In your formula for d(f(x))/dx it should be limit of h tends to 0 not infinity.
Thank you for pointing this out Sujit. I have updated the post.
Thanks for the informative post and the link to the slides, and I can totally recommend the machine learning course by andrew ng at coursera, too. It’s a real good introduction to different machine learning techniques. The hands on approach with the homework is worth every minute spent on it.
Fascinating! I love gradient descent studies like this one. I love Adrian’s teaching style. And mostly, love that you shared the news with us! Congratulations you two, many blessings and prayers coming at you!
Thank you Johnny! 🙂
Great tutorial
Kidnly put some Backpropagation tutorials too.
I will certainly be doing backpropagation tutorials, likely 2-3 of them. Right now it looks like I should be able to publish them in November/December following my current schedule.
This explanation is so beautiful, simple and elegant…
Thank you Wajih!
Thanks for the crystal clear explanation of the topic.
Congratulations for quiting bachelors club.
Thank you Abkul!
Hey Adrian- Big Congrats. I wish you the best!
Thanks Moeen!
Congratulations Adrian.
Thank you Arasch!
Congratulations Adrian.
Thank you!
nice article.
Thanks lot
Thank you Srinivasan!
kindly do tutorials on “feature selection” in relation to deep learning
I’ll certainly be doing more deep learning tutorials in the future. Be sure to keep an eye on the PyImageSearch blog.
Nice article! It’s my fault but I don’t understand how you calculate the decision boundary:
Y = (-W[0] – (W[1] * X)) / W[2]
Could you elaborate a little more or give me some reference?
Thanks in advance
Hi zauron,
you can think of it like this:
In order to draw the decision boundary, you need to draw only the points (x,y) which lie right over the boundary.
According to the sigmoid function, the boundary is the value 0.5. So, in order to obtain a 0.5, you need to provide a zero value as input to the sigmoid (That is, a zero value as output from the scoring function).
Thus, if the scoring function equals zero:
0 = w0 + w1*x + w2*y ==> y = (-w0 – w1*x)/w2
You can use any x’s coordinates you want, and you’ll get the proper y’s coordinates to draw the boundary
hello, it is a bit confusiong to me how the gradient was computed :
gradient = X.T.dot(error) / X.shape[0] ,
shouldn’t this computation be true if the loss function was derived using maximum likelihood estimation not the squared error ?
Great stuff, thanks!
“In the meantime, if you want to learn more about gradient descent, you should absolutely refer to Andrew Ng’s gradient descent lesson in the Coursera Machine Learning course.
I would also recommend Andrej Karpathy’s excellent slides from the CS231n course.”
Both links are dead.
Thank you for letting me know! I have updated both links so they are now working.
I think the gradient is for logistic loss, not the squared loss you’re using.
Thank you, an interesting tutorial! I’m a little bit confused though. When calculating the gradient, we try to minimize the loss function, which means we need to take the derivative of the loss function. The loss function is the sum of the square of the errors, with error being defined as the actual label minus the predicted label. Here the predicted labels are calculated using Sigmoid function. This means the gradient will include the gradient of the Sigmoid function, but here I see the gradient of a linear predictor function.
Could you elaborate more on what has been done. Thank you!
My bad! I got error for W in the gradient formula.
Congrats on figuring it out Niki, nice job 🙂
I think you are right. There is a mistake. Shouldn’t it be like that:
gradient = X.T.dot(error) * preds*(1-preds)
where preds*(1-preds) is the derivative of the sigmoid function?
Thanks!
Congratulations on your pending nuptial.
Thanks Ben!
Instead of sigmoid who can I use RelU activation function here?
Yes, you can.