Every relationship has its building blocks. Love. Trust. Mutual respect.

**Yesterday, I asked my girlfriend of 7.5 years to marry me. She said yes**

*.*

It was quite literally the happiest day of my life. I feel like the luckiest guy in the world, not only because I have her, but also because this incredible PyImageSearch community has been so supportive over the past 3 years. **Thank you for being on this journey with me.**

And just like love and marriage have a set of building blocks, so do machine learning and neural network classifiers.

Over the past few weeks we opened our discussion of machine learning and neural networks with an introduction to linear classification that discussed the concept of *parameterized learning*, and how this type of learning enables us to define a *scoring function* that maps our input data to output class labels.

This scoring function is defined in terms of *parameters*; specifically, our weight matrix *W* and our bias vector *b*. Our scoring function accepts these parameters as inputs and returns a *predicted* class label for each input data point .

From there, we discussed two common loss functions: Multi-class SVM loss and cross-entropy loss (commonly referred to in the same breath as “Softmax classifiers”). Loss functions, at the most basic level, are used to quantify how “good” or “bad” a given predictor (i.e., a set of parameters) are at classifying the input data points in our dataset.

Given these building blocks, we can now move on to arguably the most important aspect of machine learning, neural networks, and deep learning — **optimization.**

Throughout this discussion we’ve learned that high classification accuracy is *dependent* on finding a set of weights *W* such that our data points are correctly classified. Given *W*, can compute our output class labels via our *scoring function*. And finally, we can determine how good/poor our classifications are given some *W* via our *loss function*.

**But how do we go about finding and obtaining a weight matrix W that obtains high classification accuracy?**

Do we randomly initialize *W*, evaluate, and repeat over and over again, * hoping *that at some point we land on a

*W*that obtains reasonable classification accuracy?

Well we could — and it some cases that might work just fine.

But in most situations, we instead need to define an *optimization algorithm* that allows us to *iteratively improve* our weight matrix *W*.

In today’s blog post, we’ll be looking at arguably the most common algorithm used to *find* optimal values of *W* — **gradient descent.**

#### Looking for the source code to this post?

Jump Right To The Downloads Section## Gradient descent with Python

The gradient descent algorithm comes in two flavors:

- The standard “vanilla” implementation.
- The optimized “stochastic” version that is more commonly used.

Today well be reviewing the basic vanilla implementation to form a baseline for our understanding. Then next week I’ll be discussing the stochastic version of gradient descent.

### Gradient descent is an optimization algorithm

The gradient descent method is an *iterative optimization algorithm* that operates over a *loss landscape.*

We can visualize our loss landscape as a bowl, similar to the one you may eat cereal or soup out of:

The surface of our bowl is called our *loss landscape*, which is essentially a *plot* of our loss function.

The difference between our loss landscape and your cereal bowl is that your cereal bowl only exists in three dimensions, while your loss landscape exists in *many dimensions*, perhaps tens, hundreds, or even thousands of dimensions.

Each position along the surface of the bowl corresponds to a particular *loss value* given our set of parameters, *W* (weight matrix) and *b* (bias vector).

Our goal is to try different values of *W* and *b*, evaluate their loss, and then take a step towards more optimal values that will (ideally) have lower loss.

Iteratively repeating this process will allow us to navigate our loss landscape, following the gradient of the loss function (the bowl), and find a set of parameters that have minimum loss and high classification accuracy.

### The “gradient” in gradient descent

To make our explanation of gradient descent a little more intuitive, let’s pretend that we have a robot — let’s name him Chad:

We place Chad on a random position in our bowl (i.e., the loss landscape):

It’s now Chad’s job to navigate to the bottom of the basin (where there is minimum loss).

Seems easy enough, right? All Chad has to do is orient himself such that he’s facing “downhill” and then ride the slope until he reaches the bottom of the basin.

But we have a problem: Chad isn’t a very smart robot.

Chad only has one sensor — this sensor allows him to take his weight matrix *W* and compute a loss function *L*.

Therefore, Chad is able to compute his (relative) position on the loss landscape, but he has *absolutely no idea* in which direction he should take a step to move himself closer to the bottom of the basin.

What is Chad to do?

**The answer is to apply gradient descent.**

All we need to do is follow the slope of the gradient *W*. We can compute the gradient of *W* across all dimensions using the following equation:

In *> 1* dimensions, our gradient becomes a *vector* of *partial derivatives.*

The problem with this equation is that:

- It’s an
*approximation*to the gradient. - It’s very slow.

In practice, we use the *analytic gradient* instead. This method is exact, fast, but extremely challenging to implement due to partial derivatives and multivariable calculus. You can read more about the numeric and analytic gradients here.

For the sake of this discussion, simply try to internalize what gradient descent is doing: attempting to optimize our parameters for low loss and high classification accuracy.

### Pseudocode for gradient descent

Below I have included some Python-like pseudocode of the standard, vanilla gradient descent algorithm, inspired by the CS231n slides:

while True: Wgradient = evaluate_gradient(loss, data, W) W += -alpha * Wgradient

This pseudocode is essentially what *all* variations of gradient descent are built off of.

We start off on **Line 1** by looping until some condition is met. Normally this condition is either:

- A specified number of epochs has passed (meaning our learning algorithm has “seen” each of the training data points
*N*times). - Our loss has become
*sufficiently low*or training accuracy*satisfactorily high*. - Loss has not improved in
*M*subsequent epochs.

**Line 2** then calls a function named `evaluate_gradient`

. This function requires three parameters:

`loss`

: A function used to compute the*loss*over our current parameters*W*and input`data`

.`data`

: Our training data where each training sample is represented by a feature vector.`W`

: This is actually our weight matrix that we are optimizing over. Our goal is to apply gradient descent to find a*W*that yields minimal loss.

The `evaluate_gradient`

function returns a vector that is *K-*dimensional, where *K* is the number of dimensions in our feature vector. The `Wgradient`

variable is actually our *gradient*, where we have a gradient entry for each dimension.

We then apply the actual *gradient descent* on **Line 3**.

We multiply our `Wgradient`

by `alpha`

, which is our learning rate. **The learning rate controls the size of our step.**

In practice, you’ll spend a lot of time finding an optimal learning rate `alpha`

— it is *by far* the most important parameter in your model.

If `alpha`

is too large, we’ll end up spending all our time bouncing around our loss landscape and never actually “descending” to the bottom of our basin (unless our random bouncing takes us there by pure luck).

Conversely, if `alpha`

is too small, then it will take *many* (perhaps prohibitively many) iterations to reach the bottom of the basin.

Because of this, `alpha`

will cause you many headaches — and you’ll spend a considerable amount of your time trying to find an optimal value for your classifier and dataset.

### Implementing gradient descent with Python

Now that we know the basics of gradient descent, let’s implement gradient descent in Python and use it to classify some data.

Open up a new file, name it `gradient_descent.py`

, and insert the following code:

# import the necessary packages import matplotlib.pyplot as plt from sklearn.datasets.samples_generator import make_blobs import numpy as np import argparse def sigmoid_activation(x): # compute and return the sigmoid activation value for a # given input value return 1.0 / (1 + np.exp(-x))

**Lines 2-5** import our required Python packages.

We then define the `sigmoid_activation`

function on **Line 7**. When plotted, this function will resemble an “S”-shaped curve:

We call this an activation function because the function will “activate” and fire “ON” (*output value >= 0.5)* or “OFF” (*output vale < 0.5)* based on the inputs `x`

.

While there are other (better) alternatives to the sigmoid activation function, it makes for an excellent “starting point” in our discussion of machine learning, neural networks, and deep learning.

I’ll also be discussing activation functions in more detail in a future blog post, so for the time being, simply keep in mind that this is a non-linear activation function that we can use to “threshold” our predictions.

Next, let’s parse our command line arguments:

# construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-e", "--epochs", type=float, default=100, help="# of epochs") ap.add_argument("-a", "--alpha", type=float, default=0.01, help="learning rate") args = vars(ap.parse_args())

We can provide two (optional) command line arguments to our script:

`--epochs`

: The number of epochs that we’ll use when training our classifier using gradient descent.`--alpha`

: The*learning rate*for gradient descent. We typically see*0.1*,*0.01*, and*0.001*as initial learning rate values, but again, you’ll want to tune this hyperparameter for your own classification problems.

Now that our command line arguments are parsed, let’s generate some data to classify:

# generate a 2-class classification problem with 250 data points, # where each data point is a 2D feature vector (X, y) = make_blobs(n_samples=250, n_features=2, centers=2, cluster_std=1.05, random_state=20) # insert a column of 1's as the first entry in the feature # vector -- this is a little trick that allows us to treat # the bias as a trainable parameter *within* the weight matrix # rather than an entirely separate variable X = np.c_[np.ones((X.shape[0])), X] # initialize our weight matrix such it has the same number of # columns as our input features print("[INFO] starting training...") W = np.random.uniform(size=(X.shape[1],)) # initialize a list to store the loss value for each epoch lossHistory = []

On **Line 22** we make a call to `make_blobs`

which generates 250 data points. These data points are 2D, implying that the “feature vectors” are of length *2*.

Furthermore, 125 of these data points belong to *class 0* and the other 125 to *class 1*. Our goal is to train a classifier that correctly predicts each data point as being *class 0* or *class 1*.

**Line 29** applies a neat little trick that allows us to skip *explicitly* keeping track of our bias vector *b*. To accomplish this, we insert a brand new column of *1’s* as the first entry in our feature vector. This addition of a column containing a constant value across *all* feature vectors allows us to treat our bias as a *trainable parameter* that is ** within** the weight matrix

*W*rather than an entirely separate variable. You can learn more about this trick here and here.

**Line 34** (randomly) initializes our weight matrix such that it has the same number of dimensions as our input features.

It’s also common to see both *zero* and *one* weight initialization, but I tend to prefer random initialization better. Weight initialization methods will be discussed in further detail inside future neural network and deep learning blog posts.

Finally, **Line 37** initializes a list to keep track of our loss after each epoch. At the end of our Python script, we’ll plot the loss which should ideally decrease over time.

All of our variables are now initialized, so we can move on to the actual training and gradient descent procedure:

# loop over the desired number of epochs for epoch in np.arange(0, args["epochs"]): # take the dot product between our features `X` and the # weight matrix `W`, then pass this value through the # sigmoid activation function, thereby giving us our # predictions on the dataset preds = sigmoid_activation(X.dot(W)) # now that we have our predictions, we need to determine # our `error`, which is the difference between our predictions # and the true values error = preds - y # given our `error`, we can compute the total loss value as # the sum of squared loss -- ideally, our loss should # decrease as we continue training loss = np.sum(error ** 2) lossHistory.append(loss) print("[INFO] epoch #{}, loss={:.7f}".format(epoch + 1, loss))

On **Line 40** we start looping over the supplied number of `--epochs`

. By default, we’ll allow our training procedure to “see” each of the training points a total of 100 times (thus, 100 epochs).

**Line 45** takes the dot product between our *entire* training data `X`

and our weight matrix `W`

. We take the output of this dot product and feed the values through the sigmoid activation function, giving us our predictions.

Given our predictions, the next step is to determine the “error” of the predictions, or more simply, the difference between our *predictions* and the *true values* (**Line 50**).

**Line 55** computes the least squares error over our predictions (our loss value). The goal of this training procedure is thus to minimize the least squares error.

Now that we have our `error`

, we can compute the `gradient`

and then use it to update our weight matrix `W`

:

# the gradient update is therefore the dot product between # the transpose of `X` and our error, scaled by the total # number of data points in `X` gradient = X.T.dot(error) / X.shape[0] # in the update stage, all we need to do is nudge our weight # matrix in the negative direction of the gradient (hence the # term "gradient descent" by taking a small step towards a # set of "more optimal" parameters W += -args["alpha"] * gradient

**Line 62** handles computing the actual gradient, which is the dot product between our data points `X`

and the `error`

.

**Line 68** is the most critical step in our algorithm and where the actual gradient descent takes place. Here we update our weight matrix `W`

by taking a `--step`

in the negative direction of the gradient, thereby allowing us to move towards the *bottom* of the basin of the loss landscape (hence the term, *gradient descent*).

After updating our weight matrix, we keep looping until the desired number of epochs has been met — gradient descent is thus an *iterative algorithm.*

To actually demonstrate how we can use our weight matrix *W* as a classifier, take a look at the following code block:

# to demonstrate how to use our weight matrix as a classifier, # let's look over our a sample of training examples for i in np.random.choice(250, 10): # compute the prediction by taking the dot product of the # current feature vector with the weight matrix W, then # passing it through the sigmoid activation function activation = sigmoid_activation(X[i].dot(W)) # the sigmoid function is defined over the range y=[0, 1], # so we can use 0.5 as our threshold -- if `activation` is # below 0.5, it's class `0`; otherwise it's class `1` label = 0 if activation < 0.5 else 1 # show our output classification print("activation={:.4f}; predicted_label={}, true_label={}".format( activation, label, y[i]))

We start on on **Line 72** by looping over a sample of our training data.

For each training point `X[i]`

we compute the dot product between `X[i]`

and the weight matrix `W`

, then feed the value through our activation function.

On **Line 81**, we compute the actual output class label. If the `activation`

is *< 0.5*, then the output is *class 0*; otherwise, the output is *class 1*.

Our last code block is used to plot our training data along with the *decision boundary* that is used to determine if a given data point is *class 0* or *class 1*:

# compute the line of best fit by setting the sigmoid function # to 0 and solving for X2 in terms of X1 Y = (-W[0] - (W[1] * X)) / W[2] # plot the original data along with our line of best fit plt.figure() plt.scatter(X[:, 1], X[:, 2], marker="o", c=y) plt.plot(X, Y, "r-") # construct a figure that plots the loss over time fig = plt.figure() plt.plot(np.arange(0, args["epochs"]), lossHistory) fig.suptitle("Training Loss") plt.xlabel("Epoch #") plt.ylabel("Loss") plt.show()

### Visualizing gradient descent

To test our gradient descent classifier, be sure to download the source code using the * “Downloads”* section at the bottom of this tutorial.

From there, execute the following command:

$ python gradient_descent.py

Examining the output, you’ll notice that our classifier runs for a total of 100 epochs with the loss *decreasing* and classification accuracy *increasing* after each epoch:

To visualize this better, take a look at the plot below which demonstrates how our loss over time has decreased dramatically:

We can then see a plot of our training data points along with the decision boundary learned by our gradient descent classifier:

Notice how the decision boundary learned by our gradient descent classifier neatly divides data points of the two classes.

We then then manually investigate the classifications made by our gradient descent model. In each case, we are able to correctly predict the class:

To visualize and demonstrate gradient descent in action, I have created the following animation which shows the decision boundary being “learned” after each epoch:

As you can see, our decision boundary starts off widely inaccurate due to the random initialization. But as time passes, we are able to apply gradient descent, update our weight matrix *W*, and eventually learn an accurate model.

### Want to learn more about gradient descent?

In next week’s blog post, I’ll be discussing a slight modification to gradient descent called *Stochastic Gradient Descent* (SGD).

In the meantime, if you want to learn more about gradient descent, you should absolutely refer to Andrew Ng’s gradient descent lesson in the Coursera Machine Learning course.

I would also recommend Andrej Karpathy’s excellent slides from the CS231n course.

## Summary

In this blog post we learned about *gradient descent*, a first-order optimization algorithm that can be used to learn a set of parameters that will (ideally) obtain low loss and high classification accuracy on a given problem.

I then demonstrated how to implement a basic gradient descent algorithm using Python. Using this implementation, we were able to actually *visualize* how gradient descent can be used to learn and optimize our weight matrix *W*.

In next week’s blog post, I’ll be discussing a modification to the vanilla gradient descent implementation called *Stochastic Gradient Descent *(SGD). The SGD flavor of gradient descent is more commonly used than the one we introduced today, but I’ll save a more thorough discussion for next week.

See you then!

**Before you go, be sure to use the form below to sign up for the PyImageSearch Newsletter — you’ll then be notified when future blog posts are published.**

#### Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a **FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.** Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

joe minichino

Congratulations Adrian. And yes, the article is also very interesting.

Adrian Rosebrock

Thank you very much Joe!

Jason

Didn’t have time to read this but CONGRATULATIONS! I hope your experience of marriage is as fulfilling and wonderful as mine has been.

Adrian Rosebrock

Thank you Jason! 🙂

Sujit Pal

Nice article. Just wanted to point out a typo. In your formula for d(f(x))/dx it should be limit of h tends to 0 not infinity.

Adrian Rosebrock

Thank you for pointing this out Sujit. I have updated the post.

Max Kostka

Thanks for the informative post and the link to the slides, and I can totally recommend the machine learning course by andrew ng at coursera, too. It’s a real good introduction to different machine learning techniques. The hands on approach with the homework is worth every minute spent on it.

Johnny

Fascinating! I love gradient descent studies like this one. I love Adrian’s teaching style. And mostly, love that you shared the news with us! Congratulations you two, many blessings and prayers coming at you!

Adrian Rosebrock

Thank you Johnny! 🙂

Ranjeet Singh

Great tutorial

Kidnly put some Backpropagation tutorials too.

Adrian Rosebrock

I will certainly be doing backpropagation tutorials, likely 2-3 of them. Right now it looks like I should be able to publish them in November/December following my current schedule.

Wajih

This explanation is so beautiful, simple and elegant…

Adrian Rosebrock

Thank you Wajih!

abkul

Thanks for the crystal clear explanation of the topic.

Congratulations for quiting bachelors club.

Adrian Rosebrock

Thank you Abkul!

Moeen

Hey Adrian- Big Congrats. I wish you the best!

Adrian Rosebrock

Thanks Moeen!

Arasch U Lagies

Congratulations Adrian.

Adrian Rosebrock

Thank you Arasch!

vantruong57

Congratulations Adrian.

Adrian Rosebrock

Thank you!

Srinivasan Babu

nice article.

Thanks lot

Adrian Rosebrock

Thank you Srinivasan!

Abkul

kindly do tutorials on “feature selection” in relation to deep learning

Adrian Rosebrock

I’ll certainly be doing more deep learning tutorials in the future. Be sure to keep an eye on the PyImageSearch blog.

zauron

Nice article! It’s my fault but I don’t understand how you calculate the decision boundary:

Y = (-W[0] – (W[1] * X)) / W[2]

Could you elaborate a little more or give me some reference?

Thanks in advance

dpereira

Hi zauron,

you can think of it like this:

In order to draw the decision boundary, you need to draw only the points (x,y) which lie right over the boundary.

According to the sigmoid function, the boundary is the value 0.5. So, in order to obtain a 0.5, you need to provide a zero value as input to the sigmoid (That is, a zero value as output from the scoring function).

Thus, if the scoring function equals zero:

0 = w0 + w1*x + w2*y ==> y = (-w0 – w1*x)/w2

You can use any x’s coordinates you want, and you’ll get the proper y’s coordinates to draw the boundary

Chahrazad

hello, it is a bit confusiong to me how the gradient was computed :

gradient = X.T.dot(error) / X.shape[0] ,

shouldn’t this computation be true if the loss function was derived using maximum likelihood estimation not the squared error ?

hrushi

Maybe this post can help in getting the result of gradient

https://ai.stackexchange.com/questions/19895/different-methods-of-calculating-gradients-of-cost-functionloss-function

foobar

Great stuff, thanks!

“In the meantime, if you want to learn more about gradient descent, you should absolutely refer to Andrew Ng’s gradient descent lesson in the Coursera Machine Learning course.

I would also recommend Andrej Karpathy’s excellent slides from the CS231n course.”

Both links are dead.

Adrian Rosebrock

Thank you for letting me know! I have updated both links so they are now working.

Tri Dao

I think the gradient is for logistic loss, not the squared loss you’re using.

Niki

Thank you, an interesting tutorial! I’m a little bit confused though. When calculating the gradient, we try to minimize the loss function, which means we need to take the derivative of the loss function. The loss function is the sum of the square of the errors, with error being defined as the actual label minus the predicted label. Here the predicted labels are calculated using Sigmoid function. This means the gradient will include the gradient of the Sigmoid function, but here I see the gradient of a linear predictor function.

Could you elaborate more on what has been done. Thank you!

Niki

My bad! I got error for W in the gradient formula.

Adrian Rosebrock

Congrats on figuring it out Niki, nice job 🙂

Kiro

I think you are right. There is a mistake. Shouldn’t it be like that:

gradient = X.T.dot(error) * preds*(1-preds)

where preds*(1-preds) is the derivative of the sigmoid function?

Thanks!

Ben

Congratulations on your pending nuptial.

Adrian Rosebrock

Thanks Ben!

Dipankar

Instead of sigmoid who can I use RelU activation function here?

Adrian Rosebrock

Yes, you can.