In Short: We need these different ways of implementing gradient descent to address several issues we will most certainly encounter when training Neural Networks which are local minima and saddle points of the loss function and noisy gradients.
More on that will be explained in the following article — nice ;)
Table of Content
- 1. Introduction: Let's recap Gradient Descent
- 2. Common Problems when Training Neural Networks (local minima, saddle points, noisy gradients)
- 3. Batch Gradient Descent
- 4. Stochastic Gradient Descent
- 5. Mini-Batch Gradient Descent
- 6. Take-Home-Message
1. Introduction: Let's recap Gradient Descent
Before we address the different approaches to implement gradient descent, I think it would be a good idea to refresh your memory on what gradient descent actually is.
When we train a neural network, we want it to learn to perform a specific task. This task can be as simple as predicting the expected demand for a product in a particular market or performing classification of skin cancer.
Regardless of this task, our only goal during training is to minimize the objective/ loss function. For predictions of the expected demand, which is a regression task, this loss function would be the Mean Squared Error (MSE) loss function:
For classification tasks, we want to minimize the Cross-Entropy loss function:
Before we can minimize a loss function however, the neural network must compute an output. This is accomplished during the forward propagation step when the network receives an input feature vector x, performs several dot-products and non-linear operations in the hidden layers and outputs a prediction y. (For a more detailed explanation of the forward propagation step please refer to this article). This step is represented in the following image:
The equations for the computation of the hidden values as well as the final prediction vector y looks as follows:
The prediction y and the ground truth label (the value we actually want to predict) are both included in the loss function to compute a quantitative metric that indicates how accurate the network prediction is.
A lower value of the loss function results in a lower error between the prediction and the label, and vise versa.
To get the lowest possible value of the loss function, we must adjust the parameters of the neural network, which are the weights and biases.
And this is where gradient descent comes into play.
We must compute the negative derivative (gradient) of the loss function with respect to these weights and biases. In the next step, these parameters are updated towards the directions of this gradient. The following equation represents the update step for an arbitrary weight matrix.
Each time an update is performed, the weights and biases move closer and closer to the optimal set of values for which the loss function will have a global minimum. And this the predictions will be as close as possible to the ground truth label we want to predict.
Gradient descent is the backbone of neural networks training and the entire field of deep learning. This method enables us to teach neural networks to perform arbitrary tasks without explicitly program them for it.
As long as the neural network can minimize a loss function the network will eventually be able to do what we want it to do.
Gradient Descent will encounter Problems during Training
Of course, as usual, it is easier said than done. While descending along the negative gradient of the loss function to the optimal weights, we will most certainly face multiple problems such as local minima, saddle points and noisy gradients which can make the training process more problematic for us.
Different approaches to regular gradient descent, which are Stochastic, Batch, and Mini Batch Gradient Descent can properly handle these problems — although not every method solves every problem. It is up to you to decide which methods work best for your current problem.
Because of this, we will discuss in the following different approaches to implementing the gradient descent algorithm in more detail as well as their distinct advantages and disadvantages.
But first I would like to briefly address the problem of local minima, saddle points, and noisy gradients to give you a better understanding of why we need these different kinds of gradient descent in the first place.
2. Common Problems when Training Neural Networks
Local Minima and Saddle Points
Unfortunately, the loss functions do not always have a nice shape. In fact, they can take on very complex shapes, such as:
As a result, the loss functions usually have local minima and saddle points.
These 2D graphics show schematically what a local minimum and a saddle point would look like in a two-dimensional space:
The x-axis is an arbitrary weight value, while the y-axis is the value of the loss function. Suppose we obtain weights during the training of a neural network that results in a value of the loss function at which either a saddle point or a local minimum is located.
In this case, you can clearly see that the slope, or in other words the gradient of the loss function, becomes infinitely small since the loss function is flat at this point. The direct consequence of this is that the gradient gets stuck in a local minimum or saddle point and learning does not progress further because the weights would remain the same.
In practice, saddle points are a much bigger problem than the local minima, especially when dealing with hundreds of thousands of weight parameters.
Why is that?
The corresponding dimensional space where the loss function lives has the same number of dimensions as the number of weight parameters. A saddle point means that at my current point in the high dimensional space the loss goes down in one direction, while goes up in another direction. If you are in a hundred thousand dimensional space, as you can imagine this would happen very often.
On the contrary, a local minimum means that the loss function increases in all directions at my current point. As you can imagine, this is unlikely in high-dimensional spaces. Regardless of whether you have a saddle point or a local minimum, the regular implementation of the gradient descent would encounter difficulties during training.
Another problem and limiting factor when training neural networks is the fact that the gradients that we calculate can become very noisy. When doing gradient descent we (usually) compute the loss for each single feature-label instance pair in the training set and get the gradients by deriving the loss function with respect to the weight parameter.
The result of this is that each computed gradient on a single data sample is only a rough estimate of the true gradient that points towards the highest rate of increase of the loss function.
This causes the computed gradients to have slightly different directions and values for each features-label instance pair in the dataset. We say that these gradients are noisy or have a high variance.
As a result during the training, we don’t go directly towards the global minimum of the loss function, rather the gradient is doing some zig-zag movements. The noisiness of the gradients can result in longer training time of the network.
In order to better understand this, let’s take a look at a visual example of noisy gradients. The following video shows the process of gradient descent with noisy gradients.
The gradients are moving towards the global minimum of the loss function which lives in a 3-dimensional space. You can notice that the negative gradients which are computed on each single data instance in the training set do not point directly towards the global minimum of the loss function.
Rather the gradients differ a little bit in terms of their directions and values. Because of this we are doing these zig-zag movements and do not move directly towards the global minimum.
In the next section, we will look at different variations of how we can use gradient descent to update the weight of a neural network. Although we are still going to use the gradient descent that we have learned about previously, there are several ways how we can use the calculated gradient to update network weights.
In the following, I’ll introduce you to three techniques known as Stochastic, Batch, and Mini Batch Gradient Descent. Each weight update technique has its advantages and disadvantages.
Depending on the problem, you may prefer one method over another. Let’s start with batch gradient descent.
3. Batch Gradient Descent
Please consider a dataset where we have N=6 labeled data instances. Each instance has 4 features (age, job, education, martial) and a label y.
During the training process, the neural network computes for each instance a prediction that is compared to the ground truth label. Given the prediction and the label, we can put both into the loss function and calculate the gradient of the loss function for that given sample.
So far, so good. At this point we could use the calculated gradient to update our network weights towards the optimal weights, thereby minimizing the loss function.
However, during batch gradient descent we don’t do it right away.
Instead, the weights are updated only once after all data instances of the dataset have been processed. Specifically, during the batch gradient descent, the gradients for each instance in the dataset are calculated and summed.
In the end, the accumulated gradient is divided by the number of data instances, which is 6. In this way, we get an averaged gradient across all data instances in the dataset. The weights are now updated in the negative direction of this averaged gradient.
For our dataset, we must calculate and sum the gradients for each of the six samples in that data set. Then we divide the sum of the gradients by 6 and perform a single gradient descent with this averaged gradient to update the weights of the neural network.
In this equation, θ represents an arbitrary weight value.
Advantages of Batch Gradient Descent
- Computationally Efficient: As you may have guessed, this technique is less computationally demanding, as no updates are required after each sample.
- Stable Convergence: Another advantage is the fact that the convergence of the weights to the optimal weights is very stable. By calculating and averaging all individual gradients over each sample in the dataset, we get a very good estimate of the true gradient, indicating the highest increase of the loss function.
Disadvantages of Barch Gradient Descent
- Slower Learning: The downside of batch gradient descent is a much slower learning process because we perform only one update after N samples have been processed.
- Local Minima and Saddle Points: Another disadvantage is the fact that during the learning process we can get stuck in a local minimum of the loss function and never reach the global optimum at which the neural network achieves the best results. This is because the gradients we calculate are more or less the same. What we actually need are in fact some noisy gradients. Such small deviations in the directional values would allow the gradient to jump out of a local minimum of the loss function and continue the updates towards the global minimum. Clean gradients, on the other hand, are much more prone to getting stuck in a local minimum.
4. Stochastic Gradient Descent
In contrast to batch gradient descent, we can perform the stochastic gradient descent. This method can be considered as the opposite of the batch gradient.
Please consider the dataset I introduced earlier. For each instance, in the data, we again make a prediction, compare the prediction with the label and calculate the gradient of the loss function.
In this case, however, we update the weights after each data instance has been processed by the neural network. This method is also often called as online learning.
For our small data set, we calculate the gradients for each data instance and update the weights of the neural network six times:
In other words: This equation is performed six times — one for each data instance.
Advantages of Stochastic Gradient Descent
- Immediate Performance Insights: The stochastic gradient descent immediately gives us an insight into the performance of the neural network, since in this case, we do not have to wait until the end of the data set.
- Faster Learning: Accordingly, stochastic gradient descent may result in much faster learning because an update is performed after each data instance is processed.
Disadvantages of Stochastic Gradient Descent
- Noisy Gradients: In contrary to the batch gradient descent where we average the gradient to get one final gradient, in stochastic gradient descent we use every single gradient to update the weights. These gradients can be very noisy and have a lot of variance with respect to their directions and values. Meaning the gradients that we compute on each sample are only rough estimates of the true gradient that points towards the increase of the loss function. In other words, in this case, we have a lot of noise. However, this fact can avoid the local minima during training because the high variance can cause the gradient to jump out of a local minimum.
- Computationally Intensive: The stochastic gradient descent is much more computationally intensive than the batch gradient descent since in this case, we perform the weight updates more often.
- Inability to settle on a global Minimum: Another disadvantage may be the inability of the gradient descent to settle on a global minimum of the loss function. Due to the noisiness, it would be more difficult to find and stay at a global minimum.
5. Mini-Batch Gradient Descent
The third and final weight update technique is called mini-batch gradient descent. Imagine that this method combines the best of the other two methods. For the mini-batch gradient descent, we must divide our training set into batches of size n
For example, if our dataset contains 10,000 samples, a suitable size of n would be 8,16,32, 64, 128.
Analogous to the batch gradient descent we compute and average the gradients across the data instance in a mini-batch. The gradient descent step is performed after each mini-batch of samples has been processed.
Please consider once more our small dataset. In the case of mini-batch gradient descent, we may divide these six data instances into batches of size n=2t, leaving us with three mini-batches in total:
We calculate two gradients for the two data instances in each mini-batch, sum them, and divide them by two to get the gradient average over that mini-batch:
We would use this average gradient to perform a gradient descent step. In this case, we would do the gradient step a total of three times.
The mini-batch approach is the default method to implement the gradient descent algorithm in Deep Learning
Advantages of Mini-Batch Gradient Descent
- Computational Efficiency: In terms of computational efficiency, this technique lies between the two previously introduced techniques.
- Stable Convergence: Another advantage is the more stable converge towards the global minimum since we calculate an average gradient over n samples that results in less noise.
- Faster Learning: As we perform weight updates more often than with stochastic gradient descent, in this case, we achieve a much faster learning process.
Disadvantages of Mini-Batch Gradient Descent
- New Hyperparameter: A disadvantage of this technique is the fact that in mini-batch gradient descent, a new hyperparameter n, called as the mini-batch size, is introduced. It has been shown that the mini-batch size after the learning rate is the second most important hyperparameter for the overall performance of the neural network. For this reason, it is necessary to take time and try many different batch sizes until a final batch size is found that works best with other parameters such as the learning rate.
- Local minima, saddle points, and noisy gradients are common issues when training neural networks
- Batch Gradient descent can prevent the noisiness of the gradient, but we can get stuck in local minima and saddle points
- With stochastic gradient descent we have difficulties to settle on a global minimum, but usually, don’t get stuck in local minima
- The mini-batch approach is the default method to implement the gradient descent algorithm in Deep Learning. It combines all the advantages of other methods, while not having their disadvantages