In this article, we will look at the questions of why these phenomena occur and how they can be prevented..
Generalization in Deep Learning
When training a neural network we are optimizing the weights and biases so the network can perform a mathematical mapping of input values to output values, dependent on the given objective.
Besides the given objective, our second aim is to build a model that is capable to generalize well from training data to any data from the problem domain. In other words, after the training is complete we want the neural network to perform on yet-unseen data as good as on the training data. Unfortunately, in reality, it’s easier said than done. Often once a model has trained its performance on yet-unseen data can be much worse although the performance on the training data was good. This phenomenon is called overfitting.
Overfitting means that the neural network models the training data too well. Overfitting suggests that the neural network has a good performance. But it fact the model fails when it faces new and yet unseen data from the problem domain. Overfitting is happening for two main reasons.
- The data samples in the training data have noise and fluctuations.
- The model has very high complexity
Before we go any further let’s discuss what the term complexity means.
Neural Network Complexity
We can consider a neural network as a function, that performs a mathematical mapping from input x to an output y. A mathematical function can take the form of a polynomial of a certain degree n.
A polynomial function with a higher degree is considered having more complexity than a polynomial function of a lesser degree. That means that a neural network that can be approximated by a polynomial of a higher degree is considered as being more complex.
What is the consequence of having much or less complexity? Let’s take a look at a visual example.
In this image below, you can see two fits of a distribution of data samples. Both fits are performed by two different polynomial models with different complexities. The function that fits the data points on the left side is the polynomial function of degree 4.
As you can clearly see the function models this distribution pretty well. The fit of the model is almost identical to the true function that describes the distribution of the data points.If a new arbitrary sample would be generated with the true function that describes the distribution, this additional data sample would not be far away from the fit curve of the model.
On the right side however you can see a fit that is done by a polynomial function with a degree of 15. What can you notice?
At first glance, the function on the right side fits the data pattern even better. The absolute distances between the single data points and the fit are much lower than in the graph on the left side. However as you can see the fit function tries to fit each data point by one, while not recognizing the true function that describes the data distribution.
If you would generate a new arbitrary data sample with the true function this additional data sample would have a very high absolute distance to the fit of the model. We call this phenomenon overfitting or often we say that the model has a high variance.
As already mentioned, the training data contains noise and random fluctuations. A model with high complexity is capable to pick up this random noise and fluctuations and learn them as underlying concepts and patterns of the data. These noise and fluctuations are unique to the training set.
As soon the model sees some new data, the falsely learned patterns and concepts do not longer apply to the new data and the performance gets much lower. This concept applies also to a neural network and in fact any other machine learning model. That means that overfitting prevents the neural networks ability to generalize.
The completely opposite phenomenon is called the underfitting. As you might guess underfitting refers to a model that neither can model the training data nor new data from the problem domain. The reason for this lays again in the complexity of the model.
But this time the complexity is to low for the neural network to learn the underlying mathematical mapping from input features to output labels. Analogous to overfitting we can show this also on a graphical example.
In this image, you can see once more two different fits of a distribution of data samples. On the righthand side, you can recognize the polynomial function of degree 4 that fits the given data samples very well. On the left side, we use a polynomial function of degree one to fit the data points.
As you can see the distance between the fit of the model and the data points is much higher in this case. The reason for it is that the complexity of the given data distribution we are trying to fit is much higher than the complexity of the model. The same phenomenon can be observed in a neural network that has not enough complexity. The network is just to simple to fit the given data.
For that reason, the performance of that model will be very bad. On the training data as well as on new data. We say that the model underfits the data or has high bias.
Variance Bias Tradeoff
In practice, you will deal a lot with a phenomenon that is called a bias-variance tradeoff. This tradeoff means that the rising complexity of your model causes a lower bias error on the one side but causes a higher variance error on the other.
As a result, the overall error of a neural network has a minimum for a certain complexity. It is on your part to find a neural network model and the appropriate parameters that result in the best possible bias-variance tradeoff.
So, ideally, you must find a model at the sweet spot between overfitting and underfitting. In other words, the model with a complexity where the curves of variance and bias intersect (just like in the image above).
It sounds like an easy goal, but it is quite difficult in practice.
Identifying Overfitting and Underfitting during Training
To achieve this goal, we must observe the performance of a model during training. The easiest way to measure the performance is to compute the error or the value of the loss function we get during training. The error/loss is calculated periodically for the training set and a validation set during the entire training period.
As the name implies training data is used to train the model. The validation data, on the other hand, is used only to validate the performance of the model.
Over time, as the algorithm learns, the error/loss of the model on the training data and validation data will go down. However, if the model is trained too long this may not be the case anymore.
Although the error calculated on the training may continue to decrease, the error calculated on the validation dataset could eventually increase again.
And this is exactly what overfitting is. Because the model had enough time to learn the noise and irrelevant patterns in the training data, it loses the ability to generalize on the validation data, that is not used to train the model.
The sweet spot we are looking for is the point just before the loss/error for the validation set starts to go up. This is the point at which the algorithm best generalizes for both datasets.
Let's take a look at a more concrete examples for overfitting and underfitting.
Consider an imaginary model that classifies cat and dog images. A human who sits besides may classify those images also. Since some cats may quite look like a dog and vise versa this human could do some mistakes during the classification. In summary, our model has the following classification errors on the training and validation dataset:
Let’s assume that a human has an error rate of 1 %. At the same time, the model has a classification error rate of 3 % which is only less bad than the performance of the human. In that case, we can be very happy about our results.
A look at the error of the validation set should, however, make us suspicious. The performance of the model on the validation set is much worse than on the training set. A clear indicator of the fact the model overfits on the training set.
What about underfitting and identifying a high bias?
Suppose we train another model and get the following classification error rates:
Considering the previous example of image classification, in the case of underfitting, we would see a very bad performance on both data sets. Of course, we can only talk about a bad performance if we know what performance we can theoretically expect.
Assuming that we know that a model can achieve a performance that is only marginally worse than human performance, a 12% and 10% error is a clear indication of underfitting. In other words, our model has a strong bias.
How to avoid Overfitting?
Both overfitting and underfitting are not desirable phenomenons. However, by far the most common problem in deep learning and machine learning is overfitting.
Overfitting is a much bigger problem because the evaluation of deep learning/machine learning models on training data is quite different from the evaluation that is actually most important for us, which is the evaluation of the model on unseen data (validation set).
in the case of deep neural networks, there are two promising techniques you can use in order to avoid overfitting:
- Reduce the complexity of the model by using fewer neurons or layers
- Use regularization techniques such as L1, L2 and dropout
By using fewer neurons or layers in a neural network we automatically reduce the number of weights and biases. Since a neural network can be considered as a mathematical function that performs a mapping from X →Y, the weights and biases are parameters of this function that determine the complexity of this function. In general, a neural network with more weights and biases may be considered more complex and therefore more susceptible to over-fitting. So, reducing the number of weights and biases causes the complexity to decrease.
Although reducing the complexity be using fewer neurons and/or layers might work, it this case it is not the smartest solution there is. The main reason for this is simply the fact that we have no idea how many hidden layers or neurons we need to remove to reduce the overfitting.
In the worst-case removing the layers and neurons would on one hand eliminate the overfitting by on the other hand cause underfitting to happen again. For that reason, the better technique for reducing overfitting is to use a technique that is called regularization. The main idea behind regularization is to let the neural network learn to reduce its own complexity if needed. The most promising regularization techniques that are called L1, L2, and dropout were addressed in this article.
How to avoid Underfitting?
Reducing underfitting is a quite straightforward and simple process. Since underfitting is a result of a low complexity of a model all we need to do is to increase the complexity.
In general, if we want to increase the complexity of a function we need to add more parameters to this function. In the case of neural networks, our parameters are the weights and biases. To add additional weights and biases we just need to increase the number of layers and the number of neurons in the neural network.
- Overffing means that the model performs very good on training data, but fails when it sees new data from the same problem domain
- Underfitting means that the model fails on both types of data: training data as well as new data
- To reduce overffiting we must use fewer layers or neurons in the neural network. Or even better use regularization techniques such as L1, L2, and dropout
- To reduce the bias we must increase the complexity by adding more layers or neurons