Hey everyone I this article I will present you the theory and practical implementation of a very useful and effective technique called Batch Normalization, which can significantly accelerate the training of a neural network model.

Table of Content

  1. Introduction
  2. The Problem of Internal Covariance Shift
  3. Recap: Normalization of Data
  4. Batch Normalization in Deep Learning
  5. Advantages of Batch Normalization
  6. Batch-Normalization in Practice
  7. Take-Home-Message

1. Introduction

Neural networks learn to solve a problem using the backpropagation algorithm. During backpropagation, the network adjusts its weights and biases in order to make a better prediction next time. Or in other words: The output of the network gets closer to the actual value (label) each time the weights and biases are updated.

Backpropagation involves computing the gradient of the loss function with respect to the weights between each layer in the network and propagating the gradient backward:


However during backpropagation, we may face un undesirable phenomenon, that is called Internal Covariance Shift. And this may result in problems during the training of the network.

2. The Problem of Internal Covariance Shift

During training, each layer in the neural network is trying to correct itself for the error that was made during the forward propagation. But in doing so, each layer corrects itself separately.


Please consider the neural network above, in which the 2nd layer adjusts its weights and biases so that the entire neural network can make more accurate predictions in the future.

In doing so, the output of this 2nd layer, which is the input for the 3rd layer, is changing also. This means that by improving the weights and biases of the 2nd layer, the 3rd layer has to learn from scratch to produce correct predictions for the same data.

So, by improving one layer the following layer faces a completely new difficulty.

More specifically: Due to changes in weights and biases of the current layer, the following layer is forced to learn from new input distributions.

This is what we call Internal Covariance Shift. This phenomenon usually increases the training time of a neural network. In the following, I will show you how this issue can be addressed with batch normalization.

But first, we must do a quick recap and discuss the term normalization.

3. Recap: Normalization of Data

In machine learning and deep learning, it is often required to scale or to normalize your input data before training a neural network on it. For instance, if we have a dataset with features x that have very different ranges (e.g. some from 0 to 1 and some from 1 to 1000) we should normalize them to speed up learning.

This can be accomplished by the following equation, applied to every input feature x in the dataset:

Normalized Neural Network Neurons

Here μ represents the mean of a feature x and σ the variance of this feature. This forces the values of each input feature in the dataset to have zero-mean and a variance of one.

By doing so we can make the training of a deep learning or a classical machine learning model more stable and faster. Normalization is a very important part of the data preprocessing, which we must do before we can train a neural network on the given data.

4. Batch Normalization in Deep Learning

The principle behind feature normalization in the dataset can also be applied to the neuron values in a neural network, which improves the training of this neural network significantly.

Like in the case of normalization of input features we want the neuron values to have a mean of zero and a variance of one. The method behind it is what we call Batch Normalization.

With batch normalization we reduce the covariance shift or in other words: The amount by what the values of the hidden neurons shift around.

One difference to feature normalization is that in the case of batch normalization we normalize the values across a mini-batch of training samples and not across all samples in the dataset.

Please consider a mini-batch of size m and dimension D. While m refers to the number of training samples in the mini-batch, D can be viewed as the number of neurons in a given hidden layer.

Batch Normalization Neural Network

The batch normalization is computed across the mini-batch for each dimension or in other words each hidden neuron. In practice batch normalization showed better performances when applied before activation functions. Meaning by making the input of an activation function zero-centered with a variance of one we are achieving the same result for the activations itself.

Let’s take a look at some advantages that come from using batch normalization.

5. Advantages of Batch Normalization

More Stable Gradients

In general, using batch normalization leads to a similar scale of activations within a neural network. That results in a more stable gradient with fewer oscillations that yields a much better convergence towards the global minimum of the loss function.

Faster Training

Stable gradients allow us to use higher learning rates. Gradient descent usually requires small learning rates for the network to converge. And as networks get deeper, the gradients get smaller during gradient descent so the training time increases. Using batch normalization allows us to use much higher learning rates, which further increases the speed at which neural networks train.

Proper Initialization of Weights and Biases becomes less of a Problem

In most cases, the training performance depends a lot on proper initializations of the weights and biases. Weight initialization can be difficult, and it’s even more difficult when creating deeper networks. Using batch normalization reduces that dependance. Batch normalization seems to allow us to be much less careful about choosing our initial starting weights.

In short: batch normalization is a very powerful technique that can enhance the training of a neural network without doing too much effort. Speaking of effort lets see how batch normalization is applied in practice.

6. Batch Normalization in Practice

Please consider again a mini-batch of size m, dimension d, and x_i as the input values for the activation functions within a neural network. In the first step we compute the mean of values x_i across each dimension in the mini-batch:


Having the mean value μ_B we can use it to compute the variance:


Both mean and variance enable us to normalize the values x_i across each dimension of the minibatch. After this step, the normalized values are zero centered with a variance of one:

Batch Normalized

Up to this point, the batch normalization does not differ from the normalization of input features in a dataset. However, batch normalization requires one more step. In this step, we scale the normalized value x_hat by the parameter γ and shift the value by the parameter β:

Shift Parameters

Both γ and β are trainable variables and are learned by the neural network during training. Logically the question at this point is why are we doing this. Why is the basic normalization not sufficient?

The reason for this last step is the following: While making the activations zero centered with a unit variance we also introduce the restriction to the neural network which is that all activation values of the hidden neurons follow this particular distribution.

But we cannot guarantee that this always needs to be the case. In some cases, the batch normalization may improve the training, in other cases however the neural network may require a slightly or a completely different distribution of activations.

By introducing the possibility to scale and shift the normalized values we soften the imposed restriction. Since the scale and shift parameters are learnable parameters the neural network can learn and decide during training whether the normalized activation should be transformed any further. In the extreme case, the neural network may recognize that the normalization does not contribute to the training at all.

In this case, γ and β can take values that transform the normalized values back into the previous unnormalized form. That means that by introducing the possibility to scale and shift the normalized values we give the neural network the freedom to decide whether normalization should be performed or to what extent.

However, in practice, I would strongly recommend using batch normalization when dealing with deeper neural networks. It is a battle-proven method which brings many advantages and can significantly improve the training performance of a neural network. In the worst-case scenario, your neural network will learn that no batch normalization is required and will cancel it out by learning the appropriate scale and shift parameters.

The Take-Home Message

  • Batch Normalization makes the values of neurons in a neural network to have a mean of zero and a variance of one
  • Batch Normalization is applied before the activation functions
  • Many advantages, such as faster and more stable training
  • Should be always used when dealing with deeper networks