linked neon lights under white painted basement

In this detailed guide, I will explain everything there is to know about activation functions in deep learning. Especially what activation functions are and why we must use them when implementing neural networks.

Short answer: We must use activation functions such as ReLu, sigmoid and tanh in order to add a non-linear property to the neural network. In this way, the network can model more complex relationships and patterns in the data.

But let us discuss this in more detail in the following.


Table of Content

  1. Recap: Forward Propagation
  2. Neural Network is a Function
  3. Why do we need Activation Functions
  4. Different Kinds of Activations Functions (sigmoid, tanh, ReLU, leaky ReLU, softmax)
  5. What Activation Functions should we use?
  6. Take-Home-Message

1. Recap: Forward Propagation

In order to understand the importance of activation functions, we must first recap how a neural network computes a prediction/output. This is generally referred to as Forward Propagation. During forward propagation, the neural network receives an input vector x and outputs a prediction vector y.

How does that go?

Please consider the following neural network with one input, one output, and three hidden layers:

Neural Network

Each layer of the network is connected via a so-called weight matrix with the next layer. In total, we have 4 weight matrices W1, W2, W3, and W4.

Given an input vector x, we compute a dot-product with the first weight matrix W1 and apply the activation function to the result of this dot-product. The result is a new vector h1, which represents the values of the neurons in the first layer. This vector h1 is used as new input vector for the next layer, where the same operations are performed again. This is repeated until we get the final output vector y, which is considered as the prediction of the neural network.

The entire set of operations can be represented by the following equations, where σ represent an arbitrary activation function:

Forward Propagation

2. Neural Network is a Function

At this point, I would like to discuss with you another interpretation that can be used to describe a neural network. Rather than considering a neural network as a collection of nodes and edges, we can simply call it a function.

Just like any regular mathematical function a neural network performs a mapping from input x to an output y.

The concept of calculating an output y for an input x must already be known to you. It is the concept of a usual mathematical function. In mathematics, we can define a function f(x) as follows:

Function

This function takes three inputs x1, x2 and x3. a, b, c are the function parameters that take certain values. Given the input x1, x2 and x3, the function calculates an output y.

On a basic level, this is exactly how a neural network works. We take a feature vector x put it into the neural network which calculates an output y.

That is, instead of considering a neural network as a simple collection of nodes and connections, we can think of a neural network as a function. This function incorporates all calculations that we have viewed separately before as one single, chained calculation:

Neural Network is a Function

In the example above, the simple mathematical function we considered had the parameters a, b and c, which could strongly determine the output value of y for an input x.

In the case of a neural network, the parameters of the corresponding function are the weights. This means that our goal during the training of a neural network is to find a particular set of weights or parameters so that given the feature vector x we can calculate a prediction y which corresponds to the actual target value y_hat.

Or in other words, we are trying to build a function that can model our training data.

One question you might ask yourself is, can we always model the data? Can we always find weights that define a function that can compute a particular prediction y for given features x? The answer is no. We can only model the data if there is a mathematical dependency between the features x and the label y.

This mathematical dependence can vary in complexity. And in most cases, we as humans can never see that relationship with our eyes when we look at the data. However, if there is some sort of mathematical dependency between the features and labels, we can be sure that during the training of the neural network, the network will recognize this dependence and adjust its weights so it can model this dependence in the training data. Or in other words, so it can realize a mathematical mapping from input features x to output y.


3. Why do we need Activation Functions?

The purpose of an activation function is to add some kind of non-linear property to the function, which is a neural network. Without the activation functions, the neural network could perform only linear mappings from inputs x to the outputs y. Why is this so?

Without the activation functions, the only mathematical operation during the forward propagation would be dot-products between an input vector and a weight matrix. Since a single dot product is a linear operation, successive dot products would be nothing more than multiple linear operations repeated one after the other. And successive linear operations can be considered as a one single learn operation.

In order to be able to compute really interesting stuff, neural networks must be able to approximate nonlinear relations from input features to output labels. Usually, the more complex the data is we are trying to learn something from, the more non-linear the mapping of features to the ground truth label is.

A neural network without any activation function would not be able to realize such complex mappings mathematically and would not be able to solve tasks we want the network to solve.


4. Different Kinds of Activation Functions

At this point, we should discuss the different activation functions that are used in deep learning as well as their advantages and disadvantages

4.1 Sigmoid

Some years ago the probably most common activation function you would have encountered is the sigmoid function. The sigmoid function maps the incoming inputs to a range between 0 and 1:

Sigmoid Plot

The sigmoid activation function is defined as follows:

Sigmoid Equation

In practice, the sigmoid nonlinearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks:

Sigmoid kills Gradients

The first one is that Sigmoids saturate and kill gradients. A very undesirable feature of the sigmoid is that the activation of the neurons saturate either at the tail of 0 or 1 (blue areas):

Sigmoid Saturation

The derivative of the sigmoid function gets very small for these blue areas (meaning: large negative or positive input values). In this case, the near-zero derivative would make the gradient of the loss function very small, which prevents the update of the weights and thus the entire learning process.

Sigmoid is non-zero centered

Another undesirable property of the sigmoid activation is the fact the outputs of the function are not zero-centered. Usually, this makes the training of the neural network more difficult and unstable.

Please consider a sigmoid neuron y with inputs x1 and x2 weighted by w1 and w2:

Equation

x1 and x2 are outputs of a previous hidden layer with sigmoid activation. So x1 and x2 are always positive since sigmoid is not zero-centered. Depending on the gradient of the whole expression y=x1*w1+x2*w2 the gradient with respect to w1 and w2 will always be either positive for w1 and w2 or negative for w1 and w2.

Often the optimal gradient descent step requires an increase of w1 and a decrease of w2. So since x1 and x2 are always positive we can not increase and decrease the weights at the same time, but can only increase or decrease all weight at the same time. So, in the end, we will require more steps.

4.2 Tanh Activation Function

Another very common activation function used in deep learning is the Tanh function. The tangens hyperbolicus nonlinearity is shown in the following image:

Tanh Plot

The function maps a real-valued number to the range [-1, 1] according to the following equation:

Tanh Equation

As with the sigmoid function, the neurons saturate for large negative and positive values, and the derivative of the function goes to zero (blue area). But unlike the sigmoid its outputs are zero-centered.

Therefore, in practice, the tanh nonlinearity is always preferred to the sigmoid nonlinearity.

4.3 Rectified Linear Unit — ReLU

The Rectified Linear Unit or just simply ReLUhas become very popular in the last few years. The activation is simply thresholded at zero: R(x) = max(0,x) i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x

For inputs larger than zero, we get a linear mapping:

ReLU Plot
Meme

There are several pros and cons of using ReLU’s:

  • (+) In practice, it was shown that ReLU accelerates the convergence of gradient descent towards the global minimum of the loss function in comparison to other activation functions. This is due to its linear, non-saturating property.
  • (+) While other activation functions (tanh and sigmoid) involve very computationally expensive operations such as exponentials etc., ReLU, on the other hand, can be easily implemented by simply thresholding a vector of values at zero.
  • (-) There is unfortunately also a problem with the ReLU activation function. Because the outputs of this function are zero for input values that are below zero, the neurons of the network can be very fragile during training and can even “die”. What does this mean? It can happen (don’t have to but still can) that during the weight updates the weights are adjusted in such a way that for certain neurons the inputs are always below zero. This means that the hidden values of these neurons are always zero and do not contribute to the training process. This means that the gradient flowing through these ReLU neurons will also be zero from that point on. We say that the neurons are “dead”. For example, it is very common to observe that as much as 20–50% of the entire neural network that used ReLU activation can be “dead”. Or in other words, these neurons will never activate in the entire dataset used during training.

4.4 Leaky ReLU

Leaky ReLu is nothing more than just an improved version of the ReLU activation function. As mentioned in the previous section, it is very common that by using ReLU we may “kill” some neurons in our neural network and these neurons will never activate on any data again.

Leaky ReLU was defined to address this problem. In oppisite to “vanilla” ReLU where all outputs are zero for input values below zero, in the case of leaky ReLU we add a small linear component to the function:

f(x)=αx, x<0
= x, x>=0

The leaky ReLU activation looks as follows:

Leaky ReLU

Basically we have replaced the horizontal line for values below zero with a non-horizontal, linear line. The slope of this linear line can be adjusted by the parameter α that is multiplied with the input x.

The advantage of using leaky ReLU and replacing the horizontal line is that we avoid zero-gradients. Because in this case, we no longer have “dead” neurons that are always zero that cause our gradient to become zero.

4.5 Softmax Activation Function

Last but not least, I would like to introduce the softmax activation function. This activation function is quite unique.

Softmax is applied only in the last layer and only when we want the neural network to predict probability scores during classification tasks.

Simply speaking, the softmax activation function forces the values of output neurons to take values between zero and one, so they can represent probability scores.

Another thing, that we must consider is, that when we perform classification of input features into different classes, these classes are mutually exclusive. This means that each feature vector x belongs to only one class. Meaning a feature vector that is an image of a dog can not represent a dog class with a probability of 50% and with a probability of 50% a cat class. This feature vector must represent the dog class with a probability of 100%

Besides, in the case of mutually exclusive classes the probability scores across all output neurons must sum up to one. Only this way the neural network represents a proper probability distribution. A counterexample would be a neural network that classifies a dog's image into the class dog with a probability of 80% and with a probability of 60% into the class cat.

Fortunately, the softmax function does not only force the outputs into the range between zero and out, but the function also makes sure that the outputs across all possible classes sum up to one. Let's see now see how softmax function is working.

Softmax Plot

Imagine that the neurons in the output layer receive an input vector z which is the result of a dot-product between a weight matrix of the current layer with the output of a previous layer. A neuron in the output layer with a softmax activation receives a single value z1, which is an entry in the vector z and outputs the value y_1.

When we use softmax activation every single output of a neuron in the output layer is computed according to the following equation:

Softmax Equation

As you can see each value y of a particular neuron does not only depend on the value z which the neuron receives but on all values in the vector z. This makes each value y of an output neuron a probability value between 0 and 1. And the probability predictions across all output neurons sum up to one.

This way the output neurons now represent a probability distribution over the mutually exclusive class labels.


5. What Activation Functions should we use?

I will answer this question with the best answer there is: It depends.¯\_(ツ)_/

Specifically, it depends on the problem you are trying to solve and the value range of the output you are expecting.

For example, if you want your neural network to predict values that are larger than 1, then tanh or sigmoid are not suitable to be used in the output layer, and we must use ReLU instead.

On the other hand, if we expect the output values to be in the range [0,1] or [-1, 1] then ReLU is not a good choice for the output layer and we must use sigmoid or tanh.

If you perform a classification task and want the neural network to predict a probability distribution over the mutually exclusive class labels, then the softmax activation function should be used in the last layer.

However, regarding the hidden layers, as a rule of thumb, I would strongly suggest you always to use ReLU as an activation for these layers.


Take-Home-Message

  • Activation functions add a non-linear property to the neural network. This way the network can model more complex data
  • ReLU should generally be used as an activation function in the hidden layers
  • Regarding the output layer, we must always consider the expected value range of the predictions
  • For classification tasks, I recommend to exclusively use the softmax activation in the output layer