Before we start with the activation function, let us quickly learn about a model. The article is a little longer because it has examples to make it simple for you to understand.
What is a model?
A model consists of 3 parts – input, actions on the input, and desired output.
We have input; we perform actions on it to get the desired output.
To know the basics of AI – click here
What is the activation function?
An activation function is an action we perform on the input to get output. Let us understand it more clearly.
We all know that deep learning (DL), a part of Artificial Intelligence (AI) is a replica of the neural network in a human brain. For example, if you burn a little, you may/may not scream; if you burn terribly, you shout so loudly that the entire building knows.
Similarly, an activation function decides whether a neuron must be activated or not. It is a function used in DL which outputs a small value for small inputs and a large value if the input exceeds the threshold.
If the inputs are large enough (like a severe burn), the activation function activates, otherwise does nothing. In other words, an activation function is like a gate that checks that an incoming value is greater than a critical number (threshold).
Like the neuron-based model in our brain, the activation function decides what must be forwarded to the next neuron. The activation function takes the output from the previous cell (neuron) and converts it to input for the next cell.
You see a senior citizen distributing free chocolates, your brain senses it as a tempting offer, and then it passes to the next neurons (legs) that you have to start running towards him(output from the preceding neuron).
Once you reach there, you will extend your hand to get the chocolate. So your output of every neuron is input for your upcoming action.
Why is activation function important?
The activation function in a neural network decides whether or not a neuron will be activated and transferred to the next layer. It determines whether the neuron’s input to the network is relevant or not for prediction, detection, and more.
It also adds non-linearity to neural networks and helps to learn powerful operations.
If we remove the activation function from a feedforward neural network, the network would be re-factored to a simple linear operation or matrix transformation on its input; it would no longer be capable of performing complex tasks such as image recognition.
Now let us discuss some commonly used activation functions.
1. Sigmoid Activation Function
Mainly used to solve non-linear problems. A non-linear problem is where the output is not proportional to the change in the input. We can use the sigmoid activation function to solve binary classification problems.
Consider an example,
Students appear for an examination, and the faculty designs an AI model to declare the results. They set criteria that students scoring more than 50 % percent are pass and below 50 % fail. So the inputs are the percentages, and the binary classification takes place using the sigmoid activation function.
If the percentage is 50 percent or above, it will give the output 1(pass)
Otherwise, it will give the output 0 (fail).
Output value – 0 to 1
If value >= 0.5, Output = 1
If value < 0.5, Output = 0
Derivative of sigmoid – 0 to 0.25.
What happens in the neural network?
A weight is assigned to input in the neural network. Different inputs have different weights. The weight is multiplied with the input, and at the next layer, all the products(w*x) are added.
∑wi*xi = x1*w1 + x2*w2 +…xn*wn
Based on these weights and activation, we get an output. Naturally, the system might make some mistakes while learning. (It might consider 55% as fail). In this case, to teach the system better, we take the derivative of the function and send it back to change the weights for correction. (Like a feedback mechanism)
Glance the formula for your understanding. Skip if it confuses you.
The derivative of the function is crucial for feedback mechanisms and corrections. Its range is only 0-0.25, which is a limitation for corrections. The feedback mechanisms and corrections are backward propagations. The outputs are considered as inputs to improve the accuracy.
- Gives you a smooth gradient while converging, preventing jumps in output values.
- One of the best Normalised functions.
- Gives a clear prediction (classification) with 1 & 0; like pass/fail in above example.
- Prone to Vanishing Gradient problem. The range of derivative is between 0-0.25, if used in deep neural networks, after some layers you will get very small values, and weights will not update. This problem is called the Vanishing Gradient problem. If your neural network has more hidden layers (it is deep) then this problem occurs easily.
- Not a zero-centric function (Does not pass through 0).
- Computationally expensive function (exponential in nature).
2. Tanh Activation Function
Tanh is called a hyperbolic tangent function. Generally, used as the input of a binary probabilistic function. To solve the binary classification problems, we use the tanh function. In the tanh activation function, the range of the values is between -1 to 1. And derivatives of tanh are between 0 – 1.
Note – To solve the binary classification problem, we can use tanh for the hidden layer (to improve the vanishing gradient problem) and sigmoid for the output layer. However, the chances of a vanishing gradient remain.
• It is a smooth gradient converging function.
• Zero-centric function, unlike Sigmoid.
• Derivatives of tanh function range between 0-1. It is better than the sigmoid activation function but does not solve the vanishing gradient problem in backpropagation for deep neural networks.
• Computationally expensive function (exponential in nature).
3. relu Activation Function
Relu is Rectified linear unit; currently a more popular activation function. It solves linear problems. Range of values of ReLU: 0 – max.
ReLU = max(0 , x)
Derivatives of relu: 0 – 1.
• Deals with vanishing gradient problems.
• Computationally inexpensive function (linear in nature).
• Calculation speed much faster.
• If one of the weights in derivatives becomes 0, then that neuron will be completely dead during backpropagation.
• Not a zero-centric function.
4. Leaky ReLU Activation Function
Use leaky relu to solve the dead ReLU problem. In leaky relu, the negative values will not be zero. The derivative will have a small value when a negative number is entered.
Leaky ReLU = max(0.01x , x)
As for the ReLU activation function, the gradient is 0 for all the values of inputs less than zero, which would deactivate the neurons in that region and may cause a dying ReLU problem.
Leaky ReLU is defined to address this problem. Instead of defining the ReLU activation function as 0 for negative values of inputs(x), we define it as an extremely small linear component of x. Here is the formula for the Leaky ReLU activation function
f(x)=max(0.01*x , x)
This function returns x if it receives any positive input, but for any negative value of x, it returns a small value that is 0.01 times x. Thus it gives an output for negative values as well. The gradient on the left side of the graph is a non-zero value. We no longer encounter dead neurons in that region.
- To solve the dead neuron problem.
5. Elu Activation Function
Elu is exponential linear units.
If x>0, then
Whenever the x value is greater than 0, we use the x value, else we apply the below function.
y = x ; if x>0
y = α.(ex–1) ; if x<=0
• Gives smoother convergence for any negative value.
• Slightly computationally expensive because using of exponential value.
6. PReLU Activation Function
Parametric relu. PReLU has a learning parameter function that fine-tunes the activation function based on its learning rate (unlike zero in the case of RELU and 0.01 in the case of Leaky RELU).
If ax = 0, y will be ReLU’
If ax > 0, y will be Leaky ReLU
If ax is a learnable parameter, y will be PReLU
- It has the learning parameter function which fine-tunes the activation function based on its learning rate (unlike zero in the case of RELU and 0.01 in the case of Leaky RELU).
7. Swish Activation Function
Swish is a smooth continuous function, unlike ReLU, which is a piecewise linear function. Swish allows a small number of negative weights to propagate, while ReLU thresholds all negative weights to zero. It is crucial for deep neural networks. The trainable parameter tunes the activation function better and optimizes the neural network. It is a self-gating function since it modulates the input by using it as a gate to multiply with the sigmoid itself, a concept first introduced in Long Short-Term Memory (LSTMs).
• Deals with vanishing gradient problem.
• The output is a workaround between ReLU and sigmoid function which helps to normalize the output.
• Cannot find out derivatives of zero.
• Computationally expensive function (as of sigmoid).
8. Softmax Activation Function
Softmax is used for solving multiclass classification problems. It finds out different probabilities for different classes.
It is used in the output layer, for neural networks that classify the input into multiple categories.
Tips for beginners
Q – Which activation function solves the binary classification problem?
A – For the hidden layer, use ReLU/ PreLU/Leaky ReLu, and for the output layer, use the sigmoid activation function.
Q – Which activation function solves the multiclass classification problem?
A – For the hidden layer, use ReLU/PreLU/Leaky ReLu, and for the output layer, use the softmax activation function.
Well Done! You ended up learning till here.
For more activation functions – Click here