Why Do We Need Activation Functions?
In general, a standard neuron learns a weight and a bias, which are used to transform its input into its output.
But a neural net made up entirely of these neurons is essentially a single linear function, there is no point.
An activation function is responsible for deciding to “activate” certain nodes; turning them on or off.
With an adequate net architecture, this introduction of nonlinearity will allow our net to go from a simple linear function to a universal function approximator!
What Does an Activation Function Look Like?
Simplest activation: 0 or 1 if the output passes a certain, learned threshold.
Problem: makes it difficult to learn weights.
Why? Backpropagation relies heavily on gradients to learn weights, which are non-existent for this activation function.
So we need activation functions that provide useful gradients...
Sigmoids are popular because they are both nonlinear and differentiable. They squash the input into a value between 0 and 1, similar to a boolean function.
However, the output is always positive, which may not be desirable.
Tanh activation is similar to sigmoid, but is centered about 0 (range is [-1, 1]), solving the problem of positive-only outputs.
Similar to convolutional layers, a sliding window is used, usually 2x2.
But both activations have the vanishing gradient problem; towards the extremities of their outputs, they become flat, making the gradient vanish and preventing weights from being updated.
The ReLU
ReLU and Leaky ReLU
The Rectified Linear Unit (ReLU) is both nonlinear and differentiable.
The ReLU function is just max{0,x}; it will pass the input if it is positive, and 0 if not.
This ensures that the gradient exists for values larger than 0 (see image).
However, if the input is lower than 0, it may again result in a dead neuron that no longer learns.
Leaky ReLU fixes this by adding a small gradient below 0, giving the neuron a chance to revive over time if the backpropagation wills it do so.
Softmax
Returns a vector that sums to 1.
Useful for outputting probabilities for multiple classes, i.e classifying an image as a type of animal.
However, some frameworks choose to stick to independent logistic classifiers (sigmoids) to dispense probabilities for each class, such as YOLOv3.