Introduction to artificial neural networks and activation functions

Anuroobika K
5 min readJul 1, 2022

Artificial neural networks (ANN) are used in natural language processing (NLP) and image/speech/pattern recognition generally. The artificial neurons that make neural networks are inspired by biological neurons.

In mammalian brain, we have billions of neural networks. For example, human brain has 100 billion neurons with 100 trillion connections. The neuron receives signals and through a chemical reaction a decision is made whether the neuron should send signal to other neurons through axon. All our brain activities are carried out by these 100 billion neuron and 100 trillion connections between the neurons. Thus, it is astounding to think how simple units such as neurons, when they come together in billions could do amazing tasks.

neuron

The neural network that we use in machine learning is a web of artificial neurons. As shown in below picture, neurons collect signals from other neurons. The x1, x2, x3 could be outputs from other neurons. Based on the strengths of the signals which are encoded in the weights, these signals get amplified as per their weights and all these get added up and given to a non-linear function, which decides the output. The other term to note down here is the bias-term -’b’ which is a constant added along with the weighted signals. Most of the artificial neural networks have the bias term.

Next, lets look at the non-linear function ‘g’. Before the function ‘g’, what we have is a summation of weighted signals which is a linear function. If we remove the non-linear function (g) from the neural networks, then when signals pass from one neuron to next, they would always remain a linear combinations of initial signals, no matter how many neurons are there in the sequence from input to output. Thus, the non-linear function ‘g’ plays an important role in modeling the input-output relationship using complex functions. It is also called as activation function.

Types of activation functions:
Step function: It gives 1 or 0 binary output. It was the most basic function and was used in classification and region bifurcation. But it is not used much anymore.
Sigmoid function: The below diagram shows sigmoid function. Similar to sigmoid function, we have tanh hyperbolic tangent function which is the green line. The sigmoid function goes closer to 0 for large negative values and closer to 1 for large positive values. It approximates step function by not taking sharp transitions.
In hyperbolic tangent the two sides are -1 and +1; instead of 0 and 1. For large negative values, it goes closer to -1 and for large positive values, it goes closer to +1. But qualitatively, it is the same activation function like sigmoid function.
Rectified linear unit: Whenever the weighted sum of inputs is less than 0, then the output is 0 and whenever the weighted sum is greater than 0, the output is same value of input. It basically suppresses all negative values and lets positive values go as it is.
Softmax: While step or sigmoid function is used in binary classification, softmax could be used in n-class classification.
Linear: In linear function, the output is same as the input. Unlike other functions, this allows large range of values in output and thus mainly used where the output value could have a wide range such as in regression problems.

More details on activation functions:
As seen before, step function divides the input space into two halves-> 0 and 1. When we want to learn something about the input by changing the weights of the inputs in a trail and error fashion, we will not be able to learn anything using step function as the slope is 0. On the other hand, when we make changes to the weights in sigmoid function, we can observe whether the output is increasing or decreasing with respect to small changes in weights as the slope is not 0.

The sigmoid function is a smoother approximation of step function. When we are learning, we try to change the weights of input or the bias term but not the inputs. For example, if ‘x’ is an input and w1x + b is the output of the function, we can change the w1 or b but not x.
When we change the bias term ‘b’ to negative or positive, we shift the mid-point of the graph to left or right. When b is 0 ->mid-point is at the center, when negative-> mid-point moves to right and when positive->mid-point moves to left. When we change the weight ‘w1’, for large values of w1-> the transition would be sharp and for small values of w1-> the transition is not so steep.
Thus, when we are tuning a neural network, if one of the neurons has a sigmoid function, then by changing weights or bias, we could have above discussed changes in the output. This gives us direction on how to change weights in backward propagation or learning algorithms so that we can drive output in desired direction.

A relatively new activation function that has gained popularity is ‘Rectified linear unit (ReLU)’. The problem with sigmoid is for large values of inputs (both positive and negative), sigmoid doesn’t change much with change in input. Whereas the ReLU is flat only on the negative side and never flat on positive side. Thus, for one half of the input, ReLU give some direction on whether weight should be increased or decreased, while for other half, it will not give any direction. For a well initiated neural network, weights are randomly initialized and for about half the inputs, we will get some direction to train the neural network. However, ReLU will never be used at the output node. The output from the neural network is either probability of a class or continuous regression output. ReLU output could go above 1 whereas for a probability use case, we want the value range from 0 to 1 and thus we would go for sigmoid or its generalization -Softmax. For regression problem, we would not want any restrictions on negative values and thus go for ‘Linear unit’ and not ReLU. Thus, ReLU is used only for hidden neurons.

Conclusion:
Quick summary for selecting the right activation function,
* Sigmoid for binary classification output
* Tanh for output range in {-1,+1}
* Softmax for n-array classification
* Linear for regression
* ReLU for internal nodes (non-output)

We can discuss the basic structure of a neural network in the next blog.

--

--

Anuroobika K

Writes about data science topics in simple words and also enjoys writing about life skills. Connect on https://www.linkedin.com/in/anuroobika-k-905b8823/