Activation function
Part of a series on |
Machine learning and data mining |
---|

The activation function of a node in an
Modern activation functions include the logistic (
Comparison of activation functions
Aside from their empirical performance, activation functions also have different mathematical properties:
- Nonlinear
- When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator.[6] This is known as the Universal Approximation Theorem. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model.
- Range
- When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller learning rates are typically necessary.[citation needed]
- Continuously differentiable
- This property is desirable (ReLU is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible) for enabling gradient-based optimization methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.[7]
These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the softplus makes it suitable for predicting variances in variational autoencoders.
Mathematical details
The most common activation functions can be divided into three categories:
An activation function is saturating if . It is nonsaturating if it is . Non-saturating activation functions, such as
Ridge activation functions
Ridge functions are multivariate functions acting on a linear combination of the input variables. Often used examples include:[clarification needed]
- Linearactivation: ,
- ReLUactivation: ,
- Heavisideactivation: ,
- Logistic activation: .
In
The function looks like , where is the Heaviside step function.
If a line has a positive slope, on the other hand, it may reflect the increase in firing rate that occurs as input current increases. Such a function would be of the form .

Radial activation functions
A special class of activation functions known as radial basis functions (RBFs) are used in RBF networks. These activation functions can take many forms, but they are usually found as one of the following functions:
- Gaussian:
- Multiquadratics:
- Inverse multiquadratics:
- Polyharmonic splines
where is the vector representing the function center and and are parameters affecting the spread of the radius.
Other examples
Periodic functions can serve as activation functions. Usually the sinusoid is used, as any periodic function is decomposable into sinusoids by the Fourier transform.[10]
Quadratic activation maps .[11][12]
Folding activation functions
Folding activation functions are extensively used in the
Table of activation functions
The following table compares the properties of several activation functions that are functions of one fold x from the previous layer or layers:
Name | Plot | Function, | Derivative of , | Range | Order of continuity |
---|---|---|---|---|---|
Identity | ![]() |
||||
Binary step | ![]() |
||||
Logistic, sigmoid, or soft step | ![]() |
||||
Hyperbolic tangent ( tanh )
|
![]() |
||||
Soboleva modified hyperbolic tangent (smht) | ![]() |
||||
Rectified linear unit (ReLU)[13] | ![]() |
||||
Gaussian Error Linear Unit (GELU)[5] | ![]() |
where is the gaussian error function. | where is the probability density function of standard gaussian distribution. | ||
Softplus[14] | ![]() |
||||
Exponential linear unit (ELU)[15] | ![]() |
|
|||
Scaled exponential linear unit (SELU)[16] | ![]() |
|
|||
Leaky rectified linear unit (Leaky ReLU)[17] | ![]() |
||||
Parametric rectified linear unit (PReLU)[18] | ![]() |
|
|||
Rectified Parametric Sigmoid Units (flexible, 5 parameters) | ![]() |
where [19] |
|||
Sigmoid linear unit (SiLU,[5] Sigmoid shrinkage,[20] SiL,[21] or Swish-1[22]) | ![]() |
||||
Exponential Linear Sigmoid SquasHing (ELiSH)[23] | ![]() |
||||
Gaussian | ![]() |
||||
Sinusoid |
The following table lists activation functions that are not functions of a single fold x from the previous layer or layers:
Name | Equation, | Derivatives, | Range | Order of continuity |
---|---|---|---|---|
Softmax | for i = 1, …, J | [1][2] | ||
Maxout[24] |
- ^ Here, is the Kronecker delta.
- ^ For instance, could be iterating through the number of kernels of the previous neural network layer while iterates through the number of kernels of the current layer.
Quantum activation functions
In
See also
References
- ^ Hinkelmann, Knut. "Neural Networks, p. 7" (PDF). University of Applied Sciences Northwestern Switzerland. Archived from the original (PDF) on 2018-10-06. Retrieved 2018-10-06.
- S2CID 206485943.
- ISSN 0001-0782.
- .
- ^ arXiv:1606.08415 [cs.LG].
- S2CID 3958369.
- ISBN 978-0-387-24348-1.
- S2CID 195908774.
- PMID 12991237.
- arXiv:2006.09661.
- ISBN 978-3-540-49430-0, retrieved 2024-10-05
- arXiv:1803.01206.
- ISBN 9781605589077
- ^ Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (2011). "Deep sparse rectifier neural networks" (PDF). International Conference on Artificial Intelligence and Statistics.
- arXiv:1511.07289 [cs.LG].
- arXiv:1706.02515.
- S2CID 16489696.
- arXiv:1502.01852 [cs.CV].
- ^
Atto, Abdourrahmane M.; Galichet, Sylvie; Pastor, Dominique; Méger, Nicolas (2023), "On joint parameterizations of linear and nonlinear functionals in neural networks", Elsevier Pattern Recognition, vol. 160, pp. 12–21, PMID 36592526
- ^
Atto, Abdourrahmane M.; Pastor, Dominique; Mercier, Grégoire (2008), "Smooth sigmoid wavelet shrinkage for non-parametric estimation" (PDF), S2CID 9959057
- S2CID 6940861.
- arXiv:1710.05941 [cs.NE].
- arXiv:1808.00783
- arXiv:1302.4389.
- ISSN 1570-0755.
Further reading
- Kunc, Vladimír; Kléma, Jiří (2024-02-14), Three Decades of Activations: A Comprehensive Survey of 400 Activation Functions for Neural Networks, arXiv, , arXiv:2402.09092
- Nwankpa, Chigozie; Ijomah, Winifred; Gachagan, Anthony; Marshall, Stephen (2018-11-08). "Activation Functions: Comparison of trends in Practice and Research for Deep Learning". arXiv:1811.03378 [cs.LG].
- Dubey, Shiv Ram; Singh, Satish Kumar; Chaudhuri, Bidyut Baran (2022). "Activation functions in deep learning: A comprehensive survey and benchmark". Neurocomputing. 503. Elsevier BV: 92–108. ISSN 0925-2312.