Activation function

The activation function of a node in an

artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear.^[1]

Modern activation functions include the logistic (

GELU, which was used in the 2018 BERT model.^[5]

Comparison of activation functions

Aside from their empirical performance, activation functions also have different mathematical properties:

Nonlinear

These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the softplus makes it suitable for predicting variances in variational autoencoders.

Mathematical details

The most common activation functions can be divided into three categories:

fold functions

.

An activation function $f$ is saturating if $\lim _{|v|\to \infty }|\nabla f(v)|=0$ . It is nonsaturating if it is $\lim _{|v|\to \infty }|\nabla f(v)|\neq 0$ . Non-saturating activation functions, such as

ReLU, may be better than saturating activation functions, because they are less likely to suffer from the vanishing gradient problem.^[8]

Ridge activation functions

Ridge functions are multivariate functions acting on a linear combination of the input variables. Often used examples include:^{[clarification needed]}

Linear
activation: $\phi (\mathbf {v} )=a+\mathbf {v} '\mathbf {b}$ ,
ReLU
activation: $\phi (\mathbf {v} )=\max(0,a+\mathbf {v} '\mathbf {b} )$ ,
Heaviside
activation: $\phi (\mathbf {v} )=1_{a+\mathbf {v} '\mathbf {b} >0}$ ,
Logistic activation: $\phi (\mathbf {v} )=(1+\exp(-a-\mathbf {v} '\mathbf {b} ))^{-1}$ .

In

biologically inspired neural networks, the activation function is usually an abstraction representing the rate of action potential firing in the cell.^[9] In its simplest form, this function is binary—that is, either the neuron is firing or not. Neurons also cannot fire faster than a certain rate, motivating sigmoid

activation functions whose range is a finite interval.

The function looks like $\phi (\mathbf {v} )=U(a+\mathbf {v} '\mathbf {b} )$ , where $U$ is the Heaviside step function.

If a line has a positive slope, on the other hand, it may reflect the increase in firing rate that occurs as input current increases. Such a function would be of the form $\phi (\mathbf {v} )=a+\mathbf {v} '\mathbf {b}$ .

Radial activation functions

A special class of activation functions known as radial basis functions (RBFs) are used in RBF networks. These activation functions can take many forms, but they are usually found as one of the following functions:

Gaussian: $\,\phi (\mathbf {v} )=\exp \left(-{\frac {\|\mathbf {v} -\mathbf {c} \|^{2}}{2\sigma ^{2}}}\right)$
Multiquadratics: $\,\phi (\mathbf {v} )={\sqrt {\|\mathbf {v} -\mathbf {c} \|^{2}+a^{2}}}$
Inverse multiquadratics: $\,\phi (\mathbf {v} )=\left(\|\mathbf {v} -\mathbf {c} \|^{2}+a^{2}\right)^{-{\frac {1}{2}}}$
Polyharmonic splines

where $\mathbf {c}$ is the vector representing the function center and $a$ and $\sigma$ are parameters affecting the spread of the radius.

Other examples

Periodic functions can serve as activation functions. Usually the sinusoid is used, as any periodic function is decomposable into sinusoids by the Fourier transform.^[10]

Quadratic activation maps $x\mapsto x^{2}$ .^[11]^[12]

Folding activation functions

Folding activation functions are extensively used in the

maximum. In multiclass classification the softmax

activation is often used.

Table of activation functions

The following table compares the properties of several activation functions that are functions of one fold $x$ from the previous layer or layers:

Name	Plot	Function, $g(x)$	Derivative of $g$ , $g'(x)$	Range	Order of continuity
Identity		$x$	$1$	$(-\infty ,\infty )$	$C^{\infty }$
Binary step		${\begin{cases}0&{\text{if }}x<0\\1&{\text{if }}x\geq 0\end{cases}}$	$0$	$\{0,1\}$	$C^{-1}$
Logistic, sigmoid, or soft step		$\sigma (x)\doteq {\frac {1}{1+e^{-x}}}$	$g(x)(1-g(x))$	$(0,1)$	$C^{\infty }$
Hyperbolic tangent ( tanh )		$\tanh(x)\doteq {\frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}}$	$1-g(x)^{2}$	$(-1,1)$	$C^{\infty }$
Soboleva modified hyperbolic tangent (smht)		$\operatorname {smht} (x)\doteq {\frac {e^{ax}-e^{-bx}}{e^{cx}+e^{-dx}}}$		$(-1,1)$	$C^{\infty }$
Rectified linear unit (ReLU)^[13]		${\begin{aligned}(x)^{+}\doteq {}&{\begin{cases}0&{\text{if }}x\leq 0\\x&{\text{if }}x>0\end{cases}}\\={}&\max(0,x)=x{\textbf {1}}_{x>0}\end{aligned}}$	${\begin{cases}0&{\text{if }}x<0\\1&{\text{if }}x>0\end{cases}}$	$[0,\infty )$	$C^{0}$
Gaussian Error Linear Unit (GELU)^[5]		${\begin{aligned}&{\frac {1}{2}}x\left(1+{\text{erf}}\left({\frac {x}{\sqrt {2}}}\right)\right)\\{}={}&x\Phi (x)\end{aligned}}$ where $\mathrm {erf}$ is the gaussian error function.	$\Phi (x)+{\frac {1}{2}}x\phi (x)$ where $\phi (x)={\frac {1}{\sqrt {2\pi }}}e^{-{\frac {1}{2}}x^{2}}$ is the probability density function of standard gaussian distribution.	$(-0.17\ldots ,\infty )$	$C^{\infty }$
Softplus^[14]		$\ln \left(1+e^{x}\right)$	${\frac {1}{1+e^{-x}}}$	$(0,\infty )$	$C^{\infty }$
Exponential linear unit (ELU)^[15]		${\begin{cases}\alpha \left(e^{x}-1\right)&{\text{if }}x\leq 0\\x&{\text{if }}x>0\end{cases}}$ with parameter $\alpha$	${\begin{cases}\alpha e^{x}&{\text{if }}x<0\\1&{\text{if }}x>0\end{cases}}$	$(-\alpha ,\infty )$	${\begin{cases}C^{1}&{\text{if }}\alpha =1\\C^{0}&{\text{otherwise}}\end{cases}}$
Scaled exponential linear unit (SELU)^[16]		$\lambda {\begin{cases}\alpha (e^{x}-1)&{\text{if }}x<0\\x&{\text{if }}x\geq 0\end{cases}}$ with parameters $\lambda =1.0507$ and $\alpha =1.67326$	$\lambda {\begin{cases}\alpha e^{x}&{\text{if }}x<0\\1&{\text{if }}x\geq 0\end{cases}}$	$(-\lambda \alpha ,\infty )$	$C^{0}$
Leaky rectified linear unit (Leaky ReLU)^[17]		${\begin{cases}0.01x&{\text{if }}x\leq 0\\x&{\text{if }}x>0\end{cases}}$	${\begin{cases}0.01&{\text{if }}x<0\\1&{\text{if }}x>0\end{cases}}$	$(-\infty ,\infty )$	$C^{0}$
Parametric rectified linear unit (PReLU)^[18]		${\begin{cases}\alpha x&{\text{if }}x<0\\x&{\text{if }}x\geq 0\end{cases}}$ with parameter $\alpha$	${\begin{cases}\alpha &{\text{if }}x<0\\1&{\text{if }}x\geq 0\end{cases}}$	$(-\infty ,\infty )$	$C^{0}$
Rectified Parametric Sigmoid Units (flexible, 5 parameters)	Rectified Parametric Sigmoid Units	$\alpha (2x{1}_{\{x\geqslant \lambda \}}-g_{\lambda ,\sigma ,\mu ,\beta }(x))+(1-\alpha )g_{\lambda ,\sigma ,\mu ,\beta }(x)$ where $g_{\lambda ,\sigma ,\mu ,\beta }(x)={\frac {(x-\lambda ){1}_{\{x\geqslant \lambda \}}}{1+e^{-\operatorname {sgn}(x-\mu )\left({\frac {\vert x-\mu \vert }{\sigma }}\right)^{\beta }}}}$ ^[19]	$-$	$(-\infty ,+\infty )$	$C^{0}$
Sigmoid linear unit (SiLU,^[5] Sigmoid shrinkage,^[20] SiL,^[21] or Swish-‍1^[22])		${\frac {x}{1+e^{-x}}}$	${\frac {1+e^{-x}+xe^{-x}}{\left(1+e^{-x}\right)^{2}}}$	$[-0.278\ldots ,\infty )$	$C^{\infty }$
Exponential Linear Sigmoid SquasHing (ELiSH)^[23]	An image of the ELiSH activation function plotted over the range [-3, 3] with a minumum value of ~0.881 at x ~= -0.172.	${\begin{cases}{\frac {e^{x}-1}{1+e^{-x}}}&{\text{if }}x<0\\{\frac {x}{1+e^{-x}}}&{\text{if }}x\geq 0\end{cases}}$	${\begin{cases}{\frac {2e^{2x}+e^{3x}-e^{x}}{e^{2x}+2e^{x}+1}}&{\text{if }}x<0\\{\frac {xe^{x}+e^{2x}+e^{x}}{e^{2x}+2e^{x}+1}}&{\text{if }}x\geq 0\end{cases}}$	$[-0.881\ldots ,\infty )$	$C^{1}$
Gaussian		$e^{-x^{2}}$	$-2xe^{-x^{2}}$	$(0,1]$	$C^{\infty }$
Sinusoid		$\sin x$	$\cos x$	$[-1,1]$	$C^{\infty }$

The following table lists activation functions that are not functions of a single fold $x$ from the previous layer or layers:

Name	Equation, $g_{i}\left({\vec {x}}\right)$	Derivatives, ${\frac {\partial g_{i}\left({\vec {x}}\right)}{\partial x_{j}}}$	Range	Order of continuity
Softmax	${\frac {e^{x_{i}}}{\sum _{j=1}^{J}e^{x_{j}}}}$ for $i$ = 1, …, $J$	$g_{i}\left({\vec {x}}\right)\left(\delta _{ij}-g_{j}\left({\vec {x}}\right)\right)$ ^[1]^[2]	$(0,1)$	$C^{\infty }$
Maxout^[24]	$\max _{i}x_{i}$	${\begin{cases}1&{\text{if }}j={\underset {i}{\operatorname {argmax} }}\,x_{i}\\0&{\text{if }}j\neq {\underset {i}{\operatorname {argmax} }}\,x_{i}\end{cases}}$	$(-\infty ,\infty )$	$C^{0}$

^ Here,

\delta _{ij}

is the Kronecker delta.

^ For instance,

j

could be iterating through the number of kernels of the previous neural network layer while

i

iterates through the number of kernels of the current layer.

Quantum activation functions

In

quantum computers, based on quantum perceptrons instead of variational quantum circuits, the non-linearity of the activation function can be implemented with no need of measuring the output of each perceptron at each layer. The quantum properties loaded within the circuit such as superposition can be preserved by creating the Taylor series of the argument computed by the perceptron itself, with suitable quantum circuits computing the powers up to a wanted approximation degree. Because of the flexibility of such quantum circuits, they can be designed in order to approximate any arbitrary classical activation function.^[25]

References

^ Hinkelmann, Knut. "Neural Networks, p. 7" (PDF). University of Applied Sciences Northwestern Switzerland. Archived from the original (PDF) on 2018-10-06. Retrieved 2018-10-06.
S2CID 206485943
.

ISSN 0001-0782
.

doi:10.22266/ijies2019.0630.19
.

^
arXiv:1606.08415 [cs.LG
].

S2CID 3958369
.

ISBN 978-0-387-24348-1
.

S2CID 195908774
.

PMID 12991237
.

arXiv:2006.09661
.

ISBN 978-3-540-49430-0
, retrieved 2024-10-05

arXiv:1803.01206
.

ISBN 9781605589077

^ Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (2011). "Deep sparse rectifier neural networks" (PDF). International Conference on Artificial Intelligence and Statistics.

arXiv:1511.07289 [cs.LG
].

arXiv:1706.02515
.

S2CID 16489696
.

arXiv:1502.01852 [cs.CV
].

^ Atto, Abdourrahmane M.; Galichet, Sylvie; Pastor, Dominique; Méger, Nicolas (2023), "On joint parameterizations of linear and nonlinear functionals in neural networks", Elsevier Pattern Recognition, vol. 160, pp. 12–21,
PMID 36592526

^ Atto, Abdourrahmane M.; Pastor, Dominique; Mercier, Grégoire (2008), "Smooth sigmoid wavelet shrinkage for non-parametric estimation" (PDF), S2CID 9959057

S2CID 6940861
.

arXiv:1710.05941 [cs.NE
].

arXiv:1808.00783

arXiv:1302.4389
.

ISSN 1570-0755
.

Further reading

Kunc, Vladimír; Kléma, Jiří (2024-02-14), Three Decades of Activations: A Comprehensive Survey of 400 Activation Functions for Neural Networks, arXiv,
doi:10.48550/arXiv.2402.09092
, arXiv:2402.09092

Nwankpa, Chigozie; Ijomah, Winifred; Gachagan, Anthony; Marshall, Stephen (2018-11-08). "Activation Functions: Comparison of trends in Practice and Research for Deep Learning".
arXiv:1811.03378 [cs.LG
].

Dubey, Shiv Ram; Singh, Satish Kumar; Chaudhuri, Bidyut Baran (2022). "Activation functions in deep learning: A comprehensive survey and benchmark". Neurocomputing. 503. Elsevier BV: 92–108.
ISSN 0925-2312
.

v
t
e
Artificial intelligence (AI)
History (timeline)
Concepts

Parameter
Hyperparameter

Loss functions

Regression
Bias–variance tradeoff

Double descent

Overfitting

Clustering

Gradient descent
SGD

Quasi-Newton method

Conjugate gradient method

Backpropagation

Attention

Convolution

Normalization
Batchnorm

Activation
Softmax

Sigmoid

Rectifier

Gating

Weight initialization

Regularization

Datasets
Augmentation

Prompt engineering

Reinforcement learning
Q-learning

SARSA

Imitation

Policy gradient

Diffusion

Latent diffusion model

Autoregression

Adversary

RAG

Uncanny valley

RLHF

Self-supervised learning

Recursive self-improvement

Word embedding

Hallucination

Applications

Machine learning
In-context learning

Artificial neural network
Deep learning

Language model
Large language model

NMT

Artificial general intelligence

Implementations
Audio–visual

AlexNet

WaveNet

Human image synthesis

HWR

OCR

Speech synthesis
15.ai

ElevenLabs

Speech recognition
Whisper

Facial recognition

AlphaFold

Text-to-image models
Aurora

DALL-E

Firefly

Flux

Ideogram

Imagen

Midjourney

Stable Diffusion

Text-to-video models
Dream Machine

Gen-4

Hailuo AI

Kling

Sora

Veo

Music generation
Suno AI

Udio

Text

Word2vec

Seq2seq

GloVe

BERT

T5

Llama

Chinchilla AI

PaLM

GPT
1

2

3

J

ChatGPT

4

4o

o1

o3

4.5

Claude

Gemini
chatbot

Grok

LaMDA

BLOOM

Project Debater

IBM Watson

IBM Watsonx

Granite

PanGu-Σ

DeepSeek

Qwen

Decisional

AlphaGo

AlphaZero

OpenAI Five

Self-driving car

MuZero

Action selection
AutoGPT

Robot control

People

Alan Turing

Warren Sturgis McCulloch

Walter Pitts

John von Neumann

Claude Shannon

Marvin Minsky

John McCarthy

Nathaniel Rochester

Allen Newell

Cliff Shaw

Herbert A. Simon

Oliver Selfridge

Frank Rosenblatt

Bernard Widrow

Joseph Weizenbaum

Seymour Papert

Seppo Linnainmaa

Paul Werbos

Jürgen Schmidhuber

Yann LeCun

Geoffrey Hinton

John Hopfield

Yoshua Bengio

Lotfi A. Zadeh

Stephen Grossberg

Alex Graves

Andrew Ng

Fei-Fei Li

Alex Krizhevsky

Ilya Sutskever

Demis Hassabis

David Silver

Ian Goodfellow

Andrej Karpathy

Architectures

Neural Turing machine

Differentiable neural computer

Transformer
Vision transformer (ViT)

Recurrent neural network (RNN)

Long short-term memory (LSTM)

Gated recurrent unit (GRU)

Echo state network

Multilayer perceptron (MLP)

Convolutional neural network (CNN)

Residual neural network (RNN)

Highway network

Mamba

Autoencoder

Variational autoencoder (VAE)

Generative adversarial network (GAN)

Graph neural network (GNN)

Portals
Technology

Category
Artificial neural networks

Machine learning

List
Companies

Projects

Retrieved from "https://en.wikipedia.org/w/index.php?title=Activation_function&oldid=1276830609"

[1] Hinkelmann, Knut. "Neural Networks, p. 7" (PDF). University of Applied Sciences Northwestern Switzerland. Archived from the original (PDF) on 2018-10-06. Retrieved 2018-10-06.

[2] S2CID 206485943
.

[3] ISSN 0001-0782
.

[4] :10.22266/ijies2019.0630.19
.

[ReferenceA-5] 
arXiv:1606.08415 [cs.LG
].

[6] S2CID 3958369
.

[7] ISBN 978-0-387-24348-1
.

[8] S2CID 195908774
.

[9] PMID 12991237
.

[10] rXiv:2006.09661
.

[11] ISBN 978-3-540-49430-0
, retrieved 2024-10-05

[12] rXiv:1803.01206
.

[13] ISBN 9781605589077

[14] Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (2011). "Deep sparse rectifier neural networks" (PDF). International Conference on Artificial Intelligence and Statistics.

[15] rXiv:1511.07289 [cs.LG
].

[16] rXiv:1706.02515
.

[17] S2CID 16489696
.

[18] rXiv:1502.01852 [cs.CV
].

[refrepsu1-19] Atto, Abdourrahmane M.; Galichet, Sylvie; Pastor, Dominique; Méger, Nicolas (2023), "On joint parameterizations of linear and nonlinear functionals in neural networks", Elsevier Pattern Recognition, vol. 160, pp. 12–21,
PMID 36592526

[refssbs1-20] Atto, Abdourrahmane M.; Pastor, Dominique; Mercier, Grégoire (2008), "Smooth sigmoid wavelet shrinkage for non-parametric estimation" (PDF), S2CID 9959057

[21] S2CID 6940861
.

[22] rXiv:1710.05941 [cs.NE
].

[23] arXiv:1808.00783

[24] rXiv:1302.4389
.

[25] ISSN 1570-0755
.

[1]

[5]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]