Softmax function
Part of a series on |
Machine learning and data mining |
---|
The softmax function, also known as softargmax
Definition
The softmax function takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.
Formally, the standard (unit) softmax function , where , is defined by
In words, the softmax applies the standard exponential function to each element of the input vector (consisting of real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector. For example, the standard softmax of is approximately , which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8).
In general, instead of e a different base b > 0 can be used. If 0 < b < 1, smaller input components will result in larger output probabilities, and decreasing the value of b will create probability distributions that are more concentrated around the positions of the smallest input values. Conversely, as above, if b > 1 larger input components will result in larger output probabilities, and increasing the value of b will create probability distributions that are more concentrated around the positions of the largest input values. Writing or [a] (for real β)[b] yields the expressions:[c]
The reciprocal of β is sometimes referred to as the temperature, , with . A higher temperature results in a more uniform output distribution (i.e. with higher entropy, and "more random"), while a lower temperature results in a sharper output distribution, with one value dominating.
In some fields, the base is fixed, corresponding to a fixed scale,[d] while in others the parameter β is varied.
Interpretations
Smooth arg max
The Softmax function is a smooth approximation to the This section uses the term "softargmax" for clarity.
Formally, instead of considering the arg max as a function with categorical output (corresponding to the index), consider the arg max function with one-hot representation of the output (assuming there is a unique maximum arg):
where the output coordinate if and only if is the arg max of , meaning is the unique maximum value of . For example, in this encoding since the third argument is the maximum.
This can be generalized to multiple arg max values (multiple equal being the maximum) by dividing the 1 between all max args; formally 1/k where k is the number of arguments assuming the maximum. For example, since the second and third argument are both the maximum. In case all arguments are equal, this is simply Points z with multiple arg max values are
With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: as , softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max pointwise, meaning for each fixed input z as , However, softargmax does not converge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The reason it fails to converge uniformly is that for inputs where two coordinates are almost equal (and one is the maximum), the arg max is the index of one or the other, so a small change in input yields a large change in output. For example, but and for all inputs: the closer the points are to the singular set , the slower they converge. However, softargmax does converge compactly on the non-singular set.
Conversely, as , softargmax converges to arg min in the same way, where here the singular set is points with two arg min values. In the language of
It is also the case that, for any fixed β, if one input is much larger than the others relative to the temperature, , the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1:
Probability theory
In probability theory, the output of the softargmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes.
Statistical mechanics
In
Applications
The softmax function is used in various
This can be seen as the composition of K linear functions and the softmax function (where denotes the inner product of and ). The operation is equivalent to applying a linear operator defined by to vectors , thus transforming the original, probably highly-dimensional, input to vectors in a K-dimensional space .
Neural networks
The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a
Since the function maps a vector and a specific index to a real value, the derivative needs to take the index into account:
This expression is symmetrical in the indexes and thus may also be expressed as
Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).
To ensure stable numerical computations subtracting the maximum value from the input vector is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.
If the function is scaled with the parameter , then these expressions must be multiplied by .
See
Reinforcement learning
In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[8]
where the action value corresponds to the expected reward of following action a and is called a temperature parameter (in allusion to statistical mechanics). For high temperatures (), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (), the probability of the action with the highest expected reward tends to 1.
Computational complexity and remedies
In neural network applications, the number K of possible outcomes is often large, e.g. in case of neural language models that predict the most likely outcome out of a vocabulary which might contain millions of possible words.[9] This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the , followed by the application of the softmax function itself) computationally expensive.[9][10] What's more, the gradient descent backpropagation method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.[9][10]
Approaches that reorganize the softmax layer for more efficient calculation include the hierarchical softmax and the differentiated softmax.
A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor.[9] These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).[9][10]
Mathematical properties
Geometrically the softmax function maps the vector space to the boundary of the standard -simplex, cutting the dimension by one (the range is a -dimensional simplex in -dimensional space), due to the
Along the main diagonal softmax is just the uniform distribution on outputs, : equal scores yield equal probabilities.
More generally, softmax is invariant under translation by the same value in each coordinate: adding to the inputs yields , because it multiplies each exponent by the same factor, (because ), so the ratios do not change:
Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average: where ), and then the softmax takes the hyperplane of points that sum to zero, , to the open simplex of positive values that sum to 1, analogously to how the exponent takes 0 to 1, and is positive.
By contrast, softmax is not invariant under scaling. For instance, but
The
The softmax function is also the gradient of the LogSumExp function, a smooth maximum:
where the LogSumExp function is defined as .
History
The softmax function was used in statistical mechanics as the Boltzmann distribution in the foundational paper Boltzmann (1868),[12] formalized and popularized in the influential textbook Gibbs (1902).[13]
The use of the softmax in decision theory is credited to R. Duncan Luce,[14]: 1 who used the axiom of independence of irrelevant alternatives in rational choice theory to deduce the softmax in Luce's choice axiom for relative preferences.[citation needed]
In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, Bridle (1990a):[14]: 1 and Bridle (1990b):[3]
We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (e.g. pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (e.g. weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity.[15]: 227
For any input, the outputs must all be positive and they must sum to unity. ...
Given a set of unconstrained values, , we can ensure both conditions by using a Normalised Exponential transformation:
This transformation can be considered a multi-input generalisation of the logistic, operating on the whole output layer. It preserves the rank order of its input values, and is a differentiable generalisation of the 'winner-take-all' operation of picking the maximum value. For this reason we like to refer to it as softmax.[16]: 213
Example
If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]. The output has most of its weight where the "4" was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: softmax is not scale invariant, so if the input were [0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3] (which sums to 1.6) the softmax would be [0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153]. This shows that for values between 0 and 1 softmax, in fact, de-emphasizes the maximum value (note that 0.169 is not only less than 0.475, it is also less than the initial proportion of 0.4/1.6=0.25).
Computation of this example using Python code:
>>> import numpy as np
>>> a = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
>>> np.exp(a) / np.sum(np.exp(a))
array([0.02364054, 0.06426166, 0.1746813, 0.474833, 0.02364054,
0.06426166, 0.1746813])
Here is an example of Julia code:
julia> A = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]; # semicolon to suppress interactive output
julia> exp.(A) ./ sum(exp, A)
7-element Array{Float64,1}:
0.0236405
0.0642617
0.174681
0.474833
0.0236405
0.0642617
0.174681
Here is an example of R code:
> z <- c(1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0)
> softmax <- exp(z)/sum(exp(z))
> softmax
[1] 0.02364054 0.06426166 0.17468130 0.47483300 0.02364054 0.06426166 0.17468130
Here is an example of Elixir code:[17]
iex> t = Nx.tensor([[1, 2], [3, 4]])
iex> Nx.divide(Nx.exp(t), Nx.sum(Nx.exp(t)))
#Nx.Tensor<
f64[2][2]
[
[0.03205860328008499, 0.08714431874203257],
[0.23688281808991013, 0.6439142598879722]
]
>
Here is an example of Raku code:
> my @z = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0];
> say @z.map: {exp($_)/sum(@z.map: {exp($_)})}
(0.023640543021591385 0.06426165851049616 0.17468129859572226 0.4748329997443803 0.023640543021591385 0.06426165851049616 0.17468129859572226)
Alternatives
The softmax function generates probability predictions densely distributed over its support. Other functions like sparsemax or α-entmax can be used when sparse probability predictions are desired.[18]
See also
- Softplus
- Multinomial logistic regression
- Dirichlet distribution – an alternative way to sample categorical distributions
- Partition function
- Exponential tilting – a generalization of Softmax to more general probability distributions
Notes
- coldness.
- ^ The notation β is for the thermodynamic beta, which is inverse temperature: ,
- ^ For (coldnesszero, infinite temperature), , and this becomes the constant function , corresponding to the discrete uniform distribution.
- ^ In statistical mechanics, fixing β is interpreted as having coldness and temperature of 1.
References
- ISBN 978-0-26203561-3.
- ^ ISBN 0-387-31073-8.
- ^ a b Sako, Yusaku (2018-06-02). "Is the term "softmax" driving you nuts?". Medium.
- ^ Goodfellow, Bengio & Courville 2016, pp. 183–184: The name "softmax" can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous nor differentiable. The softmax function thus provides a "softened" version of the arg max. The corresponding soft version of the maximum function is . It would perhaps be better to call the softmax function "softargmax," but the current name is an entrenched convention.
- ISBN 978-0-26202617-8.
- ^ "Unsupervised Feature Learning and Deep Learning Tutorial". ufldl.stanford.edu. Retrieved 2024-03-25.
- ^ ai-faq What is a softmax activation function?
- ^ Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998. Softmax Action Selection
- ^ S2CID 21684923.
- ^ S2CID 6035643.
- ^ a b c Morin, Frederic; Bengio, Yoshua (2005-01-06). "Hierarchical Probabilistic Neural Network Language Model" (PDF). International Workshop on Artificial Intelligence and Statistics. PMLR: 246–252.
- ^ Boltzmann, Ludwig (1868). "Studien über das Gleichgewicht der lebendigen Kraft zwischen bewegten materiellen Punkten" [Studies on the balance of living force between moving material points]. Wiener Berichte. 58: 517–560.
- ^ Gibbs, Josiah Willard (1902). Elementary Principles in Statistical Mechanics.
- ^ arXiv:1704.00805 [math.OC].
- .
- Advances in Neural Information Processing Systems2 (1989). Morgan-Kaufmann.
- ^ "Nx/Nx at main · elixir-nx/Nx". GitHub.
- ^ "Speeding Up Entmax" by Maxat Tezekbayev, Vassilina Nikoulina, Matthias Gallé, Zhenisbek Assylbekov, https://arxiv.org/abs/2111.06832v3