Stochastic gradient descent

Stochastic gradient descent (often abbreviated SGD) is an

high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.^[1]

The basic idea behind stochastic approximation can be traced back to the

Robbins–Monro algorithm of the 1950s. Today, stochastic gradient descent has become an important optimization method in machine learning.^[2]

Background

Both

objective function

that has the form of a sum:

Q(w)={\frac {1}{n}}\sum _{i=1}^{n}Q_{i}(w),

where the parameter

w

that minimizes

Q(w)

is to be estimated. Each summand function

Q_{i}

is typically associated with the

i

-th

observation in the data set

(used for training).

In classical statistics, sum-minimization problems arise in

score function, and other estimating equations

).

The sum-minimization problem also arises for empirical risk minimization. There, $Q_{i}(w)$ is the value of the loss function at $i$ -th example, and $Q(w)$ is the empirical risk.

When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations:

w:=w-\eta \,\nabla Q(w)=w-{\frac {\eta }{n}}\sum _{i=1}^{n}\nabla Q_{i}(w).

The step size is denoted by

\eta

(sometimes called the learning rate in machine learning) and here "

:=

" denotes the update of a variable in the algorithm.

In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics,

one-parameter exponential families

allow economical function-evaluations and gradient-evaluations.

However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.^[4]

Iterative method

Fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.

In stochastic (or "on-line") gradient descent, the true gradient of $Q(w)$ is approximated by a gradient at a single sample:

w:=w-\eta \,\nabla Q_{i}(w).

As the algorithm sweeps through the training set, it performs the above update for each training sample. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an

adaptive learning rate so that the algorithm converges.^[5]

In pseudocode, stochastic gradient descent can be presented as :

Choose an initial vector of parameters $w$ and learning rate $\eta$ .
Repeat until an approximate minimum is obtained:
- Randomly shuffle samples in the training set.
- For $i=1,2,...,n$ $i=1,2,...,n$ , do:
  - $w:=w-\eta \,\nabla Q_{i}(w).$

A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately as was first shown in ^[6] where it was called "the bunch-mode back-propagation algorithm". It may also result in smoother convergence, as the gradient computed at each step is averaged over more training samples.

The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates $\eta$ decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.^[2]^[7] This is in fact a consequence of the Robbins–Siegmund theorem.^[8]

Example

Suppose we want to fit a straight line ${\hat {y}}=w_{1}+w_{2}x$ to a training set with observations $(x_{1},x_{2},\ldots ,x_{n})$ and corresponding estimated responses $({\hat {y}}_{1},{\hat {y}}_{2},\ldots ,{\hat {y}}_{n})$ using least squares. The objective function to be minimized is

Q(w)=\sum _{i=1}^{n}Q_{i}(w)=\sum _{i=1}^{n}\left({\hat {y}}_{i}-y_{i}\right)^{2}=\sum _{i=1}^{n}\left(w_{1}+w_{2}x_{i}-y_{i}\right)^{2}.

The last line in the above pseudocode for this specific problem will become:

{\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}:={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}{\frac {\partial }{\partial w_{1}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\\{\frac {\partial }{\partial w_{2}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\end{bmatrix}}={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}2(w_{1}+w_{2}x_{i}-y_{i})\\2x_{i}(w_{1}+w_{2}x_{i}-y_{i})\end{bmatrix}}.

Note that in each iteration or update step, the gradient is only evaluated at a single $x_{i}$ . This is the key difference between stochastic gradient descent and batched gradient descent.

History

In 1951, Herbert Robbins and Sutton Monro introduced the earliest stochastic approximation methods, preceding stochastic gradient descent.^[9] Building on this work one year later, Jack Kiefer and Jacob Wolfowitz published an optimization algorithm very close to stochastic gradient descent, using central differences as an approximation of the gradient.^[10] Later in the 1950s, Frank Rosenblatt used SGD to optimize his perceptron model, demonstrating the first applicability of stochastic gradient descent to neural networks.^[11]

hidden layers. Soon after, another improvement was developed: mini-batch gradient descent, where small batches of data are substituted for single samples. In 1997, the practical performance benefits from vectorization achievable with such small batches were first explored,^[12] paving the way for efficient optimization in machine learning. As of 2023, this mini-batch approach remains the norm for training neural networks, balancing the benefits of stochastic gradient descent with gradient descent.^[13]

By the 1980s,

momentum had already been introduced, and was added to SGD optimization techniques in 1986.^[14] However, these optimization techniques assumed constant hyperparameters, i.e. a fixed learning rate and momentum parameter. In the 2010s, adaptive approaches to applying SGD with a per-parameter learning rate were introduced with AdaGrad (for "Adaptive Gradient") in 2011^[15] and RMSprop (for "Root Mean Square Propagation") in 2012.^[16] In 2014, Adam (for "Adaptive Moment Estimation") was published, applying the adaptive approaches of RMSprop to momentum; many improvements and branches of Adam were then developed such as Adadelta, Adagrad, AdamW, and Adamax.^[17]^[18]

Within machine learning, approaches to optimization in 2023 are dominated by Adam-derived optimizers. TensorFlow and PyTorch, by far the most popular machine learning libraries,[19] as of 2023 largely only include Adam-derived optimizers, as well as predecessors to Adam such as RMSprop and classic SGD. PyTorch also partially supports Limited-memory BFGS, a line-search method, but only for single-device setups without parameter groups.^[18]^[20]

Notable applications

Stochastic gradient descent is a popular algorithm for training a wide range of models in

artificial neural networks.^[22] Its use has been also reported in the Geophysics community, specifically to applications of Full Waveform Inversion (FWI).^[23]

Stochastic gradient descent competes with the L-BFGS algorithm,^{[citation needed]} which is also widely used. Stochastic gradient descent has been used since at least 1960 for training linear regression models, originally under the name ADALINE.^[24]

Another stochastic gradient descent algorithm is the least mean squares (LMS) adaptive filter.

Extensions and variants

Many improvements on the basic stochastic gradient descent algorithm have been proposed and used. In particular, in machine learning, the need to set a learning rate (step size) has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge; setting it too low makes it slow to converge.^[25] A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function $η t$ of the iteration number $t$ , giving a learning rate schedule, so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. Such schedules have been known since the work of MacQueen on $k$ -means clustering.^[26] Practical guidance on choosing the step size in several variants of SGD is given by Spall.^[27]

Implicit updates (ISGD)

As mentioned earlier, classical stochastic gradient descent is generally sensitive to learning rate $η$ . Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved^[28] by considering implicit updates whereby the stochastic gradient is evaluated at the next iterate rather than the current one:

w^{\text{new}}:=w^{\text{old}}-\eta \,\nabla Q_{i}(w^{\rm {new}}).

This equation is implicit since $w^{\rm {new}}$ appears on both sides of the equation. It is a stochastic form of the proximal gradient method since the update can also be written as:

w^{\text{new}}:=\arg \min _{w}\left\{Q_{i}(w)+{\frac {1}{2\eta }}\left\|w-w^{\text{old}}\right\|^{2}\right\}.

As an example, consider least squares with features $x_{1},\ldots ,x_{n}\in \mathbb {R} ^{p}$ and observations $y_{1},\ldots ,y_{n}\in \mathbb {R}$ . We wish to solve:

\min _{w}\sum _{j=1}^{n}\left(y_{j}-x_{j}'w\right)^{2},

where

x_{j}'w=x_{j1}w_{1}+x_{j,2}w_{2}+...+x_{j,p}w_{p}

indicates the inner product. Note that

x

could have "1" as the first element to include an intercept. Classical stochastic gradient descent proceeds as follows:

w^{\text{new}}=w^{\text{old}}+\eta \left(y_{i}-x_{i}'w^{\text{old}}\right)x_{i}

where $i$ is uniformly sampled between 1 and $n$ . Although theoretical convergence of this procedure happens under relatively mild assumptions, in practice the procedure can be quite unstable. In particular, when $\eta$ is misspecified so that $I-\eta x_{i}x_{i}'$ has large absolute eigenvalues with high probability, the procedure may diverge numerically within a few iterations. In contrast, implicit stochastic gradient descent (shortened as ISGD) can be solved in closed-form as:

w^{\text{new}}=w^{\text{old}}+{\frac {\eta }{1+\eta \left\|x_{i}\right\|^{2}}}\left(y_{i}-x_{i}'w^{\text{old}}\right)x_{i}.

This procedure will remain numerically stable virtually for all $\eta$ as the learning rate is now normalized. Such comparison between classical and implicit stochastic gradient descent in the least squares problem is very similar to the comparison between least mean squares (LMS) and normalized least mean squares filter (NLMS).

Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be efficiently implemented in a wide range of models. Specifically, suppose that $Q_{i}(w)$ depends on $w$ only through a linear combination with features $x_{i}$ , so that we can write $\nabla _{w}Q_{i}(w)=-q(x_{i}'w)x_{i}$ , where $q()\in \mathbb {R}$ may depend on $x_{i},y_{i}$ as well but not on $w$ except through $x_{i}'w$ . Least squares obeys this rule, and so does logistic regression, and most generalized linear models. For instance, in least squares, $q(x_{i}'w)=y_{i}-x_{i}'w$ , and in logistic regression $q(x_{i}'w)=y_{i}-S(x_{i}'w)$ , where $S(u)=e^{u}/(1+e^{u})$ is the logistic function. In Poisson regression, $q(x_{i}'w)=y_{i}-e^{x_{i}'w}$ , and so on.

In such settings, ISGD is simply implemented as follows. Let $f(\xi )=\eta q(x_{i}'w^{old}+\xi \|x_{i}\|^{2})$ , where $\xi$ is scalar. Then, ISGD is equivalent to:

w^{\text{new}}=w^{\text{old}}+\xi ^{\ast }x_{i},~{\text{where}}~\xi ^{\ast }=f(\xi ^{\ast }).

The scaling factor $\xi ^{\ast }\in \mathbb {R}$ can be found through the bisection method since in most regular models, such as the aforementioned generalized linear models, function $q()$ is decreasing, and thus the search bounds for $\xi ^{\ast }$ are $[\min(0,f(0)),\max(0,f(0))]$ .

Momentum

Further proposals include the momentum method or the heavy ball method, which in ML context appeared in Rumelhart, Hinton and Williams' paper on backpropagation learning^[29] and borrowed the idea from Soviet mathematician Boris Polyak's 1964 article on solving functional equations.^[30] Stochastic gradient descent with momentum remembers the update $Δ w$ at each iteration, and determines the next update as a linear combination of the gradient and the previous update:^[31]^[32]

\Delta w:=\alpha \Delta w-\eta \,\nabla Q_{i}(w)

w:=w+\Delta w

that leads to:

w:=w-\eta \,\nabla Q_{i}(w)+\alpha \Delta w

where the parameter $w$ which minimizes $Q(w)$ is to be estimated, $\eta$ is a step size (sometimes called the learning rate in machine learning) and $\alpha$ is an exponential decay factor between 0 and 1 that determines the relative contribution of the current gradient and earlier gradients to the weight change.

The name momentum stems from an analogy to momentum in physics: the weight vector $w$ , thought of as a particle traveling through parameter space,

artificial neural networks for several decades.^[33]

The momentum method is closely related to underdamped Langevin dynamics, and may be combined with simulated annealing.^[34]

In mid-1980s the method was modified by Yurii Nesterov to use the gradient predicted at the next point, and the resulting so-called Nesterov Accelerated Gradient was sometimes used in ML in the 2010s.^[35]

Averaging

Averaged stochastic gradient descent, invented independently by Ruppert and Polyak in the late 1980s, is ordinary stochastic gradient descent that records an average of its parameter vector over time. That is, the update is the same as for ordinary stochastic gradient descent, but the algorithm also keeps track of^[36]

{\bar {w}}={\frac {1}{t}}\sum _{i=0}^{t-1}w_{i}.

When optimization is done, this averaged parameter vector takes the place of

w

.

AdaGrad

AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent algorithm with per-parameter learning rate, first published in 2011.^[37] Informally, this increases the learning rate for sparser parameters^{[clarification needed]} and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.^[37]

It still has a base learning rate $η$ , but this is multiplied with the elements of a vector ${G j, j}$ which is the diagonal of the outer product matrix

G=\sum _{\tau =1}^{t}g_{\tau }g_{\tau }^{\mathsf {T}}

where $g_{\tau }=\nabla Q_{i}(w)$ , the gradient, at iteration $τ$ . The diagonal is given by

G_{j,j}=\sum _{\tau =1}^{t}g_{\tau ,j}^{2}.

This vector essentially stores a historical sum of gradient squares by dimension and is updated after every iteration. The formula for an update is now^[a]

w:=w-\eta \,\mathrm {diag} (G)^{-{\frac {1}{2}}}\odot g

or, written as per-parameter updates,

w_{j}:=w_{j}-{\frac {\eta }{\sqrt {G_{j,j}}}}g_{j}.

Each

{G (i, i)}

gives rise to a scaling factor for the learning rate that applies to a single parameter

w i

. Since the denominator in this factor,

{\textstyle {\sqrt {G_{i}}}={\sqrt {\sum _{\tau =1}^{t}g_{\tau }^{2}}}}

is the ℓ₂ norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.^[33]

While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.^[38]

RMSProp

RMSProp (for Root Mean Square Propagation) is a method invented in 2012 by James Martens and Ilya Sutskever, at the time both PhD students in Geoffrey Hinton's group, in which the learning rate is, like in Adagrad, adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.^[39] Unusually, it was not published in an article but merely described in a Coursera lecture.^{[citation needed]}

So, first the running average is calculated in terms of means square,

v(w,t):=\gamma v(w,t-1)+\left(1-\gamma \right)\left(\nabla Q_{i}(w)\right)^{2}

where, $\gamma$ is the forgetting factor. The concept of storing the historical gradient as sum of squares is borrowed from Adagrad, but "forgetting" is introduced to solve Adagrad's diminishing learning rates in non-convex problems by gradually decreasing the influence of old data.^[40]

And the parameters are updated as,

w:=w-{\frac {\eta }{\sqrt {v(w,t)}}}\nabla Q_{i}(w)

RMSProp has shown good adaptation of learning rate in different applications. RMSProp can be seen as a generalization of Rprop and is capable to work with mini-batches as well opposed to only full-batches.^[39]

Adam

Adam^[41] (short for Adaptive Moment Estimation) is a 2014 update to the RMSProp optimizer combining it with the main feature of the Momentum method.^[42] In this optimization algorithm, running averages with exponential forgetting of both the gradients and the second moments of the gradients are used. Given parameters $w^{(t)}$ and a loss function $L^{(t)}$ , where $t$ indexes the current training iteration (indexed at $0$ ), Adam's parameter update is given by:

m_{w}^{(t+1)}\leftarrow \beta _{1}m_{w}^{(t)}+\left(1-\beta _{1}\right)\nabla _{w}L^{(t)}

v_{w}^{(t+1)}\leftarrow \beta _{2}v_{w}^{(t)}+\left(1-\beta _{2}\right)\left(\nabla _{w}L^{(t)}\right)^{2}

{\hat {m}}_{w}={\frac {m_{w}^{(t+1)}}{1-\beta _{1}^{t}}}

{\hat {v}}_{w}={\frac {v_{w}^{(t+1)}}{1-\beta _{2}^{t}}}

w^{(t+1)}\leftarrow w^{(t)}-\eta {\frac {{\hat {m}}_{w}}{{\sqrt {{\hat {v}}_{w}}}+\epsilon }}

where

\epsilon

is a small scalar (e.g.

10^{-8}

) used to prevent division by 0, and

\beta _{1}

(e.g. 0.9) and

\beta _{2}

(e.g. 0.999) are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done element-wise.

The initial proof establishing the convergence of Adam was incomplete, and subsequent analysis has revealed that Adam does not converge for all convex objectives.

weight decay

algorithm in Adam.

Sign-based stochastic gradient descent

Even though sign-based optimization goes back to the aforementioned Rprop, in 2018 researchers tried to simplify Adam by removing the magnitude of the stochastic gradient from being taken into account and only considering its sign.^[53]^[54]

Backtracking line search

Backtracking line search is another variant of gradient descent. All of the below are sourced from the mentioned link. It is based on a condition known as the Armijo–Goldstein condition. Both methods allow learning rates to change at each iteration; however, the manner of the change is different. Backtracking line search uses function evaluations to check Armijo's condition, and in principle the loop in the algorithm for determining the learning rates can be long and unknown in advance. Adaptive SGD does not need a loop in determining learning rates. On the other hand, adaptive SGD does not guarantee the "descent property" – which Backtracking line search enjoys – which is that $f(x_{n+1})\leq f(x_{n})$ for all n. If the gradient of the cost function is globally Lipschitz continuous, with Lipschitz constant L, and learning rate is chosen of the order 1/L, then the standard version of SGD is a special case of backtracking line search.

Second-order methods

A stochastic analogue of the standard (deterministic) Newton–Raphson algorithm (a "second-order" method) provides an asymptotically optimal or near-optimal form of iterative optimization in the setting of stochastic approximation^{[citation needed]}. A method that uses direct measurements of the Hessian matrices of the summands in the empirical risk function was developed by Byrd, Hansen, Nocedal, and Singer.^[55] However, directly determining the required Hessian matrices for optimization may not be possible in practice. Practical and theoretically sound methods for second-order versions of SGD that do not require direct Hessian information are given by Spall and others.^[56]^[57]^[58] (A less efficient method based on finite differences, instead of simultaneous perturbations, is given by Ruppert.^[59]) Another approach to the approximation Hessian matrix is replacing it with the Fisher information matrix, which transforms usual gradient to natural.^[60] These methods not requiring direct Hessian information are based on either values of the summands in the above empirical risk function or values of the gradients of the summands (i.e., the SGD inputs). In particular, second-order optimality is asymptotically achievable without direct calculation of the Hessian matrices of the summands in the empirical risk function.

Approximations in continuous time

For small learning rate ${\textstyle \eta }$ stochastic gradient descent ${\textstyle (w_{n})_{n\in \mathbb {N} _{0}}}$ can be viewed as a discretization of the

gradient flow

ODE

{\frac {d}{dt}}W_{t}=-\nabla Q(W_{t})

subject to additional stochastic noise. This approximation is only valid on a finite time-horizon in the following sense: assume that all the coefficients ${\textstyle Q_{i}}$ are sufficiently smooth. Let ${\textstyle T>0}$ and ${\textstyle g:\mathbb {R} ^{d}\to \mathbb {R} }$ be a sufficiently smooth test function. Then, there exists a constant ${\textstyle C>0}$ such that for all ${\textstyle \eta >0}$

\max _{k=0,\dots ,\lfloor T/\eta \rfloor }\left|\mathbb {E} [g(w_{k})]-g(W_{k\eta })\right|\leq C\eta ,

where ${\textstyle \mathbb {E} }$ denotes taking the expectation with respect to the random choice of indices in the stochastic gradient descent scheme.

Since this approximation does not capture the random fluctuations around the mean behavior of stochastic gradient descent solutions to

stochastic differential equations (SDEs) have been proposed as limiting objects.^[61]

More precisely, the solution to the SDE

dW_{t}=-\nabla \left(Q(W_{t})+{\tfrac {1}{4}}\eta |\nabla Q(W_{t})|^{2}\right)dt+{\sqrt {\eta }}\Sigma (W_{t})^{1/2}dB_{t},

for

\Sigma (w)={\frac {1}{n^{2}}}\left(\sum _{i=1}^{n}Q_{i}(w)-Q(w)\right)\left(\sum _{i=1}^{n}Q_{i}(w)-Q(w)\right)^{T}

where

{\textstyle dB_{t}}

denotes the

Ito-integral with respect to a Brownian motion

is a more precise approximation in the sense that there exists a constant

{\textstyle C>0}

such that

\max _{k=0,\dots ,\lfloor T/\eta \rfloor }\left|\mathbb {E} [g(w_{k})]-\mathbb {E} [g(W_{k\eta })]\right|\leq C\eta ^{2}.

However this SDE only approximates the one-point motion of stochastic gradient descent. For an approximation of the stochastic flow one has to consider SDEs with infinite-dimensional noise.^[62]

Notes

^ $\odot$ denotes the element-wise product.

References

ISBN 978-0-262-01646-9
.

^
ISBN 978-0-521-65263-6
.

JSTOR 2287314
.

Advances in Neural Information Processing Systems
. Vol. 20. pp. 161–168.
^ Murphy, Kevin (2021). Probabilistic Machine Learning: An Introduction. MIT Press. Retrieved 10 April 2021.

doi:10.1109/ICASSP.1997.604861
.

S2CID 10043417
.

ISBN 0-12-604550-X
.

doi:10.1214/aoms/1177729586
.

doi:10.1214/aoms/1177729392
.

S2CID 12781225
.

doi:10.1109/ICASSP.1997.604861
.

S2CID 73728964
. Retrieved 2023-10-02.

S2CID 205001834
.

^ Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online learning and stochastic optimization" (PDF). JMLR. 12: 2121–2159.

^ Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 26. Retrieved 19 March 2020.

arXiv:1412.6980 [cs.LG
].

^ ^a ^b "torch.optim — PyTorch 2.0 documentation". pytorch.org. Retrieved 2023-10-02.

S2CID 254236976
.

^ "Module: tf.keras.optimizers | TensorFlow v2.14.0". TensorFlow. Retrieved 2023-10-02.

^ Jenny Rose Finkel, Alex Kleeman, Christopher D. Manning (2008). Efficient, Feature-based, Conditional Random Field Parsing. Proc. Annual Meeting of the ACL.

^ LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks of the trade. Springer Berlin Heidelberg, 2012. 9-48

^ Jerome R. Krebs, John E. Anderson, David Hinkley, Ramesh Neelamani, Sunwoong Lee, Anatoly Baumstein, and Martin-Daniel Lacasse, (2009), "Fast full-wavefield seismic inversion using encoded sources," GEOPHYSICS 74: WCC177-WCC188.

^ Avi Pfeffer. "CS181 Lecture 5 — Perceptrons" (PDF). Harvard University.^{[permanent dead link]}

ISBN 978-0262035613
.

doi:10.1109/IJCNN.1990.137720
.

ISBN 0-471-33052-3
.

S2CID 10279395
.

^
S2CID 205001834
.

^ "Gradient Descent and Momentum: The Heavy Ball Method". 13 July 2020.

^ Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey E. (June 2013). Sanjoy Dasgupta and David Mcallester (ed.). On the importance of initialization and momentum in deep learning (PDF). In Proceedings of the 30th international conference on machine learning (ICML-13). Vol. 28. Atlanta, GA. pp. 1139–1147. Retrieved 14 January 2016.

^ Sutskever, Ilya (2013). Training recurrent neural networks (PDF) (Ph.D.). University of Toronto. p. 74.

^
arXiv:1212.5701 [cs.LG
].

PMID 34021212
.

^ "Papers with Code - Nesterov Accelerated Gradient Explained".

S2CID 3548228. Archived from the original
(PDF) on 2016-01-12. Retrieved 2018-02-14.

^ ^a ^b Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online learning and stochastic optimization" (PDF). JMLR. 12: 2121–2159.

^ Gupta, Maya R.; Bengio, Samy; Weston, Jason (2014). "Training highly multiclass classifiers" (PDF). JMLR. 15 (1): 1461–1492.

^ ^a ^b Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 26. Retrieved 19 March 2020.

^ "Understanding RMSprop — faster neural network learning". 2 September 2018.

^
arXiv:1412.6980 [cs.LG
].

^ "4. Beyond Gradient Descent - Fundamentals of Deep Learning [Book]".

arXiv:1904.09237
.

^ Rubio, David Martínez (2017). Convergence Analysis of an Adaptive Method of Gradient Descent (PDF) (Master thesis). University of Oxford. Retrieved 5 January 2024.

arXiv:2208.09632
.

S2CID 70293087. {{cite journal}}: Cite journal requires |journal= (help
)

doi:10.36227/techrxiv.20427852.v1. Retrieved 2022-11-19. {{cite journal}}: Cite journal requires |journal= (help
)

OCLC 1333722169.{{cite book}}: CS1 maint: multiple names: authors list (link
)

arXiv:1912.09926. {{cite journal}}: Cite journal requires |journal= (help
)

arXiv:1904.09237. {{cite journal}}: Cite journal requires |journal= (help
)

^ "An overview of gradient descent optimization algorithms". 19 January 2016.

arXiv:1711.05101. {{cite journal}}: Cite journal requires |journal= (help
)

^ Balles, Lukas; Hennig, Philipp (15 February 2018). "Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients".

^ "SignSGD: Compressed Optimisation for Non-Convex Problems". 3 July 2018. pp. 560–569.

S2CID 12396034
.

doi:10.1109/TAC.2000.880982
.

S2CID 3564529
.

ISBN 978-1-4471-4284-3
.

doi:10.1214/aos/1176346589
.

S2CID 207585383
.

ISSN 1533-7928
.

^ Gess, Benjamin; Kassing, Sebastian; Konarovskyi, Vitalii (14 February 2023). "Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent".
arXiv:2302.07125 [math.PR
].

Further reading

ISBN 978-3-540-23122-6

Buduma, Nikhil; Locascio, Nicholas (2017), "Beyond Gradient Descent", Fundamentals of Deep Learning : Designing Next-Generation Machine Intelligence Algorithms, O'Reilly,
ISBN 9781491925584

ISBN 978-3-642-35288-1

Spall, James C. (2003), Introduction to Stochastic Search and Optimization,
ISBN 978-0-471-33052-3

External links

Using stochastic gradient descent in C++, Boost, Ublas for linear regression

Machine Learning Algorithms

"Gradient Descent, How Neural Networks Learn". 3Blue1Brown. October 16, 2017. Archived from the original on 2021-12-22 – via YouTube.

Goh (April 4, 2017). "Why Momentum Really Works". doi:10.23915/distill.00006
. Interactive paper explaining momentum.

v
t
e
Differentiable computing
General

Differentiable programming

Information geometry

Statistical manifold

Automatic differentiation

Neuromorphic engineering

Pattern recognition

Tensor calculus

Computational learning theory

Inductive bias

Concepts

Gradient descent
SGD

Clustering

Regression
Overfitting

Hallucination

Adversary

Attention

Convolution

Loss functions

Backpropagation

Batchnorm

Activation
Softmax

Sigmoid

Rectifier

Regularization

Datasets

Augmentation

Diffusion

Autoregression

Applications

Machine learning
In-context learning

Artificial neural network

Deep learning

Scientific computing

Artificial Intelligence

Language model
Large language model

Hardware

IPU

TPU

VPU

Memristor

SpiNNaker

Software libraries

TensorFlow

PyTorch

Keras

Theano

JAX

Flux.jl

MindSpore

Implementations
Audio–visual

AlexNet

WaveNet

Human image synthesis

HWR

OCR

Speech synthesis

Speech recognition

Facial recognition

AlphaFold

Text-to-image models
DALL-E

Midjourney

Stable Diffusion

Text-to-video models
Sora

VideoPoet

Whisper

Verbal

Word2vec

Seq2seq

BERT

Gemini

LaMDA
Bard

NMT

Project Debater

IBM Watson

IBM Watsonx

Granite

GPT-1

GPT-2

GPT-3

GPT-4

ChatGPT

GPT-J

Chinchilla AI

PaLM

BLOOM

LLaMA

PanGu-Σ

Decisional

AlphaGo

AlphaZero

Q-learning

SARSA

OpenAI Five

Self-driving car

MuZero

Action selection
Auto-GPT

Robot control

People

Yoshua Bengio

Alex Graves

Ian Goodfellow

Stephen Grossberg

Demis Hassabis

Geoffrey Hinton

Yann LeCun

Fei-Fei Li

Andrew Ng

Jürgen Schmidhuber

David Silver

Ilya Sutskever

Organizations

Anthropic

EleutherAI

Google DeepMind

Hugging Face

OpenAI

Meta AI

Mila

MIT CSAIL

Huawei

Architectures

Neural Turing machine

Differentiable neural computer

Transformer

Recurrent neural network (RNN)

Long short-term memory (LSTM)

Gated recurrent unit (GRU)

Echo state network

Multilayer perceptron (MLP)

Convolutional neural network

Residual neural network

Mamba

Autoencoder

Variational autoencoder (VAE)

Generative adversarial network (GAN)

Graph neural network

Portals
Computer programming

Technology

Categories
Artificial neural networks

Machine learning

Retrieved from "https://en.wikipedia.org/w/index.php?title=Stochastic_gradient_descent&oldid=1219173891"

[38] $\odot$ denotes the element-wise product.

[1] ISBN 978-0-262-01646-9
.

[Bottou_1998-2] 
ISBN 978-0-521-65263-6
.

[3] JSTOR 2287314
.

[4] Advances in Neural Information Processing Systems
. Vol. 20. pp. 161–168.

[5] Murphy, Kevin (2021). Probabilistic Machine Learning: An Introduction. MIT Press. Retrieved 10 April 2021.

[6] :10.1109/ICASSP.1997.604861
.

[7] S2CID 10043417
.

[8] ISBN 0-12-604550-X
.

[rm-9] :10.1214/aoms/1177729586
.

[10] :10.1214/aoms/1177729392
.

[11] S2CID 12781225
.

[12] :10.1109/ICASSP.1997.604861
.

[13] S2CID 73728964
. Retrieved 2023-10-02.

[14] S2CID 205001834
.

[duchi2-15] Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online learning and stochastic optimization" (PDF). JMLR. 12: 2121–2159.

[rmsprop2-16] Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 26. Retrieved 19 March 2020.

[Adam20142-17] rXiv:1412.6980 [cs.LG
].

[pytorch.org-18] "torch.optim — PyTorch 2.0 documentation". pytorch.org. Retrieved 2023-10-02.

[19] S2CID 254236976
.

[20] "Module: tf.keras.optimizers | TensorFlow v2.14.0". TensorFlow. Retrieved 2023-10-02.

[21] Jenny Rose Finkel, Alex Kleeman, Christopher D. Manning (2008). Efficient, Feature-based, Conditional Random Field Parsing. Proc. Annual Meeting of the ACL.

[22] LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks of the trade. Springer Berlin Heidelberg, 2012. 9-48

[23] Jerome R. Krebs, John E. Anderson, David Hinkley, Ramesh Neelamani, Sunwoong Lee, Anatoly Baumstein, and Martin-Daniel Lacasse, (2009), "Fast full-wavefield seismic inversion using encoded sources," GEOPHYSICS 74: WCC177-WCC188.

[24] Avi Pfeffer. "CS181 Lecture 5 — Perceptrons" (PDF). Harvard University.^{[permanent dead link]}

[25] ISBN 978-0262035613
.

[26] :10.1109/IJCNN.1990.137720
.

[27] ISBN 0-471-33052-3
.

[28] S2CID 10279395
.

[Rumelhart1986-29] 
S2CID 205001834
.

[30] "Gradient Descent and Momentum: The Heavy Ball Method". 13 July 2020.

[Sutskever2013-31] Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey E. (June 2013). Sanjoy Dasgupta and David Mcallester (ed.). On the importance of initialization and momentum in deep learning (PDF). In Proceedings of the 30th international conference on machine learning (ICML-13). Vol. 28. Atlanta, GA. pp. 1139–1147. Retrieved 14 January 2016.

[SutskeverPhD-32] Sutskever, Ilya (2013). Training recurrent neural networks (PDF) (Ph.D.). University of Toronto. p. 74.

[Zeiler_2012-33] 
arXiv:1212.5701 [cs.LG
].

[Borysenko2021-34] PMID 34021212
.

[35] "Papers with Code - Nesterov Accelerated Gradient Explained".

[36] S2CID 3548228. Archived from the original
(PDF) on 2016-01-12. Retrieved 2018-02-14.

[duchi-37] Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online learning and stochastic optimization" (PDF). JMLR. 12: 2121–2159.

[39] Gupta, Maya R.; Bengio, Samy; Weston, Jason (2014). "Training highly multiclass classifiers" (PDF). JMLR. 15 (1): 1461–1492.

[rmsprop-40] Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 26. Retrieved 19 March 2020.

[41] "Understanding RMSprop — faster neural network learning". 2 September 2018.

[Adam2014-42] 
arXiv:1412.6980 [cs.LG
].

[43] "4. Beyond Gradient Descent - Fundamentals of Deep Learning [Book]".

[44] arXiv:1904.09237
.

[45] Rubio, David Martínez (2017). Convergence Analysis of an Adaptive Method of Gradient Descent (PDF) (Master thesis). University of Oxford. Retrieved 5 January 2024.

[46] arXiv:2208.09632
.

[47] S2CID 70293087. {{cite journal}}: Cite journal requires |journal= (help
)

[48] :10.36227/techrxiv.20427852.v1. Retrieved 2022-11-19. {{cite journal}}: Cite journal requires |journal= (help
)

[49] OCLC 1333722169.{{cite book}}: CS1 maint: multiple names: authors list (link
)

[50] rXiv:1912.09926. {{cite journal}}: Cite journal requires |journal= (help
)

[51] rXiv:1904.09237. {{cite journal}}: Cite journal requires |journal= (help
)

[52] "An overview of gradient descent optimization algorithms". 19 January 2016.

[AdamW-53] rXiv:1711.05101. {{cite journal}}: Cite journal requires |journal= (help
)

[54] Balles, Lukas; Hennig, Philipp (15 February 2018). "Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients".

[55] "SignSGD: Compressed Optimisation for Non-Convex Problems". 3 July 2018. pp. 560–569.

[56] S2CID 12396034
.

[57] :10.1109/TAC.2000.880982
.

[58] S2CID 3564529
.

[59] ISBN 978-1-4471-4284-3
.

[60] :10.1214/aos/1176346589
.

[61] S2CID 207585383
.

[62] ISSN 1533-7928
.

[63] Gess, Benjamin; Kassing, Sebastian; Konarovskyi, Vitalii (14 February 2023). "Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent".
arXiv:2302.07125 [math.PR
].

[1]

[2]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[20]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[a]

[38]

[39]

[40]

[41]

[42]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

Background

Iterative method

Example

History

Notable applications

Extensions and variants

Implicit updates (ISGD)

Momentum

Averaging

AdaGrad

RMSProp

Adam

Sign-based stochastic gradient descent

Backtracking line search

Second-order methods

Approximations in continuous time

See also

Notes

References

Further reading

External links