History of artificial neural networks
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
|
Part of a series on |
Machine learning and data mining |
---|
Later, advances in hardware and the development of the
Linear neural network
The simplest kind of
Perceptrons and other early neural networks
In the early 1940s,
Some say that research stagnated following
First deep learning
The first
The first deep learning MLP trained by stochastic gradient descent[22] was published in 1967 by Shun'ichi Amari.[23][9] In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned useful
Backpropagation
The backpropagation algorithm is an efficient application of the Leibniz chain rule (1673)[24] to networks of differentiable nodes.[9] It is also known as the reverse mode of
Recurrent network architectures
Wilhelm Lenz and Ernst Ising created and analyzed the Ising model (1925)[33] which is essentially a non-learning artificial recurrent neural network (RNN) consisting of neuron-like threshold elements.[9] In 1972, Shun'ichi Amari made this architecture adaptive.[34][9] His learning RNN was popularised by John Hopfield in 1982.[35]
Self-organizing maps
SOMs create internal representations reminiscent of the cortical homunculus,[39] a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body.
Convolutional neural networks (CNNs)
The origin of the CNN architecture is the "neocognitron"[40] introduced by Kunihiko Fukushima in 1980.[41][42] It was inspired by work of Hubel and Wiesel in the 1950s and 1960s which showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.
In 1969, Kunihiko Fukushima also introduced the ReLU (rectified linear unit) activation function.[43][9] The rectifier has become the most popular activation function for CNNs and deep neural networks in general.[44]
The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel and was one of the first CNNs, as it achieved shift invariance.[45] It did so by utilizing weight sharing in combination with backpropagation training.[46] Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.[45]
In 1988, Wei Zhang et al. applied backpropagation to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system.[47][48]
In 1989, Yann LeCun et al. trained a CNN with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.[49] Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types. Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991[50] and breast cancer detection in mammograms in 1994.[51]
In 1990 Yamaguchi et al. introduced max-pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They combined TDNNs with max-pooling in order to realize a speaker independent isolated word recognition system.[52] In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.[53][54][55][56] Max-pooling is often used in modern CNNs.[57]
LeNet-5, a 7-level CNN by
In 2010, Backpropagation training through max-pooling was accelerated by GPUs and shown to perform better than other pooling variants.[59] Behnke (2003) relied only on the sign of the gradient (
In 2011, a deep GPU-based CNN called "DanNet" by Dan Ciresan, Ueli Meier, and
ANNs were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as location, type (object class label), scale, lighting and others. This was realized in Developmental Networks (DNs)[65] whose embodiments are Where-What Networks, WWN-1 (2008)[66] through WWN-7 (2013).[67]
Artificial curiosity and generative adversarial networks
In 1991,
In 2014, this adversarial principle was used in a generative adversarial network (GAN) by Ian Goodfellow et al.[71] Here the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set. This can be used to create realistic deepfakes.[72]
In 1992, Schmidhuber also published another type of gradient-based adversarial neural networks where the goal of the zero-sum game is to create disentangled representations of input patterns. This was called predictability minimization.[73][74]
Nvidia's StyleGAN (2018)[75] is based on the Progressive GAN by Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.[76] Here the GAN generator is grown from small to large scale in a pyramidal fashion. StyleGANs improve consistency between fine and coarse details in the generator network.
Transformers and their variants
Many modern large language models such as
Basic ideas for this go back a long way: in 1992,
Transformers are also increasingly being used in computer vision.[84]
Deep learning with unsupervised or self-supervised pre-training
In the 1980s, backpropagation did not work well for deep FNNs and RNNs. Here the word "deep" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth.[85] The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For an FNN, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For RNNs, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.
To overcome this problem,
The vanishing gradient problem and its solutions
In 2015, Rupesh Kumar Srivastava, Klaus Greff, and Schmidhuber used
In 2011, Xavier Glorot, Antoine Bordes and Yoshua Bengio found that the ReLU[43] of Kunihiko Fukushima also helps to overcome the vanishing gradient problem,[99] compared to widely used activation functions prior to 2011.
Hardware-based designs
The development of
Computational devices were created in
Contests
Between 2009 and 2012, recurrent neural networks and deep feedforward neural networks developed in Schmidhuber's research group won eight international competitions in pattern recognition and machine learning.[104][105] For example, the bi-directional and multi-dimensional long short-term memory (LSTM)[106][107][108][109] of Graves et al. won three competitions in connected handwriting recognition at the 2009 International Conference on Document Analysis and Recognition (ICDAR), without any prior knowledge about the three languages to be learned.[108][107]
Ciresan and colleagues won pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition,[110] the ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge[111] and others. Their neural networks were the first pattern recognizers to achieve human-competitive/superhuman performance[62] on benchmarks such as traffic sign recognition (IJCNN 2012), or the MNIST handwritten digits problem.
Researchers demonstrated (2010) that deep neural networks interfaced to a hidden Markov model with context-dependent states that define the neural network output layer can drastically reduce errors in large-vocabulary speech recognition tasks such as voice search.[citation needed]
GPU-based implementations
Deep, highly nonlinear neural architectures similar to the neocognitron[113] and the "standard architecture of vision",[114] inspired by simple and complex cells, were pre-trained with unsupervised methods by Hinton.[90][89] A team from his lab won a 2012 contest sponsored by Merck to design software to help find molecules that might identify new drugs.[115]
Notes
- ^ Neurons generate an action potential—the release of neurotransmitters that are chemical inputs to other neurons—based on the sum of its incoming chemical inputs.
References
- ^ S2CID 12781225.
- ISBN 0-465-02997-3.
- S2CID 195908774.
- ^ Gershgorn, Dave (26 July 2017). "The data that transformed AI research—and possibly the world". Quartz.
- ^ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
- ^ Mansfield Merriman, "A List of Writings Relating to the Method of Least Squares"
- .
- ^ Bretscher, Otto (1995). Linear Algebra With Applications (3rd ed.). Upper Saddle River, NJ: Prentice Hall.
- ^ arXiv:2212.11279 [cs.NE].
- ^
ISBN 0-674-40340-1.
- .
- ^ Kleene, S.C. (1956). "Representation of Events in Nerve Nets and Finite Automata". Annals of Mathematics Studies. No. 34. Princeton University Press. pp. 3–41. Retrieved 17 June 2017.
- ^ Kleene, S.C. (1956). "Representation of Events in Nerve Nets and Finite Automata". Annals of Mathematics Studies. No. 34. Princeton University Press. pp. 3–41. Retrieved 2017-06-17.
- ISBN 978-1-135-63190-1.
- .
- .
- ISBN 978-0-19-517618-6.
- ISBN 978-0-262-63022-1.
- S2CID 11715509.
- ^ Ivakhnenko, A. G. (1973). Cybernetic Predicting Devices. CCM Information Corporation.
- ^ Ivakhnenko, A. G.; Grigorʹevich Lapa, Valentin (1967). Cybernetics and forecasting techniques. American Elsevier Pub. Co.
- .
- ^ Amari, Shun'ichi (1967). "A theory of adaptive pattern classifier". IEEE Transactions. EC (16): 279–307.
- ISBN 9780598818461.
- ^ Linnainmaa, Seppo (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors (Masters) (in Finnish). University of Helsinki. pp. 6–7.
- S2CID 122357351.
- S2CID 15568746.
- ISBN 978-0-89871-776-1.
- ^ Rosenblatt, Frank (1962). Principles of Neurodynamics. Spartan, New York.
- doi:10.2514/8.5282.
- ^ Werbos, Paul (1982). "Applications of advances in nonlinear sensitivity analysis" (PDF). System modeling and optimization. Springer. pp. 762–770. Archived (PDF) from the original on 14 April 2016. Retrieved 2 July 2017.
- ^ Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "Learning Internal Representations by Error Propagation". David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986.
- .
- ^ Amari, Shun-Ichi (1972). "Learning patterns and pattern sequences by self-organizing nets of threshold elements". IEEE Transactions. C (21): 1197–1206.
- PMID 6953413.
- .
- S2CID 206775459.
- S2CID 3351573.
- ^ "Homunculus | Meaning & Definition in UK English | Lexico.com". Lexico Dictionaries | English. Archived from the original on May 18, 2021. Retrieved 6 February 2022.
- .
- S2CID 206775608. Retrieved 16 November 2013.
- S2CID 3074096.
- ^ .
- arXiv:1710.05941 [cs.NE].
- ^ a b Waibel, Alex (December 1987). Phoneme Recognition Using Time-Delay Neural Networks. Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Tokyo, Japan.
- ^ Alexander Waibel et al., Phoneme Recognition Using Time-Delay Neural Networks IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. – 339 March 1989.
- ^ Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of Annual Conference of the Japan Society of Applied Physics.
- PMID 20577468.
- ^ LeCun et al., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, 1, pp. 541–551, 1989.
- PMID 20706526.
- PMID 8058017.
- ^ Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan. Archived from the original on 2021-03-07. Retrieved 2019-09-04.
- ^ J. Weng, N. Ahuja and T. S. Huang, "Cresceptron: a self-organizing neural network which grows adaptively," Proc. International Joint Conference on Neural Networks, Baltimore, Maryland, vol I, pp. 576–581, June, 1992.
- ^ J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation of 3-D objects from 2-D images," Proc. 4th International Conf. Computer Vision, Berlin, Germany, pp. 121–128, May, 1993.
- ^ J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation using the Cresceptron," International Journal of Computer Vision, vol. 25, no. 2, pp. 105–139, Nov. 1997.
- )
- S2CID 2309950.
- S2CID 14542261. Retrieved October 7, 2016.
- .
- ^ Sven Behnke (2003). Hierarchical Neural Networks for Image Interpretation (PDF). Lecture Notes in Computer Science. Vol. 2766. Springer.
- ^ Martin Riedmiller und Heinrich Braun: Rprop – A Fast Adaptive Learning Algorithm. Proceedings of the International Symposium on Computer and Information Science VII, 1992
- ^ S2CID 2161592.
- ^ a b Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffry (2012). "ImageNet Classification with Deep Convolutional Neural Networks" (PDF). NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada.
- S2CID 206594692.
- ^ J. Weng, "Why Have We Passed 'Neural Networks Do not Abstract Well'?," Natural Intelligence: the INNS Magazine, vol. 1, no.1, pp. 13–22, 2011.
- ^ Z. Ji, J. Weng, and D. Prokhorov, "Where-What Network 1: Where and What Assist Each Other Through Top-down Connections," Proc. 7th International Conference on Development and Learning (ICDL'08), Monterey, CA, Aug. 9–12, pp. 1–6, 2008.
- ^ X. Wu, G. Guo, and J. Weng, "Skull-closed Autonomous Development: WWN-7 Dealing with Scales," Proc. International Conference on Brain-Mind, July 27–28, East Lansing, Michigan, pp. 1–9, 2013.
- Schmidhuber, Jürgen(1991). "A possibility for implementing curiosity and boredom in model-building neural controllers". Proc. SAB'1991. MIT Press/Bradford Books. pp. 222–227.
- S2CID 234198.
- ^ S2CID 216056336.
- ^ Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Networks (PDF). Proceedings of the International Conference on Neural Information Processing Systems (NIPS 2014). pp. 2672–2680. Archived (PDF) from the original on 22 November 2019. Retrieved 20 August 2019.
- ^ "Prepare, Don't Panic: Synthetic Media and Deepfakes". witness.org. Archived from the original on 2 December 2020. Retrieved 25 November 2020.
- S2CID 42023620.
- S2CID 16154391.
- ^ "GAN 2.0: NVIDIA's Hyperrealistic Face Generator". SyncedReview.com. December 14, 2018. Retrieved October 3, 2019.
- )
- ^ arXiv:1706.03762 [cs.CL].
- S2CID 208117506.
- S2CID 1915014.
- ^ S2CID 16683347.
- arXiv:2009.14794 [cs.CL].
- ^ Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.
- ^ Schmidhuber, Jürgen (1993). "Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets". ICANN 1993. Springer. pp. 460–463.
- ^ He, Cheng (31 December 2021). "Transformer in CV". Transformer in CV. Towards Data Science.
- ^ S2CID 11715509.
- ^ S2CID 18271205.
- ^ Schmidhuber, Jürgen (1993). Habilitation Thesis (PDF).
- ISBN 9780262680530.
- ^ S2CID 2309950.
- ^ ISSN 1941-6016.
- ].
- ^ a b S. Hochreiter., "Untersuchungen zu dynamischen neuronalen Netzen Archived 2015-03-06 at the Wayback Machine," Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber, 1991.
- ISBN 978-0-7803-5369-5.
- S2CID 1915014.
- ISBN 0-85296-721-7.
- arXiv:1505.00387 [cs.LG].
- ^ Srivastava, Rupesh K; Greff, Klaus; Schmidhuber, Juergen (2015). "Training Very Deep Networks". Advances in Neural Information Processing Systems. 28. Curran Associates, Inc.: 2377–2385.
- ISBN 978-1-4673-8851-1.
- ^ Xavier Glorot; Antoine Bordes; Yoshua Bengio (2011). Deep sparse rectifier neural networks (PDF). AISTATS.
Rectifier and softplus activation functions. The second one is a smooth version of the first.
- ISBN 978-1-4613-1639-8.
- PMID 18654568.
- S2CID 4367148.
- S2CID 1918673.
- ^ 2012 Kurzweil AI Interview Archived 2018-08-31 at the Wayback Machine with Jürgen Schmidhuber on the eight competitions won by his Deep Learning team 2009–2012
- ^ "How bio-inspired deep learning keeps winning competitions | KurzweilAI". www.kurzweilai.net. Archived from the original on 2018-08-31. Retrieved 2017-06-16.
- ^ Graves, Alex; and Schmidhuber, Jürgen; Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks, in Advances in Neural Information Processing Systems 22 (NIPS'22), 7–10 December 2009, Vancouver, BC, Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552.
- ^ S2CID 14635907.
- ^ a b Graves, Alex; Schmidhuber, Jürgen (2009). Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris; Culotta, Aron (eds.). "Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks". Neural Information Processing Systems (NIPS) Foundation. 21. Curran Associates, Inc: 545–552.
- S2CID 14635907.
- ^ PMID 22386783.
- ^ a b Ciresan, Dan; Giusti, Alessandro; Gambardella, Luca M.; Schmidhuber, Juergen (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q. (eds.). Advances in Neural Information Processing Systems 25 (PDF). Curran Associates, Inc. pp. 2843–2851.
- .
- S2CID 206775608.
- S2CID 8920227.
- ^ Markoff, John (November 23, 2012). "Scientists See Promise in Deep-Learning Programs". New York Times.
External links
- "Lecun 2019-7-11 ACM Tech Talk". Google Docs. Retrieved 2020-02-13.