Convolutional neural network

Convolutional neural network (CNN) is a

feed-forward neural network that learns feature engineering by itself via filters (or kernel) optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections.^[1]^[2] For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution (or cross-correlation) kernels,^[3]^[4] only 25 neurons are required to process 5x5-sized tiles.^[5]^[6]

Higher-layer features are extracted from wider context windows, compared to lower-layer features.

They have applications in:

CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN), based on the shared-weight architecture of the

invariant to translation, due to the downsampling operation they apply to the input.^[14]

Feed-forward neural networks are usually fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The "full connectivity" of these networks makes them prone to overfitting data. Typical ways of regularization, or preventing overfitting, include: penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.) Robust datasets also increase the probability that CNNs will learn the generalized principles that characterize a given dataset rather than the biases of a poorly-populated set.^[15]

Convolutional networks were

cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field

. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

CNNs use relatively little pre-processing compared to other

to whom?

]

Architecture

A convolutional neural network consists of an input layer,

hidden layers and an output layer. In a convolutional neural network, the hidden layers include one or more layers that perform convolutions. Typically this includes a layer that performs a dot product of the convolution kernel with the layer's input matrix. This product is usually the Frobenius inner product, and its activation function is commonly ReLU

. As the convolution kernel slides along the input matrix for the layer, the convolution operation generates a feature map, which in turn contributes to the input of the next layer. This is followed by other layers such as pooling layers, fully connected layers, and normalization layers. Here it should be noted how close a convolutional neural network is to a matched filter.^[20]

Convolutional layers

In a CNN, the input is a tensor with shape:

(number of inputs) × (input height) × (input width) × (input channels)

After passing through a convolutional layer, the image becomes abstracted to a feature map, also called an activation map, with shape:

(number of inputs) × (feature map height) × (feature map width) × (feature map channels).

Convolutional layers convolve the input and pass its result to the next layer. This is similar to the response of a neuron in the visual cortex to a specific stimulus.^[21] Each convolutional neuron processes data only for its receptive field.

Although fully connected feedforward neural networks can be used to learn features and classify data, this architecture is generally impractical for larger inputs (e.g., high-resolution images), which would require massive numbers of neurons because each pixel is a relevant input feature. A fully connected layer for an image of size 100 × 100 has 10,000 weights for each neuron in the second layer. Convolution reduces the number of free parameters, allowing the network to be deeper.^[5] For example, using a 5 × 5 tiling region, each with the same shared weights, requires only 25 neurons. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen during backpropagation in earlier neural networks.^[1]^[2]

To speed processing, standard convolutional layers can be replaced by depthwise separable convolutional layers,^[22] which are based on a depthwise convolution followed by a pointwise convolution. The depthwise convolution is a spatial convolution applied independently over each channel of the input tensor, while the pointwise convolution is a standard convolution restricted to the use of $1\times 1$ kernels.

Pooling layers

Convolutional networks may include local and/or global pooling layers along with traditional convolutional layers. Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 × 2 are commonly used. Global pooling acts on all the neurons of the feature map.^[23]^[24] There are two common types of pooling in popular use: max and average. Max pooling uses the maximum value of each local cluster of neurons in the feature map,^[25]^[26] while average pooling takes the average value.

Fully connected layers

Fully connected layers connect every neuron in one layer to every neuron in another layer. It is the same as a traditional multilayer perceptron neural network (MLP). The flattened matrix goes through a fully connected layer to classify the images.

Receptive field

In neural networks, each neuron receives input from some number of locations in the previous layer. In a convolutional layer, each neuron receives input from only a restricted area of the previous layer called the neuron's receptive field. Typically the area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the receptive field is the entire previous layer. Thus, in each convolutional layer, each neuron takes input from a larger area in the input than previous layers. This is due to applying the convolution over and over, which takes the value of a pixel into account, as well as its surrounding pixels. When using dilated layers, the number of pixels in the receptive field remains constant, but the field is more sparsely populated as its dimensions grow when combining the effect of several layers.

To manipulate the receptive field size as desired, there are some alternatives to the standard convolutional layer. For example, atrous or dilated convolution^[27]^[28] expands the receptive field size without increasing the number of parameters by interleaving visible and blind regions. Moreover, a single dilated convolutional layer can comprise filters with multiple dilation ratios,^[29] thus having a variable receptive field size.

Weights

Each neuron in a neural network computes an output value by applying a specific function to the input values received from the receptive field in the previous layer. The function that is applied to the input values is determined by a vector of weights and a bias (typically real numbers). Learning consists of iteratively adjusting these biases and weights.

The vectors of weights and biases are called filters and represent particular features of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons can share the same filter. This reduces the memory footprint because a single bias and a single vector of weights are used across all receptive fields that share that filter, as opposed to each receptive field having its own bias and vector weighting.^[30]

History

CNN are often compared to the way the brain achieves vision processing in living

organisms.^[31]

Receptive fields in the visual cortex

Work by Hubel and Wiesel in the 1950s and 1960s showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. Provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive field.^[32] Neighboring cells have similar and overlapping receptive fields. Receptive field size and location varies systematically across the cortex to form a complete map of visual space.^{[citation needed]} The cortex in each hemisphere represents the contralateral visual field.^{[citation needed]}

Their 1968 paper identified two basic visual cell types in the brain:^[17]

simple cells, whose output is maximized by straight edges having particular orientations within their receptive field
complex cells, which have larger receptive fields, whose output is insensitive to the exact position of the edges in the field.

Hubel and Wiesel also proposed a cascading model of these two types of cells for use in pattern recognition tasks.^[33]^[32]

Neocognitron, origin of the CNN architecture

The "neocognitron"^[16] was introduced by Kunihiko Fukushima in 1980.^[18]^[26]^[34] It was inspired by the above-mentioned work of Hubel and Wiesel. The neocognitron introduced the two basic types of layers in CNNs:

A convolutional layer which contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters.
Downsampling layers which contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.

In 1969, Kunihiko Fukushima also introduced the ReLU (rectified linear unit) activation function.^[35]^[36] The rectifier has become the most popular activation function for CNNs and deep neural networks in general.^[37]

In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. in 1993 introduced a method called max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.^[38] Max-pooling is often used in modern CNNs.^[39]

Several supervised and unsupervised learning algorithms have been proposed over the decades to train the weights of a neocognitron.^[16] Today, however, the CNN architecture is usually trained through backpropagation.

The neocognitron is the first CNN which requires units located at multiple network positions to have shared weights.

Convolutional neural networks were presented at the Neural Information Processing Workshop in 1987, automatically analyzing time-varying signals by replacing learned multiplication with convolution in time, and demonstrated for speech recognition.^[40]

Time delay neural networks

The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel et al. for phoneme recognition and was one of the first convolutional networks, as it achieved shift-invariance.^[41] A TDNN is a 1-D convolutional neural net where the convolution is performed along the time axis of the data. It is the first CNN utilizing weight sharing in combination with a training by gradient descent, using backpropagation.^[42] Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.^[41]

TDNNs are convolutional networks that share weights along the temporal dimension.

translation invariance in image processing with CNNs.^[42] The tiling of neuron outputs can cover timed stages.^[45]

TDNNs now ^[when?] achieve the best performance in far-distance speech recognition.^[46]

Max pooling

In 1990 Yamaguchi et al. introduced the concept of max pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They did so by combining TDNNs with max pooling to realize a speaker-independent isolated word recognition system.^[25] In their system they used several TDNNs per word, one for each syllable. The results of each TDNN over the input signal were combined using max pooling and the outputs of the pooling layers were then passed on to networks performing the actual word classification.

Image recognition with CNNs trained by gradient descent

Denker et al. (1989) designed a 2-D CNN system to recognize hand-written ZIP Code numbers.^[47] However, the lack of an efficient training method to determine the kernel coefficients of the involved convolutions meant that all the coefficients had to be laboriously hand-designed.^[48]

Following the advances in the training of 1-D CNNs by Waibel et al. (1987), Yann LeCun et al. (1989)^[48] used back-propagation to learn the convolution kernel coefficients directly from images of hand-written numbers. Learning was thus fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types. Wei Zhang et al. (1988)^[12]^[13] used back-propagation to train the convolution kernels of a CNN for alphabets recognition. The model was called Shift-Invariant Artificial Neural Network (SIANN) before the name CNN was coined later in the early 1990s. Wei Zhang et al. also applied the same CNN without the last fully connected layer for medical image object segmentation (1991)^[49] and breast cancer detection in mammograms (1994).^[50]

This approach became a foundation of modern computer vision.

LeNet-5

LeNet-5, a pioneering 7-level convolutional network by

British English

: cheques) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of convolutional neural networks, so this technique is constrained by the availability of computing resources.

Shift-invariant neural network

A shift-invariant neural network was proposed by Wei Zhang et al. for image character recognition in 1988.^[12]^[13] It is a modified Neocognitron by keeping only the convolutional interconnections between the image feature layers and the last fully connected layer. The model was trained with back-propagation. The training algorithm was further improved in 1991^[52] to improve its generalization ability. The model architecture was modified by removing the last fully connected layer and applied for medical image segmentation (1991)^[49] and automatic detection of breast cancer in mammograms (1994).^[50]

A different convolution-based design was proposed in 1988^[53] for application to decomposition of one-dimensional electromyography convolved signals via de-convolution. This design was modified in 1989 to other de-convolution-based designs.^[54]^[55]

Neural abstraction pyramid

The feed-forward architecture of convolutional neural networks was extended in the neural abstraction pyramid^[56] by lateral and feedback connections. The resulting recurrent convolutional network allows for the flexible incorporation of contextual information to iteratively resolve local ambiguities. In contrast to previous models, image-like outputs at the highest resolution were generated, e.g., for semantic segmentation, image reconstruction, and object localization tasks.

GPU implementations

Although CNNs were invented in the 1980s, their breakthrough in the 2000s required fast implementations on graphics processing units (GPUs).

In 2004, it was shown by K. S. Oh and K. Jung that standard neural networks can be greatly accelerated on GPUs. Their implementation was 20 times faster than an equivalent implementation on

GPGPU for machine learning.^[58]

The first GPU-implementation of a CNN was described in 2006 by K. Chellapilla et al. Their implementation was 4 times faster than an equivalent implementation on CPU.[59] Subsequent work also used GPUs, initially for other types of neural networks (different from CNNs), especially unsupervised neural networks.^[60]^[61]^[62]^[63]

In 2010, Dan Ciresan et al. at

RGB images).^[26]

Subsequently, a similar GPU-based CNN by Alex Krizhevsky et al. won the

ImageNet Large Scale Visual Recognition Challenge 2012.^[67] A very deep CNN with over 100 layers by Microsoft won the ImageNet 2015 contest.^[68]

Intel Xeon Phi implementations

Compared to the training of CNNs using

Intel Xeon Phi coprocessor.^[69]

A notable development is a parallelization method for training convolutional neural networks on the Intel Xeon Phi, named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS).[70] CHAOS exploits both the thread- and

SIMD

-level parallelism that is available on the Intel Xeon Phi.

Distinguishing features

In the past, traditional

example needed] However, the full connectivity between nodes caused the curse of dimensionality, and was computationally intractable with higher-resolution images. A 1000×1000-pixel image with RGB color

channels has 3 million weights per fully-connected neuron, which is too high to feasibly process efficiently at scale.

For example, in CIFAR-10, images are only of size 32×32×3 (32 wide, 32 high, 3 color channels), so a single fully connected neuron in the first hidden layer of a regular neural network would have 32*32*3 = 3,072 weights. A 200×200 image, however, would lead to neurons that have 200*200*3 = 120,000 weights.

Also, such network architecture does not take into account the spatial structure of data, treating input pixels which are far apart in the same way as pixels that are close together. This ignores

spatially local

input patterns.

Convolutional neural networks are variants of multilayer perceptrons, designed to emulate the behavior of a visual cortex. These models mitigate the challenges posed by the MLP architecture by exploiting the strong spatially local correlation present in natural images. As opposed to MLPs, CNNs have the following distinguishing features:

3D volumes of neurons. The layers of a CNN have neurons arranged in 3 dimensions: width, height and depth.^[71] Where each neuron inside a convolutional layer is connected to only a small region of the layer before it, called a receptive field. Distinct types of layers, both locally and completely connected, are stacked to form a CNN architecture.
Local connectivity: following the concept of receptive fields, CNNs exploit spatial locality by enforcing a local connectivity pattern between neurons of adjacent layers. The architecture thus ensures that the learned "filters" produce the strongest response to a spatially local input pattern. Stacking many such layers leads to nonlinear filters that become increasingly global (i.e. responsive to a larger region of pixel space) so that the network first creates representations of small parts of the input, then from them assembles representations of larger areas.
Shared weights: In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. This means that all the neurons in a given convolutional layer respond to the same feature within their specific response field. Replicating units in this way allows for the resulting activation map to be equivariant under shifts of the locations of input features in the visual field, i.e. they grant translational equivariance - given that the layer has a stride of one.^[72]
Pooling: In a CNN's pooling layers, feature maps are divided into rectangular sub-regions, and the features in each rectangle are independently down-sampled to a single value, commonly by taking their average or maximum value. In addition to reducing the sizes of feature maps, the pooling operation grants a degree of local translational invariance to the features contained therein, allowing the CNN to be more robust to variations in their positions.^[14]

Together, these properties allow CNNs to achieve better generalization on vision problems. Weight sharing dramatically reduces the number of free parameters learned, thus lowering the memory requirements for running the network and allowing the training of larger, more powerful networks.

Building blocks

A CNN architecture is formed by a stack of distinct layers that transform the input volume into an output volume (e.g. holding the class scores) through a differentiable function. A few distinct types of layers are commonly used. These are further discussed below.

Convolutional layer

The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the filter entries and the input, producing a 2-dimensional activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input.^[73]^{[nb 1]}

Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input. Each entry in an activation map use the same set of parameters that define the filter.

Self-supervised learning has been adapted for use in convolutional layers by using sparse patches with a high-mask ratio and a global response normalization layer.^{[citation needed]}

Local connectivity

When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume because such a network architecture does not take the spatial structure of the data into account. Convolutional networks exploit spatially local correlation by enforcing a sparse local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume.

The extent of this connectivity is a

British English

: learnt) filters produce the strongest response to a spatially local input pattern.

Spatial arrangement

Three hyperparameters control the size of the output volume of the convolutional layer: the depth, stride, and padding size:

The depth of the output volume controls the number of neurons in a layer that connect to the same region of the input volume. These neurons learn to activate for different features in the input. For example, if the first convolutional layer takes the raw image as input, then different neurons along the depth dimension may activate in the presence of various oriented edges, or blobs of color.
Stride controls how depth columns around the width and height are allocated. If the stride is 1, then we move the filters one pixel at a time. This leads to heavily overlapping receptive fields between the columns, and to large output volumes. For any integer ${\textstyle S>0,}$ a stride S means that the filter is translated S units at a time per output. In practice, ${\textstyle S\geq 3}$ is rare. A greater stride means smaller overlap of receptive fields and smaller spatial dimensions of the output volume.^[74]
Sometimes, it is convenient to pad the input with zeros (or other values, such as the average of the region) on the border of the input volume. The size of this padding is a third hyperparameter. Padding provides control of the output volume's spatial size. In particular, sometimes it is desirable to exactly preserve the spatial size of the input volume, this is commonly referred to as "same" padding.

The spatial size of the output volume is a function of the input volume size $W$ , the kernel field size $K$ of the convolutional layer neurons, the stride $S$ , and the amount of zero padding $P$ on the border. The number of neurons that "fit" in a given volume is then:

{\frac {W-K+2P}{S}}+1.

If this number is not an integer, then the strides are incorrect and the neurons cannot be tiled to fit across the input volume in a symmetric way. In general, setting zero padding to be ${\textstyle P=(K-1)/2}$ when the stride is $S=1$ ensures that the input volume and output volume will have the same size spatially. However, it is not always completely necessary to use all of the neurons of the previous layer. For example, a neural network designer may decide to use just a portion of padding.

Parameter sharing

A parameter sharing scheme is used in convolutional layers to control the number of free parameters. It relies on the assumption that if a patch feature is useful to compute at some spatial position, then it should also be useful to compute at other positions. Denoting a single 2-dimensional slice of depth as a depth slice, the neurons in each depth slice are constrained to use the same weights and bias.

Since all neurons in a single depth slice share the same parameters, the forward pass in each depth slice of the convolutional layer can be computed as a convolution of the neuron's weights with the input volume.^{[nb 2]} Therefore, it is common to refer to the sets of weights as a filter (or a kernel), which is convolved with the input. The result of this convolution is an activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume. Parameter sharing contributes to the translation invariance of the CNN architecture.^[14]

Sometimes, the parameter sharing assumption may not make sense. This is especially the case when the input images to a CNN have some specific centered structure; for which we expect completely different features to be learned on different spatial locations. One practical example is when the inputs are faces that have been centered in the image: we might expect different eye-specific or hair-specific features to be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a "locally connected layer".

Pooling layer

Another important concept of CNNs is pooling, which is a form of non-linear down-sampling. There are several non-linear functions to implement pooling, where max pooling is the most common. It partitions the input image into a set of rectangles and, for each such sub-region, outputs the maximum.

Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling in convolutional neural networks. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters, memory footprint and amount of computation in the network, and hence to also control overfitting. This is known as down-sampling. It is common to periodically insert a pooling layer between successive convolutional layers (each one typically followed by an activation function, such as a ReLU layer) in a CNN architecture.^[73]^{: 460–461} While pooling layers contribute to local translation invariance, they do not provide global translation invariance in a CNN, unless a form of global pooling is used.^[14]^[72] The pooling layer commonly operates independently on every depth, or slice, of the input and resizes it spatially. A very common form of max pooling is a layer with filters of size 2×2, applied with a stride of 2, which subsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations:

f_{X,Y}(S)=\max _{a,b=0}^{1}S_{2X+a,2Y+b}.

In this case, every

max operation

is over 4 numbers. The depth dimension remains unchanged (this is true for other forms of pooling as well).

In addition to max pooling, pooling units can use other functions, such as

ℓ₂-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to max pooling, which generally performs better in practice.^[75]

Due to the effects of fast spatial reduction of the size of the representation,^[

which?] there is a recent trend towards using smaller filters^[76] or discarding pooling layers altogether.^[77]

"Region of Interest" pooling (also known as RoI pooling) is a variant of max pooling, in which output size is fixed and input rectangle is a parameter.^{[citation needed]}

Pooling is a downsampling method and an important component of convolutional neural networks for object detection based on the Fast R-CNN^[78] architecture.

Channel Max Pooling

A CMP operation layer conducts the MP operation along the channel side among the corresponding positions of the consecutive feature maps for the purpose of redundant information elimination. The CMP makes the significant features gather together within fewer channels, which is important for fine-grained image classification that needs more discriminating features. Meanwhile, another advantage of the CMP operation is to make the channel number of feature maps smaller before it connects to the first fully connected (FC) layer. Similar to the MP operation, we denote the input feature maps and output feature maps of a CMP layer as F ∈ R(C×M×N) and C ∈ R(c×M×N), respectively, where C and c are the channel numbers of the input and output feature maps, M and N are the widths and the height of the feature maps, respectively. Note that the CMP operation only changes the channel number of the feature maps. The width and the height of the feature maps are not changed, which is different from the MP operation.^[79]

ReLU layer

ReLU is the abbreviation of rectified linear unit introduced by Kunihiko Fukushima in 1969.^[35]^[36] ReLU applies the non-saturating activation function ${\textstyle f(x)=\max(0,x)}$ .^[67] It effectively removes negative values from an activation map by setting them to zero.^[80] It introduces nonlinearity to the decision function and in the overall network without affecting the receptive fields of the convolution layers. In 2011, Xavier Glorot, Antoine Bordes and Yoshua Bengio found that ReLU enables better training of deeper networks,^[81] compared to widely used activation functions prior to 2011.

Other functions can also be used to increase nonlinearity, for example the saturating

hyperbolic tangent

f(x)=\tanh(x)

,

f(x)=|\tanh(x)|

, and the sigmoid function

{\textstyle \sigma (x)=(1+e^{-x})^{-1}}

. ReLU is often preferred to other functions because it trains the neural network several times faster without a significant penalty to generalization accuracy.^[82]

Fully connected layer

After several convolutional and max pooling layers, the final classification is done via fully connected layers. Neurons in a fully connected layer have connections to all activations in the previous layer, as seen in regular (non-convolutional)

vector addition

of a learned or fixed bias term).

Loss layer

The "loss layer", or "loss function", specifies how training penalizes the deviation between the predicted output of the network, and the true data labels (during supervised learning). Various loss functions can be used, depending on the specific task.

The

cross-entropy

loss is used for predicting K independent probability values in

[0,1]

.

regressing to real-valued

labels

(-\infty ,\infty )

.

Hyperparameters

Hyperparameters are various settings that are used to control the learning process. CNNs use more hyperparameters than a standard multilayer perceptron (MLP).

Kernel size

The kernel is the number of pixels processed together. It is typically expressed as the kernel's dimensions, e.g., 2x2, or 3x3.

Padding

Padding is the addition of (typically) 0-valued pixels on the borders of an image. This is done so that the border pixels are not undervalued (lost) from the output because they would ordinarily participate in only a single receptive field instance. The padding applied is typically one less than the corresponding kernel dimension. For example, a convolutional layer using 3x3 kernels would receive a 2-pixel pad, that is 1 pixel on each side of the image.^{[citation needed]}

Stride

The stride is the number of pixels that the analysis window moves on each iteration. A stride of 2 means that each kernel is offset by 2 pixels from its predecessor.

Number of filters

Since feature map size decreases with depth, layers near the input layer tend to have fewer filters while higher layers can have more. To equalize computation at each layer, the product of feature values v_a with pixel position is kept roughly constant across layers. Preserving more information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) non-decreasing from one layer to the next.

The number of feature maps directly controls the capacity and depends on the number of available examples and task complexity.

Filter size

Common filter sizes found in the literature vary greatly, and are usually chosen based on the data set.

The challenge is to find the right level of granularity so as to create abstractions at the proper scale, given a particular data set, and without overfitting.

Pooling type and size

Max pooling is typically used, often with a 2x2 dimension. This implies that the input is drastically downsampled

, reducing processing cost.

Greater pooling reduces the dimension of the signal, and may result in unacceptable information loss. Often, non-overlapping pooling windows perform best.^[75]

Dilation

Dilation involves ignoring pixels within a kernel. This reduces processing/memory potentially without significant signal loss. A dilation of 2 on a 3x3 kernel expands the kernel to 5x5, while still processing 9 (evenly spaced) pixels. Accordingly, dilation of 4 expands the kernel to 7x7.^{[citation needed]}

Translation equivariance and aliasing

It is commonly assumed that CNNs are invariant to shifts of the input. Convolution or pooling layers within a CNN that do not have a stride greater than one are indeed equivariant to translations of the input.^[72] However, layers with a stride greater than one ignore the Nyquist-Shannon sampling theorem and might lead to aliasing of the input signal^[72] While, in principle, CNNs are capable of implementing anti-aliasing filters, it has been observed that this does not happen in practice ^[83] and yield models that are not equivariant to translations. Furthermore, if a CNN makes use of fully connected layers, translation equivariance does not imply translation invariance, as the fully connected layers are not invariant to shifts of the input.^[84]^[14] One solution for complete translation invariance is avoiding any down-sampling throughout the network and applying global average pooling at the last layer.^[72] Additionally, several other partial solutions have been proposed, such as anti-aliasing before downsampling operations,^[85] spatial transformer networks,^[86] data augmentation, subsampling combined with pooling,^[14] and capsule neural networks.^[87]

Evaluation

The accuracy of the final model is based on a sub-part of the dataset set apart at the start, often called a test-set. Other times methods such as k-fold cross-validation are applied. Other strategies include using conformal prediction.^[88]^[89]

Regularization methods

ill-posed problem or to prevent overfitting

. CNNs use various types of regularization.

Empirical

Dropout

Because a fully connected layer occupies most of the parameters, it is prone to overfitting. One method to reduce overfitting is

dropout, introduced in 2014.^[90]

At each training stage, individual nodes are either "dropped out" of the net (ignored) with probability

1-p

or kept with probability

p

, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. Only the reduced network is trained on the data in that stage. The removed nodes are then reinserted into the network with their original weights.

In the training stages, $p$ is usually 0.5; for input nodes, it is typically much higher because information is directly lost when input nodes are ignored.

At testing time after training has finished, we would ideally like to find a sample average of all possible $2^{n}$ dropped-out networks; unfortunately this is unfeasible for large values of $n$ . However, we can find an approximation by using the full network with each node's output weighted by a factor of $p$ , so the expected value of the output of any node is the same as in the training stages. This is the biggest contribution of the dropout method: although it effectively generates $2^{n}$ neural nets, and as such allows for model combination, at test time only a single network needs to be tested.

By avoiding training all nodes on all training data, dropout decreases overfitting. The method also significantly improves training speed. This makes the model combination practical, even for

deep neural networks. The technique seems to reduce node interactions, leading them to learn more robust features^{[clarification needed}

] that better generalize to new data.

DropConnect

DropConnect is the generalization of dropout in which each connection, rather than each output unit, can be dropped with probability $1-p$ . Each unit thus receives input from a random subset of units in the previous layer.[91]

DropConnect is similar to dropout as it introduces dynamic sparsity within the model, but differs in that the sparsity is on the weights, rather than the output vectors of a layer. In other words, the fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage.

Stochastic pooling

A major drawback to Dropout is that it does not have the same benefits for convolutional layers, where the neurons are not fully connected.

Even before Dropout, in 2013 a technique called stochastic pooling,^[92] the conventional deterministic pooling operations were replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to a multinomial distribution, given by the activities within the pooling region. This approach is free of hyperparameters and can be combined with other regularization approaches, such as dropout and data augmentation.

An alternate view of stochastic pooling is that it is equivalent to standard max pooling but with many copies of an input image, each having small local

elastic deformations of the input images,^[93] which delivers excellent performance on the MNIST data set.^[93]

Using stochastic pooling in a multilayer model gives an exponential number of deformations since the selections in higher layers are independent of those below.

Artificial data

Because the degree of model overfitting is determined by both its power and the amount of training it receives, providing a convolutional network with more training examples can reduce overfitting. Because there is often not enough available data to train, especially considering that some part should be spared for later testing, two approaches are to either generate new data from scratch (if possible) or perturb existing data to create new ones. The latter one is used since mid-1990s.^[51] For example, input images can be cropped, rotated, or rescaled to create new examples with the same labels as the original training set.^[94]

Explicit

Early stopping

One of the simplest methods to prevent overfitting of a network is to simply stop the training before overfitting has had a chance to occur. It comes with the disadvantage that the learning process is halted.

Number of parameters

Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting the number of hidden units in each layer or limiting network depth. For convolutional networks, the filter size also affects the number of parameters. Limiting the number of parameters restricts the predictive power of the network directly, reducing the complexity of the function that it can perform on the data, and thus limits the amount of overfitting. This is equivalent to a "

zero norm

".

Weight decay

A simple form of added regularizer is weight decay, which simply adds an additional error, proportional to the sum of weights (

L2 norm

) of the weight vector, to the error at each node. The level of acceptable model complexity can be reduced by increasing the proportionality constant('alpha' hyperparameter), thus increasing the penalty for large weight vectors.

L2 regularization is the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. Due to multiplicative interactions between weights and inputs this has the useful property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot.

L1 regularization is also common. It makes the weight vectors sparse during optimization. In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the noisy inputs. L1 with L2 regularization can be combined; this is called elastic net regularization.

Max norm constraints

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector ${\vec {w}}$ of every neuron to satisfy $\|{\vec {w}}\|_{2}<c$ . Typical values of $c$ are order of 3–4. Some papers report improvements^[95] when using this form of regularization.

Hierarchical coordinate frames

Pooling loses the precise spatial relationships between high-level parts (such as nose and mouth in a face image). These relationships are needed for identity recognition. Overlapping the pools so that each feature occurs in multiple pools, helps retain the information. Translation alone cannot extrapolate the understanding of geometric relationships to a radically new viewpoint, such as a different orientation or scale. On the other hand, people are very good at extrapolating; after seeing a new shape once they can recognize it from a different viewpoint.^[96]

An earlier common way to deal with this problem is to train the network on transformed data in different orientations, scales, lighting, etc. so that the network can cope with these variations. This is computationally intensive for large data-sets. The alternative is to use a hierarchy of coordinate frames and use a group of neurons to represent a conjunction of the shape of the feature and its pose relative to the retina. The pose relative to the retina is the relationship between the coordinate frame of the retina and the intrinsic features' coordinate frame.^[97]

Thus, one way to represent something is to embed the coordinate frame within it. This allows large features to be recognized by using the consistency of the poses of their parts (e.g. nose and mouth poses make a consistent prediction of the pose of the whole face). This approach ensures that the higher-level entity (e.g. face) is present when the lower-level (e.g. nose and mouth) agree on its prediction of the pose. The vectors of neuronal activity that represent pose ("pose vectors") allow spatial transformations modeled as linear operations that make it easier for the network to learn the hierarchy of visual entities and generalize across viewpoints. This is similar to the way the human visual system imposes coordinate frames in order to represent shapes.^[98]

Applications

Image recognition

CNNs are often used in

image recognition systems. In 2012, an error rate of 0.23% on the MNIST database was reported.^[26] Another paper on using CNN for image classification reported that the learning process was "surprisingly fast"; in the same paper, the best published results as of 2011 were achieved in the MNIST database and the NORB database.^[23]

Subsequently, a similar CNN called

ImageNet Large Scale Visual Recognition Challenge

2012.

When applied to

root mean square error.^[45]

The

ImageNet Large Scale Visual Recognition Challenge is a benchmark in object classification and detection, with millions of images and hundreds of object classes. In the ILSVRC 2014,^[101] a large-scale visual recognition challenge, almost every highly ranked team used CNN as their basic framework. The winner GoogLeNet^[102] (the foundation of DeepDream) increased the mean average precision of object detection to 0.439329, and reduced classification error to 0.06656, the best result to date. Its network applied more than 30 layers. That performance of convolutional neural networks on the ImageNet tests was close to that of humans.^[103] The best algorithms still struggle with objects that are small or thin, such as a small ant on a stem of a flower or a person holding a quill in their hand. They also have trouble with images that have been distorted with filters, an increasingly common phenomenon with modern digital cameras. By contrast, those kinds of images rarely trouble humans. Humans, however, tend to have trouble with other issues. For example, they are not good at classifying objects into fine-grained categories such as the particular breed of dog or species of bird, whereas convolutional neural networks handle this.^{[citation needed}

]

In 2015, a many-layered CNN demonstrated the ability to spot faces from a wide range of angles, including upside down, even when partially occluded, with competitive performance. The network was trained on a database of 200,000 images that included faces at various angles and orientations and a further 20 million images without faces. They used batches of 128 images over 50,000 iterations.[104]

Video analysis

Compared to image data domains, there is relatively little work on applying CNNs to video classification. Video is more complex than images since it has another (temporal) dimension. However, some extensions of CNNs into the video domain have been explored. One approach is to treat space and time as equivalent dimensions of the input and perform convolutions in both time and space.

Text-to-Video model.^{[citation needed}

]

Natural language processing

CNNs have also been explored for natural language processing. CNN models are effective for various NLP problems and achieved excellent results in semantic parsing,^[114] search query retrieval,^[115] sentence modeling,^[116] classification,^[117] prediction^[118] and other traditional NLP tasks.^[119] Compared to traditional language processing methods such as

recurrent neural networks, CNNs can represent different contextual realities of language that do not rely on a series-sequence assumption, while RNNs are better suitable when classical time series modeling is required.^[120]

[121] ^[122]^[123]

Anomaly Detection

A CNN with 1-D convolutions was used on time series in the frequency domain (spectral residual) by an unsupervised model to detect anomalies in the time domain.^[124]

Drug discovery

CNNs have been used in

Ebola virus^[127] and multiple sclerosis.^[128]

Checkers game

CNNs have been used in the game of

Chinook at its "expert" level of play.^[131]

Go

CNNs have been used in computer Go. In December 2014, Clark and Storkey published a paper showing that a CNN trained by supervised learning from a database of human professional games could outperform GNU Go and win some games against Monte Carlo tree search Fuego 1.1 in a fraction of the time it took Fuego to play.^[132] Later it was announced that a large 12-layer convolutional neural network had correctly predicted the professional move in 55% of positions, equalling the accuracy of a 6 dan human player. When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GNU Go in 97% of games, and matched the performance of the Monte Carlo tree search program Fuego simulating ten thousand playouts (about a million positions) per move.^[133]

A couple of CNNs for choosing moves to try ("policy network") and evaluating positions ("value network") driving MCTS were used by AlphaGo, the first to beat the best human player at the time.^[134]

Time series forecasting

Recurrent neural networks are generally considered the best neural network architectures for time series forecasting (and sequence modeling in general), but recent studies show that convolutional networks can perform comparably or even better.^[135]^[11] Dilated convolutions^[136] might enable one-dimensional convolutional neural networks to effectively learn time series dependences.^[137] Convolutions can be implemented more efficiently than RNN-based solutions, and they do not suffer from vanishing (or exploding) gradients.^[138] Convolutional networks can provide an improved forecasting performance when there are multiple similar time series to learn from.^[139] CNNs can also be applied to further tasks in time series analysis (e.g., time series classification^[140] or quantile forecasting^[141]).

Cultural Heritage and 3D-datasets

As archaeological findings like

3D scanners first benchmark datasets are becoming available like HeiCuBeDa^[142] providing almost 2.000 normalized 2D- and 3D-datasets prepared with the GigaMesh Software Framework.^[143] So curvature-based measures are used in conjunction with Geometric Neural Networks (GNNs) e.g. for period classification of those clay tablets being among the oldest documents of human history.^[144]^[145]

Fine-tuning

For many applications, the training data is less available. Convolutional neural networks usually require a large amount of training data in order to avoid overfitting. A common technique is to train the network on a larger data set from a related domain. Once the network parameters have converged an additional training step is performed using the in-domain data to fine-tune the network weights, this is known as transfer learning. Furthermore, this technique allows convolutional network architectures to successfully be applied to problems with tiny training sets.^[146]

Human interpretable explanations

End-to-end training and prediction are common practice in computer vision. However, human interpretable explanations are required for critical systems such as a self-driving cars.^[147] With recent advances in visual salience, spatial attention, and temporal attention, the most critical spatial regions/temporal instants could be visualized to justify the CNN predictions.^[148]^[149]

Related architectures

Deep Q-networks

A deep Q-network (DQN) is a type of deep learning model that combines a deep neural network with Q-learning, a form of reinforcement learning. Unlike earlier reinforcement learning agents, DQNs that utilize CNNs can learn directly from high-dimensional sensory inputs via reinforcement learning.^[150]

Preliminary results were presented in 2014, with an accompanying paper in February 2015.^[151] The research described an application to Atari 2600 gaming. Other deep reinforcement learning models preceded it.^[152]

Deep belief networks

Convolutional deep belief networks (CDBN) have structure very similar to convolutional neural networks and are trained similarly to deep belief networks. Therefore, they exploit the 2D structure of images, like CNNs do, and make use of pre-training like deep belief networks. They provide a generic structure that can be used in many image and signal processing tasks. Benchmark results on standard image datasets like CIFAR^[153] have been obtained using CDBNs.^[154]

Notable libraries

Caffe: A library for convolutional neural networks. Created by the Berkeley Vision and Learning Center (BVLC). It supports both CPU and GPU. Developed in C++, and has Python and MATLAB wrappers.
Deeplearning4j: Deep learning in Java and Scala on multi-GPU-enabled Spark. A general-purpose deep learning library for the JVM production stack running on a C++ scientific computing engine. Allows the creation of custom layers. Integrates with Hadoop and Kafka.
Dlib: A toolkit for making real world machine learning and data analysis applications in C++.
Microsoft Cognitive Toolkit: A deep learning toolkit written by Microsoft with several unique features enhancing scalability over multiple nodes. It supports full-fledged interfaces for training in C++ and Python and with additional support for model inference in C# and Java.
tensor processing unit (TPU),^[155]
and mobile devices.

Theano: The reference deep-learning library for Python with an API largely compatible with the popular NumPy library. Allows user to write symbolic mathematical expressions, then automatically generates their derivatives, saving the user from having to code gradients or backpropagation. These symbolic expressions are automatically compiled to CUDA code for a fast, on-the-GPU implementation.
scientific computing framework with wide support for machine learning algorithms, written in C and Lua
.

Notes

^ When applied to other types of data than image data, such as sound data, "spatial position" may variously correspond to different points in the time domain, frequency domain, or other mathematical spaces.
^ hence the name "convolutional layer"
categorical data
.

References

^
ISBN 978-1-351-65032-8. Archived
from the original on 2023-10-16. Retrieved 2020-12-13.

^
ISBN 978-3-030-32644-9. Archived
from the original on 2023-10-16. Retrieved 2020-12-13.

S2CID 213010088. Archived
from the original on 2023-07-31. Retrieved 2023-08-12.

S2CID 219470398. Archived
from the original on 2023-06-29. Retrieved 2023-08-12. Convolutional neural networks represent deep learning architectures that are currently used in a wide range of applications, including computer vision, speech recognition, malware dedection, time series analysis in finance, and many others.

^
OCLC 987790957.{{cite book}}: CS1 maint: location missing publisher (link) CS1 maint: multiple names: authors list (link
)

^ Atlas, Homma, and Marks. "An Artificial Neural Network for Spatio-Temporal Bipolar Patterns: Application to Phoneme Classification" (PDF). Neural Information Processing Systems (NIPS 1987). 1. Archived (PDF) from the original on 2021-04-14.{{cite journal}}: CS1 maint: multiple names: authors list (link)

S2CID 218955622
. Convolutional neural networks are a promising tool for solving the problem of pattern recognition.

^ van den Oord, Aaron; Dieleman, Sander; Schrauwen, Benjamin (2013-01-01). Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; Weinberger, K. Q. (eds.). Deep content-based music recommendation (PDF). Curran Associates, Inc. pp. 2643–2651. Archived (PDF) from the original on 2022-03-07. Retrieved 2022-03-31.

S2CID 2617020
.

S2CID 221386616. Archived
(PDF) from the original on 2022-05-19. Retrieved 2023-07-21.

^
S2CID 4950757
.

^ ^a ^b ^c Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of Annual Conference of the Japan Society of Applied Physics. Archived from the original on 2020-06-23. Retrieved 2020-06-22.

^
PMID 20577468. Archived
from the original on 2017-02-06. Retrieved 2016-09-22.

^
S2CID 232269854. Archived
from the original on 2021-06-27. Retrieved 2021-03-26.

PMID 31430292
.

^
doi:10.4249/scholarpedia.1717
.

^
PMID 4966457
.

^
S2CID 206775608. Archived
(PDF) from the original on 3 June 2014. Retrieved 16 November 2013.

^
PMID 12850007. Archived
(PDF) from the original on 13 December 2013. Retrieved 17 November 2013.

^ Convolutional Neural Networks Demystified: A Matched Filtering Perspective Based Tutorial https://arxiv.org/abs/2108.11663v3

^ "Convolutional Neural Networks (LeNet) – DeepLearning 0.1 documentation". DeepLearning 0.1. LISA Lab. Archived from the original on 28 December 2017. Retrieved 31 August 2013.

arXiv:1610.02357 [cs.CV
].

^ ^a ^b ^c Ciresan, Dan; Ueli Meier; Jonathan Masci; Luca M. Gambardella; Jurgen Schmidhuber (2011). "Flexible, High Performance Convolutional Neural Networks for Image Classification" (PDF). Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume Volume Two. 2: 1237–1242. Archived (PDF) from the original on 5 April 2022. Retrieved 17 November 2013.

^ Krizhevsky, Alex. "ImageNet Classification with Deep Convolutional Neural Networks" (PDF). Archived (PDF) from the original on 25 April 2021. Retrieved 17 November 2013.

^ ^a ^b Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan. Archived from the original on 2021-03-07. Retrieved 2019-09-04.

^
S2CID 2161592
.

arXiv:1511.07122 [cs.CV
].

arXiv:1706.05587 [cs.CV
].

arXiv:2108.07387 [cs.CV
].

^ LeCun, Yann. "LeNet-5, convolutional neural networks". Archived from the original on 24 February 2021. Retrieved 16 November 2013.

PMID 34690686
.

^
PMID 14403679
.

ISBN 978-0-19-517618-6. Archived
from the original on 2023-10-16. Retrieved 2019-01-18.

S2CID 3074096
.

^
doi:10.1109/TSSC.1969.300225
.

^
arXiv:2212.11279 [cs.NE
].

arXiv:1710.05941 [cs.NE
].

S2CID 8619176
.

^
S2CID 2309950. Archived
from the original on 2016-04-19. Retrieved 2019-01-20.

^ Homma, Toshiteru; Les Atlas; Robert Marks II (1988). "An Artificial Neural Network for Spatio-Temporal Bipolar Patterns: Application to Phoneme Classification" (PDF). Advances in Neural Information Processing Systems. 1: 31–40. Archived (PDF) from the original on 2022-03-31. Retrieved 2022-03-31.

^ ^a ^b Waibel, Alex (December 1987). Phoneme Recognition Using Time-Delay Neural Networks. Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Tokyo, Japan.

^ ^a ^b Alexander Waibel et al., Phoneme Recognition Using Time-Delay Neural Networks Archived 2021-02-25 at the Wayback Machine IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. - 339 March 1989.

^ LeCun, Yann; Bengio, Yoshua (1995). "Convolutional networks for images, speech, and time series". In Arbib, Michael A. (ed.). The handbook of brain theory and neural networks (Second ed.). The MIT press. pp. 276–278. Archived from the original on 2020-07-28. Retrieved 2019-12-03.

^ John B. Hampshire and Alexander Waibel, Connectionist Architectures for Multi-Speaker Phoneme Recognition Archived 2022-03-31 at the Wayback Machine, Advances in Neural Information Processing Systems, 1990, Morgan Kaufmann.

^
S2CID 221185563. Archived
(PDF) from the original on 24 February 2021. Retrieved 17 November 2013.

^ Ko, Tom; Peddinti, Vijayaditya; Povey, Daniel; Seltzer, Michael L.; Khudanpur, Sanjeev (March 2018). A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition (PDF). The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017). New Orleans, LA, US. Archived (PDF) from the original on 2018-07-08. Retrieved 2019-09-04.

^ Denker, J S, Gardner, W R, Graf, H. P, Henderson, D, Howard, R E, Hubbard, W, Jackel, L D, BaIrd, H S, and Guyon (1989) Neural network recognizer for hand-written zip code digits Archived 2018-08-04 at the Wayback Machine, AT&T Bell Laboratories

^ ^a ^b Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation Applied to Handwritten Zip Code Recognition Archived 2020-01-10 at the Wayback Machine; AT&T Bell Laboratories

^
PMID 20706526. Archived
from the original on 2017-02-06. Retrieved 2016-09-22.

^
PMID 8058017. Archived
from the original on 2017-02-06. Retrieved 2016-09-22.

^
ISBN 978-981-02-2324-3. Archived
(PDF) from the original on 2 May 2023.

^ Zhang, Wei (1991). "Error Back Propagation with Minimum-Entropy Weights: A Technique for Better Generalization of 2-D Shift-Invariant NNs". Proceedings of the International Joint Conference on Neural Networks. Archived from the original on 2017-02-06. Retrieved 2016-09-22.

^ Daniel Graupe, Ruey Wen Liu, George S Moschytz."Applications of neural networks to medical signal processing Archived 2020-07-28 at the Wayback Machine". In Proc. 27th IEEE Decision and Control Conf., pp. 343–347, 1988.

^ Daniel Graupe, Boris Vern, G. Gruener, Aaron Field, and Qiu Huang. "Decomposition of surface EMG signals into single fiber action potentials by means of neural network Archived 2019-09-04 at the Wayback Machine". Proc. IEEE International Symp. on Circuits and Systems, pp. 1008–1011, 1989.

^ Qiu Huang, Daniel Graupe, Yi Fang Huang, Ruey Wen Liu."Identification of firing patterns of neuronal signals^{[dead link]}." In Proc. 28th IEEE Decision and Control Conf., pp. 266–271, 1989. https://ieeexplore.ieee.org/document/70115 Archived 2022-03-31 at the Wayback Machine

S2CID 1304548. Archived
(PDF) from the original on 2017-08-10. Retrieved 2016-12-28.

doi:10.1016/j.patcog.2004.01.013
.

doi:10.1109/ICDAR.2005.251. Archived
from the original on 2022-03-31. Retrieved 2022-03-31.

^ Kumar Chellapilla; Sid Puri; Patrice Simard (2006). "High Performance Convolutional Neural Networks for Document Processing". In Lorette, Guy (ed.). Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft. Archived from the original on 2020-05-18. Retrieved 2016-03-14.

S2CID 2309950
.

^ Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan; Larochelle, Hugo (2007). "Greedy Layer-Wise Training of Deep Networks" (PDF). Advances in Neural Information Processing Systems: 153–160. Archived (PDF) from the original on 2022-06-02. Retrieved 2022-03-31.

^ Ranzato, MarcAurelio; Poultney, Christopher; Chopra, Sumit; LeCun, Yann (2007). "Efficient Learning of Sparse Representations with an Energy-Based Model" (PDF). Advances in Neural Information Processing Systems. Archived (PDF) from the original on 2016-03-22. Retrieved 2014-06-26.

S2CID 392458. Archived
(PDF) from the original on 8 December 2020. Retrieved 22 December 2023.

S2CID 1918673
.

^ "IJCNN 2011 Competition result table". OFFICIAL IJCNN2011 COMPETITION. 2010. Archived from the original on 2021-01-17. Retrieved 2019-01-14.

^ Schmidhuber, Jürgen (17 March 2017). "History of computer vision contests won by deep CNNs on GPU". Archived from the original on 19 December 2018. Retrieved 14 January 2019.

^
S2CID 195908774. Archived
(PDF) from the original on 2017-05-16. Retrieved 2018-12-04.

S2CID 206594692. Archived
(PDF) from the original on 2022-04-05. Retrieved 2022-03-31.

S2CID 15411954. Archived
from the original on 2023-03-06. Retrieved 2022-03-31.

^ Viebke, Andre; Memeti, Suejb; Pllana, Sabri; Abraham, Ajith (2019). "CHAOS: a parallelization scheme for training convolutional neural networks on Intel Xeon Phi". The Journal of Supercomputing. 75 (1): 197–227.
S2CID 14135321
.

^ Hinton, Geoffrey (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NIPS'12: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. 1: 1097–1105. Archived from the original on 2019-12-20. Retrieved 2021-03-26 – via ACM.

^
ISSN 1533-7928. Archived
from the original on 2022-03-31. Retrieved 2022-03-31.

^
ISBN 978-1-492-03264-9
., pp. 448

^ "CS231n Convolutional Neural Networks for Visual Recognition". cs231n.github.io. Archived from the original on 2019-10-23. Retrieved 2017-04-25.

^ ^a ^b Scherer, Dominik; Müller, Andreas C.; Behnke, Sven (2010). "Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition" (PDF). Artificial Neural Networks (ICANN), 20th International Conference on. Thessaloniki, Greece: Springer. pp. 92–101. Archived (PDF) from the original on 2018-04-03. Retrieved 2016-12-28.

arXiv:1412.6071 [cs.CV
].

arXiv:1412.6806 [cs.LG
].

arXiv:1504.08083 [cs.CV
].

S2CID 86674074
.

doi:10.20535/1810-0546.2017.1.88156
.

^ Xavier Glorot; Antoine Bordes; Yoshua Bengio (2011). Deep sparse rectifier neural networks (PDF). AISTATS. Archived from the original (PDF) on 2016-12-13. Retrieved 2023-04-10. Rectifier and softplus activation functions. The second one is a smooth version of the first.

^ Krizhevsky, A.; Sutskever, I.; Hinton, G. E. (2012). "Imagenet classification with deep convolutional neural networks" (PDF). Advances in Neural Information Processing Systems. 1: 1097–1105. Archived (PDF) from the original on 2022-03-31. Retrieved 2022-03-31.

S2CID 231925012
.

S2CID 233219976. Archived
from the original on 2022-01-22. Retrieved 2021-03-26.

OCLC 1106340711
.

^ Jadeberg, Simonyan, Zisserman, Kavukcuoglu, Max, Karen, Andrew, Koray (2015). "Spatial Transformer Networks" (PDF). Advances in Neural Information Processing Systems. 28. Archived (PDF) from the original on 2021-07-25. Retrieved 2021-03-26 – via NIPS.{{cite journal}}: CS1 maint: multiple names: authors list (link)

OCLC 1106278545.{{cite book}}: CS1 maint: multiple names: authors list (link
)

S2CID 127253432. Archived
from the original on 2021-09-29. Retrieved 2021-09-29.

S2CID 219885788
.

^ Srivastava, Nitish; C. Geoffrey Hinton; Alex Krizhevsky; Ilya Sutskever; Ruslan Salakhutdinov (2014). "Dropout: A Simple Way to Prevent Neural Networks from overfitting" (PDF). Journal of Machine Learning Research. 15 (1): 1929–1958. Archived (PDF) from the original on 2016-01-19. Retrieved 2015-01-03.

^ "Regularization of Neural Networks using DropConnect | ICML 2013 | JMLR W&CP". jmlr.org: 1058–1066. 2013-02-13. Archived from the original on 2017-08-12. Retrieved 2015-12-17.

arXiv:1301.3557 [cs.LG
].

^ ^a ^b Platt, John; Steinkraus, Dave; Simard, Patrice Y. (August 2003). "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis – Microsoft Research". Microsoft Research. Archived from the original on 2017-11-07. Retrieved 2015-12-17.

arXiv:1207.0580 [cs.NE
].

^ "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". jmlr.org. Archived from the original on 2016-03-05. Retrieved 2015-12-17.

doi:10.1016/s0364-0213(79)80008-7
.

^ Rock, Irvin. "The frame of reference." The legacy of Solomon Asch: Essays in cognition and social psychology (1990): 243–268.

^ J. Hinton, Coursera lectures on Neural Networks, 2012, Url: https://www.coursera.org/learn/neural-networks Archived 2016-12-31 at the Wayback Machine

Quartz. Archived
from the original on 12 December 2019. Retrieved 5 October 2018.

S2CID 2883848
.

^ "ImageNet Large Scale Visual Recognition Competition 2014 (ILSVRC2014)". Archived from the original on 5 February 2016. Retrieved 30 January 2016.

ISBN 978-1-4673-6964-0
.

arXiv:1409.0575 [cs.CV
].

^ "The Face Detection Algorithm Set To Revolutionize Image Search". Technology Review. February 16, 2015. Archived from the original on 20 September 2020. Retrieved 27 October 2017.

ISBN 978-3-642-25445-1
.

S2CID 1923924
.

arXiv:1801.10111 [cs.CV
].

^ Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks Archived 2019-08-06 at the Wayback Machine." IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2014.

arXiv:1406.2199 [cs.CV
]. (2014).

PMID 29789447. Archived
(PDF) from the original on 2021-03-01. Retrieved 2018-09-14.

ISBN 978-1-4799-7061-2
.

ISBN 978-3-642-15566-6. Archived
from the original on 2022-03-31. Retrieved 2022-03-31.

S2CID 6006618
.

arXiv:1404.7296 [cs.CL
].

^ Mesnil, Gregoire; Deng, Li; Gao, Jianfeng; He, Xiaodong; Shen, Yelong (April 2014). "Learning Semantic Representations Using Convolutional Neural Networks for Web Search – Microsoft Research". Microsoft Research. Archived from the original on 2017-09-15. Retrieved 2015-12-17.

arXiv:1404.2188 [cs.CL
].

arXiv:1408.5882 [cs.CL
].

^ Collobert, Ronan, and Jason Weston. "A unified architecture for natural language processing: Deep neural networks with multitask learning Archived 2019-09-04 at the Wayback Machine."Proceedings of the 25th international conference on Machine learning. ACM, 2008.

arXiv:1103.0398 [cs.LG
].

arXiv:1702.01923 [cs.LG
].

arXiv:1803.01271 [cs.LG
].

S2CID 236307579
.

arXiv:2107.09355
.

S2CID 182952311
.

arXiv:1510.02855 [cs.LG
].

arXiv:1506.06579 [cs.CV
].

^ "Toronto startup has a faster way to discover effective medicines". The Globe and Mail. Archived from the original on 2015-10-20. Retrieved 2015-11-09.

^ "Startup Harnesses Supercomputers to Seek Cures". KQED Future of You. 2015-05-27. Archived from the original on 2018-12-06. Retrieved 2015-11-09.

PMID 18252639
.

doi:10.1109/4235.942536
.

ISBN 978-1558607835
.

arXiv:1412.3409 [cs.AI
].

arXiv:1412.6564 [cs.LG
].

^ "AlphaGo – Google DeepMind". Archived from the original on 30 January 2016. Retrieved 30 January 2016.

arXiv:1803.01271 [cs.LG
].

arXiv:1511.07122 [cs.CV
].

arXiv:1703.04691 [stat.ML
].

arXiv:1508.00317 [stat.ML
].

arXiv:1906.04397 [stat.ML
].

doi:10.21629/JSEE.2017.01.18
.

arXiv:1908.07978 [cs.LG
].

^
doi:10.11588/data/IE8CCN

^ Hubert Mara and Bartosz Bogacz (2019), "Breaking the Code on Broken Tablets: The Learning Challenge for Annotated Cuneiform Script in Normalized 2D and 3D Datasets", Proceedings of the 15th International Conference on Document Analysis and Recognition (ICDAR) (in German), Sydney, Australien, pp. 148–153,
S2CID 211026941

^ Bogacz, Bartosz; Mara, Hubert (2020), "Period Classification of 3D Cuneiform Tablets with Geometric Neural Networks", Proceedings of the 17th International Conference on Frontiers of Handwriting Recognition (ICFHR), Dortmund, Germany

YouTube

^ Durjoy Sen Maitra; Ujjwal Bhattacharya; S.K. Parui, "CNN based common approach to handwritten character recognition of multiple scripts" Archived 2023-10-16 at the Wayback Machine, in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, vol., no., pp.1021–1025, 23–26 Aug. 2015

^ "NIPS 2017". Interpretable ML Symposium. 2017-10-20. Archived from the original on 2019-09-07. Retrieved 2018-09-12.

S2CID 4058889
.

PMID 29933555. Archived
(PDF) from the original on 2018-09-13. Retrieved 2018-09-14.

arXiv:1508.04186v2 [cs.LG
].

S2CID 205242740
.

PMID 18252373
.

^ "Convolutional Deep Belief Networks on CIFAR-10" (PDF). Archived (PDF) from the original on 2017-08-30. Retrieved 2017-08-18.

S2CID 12008458
.

^ Cade Metz (May 18, 2016). "Google Built Its Very Own Chips to Power Its AI Bots". Wired. Archived from the original on January 13, 2018. Retrieved March 6, 2017.

External links

CS231n: Convolutional Neural Networks for Visual Recognition — Andrej Karpathy's Stanford computer science course on CNNs in computer vision

v
t
e
Differentiable computing
General

Differentiable programming

Information geometry

Statistical manifold

Automatic differentiation

Neuromorphic engineering

Pattern recognition

Tensor calculus

Computational learning theory

Inductive bias

Concepts

Gradient descent
SGD

Clustering

Regression
Overfitting

Hallucination

Adversary

Attention

Convolution

Loss functions

Backpropagation

Batchnorm

Activation
Softmax

Sigmoid

Rectifier

Regularization

Datasets

Augmentation

Diffusion

Autoregression

Applications

Machine learning
In-context learning

Artificial neural network

Deep learning

Scientific computing

Artificial Intelligence

Language model
Large language model

Hardware

IPU

TPU

VPU

Memristor

SpiNNaker

Software libraries

TensorFlow

PyTorch

Keras

Theano

JAX

Flux.jl

MindSpore

Implementations
Audio–visual

AlexNet

WaveNet

Human image synthesis

HWR

OCR

Speech synthesis

Speech recognition

Facial recognition

AlphaFold

Text-to-image models
DALL-E

Midjourney

Stable Diffusion

Text-to-video models
Sora

VideoPoet

Whisper

Verbal

Word2vec

Seq2seq

BERT

Gemini

LaMDA
Bard

NMT

Project Debater

IBM Watson

IBM Watsonx

Granite

GPT-1

GPT-2

GPT-3

GPT-4

ChatGPT

GPT-J

Chinchilla AI

PaLM

BLOOM

LLaMA

PanGu-Σ

Decisional

AlphaGo

AlphaZero

Q-learning

SARSA

OpenAI Five

Self-driving car

MuZero

Action selection
Auto-GPT

Robot control

People

Yoshua Bengio

Alex Graves

Ian Goodfellow

Stephen Grossberg

Demis Hassabis

Geoffrey Hinton

Yann LeCun

Fei-Fei Li

Andrew Ng

Jürgen Schmidhuber

David Silver

Ilya Sutskever

Organizations

Anthropic

EleutherAI

Google DeepMind

Hugging Face

OpenAI

Meta AI

Mila

MIT CSAIL

Huawei

Architectures

Neural Turing machine

Differentiable neural computer

Transformer

Recurrent neural network (RNN)

Long short-term memory (LSTM)

Gated recurrent unit (GRU)

Echo state network

Multilayer perceptron (MLP)

Convolutional neural network

Residual neural network

Mamba

Autoencoder

Variational autoencoder (VAE)

Generative adversarial network (GAN)

Graph neural network

Portals
Computer programming

Technology

Categories
Artificial neural networks

Machine learning

Retrieved from "https://en.wikipedia.org/w/index.php?title=Convolutional_neural_network&oldid=1220698821"

[74] When applied to other types of data than image data, such as sound data, "spatial position" may variously correspond to different points in the time domain, frequency domain, or other mathematical spaces.

[76] the name "convolutional layer"

[85] tegorical data
.

[auto3-1] 
ISBN 978-1-351-65032-8. Archived
from the original on 2023-10-16. Retrieved 2020-12-13.

[auto2-2] 
ISBN 978-3-030-32644-9. Archived
from the original on 2023-10-16. Retrieved 2020-12-13.

[3] S2CID 213010088. Archived
from the original on 2023-07-31. Retrieved 2023-08-12.

[4] S2CID 219470398. Archived
from the original on 2023-06-29. Retrieved 2023-08-12. Convolutional neural networks represent deep learning architectures that are currently used in a wide range of applications, including computer vision, speech recognition, malware dedection, time series analysis in finance, and many others.

[auto1-5] 
OCLC 987790957.{{cite book}}: CS1 maint: location missing publisher (link) CS1 maint: multiple names: authors list (link
)

[6] Atlas, Homma, and Marks. "An Artificial Neural Network for Spatio-Temporal Bipolar Patterns: Application to Phoneme Classification" (PDF). Neural Information Processing Systems (NIPS 1987). 1. Archived (PDF) from the original on 2021-04-14.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[Valueva_Nagornov_Lyakhov_Valuev_2020_pp._232–243-7] S2CID 218955622
. Convolutional neural networks are a promising tool for solving the problem of pattern recognition.

[8] van den Oord, Aaron; Dieleman, Sander; Schrauwen, Benjamin (2013-01-01). Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; Weinberger, K. Q. (eds.). Deep content-based music recommendation (PDF). Curran Associates, Inc. pp. 2643–2651. Archived (PDF) from the original on 2022-03-07. Retrieved 2022-03-31.

[9] S2CID 2617020
.

[10] S2CID 221386616. Archived
(PDF) from the original on 2022-05-19. Retrieved 2023-07-21.

[Tsantekidis_7–12-11] 
S2CID 4950757
.

[:0-12] Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of Annual Conference of the Japan Society of Applied Physics. Archived from the original on 2020-06-23. Retrieved 2020-06-22.

[:1-13] 
PMID 20577468. Archived
from the original on 2017-02-06. Retrieved 2016-09-22.

[:6-14] 
S2CID 232269854. Archived
from the original on 2021-06-27. Retrieved 2021-03-26.

[15] PMID 31430292
.

[fukuneoscholar-16] 
doi:10.4249/scholarpedia.1717
.

[hubelwiesel1968-17] 
PMID 4966457
.

[intro-18] 
S2CID 206775608. Archived
(PDF) from the original on 3 June 2014. Retrieved 16 November 2013.

[robust_face_detection-19] 
PMID 12850007. Archived
(PDF) from the original on 13 December 2013. Retrieved 17 November 2013.

[20] Convolutional Neural Networks Demystified: A Matched Filtering Perspective Based Tutorial https://arxiv.org/abs/2108.11663v3

[deeplearning-21] "Convolutional Neural Networks (LeNet) – DeepLearning 0.1 documentation". DeepLearning 0.1. LISA Lab. Archived from the original on 28 December 2017. Retrieved 31 August 2013.

[22] rXiv:1610.02357 [cs.CV
].

[flexible-23] Ciresan, Dan; Ueli Meier; Jonathan Masci; Luca M. Gambardella; Jurgen Schmidhuber (2011). "Flexible, High Performance Convolutional Neural Networks for Image Classification" (PDF). Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume Volume Two. 2: 1237–1242. Archived (PDF) from the original on 5 April 2022. Retrieved 17 November 2013.

[24] Krizhevsky, Alex. "ImageNet Classification with Deep Convolutional Neural Networks" (PDF). Archived (PDF) from the original on 25 April 2021. Retrieved 17 November 2013.

[Yamaguchi111990-25] Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan. Archived from the original on 2021-03-07. Retrieved 2019-09-04.

[mcdns-26] 
S2CID 2161592
.

[27] rXiv:1511.07122 [cs.CV
].

[28] rXiv:1706.05587 [cs.CV
].

[29] rXiv:2108.07387 [cs.CV
].

[LeCun-30] LeCun, Yann. "LeNet-5, convolutional neural networks". Archived from the original on 24 February 2021. Retrieved 16 November 2013.

[31] PMID 34690686
.

[:4-32] 
PMID 14403679
.

[33] ISBN 978-0-19-517618-6. Archived
from the original on 2023-10-16. Retrieved 2019-01-18.

[34] S2CID 3074096
.

[Fukushima1969-35] 
doi:10.1109/TSSC.1969.300225
.

[DLhistory-36] 
arXiv:2212.11279 [cs.NE
].

[37] rXiv:1710.05941 [cs.NE
].

[weng1993-38] S2CID 8619176
.

[schdeepscholar-39] 
S2CID 2309950. Archived
from the original on 2016-04-19. Retrieved 2019-01-20.

[40] Homma, Toshiteru; Les Atlas; Robert Marks II (1988). "An Artificial Neural Network for Spatio-Temporal Bipolar Patterns: Application to Phoneme Classification" (PDF). Advances in Neural Information Processing Systems. 1: 31–40. Archived (PDF) from the original on 2022-03-31. Retrieved 2022-03-31.

[Waibel1987-41] Waibel, Alex (December 1987). Phoneme Recognition Using Time-Delay Neural Networks. Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Tokyo, Japan.

[speechsignal-42] Alexander Waibel et al., Phoneme Recognition Using Time-Delay Neural Networks Archived 2021-02-25 at the Wayback Machine IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. - 339 March 1989.

[43] LeCun, Yann; Bengio, Yoshua (1995). "Convolutional networks for images, speech, and time series". In Arbib, Michael A. (ed.). The handbook of brain theory and neural networks (Second ed.). The MIT press. pp. 276–278. Archived from the original on 2020-07-28. Retrieved 2019-12-03.

[Hampshire1990-44] John B. Hampshire and Alexander Waibel, Connectionist Architectures for Multi-Speaker Phoneme Recognition Archived 2022-03-31 at the Wayback Machine, Advances in Neural Information Processing Systems, 1990, Morgan Kaufmann.

[video_quality-45] 
S2CID 221185563. Archived
(PDF) from the original on 24 February 2021. Retrieved 17 November 2013.

[Ko2017-46] Ko, Tom; Peddinti, Vijayaditya; Povey, Daniel; Seltzer, Michael L.; Khudanpur, Sanjeev (March 2018). A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition (PDF). The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017). New Orleans, LA, US. Archived (PDF) from the original on 2018-07-08. Retrieved 2019-09-04.

[47] Denker, J S, Gardner, W R, Graf, H. P, Henderson, D, Howard, R E, Hubbard, W, Jackel, L D, BaIrd, H S, and Guyon (1989) Neural network recognizer for hand-written zip code digits Archived 2018-08-04 at the Wayback Machine, AT&T Bell Laboratories

[:2-48] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation Applied to Handwritten Zip Code Recognition Archived 2020-01-10 at the Wayback Machine; AT&T Bell Laboratories

[:wz1991-49] 
PMID 20706526. Archived
from the original on 2017-02-06. Retrieved 2016-09-22.

[:wz1994-50] 
PMID 8058017. Archived
from the original on 2017-02-06. Retrieved 2016-09-22.

[lecun95-51] 
ISBN 978-981-02-2324-3. Archived
(PDF) from the original on 2 May 2023.

[52] Zhang, Wei (1991). "Error Back Propagation with Minimum-Entropy Weights: A Technique for Better Generalization of 2-D Shift-Invariant NNs". Proceedings of the International Joint Conference on Neural Networks. Archived from the original on 2017-02-06. Retrieved 2016-09-22.

[53] Daniel Graupe, Ruey Wen Liu, George S Moschytz."Applications of neural networks to medical signal processing Archived 2020-07-28 at the Wayback Machine". In Proc. 27th IEEE Decision and Control Conf., pp. 343–347, 1988.

[54] Daniel Graupe, Boris Vern, G. Gruener, Aaron Field, and Qiu Huang. "Decomposition of surface EMG signals into single fiber action potentials by means of neural network Archived 2019-09-04 at the Wayback Machine". Proc. IEEE International Symp. on Circuits and Systems, pp. 1008–1011, 1989.

[55] Qiu Huang, Daniel Graupe, Yi Fang Huang, Ruey Wen Liu."Identification of firing patterns of neuronal signals^{[dead link]}." In Proc. 28th IEEE Decision and Control Conf., pp. 266–271, 1989. https://ieeexplore.ieee.org/document/70115 Archived 2022-03-31 at the Wayback Machine

[56] S2CID 1304548. Archived
(PDF) from the original on 2017-08-10. Retrieved 2016-12-28.

[57] :10.1016/j.patcog.2004.01.013
.

[58] :10.1109/ICDAR.2005.251. Archived
from the original on 2022-03-31. Retrieved 2022-03-31.

[59] Kumar Chellapilla; Sid Puri; Patrice Simard (2006). "High Performance Convolutional Neural Networks for Document Processing". In Lorette, Guy (ed.). Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft. Archived from the original on 2020-05-18. Retrieved 2016-03-14.

[60] S2CID 2309950
.

[61] Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan; Larochelle, Hugo (2007). "Greedy Layer-Wise Training of Deep Networks" (PDF). Advances in Neural Information Processing Systems: 153–160. Archived (PDF) from the original on 2022-06-02. Retrieved 2022-03-31.

[62] Ranzato, MarcAurelio; Poultney, Christopher; Chopra, Sumit; LeCun, Yann (2007). "Efficient Learning of Sparse Representations with an Energy-Based Model" (PDF). Advances in Neural Information Processing Systems. Archived (PDF) from the original on 2016-03-22. Retrieved 2014-06-26.

[LSD_1-63] S2CID 392458. Archived
(PDF) from the original on 8 December 2020. Retrieved 22 December 2023.

[64] S2CID 1918673
.

[65] "IJCNN 2011 Competition result table". OFFICIAL IJCNN2011 COMPETITION. 2010. Archived from the original on 2021-01-17. Retrieved 2019-01-14.

[66] Schmidhuber, Jürgen (17 March 2017). "History of computer vision contests won by deep CNNs on GPU". Archived from the original on 19 December 2018. Retrieved 14 January 2019.

[:02-67] 
S2CID 195908774. Archived
(PDF) from the original on 2017-05-16. Retrieved 2018-12-04.

[68] S2CID 206594692. Archived
(PDF) from the original on 2022-04-05. Retrieved 2022-03-31.

[69] S2CID 15411954. Archived
from the original on 2023-03-06. Retrieved 2022-03-31.

[70] Viebke, Andre; Memeti, Suejb; Pllana, Sabri; Abraham, Ajith (2019). "CHAOS: a parallelization scheme for training convolutional neural networks on Intel Xeon Phi". The Journal of Supercomputing. 75 (1): 197–227.
S2CID 14135321
.

[71] Hinton, Geoffrey (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NIPS'12: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. 1: 1097–1105. Archived from the original on 2019-12-20. Retrieved 2021-03-26 – via ACM.

[:5-72] 
ISSN 1533-7928. Archived
from the original on 2022-03-31. Retrieved 2022-03-31.

[Géron_Hands-on_ML_2019-73] 
ISBN 978-1-492-03264-9
., pp. 448

[75] "CS231n Convolutional Neural Networks for Visual Recognition". cs231n.github.io. Archived from the original on 2019-10-23. Retrieved 2017-04-25.

[Scherer-ICANN-2010-77] Scherer, Dominik; Müller, Andreas C.; Behnke, Sven (2010). "Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition" (PDF). Artificial Neural Networks (ICANN), 20th International Conference on. Thessaloniki, Greece: Springer. pp. 92–101. Archived (PDF) from the original on 2018-04-03. Retrieved 2016-12-28.

[78] rXiv:1412.6071 [cs.CV
].

[79] rXiv:1412.6806 [cs.LG
].

[rcnn-80] rXiv:1504.08083 [cs.CV
].

[Ma_Chang_Xie_Ding_2019_pp._3224–3233-81] S2CID 86674074
.

[Romanuke4-82] :10.20535/1810-0546.2017.1.88156
.

[glorot2011-83] Xavier Glorot; Antoine Bordes; Yoshua Bengio (2011). Deep sparse rectifier neural networks (PDF). AISTATS. Archived from the original (PDF) on 2016-12-13. Retrieved 2023-04-10. Rectifier and softplus activation functions. The second one is a smooth version of the first.

[84] Krizhevsky, A.; Sutskever, I.; Hinton, G. E. (2012). "Imagenet classification with deep convolutional neural networks" (PDF). Advances in Neural Information Processing Systems. 1: 1097–1105. Archived (PDF) from the original on 2022-03-31. Retrieved 2022-03-31.

[86] S2CID 231925012
.

[87] S2CID 233219976. Archived
from the original on 2022-01-22. Retrieved 2021-03-26.

[88] OCLC 1106340711
.

[89] Jadeberg, Simonyan, Zisserman, Kavukcuoglu, Max, Karen, Andrew, Koray (2015). "Spatial Transformer Networks" (PDF). Advances in Neural Information Processing Systems. 28. Archived (PDF) from the original on 2021-07-25. Retrieved 2021-03-26 – via NIPS.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[90] OCLC 1106278545.{{cite book}}: CS1 maint: multiple names: authors list (link
)

[91] S2CID 127253432. Archived
from the original on 2021-09-29. Retrieved 2021-09-29.

[92] S2CID 219885788
.

[93] Srivastava, Nitish; C. Geoffrey Hinton; Alex Krizhevsky; Ilya Sutskever; Ruslan Salakhutdinov (2014). "Dropout: A Simple Way to Prevent Neural Networks from overfitting" (PDF). Journal of Machine Learning Research. 15 (1): 1929–1958. Archived (PDF) from the original on 2016-01-19. Retrieved 2015-01-03.

[94] "Regularization of Neural Networks using DropConnect | ICML 2013 | JMLR W&CP". jmlr.org: 1058–1066. 2013-02-13. Archived from the original on 2017-08-12. Retrieved 2015-12-17.

[95] rXiv:1301.3557 [cs.LG
].

[:3-96] Platt, John; Steinkraus, Dave; Simard, Patrice Y. (August 2003). "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis – Microsoft Research". Microsoft Research. Archived from the original on 2017-11-07. Retrieved 2015-12-17.

[97] rXiv:1207.0580 [cs.NE
].

[98] "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". jmlr.org. Archived from the original on 2016-03-05. Retrieved 2015-12-17.

[99] :10.1016/s0364-0213(79)80008-7
.

[100] Rock, Irvin. "The frame of reference." The legacy of Solomon Asch: Essays in cognition and social psychology (1990): 243–268.

[101] J. Hinton, Coursera lectures on Neural Networks, 2012, Url: https://www.coursera.org/learn/neural-networks Archived 2016-12-31 at the Wayback Machine

[quartz-102] Quartz. Archived
from the original on 12 December 2019. Retrieved 5 October 2018.

[103] S2CID 2883848
.

[ILSVRC2014-104] "ImageNet Large Scale Visual Recognition Competition 2014 (ILSVRC2014)". Archived from the original on 5 February 2016. Retrieved 30 January 2016.

[googlenet-105] ISBN 978-1-4673-6964-0
.

[106] rXiv:1409.0575 [cs.CV
].

[107] "The Face Detection Algorithm Set To Revolutionize Image Search". Technology Review. February 16, 2015. Archived from the original on 20 September 2020. Retrieved 27 October 2017.

[108] ISBN 978-3-642-25445-1
.

[109] S2CID 1923924
.

[110] rXiv:1801.10111 [cs.CV
].

[111] Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks Archived 2019-08-06 at the Wayback Machine." IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2014.

[112] rXiv:1406.2199 [cs.CV
]. (2014).

[Wang_Duan_Zhang_Niu_p=1657-113] PMID 29789447. Archived
(PDF) from the original on 2021-03-01. Retrieved 2018-09-14.

[Duan_Wang_Zhai_Zheng_2018_p.-114] ISBN 978-1-4799-7061-2
.

[115] ISBN 978-3-642-15566-6. Archived
from the original on 2022-03-31. Retrieved 2022-03-31.

[116] S2CID 6006618
.

[117] rXiv:1404.7296 [cs.CL
].

[118] Mesnil, Gregoire; Deng, Li; Gao, Jianfeng; He, Xiaodong; Shen, Yelong (April 2014). "Learning Semantic Representations Using Convolutional Neural Networks for Web Search – Microsoft Research". Microsoft Research. Archived from the original on 2017-09-15. Retrieved 2015-12-17.

[119] rXiv:1404.2188 [cs.CL
].

[120] rXiv:1408.5882 [cs.CL
].

[121] Collobert, Ronan, and Jason Weston. "A unified architecture for natural language processing: Deep neural networks with multitask learning Archived 2019-09-04 at the Wayback Machine."Proceedings of the 25th international conference on Machine learning. ACM, 2008.

[122] rXiv:1103.0398 [cs.LG
].

[123] rXiv:1702.01923 [cs.LG
].

[124] rXiv:1803.01271 [cs.LG
].

[125] S2CID 236307579
.

[126] rXiv:2107.09355
.

[127] S2CID 182952311
.

[128] rXiv:1510.02855 [cs.LG
].

[129] rXiv:1506.06579 [cs.CV
].

[130] "Toronto startup has a faster way to discover effective medicines". The Globe and Mail. Archived from the original on 2015-10-20. Retrieved 2015-11-09.

[131] "Startup Harnesses Supercomputers to Seek Cures". KQED Future of You. 2015-05-27. Archived from the original on 2018-12-06. Retrieved 2015-11-09.

[132] PMID 18252639
.

[133] :10.1109/4235.942536
.

[134] ISBN 978-1558607835
.

[135] rXiv:1412.3409 [cs.AI
].

[136] rXiv:1412.6564 [cs.LG
].

[137] "AlphaGo – Google DeepMind". Archived from the original on 30 January 2016. Retrieved 30 January 2016.

[138] rXiv:1803.01271 [cs.LG
].

[139] rXiv:1511.07122 [cs.CV
].

[140] rXiv:1703.04691 [stat.ML
].

[141] rXiv:1508.00317 [stat.ML
].

[142] rXiv:1906.04397 [stat.ML
].

[143] :10.21629/JSEE.2017.01.18
.

[144] rXiv:1908.07978 [cs.LG
].

[HeiCuBeDa_Hilprecht-145] 
doi:10.11588/data/IE8CCN

[ICDAR19-146] Hubert Mara and Bartosz Bogacz (2019), "Breaking the Code on Broken Tablets: The Learning Challenge for Annotated Cuneiform Script in Normalized 2D and 3D Datasets", Proceedings of the 15th International Conference on Document Analysis and Recognition (ICDAR) (in German), Sydney, Australien, pp. 148–153,
S2CID 211026941

[ICFHR20-147] Bogacz, Bartosz; Mara, Hubert (2020), "Period Classification of 3D Cuneiform Tablets with Geometric Neural Networks", Proceedings of the 17th International Conference on Frontiers of Handwriting Recognition (ICFHR), Dortmund, Germany

[ICFHR20_Presentation-148] YouTube

[149] Durjoy Sen Maitra; Ujjwal Bhattacharya; S.K. Parui, "CNN based common approach to handwritten character recognition of multiple scripts" Archived 2023-10-16 at the Wayback Machine, in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, vol., no., pp.1021–1025, 23–26 Aug. 2015

[Interpretable_ML_Symposium_2017-150] "NIPS 2017". Interpretable ML Symposium. 2017-10-20. Archived from the original on 2019-09-07. Retrieved 2018-09-12.

[Zang_Wang_Liu_Zhang_2018_pp._97–108-151] S2CID 4058889
.

[Wang_Zang_Zhang_Niu_p=1979-152] PMID 29933555. Archived
(PDF) from the original on 2018-09-13. Retrieved 2018-09-14.

[Ong_Chavez_Hong_2015-153] rXiv:1508.04186v2 [cs.LG
].

[DQN-154] S2CID 205242740
.

[155] PMID 18252373
.

[CDBN-CIFAR-156] "Convolutional Deep Belief Networks on CIFAR-10" (PDF). Archived (PDF) from the original on 2017-08-30. Retrieved 2017-08-18.

[CDBN-157] S2CID 12008458
.

[158] Cade Metz (May 18, 2016). "Google Built Its Very Own Chips to Power Its AI Bots". Wired. Archived from the original on January 13, 2018. Retrieved March 6, 2017.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[14]

[15]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[17]

[33]

[16]

[18]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[45]

[46]

[47]

[48]

[12]

[13]

[49]

[50]

[52]

[53]

[54]

[55]

[56]

[58]

[60]

[61]

[62]

[63]

[67]

[68]

[69]

[71]

[72]

[73]

[nb 1]

[74]

[nb 2]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[92]

[93]

[51]

[94]

[95]

[96]

[97]

[98]

[101]

[102]

[103]

[114]

[115]

[116]

[117]

[118]

[119]

[120]