Differentiable neural computer

In

DeepMind.^[1]

Applications

DNC indirectly takes inspiration from Von-Neumann architecture, making it likely to outperform conventional architectures in tasks that are fundamentally algorithmic that cannot be learned by finding a decision boundary.

So far, DNCs have been demonstrated to handle only relatively simple tasks, which can be solved using conventional programming. But DNCs don't need to be programmed for each problem, but can instead be trained. This attention span allows the user to feed complex

symbolic reasoning and apply it to working memory. The researchers who published the method see promise that DNCs can be trained to perform complex, structured tasks^[1]^[2] and address big-data applications that require some sort of reasoning, such as generating video commentaries or semantic text analysis.^[3]^[4]

DNC can be trained to navigate rapid transit systems, and apply that network to a different system. A neural network without memory would typically have to learn about each transit system from scratch. On graph traversal and sequence-processing tasks with supervised learning, DNCs performed better than alternatives such as long short-term memory or a neural turing machine.^[5] With a reinforcement learning approach to a block puzzle problem inspired by SHRDLU, DNC was trained via curriculum learning, and learned to make a plan. It performed better than a traditional recurrent neural network.^[5]

Architecture

DNC networks were introduced as an extension of the

Neural Turing Machine (NTM), with the addition of memory attention mechanisms that control where the memory is stored, and temporal attention that records the order of events. This structure allows DNCs to be more robust and abstract than a NTM, and still perform tasks that have longer-term dependencies than some predecessors such as Long Short Term Memory (LSTM). The memory, which is simply a matrix, can be allocated dynamically and accessed indefinitely. The DNC is differentiable end-to-end (each subcomponent of the model is differentiable, therefore so is the whole model). This makes it possible to optimize them efficiently using gradient descent.^[3]^[6]^[7]

The DNC model is similar to the

Turing complete.^[8]

Traditional DNC

DNC, as originally published [1]

Independent variables
$\mathbf {x} _{t}$	Input vector
$\mathbf {z} _{t}$	Target vector
Controller
${\boldsymbol {\chi }}_{t}=[\mathbf {x} _{t};\mathbf {r} _{t-1}^{1};\cdots ;\mathbf {r} _{t-1}^{R}]$	Controller input matrix

Deep (layered) LSTM	$\forall \;0\leq l\leq L$
$\mathbf {i} _{t}^{l}=\sigma (W_{i}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{i}^{l})$	Input gate vector
$\mathbf {o} _{t}^{l}=\sigma (W_{o}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{o}^{l})$	Output gate vector
$\mathbf {f} _{t}^{l}=\sigma (W_{f}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{f}^{l})$	Forget gate vector
$\mathbf {s} _{t}^{l}=\mathbf {f} _{t}^{l}\mathbf {s} _{t-1}^{l}+\mathbf {i} _{t}^{l}\tanh(W_{s}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{s}^{l})$	State gate vector, $s_{0}=0$
$\mathbf {h} _{t}^{l}=\mathbf {o} _{t}^{l}\tanh(\mathbf {s} _{t}^{l})$	Hidden gate vector, $h_{0}=0;h_{t}^{0}=0\;\forall \;t$

$\mathbf {y} _{t}=W_{y}[\mathbf {h} _{t}^{1};\cdots ;\mathbf {h} _{t}^{L}]+W_{r}[\mathbf {r} _{t}^{1};\cdots ;\mathbf {r} _{t}^{R}]$	DNC output vector
Read & Write heads
$\xi _{t}=W_{\xi }[h_{t}^{1};\cdots ;h_{t}^{L}]$	Interface parameters
$=[\mathbf {k} _{t}^{r,1};\cdots ;\mathbf {k} _{t}^{r,R};{\hat {\beta }}_{t}^{r,1};\cdots ;{\hat {\beta }}_{t}^{r,R};\mathbf {k} _{t}^{w};{\hat {\beta _{t}^{w}}};\mathbf {\hat {e}} _{t};\mathbf {v} _{t};{\hat {f_{t}^{1}}};\cdots ;{\hat {f_{t}^{R}}};{\hat {g}}_{t}^{a};{\hat {g}}_{t}^{w};{\hat {\boldsymbol {\pi }}}_{t}^{1};\cdots ;{\hat {\boldsymbol {\pi }}}_{t}^{R}]$

Read heads	$\forall \;1\leq i\leq R$
$\mathbf {k} _{t}^{r,i}$	Read keys
$\beta _{t}^{r,i}={\text{oneplus}}({\hat {\beta }}_{t}^{r,i})$	Read strengths
$f_{t}^{i}=\sigma ({\hat {f}}_{t}^{i})$	Free gates
${\boldsymbol {\pi }}_{t}^{i}={\text{softmax}}({\hat {\boldsymbol {\pi }}}_{t}^{i})$	Read modes, ${\boldsymbol {\pi }}_{t}^{i}\in \mathbb {R} ^{3}$

Write head
$\mathbf {k} _{t}^{w}$	Write key
$\beta _{t}^{w}={\hat {\beta }}_{t}^{w}$	Write strength
$\mathbf {e} _{t}=\sigma (\mathbf {\hat {e}} _{t})$	Erase vector
$\mathbf {v} _{t}$	Write vector
$g_{t}^{a}=\sigma ({\hat {g}}_{t}^{a})$	Allocation gate
$g_{t}^{w}=\sigma ({\hat {g}}_{t}^{w})$	Write gate
Memory
$M_{t}=M_{t-1}\circ (E-\mathbf {w} _{t}^{w}\mathbf {e} _{t}^{\intercal })+\mathbf {w} _{t}^{w}\mathbf {v} _{t}^{\intercal }$	Memory matrix, Matrix of ones $E\in \mathbb {R} ^{N\times W}$
$\mathbf {u} _{t}=(\mathbf {u} _{t-1}+\mathbf {w} _{t-1}^{w}-\mathbf {u} _{t-1}\circ \mathbf {w} _{t-1}^{w})\circ {\boldsymbol {\psi }}_{t}$	Usage vector
$\mathbf {p} _{t}=\left(1-\sum _{i}\mathbf {w} _{t}^{w}[i]\right)\mathbf {p} _{t-1}+\mathbf {w} _{t}^{w}$	Precedence weighting, $\mathbf {p} _{0}=\mathbf {0}$
$L_{t}=(\mathbf {1} -\mathbf {I} )\left[(1-\mathbf {w} _{t}^{w}[i]-\mathbf {w} _{t}^{j})L_{t-1}[i,j]+\mathbf {w} _{t}^{w}[i]\mathbf {p} _{t-1}^{j}\right]$	Temporal link matrix, $L_{0}=\mathbf {0}$
$\mathbf {w} _{t}^{w}=g_{t}^{w}[g_{t}^{a}\mathbf {a} _{t}+(1-g_{t}^{a})\mathbf {c} _{t}^{w}]$	Write weighting
$\mathbf {w} _{t}^{r,i}={\boldsymbol {\pi }}_{t}^{i}[1]\mathbf {b} _{t}^{i}+{\boldsymbol {\pi }}_{t}^{i}[2]c_{t}^{r,i}+{\boldsymbol {\pi }}_{t}^{i}[3]f_{t}^{i}$	Read weighting
$\mathbf {r} _{t}^{i}=M_{t}^{\intercal }\mathbf {w} _{t}^{r,i}$	Read vectors

${\mathcal {C}}(M,\mathbf {k} ,\beta )[i]={\frac {\exp\{{\mathcal {D}}(\mathbf {k} ,M[i,\cdot ])\beta \}}{\sum _{j}\exp\{{\mathcal {D}}(\mathbf {k} ,M[j,\cdot ])\beta \}}}$	Content-based addressing , Lookup key $\mathbf {k}$ , key strength $\beta$
$\phi _{t}$	Indices of $\mathbf {u} _{t}$ , sorted in ascending order of usage
$\mathbf {a} _{t}[\phi _{t}[j]]=(1-\mathbf {u} _{t}[\phi _{t}[j]])\prod _{i=1}^{j-1}\mathbf {u} _{t}[\phi _{t}[i]]$	Allocation weighting
$\mathbf {c} _{t}^{w}={\mathcal {C}}(M_{t-1},\mathbf {k} _{t}^{w},\beta _{t}^{w})$	Write content weighting
$\mathbf {c} _{t}^{r,i}={\mathcal {C}}(M_{t-1},\mathbf {k} _{t}^{r,i},\beta _{t}^{r,i})$	Read content weighting
$\mathbf {f} _{t}^{i}=L_{t}\mathbf {w} _{t-1}^{r,i}$	Forward weighting
$\mathbf {b} _{t}^{i}=L_{t}^{\intercal }\mathbf {w} _{t-1}^{r,i}$	Backward weighting
${\boldsymbol {\psi }}_{t}=\prod _{i=1}^{R}\left(\mathbf {1} -f_{t}^{i}\mathbf {w} _{t-1}^{r,i}\right)$	Memory retention vector
Definitions
$\mathbf {W} ,\mathbf {b}$	Weight matrix , bias vector
$\mathbf {0} ,\mathbf {1} ,\mathbf {I}$	Zeros matrix, ones matrix, identity matrix
$\circ$	Element-wise multiplication
${\mathcal {D}}(\mathbf {u} ,\mathbf {v} )={\frac {\mathbf {u} \cdot \mathbf {v} }{\\|\mathbf {u} \\|\\|\mathbf {v} \\|}}$	Cosine similarity
$\sigma (x)=1/(1+e^{-x})$	Sigmoid function
${\text{oneplus}}(x)=1+\log(1+e^{x})$	Oneplus function
${\text{softmax}}(\mathbf {x} )_{j}={\frac {e^{x_{j}}}{\sum _{k=1}^{K}e^{x_{k}}}}$ for j = 1, ..., K.	Softmax function

Extensions

Refinements include sparse memory addressing, which reduces time and space complexity by thousands of times. This can be achieved by using an approximate nearest neighbor algorithm, such as Locality-sensitive hashing, or a random k-d tree like Fast Library for Approximate Nearest Neighbors from UBC.^[9] Adding Adaptive Computation Time (ACT) separates computation time from data time, which uses the fact that problem length and problem difficulty are not always the same.^[10] Training using synthetic gradients performs considerably better than Backpropagation through time (BPTT).^[11] Robustness can be improved with use of layer normalization and Bypass Dropout as regularization.^[12]

References

^
S2CID 205251479
.

^ "Differentiable neural computers | DeepMind". DeepMind. Retrieved 2016-10-19.
^ ^a ^b Burgess, Matt. "DeepMind's AI learned to ride the London Underground using human-like reason and memory". WIRED UK. Retrieved 2016-10-19.
PMID 27732576
.

^ ^a ^b James, Mike. "DeepMind's Differentiable Neural Network Thinks Deeply". www.i-programmer.info. Retrieved 2016-10-20.

^ "DeepMind AI 'Learns' to Navigate London Tube". PCMAG. Retrieved 2016-10-19.

^ Mannes, John. "DeepMind's differentiable neural computer helps you navigate the subway with its memory". TechCrunch. Retrieved 2016-10-19.

^ "RNN Symposium 2016: Alex Graves - Differentiable Neural Computer". YouTube.

arXiv:1610.09027 [cs.LG
].

arXiv:1603.08983 [cs.NE
].

arXiv:1608.05343 [cs.LG
].

arXiv:1807.02658 [cs.CL
].

External links

A bit-by-bit guide to the equations governing differentiable neural computers

DeepMind's Differentiable Neural Network Thinks Deeply

v
t
e
Differentiable computing
General

Differentiable programming

Information geometry

Statistical manifold

Automatic differentiation

Neuromorphic engineering

Pattern recognition

Tensor calculus

Computational learning theory

Inductive bias

Concepts

Gradient descent
SGD

Clustering

Regression
Overfitting

Hallucination

Adversary

Attention

Convolution

Loss functions

Backpropagation

Batchnorm

Activation
Softmax

Sigmoid

Rectifier

Regularization

Datasets

Augmentation

Diffusion

Autoregression

Applications

Machine learning
In-context learning

Artificial neural network

Deep learning

Scientific computing

Artificial Intelligence

Language model
Large language model

Hardware

IPU

TPU

VPU

Memristor

SpiNNaker

Software libraries

TensorFlow

PyTorch

Keras

Theano

JAX

Flux.jl

MindSpore

Implementations
Audio–visual

AlexNet

WaveNet

Human image synthesis

HWR

OCR

Speech synthesis

Speech recognition

Facial recognition

AlphaFold

Text-to-image models
DALL-E

Midjourney

Stable Diffusion

Text-to-video models
Sora

VideoPoet

Whisper

Verbal

Word2vec

Seq2seq

BERT

Gemini

LaMDA
Bard

NMT

Project Debater

IBM Watson

IBM Watsonx

Granite

GPT-1

GPT-2

GPT-3

GPT-4

ChatGPT

GPT-J

Chinchilla AI

PaLM

BLOOM

LLaMA

PanGu-Σ

Decisional

AlphaGo

AlphaZero

Q-learning

SARSA

OpenAI Five

Self-driving car

MuZero

Action selection
Auto-GPT

Robot control

People

Yoshua Bengio

Alex Graves

Ian Goodfellow

Stephen Grossberg

Demis Hassabis

Geoffrey Hinton

Yann LeCun

Fei-Fei Li

Andrew Ng

Jürgen Schmidhuber

David Silver

Ilya Sutskever

Organizations

Anthropic

EleutherAI

Google DeepMind

Hugging Face

OpenAI

Meta AI

Mila

MIT CSAIL

Huawei

Architectures

Neural Turing machine

Differentiable neural computer

Transformer

Recurrent neural network (RNN)

Long short-term memory (LSTM)

Gated recurrent unit (GRU)

Echo state network

Multilayer perceptron (MLP)

Convolutional neural network

Residual neural network

Mamba

Autoencoder

Variational autoencoder (VAE)

Generative adversarial network (GAN)

Graph neural network

Portals
Computer programming

Technology

Categories
Artificial neural networks

Machine learning

Retrieved from "https://en.wikipedia.org/w/index.php?title=Differentiable_neural_computer&oldid=1168183022"

[DNCnature2016-1] 
S2CID 205251479
.

[2] "Differentiable neural computers | DeepMind". DeepMind. Retrieved 2016-10-19.

[:0-3] Burgess, Matt. "DeepMind's AI learned to ride the London Underground using human-like reason and memory". WIRED UK. Retrieved 2016-10-19.

[4] PMID 27732576
.

[:1-5] James, Mike. "DeepMind's Differentiable Neural Network Thinks Deeply". www.i-programmer.info. Retrieved 2016-10-20.

[6] "DeepMind AI 'Learns' to Navigate London Tube". PCMAG. Retrieved 2016-10-19.

[7] Mannes, John. "DeepMind's differentiable neural computer helps you navigate the subway with its memory". TechCrunch. Retrieved 2016-10-19.

[8] "RNN Symposium 2016: Alex Graves - Differentiable Neural Computer". YouTube.

[9] rXiv:1610.09027 [cs.LG
].

[10] rXiv:1603.08983 [cs.NE
].

[11] rXiv:1608.05343 [cs.LG
].

[12] rXiv:1807.02658 [cs.CL
].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Applications

Architecture

Traditional DNC

Extensions

See also

References

External links