t-distributed stochastic neighbor embedding

MNIST

dataset

t-distributed stochastic neighbor embedding (t-SNE) is a

statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis,^[1] where Laurens van der Maaten proposed the t-distributed variant.^[2] It is a nonlinear dimensionality reduction

technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

The t-SNE algorithm comprises two main stages. First, t-SNE constructs a

UMAP

.

t-SNE has been used for visualization in a wide range of applications, including

music analysis,^[4] cancer research,^[5] bioinformatics,^[6] geological domain interpretation,^[7]^[8]^[9] and biomedical signal processing.^[10]

While t-SNE plots often seem to display clusters, the visual clusters can be influenced strongly by the chosen parameterization and therefore a good understanding of the parameters for t-SNE is necessary. Such "clusters" can be shown to even appear in non-clustered data,^[11] and thus may be false findings. Interactive exploration may thus be necessary to choose parameters and validate results.^[12]^[13] It has been demonstrated that t-SNE is often able to recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering.^[14]

For a data set with n elements, t-SNE runs in $O(n 2)$ time and requires $O(n 2)$ space.^[15]

Details

Given a set of $N$ high-dimensional objects $\mathbf {x} _{1},\dots ,\mathbf {x} _{N}$ , t-SNE first computes probabilities $p_{ij}$ that are proportional to the similarity of objects $\mathbf {x} _{i}$ and $\mathbf {x} _{j}$ , as follows.

For $i\neq j$ , define

p_{j\mid i}={\frac {\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{j}\rVert ^{2}/2\sigma _{i}^{2})}{\sum _{k\neq i}\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{k}\rVert ^{2}/2\sigma _{i}^{2})}}

and set $p_{i\mid i}=0$ . Note the above denominator ensures $\sum _{j}p_{j\mid i}=1$ for all $i$ .

As van der Maaten and Hinton explained: "The similarity of datapoint $x_{j}$ to datapoint $x_{i}$ is the conditional probability, $p_{j|i}$ , that $x_{i}$ would pick $x_{j}$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_{i}$ ."^[2]

Now define

p_{ij}={\frac {p_{j\mid i}+p_{i\mid j}}{2N}}

This is motivated because $p_{i}$ and $p_{j}$ from the N samples are estimated as 1/N, so the conditional probability can be written as $p_{i\mid j}=Np_{ij}$ and $p_{j\mid i}=Np_{ji}$ . Since $p_{ij}=p_{ji}$ , you can obtain previous formula.

Also note that $p_{ii}=0$ and $\sum _{i,j}p_{ij}=1$ .

The bandwidth of the

Gaussian kernels

\sigma _{i}

is set in such a way that the entropy of the conditional distribution equals a predefined entropy using the bisection method. As a result, the bandwidth is adapted to the density of the data: smaller values of

\sigma _{i}

are used in denser parts of the data space.

Since the Gaussian kernel uses the Euclidean distance $\lVert x_{i}-x_{j}\rVert$ , it is affected by the curse of dimensionality, and in high dimensional data when distances lose the ability to discriminate, the $p_{ij}$ become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this.^[16]

t-SNE aims to learn a $d$ -dimensional map $\mathbf {y} _{1},\dots ,\mathbf {y} _{N}$ (with $\mathbf {y} _{i}\in \mathbb {R} ^{d}$ and $d$ typically chosen as 2 or 3) that reflects the similarities $p_{ij}$ as well as possible. To this end, it measures similarities $q_{ij}$ between two points in the map $\mathbf {y} _{i}$ and $\mathbf {y} _{j}$ , using a very similar approach. Specifically, for $i\neq j$ , define $q_{ij}$ as

q_{ij}={\frac {(1+\lVert \mathbf {y} _{i}-\mathbf {y} _{j}\rVert ^{2})^{-1}}{\sum _{k}\sum _{l\neq k}(1+\lVert \mathbf {y} _{k}-\mathbf {y} _{l}\rVert ^{2})^{-1}}}

and set $q_{ii}=0$ . Herein a heavy-tailed

Student t-distribution (with one-degree of freedom, which is the same as a Cauchy distribution

) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map.

The locations of the points $\mathbf {y} _{i}$ in the map are determined by minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution $P$ from the distribution $Q$ , that is:

\mathrm {KL} \left(P\parallel Q\right)=\sum _{i\neq j}p_{ij}\log {\frac {p_{ij}}{q_{ij}}}

The minimization of the Kullback–Leibler divergence with respect to the points $\mathbf {y} _{i}$ is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.

Software

The R package Rtsne implements t-SNE in R.
ELKI contains tSNE, also with Barnes-Hut approximation
scikit-learn, a popular machine learning library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation.
Tensorboard, the visualization kit associated with TensorFlow, also implements t-SNE (online version)

References

Neural Information Processing Systems
.

^ ^a ^b van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008). "Visualizing Data Using t-SNE" (PDF). Journal of Machine Learning Research. 9: 2579–2605.

^ Gashi, I.; Stankovic, V.; Leita, C.; Thonnard, O. (2009). "An Experimental Study of Diversity with Off-the-shelf AntiVirus Engines". Proceedings of the IEEE International Symposium on Network Computing and Applications: 4–11.

^ Hamel, P.; Eck, D. (2010). "Learning Features from Music Audio with Deep Belief Networks". Proceedings of the International Society for Music Information Retrieval Conference: 339–344.

PMID 20175497
.

PMID 19153135
.

S2CID 67926902
.

ISBN 978-3-319-46681-1
.

S2CID 208329378
.

S2CID 8074617
.

^ "K-means clustering on the output of t-SNE". Cross Validated. Retrieved 2018-04-16.

S2CID 353336
.

doi:10.23915/distill.00002
. Retrieved 4 December 2017.

arXiv:1706.02582 [cs.LG
].

^ Pezzotti, Nicola. "Approximated and User Steerable tSNE for Progressive Visual Analytics" (PDF). Retrieved 31 August 2023.

doi:10.1007/978-3-319-68474-1_13
.

External links

Visualizing Data Using t-SNE, Google Tech Talk about t-SNE

Implementations of t-SNE in various languages, A link collection maintained by Laurens van der Maaten

Retrieved from "https://en.wikipedia.org/w/index.php?title=T-distributed_stochastic_neighbor_embedding&oldid=1205815221"

[SNE-1] Neural Information Processing Systems
.

[MaatenHinton-2] van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008). "Visualizing Data Using t-SNE" (PDF). Journal of Machine Learning Research. 9: 2579–2605.

[3] Gashi, I.; Stankovic, V.; Leita, C.; Thonnard, O. (2009). "An Experimental Study of Diversity with Off-the-shelf AntiVirus Engines". Proceedings of the IEEE International Symposium on Network Computing and Applications: 4–11.

[4] Hamel, P.; Eck, D. (2010). "Learning Features from Music Audio with Deep Belief Networks". Proceedings of the International Society for Music Information Retrieval Conference: 339–344.

[5] PMID 20175497
.

[6] PMID 19153135
.

[7] S2CID 67926902
.

[8] ISBN 978-3-319-46681-1
.

[9] S2CID 208329378
.

[10] S2CID 8074617
.

[11] "K-means clustering on the output of t-SNE". Cross Validated. Retrieved 2018-04-16.

[12] S2CID 353336
.

[13] :10.23915/distill.00002
. Retrieved 4 December 2017.

[14] rXiv:1706.02582 [cs.LG
].

[15] Pezzotti, Nicola. "Approximated and User Steerable tSNE for Progressive Visual Analytics" (PDF). Retrieved 31 August 2023.

[16] :10.1007/978-3-319-68474-1_13
.

[1]

[2]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]