t-distributed stochastic neighbor embedding

Source: Wikipedia, the free encyclopedia.
T-SNE visualisation of word embeddings generated using 19th century literature
MNIST
dataset

t-distributed stochastic neighbor embedding (t-SNE) is a

statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis,[1] where Laurens van der Maaten proposed the t-distributed variant.[2] It is a nonlinear dimensionality reduction
technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

The t-SNE algorithm comprises two main stages. First, t-SNE constructs a

UMAP
.

t-SNE has been used for visualization in a wide range of applications, including

music analysis,[4] cancer research,[5] bioinformatics,[6] geological domain interpretation,[7][8][9] and biomedical signal processing.[10]

While t-SNE plots often seem to display clusters, the visual clusters can be influenced strongly by the chosen parameterization and therefore a good understanding of the parameters for t-SNE is necessary. Such "clusters" can be shown to even appear in non-clustered data,[11] and thus may be false findings. Interactive exploration may thus be necessary to choose parameters and validate results.[12][13] It has been demonstrated that t-SNE is often able to recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering.[14]

For a data set with n elements, t-SNE runs in O(n2) time and requires O(n2) space.[15]

Details

Given a set of high-dimensional objects , t-SNE first computes probabilities that are proportional to the similarity of objects and , as follows.

For , define

and set . Note the above denominator ensures for all .

As van der Maaten and Hinton explained: "The similarity of datapoint to datapoint is the conditional probability, , that would pick as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at ."[2]

Now define


This is motivated because and from the N samples are estimated as 1/N, so the conditional probability can be written as and . Since , you can obtain previous formula.

Also note that and .

The bandwidth of the

Gaussian kernels
is set in such a way that the entropy of the conditional distribution equals a predefined entropy using the bisection method. As a result, the bandwidth is adapted to the density of the data: smaller values of are used in denser parts of the data space.

Since the Gaussian kernel uses the Euclidean distance , it is affected by the curse of dimensionality, and in high dimensional data when distances lose the ability to discriminate, the become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this.[16]

t-SNE aims to learn a -dimensional map (with and typically chosen as 2 or 3) that reflects the similarities as well as possible. To this end, it measures similarities between two points in the map and , using a very similar approach. Specifically, for , define as

and set . Herein a heavy-tailed

Student t-distribution (with one-degree of freedom, which is the same as a Cauchy distribution
) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map.

The locations of the points in the map are determined by minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution from the distribution , that is:

The minimization of the Kullback–Leibler divergence with respect to the points is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.

Software

  • The R package Rtsne implements t-SNE in R.
  • ELKI contains tSNE, also with Barnes-Hut approximation
  • scikit-learn, a popular machine learning library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation.
  • Tensorboard, the visualization kit associated with TensorFlow, also implements t-SNE (online version)

References

  1. Neural Information Processing Systems
    .
  2. ^ a b van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008). "Visualizing Data Using t-SNE" (PDF). Journal of Machine Learning Research. 9: 2579–2605.
  3. ^ Gashi, I.; Stankovic, V.; Leita, C.; Thonnard, O. (2009). "An Experimental Study of Diversity with Off-the-shelf AntiVirus Engines". Proceedings of the IEEE International Symposium on Network Computing and Applications: 4–11.
  4. ^ Hamel, P.; Eck, D. (2010). "Learning Features from Music Audio with Deep Belief Networks". Proceedings of the International Society for Music Information Retrieval Conference: 339–344.
  5. PMID 20175497
    .
  6. .
  7. .
  8. .
  9. .
  10. .
  11. ^ "K-means clustering on the output of t-SNE". Cross Validated. Retrieved 2018-04-16.
  12. S2CID 353336
    .
  13. . Retrieved 4 December 2017.
  14. ].
  15. ^ Pezzotti, Nicola. "Approximated and User Steerable tSNE for Progressive Visual Analytics" (PDF). Retrieved 31 August 2023.
  16. .

External links