Visualizing Data using t-SNEAn Intuitive IntroductionSimon CarbonnelleUniversité Catholique de Louvain, ICTEAM12th of May, 2016
Visualization and Dimensionality ReductionIntuition behind t-SNEVisualizing representations
Visualization and Dimensionality ReductionIntuition behind t-SNEVisualizing representations
Visualization is key to understand data easily.Data of house areas in m2 and price in 1000s of onIs the relation linear?1
Visualization is key to understand data easily.QuestionIs the relation linear?1
Dimensionality Reduction is a helpful tool forvisualization.IDimensionality reduction algorithmsIIIMap high-dimensional data to alower dimensionWhile preserving structureThey are used forIIIVisualizationPerformanceCurse of dimensionalityIA ton of algorithms existIt-SNE is specialised for visualizationI. and has gained a lot of popularity2
Dimensionality Reduction is a helpful tool forvisualization.IDimensionality reduction algorithmsIIIMap high-dimensional data to alower dimensionWhile preserving structureThey are used forIIIVisualizationPerformanceCurse of dimensionalityIA ton of algorithms existIt-SNE is specialised for visualizationI. and has gained a lot of popularity3
Visualization and Dimensionality ReductionIntuition behind t-SNEVisualizing representations
Dimensionality Reduction techniques solveoptimization problems.X {x1 , x2 , ., xn Rh } Y {y1 , y2 , ., yn Rl }min C (X , Y)YThree approaches for Dimensionality Reduction:IDistance preservationITopology preservationIInformation preservation4
Dimensionality Reduction techniques solveoptimization problems.X {x1 , x2 , ., xn Rh } Y {y1 , y2 , ., yn Rl }min C (X , Y)YThree approaches for Dimensionality Reduction:IDistance preservationITopology preservationIInformation preservationt-SNE is distance-based but tends to preserve topology4
SNE computes pair-wise similarities.SNE converts euclidean distances to similarities, that can beinterpreted as probabilities.pj i Pexp( k xi xj k2 /2σi2 )22k6 i exp( k xi xk k /2σi )exp( k yi yj k2 )2k6 i exp( k yi yk k )qj i Ppi i 0, qi i 0Hence the name Stochastic Neighbor Embedding.5
Pair-wise similarities should stay the same. 6
Pair-wise similarities should stay the same.pj i qj i 6
Pair-wise similarities should stay the same. 6
Pair-wise similarities should stay the same. 6
Pair-wise similarities should stay the same. 6
Kullback-Leiber Divergence measures thefaithfulness with wich qj i models pj i .IPi {p1 i , p2 i , ., pn i } and Qi {q1 i , q2 i , ., qn i } are thedistributions on the neighbors of datapoint i.IKullback-Leiber Divergence (KL) compares two distributions.XXXpj iC KL(Pi Qi ) pj i logqj iiijIKL divergence is asymmetricIKL divergence is always positive.IWe have our minimization problem: minY C (X , Y)7
Some remaining questions.exp( k xi xj k2 /2σi2 )exp( k yi yj k2 )P,q j i222k6 i exp( k yi yk k )k6 i exp( k xi xk k /2σi )pj i P1. Why radial basis function (exponential)?2. Why probabilities?3. How do you choose σi ?8
Some remaining questions.exp( k xi xj k2 /2σi2 )exp( k yi yj k2 )P,q j i222k6 i exp( k yi yk k )k6 i exp( k xi xk k /2σi )pj i P1. Why radial basis function (exponential)?Focus on localgeometry.This is why t-SNEcan be interpreted astopology-based9
Some remaining questions.exp( k yi yj k2 )exp( k xi xj k2 /2σi2 )P,q j i222k6 i exp( k yi yk k )k6 i exp( k xi xk k /2σi )pj i P1. Why radial basis function (exponential)?2. Why probabilities?Small distance does notmean proximity on manifold.Probabilities are appropriateto model this uncertainty10
Some remaining questions.exp( k xi xj k2 /2σi2 )exp( k yi yj k2 )P,q j i222k6 i exp( k yi yk k )k6 i exp( k xi xk k /2σi )pj i P1. Why radial basis function (exponential)?2. Why probabilities?3. How do you choose σi ?11
The entropy of Pi increases with σi .EntropyH(P) Pipi log2 pi12
Perplexity, a smooth measure of the # of neighbors.PerplexityPerp(P) 2H(P) Entropy of 1.055Perplexity of 2.078 Entropy of 3.800Perplexity of 13.92913
From SNE to t-SNE. SNESymmetric SNE t-SNEModelisation:exp( kxi xj k2 /2σi2 )exp( kxi xk k2 /2σi2 )exp( kyi yj k2 )P2k6 i exp( kyi yk k )pj i Pqj i k6 iCost Function:C PKL(Pi Qi )iDerivatives:dCdyi 2Pj (pj i qj i pi j qi j )(yi yj )14
From SNE to t-SNE. SNEModelisation:exp( kxi xj k2 /2σi2 )P22k6 i exp( kxi xk k /2σi )exp( kyi yj k2 )P2k6 i exp( kyi yk k )pj i qj i Cost Function:C PKL(Pi Qi )iDerivatives:dCdyi 2Pj (pj i qj i pi j qi j )(yi yj )Symmetric SNE t-SNEModelisation:pij qij pj i pi j2nexp( kyi yj k2 )P2k6 l exp( kyk yl k )Cost Function:C KL(P Q)Derivatives:dCdyi 4IPj (pij qij )(yi yj )FasterComputation14
From SNE to t-SNE. SNEModelisation:exp( kxi xj k2 /2σi2 )P22k6 i exp( kxi xk k /2σi )exp( kyi yj k2 )P2k6 i exp( kyi yk k )pj i qj i Cost Function:C PKL(Pi Qi )iDerivatives:dCdyi 2Pj (pj i qj i pi j qi j )(yi yj )Symmetric SNEModelisation:pij qij pj i pi j2nexp( kyi yj k2 )P2k6 l exp( kyk yl k )Cost Function:C KL(P Q)Derivatives:dCdyi 4IPj (pij qij )(yi yj )FasterComputation t-SNEModelisation:pij qij pj i pi j2n(1 kyi yj k2 ) 1P2 1k6 l (1 kyk yl k )Cost Function:C KL(P Q)Derivatives:dCdyi 4Pj (pij qij )(yi yj )(1 k yi yj k2 ) 1IEven FasterComputationIBetterBehaviour14
The ”Crowding problem”There is much more space in high dimensions.15
Mismatched Tails can Compensate for MismatchedDimensionalitiesStudent-t distribution has heavier tails.16
Last but not least: Optimizationmin C (X , Y)YC KL(P Q) XXijpj i logpj iqj iINon-convexIGradient descent Momentum Adaptive learning rateY (t) Y (t 1) η(t)ITwo tricks:IIIδC α(t)(Y (t 1) Y (t 2) )δYEarly CompressionEarly ExaggerationIllustration Colah’s blog17
18
Visualization and Dimensionality ReductionIntuition behind t-SNEVisualizing representations
Mapping raw data to distributed representations.IFeature engineering is often laborious.INew tendency is to automatically learn adequate features orrepresentations.IUltimate goal: enable AI to extract useful features from rawsensory data.t-SNE can be used to make sense of the learned representations!19
Using t-SNE to explore a Word embedding.IISystem outputs 1 if central word is in right context, 0otherwise.Algorithms learns representation and classificationsimultaneously.From Machine Learning to Machine Reasoning, L. Bottou (2011)GoalRepresentation captures syntactic and semantic similarity.20
Using t-SNE to explore a Word embedding.http://colah.github.io/21
Explore a Wikipedia article embedding.http://colah.github.io/22
Exploring game state representations.Google Deepmind plays Atari games.Playing Atari with deep reinforcement learning, V. Mnih et Al.GoalLearning to play Space Invaders from score feedback and raw pixelvalues.23
Exploring game state representations.Google Deepmind plays Atari games.IIIA representation is learned with a convolutional neural networkFrom 84x84x4 28.224 pixel values to 512 neurons.Predicts expected score if a certain action is taken.Human-level control through deep reinforcement learning, V. Mnih et Al. (Nature,2015)24
Exploring game state representations.Google Deepmind plays Atari games.Human-level control through deep reinforcement learning, V. Mnih et Al. (Nature,2015)25
Using t-SNE to explore image representations.Classifying dogs and /IEach data point is an image of a dog or a catIred cats, blue dogs26
Using t-SNE to explore image representations.Classifying dogs and cats.RepresentationConvolutional net trained for Image Classification (1000 sne/27
Using t-SNE to explore image representations.Classifying dogs and cats.RepresentationConvolutional net trained for Image Classification (1000 sne/27
ConclusionIThe t-SNE algorithm reduces dimensionality while preservinglocal similarity.IThe t-SNE algorithm has been build heuristically.It-SNE is commonly used to visualize representations.28
Visualizing Data using t-SNE An Intuitive Introduction Simon Carbonnelle Universit e Catholique de Louvain, ICTEAM 12th of May, 2016. Visualization and Dimensionality Reduction Intuition behind t-SNE Visualizing representations. Visualization and Dimensionality Reduction
niques (except t-SNE) perform strongly on ar-tificial data sets but their visualization of real, high-dimensional data sets is poor. Whereas, t-SNE outperforms the other techniques and cap-tures most of the local structure while revealing global structure like presence of clusters at vari-ous scales. t-SNE uses Gaussian distribution for
VISUALIZING DATA USING T-SNE 2. Stochastic Neighbor Embedding Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean dis-tances between datapoints into conditional probabilities that represent similarities.1 The similarity of datapoint xj to datapoint xi is the conditional probability, pjji, that xi would pick xj as its neighbor
when applied to high-dimensional clustered data, t-SNE tends to produce a visualization with more separated clusters, which are often in good agreement with the clusters found by a dedicated clustering algorithm (Kobak and Berens, 2019). See Figure 1 for an example of data visualization using such a basic t-SNE algorithm.
Visualizing Data using t-SNE Laurens van der Maaten L.VANDERMAATEN@MICC UNIMAAS NL MICC-IKAT Maastricht University P.O. Box 616, 6200 MD Maastricht, The Netherlands Geoffrey Hinton HINTON@CS.TORONTO EDU Department of Computer Science University of Toronto 6 King's College Road, M5S 3G4 Toronto, ON, Canada
Visualizing Data using t-SNE Laurens van der Maaten L.VANDERMAATEN@MICC UNIMAAS NL MICC-IKAT Maastricht University P.O. Box 616, 6200 MD Maastricht, The Netherlands Geoffrey Hinton HINTON@CS.TORONTO EDU Department of Computer Science University of Toronto 6 King's College Road, M5S 3G4 Toronto, ON, Canada
E. Bertini, N. Elmqvist, and T. Wischgoll (Guest Editors) Visualizing Time-Dependent Data Using Dynamic t-SNE Paulo E. Rauber 1,2, Alexandre X. Falcão2, Alexandru C. Telea1 1 Johann Bernoulli Institute, University of Groningen, the Netherlands 2 Institute of Computing, University of Campinas, Brazil Abstract
Additional Visualizations with t-SNE and PCA Figure 2: t-SNE (1st column) and PCA (2nd column) visualizations on Sines, and t-SNE (3rd column) and PCA (4th column) visualizations on Stocks. Each row provides the visualization for each of the 7 benchmarks, ordered as follows: (1) TimeGAN, (2) RCGAN, (3) C-RNN-GAN, (4) T-Forcing, (5)
English Language Arts: Grade 2 READING Guiding Principle: Students read a wide range of fiction, nonfiction, classic, and contemporary works, to build an understanding of texts, of themselves, and of the cultures of the United States and the world; to acquire new information; to respond to the needs and demands of society and the workplace .