On The Variational Posterior Of Dirichlet Process Deep .

3y ago
18 Views
2 Downloads
885.21 KB
10 Pages
Last View : 4m ago
Last Download : 3m ago
Upload by : Abby Duckworth
Transcription

On the Variational Posterior of Dirichlet ProcessDeep Latent Gaussian Mixture ModelsAmine Echraibi 1 2 Joachim Flocon-Cholet 1 Stéphane Gosselin 1 Sandrine Vaton 2AbstractThanks to the reparameterization trick, deep latentGaussian models have shown tremendous successrecently in learning latent representations. Theability to couple them however with nonparametric priors such as the Dirichlet Process (DP) hasn’tseen similar success due to its non parameterizable nature. In this paper, we present an alternative treatment of the variational posterior of theDirichlet Process Deep Latent Gaussian MixtureModel (DP-DLGMM), where we show that theprior cluster parameters and the variational posteriors of the beta distributions and cluster hiddenvariables can be updated in closed-form. Thisleads to a standard reparameterization trick onthe Gaussian latent variables knowing the clusterassignments. We demonstrate our approach onstandard benchmark datasets, we show that ourmodel is capable of generating realistic samplesfor each cluster obtained, and manifests competitive performance in a semi-supervised setting.1. IntroductionNonparametric Bayesian priors, such as the Dirichlet Process (DP), have been widely adopted in the probabilisticgraphical community. Their ability to generate an infiniteamount of probability distributions using a discrete latentvariable makes them ideally suited for automatic model selection. The most famous applications of the DP have beenhowever limited to classical probabilistic graphical modelssuch as Dirichlet Process Mixture Models and HierarchicalDirichlet Process Hidden Markov Models (Blei et al., 2006;Fox et al., 2008; Zhang et al., 2016).1Orange Labs, Lannion, France 2 Institute Mines-TelecomAtlantique, Brest, France.Correspondence to: AmineEchraibi amine.echraibi@orange.com , Sandrine Vaton sandrine.vaton@imt-atlantique.fr , Joachim Flocon-Cholet joachim.floconcholet@orange.com , Stéphane Gosselin stephane.gosselin@orange.com .Second workshop on Invertible Neural Networks, NormalizingFlows, and Explicit Likelihood Models (ICML 2020), Virtual ConferenceRecently, deep generative models such as Deep LatentGaussian Models (DLGMs) and Variational AutoEncoders(VAEs) (Kingma & Welling, 2013; Rezende et al., 2014)have shown huge success in modeling and generating complex data structures such as images. Various proposals togeneralize these models to the mixture and nonparametricmixture cases have been made (Nalisnick et al., 2016; Nalisnick & Smyth, 2016; Dilokthanakul et al., 2016; Jiang et al.,2016). Introducing such priors on top of the deep generative model can improve its generative capabilities, preserveclass structure in the latent representation space, and offera nonparametric way of performing model selection withrespect to the size of the generative model.The main challenge posed by such models lies in the inference process. Deep generative models with continuouslatent variables owe their success mainly to the reparameterization trick (Kingma & Welling, 2013; Rezende et al.,2014). This approach provides an efficient and scalablemethod for obtaining low variance estimates of the gradientof the variational lower bound with respect to variationalposterior parameters. Applying this approach directly to thevariational posterior of the DP is not straightforward, dueto the fact that a reparameterization trick for the beta distributions is hard to obtain (Ruiz et al., 2016). One approachto bypass this issue have been proposed by (Nalisnick &Smyth, 2016), where the authors used the Kumaraswamydistribution (Kumaraswamy, 1980) as a higher entropy alternative for the beta distribution in the variational posterior.However, by deriving the nature of the variational posteriordirectly from the variational lower bound, we can show thatthe appropriate distribution is in fact the beta distribution.In this paper we provide an alternative treatment of the variational posterior of the DP-DLGMM, where we combine classical variational inference to derive the variational posteriorsof the beta distributions and cluster hidden variables, andneural variational inference for the hidden variables of thelatent Gaussian model. This leads to gradient ascent updatesover the parameters present in nonlinear transformationswhere the reparameterization trick can be applied knowingthe cluster assignment. As for the remaining parameters,closed-form solutions can be obtained by maximization ofthe evidence lower bound.

On the Variational Posterior of Dirichlet Process Deep Latent Gaussian Mixture Models2. Dirichlet Process Deep Latent GaussianMixture ModelsGeneralizing deep latent Gaussian models to the Dirichletprocess mixture case can be obtained by adding a Dirichlet process prior on the hidden cluster assignments. Wedenote these cluster assignments by z. Following the assignment of a cluster hidden variable, a deep latent Gaussianmodel is defined for the assigned cluster similar to (Rezendeet al., 2014). We adopt the stick-breaking construction ofthe Dirichlet Process (Sethuraman, 1994). The generativeprocess of the model (figure 1) is given by:βk Beta(·; 1, η)πk βk (l)nh(L)nh(l)nxn h(1)n , znh(l)npln {1, ., N }Figure 1. The graphical representation of the generative process ofthe model, with the convention x h(0) .N (·; 0, I) lzn(l)where hn Rpl is the lth layer hidden representation constructed using a nonlinear transformation fW (l) representedznby a neural network for the cluster assignment zn . For simplicity, we consider diagonal covariance hmatrices ifor each(l)layer where the diagonal elements are (szn ,j )2,1 j plhence represents the element-wise product. The generalization to full covariance matrices is straightforward usingthe Cholesky decomposition.We denote by η the concentration parameter of the Dirichletprocess which is a hyperparameter to be tuned manually.The term pX represents the emission distribution of theobservable xn , usually chosen to be a normal distributionfor continuous variables or the Bernoulli distribution forbinary variables. We denote the parameters of the generativemodel by:(L)h(l 1)npl 1(1 βl )(L) m(L) (L)zn sznn (l 1) s(l) (l) fW (l) hnznnzn pX · fW (0) h(1)n(L)znk 1Y Cat(· π) βl {0, ., L 1}l 1zn π(l)hn1(l 1)hn1(0:L 1)Θ {m1: , s1: , W1: (1:L 1), s1: }In the next section, we develop a structured variational inference algorithm for DP-DLGMM. We show that by choosinga suitable structure for the variational posterior, closed-formsolutions can be obtained for the updates of the truncatedvariational posteriors of the beta distributions, the variational posteriors of the cluster hidden variables, and theoptimal prior parameters {m(L) , s(L) } maximizing the evidence lower bound.3. Structured Variational InferenceFor a brief review of variational methods, we denote byx1:N the N samples present in the dataset supposed to be independent and identically distributed. The log-likelihood ofthe model is intractable due to the required marginalizationof all the hidden variables. In order to bypass this marginalization, we introduce an approximate distribution qΦ anduse Jensen’s inequality to obtain a lower bound (Jordanet al., 1999):l(Θ) ln pΘ (x1:N )"#XZ(1:L)(1:L) lnpΘ (x1:N , z1:N , h1:N , β)dh1:N dβz1:N"The model thus has an infinite number of parameters dueto the Dirichlet process prior. Furthermore, the posteriordistribution of the hidden variables cannot be computed inclosed-form. In order to perform inference on the modelwe need to use approximate methods such as Markov ChainMonte Carlo (MCMC) or Variational Inference. MCMCmethods are not suitable for high dimensional models suchas the DP-DLGMM, where convergence of the Markovchain to the true posterior can prove to be slow and hard todiagnose (Blei et al., 2017). Ez1:N ,h(1:L) ,β qΦ ln1:N, L(Θ, Φ).(1:L)pΘ (x1:N , z1:N , h1:N , β)#(1:L)qΦ (z1:N , h1:N , β x1:N )(1)We can show that if the distribution qΦ is a good approximation of the true posterior, maximizing the evidence lowerbound (ELBO) with respect to the model parameters Θ isequivalent to maximizing the log-likelihood. For deep generative models, most state-of-the-art methods use inferencenetworks to construct the posterior distribution (Rezende

On the Variational Posterior of Dirichlet Process Deep Latent Gaussian Mixture Modelset al., 2014; Nalisnick & Smyth, 2016). For deep mixturemodels with discrete latent variables, this approach leads toa mixture density variational posterior where the reparameterization trick requires additional investigation (Graves,2016). Our approach combines standard variational Bayesand neural variational inference. We approximate the trueposterior using the following structured variational posterior:(1:L)qΦ (z1:N , h1:N , β x1:N ) N YLYnTYqγt (βt x1:N ), (2)where T is a truncation level for the variational posterior of the beta distributions obtained by supposing thatq(βT 1) 1 (Blei et al., 2006). We assume a factorizedposterior over the hidden layers h(1:L), where the intra-layerndependencies are conserved.tt(l)parameters ψt for the lth layer and the tth cluster: (l)qψ(l) (h(l) x,z t) Nh;µ(x),Σ(x).(l)(l)nnnnnnψψttIn contrast to the hidden layers, we can use the proposedvariational posterior of equation (2) to derive closed-formsolutions for qφn and qγt . Let us consider the KullbackLeibler definition of the ELBO L:L(Θ, Φ) DKL [qΦ (· x1:N ) pΘ (·, x1:N )] .By plugging the variational posterior and isolating βt termsand zn terms, we can analytically derive the optimal distributions qγt and qφn maximizing L:TXs.t.φn,t 1,The fixed point equation of φn,t , requires the evaluation ofthe expectation over the hidden layers, this can be performedby sampling from the variational posterior of each hiddenlayer and then forwarding the sample using the generativemodel:hEh(1:L) qn(1:L)ψtln pX (xn , h(1:L) zn t)nS 1X(1:L)(s)ln pX xn , hn,t zn tS s 1where:(l)(s)hn,t qψ(l) (h(l)n xn , zn t). (6)tA key insight here is the following: if a cluster t is incapableof reconstructing a sample xn from the variational posterior,this will reinforce the belief that xn should not be assigned tothat cluster. Furthermore, the estimation of the expectationcan be performed using the same reparameterization trickthat we will develop in section 3.3.(L)In addition to the variational posteriors of the beta distributions and the cluster assignments, closed-form solutions(L)(L)can be obtained for the updates of m1:T and s1:T . Let usreconsider the evidence lower bound of equation (1), wherewe isolate only terms dependent on the prior parameters.We have:(L)(L)L(m1:T , s1:T ) const t(3)n 1γ2,t η NTXXn 1 r t 1φn,r(L)3.2. Closed-Form updates for m1:T and s1:TXφn,thi(L)(L) DKL N (µψ(L) (xn ), Σψ(L) (xn )) N (mt , Vt ) .where the fixed point equations for the variational parameters φn and γt are:φn,tin,t Cat(zn ; φn ),γ1,t 1 (5)t 1 Beta(βt ; γ1,t , γ2,t )NXit Deriving the nature of the posterior distributions of thehidden layers h(1:L)using the variational approach is inntractable due to the nonlinearities present in the model. Thus,we take a similar approach to (Rezende et al., 2014), andwe assume that the variational posterior is specified by aninference network, where the parameters of the distributionare the outputs of deep neural networks µψ(l) and Σψ(l) ofqφn (zn xn )H qψ(l) (· zn t, xn )l3.1. Deriving the variational posteriors qφn and qγtqγt (βt x1:N )(1:L)ψthnt 1tX qψ(l) (h(l)n xn , zn )zn 1 l 1 qφn (zn xn )ln φn,t const Eβ q [ln πt ]hi Eh(1:L) qln pX (xn , h(1:L) zn t)n(4)(L)where Vtthi(L) diag (st,j )21 j pLrepresents the covari-ance matrix of the Lth layer. By setting the derivative of Lwith respect to the parameters to zero, we obtain:(L)mt NNX1 Xφn,t µψ(L) (xn ) Nt φn,ttNt n 1n 1(7)

On the Variational Posterior of Dirichlet Process Deep Latent Gaussian Mixture Models(L)Vt N1 Xφn,t INt n 1Algorithm 1 Variational Inference for the DP-DLGMMnΣψ(L) (xn )to(L)(L) (µψ(L) (xn ) mt )(µψ(L) (xn ) mt )T , (8)ttwhere to extract the diagonal elements we perform an elementwise multiplication by the identity matrix I. The updaterules obtained are similar to the M-Step of a classical Gaussian Mixture Model, except in this case the updates areperformed on the last hidden layer of the generative model,and the E-step of equation (5) takes into account all thehidden layers. Detailed derivation of the previous equationsare presented in the supplementary material.Input: x1:N , T, η, αInitialize φ, Λ, ψwhile not converged doupdate: γt t {(3), (4)}(L)update: mt t {(7)}(L)update: Vt t {(8)}for each epoch doΛ Λ α Λ Lψ ψ α ψ Lend forupdate: φn,t n, t {(5)}end while3.3. Stochastic BackpropagationWe next show how to perform stochastic backpropagation inorder to maximize L with respect to the parameters ψ and(0:L 1) (1:L)Λ {W1:T, s1:T }. Similarly to the previous section,we isolate the terms in the evidence lower bound dependenton ψ and Λ. We have:Dl {xn , yn }n is the labeled part, yn represents the labelof the sample xn , and Du represents the unlabeled part. Thelog likelihood can be divided for the labeled and unlabeledparts as:l(Θ)L(ψ, Λ) const Xn,t Eh(1:L) qn(1:L)ψtφn,tnXhiH qψ(l) (· zn t, xn ) xn Dltlhioln pX (xn , h(1:L) z t). (9)nnBy taking the expectation over the hidden cluster variableszn , we obtain conditional expectations over the hidden layers h(1:L)knowing the cluster assignment. In order to backnpropagate gradients of Λ and ψ, it suffices to perform areparameterization trick for each cluster assignment at eachhidden layer (proof in Appendix A). We can achieve this bysampling:(l) n,t N (0, I),a sample from the posterior of the lth hidden layer can thenbe obtained by the following transformation:q(l)(l)hn,t µψ(l) (xn ) n,t Σψ(l) (xn ),ttwhere Σψ(l) (xn ) is supposed to be a diagonal matrix fortsimplicity. Following the previous analysis, we can derivean algorithm to perform inference on the proposed model,where between iterations of the fixed point update steps, Eepochs of gradient ascent are performed to obtain a localmaximum of the ELBO with respect to Λ and ψ. Algorithm1 summarizes the process.4. Semi-Supervised Learning (SSL)4.1. SSL using the DP-DGLMMIn this section, similarly to (Kingma et al., 2014) we consider a partially labeled dataset x1:N Dl Du , whereln pΘ (x1:N )XXln pΘ (xn ) ln pΘ (xn ) Xxn Duln pΘ (xn , zn yn ) xn DlXln pΘ (xn ).xn DuThe last equation follows from the fact that pΘ (xn zn 6 yn ) 0. By dividing the labeled and unlabeled parts ofthe dataset, we can follow the same approach presented insection 3 in order to derive a variational inference algorithm.In this case, the fixed point updates and the gradient ascentsteps remain unchanged if we set φn,yn 1 for a labeledxn sample.4.2. The predictive distributionIn order to make predictions using the model, we need toevaluate the predictive distribution. Given a new samplexN 1 , the objective is to evaluate the following quantityp(zN 1 k x1:N 1 ). This task requires an intractablemarginalization over all the other hidden variables. However, similarly to (Blei et al., 2006), we can use the variational posterior to approximate the true posterior, which inturn leads to simpler expectation terms:p(zN 1 k x1:N 1 ) p(zN 1 k, xN 1 x1:N ) Eβ q [πk (β)]h i(1:L) Eh(1:L) qpX xN 1 fΛ (hN 1 ), zn kN 1ψ(1:L)k(10)where fΛ (·) represents the forward pass over the generativemodel. The expectation with respect to the beta terms can

On the Variational Posterior of Dirichlet Process Deep Latent Gaussian Mixture 410567897.55.02.50.02.55.07.510.0Figure 2. t-SNE plot of the second stochastic hidden layer on theMNIST test set for the semi-supervised (10% labels) version ofthe DP-DLGMM.106.13 .135. Experiments5.1. Evaluation of the semi-supervised classificationWe evaluate the semi-supervised classification capabilitiesof the model. We train our DP-DLGMM model on theMNIST dataset (LeCun & Cortes, 2010) with train-valid-testsplits equal to {45000, 5000, 10000} similarly to (Nalisnick& Smyth, 2016), with 10 % labelisation randomly drawn.We run the process for 5 iterations, and we evaluate ourmodel on the test set. We report the mean and standarddeviation of the classification error in percentages in Table1. Our method produces a competitive score with existingstate-of-the art methods: Deep Generative Models (DGM)(Kingma et al., 2014) and Stick-Breaking Deep GenerativeModels (SB-DGM) (Nalisnick & Smyth, 2016). Unlikethe previous approaches, the loss was not up-weighted forthe labeled samples. Figure 2 shows the t-SNE projections(Maaten & Hinton, 2008) obtained with 10 % of the labelsprovided. We notice that by introducing a small fraction oflabels the class structure was highly preserved in the latentspace.0510Figure 3. t-SNE plot of the second stochastic hidden layer on theMNIST test set for the unsupervised version of the DP-DLGMM.kNN(k 5)be computed in closed-form as a product of expectationsover the beta posteriors. The second expectation can beevaluated using the Monte-Carlo estimator of equation (6).5DGMSB-DGMDP-DLGMM4.86 .143.95 .152.90 .17Table 1. Semi-supervised classification error (%) on the MNISTtest set with 10 % labelisation. Comparison with (Nalisnick &Smyth, 2016).5.2. Data generation and visualizationTo further test our model, we generate samples for eachcluster from the models trained on both the MNIST andSVHN (Netzer et al., 2011) datasets. The MNIST model istrained in an unsupervised manner, and the SVHN modelis trained with semi-supervision where we provide 1000randomly generated labels. The samples obtained are represented in figure 4. For the unsupervised model, we noticethat the clusters are representative of the shape of each digit.We plot the t-SNE projections of the MNIST test set of theunsupervised model in Figure 3. We notice that the digitsbelonging to the same true class tend to group with eachother. However, two groups of the same class can be veryseparated in the embedding space. The interpretation wecan draw from this effect is that the DP-DLGMM tends toseparate the latent space in order to distinguish betweenthe variations of hidden representations of the same class.The clusters obtained are not always representative of thetrue classes which is a common effect with infinite mixturemodels. In a full unsupervised setting, data can be explained

On the Variational Posterior of Dirichlet Process Deep Latent Gaussian Mixture ModelsFigure 4. Generated samples from the DP-DLGMM model for the unsupervised version on the MNIST dataset (left) and the semisupervised version on the SVHN dataset (right).by multiple correct clusterings. This effect can simply becountered by adding a small supervision (figure 2).and using the density transformation lemma:N (h, µψt (x), σψt (x)2 I) ht N ( ; 0, I),6. Conclusionwe have:In this paper, we have presented a variational inferencemethod for Dirichlet Process Deep Latent Gaussian MixtureModels. Our approach combines classical variational inference and neural variational inference. The algorithm derivedis thus a standard variational inference algorithm, with fixedpoint updates over a subset of the parameters presentinglinear dependencies. The parameters present in nonlineartransformations are updated using standard gradient ascentwhere the reparameterization trick can be applied for thevariational posterior of the stochastic hidden layers knowingthe cluster assignments. Our approach shows promisingresults both for the unsupervised and semi-supervised cases.In future work, stochastic variational inference can be explored to speed-up the training procedure. Our approachcan also be generalized to other types of deep probabilisticgraphical models.A. Proof of the reparameterization trickknowing the cluster assignmentThe evidence lower bound of our model can be

, Joachim Flocon-Cholet , Stephane Gosselin . Second workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models (ICML 2020), Virtual Con-ference Recently, deep generative models such as Deep Latent

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Agenda 1 Variational Principle in Statics 2 Variational Principle in Statics under Constraints 3 Variational Principle in Dynamics 4 Variational Principle in Dynamics under Constraints Shinichi Hirai (Dept. Robotics, Ritsumeikan Univ.)Analytical Mechanics: Variational Principles 2 / 69

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

II. VARIATIONAL PRINCIPLES IN CONTINUUM MECHANICS 4. Introduction 12 5. The Self-Adjointness Condition of Vainberg 18 6. A Variational Formulation of In viscid Fluid Mechanics . . 25 7. Variational Principles for Ross by Waves in a Shallow Basin and in the "13-P.lane" Model . 37 8. The Variational Formulation of a Plasma . 9.