Stochastic Variational Inference For Dynamic Correlated .

2y ago
29 Views
2 Downloads
1.23 MB
10 Pages
Last View : 1d ago
Last Download : 3m ago
Upload by : Kian Swinton
Transcription

Stochastic Variational Inference for Dynamic Correlated Topic ModelsFederico Tomasifedericot@spotify.comSpotifyPraveen Ravichandranpraveenr@spotify.comSpotifyMounia Lalmasmounial@spotify.comSpotifyAbstractCorrelated topic models (CTM) are useful toolsfor statistical analysis of documents. They explicitly capture the correlation between topics associated with each document. We propose an extension to CTM that models the evolution of both topic correlation and word cooccurrence over time. This allows us to identifythe changes of topic correlations over time, e.g.,in the machine learning literature, the correlation between the topics “stochastic gradient descent” and “variational inference” increased inthe last few years due to advances in stochasticvariational inference methods. Our temporal dynamic priors are based on Gaussian processes(GPs), allowing us to capture diverse temporal behaviours such as smooth, with long-termmemory, temporally concentrated, and periodic.The evolution of topic correlations is modeledthrough generalised Wishart processes (GWPs).We develop a stochastic variational inferencemethod, which enables us to handle large setsof continuous temporal data. Our experimentsapplied to real world data demonstrate that ourmodel can be used to effectively discover temporal patterns of topic distributions, words associated to topics and topic relationships.1INTRODUCTIONTopic models (Blei et al., 2003) are a popular class oftools to automatically analyse large sets of categoricaldata, including text documents or other data that canbe represented as bag-of-words, such as images. Topicmodels have been widely used in various domains, e.g.,information retrieval (Blei et al., 2007; Mehrotra et al., The work was part of internship at Spotify.Proceedings of the 36th Conference on Uncertainty in ArtificialIntelligence (UAI), PMLR volume 124, 2020.Gal Levy-Fix gal.levy-fix@columbia.eduColumbia UniversityZhenwen Daizhenwend@spotify.comSpotify2013; Balikas et al., 2016), computational biology (Zhaoet al., 2014; Gopalan et al., 2016; Kho et al., 2017), recommender systems (Liang et al., 2017) and computervision (Fei-Fei & Perona, 2005; Kivinen et al., 2007;Chong et al., 2009). In the original topic model by Bleiet al. (2003), which is also known as Latent DirichletAllocation (LDA), the words in a document come from amixture of topics, where each topic is defined as a distribution over a vocabulary. The variations in the mixturesof topics across documents are captured by a Dirichlet distribution. However, a limitation is that it does not modelthe correlation in the co-occurrence of topics. To overcome this limitation, Blei & Lafferty (2006a) proposedthe correlated topic models (CTM), which extends LDAwith a correlated prior distribution for mixtures of topics.An important piece of information associated with a textual document is when the document has been written. Forhuman writings, both the meanings of topics, popularityand correlations among topics evolve over time. Modeling such evolution is very important for understandingthe topics in a collection of documents across a period oftime. For example, consider the topic machine learning.The distribution of the words associated with it has beengradually changing over the past few years, revolvingaround neural networks, shifting towards support vectormachines, kernel methods, and finally again on neuralnetworks and deep learning. In addition, due to the evolution of meaning, the topic machine learning probablyincreasingly correlates with high performance computingand GPU following the emerging of deep learning.In this paper, we propose the dynamic correlated topicmodel (DCTM), which allows us to learn the temporal dynamics of all the relevant components in CTM. To modelthe evolution of the meanings of topics, we construct atemporal prior distribution for topic representation, whichis derived from a set of Gaussian processes (GP). Thisenables us to handle documents in continuous time andto interpolate and extrapolate the topic representations atunseen time points. In CTM, the prior distribution for

mixtures of topics is derived from a multivariate normaldistribution, in which the mean encodes the popularity ofindividual topics while the covariance matrix encodes theco-occurrence of topics. We extend the prior for mixturesof topics into a dynamic distribution by providing a set ofGPs as the prior for the mean, and a generalised WishartProcess (GWP) as the prior for the covariance matrices.With DCTM, apart from assuming the individual documents at a given time points are independently sampled,we can jointly model the evolution of the representationsof topics, the popularity of topics and their correlations.A major challenge applying topic models to real worldapplications is the scalability of the inference methods. Alarge group of topic models come with the inference methods based on Markov chain Monte Carlo (often Gibbssampling in particular), which are hard to apply to corpora of millions of documents. To allow the model to dealwith large datasets, we develop a stochastic variational inference method for DCTM. To enable mini-batch training,we use a deep neural network to encode the variationalposterior of the mixtures of topics for individual documents. For the GPs and the generalised Wishart Process,we augment the model with auxiliary variables like in thestochastic variational GP (Hensman et al., 2013) to derivea scalable variational lower bound. As the final lowerbound is intractable, we marginalise the discrete latentvariables and apply a Monte Carlo sampling approximation with the reparameterisation trick, which allows us tohave a low-variance estimate for the gradients.The main contributions of this paper are as follows: We propose a full dynamic version of CTM, which allows us to model the evolution of the representationsof topics, topic popularity and their correlations. We derive a stochastic variational inference methodfor DCTM, which enables mini-batch training andis scalable to millions of documents.Outline. Section 2 discusses related work. Section 3presents our novel contribution and the generalised dynamic correlated topic model. Section 4 describes an efficient variational inference procedure for our model, builton top of sparse Gaussian processes. Section 5 presentsour experiments and validation of the model on real data.Section 6 concludes with a discussion and future researchdirections.2RELATED WORKStatic Topic Models. LDA was proposed by Blei et al.(2003) as a technique to infer a mixture of topics startingfrom a collection of documents. Each topic is a probability distribution over a vocabulary, and each topic isassumed to be independent from one another. However,such independent assumption usually does not hold in realworld scenarios, in particular when the number of topicsis large. The CTM (Blei & Lafferty, 2006a) relaxes thisassumption, allowing us to infer correlated topics throughthe use of a logistic normal distribution. Similar models have been proposed with modifications to the priordistribution of the topics, in particular using a Gaussianprocess to model topic proportions while keeping topicsstatic (Agovic & Banerjee, 2010; Hennig et al., 2012).However, the static nature of such models makes themunsuitable to model topics in a set of documents orderedby an evolving index, such as time.Dynamic Topic Models. Topic models have been extended to allow for topics and words to change over time(Blei & Lafferty, 2006b; Wang et al., 2008b), making useof the inherent structure between documents appearing atdifferent indices. These models considered latent Wienerprocesses, using a forward-backward learning algorithm,which requires a full pass through the data at every iterations if the number of time stamps is comparable with thetotal number of documents. A similar approach was proposed by Wang & McCallum (2006), with time being anobserved variable. Such approach allowed for scalability,while losing the smoothness of inferred topics. Anotherscalable approach was proposed by Bhadury et al. (2016),to model large topic dynamics by relying on stochasticgradient MCMC sampling. However, such approach isstill restricted to Wiener processes. Finally, Jähnichenet al. (2018) recently proposed a model that allows forscalability under a general framework to model time dependency, overcoming the limitation of Wiener processes.An attempt to model a latent correlation between topicsin discrete time stamps has been shown in (Song et al.,2008), where topic correlation is computed using principal component analysis based on their closeness in thelatent space. However, to the best of our knowledge, nogeneral procedure has been proposed to explicitly modeldynamic topic models with evolving correlations overcontinuous time.Stochastic Variational Inference. We develop a scalable inference method for our model based on stochastic variational inference (SVI) (Hoffman et al., 2013),which combines variational inference with stochastic gradient estimation. Two key ingredients of our inferencemethod are amortised inference and the reparameterisation trick (Kingma & Welling, 2014). Amortised inference has been widely used for enabling mini-batchtraining in the models with local latent variables such asvariational autoencoder (Kingma & Welling, 2014) and

deep Gaussian processes (Dai et al., 2015). The reparamterisation trick allows us to obtain low-variance gradient estimates with Monte Carlo sampling for intractablevariational lower bounds. Note that SVI is usually applied to the models, where the data points are i.i.d. giventhe global parameters such as Bayesian neural networks,which does apply to GP and GWP. Although the logmarginal likelihood of GP and GWP cannot be easily approximated with data sub-sampling, we use the stochasticvariational sparse GP formulation (Hensman et al., 2013),where an unbiased estimate of the variational lower boundcould be derived from data sub-sampling, which is essential for mini-batch training. Recently, Jähnichen et al.(2018) developed a stochastic variational inference forDTM, which is a dynamic version of LDA. This is different from our approach, which is a dynamic version ofCTM, where the correlations in the mixture of topics aremodelled dynamically.3DYNAMIC CORRELATED TOPICMODELDCTM is a correlated topic model in which the temporaldynamics are governed by GPs and GWPs. Consider acorpus W of documents associated with an index (for example a time stamp). We denote the index of a documentas d and its time stamp as td . While taking into accountthe dynamics underlying the documents, our goal is twofold: (i) infer the vocabulary distributions for the topics,and (ii) infer the distribution of the mixture of topics. Weuse continuous processes to model the dynamics of wordsand topics, namely the Gaussian process. These incorporate temporal dynamics into the model, and capturediverse evolution patterns, maybe in the forms of smooth,with long-term memory or periodic.Following the notation of the CTM (Blei & Lafferty,2006a), we denote the probability of word w to be assigned to topic k as βwk , and the probability of topic k forthe document d as ηdk . DCTM assumes that a Nd -worddocument d at the time td is generated according to thefollowing generative process:1. Draw a mixture of topics ηd N (µtd , Σtd );2. For each word n 1, . . . , Nd :(a) Draw a topic assignment zn ηd from a multinomial distribution with the parameter σ(ηd );(b) Draw a word wn zn , β from a multinomial distribution with the parameter σ(βzn ),where σ represents the softmax function, i.e., σ(z)i PKezi / j 1 ezj . Note that the softmax transformation isrequired for both ηd and βzn , as they are assumed to bedefined in an unconstrained space. The softmax transformation converts the parameters to probabilities, to encodethe proportion of topics for document d and the distribution of the words for a topic βzn , respectively.Under this generative process, the marginal likelihood forcorpus W becomes:p(W µ, Σ,β) D ZYd 1kX!p(Wd zn , βtd )p(zn ηd )zn 1p(ηd µtd , Σtd )dηd .(1)The individual documents are assumed to be i.i.d. giventhe document-topic proportion and topic-word distribution.The key idea of CTM is to relax the parameterisation ofη by allowing topics to be correlated with each other,i.e., by allowing a non-diagonal Σtd . We follow thesame intuition as in (Blei & Lafferty, 2006a), using alogistic normal distribution to model η. This allows theprobability of the topics to be correlated with each other.However, especially in the presence of a long period oftime, we argue that it is unlikely that the correlationsamong topics remain constant. Intuitively, the degreeof correlations among topics changes over time, as theysimply reflect the co-occurrence of the concepts appearing in documents. Consider the correlation between thetopics “stochastic gradient descent” and “variational inference”, which increased in recent years due to advancesin stochastic variational inference methods. We proposeto model the dynamics of the covariance matrix of thetopics, as well as the document-topic distribution and thetopic-word distribution.Dynamics of µ, β and Σ. First, we model the topicDprobability (µtd )d 1 and the distribution of words forDtopics (βtd )d 1 as zero-mean Gaussian processes, i.e.,p(µ) GP(0, κµ ) and p(β) GP(0, κβ ). We modelDthe series of covariance matrices (Σtd )d 1 using generalised Wishart processes, a generalisation of Gaussianprocesses to positive semi-definite matrices (Wilson &Ghahramani, 2011; Heaukulani & van der Wilk, 2019).Wishart process are constructed from i.i.d. collections ofGaussian processes as follows. Let f be D ν i.i.d. Gaussian processes with zero mean function, so thatfdi GP(0, κθ ), d D, i ν(2)and (shared) kernel function κθ , where θ denotes any parameters of the kernel function. For example, in the case

fΣηdLzdnwdnµNdWd βkwKDNdYMultinomial(1, σ(βtd ηd )).(4)n 1This trick has also been used by Srivastava & Sutton(2017) to derive a variational lower bound for LDA.Figure 1: The graphical model for DCTM.4of κθ θ12 exp( x y 2 /(2 θ22 )), θ (θ1 , θ2 ) corresponds to the amplitude and length scale of the kernel(assumed to be independent from one another).The positive integer-valued ν D is denoted as thedegrees of freedom parameter. Let Fndk : fdk (xn ), andlet Fn : (Fndk , d D, k ν) denote the D ν matrixof collected function values, for every n 1. Then,considerΣn LFn Fn L , n 1,(3)where L RD D satisfies the condition that the symmetric matrix LL is positive definite. With such construction, Σn is (marginally) Wishart distributed, and Σis correspondingly called a Wishart process with degreesof freedom ν and scale matrix V LL . We denoteΣn GWP(V, ν, κθ ) to indicate that Σn is drawn froma Wishart process. The dynamics of the process of covariance matrices Σ are inherited by the Gaussian processes,controlled by the kernel function κθ . With this formulation, the dependency between D Gaussian processes isstatic over time, and regulated by the matrix V .We consider L to be a triangular Cholesky factor of thepositive definite matrix V , with M D(D 1)/2 freeelements. We vectorise all the free elements into a vector ( 1 , . . . , M ) and assign a spherical normal distribution p( m ) N (0, 1) to each of them. Note that thediagonal elements of L need to be positive. To ensure that,we apply change of variable to the prior distribution ofthe diagonal elements by applying a soft-plus transformation i log(1 exp( ˆi )), ˆi N (0, 1). Hence, p(L)is a set of independent normal distributions with diagonalentries constrained to be positive by a change of variabletransformation.Figure 1 shows the graphical model of DCTM.Collapsing z’s. Stochastic gradient estimation with discrete latent variables is difficult, often results into significantly higher variance in gradient estimation evenwith state-of-the-art variance reduction techniques. Fortunately, the discrete latent variables z in DCTM canbe marginalised out in closed form. The resultingmarginalised distribution p(Wd ηd , βtd ) becomes a multinomial distribution over the word-count in each document,VARIATIONAL INFERENCEGiven a collections of documents covering a period oftime, we are interested in analysing the evolution of notonly the word distributions of individual topics but alsothe evolution of the popularity of individual topics inthe corpora and the correlations among topics. With theaim of handling millions of documents, we develop astochastic variational inference method to perform minibatch training with stochastic gradient descent methods.4.1AMORTISED INFERENCE FORDOCUMENT-TOPIC PROPORTIONAn essential component of the SVI method for DCTMis to enable mini-batch training over documents. Afterdefining a variational posterior q(ηd ) for each document,a variational lower bound of the log probability over thedocuments can be derived as follows,logp(W µ, Σ, β)D ZXp(Wd ηd , βtd )p(ηd µtd , Σtd ) q(ηd ) logdηdq(ηd )d 1 D XEq(ηd ) [log p(Wd ηd , βtd )]d 1 KL (q(ηd ) p(ηd µtd , Σtd )) .(5)Denote the above lower bound as LW . As the lowerbound is a summation over individual documents, it isstraight-forward to derive a stochastic approximation ofthe summation by sub-sampling the documents,LW D X Eq(ηd ) [log p(Wd ηd , βtd )]Bi DB (6) KL (q(ηd ) p(ηd µtd , Σtd )) ,where DB is a random sub-sampling of the documentindices with the size B. The above data sub-samplingallows us to perform mini-batch training, where the gradients of the variational parameters are stochastically approximated from a mini-batch. An issue with the abovedata sub-sampling is that only the variational parameters associated with the mini-batch get updated, whichcauses synchronisation issues when running stochastic

gradient descent. To avoid this, we assume the variationalposteriors q(ηd ) for individual documents are generatedaccording to parametric functions,q(ηd ) N (φm (Wd ), φS (Wd )),(7)where φm and φS are the parametric functions that generate the mean and variance of q(ηd ), respectively. This isknown as amortised inference. With this parameterisationof the variational posteriors, a common set of parameters are always updated no matter which documents aresampled into the mini-batch, thus overcoming the synchronisation issue.The lower bound LW cannot be computed analytically. Instead, we compute an unbiased estimate of LW via MonteCarlo sampling. As q(ηd ) are normal distributions, wecan easily obtain a low-variance estimate of the gradientsof the variational parameters via the reparameterisationstrategy (Kingma & Welling, 2014).4.2VARIATIONAL INFERENCE FORGAUSSIAN PROCESSESIn DCTM, both the word distributions of topics β andthe mean of the prior distribution of the document-topicproportion µ follow Gaussian processes that take the timestamps of individual documents as inputs, i.e., p(β t) andp(µ t). The inference of these Gaussian processes aredue to the cubic computational complexity with respect tothe number of documents. To scale the inference for realworld problems, we take a stochastic variational Gaussianprocess (Hensman et al., 2013, SVGP) approach to construct the variational lower bound of our model. We firstaugment each Gaussian process with a set of auxiliaryvariables with a set of corresponding time stamps, i.e.,Zp(β t) p(β Uβ , t, zβ )p(Uβ zβ )dUβ ,(8)Zp(µ t) p(µ Uµ , t, zµ )p(Uµ zµ )dUµ ,(9)where Uβ and Uµ are the auxiliary variables for β and µrespectively and zβ and zµ are the corresponding timestamps. Both p(β Uβ , t, zβ ) and p(Uβ zβ ) follow thesame Gaussian processes as the one for p(β t), i.e., theseGaussian processes have the mean and kernel functions.The same also applies to p(µ Uµ , t, zµ ) and p(Uµ zµ ).Note that, as shown in Equations (8) and (9), the aboveaugmentation does not change the prior distribut

Stochastic Variational Inference. We develop a scal-able inference method for our model based on stochas-tic variational inference (SVI) (Hoffman et al., 2013), which combines variational inference with stochastic gra-dient estimation. Two key ingredients of our infer

Related Documents:

2 Classical mean-field variational inference 3 Stochastic variational inference 4 Extensions and open issues (Hoffman et al., 2013) . Stochastic variational inference 4096 systems health communication service billion language care road 8192 service systems health com

Variational inference has experienced a recent surge in popularity owing to stochastic approaches, which have yielded practical tools for a wide range of model classes. A key benefit is that stochastic variational inference obviates the tedious process of deriving analytical expressions

Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M arti-cles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional varia

Agenda 1 Variational Principle in Statics 2 Variational Principle in Statics under Constraints 3 Variational Principle in Dynamics 4 Variational Principle in Dynamics under Constraints Shinichi Hirai (Dept. Robotics, Ritsumeikan Univ.)Analytical Mechanics: Variational Principles 2 / 69

Stochastic variational inference 4096 systems health communication service billion language care road 8192 service systems health companies market communication company billion 12288 service . 3.Use the classical algorithm to derive stochastic variatio

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

1.2. MCMC and Auxiliary Variables A popular alternative to variational inference is the method of Markov Chain Monte Carlo (MCMC). Like variational inference, MCMC starts by taking a random draw z 0 from some initial distribution q(z 0) or q(z 0 x). Rather than op-timizing this distribution, however, MCMC methods sub-

Artificial Intelligence (AI) is a science and a set of computational technologies that are inspired by—but typically operate quite differently from—the ways people use their nervous systems and bodies to sense, learn, reason, and take action. While the rate of progress in AI has been patchy and unpredictable, there have been significant advances since the field’s inception sixty years .