Variaonal %Inference%for%LDA% - Courses.cs.washington.edu

2y ago
29 Views
2 Downloads
4.21 MB
18 Pages
Last View : 19d ago
Last Download : 3m ago
Upload by : Ryan Jay
Transcription

T548,UniversityofWashingtonEmilyFoxJune2nd,2015 EmilyFox20151VariaNonalMethodsGoal Recalltask:Characterizetheposterior TurnposteriorinferenceintoanopNmizaNontask arametersandlatentvariables– Familyisindexedbyasetof“freeparameters”– Findmemberofthefamilyclosestto: EmilyFox201521

6/1/15VariaNonalMethodsCartoon Cartoonofgoal: QuesNons:– Howdowemeasure“closeness”?– ethingwedonothavetobeginwith? EmilyFox20153InterpretaNonsofMinimizingReverseKL Evidence lower bound (ELBO)log p(x) D(q(z, ) p(z, x)) L(q)L(q) Therefore,– ELBO provides a lower bound on marginal likelihood– Maximizing ELBO is equivalent to minimizing KL EmilyFox201542

6/1/15MeanFieldL(q) Eq [log p(z, , x)]Eq [log q(z, )] How do we choose a Q such that the following is tractable? Simplest case mean field approximation– Assume each parameter and latent variable is conditionally independent given theset of free parametersStructuredStructuredMeanMeanField vemeanfieldOriginal GraphGraphNaïve MeanNaïveFieldMean FieldMean FieldMean Field EmilyFox20155Any subgraphAny subgraphfor whichforinferencewhich inferenceis tractableis tractableleads to leadsa meanto afieldmeanstylefieldapproximationstyle approximationfor whichfor whichthe updatethe equationsupdate equationsare tractableare tractableMeanFieldforLDA In LDA, our parameters are { d }, {z {zid } The variational distribution factorizes as The joint distribution factorizes asp( , , z, w) KYk 1p(k )DYd 1k}p( d ) EmilyFox2015 dwidNdYi 1kzidKNdDp(zid d )p(wid zid , )63

6/1/15MeanFieldforLDA q( , , z) p( , , z, w) KYKYp(k 1 q(kk 1k ) k )DYd 1DYd 1q( d p( d )NdYi 1d)NdYi 1q(zid di)p(zid d )p(wid zid , ) ddziddiwid kkKNdDExamine the ELBOL(q) KXEq [log p(kk 1 Ndd XXd 1 i 1KXk 1 )] DXd 1Eq [log p( d )]Eq [log p(zid d )] Eq [log p(wid zid , )]Eq [log q(k k )]DXd 1Eq [log q( d d)]Ndd XXd 1 i 1Eq [log q(zid EmilyFox2015di )]7OpNmizeviaCoordinateAscent Algorithm: ddziddiwid kkKNdD EmilyFox201584

6/1/15OpNmizeviaCoordinateAscent Algorithm: ddziddiwid kkKNdD EmilyFox20159Generalizing ManyBayesianmodelhavethisform:p( , z1:N,x1:N) p( )NYi 1p(z i )p(xi z i , )zi xi N Goalistocompute yp(z i , xi ) h(z i ) exp{ ( , xi )T z ip( z, x) h( ) exp{ g (z, x)T EmilyFox2015a( ( , xi ))}a( g (z, x))}105

6/1/15Generalizing zizixi N MeanfieldvariaNonalapproximaNonNYi 1q(z i nalp( z, x) h( ) exp{ g (z, x)T xi Nq(z, ) q( ) ia( g (z, x))}SameforlocalvariaNonalterms,too EmilyFox201511Generalizingzi zixi N xi s:r L a00 ( )(E [ g (z, x)] Beal,2001) EmilyFox2015126

6/1/15GeneralCoord.AscentAlgorithmMean-field variational inferenceInitialize randomly.Repeat until the ELBO converges1For each data point, update the local variational parameters:.t /i2DE.t1/Œ .ˇ;; ng: xi /ç for i 2 f1; : : : NUpdate the global variational parameters:.t /DE.t /Œ g .Z1WNn ; x1WNn /ç: Inefficient: We analyze the whole data set before completing one iteration. E.g.: In iteration #1 we analyze all documents with random topics. 2nd,2015 EmilyFox2015147

6/1/15LimitaNonsofBatchVariaNonalMethods EmilyFox201515LimitaNonsofBatchVariaNonalMethods Example LDA– Start from randomly initialized k (topics)– Analyze whole corpus before updating k again Streaming data: can’t compute one iteration! ddziddiwid kkKNdD More generally – DosomelocalcomputaNonforeachdatapoint.– AggregatethesecomputaNonstore- ‐esNmateglobalstructure.– Repeat. Inefficient,andcannothandlemassivedatasets. EmilyFox2015168

6/1/15StochasNcVariaNonalInferenceStochastic variational BALSTRUCTURE1 A genericclass of models StochasNcvariaNonalinferenceharnesses:2 Classical mean-field variational inference3 #Stochasticvariationalinference– tensions and open issues– Idea#2:Naturalgradients(Amari,1998)(Hoffman et al., 2013) EmilyFox201517AlternaNveOpNmizaNonSchemes Didn’t have to do coord. ascent. Could have used gradient ascent. Here,L(q) Eq [log p( )] Eq [log q( )]NXi 1Eq [log p(z i , xi )]Eq [log q(z i )]Use stochastic gradient step instead– Consider one data point xt sampled uniformly at random and define:Lt (q) Eq [log p( )]Eq [log q( )]N (Eq [log p(z t , xt )] EmilyFox2015Eq [log q(z t )])189

6/1/15AlternaNveOpNmizaNonSchemes Recall the gradient of the ELBO for the global parameter:r L a00 ( )(E [ g (z, x)] )Even using just one data point, issue for scalability: EmilyFox201519NaturalGradientoftheELBONatural gradientsR IEMANNIAN C ONJUGATE G RADIENT FOR VB!"# %&'()* '(,%))'%)-# %&'()*(from Honkela et al., 2010) Figure 1: Gradient and Riemannian gradient directions are shown for the mean of distribution q.VB learning with a diagonal covariance is applied to the posterior p(x, y) ! exp[ 9(xy1)2 x2 y2 ]. The Riemannian gradient strengthens the updates in the directions wherethe uncertainty is large.The naturalgradientThe naturalgradient accountsof the ELBOforis the geometry of parameter spaceNatural gradient of the ELBO:rO L D E Œ g .Z ; x /ç:the conjugate gradient algorithm with their Riemannian counterparts: Riemannian inner products Wenaturalgradientcomputingand cannorms, computeparallel transporttheof gradientvectors betweendifferentbytangentspaces as well astheline coordinate updatessearches and steps along geodesics in the Riemannian space. In practical algorithms some of thesein canparallelandbysubtractingthe currentvariationalparameters. (Sato, 2001)be approximatedtheir flat-space counterparts.We shall applythe approximate Riemannianconjugate gradient (RCG) method which implements Riemannian (natural) gradients, inner productsand norms but uses flat-space approximations of the others as our optimisation algorithm of choicethroughout the paper. As shown in Appendix A, these approximations do not affect the asymptoticconvergence properties of the algorithm. The difference between gradient and conjugate gradientmethods is illustrated in Figure 2. EmilyFox201520In this paper we propose using the Riemannian structure of the distributions q("" ##) to derivemore efficient algorithms for approximate inference and especially VB using approximations witha fixed functional form. This differs from the traditional natural gradient learning by Amari (1998)X ""). The proposed methodwhich uses the Riemannian structure of the predictive distribution p(Xcan be used to jointly optimise all the parameters # of the approximation q("" ##), or in conjunctionwith VB EM for some parameters.323910

6/1/15NoisyNaturalGradientstt Let t Withthis,thenoisynaturalgradientoftheELBOis Notes:– apoint.– arameters.– Thankstoconjugacyithasasimpleform. EmilyFox201521SVIAlgorithmOverviewStochastic variational inferenceInitialize global parameters randomly.Set the step-size schedule t appropriately.Repeat forever1zii xi NSample a data point uniformly,x t Uniform.x1 ; : : : ; xNn /:2Compute its local variational parameter,DE3.t1/Œ .ˇ; xt /ç:Pretend its the only data point in the data set,O D E Œ t .Z t ; x t /ç:4Update the current global variational parameter,.t /D .1 t /.t 1/ EmilyFox2015C t O :2211

6/1/15SVIforLDA ddziddiwidIn LDA, the full ELBO is given by kkKNdDL Eq [log p( )] d 1 Eq [log p( d )]Eq [log q( d )]Eq [log p(z d , xd d , )]Eq [log q(z d )]Eq [log q( )] DXDXd 1Assuming D documents, consider one sampled at randomLt Eq [log p( )]Eq [log q( )] D Eq [log p( t )]E[log q( t )] D Eq [log p(z t , xt t , )]Eq [log q(z t )] EmilyFox201523SVIforLDA (0)IniNalize randomly.Repeat(indefinitely): Sampleadocumentduniformlyfromthedataset.d Forallk,iniNalizek 1 RepeatunNlconverged Fori 1, ,Nddik Set d/ exp{E[log kd ] E[log NdXi 1 ddziddiwid kkKNdDk,wid ]}diTakeastochasNcgradientstep (t) (t1) t r L d EmilyFox20152412

6/1/15SVIforLDAinPracNceStochastic variational inferenceOnline 98K900Perplexity850800Batch 98KOnline 3.3M750700650600103.5DocumentsanalyzedTop eightwords20484096104104.5105Documents seen (log inesssystems companiesserviceserviceindustrycompanies systemscompanies companiesservicebusiness businessindustryindustrycompaniescompany companycompanyservicesservicesbillionindustry management companycompanyhealthmarketsystems management man et al. 2010) EmilyFox201525Whatyouneedtoknow VariaNonalmethods– Overallgoal– InterpretaNonintermsofminimizing(reverse)KL– MeanfieldapproximaNon MeanfieldvariaNonalinferenceforLDA StochasNcvariaNonalinference– Generalideaofusingnaturalgradients stochasNcopNmizaNon– ResulNnggenericalgorithm– SVIforLDA EmilyFox20152613

6/1/15Reading InferenceinLDA:– 03):993- ‐1022.– ochasNcVariaNonalInference."arXiv:1206.7051(2012). EmilyFox201527Acknowledgements gtoSVI EmilyFox20152814

rsityofWashingtonEmilyFoxJune2nd,2015 EmilyFox201529Whatyouneedtoknow ons EmilyFox20153015

6/1/15Whatyouneedtoknow CaseStudy2:DocumentRetrievalandClustering– Approach1:k- ‐NNfornearestneighborsearch– Algorithm:Fastk- ‐NNusingKD- ‐trees(exact)– Algorithm:Approximatek- ‐NNusingKD- ‐treesandlocalitysensiNvehashing– Approach2:k- ‐meansforclustering– Dataparallelproblems– Algorithm:MapReduceframeworkandparallelk- ‐meansusingMapReduce– �� Algorithm:EM EmilyFox201531Whatyouneedtoknow CaseStudy3:fMRIPredicNon– Regularizedlinearmodels:RidgeregressionandLASSO– Sparsistency– LASSOsolvers: LARS Shotgun(stochasNccoordinatedescent) Hogwild(stochasNcgradientdescent) Averagingmethods ADMM– LASSOvariants: FusedLASSO GraphicalLASSO– Copingwithlargecovariancesusinglatentfactormodels EmilyFox20153216

6/1/15Whatyouneedtoknow CaseStudy4:CollaboraNveFiltering– Approach:MatrixfactorizaNon– Algorithm:AlternaNngleastsquares(ALS)– Algorithm:StochasNcgradientdescent(SGD)– Cold- ‐startproblemandfeature- ‐basedcollaboraNvefiltering– Modelvariants: Non- ‐negaNvematrixfactorizaNon ProbabilisNcmatrixfactorizaNon– Algorithm:Gibbssampling ProbabilisNclatentspacemodelsandnetworkdata– Graphparallelproblems– ibbsformatrixfactorizaNon EmilyFox201533Whatyouneedtoknow CaseStudy5:DocumentMixedMembershipModeling– Approach1:Bayesiandocumentclusteringmodel– �� � Approach2:LatentDirichletallocaNon(LDA)– Algorithm:CollapsedGibbssampling– nference EmilyFox20153417

6/1/15THANKYOU!!! Youhavebeenagreat,interacNveclass! especiallyfora9:30amlecture ) We’relookingforwardtothepostersession ThankstoMarcoandAlden,too! EmilyFox20153518

2 Classical mean-field variational inference 3 Stochastic variational inference 4 Extensions and open issues (Hoffman et al., 2013) . Stochastic variational inference 4096 systems health communication service billion language care road 8192 service systems health com

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Introduction to LDA – Low Dose Allergen Immunotherapy This booklet, now known as the “Pink Book”, is written as a guide for patients receiving LDA immunotherapy. Since there are a few rules related to LDA that don’t apply to othe

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

Based on a linear function, LDA is used for linear classification or leads to dimensionality reduction. Although similar to Fisher’s discriminant analysis (FDA), LDA includes assumptions as to the normal distribution of classes and equal covariance of classes. Quadratic discriminant analysis (QDA) A technique closely related to LDA.

may be taken on an instrument with only one manual. Consequently, in Grades 1–3, some notes may be transposed or omitted, provided the result is musically satisfactory. Elements of the exam All ABRSM graded Organ exams comprise the following elements: three Pieces; Scales, arpeggios and exercises; Sight-reading (with an additional Transposition exercise in Grades 6–8); and Aural tests .