A Tutorial On Variational Bayes

2y ago
17 Views
2 Downloads
936.31 KB
40 Pages
Last View : 13d ago
Last Download : 3m ago
Upload by : Kamden Hassan
Transcription

A Tutorial on Variational BayesJunhao Hua (华俊豪)Laboratory of Machine and Biological Intelligence,Department of Information Science & Electronic Engineering,ZheJiang University2014/3/27Email: huajh7@gmail.comFor further information, see:http://www.huajh7.com

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example: VB for Mixture model Discussion Application Reference2

A Problem: How Learn From Data? Typical, we use a complex statistical model,but how to learn its parameters and latentvariables? Data: X Model: P(X \theta, Z)3

Challenge Maximum Likelihood: Overfits the data Model Complexity Computational tractability Bayesian Framework: Arising intractable integrals: partition function posterior of unobserved variables Approximate Inference: Monte Carlo Sampling: e.g. MCMC, particle filter. Variational Bayes4

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example: VB for Mixture model Discussion Application Reference5

Variational Free energy Basic Idea:“conditional independence is enforced as a functional constraint in theapproximating distribution, and the best such approximation is found byminimization of a Kullback-Leibler divergence (KLD). ” Use a simpler variational distribution, Q(Z), to approximate thetrue posterior P(Z X) Two alternative explanations Minimize (reverse) Kullback-Leibler divergenceDKL (Q P)Q( Z ) Q(Z)log ZP( Z D) Q(Z ) logZQ( Z ) log P ( D )P( Z , D) Maximum variational free energy(lower bound)L(Q) Q( Z ) log P ( Z , D) Q( Z ) log Q( Z ) EQ [log P( Z , D)] H (Q)ZZ6

Optimization Techniques:Mean Field Approximation Originated in the statistical physics literature Conditional Independence assumption Decoupling: intractable distribution - a productof tractable marginal distributions (tractablesubgraph) Factorization:MQ( Z ) q( Zi D)i 17

Optimization Techniques:Variational methods Optimization Problem: Maximum the lower bound L(Q( Z )) EQ ( Z ) [ln P ( Z , D)] H (Q( Z )) Where Q( Z ) Qi ( Zi )i Subject to normalization constraints: i. Qi ( Z i )dZ i 1 Seek the extremum of a functional: Euler – Lagrange equation8

Derivation Consider the partition Z {Zi , Z i },where Z i Z \ Zi Consider Energy term,EQ ( Z ) [ln P ( Z , D)] ( Qi ( Z i )) ln( Z , D)dZi Qi ( Z i ) Q i ( Z i ) ln( Z , D)dZ i dZ i Qi ( Z i ) ln( Z , D)Q i ( Z i ) Qi ( Z i ) ln exp ln( Z , D)dZ iQ i ( Z i )dZ i Qi ( Z i ) ln Qi* ( Z i )dZ i lnC1exp ln( Z , D)Cnormalization constant. We define Qi* ( Z i ) Q i ( Z i ), where C is the9

Derivation (cont.) Consider the entropy,H (Q( Z )) ( Qk ( Z k )) ln Qi ( Z i )dZi k Q (Z )Qii i( Z i ) ln Qi ( Z i )dZ i dZ ii Q (Z ) ln Q (Z )dZiiiii Q (Z ) ln Q (Z )dZiiiiiQ i ( Z i )ii Then we get the functional,L(Q(Z)) Qi ( Z i ) ln Qi* ( Z i )dZ i Qi ( Z i ) ln Qi ( Z i )dZ i lnCi ( Qi ( Z i ) ln Qi* ( Z i )dZ i Qi ( Z i ) ln Qi ( Z i )dZ i ) Qk ( Z k ) ln Qk ( Z k )dZ k lnCk iQi* ( Z i ) Qi ( Z i ) lndZ i Qk ( Z k ) ln Qk ( Z k )dZ k lnCQi ( Z i )k i D KL (Qi ( Z i ) Qi* ( Z i )) H [Q i ( Z i )] ln C10

Derivation (cont.) Maximizing energy functional L w.r.t. each Q icould be achieved by Lagrange multipliers andfunctional differentiation i. { D KL [Qi ( Z i ) Qi* ( Z i )] λi ( Qi ( Z i )dZ i 1)}: 0 Qi ( Z i ) A long algebraic derivation would then eventuallylead to a Gibbs distribution; Fortunately, L will bemaximized when the KL divergence is zero,1Q Q (Zi )exp ln P( Z i , Z i , D)i (Zi )C*iQ i ( Z i ) Where C is normalization constant.11

Challenge1QQ (Zi )exp ln P( Z i , Z i , D)i (Zi )C*iQ i ( Z i ) The expectation can be intractable. We need pick a family of distributions Q thatallow for exact inference Then Find Q ' Q that maximizes thefunctional energy .12

Challenge The expectation can be intractable. We need pick a family of distributions Q thatallow for exact inference Then Find Q ' Q that maximizes thefunctional energy .Exponential Family13

Why Exponential Family ? Principle of maximum entropy Density function: Log partition function:canonical parametersMean parameters14

Properties of Exponential family Mean parameters: θ“various statistical computations, among them marginalizationand maximum likelihood estimation, can be understood astransforming from one parameterization to the other.” All realizable mean parameters Always a convex subset of d Forward mapping From canonical parameters Backward mapping From mean parametersθφ ( x) to the mean parameters θto the canonical parametersφ ( x)15

Properties of partition function A16

Conjugate Duality: Maximum Likelihoodand Maximum Entropy The variational representation of log partition function The conjugate dual function to A17

Nonconvexity for Naïve Mean Field Mean field optimization is always nonconvexfor any exponential family in which the statespace is finite. It is a strict subset of M(G) Contains all of the extreme points of polytope18

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example: VB for Mixture model Discussion Application Reference19

Inference in Bayesian Networks Variational Message Passing (Winn, Bishop, 2003.) Message from parents: mY X uY Message to parents: mX Y φ XY u X , {mi X }i cp Update natural parameter vector :( φY* φ Y( {mi Y}i paY) mj chYY)j YP( X φ ) exp[φ T u ( X ) f ( X ) g (φ )]20

Summary of VBMean-field AssumptionVariational MethodsQ( Zi ) 1exp ln P( Z i , Z i , D)CConjugate-exponentialfamilyForward, backwardmappingQ ( Z i ) orQ ( mb ( Zi ))21

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example VB for Mixture model Discussion Application Reference22

Mixture of Gaussian (MoG)p( X Z , µ , Λ) NK 1 znkN(x ,µΛ n k k )n 1 k 1 p ( X , Z , π , µ , Λ ) p( X Z , µ , Λ ) p( Z π ) p(π ) p( µ Λ ) p(Λ )23

Infinite Student’s t-mixture j 1j 1i 1π j (V ) V j (1 Vi )G π j (V )δ Θ j DP(α , G0 )Dirichlet ProcessStick-Breaking priorV j Beta (1, α )p (α ) Gam(α η1 ,η2 )Dirichlet Process Mixturep ( X ) N πn 1 j 1j(V ) St ( xn µ j , Λ j , v j )24

Latent Dirichlet Allocation (LDA)25

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example: VB for Mixture model Discussion Application Reference26

步骤一:选择无信息先验分布 原则,最大熵原则等。 gate 后验分布h(θ x)属于同一分布类型。π i 1,.,k SymDir ( K ,α 0 )Λ i 1,., k W( w0 ,υ0 )µi 1,.,k N (m0 ,( β 0 Λ i ) 1 )zi 1,., N Mult (1, π )X i 1,., N N ( µ z )说明: K:单高斯分布个数,N:样本个数 SymDir() :K维对称 )或多项式分布的共轭先验分布。 W() 先验。 Mult() 表示多项分布; ��量中只有一项为1,其它都为0. N() 为多元高斯分布。X {x1 ,., xN }是N �布的K 维向量;Z { z1 ,., z N }是一组潜在变量,每项zk {z1k ,. , znk }表示对应的样本xk 属于哪个混合部分;π {π 1 ,., π K }表示每个单高斯分布混合比例;µ和Λki 1,., k i 1,., �精度;K ,α 0 , β 0 , w0 ,υ0 , m0 �。27

用“盘子表示法”(plate 所示。 小正方形表示不变的超参数,如β0 ,ν0 ,α0 ,µ0 ,W0;圆圈表示随机变量,如 π , zi , xi , µ k , Λ k zi来选择其他传入的变量(µk, Ʌk)。 28

步骤二:写出联合概率密度函数 �合概率密度函数可以表示为p ( X , Z , π , µ , Λ ) p( X Z , µ , Λ ) p( Z π ) p(π ) p( µ Λ ) p(Λ )N N ( xp( X Z , µ , Λ) 每个因子为: NK n 1 k 1 1 znkΛµ ,nkk )Kp ( Z π ) π kznkn 1 k 1 Γ( Kα 0 ) K α 0 1πp (π ) K kΓ(α 0 ) k 1 p ( µ Λ ) N ( µk m0 ,( β 0 Λ k ) 1 )p (Λ ) W (Λ k w0 ,ν 0 ) 其中,11 1 1Txµxµ exp()() (2π ) D / 2 1/ 2 2 1W (Λ w,ν ) B( w,ν ) Λ (ν D 1)/ 2 exp( Tr ( w 1Λ ))2Dν 1 i 1ν D / 2 D ( D 1)/ 4 ν / 2B( w,ν ) w Γ((2 π)) 2i 1N ( x µ , ) 29

步骤三:计算边缘密度(VB- marginal), π , µ , Λ ) q ( Z )q (π , µ , Λ �设,有 q( Z ln q* ( Z ) Eπ , µ , Λ [ln p ( X , Z , π , µ , Λ )] const Eπ , µ , Λ [ln p ( X Z , µ , Λ ) p ( Z π ) p (π ) p ( µ Λ ) p (Λ )] const Eπ [ln p ( Z π )] Eµ , Λ [ln p ( X Z , µ , Λ )] constN K z n 1 k 1nkln ρ nk const 11D其中 ln ρ nk E[ln π k ] E[ln Λ k ] ln(2π ) Eµk ,Λk [( xn µk )T Λ k ( xn µk )]222NKz*两边分别取对数可得, q ( Z ) ρ nk z归一化,得 q ( Z ) rnknk ,其中 rnk nkn 1 k 1 NK*n 1 k 1 ρ nk Kj 1ρ nj可见 q ( Z ) on multinomial distribution)的乘积 。更进一步,根据categorical分布,有 E[ znk ] rnk*30

K q (π , µ , Λ ) q (π ) q ( µk , Λ k )(2)计算π的概率密度,k 1ln q* (π ) EZ , µ , Λ [ p ( X Z , π , µ , Λ )] const ln p (π ) EZ [ln p ( Z π )] constKNK (α 0 -1) ln π k rnk ln π k constk 1 K 两边取对数q (π ) *q (π ) Dir (α )n 1n 1 k 1 r α 0 1 n 1 nkπ,可见Nkq* (π是Dirichlet分布,)* 其中 α Nα 0 N k ,N k rnk .n 131

最后同时考虑 µ , Λ ,对于每一个单高斯分布有, ln q* ( µk , Λ k ) EZ ,π , µi k , Λi k [ln p ( X Z , µk , Λ k ) p ( µk , Λ k )]N ln p ( µk , Λ k ) E[ znk ]ln N ( xn µk , Λ k 1 ) constn 1 rt分布,*q ( µk , Λ k ) N ( µk mk ,( β k Λ k ) 1 )W (Λ k wk ,ν k )β0 N k , β k mk 1 ( β 0 m0 N k xk ), βk w 1 w 1 N S β 0 N k ( x m )( x m )T ,k kkk000 kβ0 N k v0 N k ,k v 1 Nrnk xn , xk N k n 1 1 N Skrnk ( xk xk )( xk xk )T . N k n 1 其中32

步骤四:迭代收敛 ��要ρnk,而这又是基于 E[ln π k ], E[ln Λ k ], Eµ ,Λ [( xn µk ) Λ k ( xn µk 三个期望的一般表达式为:k(k)K ln π E[ln π ] ψ(α)ψ i 1α ikkk D vk 1 i lnE[ln ]ψΛ Λ i 1 2 D ln 2 ln Λ k kk 1TT ExµxµDβνxmWk ( xn mk )[()()]() Λ µk , Λ knkknkkknk 这些结果能导出, D νk 1/ 2T ( xn mk ) Wk ( xn mk ) rnk π k Λ k exp 22βk 且 k 1 rnk 1 .K33

Summary: Variational Inference for GMMq (π ) Dir (α )*α α 0 N kNN k rnkn 1Soft-count or ESS*q ( µk , Λ k ) N ( µk mk ,( β k Λ k ) 1 )W (Λ k wk ,ν k )11 Nβk β 0 N k , mk ( β 0 m0 N k xk ), vk v0 N k , xk rnk xnβkN k n 1β N1w w N k S k 0 k ( xk m0 )( xk m0 )T , S k β0 N kNk 1k 10N rn 1Tx xx x()()nkknknVBM-StepLatent variableNKq* ( Z ) rnkznkn 1 k 1 VBE-Step 1/ 2 exp D ν k ( x m )T W ( x m ) rnk π k Λknkknk22βk 34

Outline Motivation The Variational Bayesian Framework Variational Free EnergyOptimization Tech. Mean Field ApproximationExponential FamilyBayesian Networks Example: VB for Mixture model Discussion Application Reference35

EM vs. VB36

The Accuracy-vs-Complexitytrade-off37

Application Matrix Factorization: Probabilistic PCA, Mixtures ofPPCA, Independent Factor Analysis(IFA), nonlinearICA/IFA/SSM, Mixture of Bayesian ICA, BayesianMixture of Factor Analyzers, etc. Time Series: Bayesian HMMs, variational Kalmanfiltering, Switching State-space models, etc. Topic model: Latent Dirichlet Allocation(LDA),(Hierarchical) Dirichlet Process (Mixture) Model,Bayesian Nonparametrical Models, etc. Variational Gaussian Process Classifiers Sparse Bayesian Learning Variational Bayesian Filtering, etc.38

Reference Neal, Radford M., and Geoffrey E. Hinton. "A view of the EM algorithmthat justifies incremental, sparse, and other variants." Learning ingraphical models. Springer Netherlands, 1998. 355-368. Attias, Hagai. "Inferring parameters and structure of latent variablemodels by variational Bayes." Proceedings of the Fifteenth conference onUncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.,1999. Winn, John, Christopher M. Bishop, and Tommi Jaakkola. "VariationalMessage Passing." Journal of Machine Learning Research 6.4 (2005). Wainwright, Martin J., and Michael I. Jordan. "Graphical models,exponential families, and variational inference." Foundations andTrends in Machine Learning 1.1-2 (2008): 1-305. Šmídl, Václav, and Anthony Quinn. The variational Bayes method insignal processing. Springer, 2006. Wikipedia, Variational Bayesian methods,http://en.wikipedia.org/wiki/Variational Bayes39

Any Question ?40

The Variational Bayesian Framework Variational Free Energy Optimization Tech. Mean Field Approximation Exponential Family Bayesian Networks Example: VB fo

Related Documents:

Agenda 1 Variational Principle in Statics 2 Variational Principle in Statics under Constraints 3 Variational Principle in Dynamics 4 Variational Principle in Dynamics under Constraints Shinichi Hirai (Dept. Robotics, Ritsumeikan Univ.)Analytical Mechanics: Variational Principles 2 / 69

II. VARIATIONAL PRINCIPLES IN CONTINUUM MECHANICS 4. Introduction 12 5. The Self-Adjointness Condition of Vainberg 18 6. A Variational Formulation of In viscid Fluid Mechanics . . 25 7. Variational Principles for Ross by Waves in a Shallow Basin and in the "13-P.lane" Model . 37 8. The Variational Formulation of a Plasma . 9.

Streaming stochastic variational Bayes: An improved approach for inference with concept drifting data streams Online learning is an essential tool for predictive analysis based on continuous, endless data streams. Adopting Bayesian inference for online settings allows hierarchical modelin

in graphical models. Succinct summaries of variational message passing and expectation propagation are provided in Appendices A and B of Minka and Winn (2008). Generally speaking, variational message passing is more amenable to semiparametric re-gression than expectation propagation. It is a special case of mean field variational Bayes (e.g.

Stochastic Variational Inference. We develop a scal-able inference method for our model based on stochas-tic variational inference (SVI) (Hoffman et al., 2013), which combines variational inference with stochastic gra-dient estimation. Two key ingredients of our infer

Variational Form of a Continuum Mechanics Problem REMARK 1 The local or strong governing equations of the continuum mechanics are the Euler-Lagrange equation and natural boundary conditions. REMARK 2 The fundamental theorem of variational calculus guarantees that the solution given by the variational principle and the one given by the local

Action principles in Lagrangian/Hamiltonian formulations of electrodynamics Schwinger variational principles for transmission lines, waveguides, scattering specialized variational principles for lasers and undulators (e.g. Xie) Variational Principles are Perhaps Better Known in

Automotive Hose and Fittings CONTENT 6 6.1 6.2 6.3 Hose and Fittings 634 G-Line Hose and Fittings 634 Quick Release Couplings 645 600 - 700 Series Hose and Fittings 648 200 Series Hose and Fittings 673 Push Fit Hose and Fittings 679 Adaptors 683 Air Conditioning Hose and 695 Fittings . 634 www.pirtek.com 6. 1 Series Page Hose 910 910 636 Hose 811 811 636 Hose 910 FC 910 FC 636 (Fuel Cell) G .