Optimization With EM And Expectation-Conjugate-Gradient

2y ago
24 Views
2 Downloads
812.59 KB
8 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Fiona Harless
Transcription

Optimization with EM and Expectation-Conjugate-GradientRuslan SalakhutdinovSam RoweisDepartment of Computer Science, University of Toronto6 King’s College Rd, M5S 3G4, CanadaRSALAKHU @ CS . TORONTO . EDUROWEIS @ CS . TORONTO . EDUZoubin GhahramaniGatsby Computational Neuroscience Unit, University College London17 Queen Square, London WC1N 3AR, UKAbstractWe show a close relationship between the Expectation - Maximization (EM) algorithm anddirect optimization algorithms such as gradientbased methods for parameter learning. We identify analytic conditions under which EM exhibits Newton-like behavior, and conditions under which it possesses poor, first-order convergence. Based on this analysis, we propose twonovel algorithms for maximum likelihood estimation of latent variable models, and report empirical results showing that, as predicted by theory, the proposed new algorithms can substantially outperform standard EM in terms of speedof convergence in certain cases.1. IntroductionThe problem of Maximum Likelihood (ML) parameter estimation for latent variable models is an important problem in the area of machine learning and pattern recognition. ML learning with unobserved quantities arises inmany probabilistic models such as density estimation, dimensionality reduction, or classification, and generally reduces to a relatively hard optimization problem in terms ofthe model parameters after the hidden quantities have beenintegrated out.A common technique for ML estimation of model parameters in the presence of latent variables is ExpectationMaximization (EM) algorithm [3]. The EM algorithm alternates between estimating the unobserved variables giventhe current model and refitting the model given the estimated, complete data. As such it takes discrete steps inparameter space similar to first order method operating onthe gradient of a locally reshaped likelihood function.In spite of tremendous success of the EM algorithm in practice, due to its simplicity and fast initial progress, some authors [11] have argued that the speed of EM convergencecan be extremely slow, and that more complicated second-ZOUBIN @ GATSBY. UCL . AC . UKorder methods should generally be favored to EM. Manymethods have been proposed to enhance the convergencespeed of the EM algorithm, mostly based on conventionaloptimization theory [7, 8]. Several authors [11, 1] havealso proposed hybrid approaches for ML learning, advocating switching to an approximate Newton method after performing several EM iterations. All of these approaches, although sometimes successful in terms of convergence, aremuch more complex than EM, and difficult to analyze; thusthey have not been popular in practice.Our goal in this paper is to contrast the EM algorithm with adirect gradient-based optimization approach. As a concretealternative, we present an Expectation-Conjugate-Gradient(ECG) algorithm for maximum likelihood estimation in latent variable models, and show that it can outperform EMin terms of convergence in certain cases. However, in othercases the performance of EM is superior. To understandthese behaviors, we study the convergence properties of theEM algorithm and identify analytic conditions under whichEM algorithm exhibits Newton-like convergence behavior,and conditions under which it possesses extremely poor,first-order convergence. Based on this analysis, we introduce a simple hybrid EM-ECG algorithm that switchesbetween EM and ECG based on estimated quantities suggested by our analysis. We report empirical results on synthetic as well as real-world data sets, showing that, as predicted by theory, this simple algorithm almost never performs worse than standard EM and can substantially outperform EM’s convergence.2. Linear and Newton Convergence ofExpectation MaximizationWe first focus on the analysis of the convergence propertiesof the Expectation-Maximization (EM) algorithm. Consider a probabilistic model of observed data x which useslatent variables z. The log-likelihood (objective function)can be lower bounded by the difference between expectedcomplete log-likelihood and negative entropy terms:Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.

The important consequence of the above analysis is that(when the expected complete log-likelihood function hasL(Θ) ln p(x Θ) ln p(x, z Θ)dz a unique optimum), EM has the appealing quality of alRx,z Θ)ways taking a step Θ(t 1) Θt having positive projectionp(z x, Ψ) ln p(p(z x,Ψ) dz onto the true gradient of the objective function L(Θt ). ThisRRp(z x, Ψ) ln p(x, z Θ)dz p(z x, Ψ) ln p(z x, Ψ)dz makes EM similar to a first order method operating on the Q(Θ, Ψ) H(Ψ, Ψ) F(Θ, Ψ)gradient of a locally reshaped likelihood function.RThe EM algorithm is nothing more than coordinate ascentin the functional F(Θ, Ψ), alternating between maximizingF with respect to Ψ for fixed Θ (E-step) and with respectto Θ for fixed Ψ (M-step) [10].The algorithm implicitly defines a mapping: M : Θ Θ0from parameter space to itself, such that Θt 1 M (Θt ).If iterates of Θt converge to Θ (and M (Θ) is continuous),then Θ M (Θ ), and in the neighborhood of Θ , by aTaylor series expansion:Θt 1 Θ M 0 (Θ )(Θt Θ )(1) where M 0 (Θ ) M Θ Θ Θ . Therefore EM is essentiallya linear iteration algorithm with a convergence rate matrixM 0 (Θ ) (which is typically nonzero) [9].For most objective functions, the EM step Θ(t 1) Θ(t) inparameter space and true gradient can be related by a transformation matrix P (Θt ), that changes at each iteration:Θ(t 1) Θ(t) P (Θt ) L (Θt )(2)tUnder certain(We define L (Θt ) L(Θ) Θ Θ Θ .)conditions, the transformation matrix P (Θt ) is guaranteedto be positive definite with respect to the gradient.1 Inparticular, ifC1: Expected complete log-likelihood Q(Θ, Θt ) is welldefined, and differentiable everywhere in Θ. andC2: For any fixed Θt 6 Θ(t 1) , along any direction that passesthrough Θt 1 , Q(Θ, Θt ) has only a single critical point,located at the maximum Θ Θt 1 ; thentttt L (Θ )P (Θ ) L (Θ ) 0 Θ(3)The second condition C2 may seem very strong. However,for the EM algorithm C2 is satisfied whenever the M-stephas a single unique solution.21t(t 1)tNote that Θt ), where Q (Θ )(ΘQ (Θ ) Q(Θ,Θt ) Θ Θt Θ tis the directional derivative of function Q(Θ, Θ )in the direction of Θ(t 1) Θt . C1 and C2 together imply thatthis quantity is positive, otherwise by the Mean Value Theorem(C1) Q(Θ, Θt ) would have a critical point along some direction,located at a point other than Θt 1 (C2). By using the identityt)ttt Θ Θt , we have L (Θt ) Q(Θ,ΘL (Θ )P (Θ ) L (Θ ) Θ t(t 1)t Q (Θ )(Θ Θ ) 0.2In particular C2 holds for any exponential family model, dueto the well-known convexity property of Q(Θ, Θt ) in this case.For maximum likelihood learning of a mixture of Gaussians model using the EM-algorithm, this positive definitetransformation matrix P (Θt ) was first described by Xu andJordan[18]. We extended their results by deriving the explicit form for the transformation matrix for several otherlatent variables models such as Factor Analysis (FA), Probabilistic Principal Component Analysis (PPCA), mixture ofPPCAs, mixture of FAs, and Hidden Markov Models [13].3We can further study the structure of the transformationmatrix P (Θt ) and relate it to the convergence rate matrixM 0 . Taking the negative derivatives of both sides of (2)with respect to Θt , we have:I M 0 (Θt ) P 0 (Θt ) L (Θt ) P (Θt )S(Θt ) (4)where S(Θt ) 2 L(Θ)t Θ2 Θ Θ is the Hessian Θt 1t0Mij (Θ ) Θi t is thejof theinputobjective function,output derivative matrix for the EM mapping and(Θ) Θ Θt is the tensor derivative of P (Θt )P 0 (Θt ) P Θwith respect to Θt . In ”flat” regions of L(Θ), where L (Θ) approaches zero (and P 0 (Θt ) does not becomeinfinite), the first term on the RHS of equation (4) becomesmuch smaller than the second term, and P (Θt ) matrixbecomes a rescaled version of the negative inverse Hessian: 1P (Θt ) I M 0 (Θt ) S(Θt )(5)In particular, if the EM algorithm iterates converge to alocal optima at Θ , then near this point (i.e. for sufficientlylarge t) EM may exhibit superlinear convergence behavior.This is also true in “plateau” regions where the gradientis very small even if they are not near a local optimum.The nature of the convergence behavior is controlled bythe eigenvalues of the matrix M 0 (Θt ). If all eigenvaluestend to zero, then EM becomes a true Newton method,4rescaling gradient by exactly the negative inverse Hessian.Θt 1 Θt S(Θt ) 1 L (Θt )(6)As the eigenvalues tend to unity, EM takes smaller andsmaller stepsizes, giving poor, first-order, convergence.Dempster, Laird, and Rubin [3] showed that if EM iteratesconverge to Θ , then 1 2 Q(Θ,Θ ) 2 H(Θ,Θ ) M (Θ) Θ Θ Θ Θ Θ Θ Θ Θ2 Θ23We also derived a general form of the transformation matrixfor exponential family models in term of their natural parameters.4For details on Newton-type algorithms refer to [11]

106GRADIENT8.4EM5NEWTON54µ18.0µ23028.0 5 9 6 30369107.79.1 5 10 100105GRADIENT10 6 5 4 1 0.5 3 2 101.59.0NEWTON1EM50.58.7µ0µ1028.3 0.5 58.5 1 6 3036 10 109.1 58.00510 1.5 1.500.511.5Figure 1. Contour plots of the likelihood function L(Θ) for MoG examples using well-separated (upper panels) and not-well-separated(lower panels) one-dimensional data sets. Axes correspond to the two means. The dashdot line shows the direction of the true gradient L (Θ), the solid line shows the direction of P (Θ) L (Θ) and the dashed line shows the direction of ( S) 1 L (Θ). Right panels areblowups of dashed regions on the left. The numbers indicate the log of the l2 norm of the gradient. Note that for the ”well-separated”case, in the vicinity of the maximum, vectors P (Θ) L (Θ) and ( S) 1 L (Θ) become identical.which can be interpreted as the ratio of missing information to the complete information near the local optimum.Thus, in the neighborhood of a solution (for sufficientlylarge t), we can make the following novel link between thetransformation matrix P , the convergence rate matrix M 0 ,and the exact inverse Hessian S: 1 1 2Q 2HttP (Θ ) I Θ2 Θ Θt S(Θ ) Θ2(7)This analysis of EM has a very interesting interpretationwhich is applicable to any latent variable model: Whenthe missing information is small compared to the complete information, EM exhibits approximate Newton behavior and enjoys fast, typically superlinear convergencein the neighborhood of Θ . If fraction of missing information approaches unity, the eigenvalues of the first term (7)approach zero and EM will exhibit extremely slow convergence.Figure 1 illustrates the above results in the simple case offitting a mixture of Gaussians model to well-clustered data– for which EM exhibits Newton-like convergence – andnot-well-clustered data, for which EM is slow. As we willsee from the empirical results of the later sections, manyother models also show this same effect. For example,when Hidden Markov Models or Aggregate Markov Models [14] are trained on very structured sequences, EM exhibits superlinear convergence, in particular when the statetransition matrix is sparse and the output distributions arealmost deterministic at each state.The above analysis motivates the use of alternative optimization techniques in the regime where missing information is high and EM is likely to perform poorly. In the following section, we analyze exactly such an alternative, theExpectation-Conjugate Gradient (ECG) algorithm, a simple direct optimization method for learning the parametersof latent variables models.3. Expectation-Conjugate-Gradient (ECG)AlgorithmThe key idea of the ECG algorithm is to note that if we ln p(x, z Θ) ofcan easily compute the derivative Θthe complete log likelihood, then knowing the posteriorp(z x, Θ) we can compute the exact gradient L (Θ): L (Θ) ln p(x Θ) ΘZ 1p(x, z Θ)dz p(x Θ) ΘZp(x, z Θ) ln p(x, z Θ)dz p(x Θ) ΘZ p(z x, Θ)ln p(x, z Θ)dz (8) ΘThis exact gradient can then be utilized in any standardmanner, for example to do gradient (as)descent or to

control a line search technique. (Note that if one canderive exact EM for a latent variable model, then one canalways derive ECG by computing the above integral overhidden variables.) As an example, we describe a conjugategradient algorithm:Expectation-Conjugate-Gradient algorithm:Apply a conjugate gradient optimizer to L(Θ), performing an“EG” step whenever the value or gradient of L(Θ) is requested(e.g. during a line search).The gradient computation is given by E-Step:Compute posterior p( , Θt ) and loglikelihood L(Θ) as usual.optimizing a convex lower bound on the likelihood, onceEM is trapped in a poor basin of attraction, it can neverfind a better local optimum. Algorithms such as split andmerge EM [16] were developed to escape from such configurations. It turns out that direct optimization methodssuch as ECG may also avoid this problem because of thenonlocal nature of the line search. In many of our experiments, ECG actually converges to a better local optimumthan EM; figure 2 illustrates exactly such case.4. Hybrid EM-ECG Algorithm G-Step: L (Θt ) R log p( , Θ)dp( , Θt ) Θ When certain parameters must obey positivity or linearconstraints, we can either modify our optimizer to respectthe constraints, or we can reparameterize to allow unconstrained optimization. In our experiments, we use simplereparameterizations of model parameters that allow our optimizers to work with arbitrary values. (Of course, an extra term must be included in the gradient calculation dueto the application of the chain rule through such reparameterizations.) For example, in the MoG model we usea “softmax” parameterization of the mixing coefficientsπi PMexp (γi ) , for covariance matrices to be symmetj 1exp (γj )Θ2ric positive definite, we use the Choleski decomposition (orlog variances for diagonal covariance matrices). In HMMs,we reparameterize probabilities also via softmax functions.2.50.01Log Likelihood Const2*ΘECG1.51Line Search0.50Θ*EM 0.5 1.5 1.5 1 0.50EM 0.005 0.01 0.015EM 1ECG0.005Gene Sequence:ATGCCGGAGGCCC . 0.0200.511.52Θ102004006008001000E StepsFigure 2. Left panel illustrates why ECG may converge to a betterlocal optimum. Right panel displays the learning curves for EMand ECG algorithms of training fully connected 7-state HMM tomodel human DNA sequences. Both algorithms started from thesame initial parameter values, but converged to the different localoptimum.Note that ECG is different than Jamshidian and Jennrich’sproposed acceleration of EM[6]. Their method alwaysmoves in the same direction as EM but uses a generalizedconjugate gradient algorithm to find the best step size; ECGcomputes the true gradient and moves in a search direction determined by the conjugate gradient algorithm, whichmay be different than the EM step direction.Of course, the choice of initial conditions is very importantfor the EM algorithm or for ECG. Since EM is based onAs we have seen, the relative performance of EM versusdirect optimization depends on the missing information ratio for the given model and data set. The key to practicalspeedups is the ability to design a hybrid algorithm that canestimate the local missing information ratio M 0 (Θt ) to detect whether to use EM or a direct approach such as ECG.Some authors have attacked this problem by finding the top(Θ)eigenvector of M Θ Θ Θt as Θt approaches Θ usingconventional numerical methods, such as finite-differenceapproximations, or power methods [4]. These approachesare computationally intensive and difficult to implement,and thus they have not been popular in practice.Here, we propose using the entropy of the posterior overhidden variables, which can be computed after performingan E-step, as a crude estimate of the local missing information ratio. This entropy has a natural interpretation asthe uncertainty about missing information, and thus canserve as a guide between switching regimes of EM andECG. For many models with discrete hidden variables thisquantity is quite easy to compute. In particular, we definethe Normalized Entropy term:PN PMttH̄t N 1i p(z i xn , Θ ) ln p(z i xn , Θ )nln Mwith z being discrete hidden variable taking on M values,and N observed data vectors xn . We now simply switchbetween EM and ECG based on thresholding this quantity:Hybrid EM-ECG algorithm: Perform EM iterations, evaluating H̄t after each E-step If H̄t τ Thena Switch to ECG Perform ECG, evaluating H̄t at the end of each linesearch If H̄t τ Then Switch back to EM Exit at either phase IF:1. (L(Θt ) L(Θt 1 ))/abs(L(Θt )) tol OR2. t TmaxaWe are near the optimum or in plateau region with highentropyAs we will see from the experimental results, this simplehybrid EM-ECG algorithm never performs substantiallyworse, and often far better than either EM or ECG.

3x 100.01EM ECGEM ECG0EM ECG00EMEMECG 0.01 5 0.005EMECG6ECG 0.02 0.014 10 0.03 0.0152 0.04 15 0.020 5 200100200 3300 1400500UnstructuredSequence: CEDBCEDDAC . 0.051600050100150200250300 0.0250204060801001201401601800.2EM ECG / EMEM ECG / EM00.050ECG 0.4 15 0.6Log Likelihood Const 0.2 10.2 0.8 200 1 25 2StructuredSequence: AEABCDEAEABCDE. 1.2 30 2 3501020030 0.05 0.1 0.15ECG 0.2 0.25 0.3 0.35 0.4240EM ECG / EM0ECG 5 1.450E Steps05101520253035 0.4501020304050Figure 3. Learning curves for ECG, EM-ECG, and EM algorithms, showing superior (upper panels) and inferior (lower panels) performance of ECG under different conditions for three models: MoG (left), HMM (middle), and Aggregate Markov Models (right).All algorithms were started with the same initial parameter vector. The number of E-steps taken by either algorithm is shown on thehorizontal axis, and log likelihood is shown on the vertical axis. For ECG and EM-ECG, diamonds indicate the maximum of each linesearch, and the zero-level likelihood corresponds to the converging point of the EM algorithm. The bottom panels use “well-separated”,or “structured” data for which EM possesses Newton-like convergence behavior. All models in this case converge in 10-15 iterationswith stopping criterion: [L(Θt 1 ) L(Θt )]/abs(L(Θt 1 )) 10 15 . The upper panels use “overlapping”, “aliased”, or “unstructured”data for which proposed algorithms performs much better.5. Experimental ResultsWe now present empirical results comparing the performance of EM, ECG, and hybrid EM-ECG for learningthe parameters of three latent variable models: mixtureof Gaussians, Hidden Markov Models, and AggregateMarkov Models. In many latent variable models, performing inference (E-step) is significantly more expensivecompared to either the parameter updates (M-step) orextra overhead beyond the E-steps in the CG step ofECG. To compare the performance of the algorithms, wetherefore simply compare the number of E-steps eachalgorithm executes until its convergence. We first showresults on synthetic data sets, whose properties we cancontrol to verify certain aspects of our theoretical analysis.We also report empirical results on several real worlddata sets, showing that our algorithms do work wellin practice. Though we show examples of single runs,we have confirmed that the qualitative behavior of theconvergence results presented in all experiments is thesame for different initial parameter conditions. For all ofthe reported experiments, we used tol 10 8 , τ 0.5.We have not investigated carefully methods for settingthe parameter τ , which controls the propensity of thealgorithm to switch to ECG in favor of EM.5.1. Synthetic Data SetsFirst, consider a mixture of Gaussians (MoG) model. Weconsidered two types of data sets, one in which the data is“well-separated” into five distinct clusters, with each cluster containing 400 points, and another “not-well-separated”case in which the five mixture components with each containing 400 points, overlap in one contiguous region. Figure 3 shows that

Expectation Maximization We first focus on the analysis of the convergenceproperties of the Expectation-Maximization (EM) algorithm. Con-sider a probabilistic model of observed data x which uses latent variables z. The log-likelihood (objective function

Related Documents:

Mathematical Expectation Properties of Mathematical Expectation I The concept of mathematical expectation arose in connection with games of chance. In its simplest form, mathematical expectation is the product of the amount a player stands to win and the probability that the player would win.

Since the eld { also referred to as black-box optimization, gradient-free optimization, optimization without derivatives, simulation-based optimization and zeroth-order optimization { is now far too expansive for a single survey, we focus on methods for local optimization of continuous-valued, single-objective problems.

An approach for the combined topology, shape and sizing optimization of profile cross-sections is the method of Graph and Heuristic Based Topology Optimization (GHT) [4], which separates the optimization problem into an outer optimization loop for the topology modification and an inner optimization loo

Structure topology optimization design is a complex multi-standard, multi-disciplinary optimization theory, which can be divided into three category Sizing optimization, Shape optimization and material selection, Topology optimization according to the structura

2. Robust Optimization Robust optimization is one of the optimization methods used to deal with uncertainty. When the parameter is only known to have a certain interval with a certain level of confidence and the value covers a certain range of variations, then the robust optimization approach can be used. The purpose of robust optimization is .

2. Topology Optimization Method Based on Variable Density 2.1. Basic Theory There are three kinds of structure optimization, they are: size optimization, shape optimization and topology op-timization. Three optimization methods correspond to the three stages of the product design process, namely the

alculus In Motion “Related Rates” * Related Rates MORE” 4.7 Applied Optimization Pg. 262-269 #2-8E, 12, 19 WS –Optimization(LL) NC #45(SM) MMM 19 Optimization MMM 20 Economic Optimization Problems WS – Optimization(KM) Calculus In Motion “Optimization-Applications” TEST: CH

Standard can be used by an organization to assure interested parties that an appropriate environmental management system is in place. Guidance on supporting environmental management techniques is contained in other International Standards, particularly those on environmental management in the documents established by ISO/TC 207. Any reference to other International Standards is for information .