An Explanation Of The Expectation Maximization Algorithm .

2y ago
22 Views
2 Downloads
496.11 KB
19 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Rosemary Rios
Transcription

Technical report from Automatic Control at Linköpings universitetAn Explanation of the ExpectationMaximization AlgorithmThomas B. SchönDivision of Automatic ControlE-mail: schon@isy.liu.se21st August 2009Report no.: LiTH-ISY-R-2915Address:Department of Electrical EngineeringLinköpings universitetSE-581 83 Linköping, SwedenWWW: http://www.control.isy.liu.seAUTOMATIC CONTROLREGLERTEKNIKLINKÖPINGS UNIVERSITETTechnical reports from the Automatic Control group in Linköping are available fromhttp://www.control.isy.liu.se/publications.

AbstractThe expectation maximization (EM) algorithm computes maximum likelihood estimates of unknown parameters in probabilistic models involvinglatent variables. More pragmatically speaking, the EM algorithm is an iterative method that alternates between computing a conditional expectationand solving a maximization problem, hence the name expectation maximization. We will in this work derive the EM algorithm and show that itprovides a maximum likelihood estimate. The aim of the work is to showhow the EM algorithm can be used in the context of dynamic systems andwe will provide a worked example showing how the EM algorithm can beused to solve a simple system identi cation problem.Expectation Maximization, system identi cation, Maximumlikelihood, latent variables, probabilistic models.Keywords:

An Explanation of the ExpectationMaximization AlgorithmThomas B. Schön2009-08-21AbstractThe expectation maximization (EM) algorithm computes maximumlikelihood estimates of unknown parameters in probabilistic models involving latent variables. More pragmatically speaking, the EM algorithmis an iterative method that alternates between computing a conditionalexpectation and solving a maximization problem, hence the name expectation maximization. We will in this work derive the EM algorithm andshow that it provides a maximum likelihood estimate. The aim of thework is to show how the EM algorithm can be used in the context ofdynamic systems and we will provide a worked example showing how theEM algorithm can be used to solve a simple system identi cation problem.1IntroductionThe expectation maximization (EM) algorithm computes maximum likelihood(ML) estimates of unknown parameters θ in probabilistic models involving latentvariables Z 1 . An instructive way of thinking about EM is to think of it as asystematic way of separating one hard problem into two new closely linkedproblems, each of which is hopefully more tractable than the original problem.This problem separation forms the very heart of the EM algorithm.More pragmatically speaking, the EM algorithm is an iterative method thatalternates between computing a conditional expectation and solving a maximization problem, hence the name expectation maximization. To thoroughlyappreciate the EM algorithm, it is important to understand why the abovementioned problem separation indeed results in an ML estimate. This will beexplained in detail below.The motivation for this work is to provide a basic introduction to the EMalgorithm within the setting of dynamic systems. More speci cally, the mainfocus is to explain how the EM algorithm can be used for estimating modelsof dynamic systems, i.e., system identi cation. That is, besides the generalintroduction and derivation of the EM algorithm given in Section 2, we will seehow it can be used to identify parameters in a simple, but still very instructive1 The term latent variable is adopted from statistics and refers to a variable that is notdirectly observed. Hence, a latent variable has to be inferred (through a mathematical model)from other variables that are directly observed, i.e., measured. Latent variables are sometimesalso referred to as hidden variables of unobserved variables and within the EM literature theyare sometimes called the missing data or the incomplete data.1

case in Section 3. In Section 4 we provide some brief insights into the, by nowrather large, literature surrounding the EM algorithm and its applications andin Section 5 we provide the conclusions. Finally, in the Appendix we give someof the details skipped in the main text and thecode for the examplediscussed in Section 3.Matlab2A Formal Derivation of the EM AlgorithmIn order to derive the EM algorithm in Section 2.2 we must rst clearly de nethe ML problem, which is the topic of Section 2.1.2.1Maximum Likelihood EstimationThe maximum likelihood method, which was introduced by Fisher (1912, 1922),is based on the rather natural idea that the unknown parameters should bechosen in such a way that the observed measurements becomes as likely aspossible. More speci cally, the ML estimate is computed according toθ̂ML , arg max pθ (y1 , . . . , yN ),(1)θwhere yt denotes the measurement at time t. Furthermore, subindex θ indicatesthat the corresponding probability density function pθ (y1 , . . . , yN ) is parameterised by the (unknown) parameter θ. The joint density of the observationspθ (y1 , . . . , yN ) can, using the de nition of conditional probabilities, be writtenaspθ (y1 , . . . , yN ) pθ (y1 )NYpθ (yt Yt 1 ),(2)t 2where Yt 1 , {y1 , . . . , yt 1 }. It is often convenient to consider the so calledlog-likelihood functionLθ (Y ) , log pθ (y1 , . . . , yN ) NXlog pθ (yt Yt 1 ) log pθ (y1 ),(3)t 2rather than the likelihood function. In the interest of a more compact notationwe have here introduced the notation Y , {y1 , . . . , yN }. The logarithm is astrictly increasing function, implying that the following problem is equivalentto (1)θ̂ML arg maxθNXlog pθ (yt Yt 1 ) log pθ (y1 ).(4)t 2This problem can of course be solved using standard methods such as Newton'smethod or one of its related variants, see e.g., Dennis and Schnabel (1983);Nocedal and Wright (2006) for details on these methods. However, the MLproblem can also be solved using the expectation maximization algorithm, anapproach that has steadily gained in popularity since its formal birth in 1977(Dempster et al., 1977).2

2.2Expectation MaximizationThe strategy underlying the EM algorithm is to separate the original ML problem (4) into two linked problems, each of which is hopefully easier to solve thanthe original problem. Abstractly speaking this separation is accomplished byexploiting the structure inherent in the probabilistic model. This will hopefullybe made clear below.The key idea is to consider the joint log-likelihood function of both theobserved variables Y and the latent variables Z(5)Lθ (Z, Y ) log pθ (Z, Y ),and then assume that the latent variables Z were available to us. In order tounderstand why this makes sense, let us start by an example.Example 2.1 (Identifying linear state-space models)Consider the following linear state-space model xt 1A Bxtv t ,ytB Dutet(6)where the noise processes vt and et are assumed to be i.i.d. Q Svt0 N,.et0ST R(7)The latent variables are for this problem given by the states, i.e., Z X ,{x1 , . . . , xN 1 }. The problem is now to identify the parameters in the fullyparameterized A, B, C and D matrices, based on the measured input ut andoutput yt signals. If we, according to the key idea mentioned above, considerthe joint log-likelihood log pθ (X, Y ) and assume that the latent variables Xare known, the problem breaks down to a linear regression problem, NXxt 1A(θ) B(θ)xtθ̂ arg max yC(θ)D(θ)uttθt 12 ,Γ Γ 1QST S,R(8)which straightforwardly allows for a closed form solution. The problem is ofcourse that the latent variables X are not known. An idea, allowing us tostill use the above idea, would then be to use the available observations inorder to nd the best possible estimate of the latent variables. This estimatecan then be used in solving (8) and hopefully this results in something thatis meaningful. As we will soon see the intuition used above is in fact in closeagreement with theory.Using the de nition of conditional probabilitypθ (Z Y ) ,pθ (Z, Y ),pθ (Y )(9)we can establish the following connection between (3) and (5),log pθ (Y ) log pθ (Z, Y ) log pθ (Z Y ).3(10)

Let θk denote the estimate of the parameter θ from the k th iteration of thealgorithm. The problem separation mentioned above is now obtained by integrating (10) w.r.t. pθk (Z Y ), resulting inZZlog pθ (Y ) log pθ (Z, Y )pθk (Z Y )dZ log pθ (Z Y )pθk (Z Y )dZ Eθk {log pθ (Z, Y ) Y } Eθk {log pθ (Z Y ) Y } .{z} {z} ,Q(θ,θk )(11),V(θ,θk )In the above equation we have used the fact thatZlog pθ (Y )pθk (Z Y )dZ log pθ (Y ).(12)It is worth noticing that the latent variables are here assumed to be continuous.However, there is nothing that prevents us from deriving the EM algorithm fordiscrete latent variables, the only di erence is that the integrals in (11) will bereplaced by summations.Let us now study the di erence between the log-likelihood function Lθ (Y )evaluated at two di erent values θ and θk ,Lθ (Y ) Lθk (Y ) (Q(θ, θk ) Q(θk , θk )) (V(θk , θk ) V(θ, θk )) ,(13)where we have made use of the de nitions in (11). It is now interesting toconsider V(θk , θk ) V(θ, θk ) in more detail. Straightforward application of thede nition of V(θ, θk ) provided in (11) results in, Zpθk (Z Y )V(θk , θk ) V(θ, θk ) logpθk (Z Y )dZpθ (Z Y ) pθ (Z Y ) Y .(14) Eθk logpθk (Z Y )Note that this means that V(θk , θk ) V(θ, θk ) is the Kullback-Leibler information distance (Kullback and Leibler, 1951) between pθk (Z Y ) and pθ (Z Y ).Furthermore, the negative logarithm is a convex function, which implies thatJensen's inequality2 can be used to establish pθ (Z Y )pθ (Z Y )Eθk log Y log Eθk Ypθk (Z Y )pθk (Z Y )Z log pθ (Z Y )dZ 0,(16)which e ectively proves thatV(θk , θk ) V(θ, θk ) 0.2Jensen's inequality states that if f is a convex function thenE{f (x)} f (E{x}),provided that both expectations exist.4(17)(15)

Hence, if we make use of this fact in (13) and choose a new parameter θ suchthat Q(θ, θk ) Q(θk , θk ), we have in fact also increased the likelihood, or leftit unchanged,Q(θ, θk ) Q(θk , θk ) Lθ (Y ) Lθk (Y ).(18)The EM algorithm now suggests itself in that if we start by computing Q(θ, θk )according to its de nition in (11) this function can then be maximized withrespect to θ in order to obtain a new estimate θk 1 . According to the aboveanalysis, this new estimate will indeed produce a higher or at least the samelikelihood as the previous estimate θk . This procedure is then repeated untilconvergence, which is summarised in the algorithm below. It is important tonote that the convergence is only guaranteed to be to a local minima.Algorithm 2.1 ( ExpectationMaximization)1. Set k 0 and initialize θ0 such that Lθk (Y ) is nite.2. Expectation (E) step: ComputeZQ(θ, θk ) Eθk {log pθ (Z, Y ) Y } log pθ (Z, Y )pθk (Z Y )dZ.(19)3. Maximization (M) step: Computeθk 1 arg max Q(θ, θk ).(20)θ4. If not converged, update k : k 1 and return to step 2.There are several ways in which the convergence check in step 4 of the abovealgorithm can be performed. One common way is to simply monitor the valueof the log-likelihood and say that the algorithm has converged whenever theincrease falls below a certain threshold εL 0 (a typical default value is εL 10 6 ), i.e., Lθk 1 (Y ) Lθk (Y ) εL .(21)Another way to check for convergence is to monitor the change in the parameter value between two consecutive iterations and state that the algorithm hasconverged whenkθk 1 θk k2 εP ,(22)where εP 0 is some suitably chosen threshold.3An Illustrative ExampleRather than providing a general solution to the problem of identifying linearstate-space models, we will solve the simplest problem we can think of. Themotivation for this is simply that it facilitates understanding. The generalcase is then a generalisation of this, see (Gibson and Ninness, 2005) for detailsconcerning the fully parameterised case.5

Let us consider the following scalar linear state-space model,xt 1 θxt vt ,1yt xt et ,2where the noise processes are Gaussian according to vt00.1 0 N,.et00 0.1(23a)(23b)(23c)For simplicity we assume that initial state is fully known, according to x1 0.Finally, the true θ-parameter is given byθ? 0.9.(23d)The identi cation problem is now to determine the parameter θ on the basisof the observations Y {y1 , . . . , yN }, using the EM algorithm introduced inthe previous section. The rst thing to do is to identify the latent variables Z .Inspired by Example 2.1 we realise that the states X , {x1 , . . . , xN 1 } playsthe role of latent variables in this problem, due to the fact that if the stateswere known we could nd the parameter simply by solving a linear regressionproblem.In the rest of this section we will now provide all the details necessary todevice a working EM algorithm to identify the parameter θ in (23). More specifically, the E and the M steps are presented in Section 3.1 and 3.2, respectively.Together, this results in the EM algorithm detailed in Section 3.3. Finally,numerical experiments are provided in Section 3.4.3.1The Expectation (E) StepThe expectation (E) step of the algorithm consists in computing the followingquantityZQ(θ, θk ) , Eθk {log pθ (X, Y ) Y } log pθ (X, Y )pθk (X Y )dX.(24)Let us start by nding an expression for log pθ (X, Y ), when the data is generatedby the probabilistic model given in (23). We have,pθ (X, Y ) pθ (xN 1 , XN , yN , YN 1 ) pθ (xN 1 , yN XN , YN 1 )pθ (XN , YN 1 ),(25)where we have used the de nition of conditional probabilities in the secondequality. According to the Markov property3 inherent in (23) we havepθ (xN 1 , yN XN , YN 1 ) pθ (xN 1 , yN xN ),3A discrete-time stochastic process {x } is said to possess the Markov property iftp(xt 1 x1 , . . . , xt ) p(xt 1 xt ).6(27)(26)

implying that (25) can be written as(28)pθ (X, Y ) pθ (xN 1 , yN xN )pθ (XN , YN 1 ).Repeated use of the above ideas straightforwardly yieldsNYpθ (X, Y ) pθ (x1 )(29)pθ (xt 1 , yt xt ).t 1From (23) we have xt 1xt 1θ0.1pθ xt N;xt ,ytyt1/2000.1 ,(30)which inserted in (29) results in the following expression for log pθ (X, Y ) NXxt 1θ0.1 0log pθ (X, Y ) pθ (x1 )log N;x,.(31)yt1/2 t0 0.1t 1Inserting the expression for the normal density function and using the fact thatin the current example we, for simplicity, assumed that the initial state wasknown, we obtain, NNX1X1 21 wt ewt(32a) log pθ (X, Y ) log2 t 10.01 2πt 1where the exponent wt is given by T xt 1 θxt0.1wt 0yt 21 xt00.1 1 xt 1 θxt.yt 12 xt(32b)The expression (32) can be simpli ed, resulting inlog pθ (X, Y ) NXx2t θ2 2t 1NXxt xt 1 θ,(33)t 1where terms independent of θ have been neglected, since they, similar to x1 , willnot a ect the resulting optimization problem. Recall the we are interested inthe Q-function, which was de ned in (24). Now, inserting (33) into (24) resultsin(N)(N)XX22Q(θ, θk ) Eθkxt Y θ 2 Eθkxt xt 1 Y θt 1t 1(34)2 ϕθ 2ψθ,where we have de nedϕ,NX Eθk x2t Y ,ψ,t 1NXEθk {xt xt 1 Y } .(35)t 1The expected values used to compute ϕ in (35) are explicitly provided by theRauch-Tung-Striebel (RTS) smoother (Rauch et al., 1965). Furthermore, theexpected values used to compute ψ in (35) are provided by an extension to theRTS formula. All the details are provided in Theorem 1 below.7

Theorem 1 Consider the following linear state-space modelxt 1 Axt But vt ,(36a)yt Cxt Dut et ,(36b)where the noise processes are Gaussian according to Qvt0 N,et0STSR (36c)and the initial state is distributed according to x1 N (µ, P1 ). Let the parametervector θ be composed of A, B, C, D, Q, R, S, P1 and µ. Then Eθk xt xTt Y Eθk xt 1 xTt Y Eθk yt xTt Y xbt N xbTt N Pt N , xbt 1 N xbTt N Mt 1 N ,(37a)(37b)(37c) yt xbTt N ,where xbt N xbt t Jt xbt 1 N Ābxt t B̄ut SR 1 yt , Pt N Pt t Jt Pt 1 N Pt 1 t JtT ,Jt Pt t ĀT 1Pt 1 t,(38a)(38b)(38c)for t N, . . . , 1 andTTMt N Pt t Jt 1 Jt (Mt 1 N ĀPt t )Jt 1,(39a)for t N 1, . . . , 1 and MN N is initialized according toMN N (I KN C)ĀPN 1 N 1 .(39b)Finally, the quantities xbt t , Pt t and Pt 1 t are provided by the Kalman lter forthe system given byĀ A SR 1 C,D,(40b) 1T(40c)B̄ B SRQ̄ Q SR(40a) 1S .Proof 1 We refrain from giving the proof here, instead we refer to Shumwayand Sto er (2006).3.2The Maximization (M) StepAccording to (20), the maximization (M) step amounts to solving the followingproblemθk 1 arg max Q(θ, θk ).(41)θIn the previous section we derived the following expression for Q(θ, θk )Q(θ, θk ) ϕθ2 2ψθ,8(42)

where ϕ and ψ are de ned in (35). Hence, the M step simply amounts to solvingthe following quadratic problem,(43)θk 1 arg max ϕθ2 2ψθ,θwhich results inθk 1 3.3ψ.ϕ(44)Explicit EM AlgorithmThe nal algorithm is now obtained simply by inserting the results derived inSection 3.1 and Section 3.2 into the general EM algorithm provided in Algorithm 2.1. This results in the algorithm provided below.Algorithm 3.1 ( ExpectationMaximization for (23))1. Set k 0 and initialize θ0 .2. Expectation (E) step: ComputeQ(θ, θk ) Eθk(NX)x2t2 Yθ 2 Eθkt 1(NX)xt xt 1 Yθ,(45)t 1where the involved expected values are computed according to Theorem 1.3. Maximization (M) step: Find the next iterate according toθk 1 nPNt 1 xt xt 1 YnPoN2 YEθkxt 1 tEθko(46)4. If Lθk (Y ) Lθk 1 (Y ) 10 6 , update k : k 1 and return to step 2,otherwise terminate.All the details of the above algorithm are now accounted for, save for how tocompute the log-likelihood function Lθ (Y ), which is needed for the convergencecheck. In the interest of a self-contained presentation we provide a derivation ofan explicit expression for Lθ (Y ) in Appendix A and in Appendix B, thecode is available.Matlab3.4Numerical ExperimentsIn the previous sections we derived an EM algorithm for estimating the unknownparameter θ in (23a). We know that the maximum likelihood method is asymptotically consistent and since (23) is a linear model we expect the estimate toconverge to the true value as the number of samples N tends to in nity. Beforewe show that this is indeed the case for Algorithm 3.1, let us explain the experimental conditions. The experiment is made by conducting seven Monte Carlostudies, each using 1000 realisations of data Y . The only di erence between thestudies is that the number of samples N used is di erent for each study. All9

Table 1: The table shows the EM estimate provided by Algorithm 3.1 for sevendi erent N . More speci cally, θ̂ is the result of Monte Carlo studies, each using1000 realisations of data. Recall that the true parameter value is θ? 0.898850000.8996100000.8998information concerning the model is provided in (23) and the initial guess usedfor θ is chosen as θ0 0.1 for all realisations.The result is provided in Table 1, where the EM estimate θ̂ is shown for thedi erent number of samples N used. As expected, the EM estimate approachesthe true value θ? as the number of samples N increase. In order to further illustrate how the EM algorithm works we have provided plots of the log-likelihoodLθ (Y ) and the Q-function in Figure 1. In the expectation (E) step, that isstep 2, an expression for the Q-function is computed according to the detailsprovided in Section 3.1 and Appendix A.2. The result of this computation isprovided by the black curve in Figure 1. In the subsequent step, the maximization (M) step, of the EM algorithm, the task is to nd the θ that maximizes theQ-function. The result of this is illustrated by a vertical line in the gure. Thegrey curve in Figure 1 is the corresponding log-likelihood, computing accordingto Appendix A.1.4BibliographyTo the best of the author's knowledge, the earliest account of an EM typealgorithm is Newcomb (1886), who used it for estimating a mixture model.Since then there have be

The expectation maximization (EM) algorithm computes maximum likelihood (ML) estimates of unknown parameters in probabilistic models involving latent avriables Z1. An instructive way of thinking about EM is to think of it as a systematic way o

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. Crawford M., Marsh D. The driving force : food in human evolution and the future.

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. 3 Crawford M., Marsh D. The driving force : food in human evolution and the future.