AppendixA Expectation-Maximization (EM) Method

2y ago
12 Views
3 Downloads
574.64 KB
51 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Roy Essex
Transcription

Appendix AExpectation-Maximization (EM) MethodExpectation-Maximization (EM) is a method for deriving algorithms to maximizelikelihood functions—using it with different likelihood functions results in differentML (or MAP, as the case may be) estimators. It is, on first encounter, a mysteriousmethod with strong appeal to those who love to see elaborate mathematical constructions telescope into simple forms.EM is reviewed in this appendix. The distinctions between the named randomvariables—incomplete, missing, and complete—are discussed, together with theauxiliary function called Q, and the E- and M-steps. It is shown that increasing theauxiliary function increases the likelihood function. This means that all EM basedalgorithms monotonically increase the data likelihood function p Z (z ; θ ). Iterativemaximization is a strictly numerical analysis method that interprets the auxiliaryfunction Q as a two parameter family of curves whose envelop is the data likelihoodfunction.EM works amazingly well in some applications and is seemingly useless inothers. It is often a hand-in-glove fit for likelihood functions that involve sums(integrals), for in these cases the natural missing data are the summation indices(variables of integration) and they simplify the likelihood function to a product ofthe summands (integrands). Such products are often much simpler to manipulate.However, other missing data are possible, and different choices of missing data canlead to different EM convergence rates.A.1 FormulationThe observed data z are called the incomplete data. As is seen shortly, the namemakes sense in the context of the EM method. The measurements z are a realizationof the random variable Z , and Z takes values in the space Z. The pdf of the dataz is specified parametrically by p Z (z ; θ ), where θ Θ is an unknown parametervector. Here, Θ is the set of all valid parameter values.The likelihood function of θ is the pdf p Z (z ; θ ) thought of as a function of θ fora given z. It is assumed that the likelihood of the data for any θ Θ is finite. It isalso assumed that the likelihood function of θ is uniformly bounded above, that is,p Z (z ; θ ) B for all θ Θ. The latter assumption is important.R.L. Streit, Poisson Point Processes, DOI 10.1007/978-1-4419-6923-1, C Springer Science Business Media, LLC 2010223

224Appendix A: Expectation-Maximization (EM) MethodThe maximum likelihood estimate of θ isθ̂ M L arg max p Z (z ; θ ) .θ ΘFor pdfs differentiable with respect to θ , the natural way to compute θ̂ M L is solvethe so-called necessary conditions θ p Z (z ; θ ) 0using, say, a safeguarded Newton-Raphson algorithm. This approach can encounterdifficulties in some nonlinear problems, especially when the dimension of θ is large.The EM method is an alternative way to compute θ̂ M L . Let K be a random variable whose realizations k occur in the set K, that is, k K. For ease of exposition,the discussion in this appendix assumes that K is a continuous variable. If K isdiscrete, the discussion below is modified merely by replacing integrals over K bysums over K.Now, K is called missing data in the sense of EM if p Z (z ; θ ) Kp Z K (z, k ; θ ) dk ,(A.1)where p Z K (z, k ; θ ) is the joint pdf of Z and K . The pair of variables (Z , K )are called the complete data. In words, (A.1) says that the pdf of the data Z is themarginal of the complete data pdf over the missing data K . The user of the EMmethod is at liberty to choose K and the joint pdf p Z K ( · ) as fancy or need dictates,as long as the (A.1) holds. This is crucial. In many applications, however, the choiceof missing data is very natural.The pdf of K conditioned on Z is, by the definition of conditioning,p K X (k x ; θ ) p Z K (z, k ; θ ) p Z (z ; θ )p Z K (z, k ; θ ).K p Z K (z, k ; θ ) dk(A.2)Let n 0 denote the EM iteration index. Let θ (0) Θ be an initial (valid) value ofthe parameter.A.1.1 E-stepFor n 1, the EM auxiliary function is defined by the conditional expectation: Q θ ; θ (n 1) E K X ;θ (n 1) log p Z K (z, k ; θ ) ) (log p Z K (z, k ; θ )) p K Z k ) z ; θ (n 1) dk .K(A.3)

A.1Formulation225 The sole purpose of the E-step is to evaluate Q θ ; θ (n 1) . In practice, this is veryoften a purely symbolic step, that is, the expectation (A.3) is manipulated so that ittakes a simple analytic form. When manipulation does not yield an analytic form, itis necessary to replace the expectation with a suitably chosen discrete sum (defined,e.g., via Monte Carlo integration).A.1.2 M-stepThe EM updated parameter is θ (n) arg max Q θ ; θ (n 1) .θ Θ(A.4)The maximum in (A.4) can be computed by any available method. It is sometimesthought that it is necessary to solve the maximization step explicitly, but this isemphatically not so—a strictly numerical method is quite acceptable.It is unnecessary to solve for the maximum in (A.4). Instead it is only necessaryto find a value θ (n) such that Q θ (n) ; θ (n 1) Q θ (n 1) ; θ (n 1) .(A.5)If the update θ (n) is chosen in this way, the method is called the generalized EM(GEM) method. The GEM method is very useful in many problems. It is the startingpoint of other closely related EM-based methods. A prominent example is the SAGE(Space Alternating Generalized EM) algorithm [35].A.1.3 ConvergenceUnder mild conditions, convergence is guaranteed to a critical point of the likelihood function, that is, θ (n) θ such that θ p Z (z θ ) θ θ 0 .Experience shows that in practice, θ is almost certainly a local maximum and nota saddle point, so that θ θ̂ M L .One of the mild conditions that cannot be overlooked is the assumption thatthe likelihood function of θ is uniformly bounded above. If it is not, then the EMiteration will converge only if the initial value of θ is by chance in the domain ofattraction of a point of local maximum likelihood. Otherwise, it will diverge, meaning that the likelihood function of the iterates will grow unbounded. Unboundednesswould not be an issue were it not for the fact that the only valid values of θ havefinite likelihoods, so the iterates—if they converge to a point—converge to an invalidparameter value, that is, to a point not in Θ. Estimation of heteroscedastic Gaussian

226Appendix A: Expectation-Maximization (EM) Methodsums is notorious for this behavior (due to covariance matrix collapse). See Section3.4 for further discussion.The details of the EM convergence proof are given in too many places to repeatall the details here. The book [80] gives a full and careful treatment of convergence,as well as examples where convergence fails in various ways.For present purposes, it is enough to establish two facts. One is that if the updateθ (n) satisfies (A.5), then it also increases the likelihood function. The other is thata critical point of the auxiliary function is also a critical point of the likelihoodfunction, and conversely.To see that each EM step monotonically increases the likelihood function, recallthat log x x 1 for all x 0 with equality if and only if x 1. Then, usingonly the above definitions, 0 Q θ ; θ (n 1) Q θ (n 1) ; θ (n 1) ) (log p Z K (z, k ; θ )) p K Z k ) z ; θ (n 1) dkK ) log p Z K z, k ; θ (n 1)p K Z k ) z ; θ (n 1) dkK,2' & )p Z K (z, k ; θ )) z ; θ (n 1) dk klogpK Zp Z K z, k ; θ (n 1)K,2 p Z K z, k ; θ (n 1)p Z K (z, k ; θ ) 1 dkp Z (z ; θ (n 1) )p Z K z, k ; θ (n 1)K * 1(n 1) pz,k;θdk k;θ p(z,)ZKZKp Z z ; θ (n 1) K* 1 p Z (z ; θ ) p Z z ; θ (n 1) . p Z z ; θ (n 1)Clearly, any increase in Q will increase the likelihood function.To see that critical points of the likelihood function are critical points of theauxiliary function, and conversely, simply note that0 θ0 p Z (z ; θ0 ) θ0 p Z K (z, k ; θ0 ) dk K p Z K (z, k ; θ0 ) θ0 log p Z K (z, k ; θ0 ) dkK p Z K (z, k ; θ0 ) θ0 log p Z K (z, k ; θ0 ) dk p Z (z ; θ0 )p Z (z ; θ0 )K

A.2Iterative Majorization227 p Z (z ; θ0 )K p K Z (k z ; θ0 ) θ log p Z K (z, k θ ) θ θ dk0 p Z (z ; θ0 ) [ θ Q (θ ; θ0 )]θ θ0 .For further discussion and references to the literature, see [80].A.2 Iterative MajorizationThe widely referenced paper [23] is the synthesis of several earlier discoveries ofEM in a statistical setting. However, EM is not essentially statistical in character, butis rather only one member of a larger class of strictly numerical methods called iterative majorization [19–21]. This connection is not often mentioned in the literature,but the insight was noticed almost immediately after the publication of [23].The observed data likelihood function p Z (z ; θ ) is the envelop of a two parameter family of functions. This family is defined in the EM method by the auxiliaryfunction, Q(θ ; φ). As seen from the results in [23], the data likelihood functionmajorizes every function in this family. For each specified parameter φ, the function Q(θ ; φ) is tangent to the p Z (z ; θ ) at the point φ. This situation is depictedin Fig. A.1 for the sequence φ θ0 , θ1 , θ2 , . . . . It is now intuitively clear thatEM based algorithms monotonically increase the data likelihood function, and thatthey converge with high probability to a local maximum of the likelihood functionp Z (z ; θ ) that depends on the starting point.It is very often observed in practice that EM based algorithms make large stridestoward the solution in the early iterations, but that progress toward the solutionFig. A.1 Iterative majorization interpretation of the observed (incomplete) data likelihood functionas the envelop of the EM auxiliary function, Q(θ, φ)

228Appendix A: Expectation-Maximization (EM) Methodslows significantly as the iteration progresses. The iterative majorization interpretation of EM provides an intuitive insight into both phenomena. See [80] for furtherdiscussion and, in particular, for a proof that the rate of convergence is ultimatelyonly linear. This convergence behavior explains the many efforts proposed in theliterature to speed up EM convergence.A.3 Observed InformationThe observed information matrix (OIM) corresponding to the ML estimate θ̂ M L isdefined by O I M(θ̂ M L ) θ ( θ log p Z (z ; θ ))Tθ θ̂ M L.(A.6)The matrix O I M(θ ) is an OIM only if θ is an ML estimate. Evaluating the OIMis often a relatively straightforward procedure. Moreover, EM based ML estimation algorithms are easily adapted to compute the OIM as a by-product of the EMiteration at very little additional computational expense. See [69] for details.The OIM and FIM differ in important ways: The FIM computes the expectation of the negative Hessian of the loglikelihoodfunction with respect to the data, while the OIM does not. The OIM is evaluated at the ML estimate θ̂ M L , while the FIM is evaluated at thetrue value of the parameter.The expected value of the OIM is sometimes said to be the FIM, but this is notprecisely correct.The OIM and FIM are alike in that they both evaluate the negative Hessian ofthe loglikelihood function. Because of this and other similarities between them, theOIM is often used as a surrogate for the FIM when the FIM is unknown or otherwiseunavailable. See [26] for further discussion of the OIM, its origins, and potentialutility.

Appendix BSolving Conditional Mean EquationsA conditional mean equation used in intensity estimation is discussed in thisappendix. The equation is seen to be monotone, and monotonicity implies that theequation is uniquely solvable.The conditional mean equation of interest is the vector equation (3.8), which isrepeated here:s N (s ; μ, Σ) ds x̄ ,R N (s ; μ, Σ) dsR(B.1)wherex̄ 1mmx j Rn x .j 1The j-th component of x̄ is denoted by x̄ j , which should not be confused with thedata point x j Rn x . The solution to (B.1) is unique and straightforward to computenumerically for rectangular multidimensional regionsR [a1 , b1 ] · · · [an x , bn x ] and diagonal covariance matrices Σ Diag σ12 , . . . , σn2x .The conditional mean of a univariate Gaussian distributed random variable conditioned on realizations in the interval [ 1, 1] is defined by M μ, σ2 12 1 s N (s ; μ, σ ) ds12 1 N (s ; μ, σ ) ds,(B.2)where μ R and σ 0. Thus, lim M μ, σ 2 1μ 229

230Appendix B: Solving Conditional Mean Equationsandlimμ M μ, σ 2 1.The most important fact about the function M is that it is strictly monotone increasing as a function of μ. Consequently, for any number c such that 1 c 1, andvariance σ 2 , the solution μ of the general equation M μ, σ 2 c(B.3)existsToit is only necessary to verify that the derivative and is unique. see this, M μ, σ 2 0 for all μ. The inequality is intuitively obviousM μ, σ 2 μ from its definiton as a conditional mean. The function M μ, σ 2 is plotted for sev eral values of σ in Fig. B.1. Evaluating the inverse function M 1 μ, σ 2 efficientlyis left as an exercise.Fig. B.1 Plots of M[μ, σ 2 ] from σ 0.1 to σ 1 in steps of 0.1. The monotonicity ofM[μ, σ 2 ] is self evident. The steepness of the transition from 1 to 1 increases steadily withdecreasing σThe conditional mean of the j-th component of (B.1) is 1n xb1bn xi 1 N sia1 · · · an x s j b1bn x 1n xi 1 N si ;a1 · · · an x ; μi , σi2 ds1 · · · dsn x x̄ j . μi , σi2 ds1 · · · dsn xThe integrals over all variables except x j cancel, so (B.4) simplifies to(B.4)

Appendix B: Solving Conditional Mean Equationsbjaj231 s j ; μ j , σ j2 ds j x̄ j .N s j ; μ j , σ j2 ds jsjNbjaj(B.5)Substituting sj bj aj2 aj bj,2x into (B.5) and using the function (B.2) gives.M a j b j2b j a j2μ ,2σbj aj 2 / aj bj x̄ j .2(B.6)Solving. M μ̃ j ,2σbj aj 2 / x̄ j aj bj2(B.7)for μ̃ j gives the solution of (B.6) as μj bj aj2 μ̃ j aj bj.2The expression (B.4) holds with obvious modifications for more general regionsR. However, the multidimensional integral over all the variables except x j is a function of x j in general and does not cancel from the ratio as it does in (B.5).

Appendix CBayesian FilteringA brief review of general Bayesian filtering is given in this appendix. The discussionsets the conceptual and notational foundation for Bayesian filtering on PPP eventspaces, all without mentioning PPPs until the very end. Gentler presentations thatreaders may find helpful are widely available (e.g., [4, 54, 104, 122]).The notation used in this appendix is used in Section 6.1 and also in the alternative derivation of Appendix D.C.1 General RecursionTwo sequences of random variables are involved: the sequence Ξ0 , Ξ1 , . . . , Ξkmodels target motion and Υ1 , . . . , Υk models data. These sequences correspond tomeasurement (or scan) times t0 , t1 , . . . , tk , where t j 1 t j for j 1, . . . , k. Theonly conditional dependencies between these variables are the traditional ones: thesequence {Ξ j } is Markov, and the conditional variables {Υ j Ξ j } are independent.The governing equation for Bayesian tracking is the following joint pdf of the track(x0 , x1 , . . . , xk ) and the measurements (z 1 , . . . , z k ):p Ξ0 ,Ξ1 ,.,Ξk ,Υ1 ,.,Υk (x0 , x1 , . . . , xk , z 1 , . . . , z k ) pΞ0 (x0 )k pΞ j Ξ j 1 (x j x j 1 ) pΥ j Ξ j (z j x j ).(C.1)j 1The product form in (C.1) is due to the conditioning.The Bayesian filter is the posterior pdf of Ξk conditioned on all data up to andincluding time tk . Conditioning and marginalizing givespΞ0 ,.,Ξk Υ1 ,.,Υk (x0 , . . . , xk z 1 , . . . , z k )pΞ0 ,.,Ξk ,Υ1 ,.,Υk (x0 , . . . , xk , z 1 , . . . , z k ) pΥ1 ,.,Υk (z 1 , . . . , z k )1pΞ0 (x0 ) kj 1 pΞ j Ξ j 1 (x j x j 1 ) pΥ j Ξ j (z j x j ) .S · · · S pΞ0 ,.,Ξk ,Υ1 ,.,Υk (x 0 , . . . , x k , z 1 , . . . , z k ) dx 0 · · · dx k233

234Appendix C: Bayesian FilteringThe posterior pdf is the integral over all states except xk :pΞk Υ1 ,.,Υk (xk z 1 , . . . , z k ) ···pΞ0 ,.,Ξk Υ1 ,.,Υk (x0 , . . . , xk z 1 , . . . , z k ) dx0 · · · dxk 1 .SSThe multidimensional integral is evaluated recursively.Let Ξk 1 k 1 , Ξk k 1 , and Υk k 1 denote the random variables Ξk 1 , Ξk , and Υkconditioned on Υ1 , . . . , Υk 1 , and let pk 1 k 1 (xk 1 ), pk k 1 (xk ), and πk k 1 (z k )be their pdfs. Initialize the recursion by setting p0 0 (x0 ) pΞ0 (x0 ). The pdf ofΞk k is, by Bayes’ Theorem,pk k (xk ) pΞk Υ1 ,.,Υk (xk z 1 , . . . , z k )pΞ Υ ,.,Υk 1 (xk z 1 , . . . , z k 1 ). pΥk Ξk (z k xk ) k 1pΥk Υ1 ,.,Υk 1 (z k z 1 , . . . , z k 1 )(C.2)The numerator of (C.2) is the predicted target pdf:pk k 1 (xk ) pΞk Υ1 ,.,Υk 1 (xk z 1 , . . . , z k 1 ) pΞk Ξk 1 (xk xk 1 ) pΞk 1 Υ1 ,.,Υk 1 (xk 1 z 1 , . . . , z k 1 ) dxk 1 S pΞk Ξk 1 (xk xk 1 ) pk 1 k 1 (xk 1 ) dxk 1 .(C.3)SThe denominator is the pdf of the measurement z k given that it is generated by thetarget with pdf pk k 1 (xk ):πk k 1 (z k ) pΥk Υ1 ,.,Υk 1 (z k z 1 , . . . , z k 1 ) pΥk Ξk (z k xk ) pΞk Υ1 ,.,Υk 1 (xk z 1 , . . . , z k 1 ) dxk S pΥk Ξk (z k xk ) pk k 1 (xk ) dxk .(C.4)SSubstituting (C.3) and (C.4) into (C.2) givespk k (xk ) pΥk Ξk (z k xk )pk k 1 (xk ).πk k 1 (z k )(C.5)The forward recursion is defined by (C.3)–(C.5).The pdf πk k 1 (z k ) is called the partition function in physics and in the machinelearning community. In many tracking applications, it is often simply dismissed asa scale factor; however, its form is important in PPP applications to tracking. Sinceits form is known before the measurement z k is available, it is called the predictedmeasurement pdf here.

C.2 Special Case: Kalman Filtering235C.2 Special Case: Kalman FilteringThe classic example of Bayesian filters is the Kalman filter, that is, the single targetBayes-Markov filter with additive Gaussian noise and no clutter. In this case Ξ jrepresents the state of a single target, the state space is S Rn x , n x 1, and Υ jrepresents a data point in the event space T Rn z , n z 1. The model is usuallywritten in an algebraic form asx j F j 1 (x j 1 ) v j 1(C.6)z j H j (x j ) w j ,(C.7)andfor j 1, . . . , k, where x j is a realization of the Markovian state variable Ξ j , and{z j } is a realization of the conditional variables Υ j Ξ j x j . The process noisesv j 1 and measurement noises w j in (C.6) are zero mean, Gaussian and independentwith covariance matrices Q j 1 and R j , respectively. The functions F j 1 (·) andH j (·) are known. The pdf of Ξ0 is N (x0 ; x̄0 , P0 ), where x̄0 and P0 are given. Theequivalent system of pdfs ispΞ0 (x0 ) N (x0 ; x̄0 , P0 )pΞ j Ξ j 1 (x j x j 1 ) N (x j ; F j 1 (x j 1 ), Q j 1 )pΥ j Ξ j (z j x j ) N (z j ; H j (x j ), R j ).(C.8a)(C.8b)(C.8c)The joint pdf is found by substituting (C.8a)–(C.8c) into (C.1). The recursion(C.3)–(C.5) gives the posterior pdf on S Rn x .The linear Gaussian Kalman filter assumes thatF j 1 (x j 1 ) F j 1 x j 1(C.9a)H j (x j ) H j x j ,(C.9b)where F j 1 is an n x n x matrix called the system matrix, and H j is n z n x andis called the measurement matrix. The Kalman filter is, in this instance, a recursionthat evaluates the parameters of the posterior pdf pk k (xk ) N xk x̂k k , Pk k ,(C.10)where x̂ j k is the point estimate of the target at time tk and Pk k is the associatederror covariance matrix. Explicitly, for j 0, . . . , k 1,

236Appendix C: Bayesian FilteringP j 1 j F j P j j F jT Qj* 1TW j 1 P j 1 j H j 1H j 1 P j 1 j H jT R j 1P j 1 j 1 P j 1 j W j 1 H j 1 P j 1 j x̂ j 1 j 1 F j x̂ j j W j 1 z j 1 H j 1 F j x̂ j j .(C.11a)(C.11b)(C.11c)(C.11d)These equations are not necessarily good for numerical purposes, especially whenobservability is an issue. In practice, the information form of the Kalman filter isalways preferable.The Kalman filter is often written in terms of measurement innovations when themeasurement and target motion models are linear. The predicted target state at timet j isx̂ j j 1 F j 1 x̂ j 1 j 1 ,j 1, . . . , k ,(C.12)and the predicted measurement isẑ j j 1 H j x̂ j j 1 ,j 1, . . . , k .(C.13)The

224 Appendix A: Expectation-Maximization (EM) Method The maximum likelihood estimate of θis θˆ ML argmax θ Θ pZ(z; θ). For pdfs differentiable with respect to θ, the natural way to compute θˆ

Related Documents:

Machine Learning for Computer Vision Expectation-Maximization EM is an elegant and powerful method for MLE problems with latent variables Main idea: model parameters and latent variables are estimated iteratively, where average over the latent variables (expectation) A typical exam

Expectation Maximization We first focus on the analysis of the convergenceproperties of the Expectation-Maximization (EM) algorithm. Con-sider a probabilistic model of observed data x which uses latent variables z. The log-likelihood (objective function

2. Expectation Maximization (EM) Iteration. A maximum likelihood (ML) method for image reconstruction based on Poisson data was introduced by Shepp and Vardi [42] in 1982 for image reconstruction in emission tomography. In fact, this algorithm was originally proposed by Richardson [37]

EPA Test Method 1: EPA Test Method 2 EPA Test Method 3A. EPA Test Method 4 . Method 3A Oxygen & Carbon Dioxide . EPA Test Method 3A. Method 6C SO. 2. EPA Test Method 6C . Method 7E NOx . EPA Test Method 7E. Method 10 CO . EPA Test Method 10 . Method 25A Hydrocarbons (THC) EPA Test Method 25A. Method 30B Mercury (sorbent trap) EPA Test Method .

Mathematical Expectation Properties of Mathematical Expectation I The concept of mathematical expectation arose in connection with games of chance. In its simplest form, mathematical expectation is the product of the amount a player stands to win and the probability that the player would win.

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAA!File Size: 2MB

Expectation-Maximization Algorithm and Applications Eugene Weinstein Courant Institute of Mathematical Sciences . Can prove is the maximum-likelihood estimate of θ Differentiate with respect to θ, set equal to 0. 5/31 EM Motivation So, to solve any ML-type p

E hub length D 1 with brake disc D.B.S.E D 2 w/o brake disc B 1 overall length with brake disc B 2 overall length w/o brake disc F H AGMA gear coupling size