8. Mixture Models And Expectation-Maximization

2y ago
21 Views
2 Downloads
8.62 MB
43 Pages
Last View : 8d ago
Last Download : 3m ago
Upload by : Sutton Moon
Transcription

Computer Vision GroupProf. Daniel Cremers8. Mixture Models andExpectation-Maximization

Motivation Often the introduction of latent (unobserved)random variables into a model can help to expresscomplex (marginal) distributions A very common example are mixture models, inparticular Gaussian mixture models (GMM) Mixture models can be used for clustering(unsupervised learning) and to express morecomplex probability distributions As we will see, the parameters of mixture modelscan be estimated using maximum-likelihoodestimation such as expectation-maximizationMachine Learning forComputer Vision2Dr. Rudolph TriebelComputer Vision Group

K-means Clustering Given: data set {x1 , . . . , xN }, number of clusters K Goal: find cluster centers {µ1 , . . . , µK } so thatJ N XKXn 1 k 1rnk kxn2µk kis minimal, where rnk 1 if xn is assigned to µk Idea: compute rnk and µk iteratively Start with some values for the cluster centers Find optimal assignments rnk Update cluster centers using these assignments Repeat until assignments or centers don’t changeMachine Learning forComputer Vision3Dr. Rudolph TriebelComputer Vision Group

K-means ClusteringInitialize cluster means:Machine Learning forComputer Vision{µ1 , . . . , µK }4Dr. Rudolph TriebelComputer Vision Group

K-means ClusteringFind optimal assignments:rnk Machine Learning forComputer Vision(10if k arg minj kxnotherwise5µj kDr. Rudolph TriebelComputer Vision Group

K-means ClusteringFind new optimal means:NX@J 2rnk (xn@µkn 1) µk Machine Learning forComputer Vision6!µk ) 0PNn 1 rnk xnPNn 1 rnkDr. Rudolph TriebelComputer Vision Group

K-means ClusteringFind new optimal assignments:rnk Machine Learning forComputer Vision(10if k arg minj kxnotherwise7µj kDr. Rudolph TriebelComputer Vision Group

K-means ClusteringIterate these steps until means andassignments do not change any moreMachine Learning forComputer Vision8Dr. Rudolph TriebelComputer Vision Group

2D Example Real data set Random initializationMachine Learning forComputer Vision Magenta line is “decisionboundary”9Dr. Rudolph TriebelComputer Vision Group

The Cost Function After every step the cost function J is minimized Blue steps: update assignments Red steps: update means Convergence after 4 roundsMachine Learning forComputer Vision10Dr. Rudolph TriebelComputer Vision Group

K-means for SegmentationMachine Learning forComputer Vision11Dr. Rudolph TriebelComputer Vision Group

K-Means: Additional Remarks K-means converges always, but the minimum isnot guaranteed to be a global one There is an online version of K-means! After each addition of xn, the nearest center μk isupdated:oldoldµnew µ (xµn nkkk )The K-medoid variant: Replace the Euclidean distance by a general measureV.N XKXJ rnk V(xn , µk )n 1 k 1Machine Learning forComputer Vision12Dr. Rudolph TriebelComputer Vision Group

Mixtures of Gaussians Assume that the data consists of K clusters The data within each cluster is Gaussian For any data point x we introduce a K-dimensionalbinary random variable z so that:p(x) KXk 1wherezk 2 {0, 1},Machine Learning forComputer Visionp(zk 1) N (x µk , k ) {z } : kKXzk 1k 113Dr. Rudolph TriebelComputer Vision Group

A Simple Example Mixture of three Gaussians with mixing coefficients Left: all three Gaussians as contour plot Right: samples from the mixture model, the redcomponent has the most samplesMachine Learning forComputer Vision14Dr. Rudolph TriebelComputer Vision Group

Parameter Estimation From a given set of training data {x1 , . . . , xN } wewant to find parameters ( 1,.,K , µ1,.,K , 1,.,K )so that the likelihood is maximized (MLE):p(x1 , . . . , xN 1,.,K , µ1,.,K , 1,.,K ) or, applying the logarithm:!!log p(X , µ, ) NXlogn 1N XKYn 1 k 1KXk 1 k N (xn µk , k ) k N (xn µk , k ) However: this is not as easy as maximumlikelihood for single Gaussians!Machine Learning forComputer Vision15Dr. Rudolph TriebelComputer Vision Group

Problems with MLE for Gaussian Mixtures Assume that for one k the mean µk is exactly at adata point xn For simplicity: assume that k 1 Then:2N (xn xn , k I) pD2 !k2kI This means that the overall log-likelihood can bemaximized arbitrarily by letting k ! 0 (overfitting) Another problem is the identifiability: The order of the Gaussians is not fixed, therefore: There are K! equivalent solutions to the MLE problemMachine Learning forComputer Vision16Dr. Rudolph TriebelComputer Vision Group

Overfitting with MLE for Gaussian Mixtures One Gaussian fits exactly to one data point It has a very small variance, i.e. contributesstrongly to the overall likelihood In standard MLE, there is no way to avoid this!Machine Learning forComputer Vision17Dr. Rudolph TriebelComputer Vision Group

Expectation-Maximization EM is an elegant and powerful method for MLEproblems with latent variables Main idea: model parameters and latent variablesare estimated iteratively, where average over thelatent variables (expectation) A typical example application of EM is theGaussian Mixture model (GMM) However, EM has many other applications First, we consider EM for GMMsMachine Learning forComputer Vision18Dr. Rudolph TriebelComputer Vision Group

Expectation-Maximization for GMM First, we define the responsibilities:(znk ) p(znk 1 xn )znk 2 {0, 1}Xznk 1kMachine Learning forComputer Vision19Dr. Rudolph TriebelComputer Vision Group

Expectation-Maximization for GMM First, we define the responsibilities:!(znk ) p(znk 1 xn )! k N (xn µk , k ) PK!j 1! j N (xn µj , j ) Next, we derive the log-likelihood wrt. to µk :@log p(X , µ, ) ! 0@µkMachine Learning forComputer Vision20Dr. Rudolph TriebelComputer Vision Group

Expectation-Maximization for GMM First, we define the responsibilities:!(znk ) p(znk 1 xn )! k N (xn µk , k ) PK!j 1! j N (xn µj , j ) Next, we derive the log-likelihood wrt. to µk :@log p(X , µ, ) ! 0@µkPNand we obtain: µ n 1 (znk )xnPNkn 1 (znk )Machine Learning forComputer Vision21Dr. Rudolph TriebelComputer Vision Group

Expectation-Maximization for GMM We can do the same for the covariances:@log p(X , µ, ) ! 0@ kand we obtain:!! k PNn 1(znk )(xnPNµk )Tµk )(xnn 1(znk ) Finally, we derive wrt. the mixing coefficients k :@log p(X , µ, ) ! 0@ kMachine Learning forComputer Vision22where:KX k 1k 1Dr. Rudolph TriebelComputer Vision Group

Expectation-Maximization for GMM We can do the same for the covariances:@log p(X , µ, ) ! 0@ kand we obtain:!! k PNn 1(znk )(xnPNµk )Tµk )(xnn 1(znk ) Finally, we derive wrt. the mixing coefficients k :@log p(X , µ, ) ! 0@ kand the result is:Machine Learning forComputer Vision1 k N23where:NXKX k 1k 1(znk )n 1Dr. Rudolph TriebelComputer Vision Group

Algorithm Summary1.Initialize means µk covariance matrices k andmixing coefficients k2.Compute the initial log-likelihood log p(X , µ, )3. E-Step. Compute the responsibilities: k N (xn µk , k )(znk ) PKj 1 j N (xn µj , j )4. M-Step. Update the parameters:PNn 1µnew PkNn 1(znk )xn(znk ) new kPNn 1(znk )(xnPNn 1µnewk )(xn(znk )Tµnew)k knewNX1 (znk )N n 15.Compute log-likelihood; if not converged go to 3.Machine Learning forComputer Vision24Dr. Rudolph TriebelComputer Vision Group

The Same Example AgainMachine Learning forComputer Vision25Dr. Rudolph TriebelComputer Vision Group

Observations Compared to K-means, points can now belong toboth clusters (soft assignment) In addition to the cluster center, a covariance isestimated by EM Initialization is the same as used for K-means Number of iterations needed for EM is much higher Also: each cycle requires much more computation Therefore: start with K-means and run EM on theresult of K-means (covariances can be initialized tothe sample covariances of K-means) EM only finds a local maximum of the likelihood!Machine Learning forComputer Vision26Dr. Rudolph TriebelComputer Vision Group

A More General View of EM Assume for a moment that we observe X and thebinary latent variables Z. The likelihood is then:p(X, Z , µ, ) where p(zn ) p(xn zn , µ, ) NYn 1KYp(zn )p(xn zn , µ, )znk kRemember:K!Xznk 2 {0, 1},z 1! k 1 nkandk 1KYk 1N (xn µk , k )znkwhich leads to the log-formulation:log p(X, Z , µ, ) Machine Learning forComputer VisionN XKXn 1 k 1znk (log k log N (xn µk , k ))27Dr. Rudolph TriebelComputer Vision Group

The Complete-Data Log-Likelihoodlog p(X, Z , µ, ) N XKXn 1 k 1znk (log k log N (xn µk , k )) This is called the complete-data log-likelihood Advantage: solving for the parameters ( k , µk , k )is much simpler, as the log is inside the sum! We could switch the sums and then for everymixture component k only look at the points thatare associated with that component. This leads to simple closed-form solutions for theparameters However: the latent variables Z are not observed!Machine Learning forComputer Vision28Dr. Rudolph TriebelComputer Vision Group

The Main Idea of EM Instead of maximizing the joint log-likelihood, wemaximize its expectation under the latent variabledistribution:EZ [log p(X, Z , µ, )] N XKXn 1 k 1EZ [znk ](log k log N (xn µk , k ))where the latent variable distribution per point is:p(xn zn , )p(zn )p(zn xn , ) p(xn ) Machine Learning forComputer VisionQK ( , µ, )znll 1 ( l N (xn µl , l ))PKj 1 j N (xn µj , j )29Dr. Rudolph TriebelComputer Vision Group

The Main Idea of EMThe expected value of the latent variables is:E[znk ] (znk )!plugging in we obtain:!EZ [log p(X, Z , µ, )] !N XKXn 1 k 1(znk )(log k log N (xn µk , k ))We compute this iteratively:1. Initialize i 0, ( ki , µik , ik )2. Compute E[znk ] (znk )i 1i 1i 13. Find parameters( k , µk , k ) that maximize this4. Increase i; if not converged, goto 2.Machine Learning forComputer Vision30Dr. Rudolph TriebelComputer Vision Group

The Theory Behind EM We have seen that EM maximizes the expectedcomplete-data log-likelihood, but: Actually, we need to maximize the log-marginal!log p(X ) log!XZp(X, Z ) It turns out that the log-marginal is maximizedimplicitly!Machine Learning forComputer Vision31Dr. Rudolph TriebelComputer Vision Group

The Theory Behind EM We have seen that EM maximizes the expectedcomplete-data log-likelihood, but: Actually, we need to maximize the log-marginallog p(X ) log!!XZp(X, Z ) It turns out that the log-marginal is maximizedimplicitly!log p(X ) L(q, ) KL(qkp)L(q, ) XZp(X, Z )q(Z) logq(Z)Machine Learning forComputer VisionKL(qkp) XZ32p(Z X, )q(Z) logq(Z)Dr. Rudolph TriebelComputer Vision Group

Visualization The KL-divergence is positive or 0 Thus, the log-likelihood is at least as large as L or: L is a lower bound of the log-likelihoodMachine Learning forComputer Vision33Dr. Rudolph TriebelComputer Vision Group

What Happens in the E-Step? The log-likelihood is independent of q Thus: L is maximized iff KL is minimal This is the case iff q(Z) p(Z X, )Machine Learning forComputer Vision34Dr. Rudolph TriebelComputer Vision Group

What Happens in the M-Step? In the M-step we keep q fixed and find new !L(q, ) XZp(Z X, old) log p(X, Z )Xq(Z) log q(Z)Z We maximize the first term, the second is indep. This implicitly makes KL non-zero The log-likelihood is maximized even more!Machine Learning forComputer Vision35Dr. Rudolph TriebelComputer Vision Group

Visualization in Parameter-Space In the E-step we compute the concave loweroldbound for given old parameters (blue curve) In the M-step, we maximize this lower bound andnewobtain new parameters This is repeated (green curve) until convergenceMachine Learning forComputer Vision36Dr. Rudolph TriebelComputer Vision Group

Variants of EM Instead of maximizing the log-likelihood, we canuse EM to maximize a posterior when a prior isgiven (MAP instead of MLE) less overfitting In Generalized EM, the M-step only increases thelower bound instead of maximization (useful ifstandard M-step is intractable) Similarly, the E-step can be generalized in that theoptimization wrt. q is not complete Furthermore, there are incremental versions of EM,where data points are given sequentially and theparameters are updated after each data point.Machine Learning forComputer Vision37Dr. Rudolph TriebelComputer Vision Group

Example 1: Learn a Sensor Model A Radar range finder on a metallic target willreturns 3 types of measurement: The distance to target The distance to the wall behind the target A completely random valueMachine Learning forComputer Vision38Dr. Rudolph TriebelComputer Vision Group

Example 1: Learn a Sensor Model Which point corresponds to from which model?What are the different model parameters?Solution: Expectation-MaximizationMachine Learning forComputer Vision39Dr. Rudolph TriebelComputer Vision Group

Example 2: Environment Classification From each image, the robot extractsfeatures: points in nD spaceK-means only finds the clustercenters, not their extent and shapeThe centers and covariances canbe obtained with EMMachine Learning forComputer Vision40Dr. Rudolph TriebelComputer Vision Group

Example 3: Plane Fitting in 3D Has been done in this paper Given a set of 3D points, fit planes into the data Idea: Model parameters are normal vectors anddistance to origin for a set of planes Gaussian noise model: p(z ) N (d(z, ) 0, )!point-to-planedistancenoisevariance Introduce latent correspondencevariables Cij and maximize the expected log-lik.:E[log p(Z, C )]! Maximization can be done in closed formMachine Learning forComputer Vision41Dr. Rudolph TriebelComputer Vision Group

Example 3: Plane Fitting in 3DMachine Learning forComputer Vision42Dr. Rudolph TriebelComputer Vision Group

Summary!!!!!!K-means is an iterative method for clusteringMixture models can be formalized using latent(unobserved) variablesA very common example are Gaussian mixturemodels (GMMs)To estimate the parameters of a GMM we canuse expectation-maximization (EM)In general EM can be interpreted as maximizinga lower bound to the complete-data loglikelihoodEM is guaranteed to converge, but it may runinto local maximaMachine Learning forComputer Vision43Dr. Rudolph TriebelComputer Vision Group

Machine Learning for Computer Vision Expectation-Maximization EM is an elegant and powerful method for MLE problems with latent variables Main idea: model parameters and latent variables are estimated iteratively, where average over the latent variables (expectation) A typical exam

Related Documents:

Maximum Lq-Likelihood Estimation via the Expectation-Maximization Algorithm: A Robust Estimation of Mixture Models Yichen QIN and Carey E. PRIEBE We introduce a maximum Lq-likelihood estimation (MLqE) of mixture models using our proposed expectation-maximization (EM) al- gorithm, namely the EM algorithm with Lq-likelihood (EM-Lq).

Mathematical Expectation Properties of Mathematical Expectation I The concept of mathematical expectation arose in connection with games of chance. In its simplest form, mathematical expectation is the product of the amount a player stands to win and the probability that the player would win.

Machine Learning, Murphy, 2012. MLE for mixtures is difficult Reason 1: The algebraic form is more complex The mixture log likelihood cannot be simplified . Machine Learning: A probabilistic perspective, Murphy, 2012. EM algorithm for Gaussian mixture models Expectation step

mixture of these substances. The mass percent composition of the mixture can be . The heterogeneous mixture is a mixture of sand (silicon dioxide, SiO 2), salt (sodium chloride, NaCl), and iron filings. Your goal is to separate the dyes of the M&M and to separate the solid mixture and determine it

1. Write down three mixtures you can find in your home. a) b) c) 2. How is a mixture different to a pure substance? 3. Match the following mixtures with their type of mixture. Mixture Type of mixture Smoke Mixture of gases L

Grade (9-1) _ 58 (Total for question 1 is 4 marks) 2. Write ̇8̇ as a fraction in its simplest form. . 90. 15 blank Find the fraction, in its

In ation models as mixture models IIn precept 6, Ian took us through an example of a zero-in ated negative binomial (if a logit and negative binomial model mated and gave birth to a zinb!) I Example: count of male ’satellites’ around a female horseshoe crab I Counts are a mixture of two processes, each modeled using a di erent distribution: I Whether or not the female attracts any .

Oregon English Language Arts and Literacy Standards Grade 2 Standards June 2019 * Denotes a revision has been made to the original Common Core State Standard. 255 Capitol St NE, Salem, OR 97310 503-947-5600 1 . Oregon achieves . . . together! Grade 2 Introduction to the Oregon Standards for English Language Arts and Literacy Preparing Oregon’s Students When Oregon adopted the Common Core .