Expectation-Maximization Algorithm For Clustering .

2y ago
8 Views
2 Downloads
501.91 KB
101 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Melina Bettis
Transcription

Expectation Maximization Tutorialby Avi KakExpectation-Maximization Algorithm forClustering Multidimensional NumericalDataAvinash KakPurdue UniversityJanuary 28, 20177:57amAn RVL Tutorial PresentationFirst Presented: Summer 2012(Updated with minor corrections: January 2017)c 2017 Avinash Kak, Purdue University1

Expectation Maximization Tutorialby Avi KakCONTENTSSection TitlePage1What Makes EM Magical?32EM: The Core Notions63An Example of EM Estimation inWhich the Unobserved Data isJust the Missing Data174EM for Clustering Data That Canbe Modeled as a Gaussian Mixture385Algorithm::ExpectationMaximization— a Perl Module696Convenience Scripts inAlgorithm::ExpectationMaximization817Some Clustering Results Obtained gments1002

Expectation Maximization Tutorialby Avi Kak1. What Makes EM Magical? Despite the fact that EM can occasionallyget stuck in a local maximum as you estimate the parameters by maximizing thelog-likelihood of the observed data, in mymind there are three things that make itmagical:– the ability to simultaneously optimize alarge number of variables– the ability to find good estimates for anymissing information in your data at thesame time– and, in the context of clustering multidimensional data that lends itself to modeling by a Gaussian mixture, the ability to create both the traditional “hard”clusters and and not-so-traditional “soft”clusters.3

Expectation Maximization Tutorialby Avi Kak With regard to the ability of EM to simultaneously optimize a large number of variables, consider the case of clustering threedimensional data:– Each Gaussian cluster in 3D space ischaracterized by the following 10 variables: the 6 unique elements of the 3 3covariance matrix (which must be symmetric and positive-definite), the 3 uniqueelements of the mean, and the prior associated with the Gaussian.– Now let’s say you expect to see six Gaussians in your data. What that means isthat you would want the values for 59variables (remember the unit-summationconstraint on the class priors which reduces the overall number of variables byone) to be estimated by the algorithmthat seeks to discover the clusters inyour data.4

Expectation Maximization Tutorialby Avi Kak– What’s amazing is that, despite the largenumber of variables that need to be optimized simultaneously, the chances arethat the EM algorithm will give you avery good approximation to the correctanswer. About EM returning both hard and softclusters, by hard clusters I mean a disjointpartition of the data. This is normally whatclassifiers do. By soft clusters I mean allowing for a data point to belong to twoor more clusters at the same time, the“level of membership” in a cluster beingexpressed by the posterior probabilities ofthe classes at the data point. (We willuse the words “cluster” and “class” synonymously in this tutorial.)5

Expectation Maximization Tutorialby Avi Kak2. EM: The Core Notions EM is based on the following core ideas:– That there exists an analytic model forthe data and that we know the functional form of the model. However, wedo NOT know the values for the parameters that characterize this functionalform).– We have a set of recorded data points insome multidimensional space, but someelements of the data are missing. (Ifyou are mystified by this statement, donot worry. It will become clear shortly.)– We refer to the missing elements of thedata as unobserved data.6

Expectation Maximization Tutorialby Avi Kak– While in some cases of estimation, it iseasy to put your finger on what could bereferred to as unobserved data, in othersit can take some imagination — someother way of looking at your recordeddata for you to be able to conceptualizethe existence of unobserved data– Regardless of how you bring into playthe unobserved data — whether due tothe fact that you actually failed to recordsome of the data or whether your newway of looking at the data generationprocess brought into existence certainunobservables — the notion of unobserved data is central to a strict implementation of the EM algorithm.– Some folks refer to the unobserved datathrough the notion of hidden variables.7

Expectation Maximization Tutorialby Avi Kak– However, the problem with the terminology “hidden variables” is that it failsto capture the fact that some portionsof the data may be missing because, say,your equipment failed to record them atthe moment they became available. It’stoo much of a stretch of imagination torefer to such “failure to record” in termsof “hidden variables”.– The notion of unobserved data is central to EM because that is what makesit possible to construct an iterative procedure for the maximization of the loglikelihood of the observed data.– Obviously, we wish for EM to find themaximum-likelihood (ML) estimates forthe parameters of the data model. Themodel parameters estimated by EM shouldbe ML in the sense that they maximizethe likelihood of all of the observed data.8

Expectation Maximization Tutorialby Avi Kak– We also wish for EM to give us the bestpossible values (again in the most likelihood sense vis-a-vis all the observeddata) for the unobserved data. Since folks new to EM have difficulty withthe notion of unobserved data, the rest ofthis section presents two examples, one inwhich the unobserved data is literally so —that is, a part of the data that needed to berecorded was not recorded — and the otherin which the unobserved data is a productof our imagination. The first example isby Duda, Hart, and Stork and the secondbased on a tutorial presentation of EM byJeff Bilmes, “ A Gentle Tutorial of the EMAlgorithm and its Applications to Parameter Estimation for Gaussian Mixtures andHidden Markov Models,” Tech. Report, U.C. Berkeley.9

Expectation Maximization Tutorialby Avi KakExample 1 of Unobserved Data:– Consider the case when the observeddata consists of N points in a 2D plane.– Let’s say that we know a priori thata single bivariate Gaussian is a goodmodel for the data. We only knowthe functional form of the model —we do NOT know the values for theparameters of this model.– That is, if x represents one elementof the observed data, we can writep( x) 1(2π)d/2 Σ 1Tµ) 2 ( x e1/2Σ 1 ( x µ)(1)where d 2 and Σ is the determinant of the 2 2 covariance matrixΣ. We think of x as a 2-dimensionalcolumn vector. (The formula shown is forthe general case of a d-dimensional x.)10

Expectation Maximization Tutorialby Avi Kak– The yet-unknown mean of the observeddata is represented by the 2-dimensionalcolumn vector µ .– The yet-unknown covariance of theobserved data represented by a positivedefinite and symmetric 2 2 matrix Σ.– We are therefore talking about 5 unknowns in the Gaussian model, of whichthree are for the symmetric 2 2 covariance matrix Σ and two for the meanvector µ .– Given the data model as described above,let’s say we are in possession of N observations, of which the last one isonly partial. We consider an observation to be partial if only one of thetwo coordinates is known.11

Expectation Maximization Tutorialby Avi Kak– Let’s denote the N 1 complete observations by x1, x2, . . . and xN 1, andthe last partial observation by x N .– The question here is: Can EM be usedto estimate the parameters of the underlying Gaussian model, while at thesame time, providing us with an estimate for the missing potion of theobservation x N ?Example 2 of Unobserved Data:– Consider the following case: Our observed data can be modeled by a mixture of K Gaussians in which eachGaussian is given byp( x) 1(2π)d/2 Σi 121T Σ 1 ( x µ i )e 2 ( x µ i )1/2(2)

Expectation Maximization Tutorialby Avi Kak– In the above model, Σi is the determinant of the d d covariance matrix Σi for the ith Gaussian, µi themean of the same. We also associatea prior probability ai with the ith Gaussian with regard to its contribution tothe mixture.– Our goal is automatic clustering ofthe observations into disjoint clusters,which each cluster corresponding to asingle Gaussian.– The question here is whether EM canbe used to estimate the class labels forthe data elements, while, at the sametime, estimating the means and thecovariances of the individual Gaussiansin the mixture.13

Expectation Maximization Tutorialby Avi Kak– We obviously need to conceptualizethe existence of unobserved data inthis case. On the face of it, it is notclear as to what would constitute theunobserved data after we have recordedthe N data points.– As it turns out, we can conceptualizethe needed unobserved data by thinking of the data generation process in amanner that allows a random variableto be associated with the selection ofthe Gaussian for each data point, aswe next describe.– We imagine the N data observations x1, x2, . . . , xN as having been generated sequentially through N differentdata generation events.14

Expectation Maximization Tutorialby Avi Kak– Next, we bring into existence a sequence of N scalar random variablesY {y1, y2, . . . , yN } that correspondto the N observations X { x1, x2, . . . , xN }on an index-by-index basis. The variable yi will take on a random valuefrom the set {1, 2, . . . , K}, the valuecorresponding to the Gaussian that waschosen for the production of xi .– As shown in Section 4 of this tutorial,treating Y {y1, y2, . . . , yN } as unobserved data allows us to use the EMalgorithm for an iterative maximization of the log-likelihood for the dataactually observed.– In this case, it makes sense to referto the unobserved data as the hiddenvariables in the estimation process.15

Expectation Maximization Tutorialby Avi Kak As mentioned earlier, the next section willpresent an example in which the unobserveddata is literally so. Subsequently, in Section 4, we will talk about using EM forclustering Gaussian mixture data.16

Expectation Maximization Tutorialby Avi Kak3. An Example of EM Estimation inWhich the Unobserved Data is Just theMissing Data This example is by Duda, Hart, and Stork(DHS) from their book “Pattern Classification,” pages 126-128. My goal in using the DHS example is bothto illustrate that the unobserved data canindeed be just the missing data, and todevelop the notion of how the unobserveddata facilitates the development of an iterative method for the maximization of thelog-likelihood of the data actually observed. The observed data in this example will consist of four randomly produced points ina plane, with only the second coordinateavailable for the last point.17

Expectation Maximization Tutorialby Avi Kak The coordinate values for the four observed 01points are: x1 2 , x2 0 , x3 22 4 , and x4 . Since the first coordinate of the last observation, x4, is unknown, we use the symbol ’*’ for its value. We willdenote the last observation x4 x4,1, where the variable x4,1 stands for4the missing information in the data. So the problem is to estimate a value forx4,1 that would be “consistent” — consistent in the maximum-likelihood sense —with the observed values for x1, x2, x3,and for the x4,2 coordinate of x4. To keep the example simple, we assumethe observed data can be modeled by aGaussian with uncorrelated x and y coordinates.18

Expectation Maximization Tutorialby Avi Kak The Gaussian distribution for the data isgiven byp( x) 11(2π)d/2 Σ T 2 ( x µ)e1/2Σ 1 ( x µ)(3)where d 2, and with the covariance ofthis Gaussian given byΣ hσ1200σ22i(4) We will express the mean of the Gaussianin terms of its coordinates through:µ µ1µ2 (5) As the reader can see, there are four parameters, yet unknown, in the data model:σ12, σ22, µ1 and µ2. We will next talk abouthow these parameters can be estimatedwith EM.19

Expectation Maximization Tutorialby Avi Kak The EM algorithm requires us to iteratethrough the following two steps:1. The Expectation Step: Using the current best guess for the parameters ofthe data model, we construct an expression for the log-likelihood for all data,observed and unobserved, and, then, marginalize the expression with respect to theunobserved data. This expression willbe shown to depend on both the current best guess for the model parameters and the model parameters treatedas variables. [This sentence, undoubtedly confusingat its first reading, will become clear on Slide 29.]2. The Maximization Step: Given theexpression resulting from the previousstep, for the next guess we choose thosevalues for the model parameters thatmaximize the expectation expression. Theseconstitute our best new guess for themodel parameters.20

Expectation Maximization Tutorialby Avi Kak The output of the Expectation Step codifies our expectation with regard to whatmodel parameters are most consistent withthe data actually observed and with thecurrent guess for the parameters — provided we maximize the expression yieldedby this step. We stop iterating through the two stepswhen any further change in the log-likelihoodof the observed data falls below some smallthreshold. This brings us to the very important subject of the “marginalization” of the loglikelihood of all the data, observed and unobserved. By marginalization in the Expectation Step we mean integration of thelog-likelihood for all data over all possibilities for the unobserved data.21

Expectation Maximization Tutorialby Avi Kak In order to give substance to the remarkmade at the bottom of the previous slide,let’s first write down an expression for loglikelihood for all data. Assuming the observations to have beenmade independently, ordinarily, the expression for the log-likelihood for the four datapoints would be LL4Xi 1 ln p( xi θ)(6)where by the vector θ we mean the modelparameters:θ 22 µ1 µ2 σ2 1σ22 (7)

Expectation Maximization Tutorialby Avi Kak However, since we know that the last point, x4, was observed only partially — that is,only the second coordinate of the last pointwas actually observed — we need to teaseit out of the summation in Eq. (6) forspecial treatment later. So let’s write the log-likelihood expressionin the following form:LL 3Xi 1 ln p( x4 θ) ln p( xi θ)(8) As you will see later, in the MaximizationStep of each iteration of EM, we wouldwant to choose a value for the parametervector θ that maximizes the log-likelihoodshown above. Although, in and of itself,that sounds straightforward, the reality ofwhat needs to be maximized is a little bitmore complex.23

Expectation Maximization Tutorialby Avi Kak We want the maximization of the log-likelihoodto NOT be absolute in any sense, but tobe with respect to the current guess for (If we couldthe model parameter vector θ.solve the absolute maximization of the loglikelihood problem, we would not need theEM algorithm.) To address the issue raised in the previous bullet, let’s denote the current guessfor the model parameters by θ g . So thequestion then becomes as to how to “linkup” the log-likelihood expression shown inEquation (8) with θ g . The value for the log-likelihood shown inEq. (8) depends obviously on the datacoordinate x4,1, whose value we do notknow. The best way to deal with this lackof knowledge about x4,1 is to average outthe log-likelihood with respect to x4,1.24

Expectation Maximization Tutorialby Avi Kak In a multi-variable scenario, averaging outan entity with respect to any single variable means carrying out a marginal integration of the entity with respect to theprobability density function for the variablein question. The question then arises as to what density function to use for the variable x4,1.This is where the current best guessabout the data model comes in. Recall,we represent our current best guess for thedata model by the parameter vector θ g . Since the parameter vector θ g is for a modelthat includes two coordinates, in and of itself, this parameter vector does not applydirectly to the scalar variable x4,1.25

Expectation Maximization Tutorialby Avi Kak So the best we can do for the needed density function for x4,1 at the moment is toexpress it generally as p(x4,1 θ g , x4,2 4). Now we are ready to write the log-likelihoodfor all of the data observations, while taking into account the missing data coordinate x4,1. As stated earlier, the new log-likelihood willbe a marginalization of the original loglikelihood over the unobserved data element:LL′ Z ( 3X ln p( x4 θ) ln p( xi θ)i 1)p(x4,1 θ g , x4,2 4) dx4,1(9)As you can see, this marginalization of thelog-likelihood over x4,1 is with respect tothe current best guess θ g for the modelparameters.26

Expectation Maximization Tutorialby Avi Kak Since the observations of the four datapoints in the 2D plane are independent ofone another, the marginalization shown abovewith respect to the variable x4,1 does notaffect the contribution to the log-likelihoodby x1, x2, and x3. The x4,1-marginalized log-likelihood shownin Eq. (9) can therefore be simplified to:′LL 3Xi 1 ln p( xi θ)Z ln p( x4 θ) p(x4,1 θ g , x4,2 4) dx4,1(10) We will now use Bayes’ Rule to simplify theintegral on the right in Eq. (10) as shownon the next slide.27

Expectation Maximization Tutorialby Avi Kak Applying the Bayes’ Rule to the secondterm of the integrand in Eq. (10):Z ln p( x4 θ) Z Z Z Z p(x4,1 θg , x4,2 4) dx4,1p(x4,1, θ g , x4,2 4) ln p( x4 θ)dx4,1 g , x4,2 4)p(θ ln p( x4 θ) x4,14 ln p( x4 θ) x4,14pθ g θ gx4,14 · p(θ g )p(x4,2 4 θ g ) · p(θ g )pp(x4,2 4 θ g )p ln p( x4 θ) R px′4,14 dx4,1dx4,1θ g dx4,1θ g dx′4,1(11) The final expression shown above is easy tohandle since all the probabilities are nowdescribed by the Gaussian model in Eq.(3).28

Expectation Maximization Tutorialby Avi Kak The x4,1-marginalized log-likelihood shownin Eq. (10) can therefore be expressed as:′LL 3Xi 1 ln p( xi θ)Zp ln p( x4 θ) R px4,14x′4,14 θ g θ g dx′4,1(12)Notice that this is a function of both thecurrent guess θ g for the model parameters,which basically is a set of constant values,and the variables for the new guess in the vector θ. The result in Equation (12) is our expectation for the log-likelihood of the observed data as a function of the variables in θ.29dx4,1

Expectation Maximization Tutorialby Avi Kak Now that we have a “general” expressionfor the log-likelihood expectation for themodel parameters, it’s time to get to thebusiness of applying the EM algorithmto the problem at hand — for both thepurpose of estimating the model parameters and the missing x4,1. In the discussion that follows, we will referto the Expectation Step as the E-Step andthe Maximization Step as the M-Step. For the very first iteration through the twosteps, we must make a reasonable randomguess for the model parameters. We willchooseθ g 30 0 0 1 1 (13)

Expectation Maximization Tutorialby Avi Kak In other words, we are choosing zero meanand unit variance as the initial guess forthe model parameters. For the invocation of the E-Step in the firstiteration of the EM algorithm, we plug theguess in Eq. (13) in Eq. (12) and we get′LL 3Xi 1 1ln p( xi θ)DZ pln p( x4 θ) x4,14 θ g dx4,1(14)where the constant D stands for the denominator integral on the right hand sidein Eq. (12). It is given byD Z Z p x′4,14 0011!!dx′4,1221 1e 2 (x4,1 4 ) dx′4,12πe 8 2π(15)31

Expectation Maximization Tutorialby Avi Kak We will now simplify the integral in Eq.(14) as follows:Z pln p( x4 θ)ZZZZ ln p ln ln lnln1 2 x4,14x4,14 θ g µ1µ2σ12σ22dx4,1 ·1 21 (x24,1 42 )edx4,12π112 22 2e 2 ((x4,1 µ1 ) σ1 (4 µ2 ) σ2 )2πσ1 σ2n 12πσ1 σ212πσ1 σ2Z 12πσ1 σ2 ·1 21 (x24,1 42 )dx4,1e2π 1 2 22 2 (x4,1 µ1 ) σ1 (4 µ2 ) σ221 21 (x24,1 42 )edx4,12πo·1 21 (x24,1 42 )edx4,1 ·2π (x4,1 µ1 )2 σ12 (4 µ2 )2 σ22 ·e 8· 2π1 21 (x24,1 42 )dx4,1e2π 1 (1 µ21 )2 σ12232 (16 8µ2 µ22 )σ22 e 8· 2π(16)

Expectation Maximization Tutor

cedure for the maximization of the log-likelihood of the observed data. – Obviously, we wish for EM to find the maximum-likelihood (ML) estimates for the parameters of the data model. The model parameters estimated by EM should be ML in the sense that th

Related Documents:

Caiado, J., Maharaj, E. A., and D’Urso, P. (2015) Time series clustering. In: Handbook of cluster analysis. Chapman and Hall/CRC. Andrés M. Alonso Time series clustering. Introduction Time series clustering by features Model based time series clustering Time series clustering by dependence Introduction to clustering

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Expectation Maximization We first focus on the analysis of the convergenceproperties of the Expectation-Maximization (EM) algorithm. Con-sider a probabilistic model of observed data x which uses latent variables z. The log-likelihood (objective function

preprocessing step for quantum clustering , which leads to reduction in the algorithm complexity and thus running it on big data sets is feasible. Second, a newer version of COMPACT, with implementation of support vector clustering, and few enhancements for the quantum clustering algorithm. Third, an implementation of quantum clustering in Java.

Expectation-Maximization Algorithm and Applications Eugene Weinstein Courant Institute of Mathematical Sciences . Can prove is the maximum-likelihood estimate of θ Differentiate with respect to θ, set equal to 0. 5/31 EM Motivation So, to solve any ML-type p

The expectation maximization (EM) algorithm computes maximum likelihood (ML) estimates of unknown parameters in probabilistic models involving latent avriables Z1. An instructive way of thinking about EM is to think of it as a systematic way o

Maximum Lq-Likelihood Estimation via the Expectation-Maximization Algorithm: A Robust Estimation of Mixture Models Yichen QIN and Carey E. PRIEBE We introduce a maximum Lq-likelihood estimation (MLqE) of mixture models using our proposed expectation-maximization (EM) al- gorithm, namely the EM algorithm with Lq-likelihood (EM-Lq).

Machine Learning for Computer Vision Expectation-Maximization EM is an elegant and powerful method for MLE problems with latent variables Main idea: model parameters and latent variables are estimated iteratively, where average over the latent variables (expectation) A typical exam