CS 7140: Advanced Machine Learning

2y ago
31 Views
3 Downloads
1.64 MB
9 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Bennett Almond
Transcription

CS 7140: Advanced Machine LearningLecture 8: Expectation Maximization (Wed 7 Feb 2018)InstructorJan-Willem van de Meent (j.vandemeent@northeastern.edu)ScribesSabbir Ahmad (ahmad.sub@husky.neu.edu)Bahar Azari (azari@ece.neu.edu)11.1Expectation MaximizationGaussian Mixtures: Gibbs SamplingIn the previous lectures, we have seen sampling methods in Generative Models to estimate the parameters for data clustering, where data is assumed to be generated from Gaussian Mixtures. For example,the Generative Model and the Gibbs Sampler updates for K Gaussian mixtures are specified as follows.Generative Modelµk , Σk p(µ, Σ)zn Discrete(π1 , . . . , πK )yn zn k N (µk , Σk )1.2Gibbs Sampler Updates1. zn p(zn yn , µ, Σ)2. µk , Σk p(µk , Σk y, z)Gaussian Mixtures: Maximum Likelihood / MAPWe can do Maximum Likelihood (ML) or Maximum a Posteriori (MAP) to estimate the parameter θ ofGaussian Mixtures. Note that, here θ : {π, µ1:K , Σ1:K } for a mixture of K Gaussians. The ML/MAPestimation problem in this setting amounts to having to estimate K multivariate Gaussians with mixingweights π1 , . . . , πK .Now we can write for Maximum Likelihood estimation:Zθ argmax p( y; θ ) argmaxθθp( y, z; θ ) dz(1)RHere, p( y, z; θ )dz is the marginal probability of y over z, where z represents the cluster assignmentfor every point in the dataset. As we have seen before, computing the integral is often intractable anddifficult to calculate. For maximum a Posteriori (MAP) estimation, we can write:Zθ argmax p( y, θ ) argmaxθ1.3θp( y, z, θ ) dz.(2)Generalized hard K-meansML and MAP estimation for Gaussian mixtures is complicated by the fact the we need to marginalizeover z. Performing optimization conditioned on z is relatively straightforward. In this case we can solve

for the derivative θ log p( y, z; θ ) NX θ log p( yn , zn ; θ ) 0.(3)n 1As we will see below, this type of problem is analogous to the one that we already solved in the context oflinear regression. It turns out that we can solve this problem in closed form for any likelihood p( y z, θ y )that is in the exponential family. When we marginalize over z, we obtain a different problem,NXNX θ p( yn ; θ ) θ log p( y; θ ) θlog p( yn ; θ ), p( yn ; θ )n 1n 1PKNX θ k 1 p( yn , zn k; θ ) p( yn ; θ )n 1 N XKX θ p( yn , zn k; θ ) 0.p( yn ; θ )n 1 k 1(4)(5)(6)It so happens that this problem is generally not tractable in closed form (we will see further on why).For this reason, we will begin by solving a simpler problem, inspired by Gibbs sampling. In this problemwe alternate between two maximization stepsStep 1 : znStep 2 : µk , Σk argmaxzn p( yn , zn ; θ ) argmaxµk ,Σk p( yn , zn , θ ) argmaxµk ,Σk p( yn , zn ; θ )[M AP][M L](7)In Step 2, the mean and covariance are optimized by MAP or ML. Note that, the notation in MAPestimation, p( yn , zn , θ ) p( yn zn , θ )p(zn θ )p(θ ), whereas, the ML estimation doesn’t have any priordistribution on parameter. So for ML, p( yn , zn ; θ ) p( yn zn ; θ )p(zn ; θ ).1.3.1Updates for the parametersStep 1 is trivial to implement (we can simply enumerate over all choices zn k). In order to implementStep 2, we will need to write down the the joint density, p( yn , zn ; θ ) p( yn zn ; θ )p(zn ; θ ), which isp( y, Z; θ ) NYp( yn , zn ; θ ) n 1N YKYp( yn , zn k; θ )I[zn k] .(8)n 1 k 1Here, I[zn k] is the indicator function, which is defined as, 1, if zn k,I[zn k] 0, if zn 6 k.(9)We can express the joint p( yn , zn k; θ ) in terms of the likelihood p( yn zn k, µ, Σ) and the clusterassignment prior p(zn k; π),p( yn zn k, µ, Σ) N ( yn ; µk , Σk ),(10)p(zn k; π) πk .(11)2

To compute the ML update for the mean, we now solve for the derivative with respect to µl , which mustbe 0 at the optimumN K XX I[zn k] log p( yn , zn k; θ ),log p( y, z; θ ) µl µl n 1 k 1 N XKXn 1 k 1NXI[zn k]I[zn l]n 1NX [log N ( yn ; µk , Σk ) log πk ] , µl log N ( yn ; µl , Σl ), µl(12)(13)(14)I[zn l]( yn µl )Σ 1l 0.(15)n 1If we multiply on the right by Σl , then we see that the optimum satisfies the condition0 NXI[zn l]( yn µl ).(16)n 1Solving for µl now gives us the updateµl N1 XI[zn l] yn ,Nl n 1Nl NXI[zn l].(17)n 1In other words, the optimal value µl simply the mean of the points yn that are assigned to the cluster l.We can derive updates for Σl and πl in an analogous manner, which yields NNl1 XΣl πl .I[zn l] yn yn µl µ (18)l ,Nl n 1N1.3.2AlgorithmNow that we have derived the closed-form solutions for updating the parameters for each iteration, wecan formally present the algorithm. We call the Generalized version Hard K-means, as the assignmentof each point to a cluster is absolute. That means, we are certain that a point belongs to a clustercompletely at a given point in the algorithm. This is called the Hard assignment.Algorithm: Hard K-means (Generalized version) Initialize: π, µ, Σ Repeat until zn unchanged:1. For n in 1, . . . , N :zn argmaxµ log N ( yn ; µk , Σk ) log πkzn argmaxµ p(zn k y; θ )2. For k in 1,P., N:Nµk N1k n 1 I[zn k] ynPNΣk N1l n 1 I[zn k] yn yn µk µ kPNNkπk NNk n 1 I[zn k]3

1.4Soft K-means (Expectation Maximization)Instead of hard assignment of the points to a cluster, if we are uncertain about the data points where theybelong, we can assign a probability to each point to determine the feasibility of its belonging to a cluster.This type of algorithm is sometimes called called soft K-means, since it performs a "soft" assignment.Algorithm: Soft K-means (Expectation Maximization) Initialize: π, µ, Σ Repeat until convergence:1. For n in 1, . . . , N :γnk : Ezn p(zn yn ;θ ) [I[zn k]]PNNk n 1 γnk2. For k in 1,P., N:N1µk Nk n 1 γnk ynPNΣk N1l n 1 γnk yn yn µk µ kπk NkNThis expectation maximization (EM) algorithm iterates between two steps1. Expectation step: Given the current parameters θ , we compute the responsibility γnk of thecluster for each point.2. Maximization step: Given the current responsibilities γnk , we compute new values for the parameters θ .Operationally this algorithm differs from hard K-means in that we replace I[zn k] with the responsibilities γnk in the maximization step, and compute γn k instead of the maximum posterior value for znin the expectation step.1.4.1ExampleFigure 1 illustrates the EM algorithm for a Gaussian Mixture data with three clusters. The degree ofredness indicates the degree to which the point belongs to the red cluster, and similarly for blue andgreen. As the iteration goes on, each point is assigned to one cluster with more possibility of being inthat cluster. The last iteration can be seen from Figure 2(a).1.4.2IssuesOne caveat of this algorithm is, it can converge to local optima. Random restart can be a solution forthis problem, but this does not ensure reaching the global optima.There can be another issue regarding implementation which is having an empty cluster. Figure 2(b)shows that the algorithm can reach to a local optima with one cluster having all the points while theother two clusters are empty.4

Figure 1: Iterations of Expectation MaximizationFigure 2: (a) Last iteration of the EM algorithm (b) EM reaching to local optima with two empty clusters5

2Generalized Expectation MaximizationA more formal derivation of expectation maximization algorithms can be obtained by defining a lowerbound on the log likelihood. To do so, we need to make use of Jensen’s inequality.2.1Jensen’s InequalityJensen’s inequality relates the expectation of a convex function to the convex function of an expectation.A convex function is a function where the area above the curve is a convex set. For a convex function,the line segment that connects any two points on the functions falls above the function. A concavefunction is the opposite of a convex function so, the opposite of the mentioned property holds, i.e., theline segment that connects any two points on the functions falls below the function.For a convex and a concave function, you can write down the Jensen’s inequality. Jensen’s inequalitystats: given that the function f is concave,the line segment that connects any two points on the functions:Jensen 'sIntermezzofalls below thefunction. For example, for any two points onInequalitythe x axis, x 1 and x 2 , if I choose any random0point x t x 1 (1 t)x 2 between them, where t [0, 1], following expression holds:,fixingfixingf (t x 1 (1 t)x 2 ) t f (x 1 ) (1 t) f (x 2 ))JensenInequalityf's:,,(Xx.Functions ,C if)-xzf- (tx,fix9,.(19),(20)XzFunctionsConcave). 1An Example of convex and concave function is shown in Equation 3vexfki C iattlxz-),. fix.fix,.f(XFunctionsncavetx-,atfkiC i a convexf) xz )function:and the oppositef ( txholds forermezzoxFunctionsConvexf (t x 1 (1 t)x 2 ) t f (x 1 ) (1 t) f (x 2 )) ,C i.1flxi9Xz,X,9XzFigure 3: A convex function (left) and a concave function (right):RandomVariablesCorollary: Jensen’s inequality generalizesand particularly generalizes to random variables. For)# [41 1]5yrandom variables we canfixlook at the differencebetween a functionof the expectation vs the expectation[X ]of a function. Taking an Expectationflxiis like taking a sum over different terms so, the relationship between)) ya function of expectation and expectation 9ofEfylxa function is as follows:XXz E[φ(x)], if φ(x) convex.(21)Randomφ(E[x])Variables E[φ(x)], if φ(x) concave.tlxz4(#.){,,rolary(#[-CorrolaryxX]:){# [41 1]yWe 5now notethat the logarithmis a concave function, which implies ,2.2log(E[x]) E[log(x)].))yEfylx(22)Lower Bounds on the Log Normalizing ConstantWhen we are doing probabilistic modeling, we can define this notion of normalizing constant. Supposewe have a random variable x π(x) where π(x) γ(x)/Z is a density for which we are only able to6

evaluate the corresponding unnormalized density γ(x). We can use the importance sampling trick torewrite the integral for the normalizing constant asZZ γ(x)γ(x) EX q(x).(23)Z : d x γ(x) d x q(x)q(x)q(x)If we add a log inside our expectation, then we can use Jensen’s inequality. So, we are getting theγ(x)expected value of log q(x) w.r.t. an arbitrary distribution q(x). According to Equation 24, by applyingJensen’s inequality, this would be less than or equal to l o g Z.L : EX q(x) γ(x)γ(x) log EX q(x) log Z.logq(x)q(x)(24)In the context of expectation maximization, we have an unnormalized density γ(z; θ ) p( y, z; θ )with a normalizing constant Z(θ ) p( y; θ ). We can now introduce a distribution q(z; γ) to define thelower bound p( y, z; θ )L(θ , γ) : Ez q(z;γ) log log p( y; θ ).(25)q(z; γ)If we now iterate between optimizing this bound with respect to γ and optimizing it with respect to θ ,then we obtain the algorithmAlgorithm: Generalized EM Initialize: θ Repeat until L(θ , γ) change below threshold1. Expectation Stepγ argmax L(θ , γ)(26)γ2. Maximization Stepθ argmax L(θ , γ)(27)θIn this algorithm, the name “expectation” step is in some sense a misnomer; we are simply finding theparameters that maximize our objective function. That said, we will see that in practice, the expectationstep generally computes some set of expected sufficient statistics that can then be used to perform themaximization step.2.3Intermezzo: Kullback-Leibler DivergenceThe KL divergence between two densities q(x) and π(x) is defined asZq(x)KL (q(x) k π(x)) : d x q(x) log.π(x)7(28)

2.3.1PropertiesThere are two properties associated with KL divergence. To prove the first property, if we start with thenegative of KL divergence, we move the negative sign into the integral and inverse what is in the logand then using Jensen’s inequality,1. KL (q(x) k π(x)) 0 (Positive Semi-definite)Zπ(x)q(x) π(x) EX q(x) l o gq(x) π(x) l o gEX q(x) l o g(1) 0q(x)ZP r oo f : KL (q(x) k π(x)) : ZBecaused xq(x)π(x) q(x)d x q(x) l o gd xπ(x) 12. KL (q(x) k π(x)) 0 q(x) π(x)ZZZπ(x)π(x)P r oo f : d x q(x) log d x π(x) log d x π(x) log 1 0q(x)π(x)2.3.2(29)(30)Relationship to Lower boundWe can relate the KL divergence to our lower bound. As it is shown in Equation 31 we write the definitionof our lower bound and expand it by applying the Bayes rule. The log p( y; θ ) does not depend on z andcomes out of the expectation. The other part is simply the KL divergence and because it is greater thanor equal to zero, our lower bound is less than or equal to log p( y; θ ). An alternative (but equivalent)view is to interpret the expectation step as minimization of the Kullback-Leibler divergence betweenq(z; γ) and the posterior on p(z y; θ ). p( y, z; θ )L(θ , γ) : Ez q(z;γ) logq(z; γ) p(z y; θ ) Ez q(z;γ) log p( y; θ ) logq(z; γ) q(z; γ) log p( y; θ ) Ez q(z;γ) logp(z y; θ )(31) log p( y; θ ) KL (q(z; γ) k p(z y; θ )) log p( y; θ )2.4Role of KL minimization in Generalized Expectation MaximizationThe key implication of Equation 31 is that maximizing L(θ ; γ) with respect to γ is equivalent to minimizing KL (q(z; γ) k p(z y; θ )) with respect to γ. Since the KL is 0 when q(z; γ) p(z y; θ ), this choiceof distribution also maximizes the lower bound.Under this interpretation we see that the expectation step can alternately be interpreted as definingq(z) : p(z y; θ ).(32)8

An alternate interpretation is that we expectation step computes expected values relative to p(z y; θ ),which are then used to update the parameters. To see which expectations we need to compute in amixture model, we will write out the derivative of the lower bound Np( yn , zn ; θ ) XL(θ , γ) Eq(zn ) log µl µl n 1q(zn ) NX Eq(zn ) log p( yn , zn ; θ ) µln 1 NX(33)Eq(zn ) [I[zn l]] ( yn µl )Σ 1ln 1In other words, we see that the update equations are analogous to the ones we derived for hard K-means,with the exception of the fact that we now replace I[zn k] with the expected valueγnk Eq(zn ) [I[zn k]](34) E p(zn yn ;θ ) [I[zn k]](35) p(zn k yn ; θ )(36)p( yn , zn k ; θ ) PLl 1 p( yn , zn l ; θ )(37)When we put all of this together, we derive the algorithmAlgorithm: Generalized EM for Gaussian Mixtures Initialize: θ Repeat until L(θ , γ) change below threshold1. Expectation Step:p( yn , zn k; θ )γnk E p(zn yn ;θ ) [I[zn k]] p(zn k yn ; θ ) P,l p( yn , zn l; θ )NXNk γnk .(38)(39)n 12. Maximization Step:N1 Xγnk yn ,Nk n 1 N1 X Σk γnk yn yn µk µ k,Nk n 1µk πk (40)Nk.N9

1.2 Gaussian Mixtures: Maximum Likelihood / MAP We can do Maximum Likelihood (ML) or Maximum a Posteriori (MAP) to estimate the parameter of Gaussian Mixtures. Note that, here : fˇ, 1:K, 1:Kgfor a mixture of K Gaussians. The ML/MAP estimation problem in this setting amounts to having to

Related Documents:

Feb 27, 2017 · Cobra 29 LTD Classic 2.25 7.28 7140-0441 Two Piece Brackets 2.25 Cobra 29 LX & 29 LX BT 2.25 7.25 7140-0441 Two Piece Brackets 2.25 Cobra 29 WX NW BT 2.56 7.28 7140-0438 Two Piece Brackets 2.56 Cobra 29 WX NWST Mobile CB 2.25 7.28 7140-0441 Two Piece Brackets 2.25 Cobra

Cobra 29 LTD Classic 2.25 7.28 7140-0441 Two Piece Brackets 2.25 Cobra 29 LX & 29 LX BT 2.25 7.25 7140-0441 Two Piece Brackets 2.25 Cobra 29 WX NW BT 2.56 7.28 7140-0438 Two Piece Brackets 2.56 Cobra 29 WX NWST Mobile CB 2.25 7.28 7140-0441 Two Piece Brackets 2.25 Cobra

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

1874 Hillhurst Ave., Los Feliz (323) 913-4710 7140 W. Sunset Blvd., Hollywood (323) 876-2741 World Book and News Co. (24 hrs) worldnews.in-hollywood-ca.com 1652 N. Cahuenga Blvd., Hollywood (323) 465-4352 Gateway Newsstand 6801 Hollywood Blvd., Hollywood (323) 460-7140 EDUCATION Los Angeles Unified School District www.lausd.net (213) 241-1000

Machine learning has many different faces. We are interested in these aspects of machine learning which are related to representation theory. However, machine learning has been combined with other areas of mathematics. Statistical machine learning. Topological machine learning. Computer science. Wojciech Czaja Mathematical Methods in Machine .

Machine Learning Real life problems Lecture 1: Machine Learning Problem Qinfeng (Javen) Shi 28 July 2014 Intro. to Stats. Machine Learning . Learning from the Databy Yaser Abu-Mostafa in Caltech. Machine Learningby Andrew Ng in Stanford. Machine Learning(or related courses) by Nando de Freitas in UBC (now Oxford).

Machine Learning Machine Learning B. Supervised Learning: Nonlinear Models B.5. A First Look at Bayesian and Markov Networks Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL .

with machine learning algorithms to support weak areas of a machine-only classifier. Supporting Machine Learning Interactive machine learning systems can speed up model evaluation and helping users quickly discover classifier de-ficiencies. Some systems help users choose between multiple machine learning models (e.g., [17]) and tune model .