CS 7140: Advanced Machine Learning

2y ago

31 Views

3 Downloads

1.64 MB

9 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Bennett Almond

Report this link

Download PDF

Transcription

CS 7140: Advanced Machine LearningLecture 8: Expectation Maximization (Wed 7 Feb 2018)InstructorJan-Willem van de Meent (j.vandemeent@northeastern.edu)ScribesSabbir Ahmad (ahmad.sub@husky.neu.edu)Bahar Azari (azari@ece.neu.edu)11.1Expectation MaximizationGaussian Mixtures: Gibbs SamplingIn the previous lectures, we have seen sampling methods in Generative Models to estimate the parameters for data clustering, where data is assumed to be generated from Gaussian Mixtures. For example,the Generative Model and the Gibbs Sampler updates for K Gaussian mixtures are specified as follows.Generative Modelµk , Σk p(µ, Σ)zn Discrete(π1 , . . . , πK )yn zn k N (µk , Σk )1.2Gibbs Sampler Updates1. zn p(zn yn , µ, Σ)2. µk , Σk p(µk , Σk y, z)Gaussian Mixtures: Maximum Likelihood / MAPWe can do Maximum Likelihood (ML) or Maximum a Posteriori (MAP) to estimate the parameter θ ofGaussian Mixtures. Note that, here θ : {π, µ1:K , Σ1:K } for a mixture of K Gaussians. The ML/MAPestimation problem in this setting amounts to having to estimate K multivariate Gaussians with mixingweights π1 , . . . , πK .Now we can write for Maximum Likelihood estimation:Zθ argmax p( y; θ ) argmaxθθp( y, z; θ ) dz(1)RHere, p( y, z; θ )dz is the marginal probability of y over z, where z represents the cluster assignmentfor every point in the dataset. As we have seen before, computing the integral is often intractable anddifficult to calculate. For maximum a Posteriori (MAP) estimation, we can write:Zθ argmax p( y, θ ) argmaxθ1.3θp( y, z, θ ) dz.(2)Generalized hard K-meansML and MAP estimation for Gaussian mixtures is complicated by the fact the we need to marginalizeover z. Performing optimization conditioned on z is relatively straightforward. In this case we can solve

for the derivative θ log p( y, z; θ ) NX θ log p( yn , zn ; θ ) 0.(3)n 1As we will see below, this type of problem is analogous to the one that we already solved in the context oflinear regression. It turns out that we can solve this problem in closed form for any likelihood p( y z, θ y )that is in the exponential family. When we marginalize over z, we obtain a different problem,NXNX θ p( yn ; θ ) θ log p( y; θ ) θlog p( yn ; θ ), p( yn ; θ )n 1n 1PKNX θ k 1 p( yn , zn k; θ ) p( yn ; θ )n 1 N XKX θ p( yn , zn k; θ ) 0.p( yn ; θ )n 1 k 1(4)(5)(6)It so happens that this problem is generally not tractable in closed form (we will see further on why).For this reason, we will begin by solving a simpler problem, inspired by Gibbs sampling. In this problemwe alternate between two maximization stepsStep 1 : znStep 2 : µk , Σk argmaxzn p( yn , zn ; θ ) argmaxµk ,Σk p( yn , zn , θ ) argmaxµk ,Σk p( yn , zn ; θ )[M AP][M L](7)In Step 2, the mean and covariance are optimized by MAP or ML. Note that, the notation in MAPestimation, p( yn , zn , θ ) p( yn zn , θ )p(zn θ )p(θ ), whereas, the ML estimation doesn’t have any priordistribution on parameter. So for ML, p( yn , zn ; θ ) p( yn zn ; θ )p(zn ; θ ).1.3.1Updates for the parametersStep 1 is trivial to implement (we can simply enumerate over all choices zn k). In order to implementStep 2, we will need to write down the the joint density, p( yn , zn ; θ ) p( yn zn ; θ )p(zn ; θ ), which isp( y, Z; θ ) NYp( yn , zn ; θ ) n 1N YKYp( yn , zn k; θ )I[zn k] .(8)n 1 k 1Here, I[zn k] is the indicator function, which is defined as, 1, if zn k,I[zn k] 0, if zn 6 k.(9)We can express the joint p( yn , zn k; θ ) in terms of the likelihood p( yn zn k, µ, Σ) and the clusterassignment prior p(zn k; π),p( yn zn k, µ, Σ) N ( yn ; µk , Σk ),(10)p(zn k; π) πk .(11)2

To compute the ML update for the mean, we now solve for the derivative with respect to µl , which mustbe 0 at the optimumN K XX I[zn k] log p( yn , zn k; θ ),log p( y, z; θ ) µl µl n 1 k 1 N XKXn 1 k 1NXI[zn k]I[zn l]n 1NX [log N ( yn ; µk , Σk ) log πk ] , µl log N ( yn ; µl , Σl ), µl(12)(13)(14)I[zn l]( yn µl )Σ 1l 0.(15)n 1If we multiply on the right by Σl , then we see that the optimum satisfies the condition0 NXI[zn l]( yn µl ).(16)n 1Solving for µl now gives us the updateµl N1 XI[zn l] yn ,Nl n 1Nl NXI[zn l].(17)n 1In other words, the optimal value µl simply the mean of the points yn that are assigned to the cluster l.We can derive updates for Σl and πl in an analogous manner, which yields NNl1 XΣl πl .I[zn l] yn yn µl µ (18)l ,Nl n 1N1.3.2AlgorithmNow that we have derived the closed-form solutions for updating the parameters for each iteration, wecan formally present the algorithm. We call the Generalized version Hard K-means, as the assignmentof each point to a cluster is absolute. That means, we are certain that a point belongs to a clustercompletely at a given point in the algorithm. This is called the Hard assignment.Algorithm: Hard K-means (Generalized version) Initialize: π, µ, Σ Repeat until zn unchanged:1. For n in 1, . . . , N :zn argmaxµ log N ( yn ; µk , Σk ) log πkzn argmaxµ p(zn k y; θ )2. For k in 1,P., N:Nµk N1k n 1 I[zn k] ynPNΣk N1l n 1 I[zn k] yn yn µk µ kPNNkπk NNk n 1 I[zn k]3

1.4Soft K-means (Expectation Maximization)Instead of hard assignment of the points to a cluster, if we are uncertain about the data points where theybelong, we can assign a probability to each point to determine the feasibility of its belonging to a cluster.This type of algorithm is sometimes called called soft K-means, since it performs a "soft" assignment.Algorithm: Soft K-means (Expectation Maximization) Initialize: π, µ, Σ Repeat until convergence:1. For n in 1, . . . , N :γnk : Ezn p(zn yn ;θ ) [I[zn k]]PNNk n 1 γnk2. For k in 1,P., N:N1µk Nk n 1 γnk ynPNΣk N1l n 1 γnk yn yn µk µ kπk NkNThis expectation maximization (EM) algorithm iterates between two steps1. Expectation step: Given the current parameters θ , we compute the responsibility γnk of thecluster for each point.2. Maximization step: Given the current responsibilities γnk , we compute new values for the parameters θ .Operationally this algorithm differs from hard K-means in that we replace I[zn k] with the responsibilities γnk in the maximization step, and compute γn k instead of the maximum posterior value for znin the expectation step.1.4.1ExampleFigure 1 illustrates the EM algorithm for a Gaussian Mixture data with three clusters. The degree ofredness indicates the degree to which the point belongs to the red cluster, and similarly for blue andgreen. As the iteration goes on, each point is assigned to one cluster with more possibility of being inthat cluster. The last iteration can be seen from Figure 2(a).1.4.2IssuesOne caveat of this algorithm is, it can converge to local optima. Random restart can be a solution forthis problem, but this does not ensure reaching the global optima.There can be another issue regarding implementation which is having an empty cluster. Figure 2(b)shows that the algorithm can reach to a local optima with one cluster having all the points while theother two clusters are empty.4

Figure 1: Iterations of Expectation MaximizationFigure 2: (a) Last iteration of the EM algorithm (b) EM reaching to local optima with two empty clusters5

2Generalized Expectation MaximizationA more formal derivation of expectation maximization algorithms can be obtained by defining a lowerbound on the log likelihood. To do so, we need to make use of Jensen’s inequality.2.1Jensen’s InequalityJensen’s inequality relates the expectation of a convex function to the convex function of an expectation.A convex function is a function where the area above the curve is a convex set. For a convex function,the line segment that connects any two points on the functions falls above the function. A concavefunction is the opposite of a convex function so, the opposite of the mentioned property holds, i.e., theline segment that connects any two points on the functions falls below the function.For a convex and a concave function, you can write down the Jensen’s inequality. Jensen’s inequalitystats: given that the function f is concave,the line segment that connects any two points on the functions:Jensen 'sIntermezzofalls below thefunction. For example, for any two points onInequalitythe x axis, x 1 and x 2 , if I choose any random0point x t x 1 (1 t)x 2 between them, where t [0, 1], following expression holds:,fixingfixingf (t x 1 (1 t)x 2 ) t f (x 1 ) (1 t) f (x 2 ))JensenInequalityf's:,,(Xx.Functions ,C if)-xzf- (tx,fix9,.(19),(20)XzFunctionsConcave). 1An Example of convex and concave function is shown in Equation 3vexfki C iattlxz-),. fix.fix,.f(XFunctionsncavetx-,atfkiC i a convexf) xz )function:and the oppositef ( txholds forermezzoxFunctionsConvexf (t x 1 (1 t)x 2 ) t f (x 1 ) (1 t) f (x 2 )) ,C i.1flxi9Xz,X,9XzFigure 3: A convex function (left) and a concave function (right):RandomVariablesCorollary: Jensen’s inequality generalizesand particularly generalizes to random variables. For)# [41 1]5yrandom variables we canfixlook at the differencebetween a functionof the expectation vs the expectation[X ]of a function. Taking an Expectationflxiis like taking a sum over different terms so, the relationship between)) ya function of expectation and expectation 9ofEfylxa function is as follows:XXz E[φ(x)], if φ(x) convex.(21)Randomφ(E[x])Variables E[φ(x)], if φ(x) concave.tlxz4(#.){,,rolary(#[-CorrolaryxX]:){# [41 1]yWe 5now notethat the logarithmis a concave function, which implies ,2.2log(E[x]) E[log(x)].))yEfylx(22)Lower Bounds on the Log Normalizing ConstantWhen we are doing probabilistic modeling, we can define this notion of normalizing constant. Supposewe have a random variable x π(x) where π(x) γ(x)/Z is a density for which we are only able to6

evaluate the corresponding unnormalized density γ(x). We can use the importance sampling trick torewrite the integral for the normalizing constant asZZ γ(x)γ(x) EX q(x).(23)Z : d x γ(x) d x q(x)q(x)q(x)If we add a log inside our expectation, then we can use Jensen’s inequality. So, we are getting theγ(x)expected value of log q(x) w.r.t. an arbitrary distribution q(x). According to Equation 24, by applyingJensen’s inequality, this would be less than or equal to l o g Z.L : EX q(x) γ(x)γ(x) log EX q(x) log Z.logq(x)q(x)(24)In the context of expectation maximization, we have an unnormalized density γ(z; θ ) p( y, z; θ )with a normalizing constant Z(θ ) p( y; θ ). We can now introduce a distribution q(z; γ) to define thelower bound p( y, z; θ )L(θ , γ) : Ez q(z;γ) log log p( y; θ ).(25)q(z; γ)If we now iterate between optimizing this bound with respect to γ and optimizing it with respect to θ ,then we obtain the algorithmAlgorithm: Generalized EM Initialize: θ Repeat until L(θ , γ) change below threshold1. Expectation Stepγ argmax L(θ , γ)(26)γ2. Maximization Stepθ argmax L(θ , γ)(27)θIn this algorithm, the name “expectation” step is in some sense a misnomer; we are simply finding theparameters that maximize our objective function. That said, we will see that in practice, the expectationstep generally computes some set of expected sufficient statistics that can then be used to perform themaximization step.2.3Intermezzo: Kullback-Leibler DivergenceThe KL divergence between two densities q(x) and π(x) is defined asZq(x)KL (q(x) k π(x)) : d x q(x) log.π(x)7(28)

2.3.1PropertiesThere are two properties associated with KL divergence. To prove the first property, if we start with thenegative of KL divergence, we move the negative sign into the integral and inverse what is in the logand then using Jensen’s inequality,1. KL (q(x) k π(x)) 0 (Positive Semi-definite)Zπ(x)q(x) π(x) EX q(x) l o gq(x) π(x) l o gEX q(x) l o g(1) 0q(x)ZP r oo f : KL (q(x) k π(x)) : ZBecaused xq(x)π(x) q(x)d x q(x) l o gd xπ(x) 12. KL (q(x) k π(x)) 0 q(x) π(x)ZZZπ(x)π(x)P r oo f : d x q(x) log d x π(x) log d x π(x) log 1 0q(x)π(x)2.3.2(29)(30)Relationship to Lower boundWe can relate the KL divergence to our lower bound. As it is shown in Equation 31 we write the definitionof our lower bound and expand it by applying the Bayes rule. The log p( y; θ ) does not depend on z andcomes out of the expectation. The other part is simply the KL divergence and because it is greater thanor equal to zero, our lower bound is less than or equal to log p( y; θ ). An alternative (but equivalent)view is to interpret the expectation step as minimization of the Kullback-Leibler divergence betweenq(z; γ) and the posterior on p(z y; θ ). p( y, z; θ )L(θ , γ) : Ez q(z;γ) logq(z; γ) p(z y; θ ) Ez q(z;γ) log p( y; θ ) logq(z; γ) q(z; γ) log p( y; θ ) Ez q(z;γ) logp(z y; θ )(31) log p( y; θ ) KL (q(z; γ) k p(z y; θ )) log p( y; θ )2.4Role of KL minimization in Generalized Expectation MaximizationThe key implication of Equation 31 is that maximizing L(θ ; γ) with respect to γ is equivalent to minimizing KL (q(z; γ) k p(z y; θ )) with respect to γ. Since the KL is 0 when q(z; γ) p(z y; θ ), this choiceof distribution also maximizes the lower bound.Under this interpretation we see that the expectation step can alternately be interpreted as definingq(z) : p(z y; θ ).(32)8

An alternate interpretation is that we expectation step computes expected values relative to p(z y; θ ),which are then used to update the parameters. To see which expectations we need to compute in amixture model, we will write out the derivative of the lower bound Np( yn , zn ; θ ) XL(θ , γ) Eq(zn ) log µl µl n 1q(zn ) NX Eq(zn ) log p( yn , zn ; θ ) µln 1 NX(33)Eq(zn ) [I[zn l]] ( yn µl )Σ 1ln 1In other words, we see that the update equations are analogous to the ones we derived for hard K-means,with the exception of the fact that we now replace I[zn k] with the expected valueγnk Eq(zn ) [I[zn k]](34) E p(zn yn ;θ ) [I[zn k]](35) p(zn k yn ; θ )(36)p( yn , zn k ; θ ) PLl 1 p( yn , zn l ; θ )(37)When we put all of this together, we derive the algorithmAlgorithm: Generalized EM for Gaussian Mixtures Initialize: θ Repeat until L(θ , γ) change below threshold1. Expectation Step:p( yn , zn k; θ )γnk E p(zn yn ;θ ) [I[zn k]] p(zn k yn ; θ ) P,l p( yn , zn l; θ )NXNk γnk .(38)(39)n 12. Maximization Step:N1 Xγnk yn ,Nk n 1 N1 X Σk γnk yn yn µk µ k,Nk n 1µk πk (40)Nk.N9

1.2 Gaussian Mixtures: Maximum Likelihood / MAP We can do Maximum Likelihood (ML) or Maximum a Posteriori (MAP) to estimate the parameter of Gaussian Mixtures. Note that, here : fˇ, 1:K, 1:Kgfor a mixture of K Gaussians. The ML/MAP estimation problem in this setting amounts to having to

Related Documents:

Faceplate Ordering Instructions (Every order number ...

Feb 27, 2017 · Cobra 29 LTD Classic 2.25 7.28 7140-0441 Two Piece Brackets 2.25 Cobra 29 LX & 29 LX BT 2.25 7.25 7140-0441 Two Piece Brackets 2.25 Cobra 29 WX NW BT 2.56 7.28 7140-0438 Two Piece Brackets 2.56 Cobra 29 WX NWST Mobile CB 2.25 7.28 7140-0441 Two Piece Brackets 2.25 Cobra

27 Views

2y ago

FACEPLATE ORDERING INSTRUCTIONS (EVERY ORDER …

Cobra 29 LTD Classic 2.25 7.28 7140-0441 Two Piece Brackets 2.25 Cobra 29 LX & 29 LX BT 2.25 7.25 7140-0441 Two Piece Brackets 2.25 Cobra 29 WX NW BT 2.56 7.28 7140-0438 Two Piece Brackets 2.56 Cobra 29 WX NWST Mobile CB 2.25 7.28 7140-0441 Two Piece Brackets 2.25 Cobra

23 Views

2y ago

Specification and Price of Automatic Rendering Machine (FOB ... - AR

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

16 Views

3m ago

THEBOOK - Musicians Institute

1874 Hillhurst Ave., Los Feliz (323) 913-4710 7140 W. Sunset Blvd., Hollywood (323) 876-2741 World Book and News Co. (24 hrs) worldnews.in-hollywood-ca.com 1652 N. Cahuenga Blvd., Hollywood (323) 465-4352 Gateway Newsstand 6801 Hollywood Blvd., Hollywood (323) 460-7140 EDUCATION Los Angeles Unified School District www.lausd.net (213) 241-1000

33 Views

2y ago

Mathematical Methods in Machine Learning - UMD

Machine learning has many different faces. We are interested in these aspects of machine learning which are related to representation theory. However, machine learning has been combined with other areas of mathematics. Statistical machine learning. Topological machine learning. Computer science. Wojciech Czaja Mathematical Methods in Machine .

27 Views

1y ago

Lecture 1: Machine Learning Problem - University of Adelaide

Machine Learning Real life problems Lecture 1: Machine Learning Problem Qinfeng (Javen) Shi 28 July 2014 Intro. to Stats. Machine Learning . Learning from the Databy Yaser Abu-Mostafa in Caltech. Machine Learningby Andrew Ng in Stanford. Machine Learning(or related courses) by Nando de Freitas in UBC (now Oxford).

36 Views

1y ago

Machine Learning - B. Supervised Learning: Nonlinear Models B.5. A ...

Machine Learning Machine Learning B. Supervised Learning: Nonlinear Models B.5. A First Look at Bayesian and Markov Networks Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL .

16 Views

1y ago

Flock: Hybrid Crowd-Machine Learning Classiﬁers - Stanford University

with machine learning algorithms to support weak areas of a machine-only classiﬁer. Supporting Machine Learning Interactive machine learning systems can speed up model evaluation and helping users quickly discover classiﬁer de-ﬁciencies. Some systems help users choose between multiple machine learning models (e.g., [17]) and tune model .

52 Views

7m ago

Recent Views

Litigation Hold Basics

Outside Counsel through In-House Counsel Generally case law uses the terms attorney/counsel 5 When rIssuance of the Litigation Hold Notification Letter Issue a Litigation Hold Notification Letter at the onset of litigation or whenever litigation is reasonably anti

2y ago

124 Views

Public Relations Litigation - Vanderbilt University

patent litigation compensate for lawsuits that are unlikely to succeed). 10. The relationship between social media and litigation is not unilateral. While litigation can fuel social media activity, social media activity can also increase the possibility and affect the outcomes of litigation by increasing the information available to attorneys.

1y ago

121 Views

GAO-13-465, INTELLECTUAL PROPERTY: Assessing Factors That Affect Patent .

litigation on key factors that have contributed to recent patent litigation; (3) what developments in the judicial system may affect patent litigation; and (4) what actions, if any, PTO has recently taken that may affect patent litigation in the future. GAO reviewed relevant laws, analyzed patent infringement litigation data from 2000

1y ago

139 Views

Injury Litigation, Insurance Law, Arbitration, Mediation .

Injury Litigation, Insurance Law, Arbitration, Mediation and Construction. Partner!DirectorlShareholder Matsumoto LaFountaine & Chow August 1, 1994 to April 30, 1999 General practice of law, including Civil and Commercial Litigation, Personal Injury Litigation, Insur

2y ago

142 Views

WA LITIGATION GUARANTEE

WA Litigation Guarantee STEWART TITLE GUARANTY COMPANY Litigation Guarantee Page 1 of 6 Order No. 21-11333-TO LITIGATION GUARANTEE Issued by STEWART TITLE GUARANTY COMPANY a corporation, herein called the Company SCHEDULE A Guarantee No.: 949242578 Premium: 578.00 Sales Tax: 51.44 Prepared by:

2y ago

349 Views

Civil Litigation - Pearson

CHAPTER 5 Causes of Actions and Litigation Strategies 121 CHAPTER 6 Evidence 143 CHAPTER 7 Interviews and Investigation in Civil Litigation 167 UNIT THREE DOCUMENTS IN CIVIL LITIGATION CHAPTER 8 Pleadings: Complaint, Summons, and Service 203 CHAPTER 9 Motions Practice 237 CH

2y ago

314 Views

Food Litigation 2018 Year in Review - Perkins Coie

PERKINS COIE IS PLEASED TO PRESENT ITS THIRD ANNUAL FOOD LITIGATION YEAR IN REVIEW, summarizing important developments in consumer litigation affecting the food and beverage industry. Class action litigation against the food and beverage industry continued unabated in 2018, with 158 new lawsuits filed—a figure equaling 2015's high-water mark.

1y ago

108 Views

Strategic Litigation Impacts

of strategic litigation. As discussed in the Foreword to this volume, strategic litigation is of keen interest to the Open Society Foundations (OSF), which both supports strate-gic litigation and engages in it directly—and thus has an interest in gaining an unbi-ased view of its promises and limitations.

1y ago

138 Views

Shareholder Litigation and Corporate Innovation

shareholder litigation as an "uncontrolled tax on innovation".1 We investigate the impact of shareholder litigation on corporate innovation by relying on a staggered law change that reduces a manager's exposure to shareholder litigation.2 We explicitly test two conflicting hypotheses that can be drawn from the literature.

1y ago

109 Views

The Status of Climate Change Litigation

litigation seeking to challenge either their facial validity or their particular application has followed. So too has litigation aimed at pressing legislators and policymakers to be more ambitious and thorough in their approaches to climate change. In addition, litigation seeking to fill the gaps left by legislative and

1y ago

106 Views

Texas Civil Litigation

The litigation practice in Texas state courts is quite simi-lar in many areas to the litigation practice in the federal courts. However, there are many differences between the two systems, including the times for filing or responding and the format of pleadings. The role of the paralegal in a litigation law firm is much the same whether the .

1y ago

122 Views

Federal Civil Ip Litigation

IPRs may be filed during a patent litigation (within 1 year of filing complaint) Used to attack patentability and scope of patent Generally less costly than patent litigation ( 100K - 500K depending on what stage is reached) Less time consuming than patent litigation-- USPTO will take up to 6 months to decide whether or not to .

1y ago

123 Views

The Role of Precedents in Repeated Litigation

poral allocation of litigation efforts, and nuisance suits. This article is organized as follows. Section 2 briefly reviews a one-period litigation problem. Section 3 introduces repeated litigation with correlated decisions and examines parties' settlement behavior, and Section 4 explores further implications of correlated decisions.

1y ago

111 Views

Federal and State Litigation Regarding Pharmacy Benefit Managers

In re Pharmacy Benefit Managers Antitrust Litigation, 582 F.3d 432 (3d Cir. 2009). 26 In re Pharmacy Benefit Managers Antitrust Litigation, No. 03-cv-04731-JF, 2006 WL 3759712 (E.D. Pa. Dec. 18, 2006) . 26 In re Pharmacy Benefit Managers Antitrust Litigation, No. 07-1151 (3d Cir. Jan. 24, 2007). . 26 In re Pharmacy Benefits Managers .

1y ago

107 Views

Recent developments in Health and Safety Law - IOSH

Recent Developments in Health and Safety Law Litigation Privilege and Internal Investigations 2 recent cases have called into question whether litigation privilege can apply to internal investigations including health and safety matters. To claim litigation privilege the following must apply: - Litigation must be in progress or in .

1y ago

116 Views

CS 7140: Advanced Machine Learning

It looks like you're using an ad-blocker