Optimization For Machine Learning

2y ago
21 Views
3 Downloads
387.12 KB
93 Pages
Last View : 3d ago
Last Download : 3m ago
Upload by : Joanna Keil
Transcription

Optimization for Machine Learning(Problems; Algorithms - C)S UVRIT S RAMassachusetts Institute of TechnologyPKU Summer School on Data Science (July 2017)

Course materialshttp://suvrit.de/teaching.htmlSome references: Introductory lectures on convex optimization – NesterovConvex optimization – Boyd & VandenbergheNonlinear programming – BertsekasConvex Analysis – RockafellarFundamentals of convex analysis – Urruty, LemaréchalLectures on modern convex optimization – NemirovskiOptimization for Machine Learning – Sra, Nowozin, WrightTheory of Convex Optimization for Machine Learning – BubeckNIPS 2016 Optimization Tutorial – Bach, SraSome related courses: EE227A, Spring 2013, (Sra, UC Berkeley)10-801, Spring 2014 (Sra, CMU)EE364a,b (Boyd, Stanford)EE236b,c (Vandenberghe, UCLA)Venues: NIPS, ICML, UAI, AISTATS, SIOPT, Math. Prog.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning2 / 29

Lecture Plan– Introduction (3 lectures)– Problems and algorithms (5 lectures)– Non-convex optimization, perspectives (2 lectures)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning3 / 29

Nonsmooth convergence ratesTheorem. (Nesterov.) Let B x kx x0 k2 D . Assume,x B. There exists a convex function f in C0L (B) (with L 0),such that for 0 k n 1, the lower-boundf (xk ) f (x ) LD ,2(1 k 1)holds for any algorithm that generates xk by linearly combiningthe previous iterates and subgradients.Exercise: So design problems where we can do better!Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning4 / 29

Composite problemsSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning5 / 29

Composite objectivesFrequently ML problems take the regularized formminimize f (x) : (x) r(x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning6 / 29

Composite objectivesFrequently ML problems take the regularized formminimize f (x) : (x) r(x) Suvrit Sra (suvrit@mit.edu) r Optimization for Machine Learning6 / 29

Composite objectivesFrequently ML problems take the regularized formminimize f (x) : (x) r(x) r Example: (x) 12 kAx bk 2 and r(x) λkxk1Lasso, L1-LS, compressed sensingSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning6 / 29

Composite objectivesFrequently ML problems take the regularized formminimize f (x) : (x) r(x) r Example: (x) 12 kAx bk 2 and r(x) λkxk1Lasso, L1-LS, compressed sensingExample: (x) : Logistic loss, and r(x) λkxk1L1-Logistic regression, sparse LRSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning6 / 29

Composite objective minimizationminimize f (x) : (x) r(x)subgradient: xk 1 xk αk gk , gk f (xk )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning7 / 29

Composite objective minimizationminimize f (x) : (x) r(x)subgradient: xk 1 xk αk gk , gk f (xk ) subgradient: converges slowly at rate O(1/ k)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning7 / 29

Composite objective minimizationminimize f (x) : (x) r(x)subgradient: xk 1 xk αk gk , gk f (xk ) subgradient: converges slowly at rate O(1/ k)but: f is smooth plus nonsmoothwe should exploit: smoothness of for better method!Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning7 / 29

Proximal Gradient Methodminf (x) x XProjected (sub)gradientx PX (x α f (x))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning8 / 29

Proximal Gradient Methodminf (x) x XProjected (sub)gradientx PX (x α f (x))minf (x) h(x)Proximal gradientx proxαh (x α f (x))proxαh denotes proximity operator for hSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning8 / 29

Proximal Gradient Methodminf (x) x XProjected (sub)gradientx PX (x α f (x))minf (x) h(x)Proximal gradientx proxαh (x α f (x))proxαh denotes proximity operator for hWhy? If we can compute proxh (x) easily, prox-grad converges as fast gradient methods for smooth problems!Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning8 / 29

Proximity operatorProjectionPX (y) : argminx RnSuvrit Sra (suvrit@mit.edu)12 kx yk22 1X (x)Optimization for Machine Learning9 / 29

Proximity operatorProjectionPX (y) : argminx Rn12 kx yk22 1X (x)Proximity: Replace 1X by a closed convex functionproxr (y) : argminx RnSuvrit Sra (suvrit@mit.edu)12 kx yk22 r(x)Optimization for Machine Learning9 / 29

Proximity operator 1 -norm ball of radius ρ(λ)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning10 / 29

Proximity operatorsExercise: Let r(x) kxk1 . Solve proxλr (y).minnx R12 kx yk22 λkxk1 .Hint 1: The above problem decomposes into n independentsubproblems of the formminx R12 (x y)2 λ x .Hint 2: Consider the two cases: either x 0 or x 6 0Exercise: Moreau decomposition y proxh y proxh y(notice analogy to V S S in linear algebra)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning11 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 0Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )x Suvrit Sra (suvrit@mit.edu) α f (x ) (I α h)(x )Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )x α f (x ) (I α h)(x )x α f (x ) (I α h)(x )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )x α f (x ) (I α h)(x )x α f (x ) (I α h)(x )x (I α h) 1 (x α f (x ))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )x α f (x ) (I α h)(x )x α f (x ) (I α h)(x )x (I α h) 1 (x α f (x ))x proxαh (x α f (x ))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )x α f (x ) (I α h)(x )x α f (x ) (I α h)(x )x (I α h) 1 (x α f (x ))x proxαh (x α f (x ))Above fixed-point eqn suggests iterationxk 1 proxαk h (xk αk f (xk ))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

Convergence Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning13 / 29

Proximal-gradient works, why?xk 1 proxαk h (xk αk f (xk ))xk 1 xk αk Gαk (xk ).Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning14 / 29

Proximal-gradient works, why?xk 1 proxαk h (xk αk f (xk ))xk 1 xk αk Gαk (xk ).Gradient mapping: the “gradient-like object”1Gα (x) (x Pαh (x α f (x)))αSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning14 / 29

Proximal-gradient works, why?xk 1 proxαk h (xk αk f (xk ))xk 1 xk αk Gαk (xk ).Gradient mapping: the “gradient-like object”1Gα (x) (x Pαh (x α f (x)))αI Our lemma shows: Gα (x) 0 if and only if x is optimalI So Gα analogous to fI If x locally optimal, then Gα (x) 0 (nonconvex f )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning14 / 29

Convergence analysisAssumption: Lipschitz continuous gradient; denoted f C1Lk f (x) f (y)k2 Lkx yk2Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning15 / 29

Convergence analysisAssumption: Lipschitz continuous gradient; denoted f C1Lk f (x) f (y)k2 Lkx yk2 Gradient vectors of closeby points are close to each other Objective function has “bounded curvature” Speed at which gradient varies is boundedSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning15 / 29

Convergence analysisAssumption: Lipschitz continuous gradient; denoted f C1Lk f (x) f (y)k2 Lkx yk2 Gradient vectors of closeby points are close to each other Objective function has “bounded curvature” Speed at which gradient varies is boundedLemma (Descent). Let f C1L . Then,f (y) f (x) h f (x), y xi L2 ky xk22Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning15 / 29

Convergence analysisAssumption: Lipschitz continuous gradient; denoted f C1Lk f (x) f (y)k2 Lkx yk2 Gradient vectors of closeby points are close to each other Objective function has “bounded curvature” Speed at which gradient varies is boundedLemma (Descent). Let f C1L . Then,f (y) f (x) h f (x), y xi L2 ky xk22For convex f , compare withf (y) f (x) h f (x), y xi.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning15 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) Suvrit Sra (suvrit@mit.edu)R10h f (zt ), y xidt.Optimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi Suvrit Sra (suvrit@mit.edu)R10h f (zt ) f (x), y xidtOptimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi f (y) f (x) h f (x), y xi Suvrit Sra (suvrit@mit.edu) R1h f (zt ) f (x), y xidt0R1h f (zt ) f (x), y xidt0Optimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi f (y) f (x) h f (x), y xi Suvrit Sra (suvrit@mit.edu)R1h f (zt ) f (x), y xidt0R1h f (zt ) f (x), y xidt 0R1 0 h f (zt ) f (x), y xi dtOptimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi f (y) f (x) h f (x), y xi Suvrit Sra (suvrit@mit.edu)R1h f (zt ) f (x), y xidt0R1h f (zt ) f (x), y xidt 0R1 0 h f (zt ) f (x), y xi dtR1 0 k f (zt ) f (x)k2 · ky xk2 dtOptimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi f (y) f (x) h f (x), y xi Suvrit Sra (suvrit@mit.edu)R1h f (zt ) f (x), y xidt0R1h f (zt ) f (x), y xidt0R1 h f (zt ) f (x), y xi dt0R1k f (zt ) f (x)k2 · ky xk2 dt0R1L 0 tkx yk22 dtOptimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi f (y) f (x) h f (x), y xi R1h f (zt ) f (x), y xidt0R1h f (zt ) f (x), y xidt0R1 h f (zt ) f (x), y xi dt0R1k f (zt ) f (x)k2 · ky xk2 dt0R1L 0 tkx yk22 dtL2 kx yk22 .Bounds f (y) around x with quadratic functionsSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning16 / 29

Descent lemma – corollaryf (y) f (x) h f (x), y xi L2 ky xk22Let y x αGα (x), thenSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning17 / 29

Descent lemma – corollaryf (y) f (x) h f (x), y xi L2 ky xk22Let y x αGα (x), thenf (y) f (x) αh f (x), Gα (x)i Suvrit Sra (suvrit@mit.edu)α2 LkGα (x)k22 .2Optimization for Machine Learning17 / 29

Descent lemma – corollaryf (y) f (x) h f (x), y xi L2 ky xk22Let y x αGα (x), thenf (y) f (x) αh f (x), Gα (x)i α2 LkGα (x)k22 .2Corollary. So if 0 α 1/L, we havef (y) f (x) αh f (x), Gα (x)i Suvrit Sra (suvrit@mit.edu)Optimization for Machine LearningαkGα (x)k22 .217 / 29

Descent lemma – corollaryf (y) f (x) h f (x), y xi L2 ky xk22Let y x αGα (x), thenf (y) f (x) αh f (x), Gα (x)i α2 LkGα (x)k22 .2Corollary. So if 0 α 1/L, we havef (y) f (x) αh f (x), Gα (x)i αkGα (x)k22 .2Lemma Let y x αGα (x). Then, for any z we havef (y) h(y) f (z) h(z) hGα (x), x zi α2 kGα (x)k22 .Exercise: Prove! (hint: f , h are convex, Gα (x) f (x) h(y))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning17 / 29

Convergence analysisWe’ve actually shown x0 x αGα (x) is a descent method.Write φ f h; plug in z x to obtainφ(x0 ) φ(x) α2 kGα (x)k22 .Exercise: Why this inequality suffices to show convergence.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning18 / 29

Convergence analysisWe’ve actually shown x0 x αGα (x) is a descent method.Write φ f h; plug in z x to obtainφ(x0 ) φ(x) α2 kGα (x)k22 .Exercise: Why this inequality suffices to show convergence.Use z x in corollary to obtain progress in terms of iterates:φ(x0 ) φ hGα (x), x x i α2 kGα (x)k22Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning18 / 29

Convergence analysisWe’ve actually shown x0 x αGα (x) is a descent method.Write φ f h; plug in z x to obtainφ(x0 ) φ(x) α2 kGα (x)k22 .Exercise: Why this inequality suffices to show convergence.Use z x in corollary to obtain progress in terms of iterates:φ(x0 ) φ hGα (x), x x i α2 kGα (x)k22 1 2hαGα (x), x x i kαGα (x)k222α 1 kx x k22 kx x αGα (x)k222αSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning18 / 29

Convergence analysisWe’ve actually shown x0 x αGα (x) is a descent method.Write φ f h; plug in z x to obtainφ(x0 ) φ(x) α2 kGα (x)k22 .Exercise: Why this inequality suffices to show convergence.Use z x in corollary to obtain progress in terms of iterates:φ(x0 ) φ hGα (x), x x i α2 kGα (x)k22 1 2hαGα (x), x x i kαGα (x)k222α 1 kx x k22 kx x αGα (x)k222α 1 kx x k22 kx0 x k22 .2αSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning18 / 29

Convergence rateSet x xk , x0 xk 1 , and α 1/L. Then addSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning19 / 29

Convergence rateSet x xk , x0 xk 1 , and α 1/L. Then addXk 1i 1(φ(xi ) φ ) Suvrit Sra (suvrit@mit.edu)L2Xk 1 kxi x k22 kxi 1 x k22i 1Optimization for Machine Learning19 / 29

Convergence rateSet x xk , x0 xk 1 , and α 1/L. Then addXk 1i 1(φ(xi ) φ ) Suvrit Sra (suvrit@mit.edu)Xk 1 kxi x k22 kxi 1 x k22i 1 2 2L2 kx1 x k2 kxk 1 x k2L2Optimization for Machine Learning19 / 29

Convergence rateSet x xk , x0 xk 1 , and α 1/L. Then addXk 1i 1 Xk 1 kxi x k22 kxi 1 x k22i 1 2 2L2 kx1 x k2 kxk 1 x k2 L2 kx1(φ(xi ) φ ) Suvrit Sra (suvrit@mit.edu)L2 x k22 .Optimization for Machine Learning19 / 29

Convergence rateSet x xk , x0 xk 1 , and α 1/L. Then addXk 1i 1 Xk 1 kxi x k22 kxi 1 x k22i 1 2 2L2 kx1 x k2 kxk 1 x k2 L2 kx1(φ(xi ) φ ) L2 x k22 .Since φ(xk ) is a decreasing sequence, it follows thatk 1φ(xk 1 ) φ 1 XL(φ(xi ) φ ) kx1 x k22 .k 12(k 1)i 1This is the well-known O(1/k) rate.I But for C1L convex functions, optimal rate is O(1/k2 )!Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning19 / 29

Accelerated Proximal Gradientmin φ(x) f (x) h(x)Let x0 y0 dom h. For k 1:xk proxαk h (yk 1 αk f (yk 1 ))yk xk k 1 k(x xk 1 ).k 2Framework due to: Nesterov (1983, 2004); also Beck, Teboulle (2009).Simplified analysis: Tseng (2008). Uses extra “memory” for interpolation Same computational cost as ordinary prox-grad Convergence rate theoretically optimalφ(xk ) φ Suvrit Sra (suvrit@mit.edu)2Lkx0 x k22 .(k 1)2Optimization for Machine Learning20 / 29

Proximal splitting methods (x) f (x) h(x)I Direct use of prox-grad not easyI Requires computation of: proxλ(f h) (i.e., (I λ( f h)) 1 )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning21 / 29

Proximal splitting methods (x) f (x) h(x)I Direct use of prox-grad not easyI Requires computation of: proxλ(f h) (i.e., (I λ( f h)) 1 )Example:min12 kx yk22 λkxk2 µ {z } Xn 1i 1f (x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning xi 1 xi .{z}h(x)21 / 29

Proximal splitting methods (x) f (x) h(x)I Direct use of prox-grad not easyI Requires computation of: proxλ(f h) (i.e., (I λ( f h)) 1 )Example:min12 kx yk22 λkxk2 µ {z } Xn 1i 1f (x) xi 1 xi .{z}h(x)I But good feature: proxf and proxh separately easierI Can we exploit that?Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning21 / 29

Proximal splitting – operator notationI If (I f h) 1 hard, but (I f ) 1 and (I h) 1 “easy”Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning22 / 29

Proximal splitting – operator notationI If (I f h) 1 hard, but (I f ) 1 and (I h) 1 “easy”I Let us derive a fixed-point equation that “splits” theoperatorsSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning22 / 29

Proximal splitting – operator notationI If (I f h) 1 hard, but (I f ) 1 and (I h) 1 “easy”I Let us derive a fixed-point equation that “splits” theoperatorsAssume we are solvingminf (x) h(x),where both f and h are convex but potentially nondifferentiable.Notice: We implicitly assumed: (f h) f h.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning22 / 29

Proximal splitting0 f (x) h(x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Key idea of splitting: new variable!z (I h)(x) x proxh (z)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Key idea of splitting: new variable!z (I h)(x) x proxh (z)2x z (I f )(x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Key idea of splitting: new variable!z (I h)(x) x proxh (z)2x z (I f )(x) x (I f ) 1 (2x z)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Key idea of splitting: new variable!z (I h)(x) x proxh (z)2x z (I f )(x) x (I f ) 1 (2x z)I Not a fixed-point equation yetSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Key idea of splitting: new variable!z (I h)(x) x proxh (z)2x z (I f )(x) x (I f ) 1 (2x z)I Not a fixed-point equation yetI We need one more ideaSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),Suvrit Sra (suvrit@mit.edu)x proxh (z)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),Suvrit Sra (suvrit@mit.edu)x proxh (z) Rh (z) 2x zOptimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),x proxh (z) Rh (z) 2x z0 f (x) g(x)2x (I f )(x) (I g)(x)2x z (I f )(x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),x proxh (z) Rh (z) 2x z0 f (x) g(x)2x (I f )(x) (I g)(x)2x z (I f )(x)x proxf (Rh (z))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),x proxh (z) Rh (z) 2x z0 f (x) g(x)2x (I f )(x) (I g)(x)2x z (I f )(x)x proxf (Rh (z))but Rh (z) 2x z z 2x Rh (z)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),x proxh (z) Rh (z) 2x z0 f (x) g(x)2x (I f )(x) (I g)(x)2x z (I f )(x)x proxf (Rh (z))but Rh (z) 2x z z 2x Rh (z)z 2 proxf (Rh (z)) Rh (z) Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),x proxh (z) Rh (z) 2x z0 f (x) g(x)2x (I f )(x) (I g)(x)2x z (I f )(x)x proxf (Rh (z))but Rh (z) 2x z z 2x Rh (z)z 2 proxf (Rh (z)) Rh (z) Rf (Rh (z))Finally, z is on both sides of the eqnSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford method(x proxh (z)0 f (x) h(x) z Rf (Rh (z))DR method: given z0 , iterate for k 0xk proxh (zk )vk proxf (2xk zk )zk 1 zk γk (vk xk )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning25 / 29

Douglas-Rachford method(x proxh (z)0 f (x) h(x) z Rf (Rh (z))DR method: given z0 , iterate for k 0xk proxh (zk )vk proxf (2xk zk )zk 1 zk γk (vk xk )Theorem. If f h admits minimizers, and (γk ) satisfyXγk [0, 2],γk (2 γk ) ,kthen the DR-iterates vk and xk converge to a minimizer.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning25 / 29

Douglas-Rachford methodFor γk 1, we havezk 1 zk vk xkzk 1 zk proxf (2 proxh (zk ) zk ) proxh (zk )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning26 / 29

Douglas-Rachford methodFor γk 1, we havezk 1 zk vk xkzk 1 zk proxf (2 proxh (zk ) zk ) proxh (zk )Dropping superscripts, writing P prox, we havez TzT I Pf (2Ph I) PhSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning26 / 29

Douglas-Rachford methodFor γk 1, we havezk 1 zk vk xkzk 1 zk proxf (2 proxh (zk ) zk ) proxh (zk )Dropping superscripts, writing P prox, we havez TzT I Pf (2Ph I) PhLemma DR can be written as: z 12 (Rf Rh I)z, where Rf denotes the reflection operator 2Pf I (similarly Rh ).Exercise: Prove this claim.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning26 / 29

Proximal methods – cornucopiaDouglas Rachford splittingADMM (special case of DR on dual)Proximal-DykstraProximal methods for f1 f2 · · · fnPeaceman-RachfordProximal quasi-Newton, NewtonMany other variation.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning27 / 29

Best approximation problemminδA (x) δB (x)where A B .Can we use DR?Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning28 / 29

Best approximation problemminδA (x) δB (x)where A B .Can we use DR?Using a clever analysis of Bauschke & Combettes (2004),DR can still be applied! However, it generates divergingiterates which can be “projected back” to obtain a solutiontomin ka bk2a A, b B.See: Jegelka, Bach, Sra (NIPS 2013) for an example.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning28 / 29

ADMMLet us see separable objective with constraintsSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.Suvrit Sra (suvrit@mit.edu)f (x) g(z)Ax Bz c.Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingI Introduce augmented lagrangian (AL)Lρ (x, z, y) : f (x) g(z) yT (Ax Bz c) ρ2 kAx Bz ck22Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingI Introduce augmented lagrangian (AL)Lρ (x, z, y) : f (x) g(z) yT (Ax Bz c) ρ2 kAx Bz ck22I Now, a Gauss-Seidel style update to the ALSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingI Introduce augmented lagrangian (AL)Lρ (x, z, y) : f (x) g(z) yT (Ax Bz c) ρ2 kAx Bz ck22I Now, a Gauss-Seidel style update to the ALxk 1 argminx Lρ (x, zk , yk )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingI Introduce augmented lagrangian (AL)Lρ (x, z, y) : f (x) g(z) yT (Ax Bz c) ρ2 kAx Bz ck22I Now, a Gauss-Seidel style update to the ALxk 1 argminx Lρ (x, zk , yk )zk 1 argminz Lρ (xk 1 , z, yk )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingI Introduce augmented lagrangian (AL)Lρ (x, z, y) : f (x) g(z) yT (Ax Bz c) ρ2 kAx Bz ck22I Now, a Gauss-Seidel style update to the ALxk 1 argminx Lρ (x, zk , yk )zk 1 argminz Lρ (xk 1 , z, yk )yk 1 yk ρ(Axk 1 Bzk 1 c)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

Convex optimization – Boyd & Vandenberghe Nonlinear programming – Bertsekas Convex Analysis – Rockafellar Fundamentals of convex analysis – Urruty, Lemarechal Lectures on modern convex optimization – Nemirovski Optimization for Machine Learning – Sra, Nowozin, Wright Theory of Convex Optimization for Machine Learning – Bubeck .

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

Linear Algebra and Optimization for Machine Learning A Textbook A frequent challenge faced by beginners in machine learning is the extensive background requirement in linear algebra and optimization. This makes the learning curve very steep. This book, therefore, reverses the focus by teaching linear algebra and optimization as the

in advanced mathematics used in US universities are also popular in Australian universities for students studying engineering and some areas of applied sciences. However, the advanced mathematics .