Optimization For Machine Learning

2y ago

21 Views

3 Downloads

387.12 KB

93 Pages

Last View : 3d ago

Last Download : 3m ago

Upload by : Joanna Keil

Report this link

Download PDF

Transcription

Optimization for Machine Learning(Problems; Algorithms - C)S UVRIT S RAMassachusetts Institute of TechnologyPKU Summer School on Data Science (July 2017)

Course materialshttp://suvrit.de/teaching.htmlSome references: Introductory lectures on convex optimization – NesterovConvex optimization – Boyd & VandenbergheNonlinear programming – BertsekasConvex Analysis – RockafellarFundamentals of convex analysis – Urruty, LemaréchalLectures on modern convex optimization – NemirovskiOptimization for Machine Learning – Sra, Nowozin, WrightTheory of Convex Optimization for Machine Learning – BubeckNIPS 2016 Optimization Tutorial – Bach, SraSome related courses: EE227A, Spring 2013, (Sra, UC Berkeley)10-801, Spring 2014 (Sra, CMU)EE364a,b (Boyd, Stanford)EE236b,c (Vandenberghe, UCLA)Venues: NIPS, ICML, UAI, AISTATS, SIOPT, Math. Prog.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning2 / 29

Lecture Plan– Introduction (3 lectures)– Problems and algorithms (5 lectures)– Non-convex optimization, perspectives (2 lectures)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning3 / 29

Nonsmooth convergence ratesTheorem. (Nesterov.) Let B x kx x0 k2 D . Assume,x B. There exists a convex function f in C0L (B) (with L 0),such that for 0 k n 1, the lower-boundf (xk ) f (x ) LD ,2(1 k 1)holds for any algorithm that generates xk by linearly combiningthe previous iterates and subgradients.Exercise: So design problems where we can do better!Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning4 / 29

Composite problemsSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning5 / 29

Composite objectivesFrequently ML problems take the regularized formminimize f (x) : (x) r(x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning6 / 29

Composite objectivesFrequently ML problems take the regularized formminimize f (x) : (x) r(x) Suvrit Sra (suvrit@mit.edu) r Optimization for Machine Learning6 / 29

Composite objectivesFrequently ML problems take the regularized formminimize f (x) : (x) r(x) r Example: (x) 12 kAx bk 2 and r(x) λkxk1Lasso, L1-LS, compressed sensingSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning6 / 29

Composite objectivesFrequently ML problems take the regularized formminimize f (x) : (x) r(x) r Example: (x) 12 kAx bk 2 and r(x) λkxk1Lasso, L1-LS, compressed sensingExample: (x) : Logistic loss, and r(x) λkxk1L1-Logistic regression, sparse LRSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning6 / 29

Composite objective minimizationminimize f (x) : (x) r(x)subgradient: xk 1 xk αk gk , gk f (xk )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning7 / 29

Composite objective minimizationminimize f (x) : (x) r(x)subgradient: xk 1 xk αk gk , gk f (xk ) subgradient: converges slowly at rate O(1/ k)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning7 / 29

Composite objective minimizationminimize f (x) : (x) r(x)subgradient: xk 1 xk αk gk , gk f (xk ) subgradient: converges slowly at rate O(1/ k)but: f is smooth plus nonsmoothwe should exploit: smoothness of for better method!Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning7 / 29

Proximal Gradient Methodminf (x) x XProjected (sub)gradientx PX (x α f (x))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning8 / 29

Proximal Gradient Methodminf (x) x XProjected (sub)gradientx PX (x α f (x))minf (x) h(x)Proximal gradientx proxαh (x α f (x))proxαh denotes proximity operator for hSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning8 / 29

Proximal Gradient Methodminf (x) x XProjected (sub)gradientx PX (x α f (x))minf (x) h(x)Proximal gradientx proxαh (x α f (x))proxαh denotes proximity operator for hWhy? If we can compute proxh (x) easily, prox-grad converges as fast gradient methods for smooth problems!Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning8 / 29

Proximity operatorProjectionPX (y) : argminx RnSuvrit Sra (suvrit@mit.edu)12 kx yk22 1X (x)Optimization for Machine Learning9 / 29

Proximity operatorProjectionPX (y) : argminx Rn12 kx yk22 1X (x)Proximity: Replace 1X by a closed convex functionproxr (y) : argminx RnSuvrit Sra (suvrit@mit.edu)12 kx yk22 r(x)Optimization for Machine Learning9 / 29

Proximity operator 1 -norm ball of radius ρ(λ)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning10 / 29

Proximity operatorsExercise: Let r(x) kxk1 . Solve proxλr (y).minnx R12 kx yk22 λkxk1 .Hint 1: The above problem decomposes into n independentsubproblems of the formminx R12 (x y)2 λ x .Hint 2: Consider the two cases: either x 0 or x 6 0Exercise: Moreau decomposition y proxh y proxh y(notice analogy to V S S in linear algebra)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning11 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 0Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )x Suvrit Sra (suvrit@mit.edu) α f (x ) (I α h)(x )Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )x α f (x ) (I α h)(x )x α f (x ) (I α h)(x )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )x α f (x ) (I α h)(x )x α f (x ) (I α h)(x )x (I α h) 1 (x α f (x ))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )x α f (x ) (I α h)(x )x α f (x ) (I α h)(x )x (I α h) 1 (x α f (x ))x proxαh (x α f (x ))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

How to cook-up prox-grad?Lemma x proxαh (x α f (x )), α 00 f (x ) h(x )0 α f (x ) α h(x )x α f (x ) (I α h)(x )x α f (x ) (I α h)(x )x (I α h) 1 (x α f (x ))x proxαh (x α f (x ))Above fixed-point eqn suggests iterationxk 1 proxαk h (xk αk f (xk ))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning12 / 29

Convergence Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning13 / 29

Proximal-gradient works, why?xk 1 proxαk h (xk αk f (xk ))xk 1 xk αk Gαk (xk ).Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning14 / 29

Proximal-gradient works, why?xk 1 proxαk h (xk αk f (xk ))xk 1 xk αk Gαk (xk ).Gradient mapping: the “gradient-like object”1Gα (x) (x Pαh (x α f (x)))αSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning14 / 29

Proximal-gradient works, why?xk 1 proxαk h (xk αk f (xk ))xk 1 xk αk Gαk (xk ).Gradient mapping: the “gradient-like object”1Gα (x) (x Pαh (x α f (x)))αI Our lemma shows: Gα (x) 0 if and only if x is optimalI So Gα analogous to fI If x locally optimal, then Gα (x) 0 (nonconvex f )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning14 / 29

Convergence analysisAssumption: Lipschitz continuous gradient; denoted f C1Lk f (x) f (y)k2 Lkx yk2Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning15 / 29

Convergence analysisAssumption: Lipschitz continuous gradient; denoted f C1Lk f (x) f (y)k2 Lkx yk2 Gradient vectors of closeby points are close to each other Objective function has “bounded curvature” Speed at which gradient varies is boundedSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning15 / 29

Convergence analysisAssumption: Lipschitz continuous gradient; denoted f C1Lk f (x) f (y)k2 Lkx yk2 Gradient vectors of closeby points are close to each other Objective function has “bounded curvature” Speed at which gradient varies is boundedLemma (Descent). Let f C1L . Then,f (y) f (x) h f (x), y xi L2 ky xk22Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning15 / 29

Convergence analysisAssumption: Lipschitz continuous gradient; denoted f C1Lk f (x) f (y)k2 Lkx yk2 Gradient vectors of closeby points are close to each other Objective function has “bounded curvature” Speed at which gradient varies is boundedLemma (Descent). Let f C1L . Then,f (y) f (x) h f (x), y xi L2 ky xk22For convex f , compare withf (y) f (x) h f (x), y xi.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning15 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) Suvrit Sra (suvrit@mit.edu)R10h f (zt ), y xidt.Optimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi Suvrit Sra (suvrit@mit.edu)R10h f (zt ) f (x), y xidtOptimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi f (y) f (x) h f (x), y xi Suvrit Sra (suvrit@mit.edu) R1h f (zt ) f (x), y xidt0R1h f (zt ) f (x), y xidt0Optimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi f (y) f (x) h f (x), y xi Suvrit Sra (suvrit@mit.edu)R1h f (zt ) f (x), y xidt0R1h f (zt ) f (x), y xidt 0R1 0 h f (zt ) f (x), y xi dtOptimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi f (y) f (x) h f (x), y xi Suvrit Sra (suvrit@mit.edu)R1h f (zt ) f (x), y xidt0R1h f (zt ) f (x), y xidt 0R1 0 h f (zt ) f (x), y xi dtR1 0 k f (zt ) f (x)k2 · ky xk2 dtOptimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi f (y) f (x) h f (x), y xi Suvrit Sra (suvrit@mit.edu)R1h f (zt ) f (x), y xidt0R1h f (zt ) f (x), y xidt0R1 h f (zt ) f (x), y xi dt0R1k f (zt ) f (x)k2 · ky xk2 dt0R1L 0 tkx yk22 dtOptimization for Machine Learning16 / 29

Descent lemmaProof. Since f C1L , by Taylor’s theorem, for the vectorzt x t(y x) we havef (y) f (x) R10h f (zt ), y xidt.Add and subtract h f (x), y xi on rhs we havef (y) f (x) h f (x), y xi f (y) f (x) h f (x), y xi R1h f (zt ) f (x), y xidt0R1h f (zt ) f (x), y xidt0R1 h f (zt ) f (x), y xi dt0R1k f (zt ) f (x)k2 · ky xk2 dt0R1L 0 tkx yk22 dtL2 kx yk22 .Bounds f (y) around x with quadratic functionsSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning16 / 29

Descent lemma – corollaryf (y) f (x) h f (x), y xi L2 ky xk22Let y x αGα (x), thenSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning17 / 29

Descent lemma – corollaryf (y) f (x) h f (x), y xi L2 ky xk22Let y x αGα (x), thenf (y) f (x) αh f (x), Gα (x)i Suvrit Sra (suvrit@mit.edu)α2 LkGα (x)k22 .2Optimization for Machine Learning17 / 29

Descent lemma – corollaryf (y) f (x) h f (x), y xi L2 ky xk22Let y x αGα (x), thenf (y) f (x) αh f (x), Gα (x)i α2 LkGα (x)k22 .2Corollary. So if 0 α 1/L, we havef (y) f (x) αh f (x), Gα (x)i Suvrit Sra (suvrit@mit.edu)Optimization for Machine LearningαkGα (x)k22 .217 / 29

Descent lemma – corollaryf (y) f (x) h f (x), y xi L2 ky xk22Let y x αGα (x), thenf (y) f (x) αh f (x), Gα (x)i α2 LkGα (x)k22 .2Corollary. So if 0 α 1/L, we havef (y) f (x) αh f (x), Gα (x)i αkGα (x)k22 .2Lemma Let y x αGα (x). Then, for any z we havef (y) h(y) f (z) h(z) hGα (x), x zi α2 kGα (x)k22 .Exercise: Prove! (hint: f , h are convex, Gα (x) f (x) h(y))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning17 / 29

Convergence analysisWe’ve actually shown x0 x αGα (x) is a descent method.Write φ f h; plug in z x to obtainφ(x0 ) φ(x) α2 kGα (x)k22 .Exercise: Why this inequality suffices to show convergence.Use z x in corollary to obtain progress in terms of iterates:φ(x0 ) φ hGα (x), x x i α2 kGα (x)k22Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning18 / 29

Convergence analysisWe’ve actually shown x0 x αGα (x) is a descent method.Write φ f h; plug in z x to obtainφ(x0 ) φ(x) α2 kGα (x)k22 .Exercise: Why this inequality suffices to show convergence.Use z x in corollary to obtain progress in terms of iterates:φ(x0 ) φ hGα (x), x x i α2 kGα (x)k22 1 2hαGα (x), x x i kαGα (x)k222α 1 kx x k22 kx x αGα (x)k222αSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning18 / 29

Convergence analysisWe’ve actually shown x0 x αGα (x) is a descent method.Write φ f h; plug in z x to obtainφ(x0 ) φ(x) α2 kGα (x)k22 .Exercise: Why this inequality suffices to show convergence.Use z x in corollary to obtain progress in terms of iterates:φ(x0 ) φ hGα (x), x x i α2 kGα (x)k22 1 2hαGα (x), x x i kαGα (x)k222α 1 kx x k22 kx x αGα (x)k222α 1 kx x k22 kx0 x k22 .2αSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning18 / 29

Convergence rateSet x xk , x0 xk 1 , and α 1/L. Then addSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning19 / 29

Convergence rateSet x xk , x0 xk 1 , and α 1/L. Then addXk 1i 1(φ(xi ) φ ) Suvrit Sra (suvrit@mit.edu)L2Xk 1 kxi x k22 kxi 1 x k22i 1Optimization for Machine Learning19 / 29

Convergence rateSet x xk , x0 xk 1 , and α 1/L. Then addXk 1i 1(φ(xi ) φ ) Suvrit Sra (suvrit@mit.edu)Xk 1 kxi x k22 kxi 1 x k22i 1 2 2L2 kx1 x k2 kxk 1 x k2L2Optimization for Machine Learning19 / 29

Convergence rateSet x xk , x0 xk 1 , and α 1/L. Then addXk 1i 1 Xk 1 kxi x k22 kxi 1 x k22i 1 2 2L2 kx1 x k2 kxk 1 x k2 L2 kx1(φ(xi ) φ ) Suvrit Sra (suvrit@mit.edu)L2 x k22 .Optimization for Machine Learning19 / 29

Convergence rateSet x xk , x0 xk 1 , and α 1/L. Then addXk 1i 1 Xk 1 kxi x k22 kxi 1 x k22i 1 2 2L2 kx1 x k2 kxk 1 x k2 L2 kx1(φ(xi ) φ ) L2 x k22 .Since φ(xk ) is a decreasing sequence, it follows thatk 1φ(xk 1 ) φ 1 XL(φ(xi ) φ ) kx1 x k22 .k 12(k 1)i 1This is the well-known O(1/k) rate.I But for C1L convex functions, optimal rate is O(1/k2 )!Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning19 / 29

Accelerated Proximal Gradientmin φ(x) f (x) h(x)Let x0 y0 dom h. For k 1:xk proxαk h (yk 1 αk f (yk 1 ))yk xk k 1 k(x xk 1 ).k 2Framework due to: Nesterov (1983, 2004); also Beck, Teboulle (2009).Simplified analysis: Tseng (2008). Uses extra “memory” for interpolation Same computational cost as ordinary prox-grad Convergence rate theoretically optimalφ(xk ) φ Suvrit Sra (suvrit@mit.edu)2Lkx0 x k22 .(k 1)2Optimization for Machine Learning20 / 29

Proximal splitting methods (x) f (x) h(x)I Direct use of prox-grad not easyI Requires computation of: proxλ(f h) (i.e., (I λ( f h)) 1 )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning21 / 29

Proximal splitting methods (x) f (x) h(x)I Direct use of prox-grad not easyI Requires computation of: proxλ(f h) (i.e., (I λ( f h)) 1 )Example:min12 kx yk22 λkxk2 µ {z } Xn 1i 1f (x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning xi 1 xi .{z}h(x)21 / 29

Proximal splitting methods (x) f (x) h(x)I Direct use of prox-grad not easyI Requires computation of: proxλ(f h) (i.e., (I λ( f h)) 1 )Example:min12 kx yk22 λkxk2 µ {z } Xn 1i 1f (x) xi 1 xi .{z}h(x)I But good feature: proxf and proxh separately easierI Can we exploit that?Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning21 / 29

Proximal splitting – operator notationI If (I f h) 1 hard, but (I f ) 1 and (I h) 1 “easy”Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning22 / 29

Proximal splitting – operator notationI If (I f h) 1 hard, but (I f ) 1 and (I h) 1 “easy”I Let us derive a fixed-point equation that “splits” theoperatorsSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning22 / 29

Proximal splitting – operator notationI If (I f h) 1 hard, but (I f ) 1 and (I h) 1 “easy”I Let us derive a fixed-point equation that “splits” theoperatorsAssume we are solvingminf (x) h(x),where both f and h are convex but potentially nondifferentiable.Notice: We implicitly assumed: (f h) f h.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning22 / 29

Proximal splitting0 f (x) h(x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Key idea of splitting: new variable!z (I h)(x) x proxh (z)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Key idea of splitting: new variable!z (I h)(x) x proxh (z)2x z (I f )(x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Key idea of splitting: new variable!z (I h)(x) x proxh (z)2x z (I f )(x) x (I f ) 1 (2x z)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Key idea of splitting: new variable!z (I h)(x) x proxh (z)2x z (I f )(x) x (I f ) 1 (2x z)I Not a fixed-point equation yetSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Proximal splitting0 f (x) h(x)2x (I f )(x) (I h)(x)Key idea of splitting: new variable!z (I h)(x) x proxh (z)2x z (I f )(x) x (I f ) 1 (2x z)I Not a fixed-point equation yetI We need one more ideaSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning23 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),Suvrit Sra (suvrit@mit.edu)x proxh (z)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),Suvrit Sra (suvrit@mit.edu)x proxh (z) Rh (z) 2x zOptimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),x proxh (z) Rh (z) 2x z0 f (x) g(x)2x (I f )(x) (I g)(x)2x z (I f )(x)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),x proxh (z) Rh (z) 2x z0 f (x) g(x)2x (I f )(x) (I g)(x)2x z (I f )(x)x proxf (Rh (z))Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),x proxh (z) Rh (z) 2x z0 f (x) g(x)2x (I f )(x) (I g)(x)2x z (I f )(x)x proxf (Rh (z))but Rh (z) 2x z z 2x Rh (z)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),x proxh (z) Rh (z) 2x z0 f (x) g(x)2x (I f )(x) (I g)(x)2x z (I f )(x)x proxf (Rh (z))but Rh (z) 2x z z 2x Rh (z)z 2 proxf (Rh (z)) Rh (z) Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford splittingReflection operatorRh (z) : 2 proxh (z) zDouglas-Rachford methodz (I h)(x),x proxh (z) Rh (z) 2x z0 f (x) g(x)2x (I f )(x) (I g)(x)2x z (I f )(x)x proxf (Rh (z))but Rh (z) 2x z z 2x Rh (z)z 2 proxf (Rh (z)) Rh (z) Rf (Rh (z))Finally, z is on both sides of the eqnSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning24 / 29

Douglas-Rachford method(x proxh (z)0 f (x) h(x) z Rf (Rh (z))DR method: given z0 , iterate for k 0xk proxh (zk )vk proxf (2xk zk )zk 1 zk γk (vk xk )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning25 / 29

Douglas-Rachford method(x proxh (z)0 f (x) h(x) z Rf (Rh (z))DR method: given z0 , iterate for k 0xk proxh (zk )vk proxf (2xk zk )zk 1 zk γk (vk xk )Theorem. If f h admits minimizers, and (γk ) satisfyXγk [0, 2],γk (2 γk ) ,kthen the DR-iterates vk and xk converge to a minimizer.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning25 / 29

Douglas-Rachford methodFor γk 1, we havezk 1 zk vk xkzk 1 zk proxf (2 proxh (zk ) zk ) proxh (zk )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning26 / 29

Douglas-Rachford methodFor γk 1, we havezk 1 zk vk xkzk 1 zk proxf (2 proxh (zk ) zk ) proxh (zk )Dropping superscripts, writing P prox, we havez TzT I Pf (2Ph I) PhSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning26 / 29

Douglas-Rachford methodFor γk 1, we havezk 1 zk vk xkzk 1 zk proxf (2 proxh (zk ) zk ) proxh (zk )Dropping superscripts, writing P prox, we havez TzT I Pf (2Ph I) PhLemma DR can be written as: z 12 (Rf Rh I)z, where Rf denotes the reflection operator 2Pf I (similarly Rh ).Exercise: Prove this claim.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning26 / 29

Proximal methods – cornucopiaDouglas Rachford splittingADMM (special case of DR on dual)Proximal-DykstraProximal methods for f1 f2 · · · fnPeaceman-RachfordProximal quasi-Newton, NewtonMany other variation.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning27 / 29

Best approximation problemminδA (x) δB (x)where A B .Can we use DR?Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning28 / 29

Best approximation problemminδA (x) δB (x)where A B .Can we use DR?Using a clever analysis of Bauschke & Combettes (2004),DR can still be applied! However, it generates divergingiterates which can be “projected back” to obtain a solutiontomin ka bk2a A, b B.See: Jegelka, Bach, Sra (NIPS 2013) for an example.Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning28 / 29

ADMMLet us see separable objective with constraintsSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.Suvrit Sra (suvrit@mit.edu)f (x) g(z)Ax Bz c.Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingI Introduce augmented lagrangian (AL)Lρ (x, z, y) : f (x) g(z) yT (Ax Bz c) ρ2 kAx Bz ck22Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingI Introduce augmented lagrangian (AL)Lρ (x, z, y) : f (x) g(z) yT (Ax Bz c) ρ2 kAx Bz ck22I Now, a Gauss-Seidel style update to the ALSuvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingI Introduce augmented lagrangian (AL)Lρ (x, z, y) : f (x) g(z) yT (Ax Bz c) ρ2 kAx Bz ck22I Now, a Gauss-Seidel style update to the ALxk 1 argminx Lρ (x, zk , yk )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingI Introduce augmented lagrangian (AL)Lρ (x, z, y) : f (x) g(z) yT (Ax Bz c) ρ2 kAx Bz ck22I Now, a Gauss-Seidel style update to the ALxk 1 argminx Lρ (x, zk , yk )zk 1 argminz Lρ (xk 1 , z, yk )Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

ADMMLet us see separable objective with constraintsmins.t.f (x) g(z)Ax Bz c.I Objective function separated into x and z variablesI The constraint prevents a trivial decouplingI Introduce augmented lagrangian (AL)Lρ (x, z, y) : f (x) g(z) yT (Ax Bz c) ρ2 kAx Bz ck22I Now, a Gauss-Seidel style update to the ALxk 1 argminx Lρ (x, zk , yk )zk 1 argminz Lρ (xk 1 , z, yk )yk 1 yk ρ(Axk 1 Bzk 1 c)Suvrit Sra (suvrit@mit.edu)Optimization for Machine Learning29 / 29

Convex optimization – Boyd & Vandenberghe Nonlinear programming – Bertsekas Convex Analysis – Rockafellar Fundamentals of convex analysis – Urruty, Lemarechal Lectures on modern convex optimization – Nemirovski Optimization for Machine Learning – Sra, Nowozin, Wright Theory of Convex Optimization for Machine Learning – Bubeck .

Related Documents:

Bruksanvisning för bilstereo Bruksanvisning for bilstereo ... - Jula

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

376 Views

1y ago

10 tips och tricks för att lyckas med ert sap-projekt

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

738 Views

2y ago

Nordens 25 största medieföretag efter omsättning

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

339 Views

1y ago

SS 02 52 68 Ljudklassning av utrymmen i byggnader - byggtjanst.se

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

358 Views

1y ago

Apple Developer Program License Agreement (Swedish)

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

345 Views

1y ago

Specification and Price of Automatic Rendering Machine (FOB ... - AR

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

17 Views

3m ago

A Textbook Linear Algebra and Optimization for Machine Learning

Linear Algebra and Optimization for Machine Learning A Textbook A frequent challenge faced by beginners in machine learning is the extensive background requirement in linear algebra and optimization. This makes the learning curve very steep. This book, therefore, reverses the focus by teaching linear algebra and optimization as the

9 Views

1y ago

Advanced Mathematics for Engineering and Applied Sciences

in advanced mathematics used in US universities are also popular in Australian universities for students studying engineering and some areas of applied sciences. However, the advanced mathematics .

177 Views

3y ago

Recent Views

Guidance for opponents in civil legal aid cases - Scottish Legal Aid Board

injury case - may apply for civil legal aid (since this leaﬂet deals only with civil legal aid, where we refer to "legal aid" we mean "civil legal aid"). Legal aid is ﬁnancial help from public funds. It helps people who qualify to get legal advice and the help of a solicitor to put their case in court.

4m ago

110 Views

WHAT TO DO IF YOU ARE SEXUALLY HARASSED

There are many legal clinics or legal information centres you can contact to obtain legal information, educational resources or legal referrals. Alberta Central Alberta Community Legal Clinic (Red Deer) Centre for Public Legal Education Alberta Pro Bono Law Alberta Women's Centre Legal Advice Clinic (Calgary)

3y ago

245 Views

Legal Advocacy Essentials

Legal Advocacy Essentials: a core training for legal advocates Presented by the Washington State Coalition Against Domestic Violence, 2008. This information is not intended as a substitute for legal advice. 1 Legal Advocacy Essentials . A core training for legal advocates . Table of Contents . What is a legal advocate?

1y ago

249 Views

Legal & Corporate Services: Strategic Plan - CP6

the provision of legal advice, managing legal risk and managing the legal supply chain. By doing this well, the team will move towards its vision. Legal Services is made up of 4 teams, each serving different customers with a dedicated legal resource. This is summarised in the figure right. Although Legal Services has customerdistinct, -focussed .

1y ago

171 Views

Legal Proceedings and Legal Privilege Exemptions: Myth-busting - ICO

If asking for legal advice, say so, and start new email chain If giving legal advice, say so Involve lawyers (before litigation contemplated) Maintain confidentiality of legal advice documents Limit dissemination of legal advice (need to know; original only) Make internal communications re legal advice factual

1y ago

240 Views

Community Fundraising Kit - Marrickville Legal Centre

Is a CLC the same as Legal Aid? Community legal centres are not the same as Legal Aid. Legal Aid NSW is a government body that provides legal services to people who experience significant disadvantage across NSW. Legal Aid provides assistance for criminal, family and civil law plus domestic and family violence.

7m ago

70 Views

Dafne-EFC 2020 Legal Environment for Philanthropy in .

Dafne-EFC Philanthropy Advocacy: 2020 Legal Environment for Philanthropy in Europe, Switzerland 3 I.Legal framework for foundations 1. Does the jurisdiction recognise a basic legal definition of a foundation? (please describe) What different legal types of foundations exist (autonomous organisations with legal

3y ago

215 Views

Legal Studies - Washington University in St. Louis

Legal Studies (02/09/21) Legal Studies The Legal Studies minor is an interdisciplinary program that allows students to study the role of law and legal institutions in society. Students who minor in Legal Studies learn about law in courses from anthropology, economics, history, philosophy, political science and other disciplines.

3y ago

183 Views

CLASS K - LAW

K85-89 Legal research K94 Legal composition and draftsmanship K100-103 Legal education K109-110 Law societies. International bar associations K115-130 The legal profession K133 Legal aid. Legal assistance to the poor K140-165 History of law K170 Biography K

2y ago

172 Views

Contract Management in Corporate Legal Departments .

May 25, 2016 · Relationship Between Legal, Finance, & the Business Create/Negotiate Activate Perform Analyze Renew Business Business Legal Legal Finance Finance Business Legal Finance Business . - Collaboration Legal Portal - Standard Operating Procedures - KPIs Dashboards - Reports. Technology Enabled Contract Management Best Practices 1. Initiate/

2y ago

361 Views

Persuasive Legal Writing

the court just focuses on the facts of the crime and hardly addresses any legal issue. The way to convince a court that a legal issue is worth reversing on requires that we have more than a legal basis to appeal - it requires us to put the legal issue in the context of a persuasive storyline. Sometimes the storyline will be about the legal issue.

1y ago

129 Views

Legal AI - Thomson Reuters

of the legal AI market, for example in relation to contract generation and completion. In short, legal AI has a potential use wherever there are people who must deal with legal documents or address legal queries, especially where those legal needs are expressed through text, which AI experts refer to as 'unstructured data'.

1y ago

123 Views

Legal Information vs Legal Advice Guidelines - TMCEC

giving legal advice. Legal advice is a written or oral statement that: o Interprets some aspect of the law, court rules, or court procedures; o Recommends a specific course of conduct a person should take in an actual or potential legal proceeding; or o Applies the law to the individual person's specific factual circumstances. What is Legal .

1y ago

225 Views

Smart legal contracts Advice to Government

The forms a smart legal contract can take 22 Use cases for smart legal contracts 30 Costs and benefits of smart legal contracts 35. CHAPTER 3: FORMATION OF SMART LEGAL CONTRACTS 39. The law on contract formation 39 Agreement 39 Consideration 49 Certainty and completeness 50 Intention to create legal relations 54 Formality requirements 57

1y ago

162 Views

CSR FREQUENTLY ASKED QUESTIONS - Legal Services Corporation

Because of this lack of legal analysis applying the law to the client's unique circumstances, these letters do not meet the definition of legal assistance (legal advice is a subset of legal assistance) set forth in Section 2.2 of the 2008 CSR Handbook which reads: For CSR purposes, legal assistance is defined as the provision of limited service

1y ago

140 Views

Optimization For Machine Learning

It looks like you're using an ad-blocker