16 The Gradient Descent Framework

3y ago
40 Views
2 Downloads
2.33 MB
13 Pages
Last View : 10d ago
Last Download : 3m ago
Upload by : Casen Newsome
Transcription

16The Gradient Descent FrameworkConsider the problem of finding the minimum-energy s-t electricalunit flow: we wanted to minimize the total energy burnE ( f ) f e2 reefor flow values f that represent a unit flow from s to t (these forma polytope). We alluded to algorithms that solve this problem, butone can also observe that E ( f ) is a convex function, and we want tofind a minimizer within some polytope K. Equivalently, we wantedto solve the linear systemLφ (es et ),which can be cast as finding a minimizer of the convex functionk Lφ (es et )k2 .How can we minimize these functions efficiently? In this lecture, wewill study the gradient descent framework for the general problem ofminimizing functions, and give concrete performance guarantees forthe case of convex optimization.16.1Convex Sets and FunctionsFirst, recall the following definitions:Definition 16.1 (Convex Set). A set K Rn is called convex if for allx, y K,λx (1 λ)y K,(16.1)for all values of λ [0, 1]. Geometrically, this means that for any twopoints in K, the line connecting them is contained in K.Definition 16.2 (Convex Function). A function f : K R defined ona convex set K is called convex if for all x, y K,f (λx (1 λ)y) λ f ( x ) (1 λ) f (y),(16.2)

y196convex sets and functionsf [λx (1 λ)y]λ f ( x ) (1 λ ) f ( y )for all values of λ [0, 1].f (x)There are two kinds of problems that we will study. The mostbasic question is that of unconstrained convex minimization (UCM):given a convex function f , we want to findxxλx (1 λ)yymin f ( x ).x RnIn some cases we will be concerned with the constrained convex minimization (CCM) problem: given a convex function f and a convexset K, we want to findmin f ( x ).x KNote that setting K 16.1.1Rngives us the unconstrained case.GradientFor most of the following discussion, we assume that the function fis differentiable. In that case, we can give an equivalent characterization, based on the notion of the gradient f : Rn Rn .Fact 16.3 (First-order condition). A function f : K R is convex ifand only iff (y) f ( x ) h f ( x ), y x i ,(16.3)The directional derivative of f at x (in thedirection y) is defined asf 0 ( x; y) : limε 0f ( x εy) f ( x ).εIf there exists a vector g such thath g, yi f 0 ( x; y) for all y, then f is calleddifferentiable at x, and g is called thefor all x, y K.gradient. It follows that the gradientGeometrically, Fact 16.3 states that the function always lies abovemust be of the form its tangent plane, for all points in K. If the function f is twice-differentiable, f f f( x ),( x ), · · · ,(x) . f (x) x1 x2 xnand if H f ( x ) is its Hessian matrix, i.e. its matrix of second derivativesat x K:( H f )i,j ( x ) : 2 f( x ), xi x j(16.4)then we get yet another characterization of convex functions.Fact 16.4 (Second-order condition). A twice-differentiable function fis convex if and only if H f ( x ) is positive semidefinite for all x K.16.1.2Lipschitz FunctionsWe will need one more notion of smoothness of the function:Definition 16.5 (Lipschitz). For a convex set K Rn , a functionf : K R is called G-Lipschitz with respect to the norm k · k if f ( x ) f (y) G k x yk ,for all x, y K.Figure 16.1: The blue line denotesthe function and the red line is thetangent line at x. (Figure from NisheethVishnoi.)

the gradient descent frameworkIn this chapter we focus on the Euclidean or 2 -norm, denoted byk · k2 . General norms arise in the next chapter, when we talk aboutmirror descent. Again, assuming that the function is differentiableallows us to give an alternative characterization of Lipschitzness.Fact 16.6. A differentiable function f : K Rn is G-Lipschitz withrespect to k · k2 if and only ifk f ( x )k2 G,(16.5)for all x K.16.2Unconstrained Convex MinimizationIf the function f is convex, any stationary point (i.e., a point x where f ( x ) 0) is also a global minimum: just use Fact 16.3 to infer thatf (y) f ( x ) for all y. Now given a convex function, we can justsolve the equation f (x) 0to compute the global minima exactly. This is often easier said thandone: for instance, if the function f we want to minimize may notbe given explicitly. Instead we may only have a gradient oracle thatgiven x, returns f ( x ).Even when f is explicit, it may be expensive to solve the equation f ( x ) 0, and gradient descent may be a faster way. One examplearises when solving linear systems: given a quadratic function f ( x ) 1 2 x Ax bx for a symmetric matrix A (say having full rank), a simplecalculation shows that f ( x ) 0 Ax b x A 1 b.This can be solved in O(nω ) (i.e., matrix-multiplication) time usingGaussian elimination—but for “nice” matrices A we are often able toapproximate a solution much faster using the gradient-based methods we will soon see.16.2.1The Basic Gradient Descent MethodGradient descent is an iterative algorithm to approximate the optimal solution x . The main idea is simple: since the gradient tells usthe direction of steepest increase, we’d like to move opposite to thedirection of the gradient to decrease the fastest. So by selecting aninitial position x0 and a step size ηt at each time t, we can repeatedlyperform the update:x t 1 x t η t · f ( x t ).(16.6)197

198unconstrained convex minimizationThere are many choices to be made: where should we start? Whatare the step sizes? When do we stop? While each of these decisionsdepend on the properties of the particular instance at hand, we canshow fairly general results for general convex functions.16.2.2An Algorithm for General Convex FunctionsThe algorithm fixes a step size for all times t, performs the update (16.6) for some number of steps T, and then returns the averageof all the points seen during the process.Algorithm 13: Gradient Descent13.113.213.313.4x1 starting pointfor t 1 to T dox t 1 x t η · f ( x t )return xb : 1TT xi .t 1This is easy to visualize in two dimensions: draw the level setsof the function f , and the gradient at a point is a scaled version ofnormal to the tangent line at that point. Now the algorithm’s path isoften a zig-zagging walk towards the optimum (see Fig 16.2).Interestingly, we can give rigorous bounds on the convergence ofthis algorithm to the optimum, based on the distance of the startingpoint from the optimum, and bounds on the Lipschitzness of thefunction. If both these are assumed to be constant, then our error issmaller than ε in only O(1/ε2 ) steps.Proposition 16.7. Let f : Rn R be convex, differentiable and GLipschitz. Let x be any point in Rd . If we define T : η : k x0 x k ,G TG 2 k x0 x k2ε2andthen the solution xb returned by gradient descent satisfiesf ( xb) f ( x ) ε.(16.7)In particular, this holds when x is a minimizer of f .The core of this proposition lies in the following theoremTheorem 16.8. Let f : Rn R be convex, differentiable and G-Lipschitz.Then the gradient descent algorithm ensures thatTTt 1t 111 f (xt ) f (x ) 2 ηTG2 2η k x0 x k2 .(16.8)We will prove Theorem 16.8 in the next section, but let’s first use itto prove Proposition 16.7.Proof of Proposition 16.7. By definition of xb and the convexity of f ,f ( xb) f 1 1xt T t T 1TT f ( x t ).t 1Figure 16.2: The yellow lines denotethe level sets of the function f and thered walk denotes the steps of gradientdescent. (Figure from Wikipedia.)

the gradient descent frameworkBy Theorem 16.8,1TT f ( xt ) t 111f ( x ) ηG2 k x0 x k2 .22ηT{z} errorThe error terms balance when η Finally, we set T k x0 x k ,G Tf ( xb) f ( x ) 1 2G k x0ε2givingk x0 x k G .T x k2 to obtainf ( xb) f ( x ) ε.Observe: we do not (and cannot) show that the point xb is close indistance to x ; we just show that the function value f ( xb) f ( x ).Indeed, if the function is very flat close to x and we start off at someremote point, we make tiny steps as we get close to x , and we cannot hope to get close to it.The 1/ε2 dependence of the number of oracle calls was shownto be tight for gradient-based methods by Yurii Nesterov, if we allow f to be any G-Lipschitz function. However, if we assume thatthe function is “well-behaved”, we can indeed improve on the 1/ε2dependence. Moreover, if the function is strongly convex, we canshow that x and xb are close to each other as well: see §16.5 for suchresults.The convergence guarantee in Proposition 16.7 is for the timeaveraged point xb. Indeed, using a fixed step size means that ouriterates may get stuck in a situation where xt 2 xt after some pointand hence we never improve, even though xb is at the minimizer.One can also show that f ( x T ) f ( x ) ε if we use a time-varying step size ηt O(1/ t), and increase the time horizon slightly toO(1/ε2 log 1/ε). We refer to the work of Shamir and Zhang. Link tonotes.16.2.3Proof of Theorem 16.8Like in the proof of the multiplicative weights algorithm, we will usea potential function. DefineΦt : k x t x k2.2η(16.9)We start the proof of Theorem 16.8 by understanding the one-stepchange in potential:Lemma 16.9. f ( xt ) (Φt 1 Φt ) f ( x ) 21 ηG2 .199

200unconstrained convex minimizationProof. Using the identityk a bk2 k ak2 2 h a, bi kbk2 ,with a b xt 1 x and a xt x , we get 1k x t 1 x k 2 k x t x k 2(16.10)2η 12 hx x t , x t x i k x t 1 x t k 2 ; {z}2η t 1 {z} Φ t 1 Φ t k b k2hb,ainow using xt 1 xt η f ( xt ) from gradient descent, 12 h η f ( xt ), xt x i kη f ( xt )k2 .2ηSince f is G-Lipschitz, k f ( x )k G for all x. Thus,1f ( xt ) (Φt 1 Φt ) f ( xt ) h f ( xt ), x xt i ηG22Since f is convex, we know that f ( xt ) h f ( xt ), x xt i f ( x ).Thus, we conclude that1f ( xt ) (Φt 1 Φt ) f ( x ) ηG2 .2Now that we understand how our potential changes over time,proving the theorem is straightforward.Proof of Theorem 16.8. We start with the inequality we proved above:1f ( xt ) (Φt 1 Φt ) f ( x ) ηG2 .2Summing over t 1, . . . , T,TTTt 1t 1t 11 f (xt ) (Φt 1 Φt ) f (x ) 2 ηG2 TThe sum of potentials on the left telescopes to give:TTt 1t 11 f (xt ) ΦT 1 Φ1 f (x ) 2 ηG2 TSince the potentials are nonnegative, we can drop the Φ T term:TTt 1t 11 f (xt ) Φ1 f (x ) 2 ηG2 TSubstituting in the definition of Φ1 and moving it over to the righthand side completes the proof.

the gradient descent framework16.2.4201Some Remarks on the AlgorithmWe assume a gradient oracle for the function: given a point x, itreturns the gradient f ( x ) at that point. If the function f is notgiven explicitly, we may have to estimate the gradient using, e.g.,random sampling. One particularly sample-efficient solution is topick a uniformly random point u Sn 1 from the sphere in Rn , andreturnh f ( x δu) iduδfor some tiny δ 0. It is slightly mysterious, so perhaps it is useful toconsider its expectation in the case of a univariate function:Eu { 1, 1}As δ 0, the expectation of thisexpression tends to f ( x ), usingStokes’ theorem. Details?h f ( x δu) if ( x δ) f ( x δ)u f 0 ( x ).δ2δIn general, randomized strategies form the basis of stochastic gradient descent, where we use an unbiased estimator of the gradient,instead of computing the gradient itself (because it is slow to compute, or because enough information is not available). The challengeis now to control the variance of this estimator.Another concern is that the step-size η and the number of stepsT both require knowledge of the distance k x1 x k as well as thebound on the gradient. More here. As an exercise, show that usingk x x kthe time-varying step-size ηt : 0 also gives a very similarG tconvergence rate.Finally, the guarantee is for f ( xb), where xb is the time-average ofthe iterates. What about returning the final iterate? It turns out thishas comparable guaranteed, but the proof is slightly more involved.See put notes on webpage.16.3Constrained Convex MinimizationUnlike the unconstrained case, the gradient at the minimizer may notbe zero in the constrained case—it may be at the boundary. In thiscase, the condition for a convex function f : K R to be minimizedat x K is nowh f ( x ), y x i 0for all y K.(16.11)In other words, all vectors y x pointing within K are “positivelycorrelated” with the gradient.16.3.1This is the analog of the minimizer of asingle variable function being achievedeither at a point where the derivative iszero, or at the boundary.Projected Gradient DescentWhile the gradient descent algorithm still makes sense: moving inthe direction opposite to the gradient still moves us towards lowerWhen x is in the interior of K, thecondition (16.11) is equivalent to f ( x ) 0.

202constrained convex minimizationfunction values. But we must change our algorithm to ensure that thenew point xt 1 lies within K. To ensure this, we simply project thenew iterate xt 1 back onto K. Let projK : Rn K be defined asprojK (y) arg minx K k x yk2 .The modified algorithm is given below in Algorithm 14, with thechanges highlighted in blue.xtAlgorithm 14: Projected Gradient Descent For CCM14.114.214.314.414.5x1 starting pointfor t 1 to T doxt0 1 xt η · f ( xt )xt 1 projK ( xt0 1 )return xb : 1Tx t 1T xtt 1We will show below that a result almost identical to that of Theorem 16.8, and hence that of Proposition 16.7 holds.Proposition 16.10. Let K be a closed convex set, and f : K R be convex,differentiable and G-Lipschitz. Let x K, and define T : k x0 x k .G Tη : satisfiesG 2 k x0 x k2ε2andThen the solution xb returned by projected gradient descentf ( xb) f ( x ) ε.(16.12)In particular, this holds when x is a minimizer of f .Proof. We can reduce to an analogous constrained version of Theorem 16.8. Let us start the proof as before:Φ t 1 Φ t 1k x t 1 x k 2 k x t x k 22η(16.13) 1k xt0 1 x k2 k xt x k2 .2η(16.14)But xt 1 is the projection of xt0 1 onto K, which is difficult to reasonabout. Also, we know that η f ( xt ) xt0 1 x , not xt 1 x ,so we would like to move to the point xt0 1 . Indeed, we claim thatxt0 1 x k xt 1 x k, and hence we getΦ t 1 Φ t Now the rest of the proof of Theorem 16.8 goes through unchanged.Why is the claim xt0 1 x k xt 1 x k true? Since K isconvex, projecting onto it gets us closer to every point in K, in particularto x K. To formally prove this fact about projections, considerthe angle x xt 1 xt0 1 . This is a non-acute angle, since theorthogonal projection means K likes to one side of the hyperplanedefined by the vector xt0 1 xt 1 , as in the figure on the right.Figure 16.3: Projection onto a convexbodyxt0 1

the gradient descent framework203Note that restricting the play to K can be helpful in two ways: wecan upper-bound the distance k x x1 k by the diameter of K, andmoreover we need only consider the Lipschitzness of f for pointswithin K. Give examples.16.4Online Gradient Descent, and Relationship with MWWe considered gradient descent for the offline convex minimizationproblem, but one can use it even when the function changes overtime. Indeed, consider the online convex optimization (OCO) problem: at each time step t, the algorithm proposes a point xt K andan adversary gives a function f t : K R with k f t k G. The cost ofeach time step is f t ( xt ) and your objective is to minimizeregret f t ( x ). f t (xt ) xmin K ttFor instance if K n , and f t ( x ) : h t , x i for some loss vector t [ 1, 1]n , then we are back in the experts setting of the previouschapters. Of course, the OCO problem is far more general, allowingarbitrary convex functions.Surprisingly, we can use the almost same algorithm to solve theOCO problem, with one natural modification: the update rule is nowtaken with respect to gradient of the current function f t :x t 1 x t η · f t ( x t ).Looking back at the proof in §16.2, the proof of Lemma 16.9 immediately extends to give us1f t ( xt ) Φt 1 Φt f t ( x ) ηG2 .2Now summing this over all times t givesT t 1 f t ( xt ) f t ( x ) T t 1 1Φt Φt 1 ηTG221 Φ1 ηTG2 ,2since Φ T 1 0. The proof is now unchanged: setting T and η k x1 x k ,G T1T16.4.1k x1 x k2 G 2ε2and doing some elementary algebra as above,T t 0 k x x kGf t ( xt ) f t ( x ) 1 ε.TComparison to the MW/Hedge AlgorithmsOne advantage of the gradient descent approach (and analysis) overthe multiplicative weight-based ones is that the guarantees here holdThis was first observed by MartinZinkevich in 2002, when he was a Ph.D.student here at CMU.

204stronger assumptionsfor all convex bodies K and all convex functions, as opposed to beingjust for the unit simplex N and linear losses f t ( x ) h t , x i, say for t [ 1, 1]n . However, in order to make a fair comparison, supposewe restrict ourselves to n and linear losses, and consider the numberof rounds T before we get an average regret of ε. If we consider k x1 x k (which, in the worst case, is the diameterof K), and G (which is an upper bound on k f t ( x )k over points inK) as constants, then the T Θ( ε12 ) dependence is the same. For a more quantitative comparison, note that k x1 x k 2 for x1 , x n , and k f t ( x )k k t k n for t [ 1, 1]n . Hence, log n Proposition 16.10 gives us T Θ ε2n , as opposed to T Θ ε2for multiplicative weights.The problem, at a high level, is that we are “choosing the wrongnorm”: when dealing with probabilities, the “right” norm is the 1norm and not the Euclidean 2 norm. In the next lecture we will formalize what this means, and how this dependence on n be improvedvia the Mirror Descent framework.16.5Stronger AssumptionsIf the function f is “well-behaved”, we can improve the guaranteesfor gradient descent in two ways: we can reduce the dependence onε, and we can weaken (or remove) the dependence on the parametersG and k x1 x k. There are two standard assumptions to make onthe convex function: that it is “not too flat” (captured by the idea ofstrong convexity), and it is not “not too curved” (i.e., it is smooth).We now use these assumptions to improve the guarantees.16.5.1Strongly-Convex FunctionsDefinition 16.11 (Strong Convexity). A function f : K R is αstrongly convex if for all x, y K, any of the following holds:1. (Zeroth order) f (λx (1 λ)y) λ f ( x ) (1 λ) f (y) α2 λ(1 λ)k x yk2 for all λ [0, 1].2. (First order) If f is differentiable, thenf (y) f ( x ) h f ( x ), y x i αk x y k2 .2(16.15)3. (Second order) If f is twice-differentiable, then all eigenvalues ofH f ( x ) are at least α at every point x K.

the gradient descent frameworkWe will work with the first-order definition, and show that the 1gradient descent algorithm with (time-varying) step size ηt O αt2converges to a value at most f ( x ) ε in time T Θ( Gαε ). Note thereis no more dependence on the diameter of the polytope. Before wegive this proof, let us give the other relevant definitions.16.5.2Smooth FunctionsDefinition 16.12 (Lipschitz Smoothness). A function f : K R isβ-(Lipschitz)-smooth if for all x, y K, any of the following holds:β1. (Zeroth order) f (λx (1 λ)y) λ f ( x ) (1 λ) f (y) 2 λ(1 λ)k x yk2 for all λ [0, 1].2. (First order) If f is differentiable, thenf (y) f ( x ) h f ( x ), y x i βk x y k2 .2(16.16)3. (Second order) If f is twice-differentiable, then all eigenvalues ofH f ( x ) are at most β at every point x K.In this case, the gradient descent algorithm with fixed step size ηt η O β1 yields an xb which satisfies f ( xb) f ( x ) ε whenβ k x1 x k T Θ. In this case, note we have no dependence on theεLipschitzness G any more; we only depend on the diameter of thepolytope. Again, we defe

16.2.1 The Basic Gradient Descent Method Gradient descent is an iterative algorithm to approximate the opti-mal solution x. The main idea is simple: since the gradient tells us the direction of steepest increase, we’d like to move opposite to the

Related Documents:

Method of Gradient Descent The gradient points directly uphill, and the negative gradient points directly downhill Thus we can decrease f by moving in the direction of the negative gradient This is known as the method of steepest descent or gradient descent Steepest descent proposes a new point

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

2 f( ). While any method capable of minimizing this objective function can be applied, the standard approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates t 1 t trf( t). The performance of vanilla gradient descent, however, is hampered by the fact that it only makes use

for use in animal nutrition. Regulation (EC) No 178/2002 laying down the general principles and requirements of food law. Directive 2002/32/EC on undesirable substances in animal feed. Directive 82/475/EEC laying down the categories of feed materials which may be used for the purposes of labelling feedingstuffs for pet animals The Animal Feed (Hygiene, Sampling etc and Enforcement) (England .