12 Gradient Descent Methods - BYU ACME

3y ago

86 Views

5 Downloads

368.68 KB

9 Pages

Last View : 11d ago

Last Download : 3m ago

Upload by : Evelyn Loftin

Report this link

Download PDF

Transcription

12Lab Ob jective:Gradient DescentMethodsIterative optimization methods choose a search direction and a step size at eachiteration. One simple choice for the search direction is the negative gradient, resulting in the method ofsteepest descent. While theoretically foundational, in practice this method is often slow to converge.An alternative method, the conjugate gradient algorithm, uses a similar idea that results in muchfaster convergence in some situations. In this lab we implement a method of steepest descent and twoconjugate gradient methods, then apply them to regression problems.The Method of Steepest DescentLet f : Rn R with rst derivative Df : Rn Rn . The following iterative technique is a commontemplate for methods that aim to compute a local minimizer x of f .xk 1 xk αk pk(12.1)Here xk is the kth approximation to x , αk is the step size, and pk is the search direction. Newton'smethod and its relatives follow this pattern, but they require the calculation (or approximation)of the inverse Hessian matrix Df 2 (xk ) 1 at each step. The following idea is a simpler and lesscomputationally intensive approach than Newton and quasi-Newton methods.The derivative Df (x)T (often called the gradient of f at x, sometimes notated f (x)) is a vectorthat points in the direction of greatest increase of f at x. It follows that the negative derivative Df (x)T points in the direction of steepest decrease at x. The method of steepest descent choosesthe search direction pk Df (xk )T at each step of (12.1), resulting in the following algorithm.xk 1 xk αk Df (xk )T(12.2)Setting αk 1 for each k is often su cient for Newton and quasi-Newton methods. However,a constant choice for the step size in (12.2) can result in oscillating approximations or even cause the sequence (xk ) k 1 to travel away from the minimizer x . To avoid this problem, the step size αk canbe chosen in a few ways. Start with αk 1, then set αk α2k until f (xk αk Df (xk )T ) f (xk ), terminating theiteration if αk gets too small. This guarantees that the method actually descends at each stepand that αk satis es the Armijo rule, without endangering convergence.1

2Lab 12. Gradient Descent Methods At each step, solve the following one-dimensional optimization problem.αk argmin f (xk αDf (xk )T )αUsing this choice is called exact steepest descent. This option is more expensive per iterationthan the above strategy, but it results in fewer iterations before convergence.Write a function that accepts an objective function f : Rn R, its derivativeDf : R Rn , an initial guess x0 Rn , a convergence tolerance tol defaulting to 1e 5 ,and a maximum number of iterations maxiter defaulting to 100. Implement the exact methodof steepest descent, using a one-dimensional optimization method to choose the step size (useopt.minimize scalar() or your own 1-D minimizer). Iterate until kDf (xk )k tol or k maxiter. Return the approximate minimizer x , whether or not the algorithm converged (Trueor False), and the number of iterations computed.Test your function on f (x, y, z) x4 y 4 z 4 (easy) and the Rosenbrock function (hard).It should take many iterations to minimize the Rosenbrock function, but it should convergeeventually with a large enough choice of maxiter.Problem 1.nThe Conjugate Gradient MethodUnfortunately, the method of steepest descent can be very ine cient for certain problems. Dependingon the nature of the objective function, the sequence of points can zig-zag back and forth or get stuckon at areas without making signi cant progress toward the true minimizer.Gradient Descent, 28903 iterationsFigure 12.1: On this surface, gradient descent takes an extreme number of iterations to converge tothe minimum because it gets stuck in the at basins of the surface.Unlike the method of steepest descent, the conjugate gradient algorithm chooses a search direction that is guaranteed to be a descent direction, though not the direction of greatest descent. Thesedirections are using a generalized form of orthogonality called conjugacy.

3Let Q be a square, positive de nite matrix. A set of vectors {x0 , x1 , . . . , xm } is called Qif each distinct pair of vectors xi , xj satisfy xTi Qxj 0. A Q-conjugate set of vectors islinearly independent and can form a basis that diagonalizes the matrix Q. This guarantees that aniterative method to solve Qx b only require as many steps as there are basis vectors.conjugateSolve a positive de nite system Qx b is valuable in and of itself for certain problems, but itis also equivalent to minimizing certain functions. Speci cally, consider the quadratic functionf (x) 1 Tx Qx bT x c.2Because Df (x)T Qx b, minimizing f is the same as solving the equation0 Df (x)T Qx b Qx b,which is the original linear system. Note that the constant c does not a ect the minimizer, since ifx minimizes f (x) it also minimizes f (x) c.Using the conjugate directions guarantees an iterative method to converge on the minimizerbecause each iteration minimizes the objective function over a subspace of dimension equal to theiteration number. Thus, after n steps, where n is the number of conjugate basis vectors, the algorithmhas found a minimizer over the entire space. In certain situations, this has a great advantage overgradient descent, which can bounce back and forth. This comparison is illustrated in Figure 12.2.Additionally, because the method utilizes a basis of conjugate vectors, the previous search directioncan be used to nd a conjugate projection onto the next subspace, saving computational time.Gradient Descent, 90 iterationsConjugate Gradient, 2 iterationsFigure 12.2: Paths traced by Gradient Descent (orange) and Conjugate Gradient (red) on a quadraticsurface. Notice the zig-zagging nature of the Gradient Descent path, as opposed to the ConjugateGradient path, which nds the minimizer in 2 steps.

4Lab 12. Gradient Descent MethodsAlgorithm 12.11:2:3:4:5:6:7:8:9:10:11:(x0 , Q, b, tol)procedure Conjugate Gradientr0 Qx0 bd0 r0k 0while krk k tol, k n doTαk rTk rk /dk Qdkxk 1 xk αk dkrk 1 rk αk QdkTβk 1 rTk 1 rk 1 /rk rkdk 1 rk 1 βk 1 dkk k 1.return xk 1The points xk are the successive approximations to the minimizer, the vectors dk are theconjugate descent directions, and the vectors rk (which actually correspond to the steepest descentdirections) are used in determining the conjugate directions. The constants αk and βk are used,respectively, in the line search, and in ensuring the Q-conjugacy of the descent directions.Write a function that accepts an n n positive de nite matrix Q, a vector b Rn ,an initial guess x0 Rn , and a stopping tolerance. Use Algorithm 12.1 to solve the systemQx b. Continue the algorithm until krk k is less than the tolerance, iterating no more thann times. Return the solution x, whether or not the algorithm converged in n iterations or less,and the number of iterations computed. Test your function on the simple systemProblem 2. Q which has solution x 1 T2, 22004 ,b 18 ,. This is equivalent to minimizing the quadratic functionf (x, y) x2 2y 2 x 8y ; check that your function from Problem 1 gets the same solution.More generally, you can generate a random positive de nite matrix Q for testing by settingsetting Q AT A for any A of full rank. Note, for values of n 5 this method is not stableenough to always converge in exaclty n iterations. Try using the code given below to test yourfunction for values of n 5. import numpy as np from scipy import linalg as la# Generate Q, b, and the initial guess x0. n 4 A np.random.random((n,n)) Q A.T @ A b, x0 np.random.random((2,n)) x la.solve(Q, b) np.allclose(Q @ x, b)True# Use your function here.

5Non-linear Conjugate GradientThe algorithm presented above is only valid for certain linear systems and quadratic functions, butthe basic strategy may be adapted to minimize more general convex or non-linear functions. Thoughthe non-linear version does not have guaranteed convergence as the linear formulation does, it canstill converge in less iterations than the method of steepest descent. Modifying the algorithm formore general functions requires new formulas for αk , rk , and βk . The scalar αk is simply the result of performing a line-search in the given direction dk and isthus de ned αk argmin f (xk αdk ).α The vector rk in the original algorithm was really just the gradient of the objective function,so now de ne rk Df (xk )T . The constants βk can be de ned in various ways, and the most correct choice depends on thenature of the objective function. A well-known formula, attributed to Fletcher and Reeves, isβk Df (xk )Df (xk )T /Df (xk 1 )Df (xk 1 )T .Algorithm 12.21:2:3:4:(f , Df , x0 , tol, maxiter)procedure Non-Linear Conjugate Gradientr0 Df (x0 )Td0 r0α0 argminf (x0 αd0 )α5:6:7:8:9:10:11:x1 x0 α0 d0k 1while krk k tol, k maxiterrk Df (xk )TTβk rTk rk /rk 1 rk 1dk rk βk dk 1 .αk argmin f (xk αdk ).doα12:13:xk 1 xk αk dk .k k 1.Write a function that accepts a convex objective function f , its derivative Df ,an initial guess x0 , a convergence tolerance defaultin to 1e 5 , and a maximum number ofiterations defaultin to 100. Use Algorithm 12.2 to compute the minimizer x of f . Return theapproximate minimizer, whether or not the algorithm converged, and the number of iterationscomputed.Compare your function to SciPy's opt.fmin cg().Problem 3. opt.fmin cg(opt.rosen, np.array([10, 10]), fprime opt.rosen der)Optimization terminated successfully.

6Lab 12. Gradient Descent MethodsCurrent function value: 0.000000Iterations: 44Function evaluations: 102 # Much faster than steepest descent!Gradient evaluations: 102array([ 1.00000007, 1.00000015])Regression ProblemsA major use of the conjugate gradient method is solving linear least squares problems. Recall thata least squares problem can be formulated as an optimization problem:x min kAx xbk2 ,where A is an m n matrix with full column rank, x Rn , and b Rm . The solution can becalculated analytically, and is given byx (AT A) 1 AT b.In other words, the minimizer solves the linear system(12.3)AT Ax AT b.Since A has full column rank, it is invertible, AT A is positive de nite, and for any non-zero vectorT T2Tz, Az 6 0. Therefore, z A Az kAzk 0. As A A is positive de nite, conjugate gradient can beused to solve Equation 12.3.Linear least squares is the mathematical underpinning of linear regression. Linear regressioninvolves a set of real-valued data points {y1 , . . . , ym }, where each yi is paired with a correspondingset of predictor variables {xi,1 , xi,2 , . . . , xi,n } with n m. The linear regression model posits thatyi β0 β1 xi,1 β2 xi,2 · · · βn xi,n εifor i 1, 2, . . . , m. The real numbers β0 , . . . , βn are known as the parameters of the model, and theεi are independent, normally-distributed error terms. The goal of linear regression is to calculatethe parameters that best t the data. This can be accomplished by posing the problem in terms oflinear least squares. De ne y1 b . ,ym 1 1 A . .x1,1x2,1x1,2x2,2······1xm,1xm,2···. x1,nx2,n ,. . xm,n β0 β1 x . . . βnThe solution x [β0 , β1 , . . . , βn ]T to the system AT Ax AT b gives the parameters that best tthe data. These values can be understood as de ning the hyperplane that best ts the data.

7Linear Regression2520y151050024x6810Figure 12.3: Solving the linear regression problem results in a best- t hyperplane.Using your function from Problem 2, solve the linear regression problem speci edby the data contained in the lea linregression.txt. This is a whitespace-delimited text leformatted so that the i-th row consists of yi , xi,1 , . . . , xi,n . Use np.loadtxt() to load in thedata and return the solution to the normal equations.Problem 4.a Source:Statistical Reference Datasets website trd/lls/data/LINKS/Logistic Regressionis another important technique in statistical analysis and machine learning thatbuilds o of the concepts of linear regression. As in linear regression, there is a set of predictormvariables {xi,1 , xi,2 , . . . , xi,n }mi 1 with corresponding outcome variables {yi }i 1 . In logistic regression,the outcome variables yi are binary and can be modeled by a sigmoidal relationship. The value ofthe predicted yi can be thought of as the probability that yi 1. In mathematical terms,Logistic regressionP(yi 1 xi,1 , . . . , xi,n ) pi ,wherepi 1.1 exp( (β0 β1 xi,1 · · · βn xi,n ))The parameters of the model are the real numbers β0 , β1 , . . . , βn . Note that pi (0, 1) regardless ofthe values of the predictor variables and parameters.The probability of observing the outcome variables yi under this model, assuming they areindependent, is given by the likelihood function L : Rn 1 RL(β0 , . . . , βn ) mYi 1pyi i (1 pi )1 yi .

8Lab 12. Gradient Descent MethodsThe goal of logistic regression is to nd the parameters β0 , . . . , βk that maximize this likelihoodfunction. Thus, the problem can be written as:max(β0 ,.,βn )L(β0 , . . . , βn ).Maximizing this function is often a numerically unstable calculation. Thus, to make the objective function more suitable, the logarithm of the objective function may be maximized because thelogarithmic function is strictly monotone increasing. Taking the log and turning the problem into aminimization problem, the nal problem is formulated as:min(β0 ,.,βn ) log L(β0 , . . . , βn ).A few lines of calculation reveal that this objective function can also be rewritten as log L(β0 , . . . , βn ) mXlog(1 exp( (β0 β1 xi,1 · · · βn xi,n ))) i 1mX(1 yi )(β0 β1 xi,1 · · · βn xi,n ).i 1The values for the parameters {βi }ni 1 that we obtain are known as the maximum likelihoodestimate (MLE). To nd the MLE, conjugate gradient can be used to minimize the objective function.For a one-dimensional binary logistic regression problem, we have predictor data {xi }mi 1 withmlabels {yi }i 1 where each yi {0, 1}. The negative log likelihood then becomes the following. log L(β0 , β1 ) mXlog(1 e (β0 β1 xi ) ) (1 yi )(β0 β1 xi )(12.4)i 1Write a class for doing binary logistic regression in one dimension that implementthe following methods.Problem 5.1. fit(): accept an array x Rn of data, an array y Rn of labels (0s and 1s), and aninitial guess β0 R2 . De ne the negative log likelihood function as given in (12.4), thenminimize it (with respect to β) with your function from Problem 3 or opt.fmin cg().Store the resulting parameters β0 and β1 as attributes.2. predict(): accept a oat x R and calculateσ(x) 1,1 exp( (β0 β1 x))where β0 and β1 are the optimal values calculated in fit(). The value σ(x) is theprobability that the observation x should be assigned the label y 1.This class does not need an explicit constructor. You may assume that predict() will be calledafter fit().

9On January 28, 1986, less than two minutes into the Challenger space shuttle's10th mission, there was a large explosion that originated from the spacecraft, killing all sevencrew members and destroying the shuttle. The investigation that followed concluded that themalfunction was caused by damage to O-rings that are used as seals for parts of the rocketengines. There were 24 space shuttle missions before this disaster, some of which had notedsome O-ring damage. Given the data, could this disaster have been predicted?The le challenger.npy contains data for 23 missions (during one of the 24 missions, theengine was lost at sea). The rst column (x) contains the ambient temperature, in Fahrenheit,of the shuttle launch. The second column (y) contains a binary indicator of the presence ofO-ring damage (1 if O-ring damage was present, 0 otherwise).Instantiate your class from Problem 5 and t it to the data, using an initial guess ofβ 0 [20, 1]T . Plot the resulting curve σ(x) for x [30, 100], along with the raw data. Returnthe predicted probability (according to this model) of O-ring damage on the day the shuttlewas launched, given that it was 31 F.Problem 6.Probability of O-Ring Damage1.0Previous DamageP(Damage) at LaunchO-Ring Damage0.80.60.40.20.03040506070Temperature8090

2 Lab 12. Gradient Descent Methods At each step, solve the following one-dimensional optimization problem. k argmin f(x k Df(x k)T) Using this choice is called exact steepest descent . This option is more expensive per iteration than the above strategy, but it results in fewer iterations before convergence. Problem 1.

Related Documents:

Gradient Descent - University of Liverpool

Method of Gradient Descent The gradient points directly uphill, and the negative gradient points directly downhill Thus we can decrease f by moving in the direction of the negative gradient This is known as the method of steepest descent or gradient descent Steepest descent proposes a new point

22 Views

1y ago

Provo City School District FY12 Cumulative Vendor Payments ...

BYU 200.00 BYU Bookstore 495.57 BYU Cashiers Office 6,076.65 BYU Creative Works 4,449.56 BYU Division Of Continuing Edu 318.00 BYU INDEPENDENT STUDY 18,184.00 Byu Scec Chapter C/O Heidi Abr 10,015.00 C&S PATCHING AND PAVING INC 2,522.50 C.C. Olsen and Sons Crane Serv 1,070.00 C.C

25 Views

2y ago

Learning to learn by gradient descent by gradient descent

2 f( ). While any method capable of minimizing this objective function can be applied, the standard approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates t 1 t trf( t). The performance of vanilla gradient descent, however, is hampered by the fact that it only makes use

73 Views

3y ago

10-725: Optimization Fall 2012 Lecture 5: Gradient Desent ...

5.4.2 Steepest descent It is a close cousin to gradient descent and just change the choice of norm. Let’s suppose q;rare complementary: 1 q 1 r 1. Steepest descent just update x x t x, where x kuk r u u argmin kvk q 1 rf(x)T v If q 2, then x r f(x), which is exactly gradient descent.

34 Views

3y ago

16 The Gradient Descent Framework

16.2.1 The Basic Gradient Descent Method Gradient descent is an iterative algorithm to approximate the opti-mal solution x. The main idea is simple: since the gradient tells us the direction of steepest increase, we’d like to move opposite to the

42 Views

3y ago

BYU-Idaho Certification Handbook version 3.0 Fall/2016

4 BYU-Idaho Certification Manual version 2.0 12/2013 1. Welcome to BYU-Idaho Welcome to BYU-Idaho Online. You have joined BYU-Idaho online instruction at an exciting time of growth and change as the reach and influence of this work is

13 Views

1y ago

2021-22 BYU Women's Basketball Season/Career Statistics ... - BYU Cougars

Season/Career Statistics All games Page 1/1 as of Mar 19, 2022 Summary Season Statistics Career Statistics . 2021-22 BYU Women's Basketball All games Page 1/1 as of Mar 19, 2022. 2021-22 BYU Women's Basketball All games as of Mar 19, 2022 ALL GAMES 26-4 14-0 9-2 3-2. BYU 594 564 617 531 8 2314. 3FG.

11 Views

1y ago

National Strategy for Artificial Intelligence

Artificial intelligence (AI) represents vast opportunities for us as individuals and for society at large. AI can lead to new, more effective business models and to effective, user-centric services in the public sector. Norway is well positioned for succeeding with artificial intelligence. We have: a high level of public trust in both the business and public sectors a population and business .

98 Views

3y ago

Recent Views

Statutory Pay Gap Report 2019 Gender; Disability .

3. Statutory Gender Pay Gap Report 2019 In this section is reported the Statutory Gender Pay Gap, the Gender Pay Gap (Excluding Casual Staff), and a review of Bonus Pay. A positive black number, means that there is a pay gap in favour of men, whereas a negative red number means that there is a pay gap in favour of women. 3.1. Statutory Gender .

3y ago

216 Views

Statutory and non-statutory documents applicable to the .

Statutory and non-statutory documents applicable to the electrical industry. Do you have any responsibility for the installation, maintenance and/or upkeep of the fixed wiring or portable appliances at work? If so, a time will come if it hasn’t already when you will need to know how to stay on the right side of the law.

3y ago

203 Views

Statutory Issue Paper No. 5 Definition of Liabilities .

RELEVANT STATUTORY ACCOUNTING AND GAAP GUIDANCE Statutory Accounting 18. As discussed above, current statutory accounting is limited to dealing with specific asset and liability captions included on a company’s statement of financial position. Generally Accepted Accounting Principles 19.

2y ago

354 Views

Statutory and Non-Statutory service-users experiences of gender .

and disadvantage, this project draws upon a range of key thinkers to make sense of neoliberalism and gendered neoliberal policies. This theoretical position draws upon the work of Stuart Hall, Stanley Cohen, Jamie Peck and Pat Carlen to critically analyse the narratives of 24 non-statutory service-users, 16 statutory service-users and 7

1y ago

148 Views

A-17: Worksheet for Preparing 24 Cfr 58.5 Statutory Checklist

Include mitigation measures in mitigation section of Statutory Checklist. Mark box B on the Statutory Checklist for this authority. If No, compliance with this section is complete. Mark box A on the Statutory Checklist for this authority. 6. §58.5(e) Endangered Species [50 CFR Part 402] a.

1y ago

103 Views

Impartial Careers Education: Statutory Guidance

Impartial Careers Education: Statutory Guidance The Education and Skills Act 2008 requires local authority maintained secondary schools, in discharging their statutory duty to provide careers education, to provide impartial information and advice which promotes the best interests of pupils and which does not seek to promote the interests of the school over other options. It also requires .

3y ago

184 Views

Statutory Interpretation- Then and Now

the history of law. Statutory interpretation itself has a long history. There is nothing new about controversies revolving around the question of how the texts of statutes should be read and applied in contested cases. Interest in statutory interpretation, including its past, is widely shared, and I

3y ago

181 Views

STATUTORY POWER OF ATTORNEY - LOPDNM

B. A statutory power of attorney is legally sufficient under the Uniform Statutory Form Power of Attorney Act, if the wording of the form complies substantially with Subsection A of this section, the form is properly completed, and the signature of the principal is acknowledged in any form permitted by law. C.

3y ago

150 Views

Statutory Guidance for service providers - GOV.WALES

and the different statutory and non-statutory enforcement actions that CIW may take in response, are set out within CIW’s Securing Improvement and Enforcement guidance. This is available on CIW’s website. List of key terms used within this guidance Term Meaning The Act The Regulation and Inspection of Social Care (Wales) Act 2016

3y ago

166 Views

Non Statutory Technical Standards for Sustainable Drainage .

Non-Statutory Technical Standards for Sustainable Drainage: Practice Guidance 2 Introduction 1.1. The Minister announced on 18 December 2014 that sustainable drainage would be delivered through the planning system. This relies on the National Planning Policy Framework, Planning Practice Guidance and Non-statutory

3y ago

137 Views

STATUTORY ACKNOWLEDGEMENT AREA STATUTORY ACKNOWLEDGEMENT .

29 Distribution of resource consent applications to trustees (1) Each relevant consent authority must, for a period of 20 years from the effective date, forward to the trustees a summary of resource consent applications received by that consent authority for activities within, adjacent to, or directly affecting a statutory area.

3y ago

161 Views

Impact of Early Help on Preventing Escalation to Statutory .

statutory intervention (statutory social work services and youth offending service). To highlight areas of good practice. To make recommendations to further improve outcomes for families with multiple problems. Evidence The review ran from September 2014 until April 2015 and evidence was received from a variety of sources: 1.

3y ago

150 Views

Statutory Issue Paper No. 3 Accounting Changes

IP 3–1 Statutory Issue Paper No. 3 Accounting Changes STATUS Finalized March 16, 1998 Original SSAP and Current Authoritative Guidance: SSAP No. 3 Type of Issue: Common Area SUMMARY OF ISSUE 1. Statutory accounting and GAAP differ in accounting for correct

2y ago

139 Views

GEORGIA STATUTORY FINANCIAL POWER OF

GEORGIA STATUTORY FINANCIAL POWER OF ATTORNEY Instructions and Form INTRODUCTION The General Assembly enacted the Uniform Power of Attorney Act during the 2017 legislative session. Within this Act is a revised form for a power of attorney. While this new Act does not require that the new form be used, it does replace the former Statutory .File Size: 423KB

2y ago

347 Views

Statutory Rape: A Guide to State Laws and Reporting Requirements

Criminal laws deal with the legality of sexual acts. Statutory rape laws assume that all sexual activities involving individuals below a certain age are coercive. This is true even if both parties believe their participation is voluntary. Generally, statutory rape laws define the age below

1y ago

125 Views

12 Gradient Descent Methods - BYU ACME

It looks like you're using an ad-blocker