10-725: Optimization Fall 2012 Lecture 5: Gradient Desent .

3y ago

32 Views

3 Downloads

638.22 KB

10 Pages

Last View : 6d ago

Last Download : 3m ago

Upload by : Javier Atchley

Report this link

Download PDF

Transcription

10-725: OptimizationFall 2012Lecture 5: Gradient Desent RevisitedLecturer: Geoff Gordon/Ryan TibshiraniScribes: Cong Lu/Yu ZhaoNote: LaTeX template courtesy of UC Berkeley EECS dept.Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.They may be distributed outside this class only with the permission of the Instructor.5.1Choose step sizeRecall that we have f : Rn R, convex and differentiable. We want to solvemin f (x)x Rni.e, to find x? such that f (x? ) min f (x) .Gradient descent: choose initial x(0) Rn , repeat :x(k) x(k 1) tk · f (x(k 1) ), k 1, 2, 3, .Stop at some point(When to stop is quite dependent on what problems you are looking at).Figure 5.1 is shows a example that we cannot always continue and it depends where we start. i.e. If we startat a spot somewhere between the purple and orange, it would stay there and go nowhere.Figure 5.1: At each iteration, consider the expansion3Tf (y) f (x) f (x) (y x) 5-112ky xk2t

5-2Lecture 5: Gradient Desent RevisitedWe can use quadratic approximation, replacing usual 2 f (x) by 1t I, then we havewhich is a linear combination to f , andf (x) f (x)T (y x),which is a proximity term to x, with weight12ky xk ,2t12t .Then, choose next point y x to minimize quadratic approximationx x t f (x)as shown in Figure 5.2.Figure 5.2: blue Bluepointis x, red point is x point is x, red point is x 5.1.15Fixed step sizeSimply take tk t for all k 1, 2, 3, , can diverge if t is too big. Consider f (x) (10x1 2 x2 2 /2), Figure 5.3shows the gradient descent after 8 steps. It can be slow if t is too small . As for the same example, gradientdescent after 100 steps in Figure 5.4, and gradient descent after 40 appropriately sized steps in Figure 5.5.Convergence analysis will give us a better idea which one is just right.5.1.2Backtracking line searchAdaptively choose the step size:First, fix a parameter 0 β 1, then at each iteration, start with t 1, and whilet2f (x f (x)) f (x) k f (x)k ,2update t βt, as shown in Figure 5.6 (from B & V page 465), for us 4x f (x), α 1/2.Backtracking line search is simple and work pretty well in practice. Figure 5.7 shows that backpackingpicks up roughly the right step size(13 steps) for the same example, with β 0.8 (B & V recommendβ (0.1, 0.8)).

5-3Lecture 5: Gradient Desent RevisitedFixed step sizeSimply take tk t for all k 1, 2, 3, . . ., can diverge if t is too big.Consider f (x) (10x21 x22 )/2, gradient descent after 8 steps:20Figure 5.3: 10 20 100* 20 1001020Can be slow if t is too small. Same example, gradient descent after100 steps:20Figure 5.4:710 20 100* 205.1.3 1001020Exact line search8At each iteration, do the best e can along the direction of the gradient,t argminf (x s f (x)).s 0Usually, it is not possible to do this minimization exactly.Approximations to exact line search are often not much more efficient than backtracking, and it’s not worthit.

5-4Lecture 5: Gradient Desent RevisitedSame example, gradient descent after 40 appropriately sized steps:20Figure 5.5: 10 20 100* 20 1001020InterpretationThis porridge is too hot! – toocold!5.6:– juuussst right. ConvergenceFigureanalysis later will give us a better idea9(From B & V page 465)5.2Convergence analysisFor us5.2.1x rf (x), 1/2Convergence analysis for fixed step sizeAssume that f : Rn R is convex and differentiable, and additionallyk f (x) f (y)k Lkx ykf or any x, yi.e. , f is Lipschitz continuous with constant L 0Theorem 5.1 Gradient descent with fixed step size t 1/L satisfiesf (x(k) f (x? ) kx(0) x? k22tk11

5-5Lecture 5: Gradient Desent RevisitedBacktracking picks up roughly the right step size (13 steps):20Figure 5.7: 10 20 100* 20 100Here(B & Vraterecommend2i.e. gradientdescent has0.8convergenceO(1/k)i.e. to get f (x(k) ) f (x? ) , we need O(1/ ) iterations1020(0.1, 0.8))Proof: Since f Lipschitz with constant L, which means 2 f LI, we have x, y, z12(x y)T ( 2 f (z) LI)(x y) 0Which meansLkx yk2 (x y)T 2 f (z)(x y)Based on Taylor’s Remainder Theorem, we have x, y, z [x, y]1f (y) f (x) f (x)T (y x) (x y)T 2 f (z)(x y)2LT f (x) f (x) (y x) ky xk22(5.1)Plugging in x x t f (x),f (x ) f (x) f (x)T (x t x x) Lkx t x xk22Lt f (x) (1 )tk f (x)k22Taking 0 t 1/L, 1 Lt/2 1/2, we havetf (x ) f (x) k f (x)k22Since f is convex, f (x) f (x ) f (x)T (x x ) we have(5.2)

5-6Lecture 5: Gradient Desent Revisitedtf (x ) f (x) k f (x)k22t f (x ) f (x)T (x x ) k f (x)k221 2 f (x ) (kx x k kx x t f (x)k2 )2t1 f (x ) (kx x k2 kx x k2 )2t(5.3)Summing over iterations, we havekX1(f (x(i) f (x )) (kx(0) x k2 kx(k) x k2 )2ti 1(5.4)1 kx(0) x k22tFrom (?), we can see that f (x(k) ) is nonincreasing. Then we havef (x(k) ) f (x ) 5.2.2kkx(0) x k21X(f (x(i) f (x )) k i 12tkConvergence analysis for backtrackingFor backtracking, it’s the same assumptions, f : Rn R is convex and differentiable, and f is Lipschitzcontinuous with constant L 0.But we don’t have to choose a step size that is small or equal to 1/L to begin with. We just get the samerate assuming that the function is Lipschitz.Theorem 5.2 Gradient descent with backtracking line search satisfiesf (x(k)kx(0) x k) f (x ) 2tmin k2 where tmin min{1, β/L}.So the gradient descent has convergence rate O(1/k). The constants are the same as there before, but sincet is adapted in each iteration, we replace t by tmin , where tmin min{1, β/L}.If β is not very tiny, then we don’t lose much compared to fixed step size (β/L vs 1/L).The proof is very similar to the proof of fixed step theorem.

5-7Lecture 5: Gradient Desent Revisited5.2.3Convergence analysis for strong convexityThere is also a statement of convergence on strong convexity. Strong convexity is a condition that thesmallest eigenvalue of the Hessian matrix of function f is uniformly bounded for any x, which means forsome d 0, f (x) dI, xThen the function has a better lower bound than that from usual convexity:d2Tf (y) f (x) f (x) (y x) ky xk , x, y2The strong convexity adds a quadratic term and still has a lower bound. If a function has both strongconvexity and Lipschitz assumption, it has both lower and upper bound by quadratics. We will have somestrong things about it since the function is well behaved.Theorem 5.3 Gradient descent with fixed step size t 2/(d L) or with backtracking line search satisfies2Lf (x(k) ) f (x ) ck kx(0) x k2where 0 c 1.The proof is on the textbook.Under strong convextiy and Lipschitz assumption, we have a theorem that it goes better than 1/k and therate is O(ck ), which is exponentially fast. It is called linear convergence, because if we plot iterations on thex-axis, and we plot difference in the function values on the y-axis on a log scale, it looks like a linear straightline. If we want f (x(k) f (x ) , we need O(log(1/ )) iterations.The constant c depends adversely on condition number L/d. If the condtion number is very high, that is aslower rate.5.2.4How realistic are these conditions?How realistic is Lipschitz continuity of f ? This means 2 f (x) LI.For example, consider linear regressionf (x) 12ky Axk2Then f (x) AT (y Ax)2 2 f (x) AT A2Take L σmax(A) kAk , then 2 f (x) AT A LI. f Lipschitz with L. Then we can choose a fixedstep size that is smaller than 1/L or use backtracking search to get a converge rate of O(1/k).How realistic is strong convexity of f ? Recall this is 2 f (x) dI.That is not easily realistic. Again considerf (x) 12ky Axk2

5-8Lecture 5: Gradient Desent Revisited2(A).Now we need d σminIf A is wide, then σmin (A) 0, and f can’t be strongly convex.Even if σmin (A) 0, we can still have a very large condition number L/d σmax (A)/σmin (A).5.3Pracalities5.3.1Stopping ruleWe can basicly stop when the gradient k f (x)k is small. It is reasonable because f (x ) 0. If k f (x)kis small, we think that f (x) is close to the minimum f (x ).If f is strongly convex with parameter d, thenk f (x)k 5.3.2 2d f (x) f (x ) Pros and consPros: It is a simple idea, and each iteration is cheap. It is very fast for well-conditioned, strongly convex problems.Cons: It is often slow, because interesting problems aren’t strongly convex or well-conditioned. It can’t handle nondifferentiable functions.5.4Forward stagewise regression5.4.1Forward stagewise regressionLet’s go back to the linear regression functionf (x) 12ky Axk2A is n p, its columns A1 , . . . , Ap are predictor variables.Forward stagewise regression is the algorithm below:Start with x(0) 0, repeatFind variable i such that ATi r is largest, for r y Ax(k 1) (largest absolute correlation with residual)(k)(k 1)Update xi xi γ · sign(ATi r)

5-9Lecture 5: Gradient Desent RevisitedHere γ 0 is small and fixed, called learning rate.In each iteration, forward stagewise regression just updates one of the variables in x with a small rate γ.5.4.2Steepest descentIt is a close cousin to gradient descent and just change the choice of norm.Let’s suppose q, r are complementary: 1/q 1/r 1.Steepest descent just update x x t · x, where x kukr · uTu argmin f (x) vkvkq 1If q 2, then x f (x), which is exactly gradient descent.If q 1, then x f (x)/ xi · ei , where f f(x) max(x) k f (x)k j 1,.,n xj xiThe normalized steepest descent just takes x u (unit q-norm).5.4.3EquivalenceNormalized steepest descent with 1-norm: updates arex i xi t · sign f(x) xiwhere i is the largest component of f (x) in absolute value.Compare forward stagewise: updates areTx i xi γ · sign(Ai r), r y AxRecall here f (x) ATi (y Ax), so f(x) ATi (y Ax) xiForward stagewise regression is exactly normalized steepest descent under 1-norm.5.4.4Early stopping and regularizationForward stagewise is like a slower version of forward stepwise. If we stop early, i.e.m don’t continue allthe way to the least squares solution, then we get a sparese approximation. Can this be used as a form ofregularization?

5-10Lecture 5: Gradient Desent RevisitedRecall lasso problem:minkxk1 t1ky Axk22Solution x (t), as function of t, also exhibits varying amounts of regularization.For some problems (some y, A), with a small enough step size, forward stagewise iterates trace out lassosolution path.

5.4.2 Steepest descent It is a close cousin to gradient descent and just change the choice of norm. Let’s suppose q;rare complementary: 1 q 1 r 1. Steepest descent just update x x t x, where x kuk r u u argmin kvk q 1 rf(x)T v If q 2, then x r f(x), which is exactly gradient descent.

Related Documents:

BLASTMASTER 725 CU. FT. ABRASIVE STORAGE HOPPER

4 Blastmaster 725 Cu. Ft. Abrasive Storage Hopper PART NUMBERS BLASTMASTER 725 CU. FT. ABRASIVE STORAGE HOPPER* Part Numbers Capacity Caged Ladder Clean-Out Port Guardrail System Rain Guards Painted F.O.B. Point 10HOPPER725P 725 Cu. Ft. X 4" X X Yes – Red Deer Park, TX 15HOPPER725P 725 Cu. Ft. X 4"

20 Views

2y ago

Linear programs - Carnegie Mellon University

Sketching an LP max 2x 3y s.t. x y 4 2x 5y 12 x 2y 5 x, y 0 6. Geoff Gordon—10-725 Optimization—Fall 2012 Did the prof get it right? 7. Geoff Gordon—10-725 Optimization—Fall 2012 . 20 if std fm has n vars, m eqns then ineq form has n-m vars and m (n-m) n ineqs

10 Views

1y ago

CBRE RETAIL ADVISORY TEAM SOUTHERN CALIFORNIA

Jared Jewelers 5,000 6,000 Jewelry Scott Riddles/Trace Rouda 949 725 8432 OC, IE, Central Valley Javier's Finest Foods of Mexico 8,000 10,000 Restaurant James Crocenzi 949 725 8605 U.S. Jenny Craig 1,300 1,800 Health & Fitness Rob Crumly/Trace Rouda 949 725 8402/ 949 725 8461 Southern California

30 Views

2y ago

Derivative-free optimization methods

Since the eld { also referred to as black-box optimization, gradient-free optimization, optimization without derivatives, simulation-based optimization and zeroth-order optimization { is now far too expansive for a single survey, we focus on methods for local optimization of continuous-valued, single-objective problems.

30 Views

1y ago

Storage and File Structure

45678 CS-101 1 Fall 2009 F 54321 CS-101 1 Fall 2009 A-76543 CS-101 1 Fall 2009 A CS-347 1 Fall 2009 Taylor 3128 C 00128 CS-347 1 Fall 2009 A-12345 CS-347 1 Fall 2009 A 23856 CS-347 1 Fall 2009 A 54321 CS-347 1 Fall 2009 A 76543 CS-347 1 Fall 2009 A 10.7 Answer: a. Everytime a record is

72 Views

2y ago

Planning Calendars - scampus.usc.edu

Two-Year Calendar 7 Planning Calendars SCampus 2011-12 January 2012 May 2012 September 2012 February 2012 June 2012 October 2012 March 2012 July 2012 November 2012 April 2012 August 2012 December 2012 S M T W T F S

73 Views

2y ago

Topology Optimization of Front Leaf Spring Mounting Bracket

Structure topology optimization design is a complex multi-standard, multi-disciplinary optimization theory, which can be divided into three category Sizing optimization, Shape optimization and material selection, Topology optimization according to the structura

70 Views

2y ago

Hierarchical topology and shape optimization of crash ...

An approach for the combined topology, shape and sizing optimization of profile cross-sections is the method of Graph and Heuristic Based Topology Optimization (GHT) [4], which separates the optimization problem into an outer optimization loop for the topology modification and an inner optimization loo

65 Views

2y ago

Recent Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Dangerous Defendants - Yale Law Journal

Law School, Louisiana State University Paul M. Hebert Law Center, Roger Williams University School of Law, Rutgers Law School, Sandra Day O'Connor College of Law, Southern Methodist University Dedman School of Law, University of Georgia School of Law, and University of Utah S.J. Quinney College of Law. For institutional support, I am grateful .

1y ago

169 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Ohm ’s Law

Ohm ’s Law Ohm's law states that, in an electrical circuit, the current passing through most materials is directly proportional to the potential difference applied across them. 3-1—3-3: Ohm ’s Law Formulas There are three forms of Ohm’s Law: I V/R V IR R V/I where:File Size: 1MBPage Count: 40Explore furtherOhm's Law Quiz MCQs with Answers Ohm Lawohmlaw.comOhm’s Law Worksheet - Basic Electricity - All About omohms law worksheet - eering.orgOhm’s Law Worksheet - Richmond County School Systemwww.rcboe.orgOhm's Law with Examples - Physics Problems with Solutions ended to you b

2y ago

295 Views

Faculty of Juridical, Social and Political Sciences Year .

Law L Law IV 8 Drept procesual civil II / Civil Procedure Law II 5 Law L Law IV 8 Dreptul comerțului internațional / International ommercial Law 4 Law L Law IV 8 riminalistică / Forensics 4 Law L Law IV 8 Practică de cercetare pentru elaborarea lucrării de lincență(3 săptămân

2y ago

384 Views

Intermediate Law Law and You Worksheet 3: Australian law - Home Affairs

4. There are different kinds of law to deal with different kinds of problems. Four important kinds of law are civil law, criminal law, family law and administrative law. Civil law deals with disputes between individuals; for example, if someone sells you goods that are faulty, or that cause you injury or damage, you can take that person to court.

4m ago

110 Views

APPLYING TO LAW SCHOOL - University of Pennsylvania

You will apply to law school through the Law School Admission Council (LSAC). 1 6 4 5 3 2 Individual Law School Application Personal Statement Law School Resume 1-3 Letters of Recommendation Dean’s Letter/Certification LSAC Law School Report with official academic transcript(s) and LSAT score(s)

2y ago

160 Views

OF THE LAW LIBRARY - University at Buffalo Libraries

the Law School. 1910 Bang's Law Library is sold, and a fund is established to develop a Law School Library (with many notable donors); students pay an extra 10 library fee. 1936-37 Law Library adds 6,300 books, allowing the Law School to become accredited by the American Bar Association. Law School moves to the new Ellicott Square Building in

1y ago

88 Views

CRIMINAL LAW: CASES, MATERIALS, AND LAWYERING

UTK Distinguished Professor of Law, University of Tennessee College of Law; John T. Parry, professor of law, Lewis & Clark Law School; Penelope Pether, professor of law, Villanova University School of Law. --Third edition. pages cm Includes index. ISBN 978-0-7698-8270-3 1. Criminal law--Unit

2y ago

189 Views

A Trail Guide to Careers in Environmental Law

law, constitutional law, property law, bankruptcy law, criminal law, food and drug law, land use planning law, and international law. A distinctive aspect of environmental practice is the role of science in advocacy efforts.

3y ago

241 Views

Accounting Technicians Diploma (ATD) Examination Syllabus

Apply law of contract and tort in various scenarios Apply general principles of business law in practice. CONTENT 2.1 Elements of the legal system 2.1.1 Nature, purpose and classification of law - Meaning of law - Nature of law - Purpose of law - Classification of law - Law and morality 2.1.2 Sources of law - The Constitution

3y ago

216 Views

PRINCIPLES OF BUSINESS LAW - DPHU

ABE Diploma in Business Administration Study Manual PRINCIPLES OF BUSINESS LAW Contents Study Unit Title Page Syllabus i 1 Nature and Sources of Law 1 Nature of Law 3 Historical Origins 6 Sources of Law 9 The European Community and UK Law: An Overview 13 2 Common Law, Equity and Statute Law 23 Custom 25 Case Law 26 Nature of Equity 32

3y ago

285 Views

10-725: Optimization Fall 2012 Lecture 5: Gradient Desent .

It looks like you're using an ad-blocker