A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY

2y ago
5 Views
2 Downloads
746.04 KB
25 Pages
Last View : 4d ago
Last Download : 3m ago
Upload by : Tripp Mcmullen
Transcription

APPROXIMATE DYNAMIC PROGRAMMINGA SERIES OF LECTURES GIVEN ATTSINGHUA UNIVERSITYJUNE 2014DIMITRI P. BERTSEKASBased on the books:(1) “Neuro-Dynamic Programming,” by DPBand J. N. Tsitsiklis, Athena Scientific,1996(2) “Dynamic Programming and OptimalControl, Vol. II: Approximate DynamicProgramming,” by DPB, Athena Sci entific, 2012(3) “Abstract Dynamic Programming,” byDPB, Athena Scientific, 2013http://www.athenasc.comFor a fuller set of slides, seehttp://web.mit.edu/dimitrib/www/publ.html1

APPROXIMATE DYNAMIC PROGRAMMINGBRIEF OUTLINE I Our subject: Large-scale DP based on approximations andin part on simulation. This has been a research area of great inter est for the last 25 years known under variousnames (e.g., reinforcement learning, neuro dynamic programming) Emerged through an enormously fruitful crossfertilization of ideas from artificial intelligenceand optimization/control theory Deals with control of dynamic systems underuncertainty, but applies more broadly (e.g.,discrete deterministic optimization) A vast range of applications in control the ory, operations research, artificial intelligence,and beyond . The subject is broad with rich variety oftheory/math, algorithms, and applications.Our focus will be mostly on algorithms .less on theory and modeling2

APPROXIMATE DYNAMIC PROGRAMMINGBRIEF OUTLINE II Our aim: A state-of-the-art account of some of the ma jor topics at a graduate level Show how to use approximation and simula tion to address the dual curses of DP: di mensionality and modeling Our 6-lecture plan: Two lectures on exact DP with emphasis oninfinite horizon problems and issues of large scale computational methods One lecture on general issues of approxima tion and simulation for large-scale problems One lecture on approximate policy iterationbased on temporal differences (TD)/projectedequations/Galerkin approximation One lecture on aggregation methods One lecture on Q-learning, and other meth ods, such as approximation in policy space3

APPROXIMATE DYNAMIC PROGRAMMINGLECTURE 1LECTURE OUTLINE Introduction to DP and approximate DP Finite horizon problems The DP algorithm for finite horizon problems Infinite horizon problems Basic theory of discounted infinite horizon prob lems4

DP AS AN OPTIMIZATION METHODOLOGY Generic optimization problem:min g(u)u Uwhere u is the optimization/decision variable, g(u)is the cost function, and U is the constraint set Categories of problems: Discrete (U is finite) or continuous Linear (g is linear and U is polyhedral) ornonlinear Stochastic or deterministic: In stochastic prob lems the cost involves a stochastic parameterw, which is averaged, i.e., it has the formg(u) Ew G(u, w)where w is a random parameter. DP deals with multistage stochastic problems Information about w is revealed in stages Decisions are also made in stages and makeuse of the available information Its methodology is “different”5

BASIC STRUCTURE OF STOCHASTIC DP Discrete-time systemxk 1 fk (xk , uk , wk ),k 0, 1, . . . , N 1 k: Discrete time xk : State; summarizes past information thatis relevant for future optimization uk : Control; decision to be selected at timek from a given set wk : Random parameter (also called “distur bance” or “noise” depending on the context) N : Horizon or number of times control isapplied Cost function that is additive over timeN 1EgN (xN ) gk (xk , uk , wk )k 0 Alternative system description: P (xk 1 xk , uk )xk 1 wk with P (wk xk , uk ) P (xk 1 xk , uk )6

INVENTORY CONTROL EXAMPLEwkStock at Period kxkDemand at Period kInventorySystemCo s t of P e riod kc uk r (xk uk - wk)Stock at Period k 1xk 1 xk uk wkStock Ordered atPeriod kuk Discrete-time systemxk 1 fk (xk , uk , wk ) xk uk wk Cost function that is additive over time()N 1XE gN (xN ) gk (xk , uk , wk )k 0 E(N 1Xcuk r(xk uk wk )k 07)

ADDITIONAL ASSUMPTIONS Probability distribution of wk does not dependon past values wk 1 , . . . , w0 , but may depend onxk and uk Otherwise past values of w, x, or u would beuseful for future optimization The constraint set from which uk is chosen attime k depends at most on xk , not on prior x oru Optimization over policies (also called feedbackcontrol laws): These are rules/functionsuk µk (xk ),k 0, . . . , N 1that map state/inventory to control/order (closed loop optimization, use of feedback) MAJOR DISTINCTION: We minimize over se quences of functions (mapping inventory to order){µ0 , µ1 , . . . , µN 1 }NOT over sequences of controls/orders{u0 , u1 , . . . , uN 1 }8

GENERIC FINITE-HORIZON PROBLEM System xk 1 fk (xk , uk , wk ), k 0, . . . , N 1 Control contraints uk Uk (xk ) Probability distribution Pk (· xk , uk ) of wk Policies π {µ0 , . . . , µN 1 }, where µk mapsstates xk into controls uk µk (xk ) and is suchthat µk (xk ) Uk (xk ) for all xk Expected cost of π starting at x0 isJπ (x0 ) E (gN (xN ) N 1Xgk (xk , µk (xk ), wk )k 0)Optimal cost functionJ (x0 ) min Jπ (x0 )π Optimal policy π satisfiesJπ (x0 ) J (x0 )When produced by DP, π is independent of x0 .9

PRINCIPLE OF OPTIMALITY Let π {µ 0 , µ1 , . . . , µN 1 } be optimal policy Consider the “tail subproblem” whereby we areat xk at time k and wish to minimize the “cost to-go” from time k to time NE(gN (xN ) N 1Xgℓ xℓ , µℓ (xℓ ), wℓℓ k )and the “tail policy” {µ k , µ k 1 , . . . , µ N 1 }xk0Tail SubproblemkNTime Principle of optimality: The tail policy is opti mal for the tail subproblem (optimization of thefuture does not depend on what we did in the past) DP solves ALL the tail subroblems At the generic step, it solves ALL tail subprob lems of a given time length, using the solution ofthe tail subproblems of shorter time length10

DP ALGORITHM Computes for all k and states xk :Jk (xk ): opt. cost of tail problem starting at xk Initial condition:JN (xN ) gN (xN )Go backwards, k N 1, . . . , 0, usingJk (xk ) min E gk (xk , uk , wk )uk Uk (xk ) wk Jk 1 fk (xk , uk , wk ) , To solve tail subproblem at time k minimizekth-stage cost Opt. cost of next tail problemstarting from next state at time k 1 Then J0 (x0 ), generated at the last step, is equalto the optimal cost J (x0 ). Also, the policy π {µ 0 , . . . , µN 1 }where µk (xk ) minimizes in the right side above foreach xk and k, is optimal Proof by induction11

PRACTICAL DIFFICULTIES OF DP The curse of dimensionality Exponential growth of the computational andstorage requirements as the number of statevariables and control variables increases Quick explosion of the number of states incombinatorial problems The curse of modeling Sometimes a simulator of the system is easierto construct than a model There may be real-time solution constraints A family of problems may be addressed. Thedata of the problem to be solved is given withlittle advance notice The problem data may change as the systemis controlled – need for on-line replanning All of the above are motivations for approxi mation and simulation12

A MAJOR IDEA: COST APPROXIMATION Use a policy computed from the DP equationwhere the optimal cost-to-go function Jk 1 is re placed by an approximation J k 1 . Apply µk (xk ), which attains the minimum inminuk Uk (xk ) E gk (xk , uk , wk ) J k 1 fk (xk , uk , wk ) Some approaches:(a) Problem Approximation: Use J k derived froma related but simpler problem(b) Parametric Cost-to-Go Approximation: Useas J k a function of a suitable parametricform, whose parameters are tuned by someheuristic or systematic scheme (we will mostlyfocus on this) This is a major portion of ReinforcementLearning/Neuro-Dynamic Programming(c) Rollout Approach: Use as J k the cost ofsome suboptimal policy, which is calculatedeither analytically or by simulation13

ROLLOUT ALGORITHMS At each k and state xk , use the control µk (xk )that minimizes inminuk Uk (xk ) E gk (xk , uk , wk ) Jk 1 fk (xk , uk , wk ) ,where J k 1 is the cost-to-go of some heuristic pol icy (called the base policy). Cost improvement property: The rollout algo rithm achieves no worse (and usually much better)cost than the base policy starting from the samestate. Main difficulty: Calculating J k 1 (x) may becomputationally intensive if the cost-to-go of thebase policy cannot be analytically calculated. May involve Monte Carlo simulation if theproblem is stochastic. Things improve in the deterministic case (animportant application is discrete optimiza tion). Connection w/ Model Predictive Control (MPC).14

INFINITE HORIZON PROBLEMS Same as the basic problem, but: The number of stages is infinite. The system is stationary. Total cost problems: MinimizeJπ (x0 ) limN Ewkk 0,1,.(N 1Xαk gxk , µk (xk ), wkk 0 ) Discounted problems (α 1, bounded g) Stochastic shortest path problems (α 1,finite-state system with a termination state)- we will discuss sparringly Discounted and undiscounted problems withunbounded cost per stage - we will not cover Average cost problems - we will not cover Infinite horizon characteristics: Challenging analysis, elegance of solutionsand algorithms Stationary policies π {µ, µ, . . .} and sta tionary forms of DP play a special role15

DISCOUNTED PROBLEMS/BOUNDED COST Stationary systemxk 1 f (xk , uk , wk ),k 0, 1, . . . Cost of a policy π {µ0 , µ1 , . . .}Jπ (x0 ) limN Ewkk 0,1,.(N 1Xαk gxk , µk (xk ), wkk 0 )with α 1, and g is bounded [for some M , wehave g(x, u, w) M for all (x, u, w)] Optimal cost function: J (x) minπ Jπ (x) Boundedness of g guarantees that all costs areMwell-defined and bounded: Jπ (x) 1 α All spaces are arbitrary - only boundedness ofg is important (there are math fine points, e.g.measurability, but they don’t matter in practice) Important special case: All underlying spacesfinite; a (finite spaces) Markovian Decision Prob lem or MDP All algorithms ultimately work with a finitespaces MDP approximating the original problem16

SHORTHAND NOTATION FOR DP MAPPINGS For any function J of x, denote (T J)(x) min E g(x, u, w) αJ f (x, u, w)u U (x) w , x T J is the optimal cost function for the one stage problem with stage cost g and terminal costfunction αJ. T operates on bounded functions of x to pro duce other bounded functions of x For any stationary policy µ, denote (Tµ J)(x) E g x, µ(x), w αJ f (x, µ(x), w)w, x The critical structure of the problem is cap tured in T and Tµ The entire theory of discounted problems canbe developed in shorthand using T and Tµ True for many other DP problems. T and Tµ provide a powerful unifying frameworkfor DP. This is the essence of the book “AbstractDynamic Programming”17

FINITE-HORIZON COST EXPRESSIONS Consider an N -stage policy π0N {µ0 , µ1 , . . . , µN 1 }with a terminal cost J:()N 1X NℓJπ0N (x0 ) E α J(xk ) α g xℓ , µℓ (xℓ ), wℓℓ 0n E g x0 , µ0 (x0 ), w0 (Tµ0 Jπ1N )(x0 ) o αJπ1N (x1 )where π1N {µ1 , µ2 , . . . , µN 1 } By induction we haveJπ0N (x) (Tµ0 Tµ1 · · · TµN 1 J)(x), x For a stationary policy µ the N -stage cost func tion (with terminal cost J) isJπ0N TµN Jwhere TµN is the N -fold composition of Tµ Similarly the optimal N -stage cost function(with terminal cost J) is T N J T N J T (T N 1 J) is just the DP algorithm18

“SHORTHAND” THEORY – A SUMMARY Infinite horizon cost function expressions [withJ0 (x) 0]Jπ (x) lim (Tµ0 Tµ1 · · · TµN J0 )(x), Jµ (x) lim (TµN J0 )(x)N N Bellman’s equation: J T J , Jµ Tµ Jµ Optimality condition:µ: optimal Tµ J T J Value iteration: For any (bounded) JJ (x) lim (T k J)(x),k xPolicy iteration: Given µk , Policy evaluation: Find Jµk by solvingJµk Tµk Jµk Policy improvement: Find µk 1 such thatTµk 1 Jµk T Jµk19

TWO KEY PROPERTIES Monotonicity property: For any J and J ′ suchthat J(x) J ′ (x) for all x, and any µ(T J)(x) (T J ′ )(x), x,(Tµ J)(x) (Tµ J ′ )(x), x. Constant Shift property: For any J, any scalarr, and any µ T (J re) (x) (T J)(x) αr, Tµ (J re) (x) (Tµ J)(x) αr, x, x,where e is the unit function [e(x) 1]. Monotonicity is present in all DP models (undis counted, etc) Constant shift is special to discounted models Discounted problems have another propertyof major importance: T and Tµ are contractionmappings (we will show this later)20

CONVERGENCE OF VALUE ITERATION For all bounded J,J (x) lim (T k J)(x),for all xk Proof: For simplicity we give the proof for J 0.For any initial state x0 , and policy π {µ0 , µ1 , . . .},Jπ (x0 ) E( Xαℓ g xℓ , µℓ ((xℓ )), wℓℓ 0 E(k 1Xαℓ gxℓ , µℓ ((xℓ )), wℓ E( Xℓ 0) )) wℓαℓ g xℓ , µℓ ((xℓ ),ℓ k) The tail portion satisfiesE( Xαℓ g xℓ , µℓ ((xℓ ),) wℓℓ k )αk M ,1 αwhere M g(x, u, w) . Take min over π of bothsides, then lim as k . Q.E.D.21

BELLMAN’S EQUATION The optimal cost function J is a solution ofBellman’s equation, J T J , i.e., for all x,J (x) min E g(x, u, w) αJ f (x, u, w)u U (x) wProof: For all x and k,kMkMαα (T k J0 )(x) J (x) ,J (x) 1 α1 αwhere J0 (x) 0 and M g(x, u, w) . ApplyingT to this relation, and using Monotonicity andConstant Shift,(T J )(x)αk 1 M (T k 1 J0 )(x)1 ααk 1 M (T J )(x) 1 αTaking the limit as k and using the factlim (T k 1 J0 )(x) J (x)k we obtain J T J .Q.E.D.22

THE CONTRACTION PROPERTY Contraction property: For any bounded func tions J and J ′ , and any µ,max (T J)(x) (T J ′ )(x) α max J(x) J ′ (x) ,xxmax (Tµ J)(x) (Tµ J ′ )(x) α max J(x) J ′ (x) .xxProof: Denote c maxx S J(x) J ′ (x) . ThenJ(x) c J ′ (x) J(x) c, xApply T to both sides, and use the Monotonicityand Constant Shift properties:(T J)(x) αc (T J ′ )(x) (T J)(x) αc, xHence(T J)(x) (T J ′ )(x) αc, x.Q.E.D. Note: This implies that J is the unique solu tion of J T J , and Jµ is the unique solutionof J T J 23

NEC. AND SUFFICIENT OPT. CONDITION A stationary policy µ is optimal if and only ifµ(x) attains the minimum in Bellman’s equationfor each x; i.e.,T J Tµ J ,or, equivalently, for all x, µ(x) arg min E g(x, u, w) u U (x) wαJ f (x, u, w)Proof: If T J Tµ J , then using Bellman’s equa tion (J T J ), we haveJ Tµ J ,so by uniqueness of the fixed point of Tµ , we obtainJ Jµ ; i.e., µ is optimal. Conversely, if the stationary policy µ is optimal,we have J Jµ , soJ Tµ J .Combining this with Bellman’s Eq. (J T J ),we obtain T J Tµ J . Q.E.D.24

MIT OpenCourseWarehttp://ocw.mit.edu6.231 Dynamic Programming and Stochastic ControlFall 2011For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

APPROXIMATE DYNAMIC PROGRAMMING. A SERIES OF LECTURES GIVEN AT . TSINGHUA UNIVERSITY . JUNE 2014 . DIMITRI P. BERTSEKAS . Based on the books: (1) “Neuro-Dynamic Programming,” by DPB and J. N. Tsitsiklis, Athena Scientific, 1996 (2) “Dynamic Programming and Optimal Control, Vol. II: Approximate Dynamic

Related Documents:

SMB_Dual Port, SMB_Cable assembly, Waterproof Cap RF Connector 1.6/5.6 Series,1.0/2.3 Series, 7/16 Series SMA Series, SMB Series, SMC Series, BT43 Series FME Series, MCX Series, MMCX Series, N Series TNC Series, UHF Series, MINI UHF Series SSMB Series, F Series, SMP Series, Reverse Polarity

tion to Luther’s Genesis lectures was provided by a course of lectures and readings given by Dr. Ulrich Asendorf as a visiting scholar at Concordia Theological Seminary in 1993. During doctoral studies, my research in the lectures was facilitated by two seminars in Luther interpretation by Dr.

Summer School on Language and Understanding, University of San Marino, 1995 (5 lectures) Numazu (Japan) Linguistic Seminar, 1996 (8 lectures) Fall School in Syntax and Semantics, NTNU, Trondheim, Norway, 1999 (3 lectures) University of Leipzig/Max Planck Institute of Cognitive Neuroscience, 2000 (4 lectures)

HSC REVISION LECTURES Please consider the Revision Lectures above. The School for Excellence (TSFX) also offer Trial HSC Exam Lectures at Sydney University too but obviously these involve considerable cost and the inconvenience and disruption of travelling to Sydney. These lectures are very high quality. Information is available at

Forster, Lectures on Riemann Surfaces, Springer-Verlag Griffiths and Harris, Principles of Algebraic Geometry, Wiley Also recommended: Cornalba et al, Lectures on Riemann Surfaces, World Scientific Griffiths, Lectures on Algebraic Curves, AMS 1. Holomorphic functions in on

Lectures suivies H3-H4 p.65 Lectures suivies H5-H6 p.66 Lectures suivies H7-H8 p.72 DEGRES SECONDAIRES Fonds existant Lectures suivies SEC I (Harmos9, Harmos10, Harmos11) p.80 Nouveau

Martin Luther’s Lectures on Romans (1515–1516): Their Rediscovery and Legacy 227 which he gave over a two-year period, 1513–1515.1 Volume 2 of the Weimar fea-tured Luther’s first lectures on Paul’s Letter to the Galatians. Luther delivered these lectures in 1516–1517 and prepared them for publication in 1519. 2 Conspicu-

Main Topic Areas 3 Digital circuits (3 lectures) Programmable Processor (2 lectures) 6502 CPU: Stack, Subroutines (4 lectures) Midterm MIPS: Branch Prediction, Cache (10 lectures)