Prediction And Control By Dynamic Programing - CS60077: Reinforcement .

3m ago

5 Views

0 Downloads

1.58 MB

76 Pages

Last View : 4d ago

Last Download : n/a

Upload by : Dani Mulvey

Report this link

Download PDF

Transcription

Prediction and Control by Dynamic Programing CS60077: Reinforcement Learning Abir Das IIT Kharagpur Aug 8,9,29,30, Sep 05, 2019

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Agenda § Understand how to evaluate policies using dynamic programing based methods § Understand policy iteration and value iteration algorithms for control of MDPs § Existence and convergence of solutions obtained by the above methods Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 2 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Resources § Reinforcement Learning by David Silver [Link] § Reinforcement Learning by Balaraman Ravindran [Link] § SB: Chapter 4 Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 3 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Dynamic Programing “Life can only be understood going backwards, but it must be lived going forwards.” - S. Kierkegaard, Danish Philosopher. The first line of the famous book by Dimitri P Bertsekas. Image taken from: amazon.com Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 4 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Dynamic Programing § Dynamic Programing [DP] in this course, refer to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment in a MDP. § Limited utility due to the ‘perfect model’ assumption and due to computational expense. § But still are important as they provide essential foundation for many of the subsequent methods. § Many of the methods can be viewed as attempts to achieve much the same effect as DP with less computation and without perfect model assumption of the environment. § The key idea in DP is to use the value functions and Bellman equations to organize and structure the search for good policies. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 5 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Dynamic Programing § Dynamic Programing addresses a bigger problem by breaking it down as subproblems and then I Solving the subproblems I Combining solutions to subproblems § Dynamic Programing is based on the principle of optimality. 𝑠% 0 Tail subproblem 𝑘 𝑎( , , 𝑎% , , 𝑎 ,- 𝑁 Time Optimal action sequence Principle of Optimality Let {a 0 , a 1 , · · · , a (N 1) } be an optimal action sequence with a corresponding state sequence {s 1 , s 2 , · · · , s N }. Consider the tail subproblem that starts at s k at time k and maximizes the ‘reward to go’ from k to N over {ak , · · · , a(N 1) }, then the tail optimal action sequence {a k , · · · , a (N 1) } is optimal for the tail subproblem. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 6 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Requirements for Dynamic Programing § Optimal substructure i.e., principle of optimality applies. § Overlapping subproblems, i.e., subproblems recur many times and solutions to these subproblems can be cached and reused. § MDPs satisfy both through Bellman equations and value functions. § Dynamic programming is used to solve many other problems, e.g., Scheduling algorithms, Graph algorithms (e.g. shortest path algorithms), Bioinformatics etc. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 7 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Planning by Dynamic Programing § Planning by dynamic programing assumes full knowledge of the MDP § For prediction/evaluation I Input: MDP hS, A, P, R, γi and policy π I Output: Value function vπ Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 8 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Planning by Dynamic Programing § Planning by dynamic programing assumes full knowledge of the MDP § For prediction/evaluation I Input: MDP hS, A, P, R, γi and policy π I Output: Value function vπ § For control I Input: MDP hS, A, P, R, γi I Output: Optimal value function v and optimal policy π Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 8 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Iterative Policy Evaluation § Problem: Policy evaluation: Compute the state-value function vπ for an arbitrary policy π. § Solution strategy: Iterative application of Bellman expectation equation. § Recall the Bellman expectation equation. ( vπ (s) X ) π(a s) r(s, a) γ X 0 0 p(s s, a)vπ (s ) (1) s0 S a A § Consider a sequence of approximate value functions v (0) , v (1) , v (2) , · · · each mapping S to R. Each successive approximation is obtained by using eqn. (1) as an update rule. ) ( X X v (k 1) (s) π(a s) r(s, a) γ p(s0 s, a)v (k) (s0 ) s0 S a A Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 9 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Iterative Policy Evaluation ( v (k 1) (s) X ) π(a s) r(s, a) γ X 0 p(s s, a)v (k) 0 (s ) s0 S a A § In code, this can be implemented by using two arrays - one for the old values v (k) (s) and the other for the new values v (k 1) (s). Here, new values of v (k 1) (s) are computed one by one from the old values v (k) (s) without changing the old values. § Another way is to use one array and update the values ‘in place’, i.e., each new value immediately overwriting the old one. § Both these converges to the true value vπ and the ‘in place’ algorithm usually converges faster. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 10 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Iterative Policy Evaluation Iterative Policy Evaluation, for estimating V vπ Input: π, the policy to be evaluated Algorithm parameter: a small threshold θ 0 determining accuracy of estimation Initialize V (s), for all s S , arbitrarily except that V (terminal) 0 Loop: 0 Loop for each s S: v V (s) V (s) P π(a s) r(s, a) γ a A P p(s0 s, a)v(s0 ) s0 S max , v V (s) until θ Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 11 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Evaluating a Random Policy in the Small Gridworld Figure credit: [SB] chapter 4 § Undiscounted episodic MDP (λ 1) § Non-terminal states are S {1, 2, · · · , 14} § Two terminal states (shown as shaded squares) § 4 possible actions in each state, A {up, down, right, lef t} § Deterministic state transitions § Actions leading out of the grid leave state unchanged § Reward is -1 until the terminal state is reached § Agent follows uniform random policy π(n .) π(s .) π(e .) π(w .) Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 12 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Evaluating a Random Policy in the Small Gridworld Figure credit: [SB] chapter 4 Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 13 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Improving a Policy: Policy Iteration § Given a policy π I Evaluate the policy ( ) X X . (k 1) 0 (k) 0 vπ v (s) π(a s) r(s, a) γ p(s s, a)v (s ) s0 S a A I Improve the policy by acting greedily with respect to vπ π 0 greedy(vπ ) being greedy means choosing the action that will land the agent . into best state i.e., π 0 (s) arg max qπ (s, a) a A P 0 0 arg max r(s, a) γ p(s s, a)vπ (s ) a A s0 S § In Small Gridworld improved policy was optimal π 0 π § In general, need more iterations of improvement/evaluation § But this process of policy iteration always converges to π Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 15 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Improving a Policy: Policy Iteration Given a policy π § Evaluate the policy ( ) X X . (k 1) 0 (k) 0 vπ v (s) π(a s) r(s, a) γ p(s s, a)v (s ) s0 S a A X XX π(a s)r(s, a) γ s0 S a A a A {z π(a s)p(s0 s, a) v (k) (s0 ) } rπ (s) rπ (s) γ X {z } pπ (s0 s) pπ (s0 s)v (k) (s0 ) s0 S I rπ (s) one step expected reward for following policy π at state s. I pπ (s0 s) one step transition probability under policy π. § Improve the policy by acting greedily with respect to vπ ( ) π 0 (s) arg max r(s, a) γ a A Abir Das (IIT Kharagpur) X p(s0 s, a)vπ (s0 ) s0 S CS60077 Aug 8,9,29,30, Sep 05, 2019 16 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration Figure credit: [David Silver: DeepMind] § Policy Evaluation: Estimate vπ by iterative policy evaluation. § Policy Improvement: Generate π 0 π by greedy policy improvement. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 17 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration Algorithm 1: Policy iteration 1 2 3 4 initialization: Select π 0 , n 0; do (Policy Evaluation) v(πn 1 ) r(πn ) γPπn v(πn ) ; // componentwise (Policy Improvement) P π n 1 (s) arg max[r(s, a) γ p(s0 s, a)v(πn 1 ) (s0 )] s S]; a A 5 6 7 s0 S n n 1; while π n 1 6 π n ; Declare π π n § why in step (4), is used? § Note the terminating condition. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 18 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration § At each step of policy iteration, the policy improves i.e., the value function for a policy at a later iteration is greater than or equal to the value function for a policy at an earlier step. § This comes from the policy improvement theorem which (informally) is - Let π n be some stationary policy and let π n 1 be greedy w.r.t. v(πn ) , then v(πn 1 ) v(πn ) , i.e., π n 1 is an improvement upon π n . rπn 1 γPπn 1 v(πn ) rπn γPπn v(πn ) v(πn ) [Bellman eqn.] rπn 1 (I γPπn 1 )v(πn ) (I γPπn 1 ) 1 rπn 1 v(πn ) vπn 1 v(πn ) (2) § The first step: π n 1 is obtained by maximizing rπ γPπ v(πn ) over all π’s. So, rπn 1 γPπn 1 v(πn ) will be better than any other π in rπ γPπ v(πn ) . That ‘any other π’ happens to be π n . Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 19 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration: Example ([SB]) § Jack manages two locations of a car rental company. At any location if car is available, he rents it out and gets 10. To ensure that cars are available, Jack can move cars between the two locations overnight, at a cost of 2 per car. § Cars are returned and requested randomly according to Poissonn distribution. Probability that n cars are rented or returned is λn! e λ . I 1st location - λ: average requests 3, average returns 3 I 2nd location - λ: average requests 4, average returns 2 § there can be no more than 20 cars at each location and a maximum of 5 cars can be moved from one location to the other Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 20 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration: Example - MDP Formulation § State: number of cars at each location at the end of the day (between 0 and 20). § Actions: number of cars moved overnight from one location to other (max 5). § Reward: 10 per car rented (if available) and - 2 per car moved. § Transition probability: The Poisson distribution defined in the last slide. § Discount factor: γ is assumed to be 0.9. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 21 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration: Example Figure credit: [SB - Chapter 4] Figure: The sequence of policies found by policy iteration on Jack’s car rental problem, and the final state-value function Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 22 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration: Disadvantages § Policy iteration involves the policy evaluation step first and this itself requires a few iterations to get the exact value of vπ in limit. § The question is - must we wait for exact convegence to vπ ? Or can we stop short of that? § The small gridworld example showed that there is no change of the greedy policy after the first three iterations. § So the question is - is there such a number of iterations such that after that the greedy policy does not change? Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 23 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Value Iteration § A related question is - what about the extreme case of 1 iteration of policy evaluation and then greedy policy improvement? If we repeat this cycle, does it find the optimal policy at least in limit? § The good news is that - yes the gurantee is there and we will soon prove that. However, first let us modify the policy iteration algorithm to this extreme case. This is known as ‘value iteration’ strategy. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 24 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Value Iteration § What policy iteration does: iterate over ( ) X X . (k 1) 0 (k) 0 vπ v (s) π(a s) r(s, a) γ p(s s, a)v (s ) § And then s0 S a A ( ) 0 π (s) arg max r(s, a) γ a A Abir Das (IIT Kharagpur) X 0 0 p(s s, a)vπ (s ) s0 S CS60077 Aug 8,9,29,30, Sep 05, 2019 25 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Value Iteration § What policy iteration does: iterate over ( ) X X . (k 1) 0 (k) 0 vπ v (s) π(a s) r(s, a) γ p(s s, a)v (s ) § And then s0 S a A ( ) 0 π (s) arg max r(s, a) γ a A X 0 0 p(s s, a)vπ (s ) s0 S § What value iteration does: evaluate a A r(s, a) γ § And then take max over it X p(s0 s, a)v (k) (s0 ) s0 S ( v (k 1) (s) max r(s, a) γ a A Abir Das (IIT Kharagpur) ) X 0 p(s s, a)v (k) 0 (s ) Where have we seen it? s0 S CS60077 Aug 8,9,29,30, Sep 05, 2019 25 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Value Iteration Algorithm 2: Value iteration 8 9 10 11 12 13 14 15 16 initialization: v v 0 V, pick an 0, n 0; while v n 1 v n 1 γ 2γ do foreach s S do n o P v n 1 (s) max r(s, a) γ p(s0 /s, a)v n (s0 ) a s0 end n n 1; end foreach s S do /* Note the use of π(s). It mens deterministic policy */ n o P // n has π(s) arg max r(s, a) γ p(s0 /s, a)v n (s0 ) ; a s0 already been incremented by 1 17 end § Take a note of the stopping criterion Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 26 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Summary of Exact DP Algorithms for Planning Problem Prediction Control Control Bellman Equation Bellman Expectation Equation Bellman Expectation Equation Greedy Policy Improvement Bellman Optimality Equation Abir Das (IIT Kharagpur) CS60077 Algorithm Iterative Policy Evaluation Policy Iteration Value Iteration Aug 8,9,29,30, Sep 05, 2019 27 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Norms Definition Given a vector space V Rd , a function f : V R is a norm (denoted as . ) if and only if § v 0 v V § v 0 if and only if v 0 § αv α v , α R and v V § Triangle inequality: u v u v u, v V Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 28 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Different types of Norms § Lp norm: v p d X ! p1 vi p i 1 § L0 norm: v 0 Number of non-zero elements in v § L norm: v max vi 1 i d Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 29 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Cauchy Sequence, Completeness Definition A sequence of vectors v1 , v2 , v3 , · · · V (with subscripts n N) is called a Cauchy sequence if for any positive real 0, N Z such that m, n N, vm vn . § Basically, for any real positive , an element can be found in the sequence, beyond which any two elements of the sequence will be within of each other. § In other words, the elements of the sequence comes closer and closer to each other - i.e., the sequence converges. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 30 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Cauchy Sequence, Completeness Definition A sequence of vectors v1 , v2 , v3 , · · · V (with subscripts n N) is called a Cauchy sequence if for any positive real 0, N Z such that m, n N, vm vn . § Basically, for any real positive , an element can be found in the sequence, beyond which any two elements of the sequence will be within of each other. § In other words, the elements of the sequence comes closer and closer to each other - i.e., the sequence converges. Definition A vector space V equipped with a norm . is complete if every Cauchy sequence converges in that norm to a point in the space. To pay tribute to Stefan Banach, the great Polish mathematician, such a space is also called the Banach space. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 30 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Contraction Mapping, Fixed Point Definition An operator T : V V is L-Lipschitz if for any u, v V T u T v L u v § If L 1, then T is called a non-expansion, while if 0 L 1, then T is called a contraction. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 31 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Contraction Mapping, Fixed Point Definition An operator T : V V is L-Lipschitz if for any u, v V T u T v L u v § If L 1, then T is called a non-expansion, while if 0 L 1, then T is called a contraction. Definition Let v is a vector in the vector space V and T is an operator T : V V. Then v is called a fixed point of the operator T , if T v v. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 31 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem Theorem Suppose V is a Banach space and T : V V is a contraction mapping, then, - an unique v in V s.t. T v v and - for arbitrary v 0 in V, the sequence {v n } defined by v n 1 T v n T n 1 v 0 , converges to v . The above theorem tells that § T has fixed point, an unique fixed point. § For arbitrary starting point if we keep repeatedly applying T on that starting point, then we will converge to v . Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 32 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (1) § Let v n and v m n be two values of v obtained after the nth and the (n m)th iteration. v m n v n m 1 X v n k 1 v n k [Triangle inequality] k 0 m 1 X T n k v 1 T n k v 0 k 0 m 1 X m 1 X λ T n k 1 v 1 T n k 1 v 0 k 0 λn k v 1 v 0 [Repeated use of contraction] k 0 v 1 v 0 m 1 X λn k k 0 λn (1 λm ) 1 v v 0 1 λ Abir Das (IIT Kharagpur) CS60077 (3) Aug 8,9,29,30, Sep 05, 2019 33 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (2) § As m and n and as λ 1, the norm of difference between v m n and v n becomes less and less. § That means the sequence {v n } is Cauchy. § And since V is a Banach space and since every Cauchy sequence converges to a point in that Banach space, therefore the Cauchy sequence {v n } also converges to a point in V. § What we have proved till now is that the sequence {v n } will reach a converging point in the same space. § Lets say that the converging point is v . § What we will try to prove next is that v is a fixed point and then we will try to prove that v is an unique fixed point. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 34 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (3) § Let us try to see what we get as the norm of the difference between v and T v . § In the first line below we apply triangle inequality where v n is the value of v at the nth iteration. T v v T v v n v n v T v T v n 1 v n v λ v v n 1 v n v [Contraction property] (4) § Since {v n } is Cauchy and v is the convergence point, both the terms in the above equation will tend to 0 as n . § So, as n , T v v 0. That means in limit T v v . So, it is proved that v is a fixed point. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 35 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (4) § Now we will show the uniqueness, i.e., v is unique. § Let u and v be two fixed points of the space. From the contraction property, we can write T u T v λ u v . § But, since u and v are fixed points, T u u and T v v . Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 36 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (4) § Now we will show the uniqueness, i.e., v is unique. § Let u and v be two fixed points of the space. From the contraction property, we can write T u T v λ u v . § But, since u and v are fixed points, T u u and T v v . § That means u v λ u v which can not be true for λ 1 unless v u . § So, it is proved that v is an unique fixed point. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 36 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § Now, we will start talking about the existance and uniqueness of the solution to Bellman expecation equations and the Bellman optimality equations. § In case of a finite MDP the value function v can be thought of as a vector in a S dimensional vector space V. § Whenever, we will use norm . in this space we will mean the max norm, unless otherwise specified. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 37 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § Previously, we have seen P I rπ (s) π(a s)r(s, a), one step expected reward for following a A policy π at state s. P I pπ (s0 s) π(a s)p(s0 s, a), one step transition probability under a A policy π. § Using these notations, the Bellman expectation equation becomes, ( vπ (s) X ) π(a s) r(s, a) γ 0 0 p(s s, a)vπ (s ) s0 S a A rπ (s) γ X X pπ (s0 s)vπ (s0 ) s0 S Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 38 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § vπ (s) rπ (s) γ P pπ (s0 s)vπ (s0 ) s0 S Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 39 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations pπ (s0 s)vπ (s0 ) P § vπ (s) rπ (s) γ s0 S § Refresher from earlier lectures v(s1 ) r(s1 ) P11 v(s2 ) r(s2 ) P21 . . γ . . . . P12 P22 . . ··· ··· . . P1n v(s1 ) P2n v(s2 ) . . . . Pn1 Pn2 ··· Pnn v(sn ) r(sn ) v(sn ) § vπ rπ γPπ vπ § rπ is a S dimensional vector while Pπ is a S S dimensional matrix. § For all s0 , pπ (s0 s) is one row (sth row) of the Pπ matrix. Similarly, vπ (s0 )’s are the value functions for all states i.e., in the vectorized notation, this is a vector vπ . Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 39 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § vπ rπ γPπ vπ § We are, now, going to define a linear operator. Lπ : V V such that Lπ v rπ γPπ v v V, [V as defined in slide (37)] (5) § So using this operator notation, we can write the Bellman expectation equation as the following, Lπ vπ vπ (6) § So far we have proved the Banach Fixed Point Theorem. Now we will try to show that Lπ is a contraction. § We will hold the proof of V being a Banach space for later. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 40 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § Let u and v be in V. So, Lπ u(s) rπ (s) γ X Lπ v(s) rπ (s) γ X pπ (s0 s)u(s0 ) s0 pπ (s0 s)v(s0 ) (7) s0 § One important note: Lπ u(s) or Lπ v(s) does not mean Lπ applied on u(s) or v(s). It means the sth component of the vector Lπ u or Lπ v Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 41 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § Let us consider the case, Lπ v(s) Lπ u(s). Then, 0 Lπ v(s) Lπ u(s) X γ pπ (s0 s){v(s0 ) u(s0 )} s0 γ v u X pπ (s0 s) s0 [Why is this?] γ v u [Since X pπ (s0 s) 1] (8) s0 § Similarly, when Lπ u(s) Lπ v(s), we can show that, 0 Lπ u(s) Lπ v(s) γ u v γ v u [Since u v v u ] (9) Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 42 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § Putting the two equations 8 and 9 together, we can get that Lπ v(s) Lπ u(s) γ v u s S (10) § Pointwise or componentwise the difference is being drawn closer by a factor of γ, so the maximum of the difference will also have come down. Lπ v Lπ u γ v u (11) § So, that means that Lπ is a contraction. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 43 / 57

Agenda DP Policy

values v(k)(s) and the other for the new values v(k 1)(s). Here, new values of v(k 1)(s) are computed one by one from the old values v(k)(s) without changing the old values. xAnother way is to use one array and update the values 'in place', i.e., each new value immediately overwriting the old one.

Related Documents:

Detection, Prediction, and Avoidance of Dynamic Obstacles ...

Detection, Prediction, and Avoidance of Dynamic Obstacles in Urban Environments Dave Ferguson, Michael Darms, Chris Urmson, and Sascha Kolski Abstract—We present an approach for robust detection, prediction, and avoidance of dynamic obstacles in urban en-vironments. After detecting a dynamic obstacle, our approach exploits structure in the environment where possible to generate a set of .

28 Views

3y ago

Dynamic Strategy, Dynamic Structure

Dec 06, 2018 · Dynamic Strategy, Dynamic Structure A Systematic Approach to Business Architecture “Dynamic Strategy, . Michael Porter dynamic capabilities vs. static capabilities David Teece “Dynamic Strategy, Dynamic Structure .

82 Views

2y ago

Enriched global horizontal irradiance prediction using novel ensemble ...

generic performance capability. The comparative analysis imparts the proposed prediction model results improved GHI prediction than the existing models. The proposed model has enriched GHI prediction with better generalization. Keywords: Ensemble, Improved backpropagation neural network, Global horizontal irradiance, and prediction.

16 Views

1y ago

Time Series Data Prediction Using Sliding Window Based RBF Neural Network

The stock market is dynamic, non-stationary and complex in nature, the prediction of stock price index is a challenging task due to its chaotic and non linear nature. The prediction is a statement about the future and based on this prediction, investors can decide to invest or not to invest in the stock market [2]. Stock market may be

44 Views

1y ago

Use of a Perth Sand Penetrometer (PSP) device to determine the ...

There are many dynamic probe devices in the world, such as Dynamic Cone Penetrometer (DCP), Mackin-tosh probe, Dynamic Probing Light (DPL), Dynamic Probing Medium (DPM), Dynamic Probing High (DPH), Dynamic Probing Super High (DPSH), Perth Sand Penetrometer (PSP), etc. Table 1 shows some of the dynamic probing devices and their specifications.

22 Views

1y ago

Modeling Prediction Markets with Dynamic Bayesian Networks

Prediction markets have shown a remarkable ability to predict outcomes. Here, we propose a Dynamic Bayesian Network model to extract information and infer prediction market prices by modeling interactions between agents. We validate our methods using poll and price data from the 2012 presidential election, and show that this model is more

6 Views

1y ago

DESIRE: Distant Future Prediction in Dynamic Scenes with ...

dictions of multiple interacting agents in dynamic scenes. DESIRE effectively predicts future locations of objects in multiple scenes by 1) accounting for the multi-modal nature of the future prediction (i.e., given the same context, future may vary), 2) foreseeing the potential future outcomes and make a strategic prediction based on that, and 3) reason-ing not only from the past motion .

51 Views

3y ago

Stock price prediction based on ARIMA - SVM model

Support vector machine (SVM) is a new technology in data mining, machine learning and artificial intelligence. It belongs to nonlinear prediction model and is suitable for the modeling and prediction of stock price fluctuation system [2-4]. Francis (2011) used the support vector machine model to realize the prediction of financial time series. He

89 Views

2y ago

Recent Views

Pegasus - University of Exeter Blogs

'Pegasus', Dept. of Classics and Ancient History, Amory Building, Rennes Drive, University of Exeter, Exeter EX4 4RJ E-mail: pegasus@exeter.ac.uk . Pegasus - 2 - Issue 52 (2009) s in . The major event this last year was the announcement of the outcome of the Research Assessment Exercise 2008 in .

11m ago

78 Views

MANAGERIAL FINANCE - GBV

of Managerial Finance page 2 Introduction to Managerial Finance 1 Starbucks—A Taste for Growth page 3 1.1 Finance and Business What Is Finance? 4 Major Areas and Opportunities in Finance 4 Legal Forms of Business Organization 5 Why Study Managerial Finance? Review Questions 9 1.2 The Managerial Finance Function 9 Organization of the Finance

3y ago

6.8K Views

Chapter 1 The roles of finance function in organisations

The roles of the finance function in organisations 4. The role of ethics in the role of the finance function Ethics is the system of moral principles that examines the concept of right and wrong. Ethics underpins an organisation’s sustained value creation. The roles that the finance function performs should be carried out in an .File Size: 888KBPage Count: 10Explore furtherRole of the Finance Function in the Financial Management .www.managementstudyguide.c Roles and Responsibilities of a Finance Department in a .www.pharmapproach.comRoles and Responsibilities of a Finance Department .www.smythecpa.comTop 10 – Functions of Business Finance in an om23 Functions and Duties of Accounting and Finance nded to you b

2y ago

342 Views

Mathematics 2 Problem Sets - Exeter

Mathematics 2 Mathematics Department Phillips Exeter Academy Exeter, NH August 2019. To the Student Contents: Members of the PEA Mathematics Department have written the material in this book. As you work through it, you will discover that algebra, geometry, and trigonometry have been integr

2y ago

124 Views

The Seafarer RL 4 The Wanderer The Wife’s Lament

survive in the Exeter Book, a manuscript of Anglo-Saxon poems produced by a single scribe around a.d. 950. In addition to these and other secular poems, the Exeter Book contains religious verse, nearly 100 riddles, and a heroic narrative. It is the largest collection of Old English poetry in existence. Neglected Treasure Originally, the Exeter

2y ago

145 Views

UPPER SCHOOL - Exeter

Club, Exeter Soccer Club, or Exeter Volleyball Club. UPPER SCHOOL students may also partici-pate in musical or choral groups. DESIGNING YOUR OWN CURRICULUM As an UPPER SCHOOL student, you have the free-dom to design your own academic curriculum. You may enroll in any three of the more than

2y ago

348 Views

Exeter a gentleman in Moscow weekend

“A Gentleman in Moscow” Itinerary at a Glance: Day 1 Arrive in Moscow Day 2 Backstage Tour of the Bolshoi Day 3 Kremlin Tour Day 4 Depart Moscow . Why Exeter International? Our Knowledge & Experience . At Exeter International we have been creating mem

2y ago

334 Views

Phillips Exeter Academy Courses of Instruction 2022-23

Academic Excellence Academic excellence is a signature strength of Phillips Exeter Academy. In every discipline and . — academic, artistic, athletic and extracurricular — . and passions and the agency needed to carry these forward. Non Sibi Non Sibi, or Not For Oneself, inscribed on Exeter’s

2y ago

119 Views

Exeter College Association

club in Takoradi, Ghana. Katrina Hancock read Earth Sciences at Exeter between 1998 and 2002. She joined the Development Office in 2004 and has been Director of Development since 2006. Mark Houghton-Berry,Honorary Fellow, read Literae Humaniores at Exeter between 1976 and 1980. He is CEO of Tudor Capital LP, the European arm of a US based hedge .

1y ago

99 Views

Exeter Plan: Outline Draft Plan Consultation Sustainability Appraisal .

1. Draft Non-Technical Summary of the SA Report for the Exeter Plan (Outline Draft Consultation) M. Andrew B. Miller S. Temple K. Nicholls K. Nicholls 10.08.2022 2. Final Non-Technical Summary of the SA Report for the Exeter Plan (Outline Draft Consultation) M. Andrew B. Miller S. Temple K. Nicholls K. Nicholls 09.09.2022 3.

1y ago

113 Views

2017-2018 GRANDE ÉCOLE MSc in MANAGEMENT

Descriptif des cours Course Outlines 10 Catalogue des cours/ Course Catalog 2017-2018 FIN: Finance/Finance A : Actuariat/Actuarial, Insurance E : Finance d’entreprise/Corporate Finance The course liste tables and the course outlines G : Finance générale/General Finance M : Finance de marché/Market Finance S : Synthèse/Synthesis IDS: Systèmes d’Information, Sciences de la Décision et .

3y ago

318 Views

Behavioral Finance and Wealth L Management

Introduction to Behavioral Finance CHAPTER1 What Is Behavioral Finance? Behavioral Finance: The Big Picture Standard Finance versus Behavioral Finance The Role of Behavioral Finance with Private Clients How Practical Application of Behavioral Finance Can Create a Successful Advisory Rel

2y ago

386 Views

Catalogue des Cours Course Catalog - ESSEC Business School

10 Catalogue des cours/Course Catalog 2021-2022 FIN: Finance/Finance E : Finance d'entreprise/Corporate Finance G : Finance générale/General Finance M : Finance de marché/Market Finance S : Synthèse/Synthesis IDS: Systèmes d'Information, Sciences de la Décision et Statistiques/ Information Systems, Decision Sciences and Statistics

1y ago

226 Views

SINGAPORE - Kelly Services

FINANCE Chief Financial Officer Degree/Master 15 20,000 25,000 Finance Assistant Diploma 1-3 2,800 3,400 Finance Controller Degree 10-15 10,000 18,000 Finance Director Degree 15 15,000 20,000 Finance Executive/ Senior Finance Executive Degree 2-5 3,000 6,000 Finance Manager/ Assistan

2y ago

538 Views

Ministries of Finance and Nationally Determined Contributions

Rodrigo Rojo, IDB Sr. Consultant and advisor to Ministry of Finance of Chile. Colombia German Romero Otalora and Laura Marcela Ruiz Daza — Office of the Vice-Minister — Ministry of Finance. Ireland Paul Ryan — International Finance Division — Ministry of Finance Sean Judge — Department of Finance — Ministry of Finance

1y ago

241 Views

Prediction And Control By Dynamic Programing - CS60077: Reinforcement .

It looks like you're using an ad-blocker