Prediction And Control By Dynamic Programing - CS60077: Reinforcement .

3m ago
5 Views
0 Downloads
1.58 MB
76 Pages
Last View : 4d ago
Last Download : n/a
Upload by : Dani Mulvey
Transcription

Prediction and Control by Dynamic Programing CS60077: Reinforcement Learning Abir Das IIT Kharagpur Aug 8,9,29,30, Sep 05, 2019

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Agenda § Understand how to evaluate policies using dynamic programing based methods § Understand policy iteration and value iteration algorithms for control of MDPs § Existence and convergence of solutions obtained by the above methods Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 2 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Resources § Reinforcement Learning by David Silver [Link] § Reinforcement Learning by Balaraman Ravindran [Link] § SB: Chapter 4 Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 3 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Dynamic Programing “Life can only be understood going backwards, but it must be lived going forwards.” - S. Kierkegaard, Danish Philosopher. The first line of the famous book by Dimitri P Bertsekas. Image taken from: amazon.com Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 4 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Dynamic Programing § Dynamic Programing [DP] in this course, refer to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment in a MDP. § Limited utility due to the ‘perfect model’ assumption and due to computational expense. § But still are important as they provide essential foundation for many of the subsequent methods. § Many of the methods can be viewed as attempts to achieve much the same effect as DP with less computation and without perfect model assumption of the environment. § The key idea in DP is to use the value functions and Bellman equations to organize and structure the search for good policies. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 5 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Dynamic Programing § Dynamic Programing addresses a bigger problem by breaking it down as subproblems and then I Solving the subproblems I Combining solutions to subproblems Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 6 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Dynamic Programing § Dynamic Programing addresses a bigger problem by breaking it down as subproblems and then I Solving the subproblems I Combining solutions to subproblems § Dynamic Programing is based on the principle of optimality. 𝑠% 0 Tail subproblem 𝑘 𝑎( , , 𝑎% , , 𝑎 ,- 𝑁 Time Optimal action sequence Principle of Optimality Let {a 0 , a 1 , · · · , a (N 1) } be an optimal action sequence with a corresponding state sequence {s 1 , s 2 , · · · , s N }. Consider the tail subproblem that starts at s k at time k and maximizes the ‘reward to go’ from k to N over {ak , · · · , a(N 1) }, then the tail optimal action sequence {a k , · · · , a (N 1) } is optimal for the tail subproblem. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 6 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Requirements for Dynamic Programing § Optimal substructure i.e., principle of optimality applies. § Overlapping subproblems, i.e., subproblems recur many times and solutions to these subproblems can be cached and reused. § MDPs satisfy both through Bellman equations and value functions. § Dynamic programming is used to solve many other problems, e.g., Scheduling algorithms, Graph algorithms (e.g. shortest path algorithms), Bioinformatics etc. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 7 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Planning by Dynamic Programing § Planning by dynamic programing assumes full knowledge of the MDP § For prediction/evaluation I Input: MDP hS, A, P, R, γi and policy π I Output: Value function vπ Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 8 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Planning by Dynamic Programing § Planning by dynamic programing assumes full knowledge of the MDP § For prediction/evaluation I Input: MDP hS, A, P, R, γi and policy π I Output: Value function vπ § For control I Input: MDP hS, A, P, R, γi I Output: Optimal value function v and optimal policy π Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 8 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Iterative Policy Evaluation § Problem: Policy evaluation: Compute the state-value function vπ for an arbitrary policy π. § Solution strategy: Iterative application of Bellman expectation equation. § Recall the Bellman expectation equation. ( vπ (s) X ) π(a s) r(s, a) γ X 0 0 p(s s, a)vπ (s ) (1) s0 S a A § Consider a sequence of approximate value functions v (0) , v (1) , v (2) , · · · each mapping S to R. Each successive approximation is obtained by using eqn. (1) as an update rule. ) ( X X v (k 1) (s) π(a s) r(s, a) γ p(s0 s, a)v (k) (s0 ) s0 S a A Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 9 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Iterative Policy Evaluation ( v (k 1) (s) X ) π(a s) r(s, a) γ X 0 p(s s, a)v (k) 0 (s ) s0 S a A § In code, this can be implemented by using two arrays - one for the old values v (k) (s) and the other for the new values v (k 1) (s). Here, new values of v (k 1) (s) are computed one by one from the old values v (k) (s) without changing the old values. § Another way is to use one array and update the values ‘in place’, i.e., each new value immediately overwriting the old one. § Both these converges to the true value vπ and the ‘in place’ algorithm usually converges faster. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 10 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Iterative Policy Evaluation Iterative Policy Evaluation, for estimating V vπ Input: π, the policy to be evaluated Algorithm parameter: a small threshold θ 0 determining accuracy of estimation Initialize V (s), for all s S , arbitrarily except that V (terminal) 0 Loop: 0 Loop for each s S: v V (s) V (s) P π(a s) r(s, a) γ a A P p(s0 s, a)v(s0 ) s0 S max , v V (s) until θ Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 11 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Evaluating a Random Policy in the Small Gridworld Figure credit: [SB] chapter 4 § Undiscounted episodic MDP (λ 1) § Non-terminal states are S {1, 2, · · · , 14} § Two terminal states (shown as shaded squares) § 4 possible actions in each state, A {up, down, right, lef t} § Deterministic state transitions § Actions leading out of the grid leave state unchanged § Reward is -1 until the terminal state is reached § Agent follows uniform random policy π(n .) π(s .) π(e .) π(w .) Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 12 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Evaluating a Random Policy in the Small Gridworld Figure credit: [SB] chapter 4 Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 13 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Evaluating a Random Policy in the Small Gridworld Figure credit: [SB] chapter 4 Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 14 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Improving a Policy: Policy Iteration § Given a policy π I Evaluate the policy ( ) X X . (k 1) 0 (k) 0 vπ v (s) π(a s) r(s, a) γ p(s s, a)v (s ) s0 S a A I Improve the policy by acting greedily with respect to vπ π 0 greedy(vπ ) being greedy means choosing the action that will land the agent . into best state i.e., π 0 (s) arg max qπ (s, a) a A P 0 0 arg max r(s, a) γ p(s s, a)vπ (s ) a A s0 S § In Small Gridworld improved policy was optimal π 0 π § In general, need more iterations of improvement/evaluation § But this process of policy iteration always converges to π Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 15 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Improving a Policy: Policy Iteration Given a policy π § Evaluate the policy ( ) X X . (k 1) 0 (k) 0 vπ v (s) π(a s) r(s, a) γ p(s s, a)v (s ) s0 S a A X XX π(a s)r(s, a) γ s0 S a A a A {z π(a s)p(s0 s, a) v (k) (s0 ) } rπ (s) rπ (s) γ X {z } pπ (s0 s) pπ (s0 s)v (k) (s0 ) s0 S I rπ (s) one step expected reward for following policy π at state s. I pπ (s0 s) one step transition probability under policy π. § Improve the policy by acting greedily with respect to vπ ( ) π 0 (s) arg max r(s, a) γ a A Abir Das (IIT Kharagpur) X p(s0 s, a)vπ (s0 ) s0 S CS60077 Aug 8,9,29,30, Sep 05, 2019 16 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration Figure credit: [David Silver: DeepMind] § Policy Evaluation: Estimate vπ by iterative policy evaluation. § Policy Improvement: Generate π 0 π by greedy policy improvement. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 17 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration Algorithm 1: Policy iteration 1 2 3 4 initialization: Select π 0 , n 0; do (Policy Evaluation) v(πn 1 ) r(πn ) γPπn v(πn ) ; // componentwise (Policy Improvement) P π n 1 (s) arg max[r(s, a) γ p(s0 s, a)v(πn 1 ) (s0 )] s S]; a A 5 6 7 s0 S n n 1; while π n 1 6 π n ; Declare π π n § why in step (4), is used? § Note the terminating condition. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 18 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration § At each step of policy iteration, the policy improves i.e., the value function for a policy at a later iteration is greater than or equal to the value function for a policy at an earlier step. § This comes from the policy improvement theorem which (informally) is - Let π n be some stationary policy and let π n 1 be greedy w.r.t. v(πn ) , then v(πn 1 ) v(πn ) , i.e., π n 1 is an improvement upon π n . rπn 1 γPπn 1 v(πn ) rπn γPπn v(πn ) v(πn ) [Bellman eqn.] rπn 1 (I γPπn 1 )v(πn ) (I γPπn 1 ) 1 rπn 1 v(πn ) vπn 1 v(πn ) (2) § The first step: π n 1 is obtained by maximizing rπ γPπ v(πn ) over all π’s. So, rπn 1 γPπn 1 v(πn ) will be better than any other π in rπ γPπ v(πn ) . That ‘any other π’ happens to be π n . Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 19 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration: Example ([SB]) § Jack manages two locations of a car rental company. At any location if car is available, he rents it out and gets 10. To ensure that cars are available, Jack can move cars between the two locations overnight, at a cost of 2 per car. § Cars are returned and requested randomly according to Poissonn distribution. Probability that n cars are rented or returned is λn! e λ . I 1st location - λ: average requests 3, average returns 3 I 2nd location - λ: average requests 4, average returns 2 § there can be no more than 20 cars at each location and a maximum of 5 cars can be moved from one location to the other Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 20 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration: Example - MDP Formulation § State: number of cars at each location at the end of the day (between 0 and 20). § Actions: number of cars moved overnight from one location to other (max 5). § Reward: 10 per car rented (if available) and - 2 per car moved. § Transition probability: The Poisson distribution defined in the last slide. § Discount factor: γ is assumed to be 0.9. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 21 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration: Example Figure credit: [SB - Chapter 4] Figure: The sequence of policies found by policy iteration on Jack’s car rental problem, and the final state-value function Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 22 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Policy Iteration: Disadvantages § Policy iteration involves the policy evaluation step first and this itself requires a few iterations to get the exact value of vπ in limit. § The question is - must we wait for exact convegence to vπ ? Or can we stop short of that? § The small gridworld example showed that there is no change of the greedy policy after the first three iterations. § So the question is - is there such a number of iterations such that after that the greedy policy does not change? Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 23 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Value Iteration § A related question is - what about the extreme case of 1 iteration of policy evaluation and then greedy policy improvement? If we repeat this cycle, does it find the optimal policy at least in limit? Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 24 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Value Iteration § A related question is - what about the extreme case of 1 iteration of policy evaluation and then greedy policy improvement? If we repeat this cycle, does it find the optimal policy at least in limit? § The good news is that - yes the gurantee is there and we will soon prove that. However, first let us modify the policy iteration algorithm to this extreme case. This is known as ‘value iteration’ strategy. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 24 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Value Iteration § What policy iteration does: iterate over ( ) X X . (k 1) 0 (k) 0 vπ v (s) π(a s) r(s, a) γ p(s s, a)v (s ) § And then s0 S a A ( ) 0 π (s) arg max r(s, a) γ a A Abir Das (IIT Kharagpur) X 0 0 p(s s, a)vπ (s ) s0 S CS60077 Aug 8,9,29,30, Sep 05, 2019 25 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Value Iteration § What policy iteration does: iterate over ( ) X X . (k 1) 0 (k) 0 vπ v (s) π(a s) r(s, a) γ p(s s, a)v (s ) § And then s0 S a A ( ) 0 π (s) arg max r(s, a) γ a A X 0 0 p(s s, a)vπ (s ) s0 S § What value iteration does: evaluate a A r(s, a) γ § And then take max over it X p(s0 s, a)v (k) (s0 ) s0 S ( v (k 1) (s) max r(s, a) γ a A Abir Das (IIT Kharagpur) ) X 0 p(s s, a)v (k) 0 (s ) Where have we seen it? s0 S CS60077 Aug 8,9,29,30, Sep 05, 2019 25 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Value Iteration Algorithm 2: Value iteration 8 9 10 11 12 13 14 15 16 initialization: v v 0 V, pick an 0, n 0; while v n 1 v n 1 γ 2γ do foreach s S do n o P v n 1 (s) max r(s, a) γ p(s0 /s, a)v n (s0 ) a s0 end n n 1; end foreach s S do /* Note the use of π(s). It mens deterministic policy */ n o P // n has π(s) arg max r(s, a) γ p(s0 /s, a)v n (s0 ) ; a s0 already been incremented by 1 17 end § Take a note of the stopping criterion Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 26 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Summary of Exact DP Algorithms for Planning Problem Prediction Control Control Bellman Equation Bellman Expectation Equation Bellman Expectation Equation Greedy Policy Improvement Bellman Optimality Equation Abir Das (IIT Kharagpur) CS60077 Algorithm Iterative Policy Evaluation Policy Iteration Value Iteration Aug 8,9,29,30, Sep 05, 2019 27 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Norms Definition Given a vector space V Rd , a function f : V R is a norm (denoted as . ) if and only if § v 0 v V § v 0 if and only if v 0 § αv α v , α R and v V § Triangle inequality: u v u v u, v V Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 28 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Different types of Norms § Lp norm: v p d X ! p1 vi p i 1 § L0 norm: v 0 Number of non-zero elements in v § L norm: v max vi 1 i d Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 29 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Cauchy Sequence, Completeness Definition A sequence of vectors v1 , v2 , v3 , · · · V (with subscripts n N) is called a Cauchy sequence if for any positive real 0, N Z such that m, n N, vm vn . § Basically, for any real positive , an element can be found in the sequence, beyond which any two elements of the sequence will be within of each other. § In other words, the elements of the sequence comes closer and closer to each other - i.e., the sequence converges. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 30 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Cauchy Sequence, Completeness Definition A sequence of vectors v1 , v2 , v3 , · · · V (with subscripts n N) is called a Cauchy sequence if for any positive real 0, N Z such that m, n N, vm vn . § Basically, for any real positive , an element can be found in the sequence, beyond which any two elements of the sequence will be within of each other. § In other words, the elements of the sequence comes closer and closer to each other - i.e., the sequence converges. Definition A vector space V equipped with a norm . is complete if every Cauchy sequence converges in that norm to a point in the space. To pay tribute to Stefan Banach, the great Polish mathematician, such a space is also called the Banach space. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 30 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Contraction Mapping, Fixed Point Definition An operator T : V V is L-Lipschitz if for any u, v V T u T v L u v § If L 1, then T is called a non-expansion, while if 0 L 1, then T is called a contraction. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 31 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Contraction Mapping, Fixed Point Definition An operator T : V V is L-Lipschitz if for any u, v V T u T v L u v § If L 1, then T is called a non-expansion, while if 0 L 1, then T is called a contraction. Definition Let v is a vector in the vector space V and T is an operator T : V V. Then v is called a fixed point of the operator T , if T v v. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 31 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem Theorem Suppose V is a Banach space and T : V V is a contraction mapping, then, - an unique v in V s.t. T v v and - for arbitrary v 0 in V, the sequence {v n } defined by v n 1 T v n T n 1 v 0 , converges to v . The above theorem tells that § T has fixed point, an unique fixed point. § For arbitrary starting point if we keep repeatedly applying T on that starting point, then we will converge to v . Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 32 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (1) § Let v n and v m n be two values of v obtained after the nth and the (n m)th iteration. v m n v n m 1 X v n k 1 v n k [Triangle inequality] k 0 m 1 X T n k v 1 T n k v 0 k 0 m 1 X m 1 X λ T n k 1 v 1 T n k 1 v 0 k 0 λn k v 1 v 0 [Repeated use of contraction] k 0 v 1 v 0 m 1 X λn k k 0 λn (1 λm ) 1 v v 0 1 λ Abir Das (IIT Kharagpur) CS60077 (3) Aug 8,9,29,30, Sep 05, 2019 33 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (2) § As m and n and as λ 1, the norm of difference between v m n and v n becomes less and less. § That means the sequence {v n } is Cauchy. § And since V is a Banach space and since every Cauchy sequence converges to a point in that Banach space, therefore the Cauchy sequence {v n } also converges to a point in V. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 34 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (2) § As m and n and as λ 1, the norm of difference between v m n and v n becomes less and less. § That means the sequence {v n } is Cauchy. § And since V is a Banach space and since every Cauchy sequence converges to a point in that Banach space, therefore the Cauchy sequence {v n } also converges to a point in V. § What we have proved till now is that the sequence {v n } will reach a converging point in the same space. § Lets say that the converging point is v . § What we will try to prove next is that v is a fixed point and then we will try to prove that v is an unique fixed point. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 34 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (3) § Let us try to see what we get as the norm of the difference between v and T v . Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 35 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (3) § Let us try to see what we get as the norm of the difference between v and T v . § In the first line below we apply triangle inequality where v n is the value of v at the nth iteration. T v v T v v n v n v T v T v n 1 v n v λ v v n 1 v n v [Contraction property] (4) Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 35 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (3) § Let us try to see what we get as the norm of the difference between v and T v . § In the first line below we apply triangle inequality where v n is the value of v at the nth iteration. T v v T v v n v n v T v T v n 1 v n v λ v v n 1 v n v [Contraction property] (4) § Since {v n } is Cauchy and v is the convergence point, both the terms in the above equation will tend to 0 as n . § So, as n , T v v 0. That means in limit T v v . So, it is proved that v is a fixed point. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 35 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (4) § Now we will show the uniqueness, i.e., v is unique. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 36 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (4) § Now we will show the uniqueness, i.e., v is unique. § Let u and v be two fixed points of the space. From the contraction property, we can write T u T v λ u v . § But, since u and v are fixed points, T u u and T v v . Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 36 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Banach Fixed Point Theorem - Proof (4) § Now we will show the uniqueness, i.e., v is unique. § Let u and v be two fixed points of the space. From the contraction property, we can write T u T v λ u v . § But, since u and v are fixed points, T u u and T v v . § That means u v λ u v which can not be true for λ 1 unless v u . § So, it is proved that v is an unique fixed point. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 36 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § Now, we will start talking about the existance and uniqueness of the solution to Bellman expecation equations and the Bellman optimality equations. § In case of a finite MDP the value function v can be thought of as a vector in a S dimensional vector space V. § Whenever, we will use norm . in this space we will mean the max norm, unless otherwise specified. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 37 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § Previously, we have seen P I rπ (s) π(a s)r(s, a), one step expected reward for following a A policy π at state s. P I pπ (s0 s) π(a s)p(s0 s, a), one step transition probability under a A policy π. § Using these notations, the Bellman expectation equation becomes, ( vπ (s) X ) π(a s) r(s, a) γ 0 0 p(s s, a)vπ (s ) s0 S a A rπ (s) γ X X pπ (s0 s)vπ (s0 ) s0 S Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 38 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § vπ (s) rπ (s) γ P pπ (s0 s)vπ (s0 ) s0 S Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 39 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations pπ (s0 s)vπ (s0 ) P § vπ (s) rπ (s) γ s0 S § Refresher from earlier lectures v(s1 ) r(s1 ) P11 v(s2 ) r(s2 ) P21 . . γ . . . . P12 P22 . . ··· ··· . . P1n v(s1 ) P2n v(s2 ) . . . . Pn1 Pn2 ··· Pnn v(sn ) r(sn ) v(sn ) § vπ rπ γPπ vπ § rπ is a S dimensional vector while Pπ is a S S dimensional matrix. § For all s0 , pπ (s0 s) is one row (sth row) of the Pπ matrix. Similarly, vπ (s0 )’s are the value functions for all states i.e., in the vectorized notation, this is a vector vπ . Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 39 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § vπ rπ γPπ vπ § We are, now, going to define a linear operator. Lπ : V V such that Lπ v rπ γPπ v v V, [V as defined in slide (37)] (5) § So using this operator notation, we can write the Bellman expectation equation as the following, Lπ vπ vπ (6) § So far we have proved the Banach Fixed Point Theorem. Now we will try to show that Lπ is a contraction. § We will hold the proof of V being a Banach space for later. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 40 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § Let u and v be in V. So, Lπ u(s) rπ (s) γ X Lπ v(s) rπ (s) γ X pπ (s0 s)u(s0 ) s0 pπ (s0 s)v(s0 ) (7) s0 § One important note: Lπ u(s) or Lπ v(s) does not mean Lπ applied on u(s) or v(s). It means the sth component of the vector Lπ u or Lπ v Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 41 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § Let us consider the case, Lπ v(s) Lπ u(s). Then, 0 Lπ v(s) Lπ u(s) X γ pπ (s0 s){v(s0 ) u(s0 )} s0 γ v u X pπ (s0 s) s0 [Why is this?] γ v u [Since X pπ (s0 s) 1] (8) s0 § Similarly, when Lπ u(s) Lπ v(s), we can show that, 0 Lπ u(s) Lπ v(s) γ u v γ v u [Since u v v u ] (9) Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 42 / 57

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions Existence and Uniqueness of Bellman Equations § Putting the two equations 8 and 9 together, we can get that Lπ v(s) Lπ u(s) γ v u s S (10) § Pointwise or componentwise the difference is being drawn closer by a factor of γ, so the maximum of the difference will also have come down. Lπ v Lπ u γ v u (11) § So, that means that Lπ is a contraction. Abir Das (IIT Kharagpur) CS60077 Aug 8,9,29,30, Sep 05, 2019 43 / 57

Agenda DP Policy

values v(k)(s) and the other for the new values v(k 1)(s). Here, new values of v(k 1)(s) are computed one by one from the old values v(k)(s) without changing the old values. xAnother way is to use one array and update the values 'in place', i.e., each new value immediately overwriting the old one.

Related Documents:

Detection, Prediction, and Avoidance of Dynamic Obstacles in Urban Environments Dave Ferguson, Michael Darms, Chris Urmson, and Sascha Kolski Abstract—We present an approach for robust detection, prediction, and avoidance of dynamic obstacles in urban en-vironments. After detecting a dynamic obstacle, our approach exploits structure in the environment where possible to generate a set of .

Dec 06, 2018 · Dynamic Strategy, Dynamic Structure A Systematic Approach to Business Architecture “Dynamic Strategy, . Michael Porter dynamic capabilities vs. static capabilities David Teece “Dynamic Strategy, Dynamic Structure .

generic performance capability. The comparative analysis imparts the proposed prediction model results improved GHI prediction than the existing models. The proposed model has enriched GHI prediction with better generalization. Keywords: Ensemble, Improved backpropagation neural network, Global horizontal irradiance, and prediction.

The stock market is dynamic, non-stationary and complex in nature, the prediction of stock price index is a challenging task due to its chaotic and non linear nature. The prediction is a statement about the future and based on this prediction, investors can decide to invest or not to invest in the stock market [2]. Stock market may be

There are many dynamic probe devices in the world, such as Dynamic Cone Penetrometer (DCP), Mackin-tosh probe, Dynamic Probing Light (DPL), Dynamic Probing Medium (DPM), Dynamic Probing High (DPH), Dynamic Probing Super High (DPSH), Perth Sand Penetrometer (PSP), etc. Table 1 shows some of the dynamic probing devices and their specifications.

Prediction markets have shown a remarkable ability to predict outcomes. Here, we propose a Dynamic Bayesian Network model to extract information and infer prediction market prices by modeling interactions between agents. We validate our methods using poll and price data from the 2012 presidential election, and show that this model is more

dictions of multiple interacting agents in dynamic scenes. DESIRE effectively predicts future locations of objects in multiple scenes by 1) accounting for the multi-modal nature of the future prediction (i.e., given the same context, future may vary), 2) foreseeing the potential future outcomes and make a strategic prediction based on that, and 3) reason-ing not only from the past motion .

Support vector machine (SVM) is a new technology in data mining, machine learning and artificial intelligence. It belongs to nonlinear prediction model and is suitable for the modeling and prediction of stock price fluctuation system [2-4]. Francis (2011) used the support vector machine model to realize the prediction of financial time series. He