Lecture 2: Markov Decision Processes - Stanford University

1y ago
16 Views
2 Downloads
815.74 KB
57 Pages
Last View : 2d ago
Last Download : 2m ago
Upload by : Philip Renner
Transcription

Lecture 2: Markov Decision ProcessesLecture 2: Markov Decision ProcessesDavid Silver

Lecture 2: Markov Decision Processes1 Markov Processes2 Markov Reward Processes3 Markov Decision Processes4 Extensions to MDPs

Lecture 2: Markov Decision ProcessesMarkov ProcessesIntroductionIntroduction to MDPsMarkov decision processes formally describe an environmentfor reinforcement learningWhere the environment is fully observablei.e. The current state completely characterises the processAlmost all RL problems can be formalised as MDPs, e.g.Optimal control primarily deals with continuous MDPsPartially observable problems can be converted into MDPsBandits are MDPs with one state

Lecture 2: Markov Decision ProcessesMarkov ProcessesMarkov PropertyMarkov Property“The future is independent of the past given the present”DefinitionA state St is Markov if and only ifP [St 1 St ] P [St 1 S1 , ., St ]The state captures all relevant information from the historyOnce the state is known, the history may be thrown awayi.e. The state is a sufficient statistic of the future

Lecture 2: Markov Decision ProcessesMarkov ProcessesMarkov PropertyState Transition MatrixFor a Markov state s and successor state s 0 , the state transitionprobability is defined by Pss 0 P St 1 s 0 St sState transition matrix P defines transition probabilities from allstates s to all successor states s 0 ,to P11 . . . P1n P from . Pn1 . . . Pnn where each row of the matrix sums to 1.

Lecture 2: Markov Decision ProcessesMarkov ProcessesMarkov ChainsMarkov ProcessA Markov process is a memoryless random process, i.e. a sequenceof random states S1 , S2 , . with the Markov property.DefinitionA Markov Process (or Markov Chain) is a tuple hS, PiS is a (finite) set of statesP is a state transition probability matrix,Pss 0 P [St 1 s 0 St s]

Lecture 2: Markov Decision ProcessesMarkov ProcessesMarkov ChainsExample: Student Markov Chain0.9SleepFacebook0.10.5Class 11.00.20.5Class 20.8Class 30.40.20.4Pub0.40.6Pass

Lecture 2: Markov Decision ProcessesMarkov ProcessesMarkov ChainsExample: Student Markov Chain EpisodesSample episodes for Student MarkovChain starting from S1 C10.9S1 , S2 , ., STSleepFacebook0.10.5Class 1 0.51.00.2Class 20.8Class 30.40.2PassC1 C2 C3 Pass SleepC1 FB FB C1 C2 SleepC1 C2 C3 Pub C2 C3 Pass Sleep0.4Pub0.60.4C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep

Lecture 2: Markov Decision ProcessesMarkov ProcessesMarkov ChainsExample: Student Markov Chain Transition Matrix0.9SleepFacebookC10.10.5Class 1 0.51.00.2Class 20.8Class 30.40.20.4Pub0.40.6PassP C1C2C3PassPubFBSleep 0.2 0.1C2C3PassPub0.5FBSleep0.50.8 0.20.60.41.00.40.40.91

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesMRPMarkov Reward ProcessA Markov reward process is a Markov chain with values.DefinitionA Markov Reward Process is a tuple hS, P, R, γiS is a finite set of statesP is a state transition probability matrix,Pss 0 P [St 1 s 0 St s]R is a reward function, Rs E [Rt 1 St s]γ is a discount factor, γ [0, 1]

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesMRPExample: Student MRP0.9SleepFacebook0.1R -1R 00.5Class 11.00.20.5R -2Class 20.20.8R -20.4PubR 10.4Class 30.40.6R -2PassR 10

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesReturnReturnDefinitionThe return Gt is the total discounted reward from time-step t.Gt Rt 1 γRt 2 . Xγ k Rt k 1k 0The discount γ [0, 1] is the present value of future rewardsThe value of receiving reward R after k 1 time-steps is γ k R.This values immediate reward above delayed reward.γ close to 0 leads to ”myopic” evaluationγ close to 1 leads to ”far-sighted” evaluation

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesReturnWhy discount?Most Markov reward and decision processes are discounted. Why?Mathematically convenient to discount rewardsAvoids infinite returns in cyclic Markov processesUncertainty about the future may not be fully representedIf the reward is financial, immediate rewards may earn moreinterest than delayed rewardsAnimal/human behaviour shows preference for immediaterewardIt is sometimes possible to use undiscounted Markov rewardprocesses (i.e. γ 1), e.g. if all sequences terminate.

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesValue FunctionValue FunctionThe value function v (s) gives the long-term value of state sDefinitionThe state value function v (s) of an MRP is the expected returnstarting from state sv (s) E [Gt St s]

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesValue FunctionExample: Student MRP ReturnsSample returns for Student MRP:Starting from S1 C1 with γ 12G1 R2 γR3 . γ T 2 RTC1 C2 C3 Pass Sleepv1 2 2 12 2 14 10 18 2.25C1 FB FB C1 C2 Sleep1v1 2 1 12 1 14 2 18 2 16 3.125C1 C2 C3 Pub C2 C3 Pass Sleep1 .v1 2 2 12 2 14 1 18 2 16 3.41C1 FB FB C1 C2 C3 Pub C1 .1 .v1 2 1 12 1 14 2 18 2 16 3.20FB FB FB C1 C2 C3 Pub C2 Sleep

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesValue FunctionExample: State-Value Function for Student MRP (1)v(s) for γ 00.9-10.10R -1R 00.5-21.00.20.5-2R -20.20.8R -20.4 1R 10.4-20.40.6R -210R 10

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesValue FunctionExample: State-Value Function for Student MRP (2)v(s) for γ 0.90.9-7.60.10R -1R 00.5-5.01.00.20.50.9R -20.20.8R -20.41.9R 10.44.10.40.6R -210R 10

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesValue FunctionExample: State-Value Function for Student MRP (3)v(s) for γ 10.9-230.10R -1R 00.5-131.00.20.51.5R -20.20.8R -20.4 0.8R 10.44.30.40.6R -210R 10

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesBellman EquationBellman Equation for MRPsThe value function can be decomposed into two parts:immediate reward Rt 1discounted value of successor state γv (St 1 )v (s) E [Gt St s] E Rt 1 γRt 2 γ 2 Rt 3 . St s E [Rt 1 γ (Rt 2 γRt 3 .) St s] E [Rt 1 γGt 1 St s] E [Rt 1 γv (St 1 ) St s]

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesBellman EquationBellman Equation for MRPs (2)v (s) E [Rt 1 γv (St 1 ) St s]v(s) !7 srv(s0 ) !7 s0v (s) Rs γXs 0 SPss 0 v (s 0 )

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesBellman EquationExample: Bellman Equation for Student MRP4.3 -2 0.6*10 0.4*0.80.9-230.10R -1R 00.5-131.00.20.51.5R -20.20.8R -20.40.8R 10.44.30.40.6R -210R 10

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesBellman EquationBellman Equation in Matrix FormThe Bellman equation can be expressed concisely using matrices,v R γPvwhere v is a column vector with one entry per state v (1)R1P11 . . . P1nv (1) . . . . . . γ . . v (n)RnP11 . . . Pnnv (n)

Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesBellman EquationSolving the Bellman EquationThe Bellman equation is a linear equationIt can be solved directly:v R γPv(I γP) v Rv (I γP) 1 RComputational complexity is O(n3 ) for n statesDirect solution only possible for small MRPsThere are many iterative methods for large MRPs, e.g.Dynamic programmingMonte-Carlo evaluationTemporal-Difference learning

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesMDPMarkov Decision ProcessA Markov decision process (MDP) is a Markov reward process withdecisions. It is an environment in which all states are Markov.DefinitionA Markov Decision Process is a tuple hS, A, P, R, γiS is a finite set of statesA is a finite set of actionsP is a state transition probability matrix,a P [S0Pss0t 1 s St s, At a]R is a reward function, Ras E [Rt 1 St s, At a]γ is a discount factor γ [0, 1].

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesMDPExample: Student MDPFacebookR -1QuitR 0FacebookSleepR 0R -1StudyStudyStudyR -2R -2PubR 10.20.40.4R 10

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesPoliciesPolicies (1)DefinitionA policy π is a distribution over actions given states,π(a s) P [At a St s]A policy fully defines the behaviour of an agentMDP policies depend on the current state (not the history)i.e. Policies are stationary (time-independent),At π(· St ), t 0

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesPoliciesPolicies (2)Given an MDP M hS, A, P, R, γi and a policy πThe state sequence S1 , S2 , . is a Markov process hS, P π iThe state and reward sequence S1 , R2 , S2 , . is a Markovreward process hS, P π , Rπ , γiwhereπPs,s0 Rπs Xaπ(a s)Pss0a AXa Aπ(a s)Ras

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesValue FunctionsValue FunctionDefinitionThe state-value function vπ (s) of an MDP is the expected returnstarting from state s, and then following policy πvπ (s) Eπ [Gt St s]DefinitionThe action-value function qπ (s, a) is the expected returnstarting from state s, taking action a, and then following policy πqπ (s, a) Eπ [Gt St s, At a]

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesValue FunctionsExample: State-Value Function for Student MDPvπ(s) for π(a s) 0.5, γ 1FacebookR -1-2.3Quit0FacebookR 0SleepR 0R -1Study-1.3Study2.7R -2StudyR -2PubR 10.20.40.47.4R 10

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation EquationThe state-value function can again be decomposed into immediatereward plus discounted value of successor state,vπ (s) Eπ [Rt 1 γvπ (St 1 ) St s]The action-value function can similarly be decomposed,qπ (s, a) Eπ [Rt 1 γqπ (St 1 , At 1 ) St s, At a]

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation Equation for V πv (s) !7 sq (s, a) !7 avπ (s) Xa Aπ(a s)qπ (s, a)

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation Equation for Q πq (s, a) !7 s, arv (s0 ) !7 s0qπ (s, a) Ras γXs 0 Sa0Pss0 vπ (s )

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation Equation for vπ (2)v (s) !7 sarv (s0 ) !7 s0!vπ (s) Xa Aπ(a s) Ras γXs 0 Sa0Pss0 vπ (s )

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation Equation for qπ (2)q (s, a) !7 s, ars0q (s0 , a0 ) !7 a0qπ (s, a) Ras γXs 0 SaPss0Xa0 Aπ(a0 s 0 )qπ (s 0 , a0 )

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationExample: Bellman Expectation Equation in Student MDP7.4 0.5 * (1 0.2* -1.3 0.4 * 2.7 0.4 * 7.4) 0.5 * 10FacebookR -1-2.3Quit0FacebookR 0SleepR 0R -1Study-1.3Study2.7R -2StudyR -2PubR 10.20.40.47.4R 10

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation Equation (Matrix Form)The Bellman expectation equation can be expressed conciselyusing the induced MRP,vπ Rπ γP π vπwith direct solutionvπ (I γP π ) 1 Rπ

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesOptimal Value FunctionsOptimal Value FunctionDefinitionThe optimal state-value function v (s) is the maximum valuefunction over all policiesv (s) max vπ (s)πThe optimal action-value function q (s, a) is the maximumaction-value function over all policiesq (s, a) max qπ (s, a)πThe optimal value function specifies the best possibleperformance in the MDP.An MDP is “solved” when we know the optimal value fn.

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesOptimal Value FunctionsExample: Optimal Value Function for Student MDPv*(s) for γ 1FacebookR -16Quit0FacebookR 0SleepR 0R -1Study6Study8R -2StudyR -2PubR 10.20.40.410R 10

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesOptimal Value FunctionsExample: Optimal Action-Value Function for Student MDPq*(s,a) for γ 1FacebookR -1q* 56Quit0FacebookR 0q* 6SleepR 0q* 0R -1q* 56Study8R -2q* 6StudyStudyR -2q* 8Pub0.2R 1q* 8.40.40.410R 10q* 10

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesOptimal Value FunctionsOptimal PolicyDefine a partial ordering over policiesπ π 0 if vπ (s) vπ0 (s), sTheoremFor any Markov Decision ProcessThere exists an optimal policy π that is better than or equalto all other policies, π π, πAll optimal policies achieve the optimal value function,vπ (s) v (s)All optimal policies achieve the optimal action-value function,qπ (s, a) q (s, a)

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesOptimal Value FunctionsFinding an Optimal PolicyAn optimal policy can be found by maximising over q (s, a),(1 if a argmax q (s, a)a Aπ (a s) 0 otherwiseThere is always a deterministic optimal policy for any MDPIf we know q (s, a), we immediately have the optimal policy

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesOptimal Value FunctionsExample: Optimal Policy for Student MDPπ*(a s) for γ 1FacebookR -1q* 56Quit0FacebookR 0q* 6SleepR 0q* 0R -1q* 56Study8R -2q* 6StudyStudyR -2q* 8Pub0.2R 1q* 8.40.40.410R 10q* 10

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Optimality EquationBellman Optimality Equation for v The optimal value functions are recursively related by the Bellmanoptimality equations:v (s) !7 sq (s, a) !7 av (s) max q (s, a)a

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Optimality EquationBellman Optimality Equation for Q q (s, a) !7 s, arv (s0 ) !7 s0q (s, a) Ras γXs 0 Sa0Pss0 v (s )

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Optimality EquationBellman Optimality Equation for V (2)v (s) !7 sarv (s0 ) !7 s0v (s) max Ras γaXs 0 Sa0Pss0 v (s )

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Optimality EquationBellman Optimality Equation for Q (2)q (s, a) !7 s, ars0q (s0 , a0 ) !7 a0q (s, a) Ras γXs 0 SPssa 0 maxq (s 0 , a0 )0a

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Optimality EquationExample: Bellman Optimality Equation in Student MDP6 max {-2 8, -1 6}FacebookR -16Quit0FacebookR 0SleepR 0R -1Study6Study8R -2StudyR -2PubR 10.20.40.410R 10

Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Optimality EquationSolving the Bellman Optimality EquationBellman Optimality Equation is non-linearNo closed form solution (in general)Many iterative solution methodsValue IterationPolicy IterationQ-learningSarsa

Lecture 2: Markov Decision ProcessesExtensions to MDPsExtensions to MDPsInfinite and continuous MDPsPartially observable MDPsUndiscounted, average reward MDPs(no exam)

Lecture 2: Markov Decision ProcessesExtensions to MDPsInfinite MDPsInfinite MDPsThe following extensions are all possible:Countably infinite state and/or action spacesStraightforwardContinuous state and/or action spacesClosed form for linear quadratic model (LQR)Continuous timeRequires partial differential equationsHamilton-Jacobi-Bellman (HJB) equationLimiting case of Bellman equation as time-step 0(no exam)

Lecture 2: Markov Decision ProcessesExtensions to MDPsPartially Observable MDPsPOMDPs(no exam)A Partially Observable Markov Decision Process is an MDP withhidden states. It is a hidden Markov model with actions.DefinitionA POMDP is a tuple hS, A, O, P, R, Z, γiS is a finite set of statesA is a finite set of actionsO is a finite set of observationsP is a state transition probability matrix,0a P [SPss0t 1 s St s, At a]R is a reward function, Ras E [Rt 1 St s, At a]Z is an observation function,Zsa0 o P [Ot 1 o St 1 s 0 , At a]γ is a discount factor γ [0, 1].

Lecture 2: Markov Decision ProcessesExtensions to MDPsPartially Observable MDPsBelief States(no exam)DefinitionA history Ht is a sequence of actions, observations and rewards,Ht A0 , O1 , R1 , ., At 1 , Ot , RtDefinitionA belief state b(h) is a probability distribution over states,conditioned on the history h b(h) (P St s 1 Ht h , ., P [St s n Ht h])

Lecture 2: Markov Decision ProcessesExtensions to MDPsPartially Observable MDPsReductions of POMDPs(no exam)The history Ht satisfies the Markov propertyThe belief state b(Ht ) satisfies the Markov propertyHistory treeBelief treeP(s)a1a2a1o1a1o1a1a2o2a1o2a1o1a1a1o1a2o1P(s a1)o2a2o2a2a1o1a2o1.P(s a2)o2P(s a1o1)a1.a2o1o2P(s a1o2)P(s a2o1)P(s a2o2).a2P(s a1o1a1) P(s a1o1a2)A POMDP can be reduced to an (infinite) history treeA POMDP can be reduced to an (infinite) belief state tree

Lecture 2: Markov Decision ProcessesExtensions to MDPsAverage Reward MDPsErgodic Markov Process(no exam)An ergodic Markov process isRecurrent: each state is visited an infinite number of timesAperiodic: each state is visited without any systematic periodTheoremAn ergodic Markov process has a limiting stationary distributiond π (s) with the propertyXd π (s) d π (s 0 )Ps 0 ss 0 S

Lecture 2: Markov Decision ProcessesExtensions to MDPsAverage Reward MDPsErgodic MDP(no exam)DefinitionAn MDP is ergodic if the Markov chain induced by any policy isergodic.For any policy π, an ergodic MDP has an average reward pertime-step ρπ that is independent of start state." T#X1ρ limERtT Tπt 1

Lecture 2: Markov Decision ProcessesExtensions to MDPsAverage Reward MDPsAverage Reward Value Function(no exam)The value function of an undiscounted, ergodic MDP can beexpressed in terms of average reward.ṽπ (s) is the extra reward due to starting from state s,"ṽπ (s) Eπ Xk 1#π(Rt k ρ ) St sThere is a corresponding average reward Bellman equation,"πṽπ (s) Eπ (Rt 1 ρ ) Xk 1#π(Rt k 1 ρ ) St s Eπ [(Rt 1 ρπ ) ṽπ (St 1 ) St s]

Lecture 2: Markov Decision ProcessesExtensions to MDPsAverage Reward MDPsQuestions?The only stupid question is the one you were afraid toask but never did.-Rich Sutton

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artificial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: filtering, smoothing, best sequence Kalman filters (a brief mention) Dynamic Bayesian networks

so-calledfiltered Markov Decision Processes.MoreoverPiecewise Determinis-tic Markov Decision Processes are discussed and we give recent applications . to the students at Ulm University and KIT who struggled with the text in their seminars. Special thanks go to Rolf B auerle and Sebastian Urban for

2.2 Markov chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a collection of methods to generate pseudorandom numbers via Markov Chains. MCMC works constructing a Markov chain which steady-state is the distribution of interest. Random Walks Markov are closely attached to MCMC. Indeed, t

Stationary policy composed of stochastic history-dependent Markov decision rules ˇ(s t) (U(M s;M s 10) ifs t s t 1 2 0 otherwise Non-stationary policy composed of deterministic Markov decision rules ˇ t(s) (M s ift 6 b(M s) 5c otherwise As one can see, any combination of di erent types of decision rules and policies can be .

Markov Decision Processes (MDPs) Machine Learning - CSE546 Carlos Guestrin University of Washington December 2, 2013 Carlos Guestrin 2005-2013 1 Markov Decision Process (MDP) Representation! State space: " Joint state x of entire system ! Action space: " Joint action a {a 1, , a n} for all agents ! Reward function:

Studies have shown veterinary surgeons do not feel they receive adequate training in small animal nutrition during veterinary school. In a 1996 survey among veterinarians in the United States, 70% said their nutrition education was inadequate. 3. In a 2013 survey in the UK, 50% of 134 veterinarians felt their nutrition education in veterinary school was insufficient and a further 34% said it .