Markov Decision Processes (MDPs)

1y ago
27 Views
2 Downloads
2.64 MB
21 Pages
Last View : 13d ago
Last Download : 3m ago
Upload by : Aliana Wahl
Transcription

Markov DecisionProcesses (MDPs)Machine Learning – CSE546Carlos GuestrinUniversity of WashingtonDecember 2, 2013 Carlos Guestrin 2005-20131Markov Decision Process (MDP)Representationn State space: n Action space: n Joint state x of entire systemJoint action a {a1, , an} for all agentsReward function: Total reward R(x,a)n n sometimes reward can depend on actionTransition model: Dynamics of the entire system P(x’ x,a) Carlos Guestrin 2005-201321

Discount FactorsPeople in economics and probabilistic decision-making dothis all the time.The “Discounted sum of future rewards” using discountfactor γ” is(reward now) γ (reward in 1 time step) γ 2 (reward in 2 time steps) γ 3 (reward in 3 time steps) ::(infinite sum)3 Carlos Guestrin 2005-2013niscoume D .9uss0Aor γ FactThe Academic S.On :0.70.3VA Expected discounted future rewards starting in state AVB Expected discounted future rewards starting in state BVT ““““““ “ TVS ““““““ “ SVD ““““““ “ DHow do we compute VA, VB, VT, VS, VD ? Carlos Guestrin 2005-201342

PolicyAt state x,action a for allagentsPolicy: π(x) ax0π(x0) both peasants get woodx1π(x1) one peasant buildsbarrack, other gets goldx2π(x2) peasants get gold,footmen attack5 Carlos Guestrin 2005-2013Value of PolicyExpected longterm rewardstarting from xValue: Vπ(x)Vπ(x0) Eπ[R(x0) γ R(x1) γ2 R(x2) γ3 R(x3) γ4 R(x4) ]Startfrom x0x0R(x0)π(x0)x1R(x1)x 1’R(x1’)x1’’Future rewardsdiscounted by γ in 3)R(x1’’) Carlos Guestrin 2005-2013π(x3)x4R(x4)63

Computing the value of a policyVπ(x0) Eπ[R(x0) γ R(x1) γ2 R(x2) γ3 R(x3) γ4 R(x4) ]n Discounted value of a state: n value of starting from x0 and continuing with policy π from then onA recursion! Carlos Guestrin 2005-20137Simple approach for computing thevalue of a policy: Iterativelyn Can solve using a simple convergent iterative approach:(a.k.a. dynamic programming) Start with some guess V0Iteratively say:n Stop when Vt 1-Vt εn means that Vπ-Vt 1 ε/(1-γ) Carlos Guestrin 2005-201384

But we want to learn a Policyn n So far, told you how good apolicy is But how can we choose thebest policy?At state x, actiona for all agentsPolicy: π(x) ax0π(x0) both peasants get woodx1n Suppose there was only onetime step: π(x1) one peasant buildsbarrack, other gets goldx2π(x2) peasants get gold,footmen attackworld is about to end!!!select action that maximizesreward! Carlos Guestrin 2005-20139Unrolling the recursionn Choose actions that lead to best value in the long run Optimal value policy achieves optimal value V* Carlos Guestrin 2005-2013105

Bellman equationn Evaluating policy π:n Computing the optimal value V* - Bellman equationV (x) max R(x, a) γ P(x' x, a)V (x' )ax' Carlos Guestrin 2005-201311Optimal Long-term PlanOptimal valuefunction V*(x)Optimal Policy: π*(x)Optimal policy:π (x) argmax R(x,a) γ P(x' x,a)V (x')a x' Carlos Guestrin 2005-2013126

Interesting fact – Unique valueV (x) max R(x, a) γ P(x' x, a)V (x' )an Slightly surprising fact: There is only one V* that solvesBellman equation! n x'there may be many optimal policies that achieve V*Surprising fact: optimal policies are good everywhere!!!13 Carlos Guestrin 2005-2013Solving an MDPSolveBellmanequationOptimalvalue V*(x)Optimalpolicy π*(x)V (x) max R(x, a) γ P(x' x, a)V (x' )ax'Bellman equation is non-linear!!!Many algorithms solve the Bellman equations:n n n n Policy iteration [Howard ‘60, Bellman ‘57]Value iteration [Bellman ‘57]Linear programming [Manne ‘60] Carlos Guestrin 2005-2013147

Value iteration (a.k.a. dynamic programming) –the simplest of alln n Start with some guess V0Iteratively say:n n Stop when Vt 1-Vt ε means that V*-Vt 1 ε/(1-γ)15 Carlos Guestrin 2005-2013A simple exampleγ 0.91You run astartupcompany.In everystate youmustchoosebetweenSavingmoney orAdvertising.SPoor &Unknown 0A1/21/2SAARich &Unknown 1011/21/2AS1/21/21/21Poor &Famous 01/2S1/21/2 Carlos Guestrin 2005-2013Rich &Famous 10168

Let’s compute Vt(x) for our exampleγ 0.91SPoor &Unknown 01/21/21/21/2A1/21ARich &Unknown 10S1/21/2tVt(PU) Vt(PF) Vt(RU) Vt(RF)12345S1/2AS1/21Poor &Famous 01/2ARich &Famous 106V t 1 (x) max R(x, a) γ P(x' x, a)V t (x')ax'17 Carlos Guestrin 2005-2013Let’s compute Vt(x) for our exampleγ 0.91SPoor &Unknown 01/21/21/21/21/21AARich &Unknown 10AS1/21/2S1Poor &Famous 01/2AS1/21/2Rich &Famous 10tVt(PU) Vt(PF) Vt(RU) Vt(RF)123456 1934.7854.02V t 1 (x) max R(x, a) γ P(x' x, a)V t (x')ax' Carlos Guestrin 2005-2013189

What you need to known What’s a Markov decision process state,actions, transitions, rewards a policy value function for a policyn n computing VπOptimal value function and optimal policy Bellmann equationSolving Bellman equation withvalue iteration, policy iteration and linearprogramming Carlos Guestrin 2005-201319Acknowledgmentn This lecture contains some material from AndrewMoore’s excellent collection of ML tutorials: http://www.cs.cmu.edu/ awm/tutorials Carlos Guestrin 2005-20132010

Announcementn Poster session 3-5pm CSE Atrium: Arrive15mins early Everyone must attend Write project number on poster Prepare 2-3 minutes overview of what you did At least 2 instructors will see your projectn Final Project Report DueMonday 9th at 9am See website for details (maximum 8 pages) Be clear about what you did Make it read like a paper Carlos Guestrin 2005-201321ReinforcementLearningMachine Learning – CSE546Carlos GuestrinUniversity of WashingtonDecember 3, 2013 Carlos Guestrin 2005-20132211

The Reinforcement Learning taskWorld:You are in state 34.Your immediate reward is 3. You have possible 3 actions.Robot:World:I’ll take action 2.You are in state 77.Your immediate reward is -7. You have possible 2 actions.Robot:World:I’ll take action 1.You’re in state 34 (again).Your immediate reward is 3. You have possible 3 actions. Carlos Guestrin23 2005-2013Formalizing the (online)reinforcement learning problemn Given a set of states X and actions A inn some versions of the problem size of X and A unknownInteract with world at each time step t: worldgives state xt and reward rt you give next action atn Goal: (quickly) learn policy that (approximately)maximizes long-term expected discounted reward Carlos Guestrin24 2005-201312

The “Credit Assignment” ProblemI’m in state 43,reward 0, action 2“““ 39,“ 0,“ 4“““ 22,“ 0,“ 1“““ 21,“ 0,“ 1“““ 21,“ 0,“ 1“““ 13,“ 0,“ 2“““ 54,“ 0,“ 2“““ 26,“ 100,Yippee! I got to a state with a big reward! But which of myactions along the way actually helped me get there?This is the Credit Assignment problem.25Exploration-Exploitation tradeoffn You have visited part of the statespace and found a reward of 100 isn this the best I can hope for?Exploitation: should I stick withwhat I know and find a goodpolicy w.r.t. this knowledge? atthe risk of missing out on somelarge reward somewheren Exploration: should I look for aregion with more reward? atthe risk of wasting my time orcollecting a lot of negative reward Carlos Guestrin26 2005-201313

Two main reinforcement learningapproachesn Model-based approaches: exploreenvironment, then learn model (P(x’ x,a) and R(x,a))(almost) everywhere use model to plan policy, MDP-style approach leads to strongest theoretical results works quite well in practice when state space is manageablen Model-free approach: don’tlearn a model, learn value function or policy directly leads to weaker theoretical results often works well when state space is large Carlos Guestrin27 2005-2013Rmax – A model-basedapproach Carlos Guestrin 2005-20132814

Given a dataset – learn modelGiven data, learn (MDP) Representation:n Dataset:n Learn reward function: n R(x,a)Learn transition model: P(x’ x,a) Carlos Guestrin29 2005-2013Planning with insufficient informationn Model-based approach: estimate R(x,a) & P(x’ x,a)obtain policy by value or policy iteration, or linear programming No credit assignment problem! n n learning model, planning algorithm takes care of “assigning” creditWhat do you plug in when you don’t have enough information about a state? don’t reward at a particular staten n n plug in 0?plug in smallest reward (Rmin)?plug in largest reward (Rmax)?don’t know a particular transition probability? Carlos Guestrin30 2005-201315

Some challenges in model-based RL 2:Exploration-Exploitation tradeoffn A state may be very hard to reach wastea lot of time trying to learn rewards andtransitions for this state after a much effort, state may be uselessn A strong advantage of a model-based approach: youknow which states estimate for rewards andtransitions are bad can (try) to plan to reach these states have a good estimate of how long it takes to get there Carlos Guestrin31 2005-2013A surprisingly simple approach for modelbased RL – The Rmax algorithm [Brafman & Tennenholtz]n Optimism in the face of uncertainty!!!! heuristicshown to be useful long before theory was done(e.g., Kaelbling ’90)n If you don’t know reward for a particular state-actionpair, set it to Rmax!!!n If you don’t know the transition probabilitiesP(x’ x,a) from some some state action pair x,aassume you go to a magic, fairytale new state x0!!! R(x0,a) Rmax P(x0 x0,a) 1 Carlos Guestrin32 2005-201316

Understanding Rmaxn With Rmax you either: explore– visit a state-actionpair you don’t know muchaboutn because it seems to have lots ofpotential exploit– spend all your timeon known statesn n even if unknown states wereamazingly good, it’s not worth itNote: you never know if youare exploring or exploiting!!! Carlos Guestrin33 2005-2013Implicit Exploration-Exploitation Lemman Lemma: every T time steps, either: Exploits:achieves near-optimal reward for these T-steps, or Explores: with high probability, the agent visits an unknownstate-action pairn Tlearns a little about an unknown stateis related to mixing time of Markov chain defined by MDPn time it takes to (approximately) forget where you started Carlos Guestrin34 2005-201317

The Rmax algorithmn Initialization: n Add state x0 to MDPR(x,a) Rmax, x,aP(x0 x,a) 1, x,aall states (except for x0) are unknownRepeat obtain policy for current MDP and Execute policy for any visited state-action pair, set reward function to appropriate value if visited some state-action pair x,a enough times to estimate P(x’ x,a)n n update transition probs. P(x’ x,a) for x,a using MLErecompute policy Carlos Guestrin35 2005-2013Visit enough times to estimate P(x’ x,a)?n How many times are enough? usen Chernoff Bound!Chernoff Bound: X1, ,Xn are i.i.d. Bernoulli trials with prob. θP( 1/n i Xi - θ ε) exp{-2nε2} Carlos Guestrin36 2005-201318

Putting it all togethern Theorem: With prob. at least 1-δ, Rmax will reach aε-optimal policy in time polynomial in: num. states,num. actions, T, 1/ε, 1/δ Everyn n T steps:achieve near optimal reward (great!), orvisit an unknown state-action pair ! num. states and actions isfinite, so can’t take too long before all states are known Carlos Guestrin37 2005-2013What you need to know about RL n n n n Neither supervised, nor unsupervised learningTry to learn to act in the world, as we travelstates and get rewardsModel-based & Model-free approachesRmax, a model based approach: Learnmodel of rewards and transitions Address exploration-exploitation tradeoff Simple algorithm, great in practice Carlos Guestrin 2005-20133819

Closing . Carlos Guestrin 2005-201339What you have learned this quartern n n n n n n n n n n n n n n n n n n n Learning is function approximationPoint estimationRegressionLASSOSubgradientStochastic gradient descentCoordinate descentDiscriminative v. Generative learningNaïve BayesLogistic regressionBias-Variance tradeoffDecision treesCross validationBoostingInstance-based learningPerceptronSVMsKernel trickPAC learningBayes nets n n n n n n representation, parameter and structure learningK-meansEMMixtures of GaussiansDimensionality reduction, PCAMDPsReinforcement learning Carlos Guestrin40 2005-201320

BIG PICTUREn Improving the performance at some task though experience!!! J before you start any learning task, remember the fundamental questions:What is thelearning problem?What loss functionare you optimizing?Which learningalgorithm?From whatexperience?What model?With whatoptimization algorithm?With whatguarantees?How will youevaluate it? Carlos Guestrin41 2005-2013You have done a lot!!!n And (hopefully) learned a lot!!! Implementedn n n n n n LASSOLRPerceptronClustering Answered hard questions and proved many interesting resultsCompleted (I am sure) an amazing ML projectAnd did excellently on the final!Now you are ready for one of the most sought-after careers in industry today!!! J Thank You for theHard Work!!! Carlos Guestrin42 2005-201321

Markov Decision Processes (MDPs) Machine Learning - CSE546 Carlos Guestrin University of Washington December 2, 2013 Carlos Guestrin 2005-2013 1 Markov Decision Process (MDP) Representation! State space: " Joint state x of entire system ! Action space: " Joint action a {a 1, , a n} for all agents ! Reward function:

Related Documents:

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

lazy-MDPs. We study the theoretical properties of lazy-MDPs, ex-pressing value functions and characterizing greediness and optimal solutions. Then we empirically demonstrate that policies learned in lazy-MDPs generally come with a form of interpretability: by construction, they show us the states where the agent takes control over the default .

relational dynamics that capture how object movement de-pends on the state of other objects. Object-Oriented Markov Decision Processes (OO-MDPs) represent dynamics as a finite set of object attributes and relationships (Diuk et al.,2008). In this way, OO-MDPs both manage large state-spaces and can generalize to un-

Norm Ferns [McGill University, (2008)]. Key words. bisimulation, metrics, reinforcement learning, continuous, Markov decision process AMS subject classifications. 90C40, 93E20, 68T37, 60J05 1. Introduction. Markov decision processes (MDPs) offer a popular mathematical tool for planning and learning in the presence of uncertainty [7].

A Statistician's view to MDPs Markov Chain One-step Decision Theory Markov Decision Process sequential process models state transitions autonomous process

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artificial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: filtering, smoothing, best sequence Kalman filters (a brief mention) Dynamic Bayesian networks

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

automotive sector to the West Midlands’ economy, the commission identified the need for a clear automotive skills plan that describes the current and future skills needs of the West Midlands automotive sector; the strengths and weaknesses of the region’s further and higher education system in addressing these needs; and a clear road-map for developing new co-designed skills solutions. The .