Markov Decision Process I - Cse.msu.edu

1y ago
14 Views
3 Downloads
1.60 MB
20 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Tia Newell
Transcription

Markov Decision Process IHui Liuliuhui7@msu.eduTime & Place. Tu/Th 10:20-11:40 & ZoomAck: Berkeley AI courseCSE-440 Spring 2022

Outline Non-Deterministic Search Markov Decision ProcessCSE-440 Spring 2022

Example: Grid World A maze-like problem Noisy movement: actions do not always go asplanned 80% of the time, the action North takes the agentNorth (if there is no wall there)10% of the time, North takes the agent West; 10% EastIf there is a wall in the direction the agent would havebeen taken, the agent stays putThe agent receives rewards The agent lives in a gridWalls block the agent’s pathSmall “living” reward each step (can be negative)Big rewards come at the end (good or bad)Goal: maximize sum of rewardsCSE-440 Spring 2022

Grid World ActionsDeterministic Grid WorldStochastic Grid WorldCSE-440 Spring 2022

Outline Non-Deterministic Search Markov Decision ProcessCSE-440 Spring 2022

Markov Decision Processes An MDP is defined by: A set of states s Î S A set of actions a Î A A transition function T(s, a, s’) Probability that a from s leads to s’, i.e., P(s’ s, a) Also called the model or the dynamics A reward function R(s, a, s’) Sometimes just R(s) or R(s’) A start state Maybe a terminal state MDPs are non-deterministic search problemsCSE-440 Spring 2022

What is Markov about MDPs? “Markov” generally means that given the present state, the futureand the past are independent For Markov decision processes, “Markov” means action outcomesdepend only on the current stateAndrey Markov(1856-1922) This is just like search, where the successor function could onlydepend on the current state (not the history)CSE-440 Spring 2022

Policies In deterministic single-agent search problems,we wanted an optimal plan, or sequence ofactions, from start to a goal For MDPs, we want an optimalpolicy p*: S A A policy p gives an action for each state An optimal policy is one that maximizes expectedutility if followed An explicit policy defines a reflex agentCSE-440 Spring 2022Optimal policy when R(s, a, s’) -0.03for all non-terminals s

Optimal PoliciesR(s) -0.01R(s) -0.4R(s) -0.03CSE-440 Spring 2022R(s) -2.0

Example: Racing A robot car wants to travel far, quickly Three states: Cool, Warm, Overheated Two actions: Slow, Fast0.5 Going faster gets double reward 1 1FastSlow1.0-100.5WarmSlowFast1.0 1Cool0.5 20.5Overheated 2CSE-440 Spring 2022

Racing Search Tree.5.5CSE-440 Spring 2022

MDP Search Trees Each MDP state projects an expectimax-like search trees is a statesa(s, a) is a qstates, a(s,a,s’) called a transitionT(s,a,s’) P(s’ s,a)s,a,sʼR(s,a,s’)sʼCSE-440 Spring 2022

Utilities of Sequences What preferences should an agent have over reward sequences? More or less?[1, 2, 2]or[2, 3, 4] Now or later?[0, 0, 1]or[1, 0, 0]CSE-440 Spring 2022

Discounting It’s reasonable to maximize the sum of rewards It’s also reasonable to prefer rewards now to rewards later One solution: values of rewards decay exponentiallyWorth NowWorth Next StepCSE-440 Spring 2022Worth In Two Steps

Discounting How to discount? Each time we descend a level, wemultiply in the discount once Why discount? Reward now is better than later Also helps our algorithms converge Example: discount of 0.5 U([1,2,3]) 1*1 0.5*2 0.25*3 U([1,2,3]) U([3,2,1])CSE-440 Spring 2022

Stationary Preferences Theorem: if we assume stationary preferences: Then: there are only two ways to define utilities Additive utility: Discounted utility:CSE-440 Spring 2022

Infinite Utilities?!§ Problem: What if the game lasts forever? Do we get infinite rewards?§ Solutions:§ Finite horizon: (similar to depth-limited search)§ Terminate episodes after a fixed T steps (e.g. life)§ Gives nonstationary policies (p depends on time left)§ Discounting: use 0 g 1§ Smaller g means smaller “horizon” – shorter term focus§ Absorbing state: guarantee that for every policy, a terminal state will eventuallybe reached (like “overheated” for racing)CSE-440 Spring 2022

Example: Discounting Given: Actions: East, West, and Exit (only available in exit states a, e) Transitions: deterministic Q1: For g 1, what is the optimal policy? - - Q2: For g 0.1, what is the optimal policy? - - Q3: For which g are West and East equally good when in state d?1g 10 g3CSE-440 Spring 2022 --

Recap: Defining MDPs Markov decision processes: sSet of states SStart state s0Set of actions ATransitions P(s’ s,a) (or T(s,a,s’))Rewards R(s,a,s’) (and discount g)as, as,a,sʼsʼ MDP quantities so far: Policy Choice of action for each state Utility sum of (discounted) rewardsCSE-440 Spring 2022

Important This Week Homework 2 is released on Mimir, due next Monday.Questions?CSE-440 Spring 2022

Markov Decision Process I Hui Liu liuhui7@msu.edu Time & Place. Tu/Th 10:20-11:40 & Zoom CSE-440 Spring 2022 Ack: Berkeley AI course . Outline . Markov Decision Processes AnMDPisdefinedby: A set of states s ÎS A set of actions a ÎA A transition function T(s, a, s')

Related Documents:

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

92 vipul sharma it 93 rishabh jain cse 94 manik arora cse 95 nishant bhardwaj cse . 96 rajit shrivastava it 97 shivansh gaur cse 98 harsh singh cse 99 shreyanshi raj cse 100 rahul bedi cse 101 pallavi anand cse 102 divya cse 103 nihal raj it 104 kanak

cse-148 kuriakose jijo george n t george cse-149 kusum joshi ramesh chandra joshi cse-150 m mithun bose n k mohandasan cse-151 madhuri yadav rajbir yadav cse-152 malini shukla r s sharma cse-153 manisha khattar sunil kumar khattar cse-154 m

Texts of Wow Rosh Hashana II 5780 - Congregation Shearith Israel, Atlanta Georgia Wow ׳ג ׳א:׳א תישארב (א) ׃ץרֶָֽאָּהָּ תאֵֵ֥וְּ םִימִַׁ֖שַָּה תאֵֵ֥ םיקִִ֑לֹאֱ ארָָּ֣ Îָּ תישִִׁ֖ארֵ Îְּ(ב) חַורְָּ֣ו ם

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artificial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: filtering, smoothing, best sequence Kalman filters (a brief mention) Dynamic Bayesian networks

2.2 Markov chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a collection of methods to generate pseudorandom numbers via Markov Chains. MCMC works constructing a Markov chain which steady-state is the distribution of interest. Random Walks Markov are closely attached to MCMC. Indeed, t

Sector shutdowns during the coronavirus crisis: which workers are most exposed? Authors: Robert Joyce (IFS) and Xiaowei Xu (IFS) Summary The lockdown in response to the Covid-19 pandemic has effectively shut down a number of sectors. Restaurants, shops and leisure facilities have been ordered to close, air travel has halted, and public transport has been greatly reduced. Our analysis shows .