Markov Decision Processes

3y ago

8 Views

2 Downloads

4.95 MB

81 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Alexia Money

Report this link

Download PDF

Transcription

Markov Decision ProcessesRobert PlattNortheastern UniversitySome images and slides are used from:1. CS188 UC Berkeley2. AIMA3. Chris Amato4. Stacy Marsella

Stochastic domainsSo far, we have studied searchCan use search to solve simple planning problems,e.g. robot planning using A*But only in deterministic domains.

Stochastic domainsSo far, we have studied searchCan use search to solve simple planning problems,e.g. robot planning using A*A* doesn't work so well in stochastic environments.!!?

Stochastic domainsSo far, we have studied searchCan use search to solve simple planning problems,e.g. robot planning using A*A* doesn't work so well in stochastic environments.We are going to introduce a new framework for encoding problemsw/ stochastic dynamics: the Markov Decision Process (MDP)!!?

SEQUENTIAL DECISIONMAKING

MAKING DECISIONS UNDERUNCERTAINTY Rational decision making requires reasoningabout one’s uncertainty and objectives Previous section focused on uncertainty This section will discuss how to make rationaldecisions based on a probabilistic model andutility function Last class, we focused on single step decisions,now we will consider sequential decisionproblems

REVIEW: EXPECTIMAX maxWhat if we don’t know the outcome of actions? Actions can fail when a robot moves, it’s wheels might slip Opponents may be uncertainExpectimax search: maximize average score MAX nodes choose action that maximizesoutcome Chance nodes model an outcome (a value)that is uncertain Use expected utilities weighted average (expectation) of childrenab2055chance.3.7.5.510202041051007

REVIEW: PROBABILITY AND EXPECTEDUTILITY EU probability(outcome) * value(outcome) Expected utility is the probability-weighted average ofall possible values I.e., each possible value is multiplied by its probabilityof occurring and the resulting products are summed What is the expected value of rolling a six-sided die ifyou threw the die MANY times? (1/6 * 1) (1/6 * 2) (1/6 * 3) (1/6 * 4) (1/6 * 5) (1/6 * 6) 3.5

DIFFERENT APPROACH INSEQUENTIAL DECISION MAKING In deterministic planning, our agents generated entire plans Entire sequence of actions from start to goals Under assumption environment was deterministic, actions werereliableIn Expectimax, chance nodes model nondeterminism But agent only determined best next action with a bounded horizonNow we consider agents who use a “Policy” A strategy that determines what action to take in any state Assuming unreliable action outcomes & infnite horizons

Markov Decision Process (MDP): grid world example 1-1Rewards:– agent gets these rewards in these cells– goal of agent is to maximize rewardActions: left, right, up, down– take one action per time step– actions are stochastic: only go in intendeddirection 80% of the timeStates:– each cell is a state

Markov Decision Process (MDP)Deterministic– same action always has same outcome1.0Stochastic– same action could have different outcomes0.10.10.8

Markov Decision Process (MDP)Same action could have different outcomes:0.10.10.10.80.10.8Transition function at s 1:s's 2s 3s 4T(s,a,s')0.10.80.1

Markov Decision Process (MDP)Technically, an MDP is a 4-tupleAn MDP (Markov Decision Process)defines a stochastic control problem:State set:Action Set:Transition function:Reward function:

Markov Decision Process (MDP)Technically, an MDP is a 4-tupleAn MDP (Markov Decision Process)defines a stochastic control problem:State set:Action Set:Transition function:Reward function:Probability of going from s to s'when executing action a

Markov Decision Process (MDP)Technically, an MDP is a 4-tupleAn MDP (Markov Decision Process)defines a stochastic control problem:State set:Action Set:Probability of going from s to s'when executing action aTransition function:Reward function:But, what is the objective?

What is a policy? We want an optimal policy A policy gives an action for each state An optimal policy is one that maximizesexpected utility if followed For Deterministic single-agent search problems,derived an optimal plan, or sequence ofactions, from start to a goal For Expectimax, didn’t compute entire policies It computed the action for a single stateonly Over a limited horizon Final rewards onlyOptimal policy whenR(s, a, s’) -0.03for all non-terminalss (cost of living)

What is a policy?A policy tells the agent what action to execute as a function of state:Deterministic policy:– agent always executes the same action from a given stateStochastic policy:– agent selects an action to execute by drawing from aprobability distribution encoded by the policy .

Examples of optimal policiesR(s) -0.01R(s) -0.03R(s) -0.4R(s) -2.0

Markov? “Markovian Property” Given the present state, the future and the past are independentFor Markov decision processes, “Markov” means actionoutcomes depend only on the current stateAndrey Markov(1856-1922) This is just like search, where the successor function couldonly depend on the current state (not the history)

Another example of an MDP A robot car wants to travel far, quicklyThree states: Cool, Warm, OverheatedTwo actons: Slow, FastGoing faster gets double reward0.5 1Fast 1Slow1.0-100.5WarmSlowFast1.0 1Cool0.5 20.5 2Overheated

Objective: maximize expected future rewardExpected future reward starting at time t

Objective: maximize expected future rewardExpected future reward starting at time tWhat's wrong w/ this?

Objective: maximize expected future rewardExpected future reward starting at time tWhat's wrong w/ this?Two viable alternatives:1. maximize expected future reward over the next T timesteps (finite horizon):2. maximize expected discounted future rewards:Discount factor (usually around 0.9):

Discounting

STATIONARY PREFERENCES Theorem: if we assume stationarypreferences: Then: there are only two ways to defneutilities Additive utility: Discounted utility:

QUIZ: DISCOUNTING Given: Actins: East, West, and Exit (only available in exit states a,e) Transitions: deterministic Quiz 1: For 1, what is the optimal policy? Quiz 2: For 0.1, what is the optimal policy? Quiz 3: For which are West and East equally goodwhen in state d?

UTILITIES OVER TIME: FINITE ORINFINITE HORIZON? If there is fxed time, N, after which nothing canhappen, what should an agent do? E.g., if N 3, Bot must head directly for 1 state If N 100, can take safe routeSo with fnite horizon, optimal action changesover time Optimal policy is nonstationary (depends on time left)

Choosing a reward functionA few possibilities:– all reward on goal– negative reward everywhereexcept terminal states– gradually increasing rewardas you approach the goalIn general:– reward can be whatever youwant 1-1

Value functionsValue functionExpected discounted reward if agent acts optimallystarting in state s.Expected discounted reward if agent acts optimallyafter taking action a from state s.Action value functionGame plan:1. calculate the optimal value function2. calculate optimal policy from optimal value function

Grid world optimal value functionNoise 0.2Discount 0.9Living reward 0

Grid world optimal action-value functionNoise 0.2Discount 0.9Living reward 0

Time-limited values Key idea: time-limited values Defne Vk(s) to be the optimal value of s if thegame ends in k more time steps Equivalently, it’s what a depth-k expectimax wouldgive from s

Value iterationValue iteration calculates the time-limited value function, V i:Value IterationInput: MDP (S,A,T,r)Output: value function, V1. let2. for i 0 to infinity3.for all4.5.if V converged, then break

Value iteration exampleNoise 0.2Discount 0.9Living reward 0

Value iteration example

Value iterationValue IterationInput: MDP (S,A,T,r)Output: value function, V1. let2. for i 0 to infinity3.for all4.5.if V converged, then break

Value iterationValue IterationInput: MDP (S,A,T,r)Output: value function, VLet's look at this eqn more closely.1. let2. for i 0 to infinity3.for all4.5.if V converged, then break

Value iterationValue of getting to s' by taking a from s:reward obtained on this time stepdiscounted value of being at s'

Value iterationValue of getting to s'by taking a from sExpected value oftaking action aWhy do we maximize?

Value iterationValue of s at k timesteps to go:Value iteration:1. initialize2.3.4.k. .

Value iteration3.52.501000S 1.0 [1 V1(c)]F .5 [2 V1(c)] .5[2 V1(w)]2S 1.0 [1 V0(c)]F .5[2 V0(c)] .5[2 V0(w)]0Assume no discount!

Value iterationValue IterationInput: MDP (S,A,T,r)Output: value function, V1. let2. for i 0 to infinity3.for all4.5.if V converged, then breakHow do we know that this converges?How do we know that this converges to the optimal value function?

Convergence How do we know the Vk vectors are going to converge? Case 1: If the tree has maximum depth M, then VM holdsthe actual untruncated values Case 2: If the discount is less than 1 Sketch: For any state Vk and Vk 1 can be viewed as depthk 1 expectmax results in nearly identcal search trees The last layer is at most all RMAX and at least RMIN But everything is discounted by γk that far outSo Vk and Vk 1 are at most γk max RMAX - RMIN diferent– So as k increases, the values converge

OptimalityAt convergence, this property must hold (why?)What does this equation tell us about optimality of value iteration?– we denote the optimal value function as:

Bellman Equation With this equation, Bellman introduceddynamic programming in 1953Will be the focus of the next fewlecturesRichard Bellman1920 –1984

Gauss-Siedel Value IterationValue IterationInput: MDP (S,A,T,r)Output: value function, V1. let2. for i 1 to infinity3.Regular value iteration maintains twoV arrays: old V and new Vfor all4.5.if V converged, then breakGauss-Siedel maintains only one V matrix.– each update is immediately applied– can lead to faster convergence

Computing a policy from the value functionNotice these little arrowsThe arrows denote a policy– how do we calculate it?

Computing a policy from the value functionGiven values calculatedusing value iteration, doone step of expectimax:The optimal policy is implied by the optimal value function.

Stochastic policies vs deterministic policiesIn general, a policy is a distribution over actions:Here, we restrict consideration to deterministic policies:

Problems with value iterationProblem 1: It’s slow – O(S2A) per iteratonProblem 2: The “max” at each state rarely changesProblem 3: The policy ofen converges long before thevalues

Policy iterationAlternative approach for optimal values:Step 1: Policy evaluation: calculate utilities for some fixed policy(not optimal utilities!) until convergenceStep 2: Policy improvement: update policy using one-step lookahead with resulting converged (but not optimal!) utilities asfuture valuesRepeat steps until policy convergesThis is policy iterationIt’s still optimal!Can converge (much) faster under some conditions

Policy evaluationWhat if you want to calculate the value function for a given sub-optimal policy?Answer: Policy Evaluation!Value IterationInput: MDP (S,A,T,r)Output: value function, V1. let2. for i 0 to infinity3.for all4.5.if V converged, then break

Policy evaluationWhat if you want to calculate the value function for a given sub-optimal policy?Answer: Policy Evaluation!Policy EvaluationInput: MDP (S,A,T,r),Output: value function, V1. letNotice this2. for i 0 to infinity3.for all4.5.if V converged, then breakOR: can solve for value function as the sol'n to a system of linear equations– can't do this for value iteration because of the maxes

Policy iteration: exampleAlways Go RightAlways Go Forward

Modified policy iterationPolicy iteration often converges in few iterations, but each is expensiveIdea: use a few steps of value iteration (but with π fixed) starting from thevalue function produced the last time to produce an approximate valuedetermination step.Often converges much faster than pure VI or PILeads to much more general algorithms where Bellman value updates andHoward policy updates can be performed locally in any orderReinforcement learning algorithms operate by performing such updatesbased on the observed transitions made in an initially unknownenvironment

Online methodsSolving for a full policy offline is expensive!What can we do?

Online methodsOnline methods compute optimal action from current stateExpand tree up to some horizonStates reachable from the current state is typically small comparedto full state spaceHeuristics and branch-and-bound techniques allow search spaceto be prunedMonte Carlo methods provide approximate solutions

Forward searchProvides optimal action from current state s up to depth dRecallééV(s) max aÎA(s) é R(s,a) g å T (s,a, s )V( s ) ééés Time complexity is O(( S x A )d)

Monte Carlo evaluationEstimate value of a policy by sampling from a simulator

Sparse samplingRequires a generative model (s’,r) G(s,a)Complexity? Guarantees?

Sparse samplingRequires a generative model (s’,r) G(s,a)Complexity O((n A )d), Guarantees probabilistic

Monte Carlo tree searchUCT (Upper Confident bounds for Trees)

UCT continuedSearch (within the tree, T)Execute action that maximizesUpdate the value Q(s,a) and counts N(s) and N(s,a)c is a exploration constantExpansion (outside of the tree, T)Create a new node for the stateInitialize Q(s,a) and N(s,a) (usually to 0) for each actionRollout (outside of the tree, T)Only expand once and then use a rollout policy to select actions (e.g., random policy)Add the rewards gained during the rollout with those in the tree:

UCT continuedContinue UCT until some terminationcondition (usually a fixed number ofsamples)Complexity?Guarantees?

AlphaGoUses UCT with neural net to approximate opponentchoices and state values

Branch and bound searchRequires a lower bound Ṳ(s) and upper bound Ū(s)Worse case complexity?

Optimal policy when R(s, a, s’) -0.03 for all non-terminals s (cost of living) We want an optimal policy A policy gives an action for each state An optimal policy is one that maximizes expected utility if followed For Deterministic single-agent search problems, derived an optimal plan, or sequence of actions, from start to a .

Related Documents:

Lecture 2: Markov Decision Processes - Stanford University

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

18 Views

1y ago

Markov Decision Processes - Johns Hopkins University

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artiﬁcial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: ﬁltering, smoothing, best sequence Kalman ﬁlters (a brief mention) Dynamic Bayesian networks

14 Views

1y ago

The Markov Chain Monte Carlo Revolution

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

28 Views

2y ago

Markov Decision Processes with Applications to Finance (Universitext)

so-calledﬁltered Markov Decision Processes.MoreoverPiecewise Determinis-tic Markov Decision Processes are discussed and we give recent applications . to the students at Ulm University and KIT who struggled with the text in their seminars. Special thanks go to Rolf B auerle and Sebastian Urban for

10 Views

1y ago

Using Multiple Imputation to Simulate Time Series: A ...

2.2 Markov chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a collection of methods to generate pseudorandom numbers via Markov Chains. MCMC works constructing a Markov chain which steady-state is the distribution of interest. Random Walks Markov are closely attached to MCMC. Indeed, t

19 Views

2y ago

1 Markov Decision Processes - MIT

Stationary policy composed of stochastic history-dependent Markov decision rules ˇ(s t) (U(M s;M s 10) ifs t s t 1 2 0 otherwise Non-stationary policy composed of deterministic Markov decision rules ˇ t(s) (M s ift 6 b(M s) 5c otherwise As one can see, any combination of di erent types of decision rules and policies can be .

17 Views

1y ago

Markov Decision Processes (MDPs)

Markov Decision Processes (MDPs) Machine Learning - CSE546 Carlos Guestrin University of Washington December 2, 2013 Carlos Guestrin 2005-2013 1 Markov Decision Process (MDP) Representation! State space: " Joint state x of entire system ! Action space: " Joint action a {a 1, , a n} for all agents ! Reward function:

28 Views

1y ago

Bisimulation Metrics for Continuous Markov Decision Processes

Norm Ferns [McGill University, (2008)]. Key words. bisimulation, metrics, reinforcement learning, continuous, Markov decision process AMS subject classiﬁcations. 90C40, 93E20, 68T37, 60J05 1. Introduction. Markov decision processes (MDPs) oﬀer a popular mathematical tool for planning and learning in the presence of uncertainty [7].

13 Views

1y ago

Recent Views

Guide to becoming a barrister in New South Wales - NSW Bar

1. What is a barrister? 7 2. Eligibility to be a barrister 8 3. The New South Wales Bar exam 8 3.1 Registering for the Bar exam 8 3.2 The exam process 9 3.3 Preparing for the Bar exam 9 4. Bar Practice Course 10 4.1 Registering for the Bar Practice Course 10 4.2 Attendance during the Bar Practice Course 11 4.3 Bar Practice Course material 11 5.

1y ago

137 Views

Republic of Mauritius Code of Ethics

clerk, barrister's clerk or clerk or any other employee of any person acting in any of the above capacities. II. GENERAL PRINCIPLES 3. Independence of the Barrister and the Cab-Rank Principle 3.1 The many duties to which a barrister is subject require his absolute independence, free from all other influence, especially such as may arise from

1y ago

129 Views

Simon and Katy Gittins Innovation and the Bar - Clerksroom

Absolute Barrister was the first company formed to take advantage of the direct access rules and its goal is to continue to drive innovation to allow better access to legal services. Husband and wife team first founded Absolute Barrister under a different name in 2011. Absolute Barrister is an innovative, award-winning online

1y ago

121 Views

PLANNERS AS EXPERT WITNESSES - RTPI

When acting as an expert witness your advocate will work with you to ask questions to draw out the key elements of the case. You will also be questioned by the opposing side’s advocate12. Working with your barrister The role of advocate is often carried out by a barrister. In order to perform effectively as an expert

3y ago

244 Views

BARRISTER - Miami

BARRISTER 1 Dennis O. Lynch Is Law School's New Dean D ennis O. Lynch, professor and dean emeritus at the University of Denver College of Law and prominent expert on Latin American law, is the new dean of the University of Miami School of Law. He succeeds Mary Doyle, who had been interim dean since the May 1998 resignation of Samuel C .

1y ago

216 Views

Becoming a Barrister - Bar Council

BARRISTER? "It is wonderful to be able to stand up and represent someone in court using your skills, to win a case for them." Simon O'Toole, 5 Pump Court chambers In England and Wales, the legal profession is split into two main groups: barristers and solicitors, with legal executives making an increasingly important contribution.

1y ago

131 Views

JAMES S. M. KITCHEN Suite 224 BARRISTER & SOLICITOR Airdrie AB T4B 3C3

BARRISTER & SOLICITOR. 203-304 Main St S Suite 224 . Airdrie AB T4B 3C3 . Phone: 403-667-8575 . Email: james@jsmklaw.ca [2] COVID mRNA vaccines (Pfizer and Moderna). She is also compelled to maintain both the physical and spiritual integrity of her body by asserting her God-given prior right to decline

1y ago

206 Views

A Barrister's Guide to Your Personal Injury Claim - Headway

This is the first edition of "A Barrister's Guide to Your Personal Injury Claim". My website www.abarristersguide.org.uk explains that the guide is intended to provide clear, authoritative and independent advice about all aspects of personal injury claims in England and Wales.

1y ago

113 Views

ALTERNATIVE DISPUTE RESOLUTION - Law Reform

Barrister-at-Law LEGISLAT ION DIRECTO. RY. Project Manager for Legislation Directory: Heather Mahon LLB (ling. Ger.), M.Litt., Barrister-at-Law . Legal Researchers: Margaret Devaney LLB Eóin McManus BA, LLB (NUI), LLM (Lond) vi ADMINISTRATION STAFF Head of Administration and Development:

1y ago

117 Views

Albania Albanie

10th and 11th April 2006 at the Peace Palace in The Hague. She is a barrister -at-law and is a member of Gray's Inn, London, United Kingdom. Ms. SITPAH SELVARATNAM, Bachelor of Laws (University of Wales, United Kingdom); Master of Law (University of Cambridge, United Kingdom); Barrister -at-law (Lincoln's Inn,

1y ago

126 Views

Seminar Resolving and Avoiding Construction Disputes - The Hong Kong .

Seminar Resolving and Avoiding Construction Disputes Gary Soo Barrister-at-Law & Chartered Engineer Dates 03/05/2011 Gary Soo Arbitrator, Barrister-at-Law, Chartered Engineer CEDR Accredited Mediator LLM (Peking), LLB & BSc FHKIArb, FCIArb, FIoD, CQP, MIStructE, MICE, MHKIE, MASCE Arbitration and litigation involving commercial and construction .

10m ago

73 Views

by Anthony Trollope - Robert C. Walton

Anthony Trollope (1815-1882) was born in London to a failed barrister and a novelist whose writing for many years supported the family. Financial difficulties forced him to transfer from one school to another and prevented a university education. At age 19 he began work for the Post Office,

3y ago

119 Views

DAMAGES IN Small Claims Court

Deputy Judge, Small Claims Court, Superior Court of Justice . 1:00 p.m. – 1:25 p.m. Damages in Employment Law-Managing Your Client’s Expectations and Effective Advocacy before the Court (15 minutes) Carla Bocci, Barrister & Solicitor, Deputy Judge, Small Claims Court, Superior Court of Justice . 1:25 p.m. – 1:30 p.m.

3y ago

198 Views

A PRACTICAL APPROACH TO PLANNING LAW

A PRACTICAL APPROACH TO PLANNING LAW THIRTEENTH EDITION Victor Moore LLM, BARRISTER Professor of Law Emeritus, University of Reading Michael Purdue LLB, LLM (LONDON), . It furthers the University’s objective of excellence in research, scholarship, and education by p

2y ago

117 Views

INTERNATIONAL SOCIETY - Courts

and people, as well as smuggling migrants across borders and engaging in maritime piracy and cybercrime, and the responses of numerous jurisdictions to these plus other criminal justice problems. - 11 - Richard C.C. Peck, Q.C. (*) Barrister. Vcmcouver. BC. Canada Prof . Ellen S Pod

2y ago

111 Views

Markov Decision Processes

It looks like you're using an ad-blocker