Markov Decision Processes - University Of Washington

1y ago
24 Views
2 Downloads
1.17 MB
45 Pages
Last View : 15d ago
Last Download : 3m ago
Upload by : Jamie Paz
Transcription

Markov Decision ProcessesMausamCSE 515

hTheoryMarkov Decision ificialIntelligencemodel the sequential decision making of a rational agent.

A Statistician’s view to MDPsMarkovChainssOne-stepDecision Theoryusa one-step process models choice maximizes utility sequential process models state transitions autonomous processMarkov Decision Processssua sequential process Markov chain choice Decision theory sequentiality models state transitions models choice maximizes utility

A Planning ViewStatic vs. DynamicPredictable vs. eDeterministicvs.StochasticWhat PerceptsActions

Classical PlanningStatic What actionnext?InstantaneousPerfectPerceptsActions

Deterministic, fully observable

Stochastic Planning: tochasticWhat actionnext?InstantaneousPerfectPerceptsActions

Stochastic, Fully Observable

Markov Decision Process (MDP) factoredS: A set of statesA: A set of actionsPr(s’ s,a): transition modelC(s,a,s’): cost modelG: set of goalss0: start state : discount factorR(s,a,s’): reward modelFactored MDPabsorbing/non-absorbing

Objective of an MDP Find a policy : S A which optimizes minimizes discounted expected cost to reach a goal maximizesexpected rewardor maximizes undiscount. expected (reward-cost) given a horizon finite infinite indefinite assuming full observability

Role of Discount Factor ( ) Keep the total reward/total cost finite useful for infinite horizon problems Intuition (economics): Money today is worth more than money tomorrow. Total reward: r1 r2 2r3 Total cost: c1 c2 2c3

Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimization MDP S, A, Pr, C, G, s0 Most often studied in planning, graph theory communities Infinite Horizon, Discounted Reward Maximization MDP S, A, Pr, R, most popular Most often studied in machine learning, economics, operationsresearch communities Goal-directed, Finite Horizon, Prob. Maximization MDP S, A, Pr, G, s0, T Also studied in planning community Oversubscription Planning: Non absorbing goals, Reward Max. MDP S, A, Pr, G, R, s0 Relatively recent model

Bellman Equations for MDP1 S, A, Pr, C, G, s0 Define J*(s) {optimal cost} as the minimumexpected cost to reach a goal from this state. J* should satisfy the following equation:

Bellman Equations for MDP2 S, A, Pr, R, s0, Define V*(s) {optimal value} as the maximumexpected discounted reward from this state. V* should satisfy the following equation:

Bellman Equations for MDP3 S, A, Pr, G, s0, T Define P*(s,t) {optimal prob} as the maximumexpected probability to reach a goal from thisstate starting at tth timestep. P* should satisfy the following equation:

Bellman Backup (MDP2) Given an estimate of V* function (say Vn) Backup Vn function at state s calculate a new estimate (Vn 1) :RV Vax Qn 1(s,a) : value/cost of the strategy: execute action a in s, execute n subsequently n argmaxa Ap(s)Qn(s,a)

Bellman Backupmaxagreedy a3V1 6.5( 1)s0a15s1V 0 0s2V 0 1s3V 0 2a2a3Q1(s,a1) 2 0 Q1(s,a2) 5 0.9 1 0.1 2Q1(s,a3) 4.5 2

Value iteration [Bellman’57] assign an arbitrary assignment of V0 to each state. repeat for all states s compute Vn 1(s) by Bellman backup at s. until maxs Vn 1(s) – Vn(s) -convergenceIteration n 1Residual(s)

Comments Decision-theoretic AlgorithmDynamic ProgrammingFixed Point ComputationProbabilistic version of Bellman-Ford Algorithm for shortest path computation MDP1 : Stochastic Shortest Path Problem Time Complexity one iteration: O( S 2 A ) number of iterations: poly( S , A , 1/(1- )) Space Complexity: O( S ) Factored MDPs exponential space, exponential time

Convergence Properties Vn V* in the limit as n 1 -convergence: Vn function is within of V* Optimality: current policy is within 2 /(1- ) of optimal Monotonicity V0 p V* Vn p V* (Vn monotonic from below) V0 p V* Vn p V* (Vn monotonic from above) otherwise Vn non-monotonic

Policy ComputationOptimal policyaxis stationary and time-independent. for infinite/indefinite horizon problemsaxR VPolicy EvaluationVR VA system of linear equations in S variables.

Changing the Search Space Value Iteration Search in value space Compute the resulting policy Policy Iteration Search in policy space Compute the resulting value

Policy iteration [Howard’60] assign an arbitrary assignment of 0 to each state. repeatcostly: O(n3) Policy Evaluation: compute Vn 1: the evaluation of n Policy Improvement: for all states s compute n 1(s): argmaxa2 Ap(s)Qn 1(s,a) until n 1 nAdvantageModifiedPolicy Iterationapproximateby value iterationusing fixed policy searching in a finite (policy) space as opposed touncountably infinite (value) space convergence faster. all other properties follow!

Modified Policy iteration assign an arbitrary assignment of 0 to each state. repeat Policy Evaluation: compute Vn 1 the approx. evaluation of n Policy Improvement: for all states s compute n 1(s): argmaxa2 Ap(s)Qn 1(s,a) until n 1 nAdvantage probably the most competitive synchronous dynamicprogramming algorithm.

Asynchronous Value Iteration States may be backed up in any order instead of an iteration by iteration As long as all states backed up infinitely often Asynchronous Value Iteration converges to optimal

Asynch VI: Prioritized Sweeping Why backup a state if values of successors same? Prefer backing a state whose successors had most change Priority Queue of (state, expected change in value) Backup in the order of priority After backing a state update priority queue for all predecessors

Asynch VI: Real Time Dynamic Programming[Barto, Bradtke, Singh’95] Trial: simulate greedy policy starting from start state;perform Bellman backup on visited states RTDP: repeat Trials until value function converges

RTDP TrialQn 1(s0,a)agreedy a2Mina1Vn 1(s0)s0a2VnVn?Vn?Vna3?VnVnVnGoal

Comments Properties if all states are visited infinitely often then Vn V* Advantages Anytime: more probable states explored quickly Disadvantages complete convergence can be slow!

Reinforcement Learning

Reinforcement Learning Still have an MDP Still looking for policy New twist: don’t know Pr and/or R i.e. don’t know which states are good and what actions do Must actually try out actions to learn

Model based methods Visit different states, perform different actions Estimate Pr and R Once model built, do planning using V.I. orother methods Con: require huge amounts of data

Model free methods Directly learn Q*(s,a) values sample R(s,a,s’) maxa’Qn(s’,a’) Nudge the old estimate towards the new sample Qn 1(s,a) (1- )Qn(s,a) [sample]

Properties Converges to optimal if If you explore enough If you make learning rate ( ) small enough But not decrease it too quickly i (s,a,i) i 2(s,a,i) where i is the number of visits to (s,a)

Model based vs. Model Free RL Model based estimate O( S 2 A ) parameters requires relatively larger data for learning can make use of background knowledge easily Model free estimate O( S A ) parameters requires relatively less data for learning

Exploration vs. Exploitation Exploration: choose actions that visit new states inorder to obtain more data for better learning. Exploitation: choose actions that maximize thereward given current learnt model. -greedy Each time step flip a coin With prob , take an action randomly With prob 1- take the current greedy action Lower over time increase exploitation as more learning has happened

Q-learning Problems Too many states to visit during learning Q(s,a) is still a BIG table We want to generalize from small set of training examples Techniques Value function approximators Policy approximators Hierarchical Reinforcement Learning

Task Hierarchy: MAXQ Decomposition [Dietterich’00]RootFetchTakeExtend-armChildren of a taskare ovesMovewMoveeExtend-arm

Partially Observable Markov Decision Processes

Partially Observable bleStochasticWhat actionnext?InstantaneousNoisyPerceptsActions

Stochastic, Fully Observable

Stochastic, Partially Observable

POMDPs In POMDPs we apply the very same idea as in MDPs. Since the state is not observable,the agent has to make its decisions based on the belief statewhich is a posterior distribution over states. Let b be the belief of the agent about the current state POMDPs compute a value function over belief space:aaγb, a

POMDPs Each belief is a probability distribution, value fn is a function of an entire probability distribution. Problematic, since probability distributions are continuous. Also, we have to deal with huge complexity of belief spaces. For finite worlds with finite state, action, and observationspaces and finite horizons, we can represent the value functions by piecewise linearfunctions.

Applications Robotic control helicopter maneuvering, autonomous vehicles Mars rover - path planning, oversubscription planning elevator planning Game playing - backgammon, tetris, checkers Neuroscience Computational Finance, Sequential Auctions Assisting elderly in simple tasks Spoken dialog management Communication Networks – switching, routing, flow control War planning, evacuation planning

A Statistician's view to MDPs Markov Chain One-step Decision Theory Markov Decision Process sequential process models state transitions autonomous process

Related Documents:

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artificial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: filtering, smoothing, best sequence Kalman filters (a brief mention) Dynamic Bayesian networks

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

so-calledfiltered Markov Decision Processes.MoreoverPiecewise Determinis-tic Markov Decision Processes are discussed and we give recent applications . to the students at Ulm University and KIT who struggled with the text in their seminars. Special thanks go to Rolf B auerle and Sebastian Urban for

Markov Decision Processes (MDPs) Machine Learning - CSE546 Carlos Guestrin University of Washington December 2, 2013 Carlos Guestrin 2005-2013 1 Markov Decision Process (MDP) Representation! State space: " Joint state x of entire system ! Action space: " Joint action a {a 1, , a n} for all agents ! Reward function:

Norm Ferns [McGill University, (2008)]. Key words. bisimulation, metrics, reinforcement learning, continuous, Markov decision process AMS subject classifications. 90C40, 93E20, 68T37, 60J05 1. Introduction. Markov decision processes (MDPs) offer a popular mathematical tool for planning and learning in the presence of uncertainty [7].

2.2 Markov chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a collection of methods to generate pseudorandom numbers via Markov Chains. MCMC works constructing a Markov chain which steady-state is the distribution of interest. Random Walks Markov are closely attached to MCMC. Indeed, t

Stationary policy composed of stochastic history-dependent Markov decision rules ˇ(s t) (U(M s;M s 10) ifs t s t 1 2 0 otherwise Non-stationary policy composed of deterministic Markov decision rules ˇ t(s) (M s ift 6 b(M s) 5c otherwise As one can see, any combination of di erent types of decision rules and policies can be .