Markov Decision Processes - University Of Washington

1y ago

24 Views

2 Downloads

1.17 MB

45 Pages

Last View : 15d ago

Last Download : 3m ago

Upload by : Jamie Paz

Report this link

Download PDF

Transcription

Markov Decision ProcessesMausamCSE 515

hTheoryMarkov Decision ificialIntelligencemodel the sequential decision making of a rational agent.

A Statistician’s view to MDPsMarkovChainssOne-stepDecision Theoryusa one-step process models choice maximizes utility sequential process models state transitions autonomous processMarkov Decision Processssua sequential process Markov chain choice Decision theory sequentiality models state transitions models choice maximizes utility

A Planning ViewStatic vs. DynamicPredictable vs. eDeterministicvs.StochasticWhat PerceptsActions

Classical PlanningStatic What actionnext?InstantaneousPerfectPerceptsActions

Deterministic, fully observable

Stochastic Planning: tochasticWhat actionnext?InstantaneousPerfectPerceptsActions

Stochastic, Fully Observable

Markov Decision Process (MDP) factoredS: A set of statesA: A set of actionsPr(s’ s,a): transition modelC(s,a,s’): cost modelG: set of goalss0: start state : discount factorR(s,a,s’): reward modelFactored MDPabsorbing/non-absorbing

Objective of an MDP Find a policy : S A which optimizes minimizes discounted expected cost to reach a goal maximizesexpected rewardor maximizes undiscount. expected (reward-cost) given a horizon finite infinite indefinite assuming full observability

Role of Discount Factor ( ) Keep the total reward/total cost finite useful for infinite horizon problems Intuition (economics): Money today is worth more than money tomorrow. Total reward: r1 r2 2r3 Total cost: c1 c2 2c3

Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimization MDP S, A, Pr, C, G, s0 Most often studied in planning, graph theory communities Infinite Horizon, Discounted Reward Maximization MDP S, A, Pr, R, most popular Most often studied in machine learning, economics, operationsresearch communities Goal-directed, Finite Horizon, Prob. Maximization MDP S, A, Pr, G, s0, T Also studied in planning community Oversubscription Planning: Non absorbing goals, Reward Max. MDP S, A, Pr, G, R, s0 Relatively recent model

Bellman Equations for MDP1 S, A, Pr, C, G, s0 Define J*(s) {optimal cost} as the minimumexpected cost to reach a goal from this state. J* should satisfy the following equation:

Bellman Equations for MDP2 S, A, Pr, R, s0, Define V*(s) {optimal value} as the maximumexpected discounted reward from this state. V* should satisfy the following equation:

Bellman Equations for MDP3 S, A, Pr, G, s0, T Define P*(s,t) {optimal prob} as the maximumexpected probability to reach a goal from thisstate starting at tth timestep. P* should satisfy the following equation:

Bellman Backup (MDP2) Given an estimate of V* function (say Vn) Backup Vn function at state s calculate a new estimate (Vn 1) :RV Vax Qn 1(s,a) : value/cost of the strategy: execute action a in s, execute n subsequently n argmaxa Ap(s)Qn(s,a)

Bellman Backupmaxagreedy a3V1 6.5( 1)s0a15s1V 0 0s2V 0 1s3V 0 2a2a3Q1(s,a1) 2 0 Q1(s,a2) 5 0.9 1 0.1 2Q1(s,a3) 4.5 2

Value iteration [Bellman’57] assign an arbitrary assignment of V0 to each state. repeat for all states s compute Vn 1(s) by Bellman backup at s. until maxs Vn 1(s) – Vn(s) -convergenceIteration n 1Residual(s)

Comments Decision-theoretic AlgorithmDynamic ProgrammingFixed Point ComputationProbabilistic version of Bellman-Ford Algorithm for shortest path computation MDP1 : Stochastic Shortest Path Problem Time Complexity one iteration: O( S 2 A ) number of iterations: poly( S , A , 1/(1- )) Space Complexity: O( S ) Factored MDPs exponential space, exponential time

Convergence Properties Vn V* in the limit as n 1 -convergence: Vn function is within of V* Optimality: current policy is within 2 /(1- ) of optimal Monotonicity V0 p V* Vn p V* (Vn monotonic from below) V0 p V* Vn p V* (Vn monotonic from above) otherwise Vn non-monotonic

Policy ComputationOptimal policyaxis stationary and time-independent. for infinite/indefinite horizon problemsaxR VPolicy EvaluationVR VA system of linear equations in S variables.

Changing the Search Space Value Iteration Search in value space Compute the resulting policy Policy Iteration Search in policy space Compute the resulting value

Policy iteration [Howard’60] assign an arbitrary assignment of 0 to each state. repeatcostly: O(n3) Policy Evaluation: compute Vn 1: the evaluation of n Policy Improvement: for all states s compute n 1(s): argmaxa2 Ap(s)Qn 1(s,a) until n 1 nAdvantageModifiedPolicy Iterationapproximateby value iterationusing fixed policy searching in a finite (policy) space as opposed touncountably infinite (value) space convergence faster. all other properties follow!

Modified Policy iteration assign an arbitrary assignment of 0 to each state. repeat Policy Evaluation: compute Vn 1 the approx. evaluation of n Policy Improvement: for all states s compute n 1(s): argmaxa2 Ap(s)Qn 1(s,a) until n 1 nAdvantage probably the most competitive synchronous dynamicprogramming algorithm.

Asynchronous Value Iteration States may be backed up in any order instead of an iteration by iteration As long as all states backed up infinitely often Asynchronous Value Iteration converges to optimal

Asynch VI: Prioritized Sweeping Why backup a state if values of successors same? Prefer backing a state whose successors had most change Priority Queue of (state, expected change in value) Backup in the order of priority After backing a state update priority queue for all predecessors

Asynch VI: Real Time Dynamic Programming[Barto, Bradtke, Singh’95] Trial: simulate greedy policy starting from start state;perform Bellman backup on visited states RTDP: repeat Trials until value function converges

RTDP TrialQn 1(s0,a)agreedy a2Mina1Vn 1(s0)s0a2VnVn?Vn?Vna3?VnVnVnGoal

Comments Properties if all states are visited infinitely often then Vn V* Advantages Anytime: more probable states explored quickly Disadvantages complete convergence can be slow!

Reinforcement Learning

Reinforcement Learning Still have an MDP Still looking for policy New twist: don’t know Pr and/or R i.e. don’t know which states are good and what actions do Must actually try out actions to learn

Model based methods Visit different states, perform different actions Estimate Pr and R Once model built, do planning using V.I. orother methods Con: require huge amounts of data

Model free methods Directly learn Q*(s,a) values sample R(s,a,s’) maxa’Qn(s’,a’) Nudge the old estimate towards the new sample Qn 1(s,a) (1- )Qn(s,a) [sample]

Properties Converges to optimal if If you explore enough If you make learning rate ( ) small enough But not decrease it too quickly i (s,a,i) i 2(s,a,i) where i is the number of visits to (s,a)

Model based vs. Model Free RL Model based estimate O( S 2 A ) parameters requires relatively larger data for learning can make use of background knowledge easily Model free estimate O( S A ) parameters requires relatively less data for learning

Exploration vs. Exploitation Exploration: choose actions that visit new states inorder to obtain more data for better learning. Exploitation: choose actions that maximize thereward given current learnt model. -greedy Each time step flip a coin With prob , take an action randomly With prob 1- take the current greedy action Lower over time increase exploitation as more learning has happened

Q-learning Problems Too many states to visit during learning Q(s,a) is still a BIG table We want to generalize from small set of training examples Techniques Value function approximators Policy approximators Hierarchical Reinforcement Learning

Task Hierarchy: MAXQ Decomposition [Dietterich’00]RootFetchTakeExtend-armChildren of a taskare ovesMovewMoveeExtend-arm

Partially Observable Markov Decision Processes

Partially Observable bleStochasticWhat actionnext?InstantaneousNoisyPerceptsActions

Stochastic, Fully Observable

Stochastic, Partially Observable

POMDPs In POMDPs we apply the very same idea as in MDPs. Since the state is not observable,the agent has to make its decisions based on the belief statewhich is a posterior distribution over states. Let b be the belief of the agent about the current state POMDPs compute a value function over belief space:aaγb, a

POMDPs Each belief is a probability distribution, value fn is a function of an entire probability distribution. Problematic, since probability distributions are continuous. Also, we have to deal with huge complexity of belief spaces. For finite worlds with finite state, action, and observationspaces and finite horizons, we can represent the value functions by piecewise linearfunctions.

Applications Robotic control helicopter maneuvering, autonomous vehicles Mars rover - path planning, oversubscription planning elevator planning Game playing - backgammon, tetris, checkers Neuroscience Computational Finance, Sequential Auctions Assisting elderly in simple tasks Spoken dialog management Communication Networks – switching, routing, flow control War planning, evacuation planning

A Statistician's view to MDPs Markov Chain One-step Decision Theory Markov Decision Process sequential process models state transitions autonomous process

Related Documents:

Lecture 2: Markov Decision Processes - Stanford University

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

18 Views

1y ago

Markov Decision Processes - Johns Hopkins University

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artiﬁcial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: ﬁltering, smoothing, best sequence Kalman ﬁlters (a brief mention) Dynamic Bayesian networks

14 Views

1y ago

The Markov Chain Monte Carlo Revolution

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

27 Views

2y ago

Markov Decision Processes with Applications to Finance (Universitext)

so-calledﬁltered Markov Decision Processes.MoreoverPiecewise Determinis-tic Markov Decision Processes are discussed and we give recent applications . to the students at Ulm University and KIT who struggled with the text in their seminars. Special thanks go to Rolf B auerle and Sebastian Urban for

10 Views

1y ago

Markov Decision Processes (MDPs)

Markov Decision Processes (MDPs) Machine Learning - CSE546 Carlos Guestrin University of Washington December 2, 2013 Carlos Guestrin 2005-2013 1 Markov Decision Process (MDP) Representation! State space: " Joint state x of entire system ! Action space: " Joint action a {a 1, , a n} for all agents ! Reward function:

27 Views

1y ago

Bisimulation Metrics for Continuous Markov Decision Processes

Norm Ferns [McGill University, (2008)]. Key words. bisimulation, metrics, reinforcement learning, continuous, Markov decision process AMS subject classiﬁcations. 90C40, 93E20, 68T37, 60J05 1. Introduction. Markov decision processes (MDPs) oﬀer a popular mathematical tool for planning and learning in the presence of uncertainty [7].

12 Views

1y ago

Using Multiple Imputation to Simulate Time Series: A ...

2.2 Markov chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a collection of methods to generate pseudorandom numbers via Markov Chains. MCMC works constructing a Markov chain which steady-state is the distribution of interest. Random Walks Markov are closely attached to MCMC. Indeed, t

18 Views

2y ago

1 Markov Decision Processes - MIT

Stationary policy composed of stochastic history-dependent Markov decision rules ˇ(s t) (U(M s;M s 10) ifs t s t 1 2 0 otherwise Non-stationary policy composed of deterministic Markov decision rules ˇ t(s) (M s ift 6 b(M s) 5c otherwise As one can see, any combination of di erent types of decision rules and policies can be .

17 Views

1y ago

Recent Views

Finanças Comportamentais: efeito dos vieses cognitivos do . - FIPECAFI

www.congressousp.fipecafi.org Finanças Comportamentais: efeito dos vieses cognitivos do consumidor nas compras da Black Friday Alana Dantas André Universidade Federal do Rio Grande do Norte (UFRN)

10m ago

82 Views

Retos para la educación y profesión contable derivados de .

retos para la educaciÓn y profesiÓn contable derivados de una armonizaciÓn mundial liderada por el iasc: 7 la nueva estrategia europea de armonizaciÓn contable como modelo para otras Áreas econÓmicas * revista contabilidade & finanças fipecafi - fea - usp, são paulo, fipecafi, v.16, n. 27, p. 7 - 23, setembro/dezembro 2001

3y ago

109 Views

Orçamento e Finanças Governamentais

de Carvalho Sil va. Salvado r: UFBA, Faculdade de Ci ncias Cont beis: Superi ntend ncia de Educa o dist ncia, 2019 144 p. il. ISBN: 978.85.8292. 217-0 1.Finan as P blicas - Contabilidad e. 2.Contabilidade p blica. 3.Contabilidade - Estudo e ensino (Superior). I. Uni versidade Fede ral da Bahia. Faculdade de Ci ncias Cont bei s.II. Universidade .

3y ago

109 Views

O PLANO GERAL DE CONTABILIDADE ANGOLANO E O SISTEMA DE .

Universidade Kimpa vita e da professora Doutora Maria Fatima, Diretora-Geral da Escola Superior politécnica do Uige. A todos os professores do curso de Contabilidade e Finanças do Instituto Superior de Contabilidade e Administração do Porto, os meus sinceros agradecimentos.

3y ago

169 Views

CONCURSO PÚBLICO PARA PROVIMENTO DE CARGOS EDITAL Nº 001 .

CONCURSO PÚBLICO PARA PROVIMENTO DE CARGOS EDITAL Nº 001/2017 REALIZAÇÃO: OBJETIVA CONCURSOS LTDA ANTONIO CARLOS LOPES, Prefeito Municipal de Astorga, por meio da Secretaria Municipal de Administração e Finanças, no uso de suas atribuições legais, nos termos do Art. 37 da Constituição Federal e Lei Orgânica Municipal e emendas .

3y ago

95 Views

s Deixe-nos Começar Com

Fonte da Vida A Flor da Permacultura Educação e Economia e Cultura finanças e governo comunitário Saúde e Bem-Estar Espiritual & Princípios Design . exemplo incentivar para a manutenção das estruturas da escola (compostagem, reciclagem, hortas, ma

2y ago

90 Views

MANUAL DO ENCONTRO NOS - WordPress

Então, nada mais "justo" do que ele tirar a recompensa em produtos que estão "sobrando" na empresa. O irmão se vê apenas corrigindo a injustiça que sofre. Mas, o furto ainda é pecado. c) Um outro vive em conflitos com a esposa. Problemas com finanças e na criação dos filhos tornam sua vida em casa um conflito constante.

1y ago

82 Views

Relatório Onde Estamos Na Implementação Do Código Florestal? - Cpi

Radiografia do CAR e do PRA nos Estados Brasileiros. Edição 2020. Rio de Janeiro: Climate Policy Initiative, 2020. SOBRE O CPI E INPUT O Climate Policy Initiative (CPI) é uma organização com experiência na análise de políticas públicas e finanças. Nossa missão é contribuir para que governos, empresas e instituições financeiras possam

1y ago

51 Views

O impacto macroeconómico da COVID-19 em Moçambique

a Direcção Nacional de Políticas Económicas e Desenvolvimento, Ministério da Economia e Finanças (MEF), Moçambique. b UNU-WIDER, Helsínquia, Finlândia. c Grupo de Pesquisa em Economia do Desenvolvimento (DERG), Universidade de Copenhaga, Dinamarca. d Universidade Católica de Moçambique, Beira, Moçambique. MEF Discussion Paper 2021/1 . O impacto macroeconómico da COVID-19 em

7m ago

37 Views

Gestão De Empreendimentos E Negócios - An7

1 Empreendedorismo 2 Softwares, Programas, Telecomunicações, Tecnologia e Internet 3 Administração Contemporânea 4 Administração do Tempo 5 Gestão Organizacional e Organização Empresarial 6 Gestão Estratégica 7 Finanças e Matemática 8 Contabilidade Geral e de Custos 9 Gestão de Compras e Orçamento 10 Administração de Produção e Operaçõe.

5m ago

49 Views

Markov Decision Processes - University Of Washington

It looks like you're using an ad-blocker