Markov Decision Processes (MDPs)

1y ago

27 Views

2 Downloads

2.64 MB

21 Pages

Last View : 13d ago

Last Download : 3m ago

Upload by : Aliana Wahl

Report this link

Download PDF

Transcription

Markov DecisionProcesses (MDPs)Machine Learning – CSE546Carlos GuestrinUniversity of WashingtonDecember 2, 2013 Carlos Guestrin 2005-20131Markov Decision Process (MDP)Representationn State space: n Action space: n Joint state x of entire systemJoint action a {a1, , an} for all agentsReward function: Total reward R(x,a)n n sometimes reward can depend on actionTransition model: Dynamics of the entire system P(x’ x,a) Carlos Guestrin 2005-201321

Discount FactorsPeople in economics and probabilistic decision-making dothis all the time.The “Discounted sum of future rewards” using discountfactor γ” is(reward now) γ (reward in 1 time step) γ 2 (reward in 2 time steps) γ 3 (reward in 3 time steps) ::(infinite sum)3 Carlos Guestrin 2005-2013niscoume D .9uss0Aor γ FactThe Academic S.On :0.70.3VA Expected discounted future rewards starting in state AVB Expected discounted future rewards starting in state BVT ““““““ “ TVS ““““““ “ SVD ““““““ “ DHow do we compute VA, VB, VT, VS, VD ? Carlos Guestrin 2005-201342

PolicyAt state x,action a for allagentsPolicy: π(x) ax0π(x0) both peasants get woodx1π(x1) one peasant buildsbarrack, other gets goldx2π(x2) peasants get gold,footmen attack5 Carlos Guestrin 2005-2013Value of PolicyExpected longterm rewardstarting from xValue: Vπ(x)Vπ(x0) Eπ[R(x0) γ R(x1) γ2 R(x2) γ3 R(x3) γ4 R(x4) ]Startfrom x0x0R(x0)π(x0)x1R(x1)x 1’R(x1’)x1’’Future rewardsdiscounted by γ in 3)R(x1’’) Carlos Guestrin 2005-2013π(x3)x4R(x4)63

Computing the value of a policyVπ(x0) Eπ[R(x0) γ R(x1) γ2 R(x2) γ3 R(x3) γ4 R(x4) ]n Discounted value of a state: n value of starting from x0 and continuing with policy π from then onA recursion! Carlos Guestrin 2005-20137Simple approach for computing thevalue of a policy: Iterativelyn Can solve using a simple convergent iterative approach:(a.k.a. dynamic programming) Start with some guess V0Iteratively say:n Stop when Vt 1-Vt εn means that Vπ-Vt 1 ε/(1-γ) Carlos Guestrin 2005-201384

But we want to learn a Policyn n So far, told you how good apolicy is But how can we choose thebest policy?At state x, actiona for all agentsPolicy: π(x) ax0π(x0) both peasants get woodx1n Suppose there was only onetime step: π(x1) one peasant buildsbarrack, other gets goldx2π(x2) peasants get gold,footmen attackworld is about to end!!!select action that maximizesreward! Carlos Guestrin 2005-20139Unrolling the recursionn Choose actions that lead to best value in the long run Optimal value policy achieves optimal value V* Carlos Guestrin 2005-2013105

Bellman equationn Evaluating policy π:n Computing the optimal value V* - Bellman equationV (x) max R(x, a) γ P(x' x, a)V (x' )ax' Carlos Guestrin 2005-201311Optimal Long-term PlanOptimal valuefunction V*(x)Optimal Policy: π*(x)Optimal policy:π (x) argmax R(x,a) γ P(x' x,a)V (x')a x' Carlos Guestrin 2005-2013126

Interesting fact – Unique valueV (x) max R(x, a) γ P(x' x, a)V (x' )an Slightly surprising fact: There is only one V* that solvesBellman equation! n x'there may be many optimal policies that achieve V*Surprising fact: optimal policies are good everywhere!!!13 Carlos Guestrin 2005-2013Solving an MDPSolveBellmanequationOptimalvalue V*(x)Optimalpolicy π*(x)V (x) max R(x, a) γ P(x' x, a)V (x' )ax'Bellman equation is non-linear!!!Many algorithms solve the Bellman equations:n n n n Policy iteration [Howard ‘60, Bellman ‘57]Value iteration [Bellman ‘57]Linear programming [Manne ‘60] Carlos Guestrin 2005-2013147

Value iteration (a.k.a. dynamic programming) –the simplest of alln n Start with some guess V0Iteratively say:n n Stop when Vt 1-Vt ε means that V*-Vt 1 ε/(1-γ)15 Carlos Guestrin 2005-2013A simple exampleγ 0.91You run astartupcompany.In everystate youmustchoosebetweenSavingmoney orAdvertising.SPoor &Unknown 0A1/21/2SAARich &Unknown 1011/21/2AS1/21/21/21Poor &Famous 01/2S1/21/2 Carlos Guestrin 2005-2013Rich &Famous 10168

Let’s compute Vt(x) for our exampleγ 0.91SPoor &Unknown 01/21/21/21/2A1/21ARich &Unknown 10S1/21/2tVt(PU) Vt(PF) Vt(RU) Vt(RF)12345S1/2AS1/21Poor &Famous 01/2ARich &Famous 106V t 1 (x) max R(x, a) γ P(x' x, a)V t (x')ax'17 Carlos Guestrin 2005-2013Let’s compute Vt(x) for our exampleγ 0.91SPoor &Unknown 01/21/21/21/21/21AARich &Unknown 10AS1/21/2S1Poor &Famous 01/2AS1/21/2Rich &Famous 10tVt(PU) Vt(PF) Vt(RU) Vt(RF)123456 1934.7854.02V t 1 (x) max R(x, a) γ P(x' x, a)V t (x')ax' Carlos Guestrin 2005-2013189

What you need to known What’s a Markov decision process state,actions, transitions, rewards a policy value function for a policyn n computing VπOptimal value function and optimal policy Bellmann equationSolving Bellman equation withvalue iteration, policy iteration and linearprogramming Carlos Guestrin 2005-201319Acknowledgmentn This lecture contains some material from AndrewMoore’s excellent collection of ML tutorials: http://www.cs.cmu.edu/ awm/tutorials Carlos Guestrin 2005-20132010

Announcementn Poster session 3-5pm CSE Atrium: Arrive15mins early Everyone must attend Write project number on poster Prepare 2-3 minutes overview of what you did At least 2 instructors will see your projectn Final Project Report DueMonday 9th at 9am See website for details (maximum 8 pages) Be clear about what you did Make it read like a paper Carlos Guestrin 2005-201321ReinforcementLearningMachine Learning – CSE546Carlos GuestrinUniversity of WashingtonDecember 3, 2013 Carlos Guestrin 2005-20132211

The Reinforcement Learning taskWorld:You are in state 34.Your immediate reward is 3. You have possible 3 actions.Robot:World:I’ll take action 2.You are in state 77.Your immediate reward is -7. You have possible 2 actions.Robot:World:I’ll take action 1.You’re in state 34 (again).Your immediate reward is 3. You have possible 3 actions. Carlos Guestrin23 2005-2013Formalizing the (online)reinforcement learning problemn Given a set of states X and actions A inn some versions of the problem size of X and A unknownInteract with world at each time step t: worldgives state xt and reward rt you give next action atn Goal: (quickly) learn policy that (approximately)maximizes long-term expected discounted reward Carlos Guestrin24 2005-201312

The “Credit Assignment” ProblemI’m in state 43,reward 0, action 2“““ 39,“ 0,“ 4“““ 22,“ 0,“ 1“““ 21,“ 0,“ 1“““ 21,“ 0,“ 1“““ 13,“ 0,“ 2“““ 54,“ 0,“ 2“““ 26,“ 100,Yippee! I got to a state with a big reward! But which of myactions along the way actually helped me get there?This is the Credit Assignment problem.25Exploration-Exploitation tradeoffn You have visited part of the statespace and found a reward of 100 isn this the best I can hope for?Exploitation: should I stick withwhat I know and find a goodpolicy w.r.t. this knowledge? atthe risk of missing out on somelarge reward somewheren Exploration: should I look for aregion with more reward? atthe risk of wasting my time orcollecting a lot of negative reward Carlos Guestrin26 2005-201313

Two main reinforcement learningapproachesn Model-based approaches: exploreenvironment, then learn model (P(x’ x,a) and R(x,a))(almost) everywhere use model to plan policy, MDP-style approach leads to strongest theoretical results works quite well in practice when state space is manageablen Model-free approach: don’tlearn a model, learn value function or policy directly leads to weaker theoretical results often works well when state space is large Carlos Guestrin27 2005-2013Rmax – A model-basedapproach Carlos Guestrin 2005-20132814

Given a dataset – learn modelGiven data, learn (MDP) Representation:n Dataset:n Learn reward function: n R(x,a)Learn transition model: P(x’ x,a) Carlos Guestrin29 2005-2013Planning with insufficient informationn Model-based approach: estimate R(x,a) & P(x’ x,a)obtain policy by value or policy iteration, or linear programming No credit assignment problem! n n learning model, planning algorithm takes care of “assigning” creditWhat do you plug in when you don’t have enough information about a state? don’t reward at a particular staten n n plug in 0?plug in smallest reward (Rmin)?plug in largest reward (Rmax)?don’t know a particular transition probability? Carlos Guestrin30 2005-201315

Some challenges in model-based RL 2:Exploration-Exploitation tradeoffn A state may be very hard to reach wastea lot of time trying to learn rewards andtransitions for this state after a much effort, state may be uselessn A strong advantage of a model-based approach: youknow which states estimate for rewards andtransitions are bad can (try) to plan to reach these states have a good estimate of how long it takes to get there Carlos Guestrin31 2005-2013A surprisingly simple approach for modelbased RL – The Rmax algorithm [Brafman & Tennenholtz]n Optimism in the face of uncertainty!!!! heuristicshown to be useful long before theory was done(e.g., Kaelbling ’90)n If you don’t know reward for a particular state-actionpair, set it to Rmax!!!n If you don’t know the transition probabilitiesP(x’ x,a) from some some state action pair x,aassume you go to a magic, fairytale new state x0!!! R(x0,a) Rmax P(x0 x0,a) 1 Carlos Guestrin32 2005-201316

Understanding Rmaxn With Rmax you either: explore– visit a state-actionpair you don’t know muchaboutn because it seems to have lots ofpotential exploit– spend all your timeon known statesn n even if unknown states wereamazingly good, it’s not worth itNote: you never know if youare exploring or exploiting!!! Carlos Guestrin33 2005-2013Implicit Exploration-Exploitation Lemman Lemma: every T time steps, either: Exploits:achieves near-optimal reward for these T-steps, or Explores: with high probability, the agent visits an unknownstate-action pairn Tlearns a little about an unknown stateis related to mixing time of Markov chain defined by MDPn time it takes to (approximately) forget where you started Carlos Guestrin34 2005-201317

The Rmax algorithmn Initialization: n Add state x0 to MDPR(x,a) Rmax, x,aP(x0 x,a) 1, x,aall states (except for x0) are unknownRepeat obtain policy for current MDP and Execute policy for any visited state-action pair, set reward function to appropriate value if visited some state-action pair x,a enough times to estimate P(x’ x,a)n n update transition probs. P(x’ x,a) for x,a using MLErecompute policy Carlos Guestrin35 2005-2013Visit enough times to estimate P(x’ x,a)?n How many times are enough? usen Chernoff Bound!Chernoff Bound: X1, ,Xn are i.i.d. Bernoulli trials with prob. θP( 1/n i Xi - θ ε) exp{-2nε2} Carlos Guestrin36 2005-201318

Putting it all togethern Theorem: With prob. at least 1-δ, Rmax will reach aε-optimal policy in time polynomial in: num. states,num. actions, T, 1/ε, 1/δ Everyn n T steps:achieve near optimal reward (great!), orvisit an unknown state-action pair ! num. states and actions isfinite, so can’t take too long before all states are known Carlos Guestrin37 2005-2013What you need to know about RL n n n n Neither supervised, nor unsupervised learningTry to learn to act in the world, as we travelstates and get rewardsModel-based & Model-free approachesRmax, a model based approach: Learnmodel of rewards and transitions Address exploration-exploitation tradeoff Simple algorithm, great in practice Carlos Guestrin 2005-20133819

Closing . Carlos Guestrin 2005-201339What you have learned this quartern n n n n n n n n n n n n n n n n n n n Learning is function approximationPoint estimationRegressionLASSOSubgradientStochastic gradient descentCoordinate descentDiscriminative v. Generative learningNaïve BayesLogistic regressionBias-Variance tradeoffDecision treesCross validationBoostingInstance-based learningPerceptronSVMsKernel trickPAC learningBayes nets n n n n n n representation, parameter and structure learningK-meansEMMixtures of GaussiansDimensionality reduction, PCAMDPsReinforcement learning Carlos Guestrin40 2005-201320

BIG PICTUREn Improving the performance at some task though experience!!! J before you start any learning task, remember the fundamental questions:What is thelearning problem?What loss functionare you optimizing?Which learningalgorithm?From whatexperience?What model?With whatoptimization algorithm?With whatguarantees?How will youevaluate it? Carlos Guestrin41 2005-2013You have done a lot!!!n And (hopefully) learned a lot!!! Implementedn n n n n n LASSOLRPerceptronClustering Answered hard questions and proved many interesting resultsCompleted (I am sure) an amazing ML projectAnd did excellently on the final!Now you are ready for one of the most sought-after careers in industry today!!! J Thank You for theHard Work!!! Carlos Guestrin42 2005-201321

Markov Decision Processes (MDPs) Machine Learning - CSE546 Carlos Guestrin University of Washington December 2, 2013 Carlos Guestrin 2005-2013 1 Markov Decision Process (MDP) Representation! State space: " Joint state x of entire system ! Action space: " Joint action a {a 1, , a n} for all agents ! Reward function:

Related Documents:

Lecture 2: Markov Decision Processes - Stanford University

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

18 Views

1y ago

Lazy-MDPs: Towards Interpretable RL by Learning When to Act - IFAAMAS

lazy-MDPs. We study the theoretical properties of lazy-MDPs, ex-pressing value functions and characterizing greediness and optimal solutions. Then we empirically demonstrate that policies learned in lazy-MDPs generally come with a form of interpretability: by construction, they show us the states where the agent takes control over the default .

17 Views

1y ago

A Physics-Based Model Prior for Object-Oriented MDPs - WordPress.com

relational dynamics that capture how object movement de-pends on the state of other objects. Object-Oriented Markov Decision Processes (OO-MDPs) represent dynamics as a ﬁnite set of object attributes and relationships (Diuk et al.,2008). In this way, OO-MDPs both manage large state-spaces and can generalize to un-

7 Views

1y ago

Bisimulation Metrics for Continuous Markov Decision Processes

Norm Ferns [McGill University, (2008)]. Key words. bisimulation, metrics, reinforcement learning, continuous, Markov decision process AMS subject classiﬁcations. 90C40, 93E20, 68T37, 60J05 1. Introduction. Markov decision processes (MDPs) oﬀer a popular mathematical tool for planning and learning in the presence of uncertainty [7].

12 Views

1y ago

Markov Decision Processes - University of Washington

A Statistician's view to MDPs Markov Chain One-step Decision Theory Markov Decision Process sequential process models state transitions autonomous process

25 Views

1y ago

Markov Decision Processes - Johns Hopkins University

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artiﬁcial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: ﬁltering, smoothing, best sequence Kalman ﬁlters (a brief mention) Dynamic Bayesian networks

14 Views

1y ago

The Markov Chain Monte Carlo Revolution

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

28 Views

2y ago

AUTOMOTIVE SKILLS PLAN - West Midlands Combined Authority

automotive sector to the West Midlands’ economy, the commission identified the need for a clear automotive skills plan that describes the current and future skills needs of the West Midlands automotive sector; the strengths and weaknesses of the region’s further and higher education system in addressing these needs; and a clear road-map for developing new co-designed skills solutions. The .

254 Views

3y ago

Recent Views

An Introduction to Islamic capital markets - REDmoney Events

Capital markets are markets for buying and selling equity securities (i.e. shares) and debt securities (i.e. bonds). Capital markets include primary markets, where new stock and bond issues are sold to investors, and secondary markets, where existing securities are traded Key participants: buyers, sellers and financial intermediaries

1y ago

104 Views

Don't fear the bear (RES-4011Q-A)

the 0% line are bull markets, and the red-shaded areas below it are bear markets — a decline of more than 20%. You'll notice that bear markets are shorter than bull markets. On average, bear markets last about 12 months, with an average loss . of about 32%.* Bull markets, on average, last nearly five years (54 months), with an average gain .

1y ago

105 Views

1213 How to Educate Consumers on Your Financial Services

Financial Empowerment 2 Financial education –strategy that provides people with financial knowledge, skills and resources Financial education builds an individual’s knowledge, skills and capacity to use resources and tools, including financial products and services leading to Financial Literacy Financial empowerment includes financial education and financial literacy –focuses .

3y ago

301 Views

Motives for Investing in Foreign Markets

international financial markets have been developed. Financial man-agers of MNCs must understand the various international financial markets that are available so that they can use those markets to facilitate their international business transactions. The specific objectives of this chapter are to describe the background and corporate use of .

3y ago

142 Views

Common Risk Factors in Cryptocurrency

excess returns over the risk-free rate of each portfolio, and the excess returns of the long- . Journal of Financial Economics, Journal of Financial Markets Journal of Financial Economics. Journal of Financial Economics. Journal of Financial Economics Journal of Financial Economics Journal of Financial Economics Journal of Financial Economics .

3y ago

203 Views

GEE II: FINANCIAL MARKETS, MONETARY POLICY AND THE

Policy, 11th Edition (New York: Addison-Wesley, 2018) V. FINANCIAL CRISES IN ADVANCED ECONOMIES (MB) Ch. 12 Financial Crises (C) Mishkin, F.S., "Asymmetric Information and Financial Crises: A Historical Perspective," in R. Glenn Hubbard, ed., Financial Markets and Financial Cri

2y ago

309 Views

Consumer protection in the banking, insurance and financial services .

insurance and financial services sector. ASIC's role in the financial system 2 As Australia's corporate, markets, financial services and consumer credit regulator, ASIC strives to ensure that Australia's financial markets are fair and transparent and supported by confident and informed investors and financial consumers. 3 The

1y ago

122 Views

International financial markets and bank funding in the euro area .

International financial markets and bank funding in the euro area: dynamics and participants1 Jaime Caruana Adrian Van Rixtel General Manager Senior Economist Bank for International Settlements 1. Introduction Financial markets are undergoing major and at times very rapid changes, mostly as a result of the financial crisis that began in 2007.

1y ago

100 Views

FINS5512 FINANCIAL MARKETS AND INSTITUTIONS Course Outline .

This course will provide students with an introduction to Australian financial markets and an evaluation of the institutions, instruments and participants involved in the industry. The mainstream markets to be evaluated include the equity, money, bond, futures, options and exchange rate markets. The subject

3y ago

146 Views

Money & Capital Markets - City University of New York

Financial Markets & Institutions By Mishkin and Eakins 7th edition (2012) McGraw-Hill Publishers ISBN: 978-0-13-213683-9 Learning Goals In this case study based graduate course we will 1) explore the function and structure of financial markets, including money, bond, stock, mortgage and foreign exchange markets,

3y ago

125 Views

Impact of COVID-19 on the Global Financial System

markets. Equity markets began declining rapidly, losing around 30% of market value in a matter of weeks, with the speed of the sell-off exceeding that of the global financial crisis of 2008-2009 (GFC). By early March, short-term funding markets and international US dollar funding markets started to show signs of stress and, in the

3y ago

112 Views

2. An Overview of the Financial System

2-5 Structure of Financial Markets Debt and Equity Markets Primary and Secondary Markets Investment Banks underwrite securities in primary markets Brokers and dealers work in seconda

2y ago

111 Views

HDFC MF Yearbook 2021

HDFC group pledged Rs150cr contribution to the PM CARES Fund to provide relief and rehabilitation measures towards the . Global Economy and Markets 2. Key Future Trends 3. Indian Economy 4. Equity Markets & Sector Overview 5. Fixed Income Markets 3. . Developed markets (DMs) likely to achieve herd immunity by CY21 and Emerging Markets .

3y ago

124 Views

Feb 10th, 2020 Tax Loss Harvesting (TLH) South Bay .

SCHF FTSE Developed Markets Ex-US Emerging Markets VWO FTSE Emerging Markets EEM MSCI Emerging Markets Index IEMG MSCI Emerging Markets Investable Market Index Dividend Stocks VIG NASDAQ US Dividend Achievers Select SCHD Dow Jones U.S. Dividend 100 TIPS VTIP Barclays Capital US TIPS 0-5 Years

2y ago

314 Views

2021 Capital Markets Fact Book - sifma

Introduction 2021 Capital Markets Fact Book Page 7 US Capital Markets Are the Largest in the World The U.S. capital markets are largest in the world and continue to be among the deepest, most liquid, and most efficient. Equities: U.S. equity markets represent 38.5% of the 105.8 trillion in global equity market cap, or 40.7 trillion; this

1y ago

108 Views

Markov Decision Processes (MDPs)

It looks like you're using an ad-blocker