10703 Deep Reinforcement Learning - Carnegie Mellon University

1y ago

6 Views

2 Downloads

949.72 KB

36 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Ronan Orellana

Report this link

Download PDF

Transcription

10703 Deep Reinforcement Learning!Solving known MDPsTom MitchellSeptember 10, 2018Many slides borrowed from !Katerina Fragkiadaki!Russ Salakhutdinov!

Markov Decision Process (MDP)!A Markov Decision Process is a tuple is a finite set of states is a finite set of actions is a state transition probability function is a reward function is a discount factor

Outline!Previous lecture: Policy evaluationThis lecture: Policy iteration Value iteration Asynchronous DP

Policy Evaluation!Policy evaluation: for a given policy , compute the state valuefunctionwherea system ofis implicitly given by the Bellman equationsimultaneous equations.

Iterative Policy Evaluation!(Synchronous) Iterative Policy Evaluation for given policy Initialize V(s) to anything Do until change in maxs[V[k 1](s) – Vk(s)] is below desired threshold for every state s, update:

Iterative Policy Evaluation!for therandom policyPolicy, choose an equiprobable random action An undiscounted episodic task Nonterminal states: 1, 2, , 14 Terminal states: two, shown in shaded squares Actions that would take the agent off the grid leave the state unchanged Reward is -1 until the terminal state is reached

Is Iterative Policy EvaluationGuaranteed to Converge?

Contraction Mapping Theorem!Definition:An operatoron a normed vector spacefor, provided for allis a-contraction,

Contraction Mapping Theorem!Definition:An operatoron a normed vector spacefor, provided for allis a-contraction,Theorem (Contraction mapping)For a -contraction in a complete normed vector space Iterative application ofconverges to a unique fixed point inindependent of the starting point at a linear convergence rate determined by

Value Function Sapce! Consider the vector space There areover value functionsdimensions Each point in this space fully specifies a value function Bellman backup is a contraction operator that brings valuefunctions closer in this space (we will prove this) And therefore the backup must converge to a unique solution

Value Function-Norm ! We will measure distance between state-value functionsby the-norm i.e. the largest difference between state values:and

Bellman Expectation Backup is a Contraction! Define the Bellman expectation backup operator This operator is a -contraction, i.e. it makes value functions closerby at least,

Matrix Form!The Bellman expectation equation can be written concisely using theinduced matrix form:with direct solutionof complexityhere T π is an S x S matrix, whose (j,k) entry gives P(sk sj, a π(sj))r π is an S -dim vector whose jth entry gives E[r sj, a π(sj) ]vπ is an S -dim vector whose jth entry gives Vπ(sj)where S is the number of distinct states

Convergence of Iterative Policy Evaluation! The Bellman expectation operator is a fixed point ofhas a unique fixed point(by Bellman expectation equation) By contraction mapping theorem: Iterative policy evaluationconverges on

Given that we know how to evaluate a policy,how can we discover the optimal policy?

Policy Iteration!policy evaluationpolicy improvement“greedification”

Policy Improvement! Suppose we have computedfor a deterministic policy For a given state , would it be better to do an action It is better to switch to action And we can computefor statefromif and only ifby:?

Policy Improvement Cont.! Do this for all states to get a new policywith respect to: What if the policy is unchanged by this? Then the policy must be optimal.\pi'(s) & \arg\max {a} q \pi(s, a) \\that is greedy

Policy Iteration!

Iterative Policy Eval for the Small Gridworld!Policy, an equiprobable random actionRγ 1 An undiscounted episodic task Nonterminal states: 1, 2, , 14 Terminal state: one, shown in shaded square Actions that take the agent off the grid leave the state unchanged Reward is -1 until the terminal state is reached 6

Iterative Policy Eval for the Small Gridworld!Initial policy: equiprobable random actionRγ 1 An undiscounted episodic task Nonterminal states: 1, 2, , 14 Terminal state: two, shown in shaded squares Actions that take the agent off the grid leave the state unchanged Reward is -1 until the terminal state is reached

Generalized Policy Iteration!Generalized Policy Iteration (GPI): any interleaving of policyevaluation and policy improvement, independent of their granularity.A geometric metaphor forconvergence of GPI:

Generalized Policy Iteration! Does policy evaluation need to converge to? Or should we introduce a stopping condition e.g.-convergence of value function Or simply stop after k iterations of iterative policy evaluation? For example, in the small grid world k 3 was sufficient to achieveoptimal policy Why not update policy every iteration? i.e. stop after k 1 This is equivalent to value iteration (next section)

Principle of Optimality! Any optimal policy can be subdivided into two components: An optimal first action Followed by an optimal policy from successor state Theorem (Principle of Optimality) A policyachieves the optimal value from state ,dfsfdsfdf dsfdf , if and only if For any statereachable from ,value from state ,achieves the optimal

Lecture 3: Planning by Dynamic ProgrammingValue IterationExample: Shortest Path!Value Iteration in MDPsExample: Shortest Pathr(s,a) -1 except for actions entering terminal 4-4-4-3-4-5-5-3-4-5-6V4V5V6V7

Bellman Optimality Backup is a Contraction! Define the Bellman optimality backup operator , This operator is acloser by at least-contraction, i.e. it makes value functions(similar to previous proof)

Value Iteration Converges to V*! The Bellman optimality operator is a fixed point ofhas a unique fixed point(by Bellman optimality equation) By contraction mapping theorem, value iteration converges on

Synchronous Dynamic Programming Algorithms!“Synchronous” here means we sweep through every state s in S for each update don’t update V or π until the full sweep in completedProblem !Bellman Equation!Algorithm!Prediction!Bellman Expectation Equation!Iterative PolicyEvaluation!Control!Bellman Expectation Equation Greedy Policy Improvement!Policy Iteration!Control!Bellman Optimality Equation !Value Iteration! Algorithms are based on state-value functionor Complexityper iteration, foractions and Could also apply to action-value functionorstates

Asynchronous DP! Synchronous DP methods described so far require- exhaustive sweeps of the entire state set.- updates to V or Q only after a full sweep Asynchronous DP does not use sweeps. Instead it works like this: Repeat until convergence criterion is met: Pick a state at random and apply the appropriate backup Still need lots of computation, but does not get locked into hopelesslylong sweeps Guaranteed to converge if all states continue to be selected Can you select states to backup intelligently? YES: an agent’sexperience can act as a guide.

Asynchronous Dynamic Programming! Three simple ideas for asynchronous dynamic programming: In-place dynamic programming Prioritized sweeping Real-time dynamic programming

In-Place Dynamic Programming! Multi-copy synchronous value iteration stores two copies of value function for allin In-place value iteration only stores one copy of value function for allin

Prioritized Sweeping! Use magnitude of Bellman error to guide state selection, e.g. Backup the state with the largest remaining Bellman error Requires knowledge of reverse dynamics (predecessor states) Can be implemented efficiently by maintaining a priority queue

Real-time Dynamic Programming! Idea: update only states that the agent experiences in real world After each time-step Backup the state

Sample Backups! In subsequent lectures we will consider sample backups Using sample rewards and sample transitions Advantages: Model-free: no advance knowledge of T or r(s,a) required Breaks the curse of dimensionality through sampling Cost of backup is constant, independent of

Approximate Dynamic Programming! Approximate the value function Using function approximation (e.g., neural net) Apply dynamic programming to e.g. Fitted Value Iteration repeats at each iteration k, Sample states For each stateoptimality equation,, estimate target value using Bellman Train next value functionusing targets

Approximate Dynamic Programming! Approximate the value function Using function approximation (e.g., neural net) Apply dynamic programming to e.g. Fitted Value Iteration repeats at each iteration k, Sample states For each state , estimate target value using Bellman optimality equation,

Related Documents:

Applying Deep Reinforcement Learning to Berkeley's Capture the Flag game

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

97 Views

1y ago

GraphBit: Bitwise Interaction Mining via Deep Reinforcement Learning

Deep Reinforcement Learning: Reinforcement learn-ing aims to learn the policy of sequential actions for decision-making problems [43, 21, 28]. Due to the recen-t success in deep learning [24], deep reinforcement learn-ing has aroused more and more attention by combining re-inforcement learning with deep neural networks [32, 38].

76 Views

1y ago

1 Introduction to reinforcement learning - GitHub Pages

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

23 Views

2y ago

Enhancing Deep Reinforcement Learning Agent for Angry Birds

A representative work of deep learning is on playing Atari with Deep Reinforcement Learning [Mnih et al., 2013]. The reinforcement learning algorithm is connected to a deep neural network which operates directly on RGB images. The training data is processed by using stochastic gradient method. A Q-network denotes a neural network which approxi-

30 Views

8m ago

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating ...

In this section, we present related work and background concepts such as reinforcement learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement learning (Sutton and Barto, 1998) environment is typically formalized by means of a Markov decision process (MDP). An MDP can be described as follows. Let S fs 1 .

11 Views

1y ago

Multi-Agent Patrolling with Reinforcement Learning1

learning techniques, such as reinforcement learning, in an attempt to build a more general solution. In the next section, we review the theory of reinforcement learning, and the current efforts on its use in other cooperative multi-agent domains. 3. Reinforcement Learning Reinforcement learning is often characterized as the

10 Views

1y ago

OFFSHORE PRODUCTS APPLICATIONS - Nyborg fan

Drawworks Motor * Blower Skid in acc.to GA Drawing: 3-10703 * Blower Skid in acc.to Purch. . Efficiency by technology 2-3684 2-3584 3-10703 . Reference project: Deepsea Rig . The GM4000 is a state of the art top-side from National Oilwell Varco and Nymo. Similar References: Project to owners of: Awilco: WilPromotor WilPioneer WilInnovator

2 Views

1y ago

MetaLight: Value-based Meta-reinforcement Learning for Traffic Signal ...

Meta-reinforcement learning. Meta reinforcement learn-ing aims to solve a new reinforcement learning task by lever-aging the experience learned from a set of similar tasks. Currently, meta-reinforcement learning can be categorized into two different groups. The ﬁrst group approaches (Duan et al. 2016; Wang et al. 2016; Mishra et al. 2018) use an

15 Views

1y ago

Recent Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Decision Tree Tutorial by Kardi Teknomo - TAN THIAM HUAT 陳添發

Male 1 Cheap Medium Bus Female 1 Cheap Medium Train Female 0 Cheap Low Bus Male 1 Cheap Medium Bus Male 0 Standard Medium Train Female 1 Standard Medium Train Female 1 Expensive High Car Male 2 Expensive Medium Car Female 2 Expensive High Car Based on above training data, we can induce a decision tree as the following:

10m ago

84 Views

-xglfldo:Dwfk Xjxvw Wkurxjk)2,

Affordable Care Act - insurance comparison, cheapest insurance, cheap health insurance NJ, cheapest insurance company Priority One High Volume - Washington state health insurance plans, affordable health insurance The best performing ad copy included those that made specific reference to finding "health insurance" for

1y ago

259 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

Industry Observations Insurance Industry

Jun 30, 2019 · 6/17/2019 Commercial Insurance Branch of Extraco Banks, N.A. Higginbotham Insurance Group, Inc. Insurance Brokers NA 6/13/2019 Links Insurance Services, LLC World Insurance Associates LLC Property and Casualty Insurance NA 6/13/2019 Abram Interstate Insurance Services, Inc. Risk Placement Services,

2y ago

619 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

18.01.41 - REPLACEMENT OF LIFE INSURANCE AND ANNUITIES - Idaho

Department of Insurance Replacement of Life Insurance and Annuities. Page 3. 04. Existing Life Insurance or Annuity. "Existing Life Insurance or Annuity" means any life insurance or annuity in force, including life insurance under a binding or conditional receipt or a lif e insurance policy or annuity that is within an unconditional refund period.

1y ago

407 Views

EXAMINATION REPORT OF THE ADMIRAL INSURANCE COMPANY AS OF . - Delaware

Berkley Regional Specialty Insurance Comp 31295 DE Carolina Casualty Insurance Company 10510 IA Clermont Insurance Company 33480 IA Continental Western Insurance Company 10804 IA Firemen's Insurance Com pany of Wash, D.C. 21784 DE Gemini Insurance Company 10833 DE Great Divide Insurance Company 25224 ND

1y ago

258 Views

American International Group, Inc. - Federal Reserve

American General Life Insurance Company AGL U.S. Life Insurance Company AGC Life Insurance Company AGC Life U.S. Life Insurance Company The United States Life Insurance Company in the City of New York U.S. Life U.S. Life Insurance Company The Variable Annuity Life Insurance Company VALIC U.S. Life Insurance Company

1y ago

269 Views

Japan's Insurance Market - Toa Re

with 61.6% of net premiums written, of which automobile insurance totaled 48.8% and compulsory automobile liability insurance totaled 12.8%. Fire insurance accounted for 13.7%, miscellaneous casualty insurance including liability insurance accounted for 11.6%, accident insurance accounted for 9.8%, and marine insurance accounted for 3.2%.

1y ago

179 Views

List of Insurance Companies by Insurance Manager - Cayman Islands dollar

2447 Batan Insurance Company SPC, Ltd. 29-Sep-03 1307714 BBG Insurance Services, Ltd. 09-Aug-16 1254 BCHS Insurance, Ltd. 07-Oct-98 1168 Bearacuda Re 01-Aug-97 2639 Bedrock Insurance Limited 24-Nov-05 2150 Bom Ambiente Insurance Company 14-Jun-00 2565 Boundless Insurance Company, Ltd. 01-Dec-04 769 Bucap Limited 03-Mar-89

1y ago

293 Views

Insurance Certificate 713705-3 and Assistance Program

Name of insurance product: Purchase Protection and Travel Insurance for National Bank of Canada Mastercard credit cards, group insurance policy no. 713705 (Schedule A Certificate number 3)/713705-3 Type of insurance product: Purchase insurance and extended warranty and travel insurance (group insurance) Assistance provider contact information

3m ago

54 Views

The End of Cheap Oil

78 Scientific American March 1998 The End of Cheap Oil The End of Cheap Oil . serves “proved” only if the oil lies near a producing well and there is “reason- . many P90 reserve estimates always un - derstates the amount of proved oil in a region. The only correct way to total

2y ago

153 Views

10703 Deep Reinforcement Learning - Carnegie Mellon University

It looks like you're using an ad-blocker