Introduction To Deep Reinforcement Learning

2y ago
6 Views
2 Downloads
2.78 MB
39 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Nadine Tse
Transcription

2019 CS420, Machine Learning, Lecture 13Introduction to DeepReinforcement LearningWeinan ZhangShanghai Jiao Tong ching/cs420/index.html

Value and Policy ApproximationVμ (s)Qμ (s; a)μsμsaState-value and action-value approx.¼μ (ajs)aμμssStochasticpolicy approx.Deterministicpolicy approx. What if we directly build these approximate function withdeep neural networks?

End-to-End Reinforcement reinforcementlearningDeepreinforcementlearningDeep Reinforcement Learning is what allows RL algorithms to solve complex problems inan end-to-end manner.Slide from Sergey Levine. slides/lec-1.pdf

Deep Reinforcement Learning Deep Reinforcement Learning leverages deep neural networks for value functions andpolicies approximation so as to allow RL algorithms to solve complex problems inan end-to-end manner.Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Playing Atari with Deep Reinforcement Learning. NIPS 2013 workshop.

Deep Reinforcement Learning Trendsdeep reinforcement learning: (Worldwide)120100AlphaGowins LeeSedolNIPS13200 Google search trends of the term ‘deep reinforcementlearning’

Key Changes Brought from DRL What will happen when combining DL and RL? Value functions and policies are now deep neural netsVery high-dimensional parameter spaceHard to train stablyEasy to overfitNeed a large amount of dataNeed high performance computingBalance between CPUs (for collecting experience data)and GPUs (for training neural networks) These new problems motivates novel algorithmsfor DRL

Deep Reinforcement Learning Categories Value-based methods Deep Q-network and its extensions Stochastic policy-based methods Policy gradients with NNs, natural policy gradient, trustregion policy optimization, proximal policy optimization,A3C Deterministic policy-based methods Deterministic policy gradient, DDPG

REVIEWQ-Learning For off-policy learning of action-value Q(s,a)The next action is chosen using behavior policy at 1 » ¹( jst )But we consider alternative successor action a » ¼( jst )And update Q(st,at) towards value of alternative actionQ(st ; at ) Ã Q(st ; at ) (rt 1 Q(st 1 ; a0 ) ¡ Q(st ; at ))actionfrom πnot μ

REVIEWOff-Policy Control with Q-Learning Allow both behavior and target policies to improve The target policy π is greedy w.r.t. Q(s,a)0¼(st 1 ) arg maxQ(s;a)t 10a Q-learning update0Q(st ; at ) Ã Q(st ; at ) (rt 1 maxQ(s;a) ¡ Q(st ; at ))t 10aAt state s, take action aObserve reward rTransit to the next state s’At state s’, take action argmax Q(s’,a’)

Deep Q-Network (DQN)DQN (NIPS 2013) is the beginning of theentire deep reinforcement learning subarea.Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Playing Atari with Deep Reinforcement Learning. NIPS 2013 workshop.Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Human-level control through deep reinforcement learning. Nature 2015.

Deep Q-Network (DQN) Implement Q function with deep neural network Input a state, output Q values for all actionsVolodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Human-level control through deep reinforcement learning. Nature 2015.

Deep Q-Network (DQN) The loss function of Q-learning update at iteration ih¡i 20 0 ¡Li (μi ) E(s;a;r;s0 )»U(D) r maxQ(s; a ; μi ) ¡ Q(s; a; μi )0atarget Q valueestimated Q value θi are the network parameters to be updated at iteration i Updated with standard back-propagation algorithms θi- are the target network parameters Only updated with θi for every C steps (s,a,r,s’) U(D): the samples are uniformly drawn from theexperience pool D Thus to avoid the overfitting to the recent experiencesVolodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Human-level control through deep reinforcement learning. Nature 2015.

Deep Q-Network (DQN) The loss function of Q-learning update at iteration ih¡i 20 0 ¡Li (μi ) E(s;a;r;s0 )»U(D) r maxQ(s; a ; μi ) ¡ Q(s; a; μi )0atarget Q valueestimated Q value For each experience (s,a,r,s’) U(D), the gradient isμi 1¡ 0 0 ¡ μi r maxQ(s ; a ; μi ) ¡ Q(s; a; μi ) rμ Q(s; a; μi )0abackpropagationVolodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Human-level control through deep reinforcement learning. Nature 2015.

DRL with Double Q-Learning DQN gradient isμi 1¡ μi yi ¡ Q(s; a; μi ) rμ Q(s; a; μi )0 0 ¡Q(s; a ; μi )target Q value yi r max0a The target Q value can be rewritten as¡0 ¡yi r Q(s0 ; arg maxQ(s;a;μ);μii )0auses the same values both to select and to evaluate an action,which makes it more likely to select overestimated values,resulting in overoptimistic value estimates.Hasselt et al. Deep Reinforcement Learning with Double Q-learning. AAAI 2016.

DRL with Double Q-Learning DQN gradient isμi 1¡ μi yi ¡ Q(s; a; μi ) rμ Q(s; a; μi )0 0 ¡Q(s; a ; μi )target Q value yi r max0a The target Q value can be rewritten as¡0 ¡yi r Q(s0 ; arg maxQ(s;a;μ);μii )0auses the same values both to select and to evaluate an action Double Q-learning generalizes using differentparameters00yi r Q(s0 ; arg maxQ(s;a;μ);μ)ii0aHasselt et al. Deep Reinforcement Learning with Double Q-learning. AAAI 2016.

Experiments of DQN vs. Double DQNHasselt et al. Deep Reinforcement Learning with Double Q-learning. AAAI 2016.

Deep Reinforcement Learning Categories Value-based methods Deep Q-network and its extensions Stochastic policy-based methods Policy gradients with NNs, natural policy gradient, trustregion policy optimization, proximal policy optimization,A3C Deterministic policy-based methods Deterministic policy gradient, DDPG

REVIEWPolicy Gradient Theorem The policy gradient theorem generalizes the likelihood ratioapproach to multi-step MDPs Replaces instantaneous reward rsa with long-term value Q¼μ (s; a) Policy gradient theorem applies to start state objective J1, average reward objective JavR, and averagevalue objective JavV Theorem For any differentiable policy ¼μ (ajs) , for any of policy objectivefunction J J1, JavR, JavV , the policy gradient isih @ log ¼ (ajs)@J(μ)μ E¼μQ¼μ (s; a)@μ@μ

Policy Network Gradients For stochastic policy, typically the action probabilityis defined as a softmaxefμ (s;a)¼μ (ajs) P f (s;a0 )μa0 e where fθ(s,a) is the score function of a state-action pairparametrized by θ, which can be implemented with a neural net The gradient of its log-form00 )X@ log ¼μ (ajs)@f(s;a@fμ (s; a)100μ ¡ P f (s;a0 )efμ (s;a )μ@μ@μ@μa0 ea00h @f (s; a0 ) i@fμ (s; a)μ¡ Ea0 »¼μ (a0 js) @μ@μ

Policy Network Gradients With the gradient formh @f (s; a0 ) i@ log ¼μ (ajs)@fμ (s; a)μ ¡ Ea0 »¼μ (a0 js)@μ@μ@μ The policy network gradient isih @ log ¼ (ajs)@J(μ)μ E¼μQ¼μ (s; a)@μ@μih³ @f (s; a)h @f (s; a0 ) i μμ¼μ E¼μ¡ Ea0 »¼μ (a0 js)Q (s; a)@μ@μbackpropagationbackpropagation

UC Berkeley DRL course: .pdfLooking into Policy Gradient Let R(π) denote the expected return of π1hXR(¼) Es0 »½0 ;at »¼( jst ) t rtit 0 We collect experience data with another policy πold, andwant to optimize some objective to get a new better policy π Note that a useful identity1hXR(¼) R(¼old ) E¿ »¼i t A¼old (st ; at )t 0Trajectories sampled from π Advantage functionA¼old (s; a) Es0 »½(s0 js;a) [r(s) V ¼old (s0 ) ¡ V ¼old (s)]S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. ICML. 2002.

UC Berkeley DRL course: .pdfLooking into Policy Gradient Advantage functionA¼old (s; a) Es0 »½(s0 js;a) [r(s) V ¼old (s0 ) ¡ V ¼old (s)] Note that a useful identity1hXR(¼) R(¼old ) E¿ »¼t 0 Proof:1hXE¿ »¼i t A¼old (st ; at )t A¼old1hXi(st ; at ) E¿ »¼t (r(st ) Vt 0¼old(st 1 ) ¡ V¼oldi(st ))t 0h E¿ »¼ ¡ V¼old1X(s0 ) it r(st )t 0 ¡Es0 [V¼old1hX(s0 )] E¿ »¼ti r(st ) ¡R(¼old ) R(¼)t 0S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. ICML. 2002.

More for the Policy Expected Return Given the advantage functionA¼old (s; a) Es0 »½(s0 js;a) [r(s) V ¼old (s0 ) ¡ V ¼old (s)] Want to manipulate R(π) into an objective that can beestimated from data1hXR(¼) R(¼old ) E¿ »¼ t A¼old (st ; at )it 01 XX R(¼old ) XP (st sj¼)s1XXat 0 R(¼old ) ¼(ajs) t A¼old (s; a)Xt P (st sj¼)sX R(¼old ) t 0X½¼ (s)sa¼(ajs)A¼old (s; a)a¼(ajs)A¼old (s; a)

Surrogate Loss Function With the importance samplingXR(¼) R(¼old ) X½¼ (s)s¼(ajs)A¼old (s; a)a¼old R(¼old ) Es»¼;a»¼ [A (s; a)]ih ¼(ajs) R(¼old ) Es»¼;a»¼oldA¼old (s; a)¼old (ajs) Define a surrogate loss function based on sampleddata that ignores change in state distributionih ¼(ajs)L(¼) Es»¼old ;a»¼oldA¼old (s; a)¼old (ajs)

Surrogate Loss Functionih¼oldTarget function R(¼) R(¼old ) Es»¼;a»¼ ¼(ajs)A(s; a)Surrogate lossih ¼(ajs)L(¼) Es»¼old ;a»¼oldA¼old (s; a)¼old (ajs) Matches to first order for parameterized policy rμ L(¼μ ) μoldh r ¼ (ajs)i μ μ Es»¼old ;a»¼oldA¼old (s; a) ¼old (ajs)μoldi h ¼ (ajs)r log ¼ (ajs) μμμA¼old (s; a) Es»¼old ;a»¼old¼old (ajs)μold ih ¼old Es»¼old ;a»¼μ rμ log ¼μ (ajs)A (s; a) μold rμ R(¼μ ) μold

Trust-Region Policy OptimizationBetter for M must bebetter for R(π(θ))R(π(θ))R(π(θ))M is the lower boundLower forR(π(θ))M is not the lower bound Idea: by optimizing a lower bound function approximatingR(π) locally, it guarantees policy improvement every timeand lead us to the optimal policy eventually. How to choose a proper lower bound M?

Trust-Region Policy Optimization1hXR(¼) Es0 »½0 ;at »¼( jst )X R(¼old ) t rtt 0X½¼ (s)si¼(ajs)A¼old (s; a)ah ¼(ajs)iL¼old (¼) Es»¼old ;a»¼oldA¼old (s; a)¼old (ajs) The appendix A of the TRPO paper provides a 2-page proofthat establishes the following boundaryq R(¼) ¡ (R(¼old ) L¼ (¼)) · C Es»½¼ [DKL (¼old ( js)k¼( js)))]oldSchulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

Trust-Region Policy Optimizationmax R(¼) max R(¼) ¡ R(¼old )¼¼q R(¼) ¡ (R(¼old ) L¼ (¼)) · C Es»½¼ [DKL (¼old ( js)k¼( js)))]old With some twitting, this is our final lower bound M.qR(¼) ¡ R(¼old ) L¼old (¼) ¡ CEs»½¼ [DKL (¼old ( js)k¼( js)))]Better for M must bebetter for R(π(θ))qmax L¼old (¼) ¡ C Es»½¼ [DKL (¼old ( js)k¼( js)))]¼R(π(θ))Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

Trust-Region Policy Optimization In fact, with the Lagrangian methods, our objective ismathematically the same as the following using a trustregion constraintqmax L¼old (¼) ¡ C Es»½¼ [DKL (¼old ( js)k¼( js)))]¼Better for M must bebetter for R(π(θ))max L¼old (¼)¼s.t. Es»½¼ [DKL (¼old ( js)k¼( js)))] · Trust regionR(π(θ))To guarantee the follow inequalityto make M a valid lower boundq R(¼) ¡ (R(¼old ) L¼ (¼)) · ColdEs»½¼ [DKL (¼old ( js)k¼( js)))]Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

Trust-Region Policy OptimizationLine search(like gradient ascent)Optimization in Trust Regionhttps://medium.com/@jonathan ined-a6ee04eeeee9

A3C: Actor Critic Methods A3C stands for Asynchronous Advantage ActorCritic Asynchronous: because the algorithm involves executinga set of environments in parallel Advantage: because the policy gradient updates aredone using the advantage function Actor Critic: because this is an actor-critic method whichinvolves a policy that updates with the help of learnedstate-value functions.rμ0 log ¼(at jst ; μ0 )A(st ; at ; μ; μv )k¡1XA(st ; at ; μ; μv ) i rt i k V (st k ; μv ) ¡ V (st ; μv )i 0https://medium.com/@jonathan 77f014ec3f12

Deep Reinforcement Learning Categories Value-based methods Deep Q-network and its extensions Stochastic policy-based methods Policy gradients with NNs, natural policy gradient, trustregion policy optimization, proximal policy optimization,A3C Deterministic policy-based methods Deterministic policy gradient, DDPG

Stochastic vs. Deterministic Policies Stochastic policyexpfQμ (s; a)gfor discrete actions ¼(ajs; μ) P0 )gexpfQ(s;a0μafor continuous actions ¼(ajs; μ) / expf(a ¡ ¹μ (s))2 g Deterministic policyfor discrete actions¼(s; μ) arg max Qμ (s; a)a(non-differentiable)for continuous actionsa ¼μ (s)(can be differentiable)

Deterministic Policy Gradient A critic module for state-action value estimationQw (s; a) ' Q¼ (s; a) w ¼2L(w) Es»½¼ ;a»¼μ (Q (s; a) ¡ Q (s; a)) With the differentiable critic, the deterministiccontinuous-action actor can be updated as Deterministic policy gradient theoremJ(¼μ ) Es»½¼ [Q¼ (s; a)]rμ J(¼μ ) Es»½¼ [rμ ¼μ (s)ra Q¼ (s; a)ja ¼μ (s) ]On-policyChain ruleD. Silver et al. Deterministic Policy Gradient Algorithms. ICML 2014.

DDPG: Deep Deterministic Policy Gradient For deterministic policy gradientrμ J(¼μ ) Es»½¼ [rμ ¼μ (s)ra Q¼ (s; a)ja ¼μ (s) ] In practice, a naive application of this actor-criticmethod with neural function approximators isunstable for challenging problems DDPG solutions over DPG Experience replay (off-policy) Target network Batch normalization on Q network prior to the actioninput Add noise on continuousLillicrap et al. Continuous control with deep reinforcement learning. NIPS 2015.

Noise on actionOff-policyUpdate critic netTarget critic networkTarget actor networkUpdate actor net

DDPG Experiments Performance curves for a selection of domains using variants of DPG Light grey: original DPG algorithm with batch normalizationDark grey: with target networkGreen: with target networks and batch normalizationBlue: with target networks from pixel-only inputs. Target networks are crucial.Lillicrap et al. Continuous control with deep reinforcement learning. NIPS 2015.

Deep Reinforcement Learning Categories DRL RL DL One of the most challenging problems in machinelearning with very fast develop during the recent 5 years Value-based methods Deep Q-network and its extensions Stochastic policy-based methods Policy gradients with NNs, natural policy gradient, trustregion policy optimization, proximal policy optimization,A3C Deterministic policy-based methods Deterministic policy gradient, DDPG

VolodymyrMnih, KorayKavukcuoglu, David Silver et al. Human-level control through deep reinforcement learning. Nature 2015. DQN (NIPS 2013) is the beginning of the entire deep reinforcement learning sub-area. VolodymyrMnih, KorayKavukcuoglu, David Silver et al. Playing Atari with

Related Documents:

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

Deep Reinforcement Learning: Reinforcement learn-ing aims to learn the policy of sequential actions for decision-making problems [43, 21, 28]. Due to the recen-t success in deep learning [24], deep reinforcement learn-ing has aroused more and more attention by combining re-inforcement learning with deep neural networks [32, 38].

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

A representative work of deep learning is on playing Atari with Deep Reinforcement Learning [Mnih et al., 2013]. The reinforcement learning algorithm is connected to a deep neural network which operates directly on RGB images. The training data is processed by using stochastic gradient method. A Q-network denotes a neural network which approxi-

In this section, we present related work and background concepts such as reinforcement learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement learning (Sutton and Barto, 1998) environment is typically formalized by means of a Markov decision process (MDP). An MDP can be described as follows. Let S fs 1 .

learning techniques, such as reinforcement learning, in an attempt to build a more general solution. In the next section, we review the theory of reinforcement learning, and the current efforts on its use in other cooperative multi-agent domains. 3. Reinforcement Learning Reinforcement learning is often characterized as the

Meta-reinforcement learning. Meta reinforcement learn-ing aims to solve a new reinforcement learning task by lever-aging the experience learned from a set of similar tasks. Currently, meta-reinforcement learning can be categorized into two different groups. The first group approaches (Duan et al. 2016; Wang et al. 2016; Mishra et al. 2018) use an

Deep Learning: Top 7 Ways to Get Started with MATLAB Deep Learning with MATLAB: Quick-Start Videos Start Deep Learning Faster Using Transfer Learning Transfer Learning Using AlexNet Introduction to Convolutional Neural Networks Create a Simple Deep Learning Network for Classification Deep Learning for Computer Vision with MATLAB