The Mathematical Foundations ofPolicy Gradient MethodsSham M. KakadeUniversity of Washington&Microsoft Research
Reinforcement (interactive) learning (RL):
Markov Decision Processes:a frameworkfor sRL dS states.start with00A actions. A policy:๐: States Actionsdynamics model P(s 0 s, a). We execute๐ to robtainrewardfunction(s) a trajectory:๐ !, ๐!, ๐!, ๐ ", ๐", ๐", discount factor Total ๐พ-discounted reward:Stochastic policy : st !# atSutton, Barto โ18&๐ (๐ !) ๐ธ 9 ๐พ ๐ ๐ !, ๐Standard objective: find which %!maximizes:V (s0 ) E[r (s0 ) r (s1 ) 2r (s2 ) . . .]Goal: Find a policy that maximizes our value, ๐ ! (๐ " ).where the distribution of st , at is induced by .
Dexterous Robotic Hand ManipulationOpenAI, Oct 15, 2019Challenges in RL1. Exploration(the environment may beunknown)2. Credit assignment problem(due to delayed rewards)3. Large state/action spaces:hand state: joint angles/velocitiescube state: configurationactions: forces applied to actuators
Values, State-Action Values, and Advantages%๐ ! (๐ " ) ๐ธ ' ๐พ # ๐(๐ # , ๐# ) ๐ " , ๐# "%๐! (๐ " , ๐" ) ๐ธ ' ๐พ # ๐(๐ # , ๐# ) ๐ " , ๐" , ๐# "๐ด! ๐ , ๐ ๐! ๐ , ๐ ๐ ! (๐ ) Expectation with respect to sampled trajectories under ๐ Have S states and A actions. Effective โhorizonโ is 1/(1 ๐พ) time steps.
The โTabularโ Dynamic Programming approachState ๐:(joint angles, cube config, )๐ธ๐ (๐, ๐): state-action valueAction ๐:(forces at joints)โone step look-ahead valueโusing ๐ (31 , 12 , , 8134, )(1.2 Newton, 0.1 Newton, )8 units of reward Table: โbookkeepingโ for dynamic programming (with known rewards/dynamics)1. Estimate the state-action value ๐! (๐ , ๐) for every entry in the table.2. Update the policy ๐ & goto step 1 Generalization: how can we deal with this infinite table?using sampling/supervised learning?
This Tutorial:Mathematical Foundations of Policy Gradient Methodsยง Part โ I: BasicsA. Derivation and EstimationB. Preconditioning and the Natural Policy Gradientยง Part โ II: Convergence and ApproximationA. Convergence: This is a non-convex problems!B. Approximation: How to the think about the role of deep learning?
Part-1: Basics
State-Action Visitation Measures! This helps to clean up notation! โOccupancy frequencyโ of being in state ๐ and action a, after following ๐ starting in ๐ !'๐ !! ๐ 1 ๐พ ๐ธ ๐พ % ๐ผ ๐ % ๐ ๐ " , ๐%&" ๐.#! is a probability distribution With this notation:๐ # (๐ !)1 ๐ธ. 0"#! ,1 # ๐(๐ , ๐)1 ๐พ
Direct Policy Optimization over Stochastic Policies ๐( ๐ ๐ is the probability of action ๐ given ๐ , parameterized by๐( ๐ ๐ exp(๐( (๐ , ๐)) Softmax policy class: ๐( ๐ , ๐ ๐ ,* Linear policy class: ๐( ๐ , ๐ ๐โ ๐(๐ , ๐)where ๐(๐ , ๐) ๐ Neural policy class: ๐( (๐ , ๐) is a neural network
In practice, policy gradient methods rule They are the most effective method forobtaining state of the art.๐ ๐ ๐ ๐ !A (๐ " ) Why do we like them? They easily deal with large state/action spaces (through the neural net parameterization) We can estimate the gradient using only simulation of our current policy ๐!(the expectation is under the state actions visited under ๐! ) They directly optimize the cost function of interest!
Two (equal) expressions for the policy gradient!# ๐ ๐ "# ๐ ๐ "1#B ๐ธ & ,( ! ๐ ๐ , ๐ log ๐# ๐ ๐ 1 ๐พ1#B ๐ธ & ,( ! ๐ด ๐ , ๐ log ๐# ๐ ๐ 1 ๐พ(some shorthand notation above) Where do these expression come from? How do we compute this?
Example: an important special case! Remember the softmax policy class (a โtabularโ parameterization)๐C ๐ ๐ exp(๐D,E ) Complete class with ๐๐ด params:one parameter per state action, so it contains the optimal policy Expression for softmax class: ๐ C ๐ " ๐ !2 ๐ ๐C ๐ ๐ ๐ดC ๐ , ๐ ๐D,E Intuition: increase ๐!,# if the โweightedโ advantage is large.
Part-1A: Derivations and Estimation
General Derivation rV (s0 )Xr (a0 s0 )Q (s0 , a0 )a0 X a0 Xa0 a0 Xa0 (a0 s0 )rQ (s0 , a0 ) (a0 s0 ) r log (a0 s0 ) Q (s0 , a0 )a0 r (a0 s0 ) Q (s0 , a0 ) XX (a0 s0 )r r(s0 , a0 ) Xs1 P (s1 s0 , a0 )V (s1 ) (a0 s0 ) r log (a0 s0 ) Q (s0 , a0 ) Xa0 ,s1E [Q (s0 , a0 )r log (a0 s0 )] E [rV (s1 )] . (a0 s0 )P (s1 s0 , a0 )rV (s1 )
SL vs RL: How do we obtain gradients? In supervised learning, how do we compute the gradient of our loss ๐ฟ(๐)?๐ ๐ ๐ ๐ฟ(๐) Hint: can we compute our loss? In reinforcement learning, how do we compute the policy gradient ๐ 3 (๐ !)?๐ ๐ ๐ ๐ C (๐ " )# ๐ ๐ "1 ๐ธ ,( ๐ # ๐ , ๐ log ๐# ๐ ๐ 1 ๐พ
Monte Carlo Estimation Sample a trajectory: execute ๐3 and ๐ !, ๐!, ๐!, ๐ ", ๐", ๐", b t , at )Q(s[ rV 1Xt0 01Xt 0t0tr(st0 t , at0 t )b t , at )r log (at st )Q(s Lemma: [Glynn โ90, Williams โ92]] This gives an unbiased estimate of the gradient:# ๐ (๐ )E ๐%This is the โlikelihood ratioโ method.
Back to the softmax policy class ๐C ๐ ๐ exp(๐D,E ) Expression for softmax class: ๐ C ๐ " ๐ !2 ๐ ๐C ๐ ๐ ๐ดC ๐ , ๐ ๐D,E What might be making gradient estimation difficult here?(hint: when does gradient descent โeffectiveโ stop?)
Part-1B: Preconditioning and theNatural Policy Gradient
A closer look at Natural Policy Gradient (NPG) Practice: (almost) all methods are gradient based, usually variants of:Natural Policy Gradient [K. โ01]; TRPO [Schulman โ15]; PPO [Schulman โ17] NPG warps the distance metric to stretch the corners out (using the Fisherinformation metric) move โmoreโ near the boundaries. The update is:๐น ๐ ๐ธ. 0# ,1 # log ๐3 ๐ ๐ log ๐3 ๐ ๐ ๐ ๐ ๐๐น ๐5" ๐ 3 (๐ )!4
TRPO (Trust Region Policy Optimization) TRPO [Schulman โ15] (related: PPO [Schulman โ17]):move staying โcloseโ in KL to previous policy:๐ 6" argmin3 ๐ 3 (๐ !)s. t. ๐ธ. 0# ๐พ๐ฟ ๐ 3 ๐ R ๐ 3 ๐ NPG TRPO: they are first order equivalent (and have same practical behavior)
NPG intuition. But first . NPG as preconditioning:๐ ๐ ๐๐น ๐5" ๐ 3 (๐ )!OR๐๐ ๐ ๐ธ log ๐3 ๐ ๐ log ๐3 ๐ ๐ 1 ๐พ4 What does the following problem remind you of?๐ธ ๐๐ 7 What is NPG is trying to approximate?5" ๐ธ[๐๐]5" ๐ธ log ๐3 ๐ ๐ ๐ด3 (๐ , ๐)
Equivalent Update Rule (for the softmax) Take the best linear fit of ๐ 3 in โpolicy spaceโ-featuresโ: this gives ๐ด3 (๐ , ๐)W.,1 Using the NPG update rule :๐.,1 ๐.,1 ๐๐ด3 (๐ , ๐)1 ฮณ And so an equivalent update rule to NPG is:๐3 ๐ ๐ ๐3๐๐ ๐ exp๐ด3 (๐ , ๐) /๐1 ฮณ What algorithm does this remind you of?Questions: convergence? General case/approximation?
But does gradient descent even work in RL?Reinforcement LearningSupervised LearningWhat about approximation?Stay tuned!!
Part-2: Convergence andApproximation
The Optimization LandscapeSupervised Learning: Gradient descent tends to โjustworkโ in practice and is notsensitive to initialization Saddle points not a problem Reinforcement Learning: Local search depends on initialization inmany real problems, due to โveryโ flatregions. Gradients can be exponentially small inthe โhorizonโ
Prior work: The Explore/Exploit TradeoffRL and the vanishing gradient problems!Thrun โ92Reinforcement Learning:Randomdoesfindreward Thesearchrandom init.hasnotโveryโflattheregionsin realquickly.problems (lack of โexplorationโ) Lemma: [Agarwal, Lee, K., Mahajan random init,theall ๐-thhigher-order gradientsare 2# /& in magnitude for up to k H/ ln ๐ป orders, ๐ป 1/(1 ๐พ).[Kearns& Singh, โ02] E 3 isa near-optimal algo. This is a landscape/optimization issues.Sampleโ03,โ17](also acomplexity:statistical issue[K.if weusedAzarrandominit).Model free: [Strehl et.al. โ06; Dann and Brunskill โ15; Szita &Szepesvari โ10; Lattimore et.al. โ14; Jin et.al. โ18]
Part 2:Understanding the convergence properties of the (NPG) policy gradientmethods!ยง A: Convergence Letโs look at the tabular/โsoftmaxโ caseยง B: Approximationยง Approximation: โlinearโ policies and neural nets
NPG: back to the โsoftโ policy iteration interpretation Remember the softmax policy class๐3 ๐ ๐ exp(๐.,1 )has ๐ ๐ด params At iteration t, the NPG update rule:๐ 6" ๐ ๐ ๐น ๐ 5" ๐ (๐ )!is equivalent to a โsoftโ (exact) policy iteration update rule:๐ 6"๐ ๐ ๐ ๐๐ ๐ exp๐ด (๐ , ๐) /๐1 ฮณ What happens for this non-convex update rule?
Part-2A: Global Convergence
Provable Global Convergence of NPGTheorem [Agarwal, Lee, K., Mahajan 2019]For the softmax policy class, with ๐ 1 ๐พ & log ๐ด ,we have after T iterations,2' ๐๐ % ๐ ๐ % 1 ๐พ &๐ Dimension free iteration complexity! (No dependence on ๐, ๐ด)Also a โFAST RATEโ! Even though problem is non-convex, a mirror descent analysis applies.Analysis idea from [Even-Dar, K., Mansour 2009] What about approximate/sampled gradients and large state space?
Notes: Potentials and Progress?
But first, the โPerformance Difference Lemmaโ Lemma: [Kโ02]: a characterization of the performance gap between any two policies:%๐!๐ " ๐!9# !9๐ " ๐ธE: ,D; ,E; ! ' ๐พ ๐ด (๐ # , ๐# ) ๐ "# " Q๐ธD T ,E !QRS!9๐ด๐ , ๐
Mirror Descent Gives a Proof!(even though it is non-convex) ?(t)?(t 1)?Es d KL( s s ) KL( s s) Es d? EXs d? ?(t 1) (a s)? (a s) log (t) (a s)a X ?(t) (a s)A (s, a)1a V (s0 )V (t) (s0 )Es d? log Zt (s)X a (a s) log Zt (s)!
Notes: are we making progress?
Re-arranging?V (s0 ) 1V (t) (s0 ) Es d? KL( s? s(t) ) KL( s? s(t 1) ) log Zt (s)
Understanding progress:V ?(s0 )1 TTX1V (T(V ?1)(s0 )(s0 )V (t) (s0 ))t 01 Es d? (KL( s? s(0) ) Tlog A 1 T TTX1t 0TX11KL( s? s(T ) )) Es d? log Zt (s) T t 0Es d? log Zt (s)
A slow rate proof sketch
The key lemma for the fast rate Es ยต log Zt (s) . . . . 1 Es ยต V(t 1)(s)V(t)(s)
The fast rate proof!V ?(s0 )V (Tlog A 1 T T1)(s0 )TX1Es d? log Zt (s)t 0log A 1 T(1)TTX1 V (t 1) (d? )t 0log A V (T ) (d? ) V (0) (d? ) T(1)Tlog A 1 .2 T(1) TV (t) (d? )
Part-2B: Approximation(and statistics)
Remember our policy classes: ๐( ๐ ๐ is the probability of action ๐ given ๐ , parameterized by๐( ๐ ๐ exp(๐( (๐ , ๐)) Softmax policy class: ๐( ๐ , ๐ ๐ ,* Linear policy class: ๐( ๐ , ๐ ๐โ ๐(๐ , ๐)where ๐(๐ , ๐) ๐ Neural policy class: ๐( (๐ , ๐) is a neural network
OpenAI: dexterous hand manipulation not far off?Trained with โdomain randomizationโBasically: The measure ๐ ! ๐ wasdiverse.
Prior work: The Explore/Exploit TradeoffPolicy search algorithms: exploration and start state-measures๐๐๐ฅ ๐ธ 0 [๐ ( ๐ ]s"( .Thrun โ92Random search does not find the reward quickly. Idea:Reweightingby a diversedistribution(theory)Balancingthe explore/exploittradeoff:๐ to handles the โvanishing gradientโproblem.[Kearns& Singh, โ02] E 3 is a near-optimal algo.Sampleโ03, Azar Therecomplexity:is sense in[K.whichthisโ17]reweighting is related toModel free: [Strehl et.al. โ06; Dann and Brunskill โ15; Szita &Szepesvari โ10; Lattimore et.al. โ14; Jin et.al. โ18]the a โcondition numberโ Related theory: [K. & Langford; โ02] [K. โ03] Conservative policy iteration (CPI) has the strongest provableguarantees, in terms of the๐ along with the error of a โsupervisedlearningโ black box.S. M. Kakade (UW)Curiosity4 / 16 Other โreductions to SLโ : [Bagnell et al, โ04], [Scherer & Geist, โ14], [Geist et al., โ19], etc helpful for imitation learning: [Ross et al., 2011]; [Ross & Bagnell, 2014]; [Sun et al., 2017 ]
NPG for the linear policy class Now:๐3 ๐ ๐ exp(๐.,1 ๐.,1 ) Take the best linear fit in โpolicy spaceโ-features:W argmin ๐ธ.! ๐ธ.,1 0"#!๐ ๐.,1 ๐ด3๐ , ๐ ๐ is our start-state distribution, hopefully with โcoverageโb3 ๐ , ๐ ๐ ๐ , and the NPG is update is equivalent to: Define ๐ด.,1๐3 ๐ ๐ ๐3๐ b3๐ ๐ exp๐ด ๐ , ๐1 ฮณ This is like a soft โapproximateโ policy iteration step./๐
Sample Based NPG, linear case Sample trajectories: at iteration t, using start state s! ๐, then follow ๐ Now do regression on this sampled data: 3 b๐ argmin ๐ธc.,1 ๐ ๐.,1 ๐ด ๐ , ๐ Define:b ๐ , ๐ ๐b ๐.,1๐ด And so an equivalent update rule to NPG is:๐ 6"๐ ๐ ๐ ๐ b ๐ ๐ exp๐ด ๐ , ๐1 ฮณ/๐
Guarantees: NPG for linear policy classes (realizability) Suppose that ๐ด' ๐ , ๐ is a linear function in ๐(,* supervised learning error: suppose we have bounded regression error, say due to sampling๐ธY ๐ด ๐ , ๐ [ ๐ ๐ ๐ , ๐& ๐ relative condition number: (to opt state-action measure ๐ starting from ๐ - )๐ max@๐ธ.,1 0 ๐.,1 ๐ฅ๐ธ.,1 A๐.,1 ๐ฅ Theorem [Agarwal, Lee, K., Mahajan 2019]๐ด: # actions. ๐ป: horizon. After ๐ iterations, for all ๐ !, the NPG algorithm satisfies:๐ป 2 log ๐ด3 %๐ ๐ ! ๐ ๐ ! ๐ป?๐ ๐๐
Sample Based NPG, neural case Now:๐3 ๐ ๐ exp(๐3 (๐ , ๐)) Sampling: at iteration t, sample s! ๐ and follow ๐, Supervised learning/regression:b argmin ๐ธc.,1๐ Define:๐ ๐3 ๐ , ๐ ๐ด3 ๐ , ๐b ๐ , ๐ ๐b ๐3 ๐ , ๐๐ด The NPG is:๐ 6"๐ ๐ ๐ ๐ b ๐ ๐ exp๐ด ๐ , ๐1 ฮณ/๐
Guarantees: NPG for linear policy classes (realizability) Suppose that ๐ด' ๐ , ๐ is a linear function in ๐' ๐ , ๐ supervised learning error: suppose we have bounded regression error, say due to sampling&' [๐ธ(,* / ๐ด ๐ , ๐ ๐ ๐' ๐ , ๐ ๐ relative condition number: (to opt state-action measure ๐ starting from ๐ - )๐ max@๐ธ.,1 0 ๐! ๐ , ๐ ๐ฅ๐ธ.,1 A ๐! ๐ , ๐.,1 ๐ฅ Theorem [Agarwal, Lee, K., Mahajan 2019]๐ด: # actions. ๐ป: horizon. After ๐ iterations, for all ๐ !, the NPG algorithm satisfies:๐ป 2 log ๐ด3 %๐ ๐ ! ๐ ๐ ! ๐ป?๐ ๐๐NTK TRPO analysis [Lie et. al โ19]
Thank you! Today: mathematical foundations of policy gradient methods. With โcoverageโ, policy gradients have the strongest theoretical guaranteesand are practically effective! New directions/not discussed: design of good exploratory distributions ๐ Relations to transfer learning and โdistribution shiftโRL is a very relevant area, both now and the in the future!With some basics, please participate
Some details for the fast rate!V(t 1) 1(t)(ยต) V (ยต)X1Es d(t 1) (t 1) (a s)A(t) (s, a)ยตa(t 1)X1 (a s)Zt (s)(t 1) Es d(t 1) (a s) log(t) (a s)ยต a11(t 1)(t) Es d(t 1) KL( s s ) Es d(t 1) log Zt (s)ยตยต 11Es d(t 1) log Zt (s)Es ยต log Zt (s).ยต
(also a statistical issue if we used random init). Prior work: The Explore/Exploit Tradeoff Thrun '92 Random search does not ๏ฌnd the reward quickly. (theory) Balancing the explore/exploit tradeoff: [Kearns & Singh, '02] E 3 is a near-optimal algo. Sample complexity: [K. '03, Azar '17]
May 02, 2018ย ยท D. Program Evaluation อThe organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: อThe evaluation methods are cost-effective for the organization อQuantitative and qualitative data is being collected (at Basics tier, data collection must have begun)
Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thรฉ early of Langkasuka Kingdom (2nd century CE) till thรฉ reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thรฉ appearance of a fine physical and spiritual .
On an exceptional basis, Member States may request UNESCO to provide thรฉ candidates with access to thรฉ platform so they can complรจte thรฉ form by themselves. Thรจse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thรฉ nomineewith accessto thรฉ platform via their รฉmail address.
ฬถThe leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ฬถHelp them understand the impact on the organization ฬถShare important changes, plan options, tasks, and deadlines ฬถProvide key messages and talking points ฬถPrepare them to answer employee questions
Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thรจ functionalities and values, a product or Service can provide. The current study aims to segregate thรจ dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have
Chรญnh Vฤn.- Cรฒn ฤแปฉc Thแบฟ tรดn thรฌ tuแป giรกc cแปฑc kแปณ trong sแบกch 8: hiแปn hร nh bแบฅt nhแป 9, ฤแบกt ฤแบฟn vรด tฦฐแปng 10, ฤแปฉng vร o chแป ฤแปฉng cแปงa cรกc ฤแปฉc Thแบฟ tรดn 11, thแป hiแปn tรญnh bรฌnh ฤแบณng cแปงa cรกc Ngร i, ฤแบฟn chแป khรดng cรฒn chฦฐแปng ngแบกi 12, giรกo phรกp khรดng thแป khuynh ฤแบฃo, tรขm thแปฉc khรดng bแป cแบฃn trแป, cรกi ฤฦฐแปฃc
Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Prรฉ-textes. Lโhomme prรฉhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Cafรฉ, thรฉ, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. Crawford M., Marsh D. The driving force : food in human evolution and the future.
Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Prรฉ-textes. Lโhomme prรฉhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Cafรฉ, thรฉ, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. 3 Crawford M., Marsh D. The driving force : food in human evolution and the future.