Lecture 11: Fast Reinforcement Learning 1With Many Slides From Or .

1y ago
3 Views
1 Downloads
2.25 MB
70 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Alexia Money
Transcription

Lecture 11: Fast Reinforcement Learning2Emma BrunskillCS234 Reinforcement Learning.Winter 20182With many slides from or derived from David SilverEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 3Winter 20181 / 70

Class StructureLast time: MidtermThis time: Fast LearningNext time: Batch RLEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 4Winter 20182 / 70

Recall: We want RL Algorithms that PerformOptimizationDelayed consequencesExplorationGeneralizationAnd do it all statistically and computationally efficientlyEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 5Winter 20183 / 70

Consider Montezuma’s revengeBellemare et al. ”Unifying Count-Based Exploration and IntrinsicMotivation”Enormously better than standard DQN with -greedy approachUses principle of optimism under uncertainty which we will see todayEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 6Winter 20184 / 70

Other Areas: Health, Education, .Asymptotic convergence to good/optimal is not enoughEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 7Winter 20185 / 70

Table of Contents1Metrics for evaluating RL algorithms2Exploration and Exploitation3Principles for RL Exploration4Multi-Armed Bandits5MDPs6Principles for RL ExplorationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 8Winter 20186 / 70

Performance Criteria of RL AlgorithmsEmpirical performanceConvergence (to something .)Asymptotic convergence to optimal policyFinite sample guarantees: probably approximately correctRegret (with respect to optimal decisions)Optimal decisions given information have availablePAC uniformEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 9Winter 20187 / 70

Performance Criteria of RL AlgorithmsEmpirical performanceConvergence (to something .)Asymptotic convergence to optimal policyFinite sample guarantees: probably approximately correctRegret (with respect to optimal decisions)Optimal decisions given information have availablePAC uniformEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 10Winter 20188 / 70

Strategic ExplorationTo get stronger guarantees on performance, need strategicexplorationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 11Winter 20189 / 70

Table of Contents1Metrics for evaluating RL algorithms2Exploration and Exploitation3Principles for RL Exploration4Multi-Armed Bandits5MDPs6Principles for RL ExplorationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 12Winter 201810 / 70

Exploration vs. Exploitation DilemmaOnline decision-making involves a fundamental choice:Exploitation: Make the best decision given current informationExploration: Gather more informationThe best long-term strategy may involve short-term sacrificesGather enough information to make the best overall decisionEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 13Winter 201811 / 70

ExamplesRestaurant SelectionGo off-campusEat at Treehouse (again)Online advertisementsShow the most successful adShow a different adOil DrillingDrill at best known locationDrill at new locationGame PlayingPlay the move you believe is bestPlay an experimental moveEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 14Winter 201812 / 70

Table of Contents1Metrics for evaluating RL algorithms2Exploration and Exploitation3Principles for RL Exploration4Multi-Armed Bandits5MDPs6Principles for RL ExplorationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 15Winter 201813 / 70

PrinciplesNaive ExplorationOptimistic InitializationOptimism in the Face of UncertaintyProbability MatchingInformation State SearchEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 16Winter 201814 / 70

Table of Contents1Metrics for evaluating RL algorithms2Exploration and Exploitation3Principles for RL Exploration4Multi-Armed Bandits5MDPs6Principles for RL ExplorationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 17Winter 201815 / 70

MABsWill introduce various principles for multi-armed bandits (MABs) firstinstead of for generic reinforcement learningMABs are a subclass of reinforcement learningSimpler (as will see shortly)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 18Winter 201816 / 70

Multiarmed BanditsMulti-armed bandit is a tuple of (A, R)A : known set of m actionsRa (r ) P[r a] is an unknown probability distribution over rewardsAt each step t the agent selects an action at AThe environment generates a reward rt RatPGoal: Maximize cumulative reward tτ 1 rτEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 19Winter 201817 / 70

Greedy AlgorithmWe consider algorithms that estimate Q̂t (a) Q(a)Estimate the value of each action by Monte-Carlo evaluationTQ̂t (a) 1 Xrt 1(at a)Nt (a)t 1The greedy algorithm selects action with highest valueat arg max Q̂t (a)a AGreedy can lock onto suboptimal action, foreverEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 20Winter 201818 / 70

-Greedy AlgorithmThe -greedy algorithm proceeds as follows:With probability 1 select a arg maxa A Q̂t (a)With probability select a random actionAlways will be making a sub-optimal decision fraction of the timeAlready used this in prior homeworksEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 21Winter 201819 / 70

Optimistic InitializationSimple and practical idea: initialize Q(a) to high valueUpdate action value by incremental Monte-Carlo evaluationStarting with N(a) 0Q̂t (at ) Q̂t 1 1(rt Q̂t 1 )Nt (at )Encourages systematic exploration early onBut can still lock onto suboptimal actionDepends on how high initialize QCheck your understanding: What is the downside to initializing Q toohigh?Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 22Winter 201820 / 70

Decaying t -Greedy AlgorithmPick a decay schedule for 1 , 2 , . . .Consider the following schedulec 0d min ia a 0 c A t min 1, 2d tEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 23Winter 201821 / 70

How to Compare these Methods?Empirical performanceConvergence (to something .)Asymptotic convergence to optimal policyFinite sample guarantees: probably approximately correctRegret (with respect to optimal decisions)Very common criteria for bandit algorithmsAlso frequently considered for reinforcement learning methodsOptimal decisions given information have availablePAC uniformEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 24Winter 201822 / 70

RegretAction-value is the mean reward for action aQ(a) E[r a]Optimal value V V Q(a ) max Q(a)a ARegret is the opportunity loss for one steplt E[V Q(at )]Total Regret is the total opportunity losstXLt E[V Q(aτ )]τ 1Maximize cumulative reward minimize total regretEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 25Winter 201823 / 70

Evaluating RegretCount Nt (a) is expected number of selections for action aGap a is the difference in value between action a and optimalaction a , a V Q(a)Regret is a function of gaps and counts" t#XLt EV Q(aτ )τ 1 XE[Nt (a)](V Q(a))a A XE[Nt (a)] aa AA good algorithm ensures small counts for large gapsBut: gaps are not knownEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 26Winter 201824 / 70

Types of Regret boundsProblem independent: Bound how regret grows as a function of T ,the total number of time steps the algorithm operates forProblem dependent: Bound regret as a function of the number oftimes pull each arm and the gap between the reward for the pulledarm and the true optimal armEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 27Winter 201825 / 70

”Good”: Sublinear or below regretExplore forever: have linear total regretExplore never: have linear total regretIs it possible to achieve sublinear regret?Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 28Winter 201826 / 70

Greedy Bandit Algorithms and Optimistic InitializationGreedy: Linear total regretConstant -greedy: Linear total regretDecaying -greedy: Sublinear regret but schedule for decaying requires knowledge of gaps, which are unknownOptimistic initialization: Sublinear regret if initialize valuessufficiently optimistically, else linear regretCheck your understanding: why does fixed -greedy have linearregret? (Do a proof sketch)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 29Winter 201827 / 70

Lower BoundUse lower bound to determine how hard this problem isThe performance of any algorithm is determined by similarity betweenoptimal arm and other armsHard problems have similar looking arms with different meansThis is described formally by the gap a and the similarity in distributions KL(Ra kRa )Theorem (Lai and Robbins): Asymptotic total regret is at leastlogarithmic in number of stepslim Lt log tXt a a 0 aKL(Ra kRa )Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 30Winter 201828 / 70

PrinciplesNaive ExplorationOptimistic InitializationOptimism in the Face of UncertaintyProbability MatchingInformation State SearchEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 31Winter 201829 / 70

Optimism in the Face of UncertaintyWhich action should we pick?Choose arms that could be goodIntuitively choosing an arm with potentially high mean reward willeither lead to:Getting high reward: if the arm really has a high mean rewardLearn something: if the arm really has a lower mean reward, pulling itwill (in expectation) reduce its average reward and the uncertainty overits valueEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 32Winter 201830 / 70

Upper Confidence BoundsEstimate an upper confidence Ût (a) for each action value, such thatQ(a) Q̂t (a) Ût (a) with high probabilityThis depends on the number of times N(a) has been selectedSmall Nt (a) large Ût (a) (estimate value is uncertain)Large Nt (a) small Ût (a) (estimate value is accurate)Select action maximizing Upper Confidence Bound (UCB)at arg max a AQ̂t (a) Ût (a)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 33Winter 201831 / 70

Hoeffding’s InequalityTheorem (Hoeffding’s Inequality):PLet X1 , . . . , Xt be i.i.d. randomvariables in [0, 1], and let X̄t τ1 tτ 1 Xτ be the sample mean. Then P E [X ] X̄t u exp( 2tu 2 )Applying Hoeffding’s Inequality to the rewards of the bandit,hiP Q(a) Q̂t (a) Ut (a) exp( 2Nt (a)Ut (a)2 )Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 34Winter 201832 / 70

Calculating UCBPick a probability p that true value exceeds UCBNow solve for Ut (a)exp( 2Nt (a)Ut (a)2 ) psUt (a) log p2Nt (a)Reduce p as we observe more rewards, e.q. p t 4Ensures we select optimal action as t s2 log tUt (a) Nt (a)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 35Winter 201833 / 70

UCB1This leads to the UCB1 algorithmat arg max Q(a) a Ap2 log tNt (a)Theorem: The UCB algorithm achieves logarithmic asymptotic totalregretXlim Lt 8 log t at a a 0Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 36Winter 201834 / 70

Check Your UnderstandingAn alternative would be to always select the arm with the highestlower boundWhy can this yield linear regret?Consider a two arm case for simplicityEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 37Winter 201835 / 70

Bayesian BanditsSo far we have made no assumptions about the reward distribution RExcept bounds on rewardsBayesian bandits exploit prior knowledge of rewards, p[R]They compute posterior distribution of rewards p[R ht , whereht (a1 , r1 , . . . , at 1 , rt 1 )Use posterior to guide exporationUpper confidence bounds (Bayesian UCB)Probability matching (Thompson Sampling)Better performance if prior knowledge is accurateEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 38Winter 201836 / 70

Bayesian UCB Example: Independent GaussiansAssume reward distribution is Gaussian, Ra (r ) N (r ; µa , σa2 )Compute Gaussian posterior over µa and σa2 (by Bayes law)Yp[µa , σa2 ht ] p[µa , σa2 ]N (rt ; µa , σa2 )t at aPick action that maximizes standard deviation of Q(a)cσaat arg max µa c pa AN(a)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 39Winter 201837 / 70

PrinciplesNaive ExplorationOptimistic InitializationOptimism in the Face of UncertaintyProbability MatchingInformation State SearchEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 40Winter 201838 / 70

Probability MatchingAgain assume have a parametric distribution over rewards for eacharmProbability matching selects action a according to probability that ais the optimal actionπ(a ht ) P[Q(a) Q(a0 ), a0 6 a ht ]Probability matching is optimistic in the face of uncertaintyUncertain actions have higher probability of being maxCan be difficult to compute analytically from posteriorEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 41Winter 201839 / 70

Thompson sampling implements probability matchingThompson sampling:π(a ht ) P[Q(a) Q(a0 ), a0 6 a ht ] ER ht 1(a arg max Q(a))a AEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 42Winter 201840 / 70

Thompson sampling implements probability matchingThompson sampling:π(a ht ) P[Q(a) Q(a0 ), a0 6 a ht ] ER ht 1(a arg max Q(a))a AUse Bayes law to compute posterior distribution p[R ht ]Sample a reward distribution R from posteriorCompute action-value function Q(a) E[Ra ]Select action maximizing value on sample, at arg maxa A Q(a)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 43Winter 201841 / 70

Thompson sampling implements probability matchingThompson sampling achieves Lai and Robbins lower boundLast checked: bounds for optimism are tighter than for ThomsponsamplingBut empirically Thompson sampling can be extremely effectiveEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 44Winter 201842 / 70

PrinciplesNaive ExplorationOptimistic InitializationOptimism in the Face of UncertaintyProbability MatchingInformation State SearchEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 45Winter 201843 / 70

Relevant Background: Value of InformationExploration is useful because it gains informationCan we quantify the value of information (VOI)?How much reward a decision-maker would be prepared to pay in orderto have that information, prior to making a decisionLong-term reward after getting information - immediate rewardEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 46Winter 201844 / 70

Relevant Background: Value of Information ExampleConsider bandit where only get to make a single decisionOil company considering buying rights to drill in 1 of 5 locations1 of locations contains 10 million worth of oil, others 0Cost of buying rights to drill is 2 millionSeismologist says for a fee will survey one of 5 locations and reportback definitively whether that location does or does not contain oilWhat is the should consider paying seismologist?Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 47Winter 201845 / 70

Relevant Background: Value of Information Example1 of locations contains 10 million worth of oil, others 0Cost of buying rights to drill is 2 millionSeismologist says for a fee will survey one of 5 locations and reportback definitively whether that location does or does not contain oilValue of information: expected profit if ask seismologist minusexpected profit if don’t askExpected profit if don’t ask:Guess at random 14(10 2) (0 2) 055(1)Expected profit if ask:If one surveyed has oil, expected profit is: 10 2 8If one surveyed doesn’t have oil, expected profit: (guess at randomfrom other locations) 41 (10 2) 34 ( 2) 0.5Weigh by probability will survey location with oil: 15 8 54 0.5 2VOI: 2 0 2Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 48Winter 201846 / 70

Relevant Background: Value of InformationBack to making a sequence of decisions under uncertaintyInformation gain is higher in uncertain situationsBut need to consider value of that informationWould it change our decisions?Expected utility benefitEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 49Winter 201847 / 70

Information State SpaceSo far viewed bandits as a simple fully observable Markov decisionprocess (where actions don’t impact next state)Beautiful idea: frame bandits as a partially observable Markovdecision process where the hidden state is the mean reward of eacharmEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 50Winter 201848 / 70

Information State SpaceSo far viewed bandits as a simple fully observable Markov decisionprocess (where actions don’t impact next state)Beautiful idea: frame bandits as a partially observable Markovdecision process where the hidden state is the mean reward of eacharm(Hidden) State is staticActions are same as before, pulling an armObservations: Sample from reward model given hidden statePOMDP planning Optimal Bandit learningEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 51Winter 201849 / 70

Information State SpacePOMDP belief state / information state s̃ is posterior over hiddenparameters (e.g. mean reward of each arm)s̃ is a statistic of the history, s̃ f (ht )Each action a causes a transition to a new information state s̃ 0 (byaadding information), with probability P̃s̃,s̃0Equivalent to a POMDPOr a MDP M̃ (S̃, A, P̃, R, γ) in augmented information statespaceEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 52Winter 201850 / 70

Bernoulli BanditsConsider a Bernoulli bandit such that Ra B(µa )e.g. Win or lose a game with probability µaWant to find which arm has the highest µaThe information state is s̃ (α, β)αa counts the pulls of arm a where the reward was 0βa counts the pulls of arm a where the reward was 1Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 53Winter 201851 / 70

Solving Information State Space BanditsWe now have an infinite MDP over information statesThis MDP can be solved by reinforcement learningModel-free reinforcement learning (e.g. Q-learning)Bayesian model-based RL (e.g. Gittins indices)This approach is known as Bayes-adaptive RL: Finds Bayes-optimalexploration/exploitation trade-off with respect to prior distributionIn other words, selects actions that maximize expected reward giveninformation have so farCheck your understanding: Can an algorithm that optimally solves aninformation state bandit have a non-zero regret? Why or why not?Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 54Winter 201852 / 70

Bayes-Adaptive Bernoulli BanditsStart with Beta(αa , βa ) priorover reward function RaEach time a is selected,update posterior for RaBeta(αa 1, βa ) if r 0Beta(αa , βa 1) if r 1This defines transitionfunction P̃ for theBayes-adaptive MDPInformation state (α, β)corresponds to reward modelBeta(α, β)Each state transitioncorresponds to a Bayesianmodel updateEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 55Winter 201853 / 70

Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 56Winter 201854 / 70

Gittins Indices for Bernoulli BanditsBayes-adaptive MDP can be solved by dynamic programmingThe solution is known as the Gittins indexExact solution to Bayes-adaptiev MDP is typically intractable;information state space is too largeRecent idea: apply simulation-based search (Guez et al. 2012, 2013)Forward search in information state spaceUsing simulations from current information stateEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 57Winter 201855 / 70

Table of Contents1Metrics for evaluating RL algorithms2Exploration and Exploitation3Principles for RL Exploration4Multi-Armed Bandits5MDPs6Principles for RL ExplorationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 58Winter 201856 / 70

Principles for Strategic ExplorationThe sample principles for exploration/exploitation apply to MDPsNaive ExplorationOptimistic InitializationOptimism in the Face of UncertaintyProbability MatchingInformation State SearchEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 59Winter 201857 / 70

Optimistic Initialization: Model-Free RLInitialize action-value function Q(s,a) tormax1 γRun favorite model-free RL algorithmMonte-Carlo controlSarsaQ-learningetc.Encourages systematic exploration of states and actionsEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 60Winter 201858 / 70

Optimistic Initialization: Model-Free RLConstruct an optimistic model of the MDPInitialize transitions to go to terminal state with rmax rewardSolve optimistic MDP by favorite planning algorithmEncourages systematic exploration of states and actionse.g. RMax algorithm (Brafman and Tennenholtz)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 61Winter 201859 / 70

UCB: Model-Free RLMaximize UCB on action-value function Q π (s, a)at arg max Q(st , a) U(st , a)a AEstimate uncertainty in policy evaluation (easy)Ignores uncertainty from policy improvementMaximize UCB on optimal action-value function Q (s, a)at arg max Q(st , a) U1 (st , a) U2 (st , a)a AEstimate uncertainty in policy evaluation (easy)plus uncertainty from policy improvement (hard)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 62Winter 201860 / 70

Bayesian Model-Based RLMaintain posterior distribution over MDP modelsEstimate both transition and rewards, p[P, R ht ], whereht (s1 , a1 , r1 , . . . , st ) is the historyUse posterior to guide explorationUpper confidence bounds (Bayesian UCB)Probability matching (Thompson sampling)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 63Winter 201861 / 70

Thompson Sampling: Model-Based RLThompson sampling implements probability matchingπ(s, a ht ) P[Q(s, a) Q(s, a0 ), a0 6 a ht ] EP,R ht 1(a arg max Q(s, a))a AUse Bayes law to compute posterior distribution p[P, R ht ]Sample an MDP P, R from posteriorSolve MDP using favorite planning algorithm to get Q (s, a)Select optimal action for sample MDP, at arg maxa A Q (st , a)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 64Winter 201862 / 70

Information State Search in MDPsMDPs can be augmented to include information stateNow the augmented state is (s, s̃)where s is original state within MDPand s̃ is a statistic of the history (accumulated information)Each action a causes a transitionato a new state s 0 with probability Ps,s00to a new information state s̃Defines MDP M̃ in augmented information state spaceEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 65Winter 201863 / 70

Bayes Adaptive MDPPosterior distribution over MDP model is an information states̃t P[P, R ht ]Augmented MDP over (s, s̃) is called Bayes-adaptive MDPSolve this MDP to find optimal exploration/exploitation trade-off(with respect to prior)However, Bayes-adaptive MDP is typically enormousSimulation-based search has proven effective (Guez et al, 2012, 2013)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 66Winter 201864 / 70

Table of Contents1Metrics for evaluating RL algorithms2Exploration and Exploitation3Principles for RL Exploration4Multi-Armed Bandits5MDPs6Principles for RL ExplorationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 67Winter 201865 / 70

PrinciplesNaive ExplorationAdd noise to greedy policy (e.g. -greedy)Optimistic InitializationAssume the best until proven otherwiseOptimism in the Face of UncertaintyPrefer actions with uncertain valuesProbability MatchingSelect actions according to probability they are bestInformation State SearchLookahead search incorporating value of informationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 68Winter 201866 / 70

Other evaluation criteriaProbably approximately correct:On all but N steps, algorithm will select an action whose value is nearoptimal Q(s, at ) V (s) with probability at least 1 δ.1N is a polynomial function of the MDP parameters ( S , A , 1 γ, δ, )Bounded ”mistakes”Many PAC RL algorithms use ideas of optimism under uncertaintyEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 69Winter 201867 / 70

Generalization and Strategic ExplorationImportant topic: combine generalization & strategic explorationMany approaches are grounded by principles outlined hereSome examples:Optimism under uncertainty: Bellemare et al. NIPS 2016; Ostrovski etal. ICML 2017; Tang et al. NIPS 2017Probability matching: Osband et al. NIPS 2016; Mandel et al. IJCAI2016Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 70Winter 201868 / 70

Midterm ResultsMidterm was rigorousWe will curve the classIf you got under 30, there are some concepts you haven’tdemonstrated you understand yet, please reach out to us tobrainstorm ways we can help you learnEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 71Winter 201869 / 70

Class StructureLast time: Midterm!This time: Exploration and ExploitationNext time: Batch RLEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 72Winter 201870 / 70

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 11: Fast Reinforcement Learning 30 Winter 2018 28 / 70. Principles Naive Exploration Optimistic Initialization Optimism in the Face of Uncertainty Probability Matching Information State Search . Applying Hoe ding's Inequality to the rewards of the bandit, P h Q(a) Q t(a) U t(a) i .

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

In this section, we present related work and background concepts such as reinforcement learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement learning (Sutton and Barto, 1998) environment is typically formalized by means of a Markov decision process (MDP). An MDP can be described as follows. Let S fs 1 .

learning techniques, such as reinforcement learning, in an attempt to build a more general solution. In the next section, we review the theory of reinforcement learning, and the current efforts on its use in other cooperative multi-agent domains. 3. Reinforcement Learning Reinforcement learning is often characterized as the

applying reinforcement learning methods to the simulated experiences just as if they had really happened. Typically, as in Dyna-Q, the same reinforcement learning method is used both for learning from real experience and for planning from simulated experience. The reinforcement learning method is thus the ÒÞnal common pathÓ for both learning

Meta-reinforcement learning. Meta reinforcement learn-ing aims to solve a new reinforcement learning task by lever-aging the experience learned from a set of similar tasks. Currently, meta-reinforcement learning can be categorized into two different groups. The first group approaches (Duan et al. 2016; Wang et al. 2016; Mishra et al. 2018) use an

Araling Panlipunan – Ikalawang Baitang Alternative Delivery Mode Unang Markahan – Modyul 3: Komunidad Ko, Pahahalagahan Ko Unang Edisyon, 2020 Isinasaad sa Batas Republika 8293, Seksiyon 176 na: Hindi maaaring magkaroon ng karapatang-sipi sa anomang akda ang Pamahalaan ng Pilipinas. Gayonpaman, kailangan muna ang pahintulot ng ahensiya o tanggapan ng pamahalaan na naghanda ng akda kung ito .