Lecture 11: Fast Reinforcement Learning 1With Many Slides From Or .

1y ago

3 Views

1 Downloads

2.25 MB

70 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Alexia Money

Report this link

Download PDF

Transcription

Lecture 11: Fast Reinforcement Learning2Emma BrunskillCS234 Reinforcement Learning.Winter 20182With many slides from or derived from David SilverEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 3Winter 20181 / 70

Class StructureLast time: MidtermThis time: Fast LearningNext time: Batch RLEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 4Winter 20182 / 70

Recall: We want RL Algorithms that PerformOptimizationDelayed consequencesExplorationGeneralizationAnd do it all statistically and computationally efficientlyEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 5Winter 20183 / 70

Consider Montezuma’s revengeBellemare et al. ”Unifying Count-Based Exploration and IntrinsicMotivation”Enormously better than standard DQN with -greedy approachUses principle of optimism under uncertainty which we will see todayEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 6Winter 20184 / 70

Other Areas: Health, Education, .Asymptotic convergence to good/optimal is not enoughEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 7Winter 20185 / 70

Table of Contents1Metrics for evaluating RL algorithms2Exploration and Exploitation3Principles for RL Exploration4Multi-Armed Bandits5MDPs6Principles for RL ExplorationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 8Winter 20186 / 70

Performance Criteria of RL AlgorithmsEmpirical performanceConvergence (to something .)Asymptotic convergence to optimal policyFinite sample guarantees: probably approximately correctRegret (with respect to optimal decisions)Optimal decisions given information have availablePAC uniformEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 9Winter 20187 / 70

Strategic ExplorationTo get stronger guarantees on performance, need strategicexplorationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 11Winter 20189 / 70

Exploration vs. Exploitation DilemmaOnline decision-making involves a fundamental choice:Exploitation: Make the best decision given current informationExploration: Gather more informationThe best long-term strategy may involve short-term sacrificesGather enough information to make the best overall decisionEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 13Winter 201811 / 70

ExamplesRestaurant SelectionGo off-campusEat at Treehouse (again)Online advertisementsShow the most successful adShow a different adOil DrillingDrill at best known locationDrill at new locationGame PlayingPlay the move you believe is bestPlay an experimental moveEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 14Winter 201812 / 70

PrinciplesNaive ExplorationOptimistic InitializationOptimism in the Face of UncertaintyProbability MatchingInformation State SearchEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 16Winter 201814 / 70

MABsWill introduce various principles for multi-armed bandits (MABs) firstinstead of for generic reinforcement learningMABs are a subclass of reinforcement learningSimpler (as will see shortly)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 18Winter 201816 / 70

Multiarmed BanditsMulti-armed bandit is a tuple of (A, R)A : known set of m actionsRa (r ) P[r a] is an unknown probability distribution over rewardsAt each step t the agent selects an action at AThe environment generates a reward rt RatPGoal: Maximize cumulative reward tτ 1 rτEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 19Winter 201817 / 70

Greedy AlgorithmWe consider algorithms that estimate Q̂t (a) Q(a)Estimate the value of each action by Monte-Carlo evaluationTQ̂t (a) 1 Xrt 1(at a)Nt (a)t 1The greedy algorithm selects action with highest valueat arg max Q̂t (a)a AGreedy can lock onto suboptimal action, foreverEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 20Winter 201818 / 70

-Greedy AlgorithmThe -greedy algorithm proceeds as follows:With probability 1 select a arg maxa A Q̂t (a)With probability select a random actionAlways will be making a sub-optimal decision fraction of the timeAlready used this in prior homeworksEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 21Winter 201819 / 70

Optimistic InitializationSimple and practical idea: initialize Q(a) to high valueUpdate action value by incremental Monte-Carlo evaluationStarting with N(a) 0Q̂t (at ) Q̂t 1 1(rt Q̂t 1 )Nt (at )Encourages systematic exploration early onBut can still lock onto suboptimal actionDepends on how high initialize QCheck your understanding: What is the downside to initializing Q toohigh?Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 22Winter 201820 / 70

Decaying t -Greedy AlgorithmPick a decay schedule for 1 , 2 , . . .Consider the following schedulec 0d min ia a 0 c A t min 1, 2d tEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 23Winter 201821 / 70

How to Compare these Methods?Empirical performanceConvergence (to something .)Asymptotic convergence to optimal policyFinite sample guarantees: probably approximately correctRegret (with respect to optimal decisions)Very common criteria for bandit algorithmsAlso frequently considered for reinforcement learning methodsOptimal decisions given information have availablePAC uniformEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 24Winter 201822 / 70

RegretAction-value is the mean reward for action aQ(a) E[r a]Optimal value V V Q(a ) max Q(a)a ARegret is the opportunity loss for one steplt E[V Q(at )]Total Regret is the total opportunity losstXLt E[V Q(aτ )]τ 1Maximize cumulative reward minimize total regretEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 25Winter 201823 / 70

Evaluating RegretCount Nt (a) is expected number of selections for action aGap a is the difference in value between action a and optimalaction a , a V Q(a)Regret is a function of gaps and counts" t#XLt EV Q(aτ )τ 1 XE[Nt (a)](V Q(a))a A XE[Nt (a)] aa AA good algorithm ensures small counts for large gapsBut: gaps are not knownEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 26Winter 201824 / 70

Types of Regret boundsProblem independent: Bound how regret grows as a function of T ,the total number of time steps the algorithm operates forProblem dependent: Bound regret as a function of the number oftimes pull each arm and the gap between the reward for the pulledarm and the true optimal armEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 27Winter 201825 / 70

”Good”: Sublinear or below regretExplore forever: have linear total regretExplore never: have linear total regretIs it possible to achieve sublinear regret?Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 28Winter 201826 / 70

Greedy Bandit Algorithms and Optimistic InitializationGreedy: Linear total regretConstant -greedy: Linear total regretDecaying -greedy: Sublinear regret but schedule for decaying requires knowledge of gaps, which are unknownOptimistic initialization: Sublinear regret if initialize valuessufficiently optimistically, else linear regretCheck your understanding: why does fixed -greedy have linearregret? (Do a proof sketch)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 29Winter 201827 / 70

Lower BoundUse lower bound to determine how hard this problem isThe performance of any algorithm is determined by similarity betweenoptimal arm and other armsHard problems have similar looking arms with different meansThis is described formally by the gap a and the similarity in distributions KL(Ra kRa )Theorem (Lai and Robbins): Asymptotic total regret is at leastlogarithmic in number of stepslim Lt log tXt a a 0 aKL(Ra kRa )Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 30Winter 201828 / 70

Optimism in the Face of UncertaintyWhich action should we pick?Choose arms that could be goodIntuitively choosing an arm with potentially high mean reward willeither lead to:Getting high reward: if the arm really has a high mean rewardLearn something: if the arm really has a lower mean reward, pulling itwill (in expectation) reduce its average reward and the uncertainty overits valueEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 32Winter 201830 / 70

Upper Confidence BoundsEstimate an upper confidence Ût (a) for each action value, such thatQ(a) Q̂t (a) Ût (a) with high probabilityThis depends on the number of times N(a) has been selectedSmall Nt (a) large Ût (a) (estimate value is uncertain)Large Nt (a) small Ût (a) (estimate value is accurate)Select action maximizing Upper Confidence Bound (UCB)at arg max a AQ̂t (a) Ût (a)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 33Winter 201831 / 70

Hoeffding’s InequalityTheorem (Hoeffding’s Inequality):PLet X1 , . . . , Xt be i.i.d. randomvariables in [0, 1], and let X̄t τ1 tτ 1 Xτ be the sample mean. Then P E [X ] X̄t u exp( 2tu 2 )Applying Hoeffding’s Inequality to the rewards of the bandit,hiP Q(a) Q̂t (a) Ut (a) exp( 2Nt (a)Ut (a)2 )Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 34Winter 201832 / 70

Calculating UCBPick a probability p that true value exceeds UCBNow solve for Ut (a)exp( 2Nt (a)Ut (a)2 ) psUt (a) log p2Nt (a)Reduce p as we observe more rewards, e.q. p t 4Ensures we select optimal action as t s2 log tUt (a) Nt (a)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 35Winter 201833 / 70

UCB1This leads to the UCB1 algorithmat arg max Q(a) a Ap2 log tNt (a)Theorem: The UCB algorithm achieves logarithmic asymptotic totalregretXlim Lt 8 log t at a a 0Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 36Winter 201834 / 70

Check Your UnderstandingAn alternative would be to always select the arm with the highestlower boundWhy can this yield linear regret?Consider a two arm case for simplicityEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 37Winter 201835 / 70

Bayesian BanditsSo far we have made no assumptions about the reward distribution RExcept bounds on rewardsBayesian bandits exploit prior knowledge of rewards, p[R]They compute posterior distribution of rewards p[R ht , whereht (a1 , r1 , . . . , at 1 , rt 1 )Use posterior to guide exporationUpper confidence bounds (Bayesian UCB)Probability matching (Thompson Sampling)Better performance if prior knowledge is accurateEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 38Winter 201836 / 70

Bayesian UCB Example: Independent GaussiansAssume reward distribution is Gaussian, Ra (r ) N (r ; µa , σa2 )Compute Gaussian posterior over µa and σa2 (by Bayes law)Yp[µa , σa2 ht ] p[µa , σa2 ]N (rt ; µa , σa2 )t at aPick action that maximizes standard deviation of Q(a)cσaat arg max µa c pa AN(a)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 39Winter 201837 / 70

Probability MatchingAgain assume have a parametric distribution over rewards for eacharmProbability matching selects action a according to probability that ais the optimal actionπ(a ht ) P[Q(a) Q(a0 ), a0 6 a ht ]Probability matching is optimistic in the face of uncertaintyUncertain actions have higher probability of being maxCan be difficult to compute analytically from posteriorEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 41Winter 201839 / 70

Thompson sampling implements probability matchingThompson sampling:π(a ht ) P[Q(a) Q(a0 ), a0 6 a ht ] ER ht 1(a arg max Q(a))a AEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 42Winter 201840 / 70

Thompson sampling implements probability matchingThompson sampling:π(a ht ) P[Q(a) Q(a0 ), a0 6 a ht ] ER ht 1(a arg max Q(a))a AUse Bayes law to compute posterior distribution p[R ht ]Sample a reward distribution R from posteriorCompute action-value function Q(a) E[Ra ]Select action maximizing value on sample, at arg maxa A Q(a)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 43Winter 201841 / 70

Thompson sampling implements probability matchingThompson sampling achieves Lai and Robbins lower boundLast checked: bounds for optimism are tighter than for ThomsponsamplingBut empirically Thompson sampling can be extremely effectiveEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 44Winter 201842 / 70

Relevant Background: Value of InformationExploration is useful because it gains informationCan we quantify the value of information (VOI)?How much reward a decision-maker would be prepared to pay in orderto have that information, prior to making a decisionLong-term reward after getting information - immediate rewardEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 46Winter 201844 / 70

Relevant Background: Value of Information ExampleConsider bandit where only get to make a single decisionOil company considering buying rights to drill in 1 of 5 locations1 of locations contains 10 million worth of oil, others 0Cost of buying rights to drill is 2 millionSeismologist says for a fee will survey one of 5 locations and reportback definitively whether that location does or does not contain oilWhat is the should consider paying seismologist?Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 47Winter 201845 / 70

Relevant Background: Value of Information Example1 of locations contains 10 million worth of oil, others 0Cost of buying rights to drill is 2 millionSeismologist says for a fee will survey one of 5 locations and reportback definitively whether that location does or does not contain oilValue of information: expected profit if ask seismologist minusexpected profit if don’t askExpected profit if don’t ask:Guess at random 14(10 2) (0 2) 055(1)Expected profit if ask:If one surveyed has oil, expected profit is: 10 2 8If one surveyed doesn’t have oil, expected profit: (guess at randomfrom other locations) 41 (10 2) 34 ( 2) 0.5Weigh by probability will survey location with oil: 15 8 54 0.5 2VOI: 2 0 2Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 48Winter 201846 / 70

Relevant Background: Value of InformationBack to making a sequence of decisions under uncertaintyInformation gain is higher in uncertain situationsBut need to consider value of that informationWould it change our decisions?Expected utility benefitEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 49Winter 201847 / 70

Information State SpaceSo far viewed bandits as a simple fully observable Markov decisionprocess (where actions don’t impact next state)Beautiful idea: frame bandits as a partially observable Markovdecision process where the hidden state is the mean reward of eacharm(Hidden) State is staticActions are same as before, pulling an armObservations: Sample from reward model given hidden statePOMDP planning Optimal Bandit learningEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 51Winter 201849 / 70

Information State SpacePOMDP belief state / information state s̃ is posterior over hiddenparameters (e.g. mean reward of each arm)s̃ is a statistic of the history, s̃ f (ht )Each action a causes a transition to a new information state s̃ 0 (byaadding information), with probability P̃s̃,s̃0Equivalent to a POMDPOr a MDP M̃ (S̃, A, P̃, R, γ) in augmented information statespaceEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 52Winter 201850 / 70

Bernoulli BanditsConsider a Bernoulli bandit such that Ra B(µa )e.g. Win or lose a game with probability µaWant to find which arm has the highest µaThe information state is s̃ (α, β)αa counts the pulls of arm a where the reward was 0βa counts the pulls of arm a where the reward was 1Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 53Winter 201851 / 70

Solving Information State Space BanditsWe now have an infinite MDP over information statesThis MDP can be solved by reinforcement learningModel-free reinforcement learning (e.g. Q-learning)Bayesian model-based RL (e.g. Gittins indices)This approach is known as Bayes-adaptive RL: Finds Bayes-optimalexploration/exploitation trade-off with respect to prior distributionIn other words, selects actions that maximize expected reward giveninformation have so farCheck your understanding: Can an algorithm that optimally solves aninformation state bandit have a non-zero regret? Why or why not?Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 54Winter 201852 / 70

Bayes-Adaptive Bernoulli BanditsStart with Beta(αa , βa ) priorover reward function RaEach time a is selected,update posterior for RaBeta(αa 1, βa ) if r 0Beta(αa , βa 1) if r 1This defines transitionfunction P̃ for theBayes-adaptive MDPInformation state (α, β)corresponds to reward modelBeta(α, β)Each state transitioncorresponds to a Bayesianmodel updateEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 55Winter 201853 / 70

Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 56Winter 201854 / 70

Gittins Indices for Bernoulli BanditsBayes-adaptive MDP can be solved by dynamic programmingThe solution is known as the Gittins indexExact solution to Bayes-adaptiev MDP is typically intractable;information state space is too largeRecent idea: apply simulation-based search (Guez et al. 2012, 2013)Forward search in information state spaceUsing simulations from current information stateEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 57Winter 201855 / 70

Principles for Strategic ExplorationThe sample principles for exploration/exploitation apply to MDPsNaive ExplorationOptimistic InitializationOptimism in the Face of UncertaintyProbability MatchingInformation State SearchEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 59Winter 201857 / 70

Optimistic Initialization: Model-Free RLInitialize action-value function Q(s,a) tormax1 γRun favorite model-free RL algorithmMonte-Carlo controlSarsaQ-learningetc.Encourages systematic exploration of states and actionsEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 60Winter 201858 / 70

Optimistic Initialization: Model-Free RLConstruct an optimistic model of the MDPInitialize transitions to go to terminal state with rmax rewardSolve optimistic MDP by favorite planning algorithmEncourages systematic exploration of states and actionse.g. RMax algorithm (Brafman and Tennenholtz)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 61Winter 201859 / 70

UCB: Model-Free RLMaximize UCB on action-value function Q π (s, a)at arg max Q(st , a) U(st , a)a AEstimate uncertainty in policy evaluation (easy)Ignores uncertainty from policy improvementMaximize UCB on optimal action-value function Q (s, a)at arg max Q(st , a) U1 (st , a) U2 (st , a)a AEstimate uncertainty in policy evaluation (easy)plus uncertainty from policy improvement (hard)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 62Winter 201860 / 70

Bayesian Model-Based RLMaintain posterior distribution over MDP modelsEstimate both transition and rewards, p[P, R ht ], whereht (s1 , a1 , r1 , . . . , st ) is the historyUse posterior to guide explorationUpper confidence bounds (Bayesian UCB)Probability matching (Thompson sampling)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 63Winter 201861 / 70

Thompson Sampling: Model-Based RLThompson sampling implements probability matchingπ(s, a ht ) P[Q(s, a) Q(s, a0 ), a0 6 a ht ] EP,R ht 1(a arg max Q(s, a))a AUse Bayes law to compute posterior distribution p[P, R ht ]Sample an MDP P, R from posteriorSolve MDP using favorite planning algorithm to get Q (s, a)Select optimal action for sample MDP, at arg maxa A Q (st , a)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 64Winter 201862 / 70

Information State Search in MDPsMDPs can be augmented to include information stateNow the augmented state is (s, s̃)where s is original state within MDPand s̃ is a statistic of the history (accumulated information)Each action a causes a transitionato a new state s 0 with probability Ps,s00to a new information state s̃Defines MDP M̃ in augmented information state spaceEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 65Winter 201863 / 70

Bayes Adaptive MDPPosterior distribution over MDP model is an information states̃t P[P, R ht ]Augmented MDP over (s, s̃) is called Bayes-adaptive MDPSolve this MDP to find optimal exploration/exploitation trade-off(with respect to prior)However, Bayes-adaptive MDP is typically enormousSimulation-based search has proven effective (Guez et al, 2012, 2013)Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 66Winter 201864 / 70

PrinciplesNaive ExplorationAdd noise to greedy policy (e.g. -greedy)Optimistic InitializationAssume the best until proven otherwiseOptimism in the Face of UncertaintyPrefer actions with uncertain valuesProbability MatchingSelect actions according to probability they are bestInformation State SearchLookahead search incorporating value of informationEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 68Winter 201866 / 70

Other evaluation criteriaProbably approximately correct:On all but N steps, algorithm will select an action whose value is nearoptimal Q(s, at ) V (s) with probability at least 1 δ.1N is a polynomial function of the MDP parameters ( S , A , 1 γ, δ, )Bounded ”mistakes”Many PAC RL algorithms use ideas of optimism under uncertaintyEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 69Winter 201867 / 70

Generalization and Strategic ExplorationImportant topic: combine generalization & strategic explorationMany approaches are grounded by principles outlined hereSome examples:Optimism under uncertainty: Bellemare et al. NIPS 2016; Ostrovski etal. ICML 2017; Tang et al. NIPS 2017Probability matching: Osband et al. NIPS 2016; Mandel et al. IJCAI2016Emma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 70Winter 201868 / 70

Midterm ResultsMidterm was rigorousWe will curve the classIf you got under 30, there are some concepts you haven’tdemonstrated you understand yet, please reach out to us tobrainstorm ways we can help you learnEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 71Winter 201869 / 70

Class StructureLast time: Midterm!This time: Exploration and ExploitationNext time: Batch RLEmma Brunskill (CS234 Reinforcement Learning.Lecture)11: Fast Reinforcement Learning 72Winter 201870 / 70

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 11: Fast Reinforcement Learning 30 Winter 2018 28 / 70. Principles Naive Exploration Optimistic Initialization Optimism in the Face of Uncertainty Probability Matching Information State Search . Applying Hoe ding's Inequality to the rewards of the bandit, P h Q(a) Q t(a) U t(a) i .

Related Documents:

CHEMICAL REACTION ENGINEERING

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

94 Views

2y ago

1 Introduction to reinforcement learning - GitHub Pages

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

23 Views

2y ago

Applying Deep Reinforcement Learning to Berkeley's Capture the Flag game

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

97 Views

1y ago

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating ...

In this section, we present related work and background concepts such as reinforcement learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement learning (Sutton and Barto, 1998) environment is typically formalized by means of a Markov decision process (MDP). An MDP can be described as follows. Let S fs 1 .

11 Views

1y ago

Multi-Agent Patrolling with Reinforcement Learning1

learning techniques, such as reinforcement learning, in an attempt to build a more general solution. In the next section, we review the theory of reinforcement learning, and the current efforts on its use in other cooperative multi-agent domains. 3. Reinforcement Learning Reinforcement learning is often characterized as the

10 Views

1y ago

Lecture Slides 'Reinforcement Learning - uni-paderborn.de

applying reinforcement learning methods to the simulated experiences just as if they had really happened. Typically, as in Dyna-Q, the same reinforcement learning method is used both for learning from real experience and for planning from simulated experience. The reinforcement learning method is thus the ÒÞnal common pathÓ for both learning

13 Views

1y ago

MetaLight: Value-based Meta-reinforcement Learning for Traffic Signal ...

Meta-reinforcement learning. Meta reinforcement learn-ing aims to solve a new reinforcement learning task by lever-aging the experience learned from a set of similar tasks. Currently, meta-reinforcement learning can be categorized into two different groups. The ﬁrst group approaches (Duan et al. 2016; Wang et al. 2016; Mishra et al. 2018) use an

15 Views

1y ago

Araling Panlipunan

Araling Panlipunan – Ikalawang Baitang Alternative Delivery Mode Unang Markahan – Modyul 3: Komunidad Ko, Pahahalagahan Ko Unang Edisyon, 2020 Isinasaad sa Batas Republika 8293, Seksiyon 176 na: Hindi maaaring magkaroon ng karapatang-sipi sa anomang akda ang Pamahalaan ng Pilipinas. Gayonpaman, kailangan muna ang pahintulot ng ahensiya o tanggapan ng pamahalaan na naghanda ng akda kung ito .

380 Views

3y ago

Recent Views

Chapter 15 Rooming Houses - MassLegalHelp

Individual renters usually have their own separate room and their own agreement with the landlord. For example, you may stay for just a few days, but another renter may stay for 3 months. Rooming houses with 4 or more renters at the same time must be licensed. Some cities and towns have local protections for renters in rooming houses. Rooming House

2y ago

356 Views

Americans rent, buy, sell and think about home.

median rent among Generation X is 1,062 per month. The youngest renters, Generation Z, are typically paying the least at 882 per month.9 This echoes the notion that Generation Z renters are opting to rent the smallest apartments or homes, which translates to lower monthly rental payments. Approximately half of renters (47 percent) are paying for

1y ago

174 Views

Disaster assistance process overview

A guide through the post-disaster recovery process. KEY ASSISTANCE SOURCES TIPS HOMEOWNERS/RENTERS INSURANCE If you have homeowners or renters insurance, this provides you funds to repair or replace property damaged as a result of covered perils during a disaster. Additional types of insurance, such as auto or other peril-specific

1y ago

109 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Texas Demographic Trends and Projections and the 2020 Census

Income disparities place African Americans and Latinos at greater risk during times of income loss. Renters, renters w/low incomes, Blacks, and households w/children face greater risk of eviction. Persistently low health insurance coverage in the state increases vulnerability of Texans with employer based insurance.

1y ago

137 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Texas - milestonepnc

State Auto - Homeowners TEXAS 05/2017 State Auto Insurance Company UG-1.0 I - UNDERWRITING GUIDELINES A. Entire State Eligibility Guidelines Premier Protection Plus Standard Available Forms HO0004 - Renters HO0005 - Homeowner Expanded HO0006 - Condominium HO0003 - Homeowner HO0004 - Renters HO0005 - Homeowner Expanded

1y ago

112 Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

Industry Observations Insurance Industry

Jun 30, 2019 · 6/17/2019 Commercial Insurance Branch of Extraco Banks, N.A. Higginbotham Insurance Group, Inc. Insurance Brokers NA 6/13/2019 Links Insurance Services, LLC World Insurance Associates LLC Property and Casualty Insurance NA 6/13/2019 Abram Interstate Insurance Services, Inc. Risk Placement Services,

2y ago

619 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

18.01.41 - REPLACEMENT OF LIFE INSURANCE AND ANNUITIES - Idaho

Department of Insurance Replacement of Life Insurance and Annuities. Page 3. 04. Existing Life Insurance or Annuity. "Existing Life Insurance or Annuity" means any life insurance or annuity in force, including life insurance under a binding or conditional receipt or a lif e insurance policy or annuity that is within an unconditional refund period.

1y ago

407 Views

EXAMINATION REPORT OF THE ADMIRAL INSURANCE COMPANY AS OF . - Delaware

Berkley Regional Specialty Insurance Comp 31295 DE Carolina Casualty Insurance Company 10510 IA Clermont Insurance Company 33480 IA Continental Western Insurance Company 10804 IA Firemen's Insurance Com pany of Wash, D.C. 21784 DE Gemini Insurance Company 10833 DE Great Divide Insurance Company 25224 ND

1y ago

258 Views

American International Group, Inc. - Federal Reserve

American General Life Insurance Company AGL U.S. Life Insurance Company AGC Life Insurance Company AGC Life U.S. Life Insurance Company The United States Life Insurance Company in the City of New York U.S. Life U.S. Life Insurance Company The Variable Annuity Life Insurance Company VALIC U.S. Life Insurance Company

1y ago

269 Views

Japan's Insurance Market - Toa Re

with 61.6% of net premiums written, of which automobile insurance totaled 48.8% and compulsory automobile liability insurance totaled 12.8%. Fire insurance accounted for 13.7%, miscellaneous casualty insurance including liability insurance accounted for 11.6%, accident insurance accounted for 9.8%, and marine insurance accounted for 3.2%.

1y ago

179 Views

Lecture 11: Fast Reinforcement Learning 1With Many Slides From Or .

It looks like you're using an ad-blocker