Lecture Slides 'Reinforcement Learning - Uni-paderborn.de

1y ago

13 Views

2 Downloads

1.96 MB

40 Pages

Last View : 22d ago

Last Download : 3m ago

Upload by : Maxton Kershaw

Report this link

Download PDF

Transcription

Lecture 07: Planning and Learningwith Tabular MethodsOliver WallscheidOliver WallscheidRL Lecture 071

e del-BasedModelFig. 7.1: Main categories of reinforcement learning algorithms(source: D. Silver, Reinforcement learning, 2016. CC BY-NC 4.0) Up to now: independent usage of model-free and model-based RL Today: integrating both strategies (on finite state & action spaces)Oliver WallscheidRL Lecture 072

Table of Contents1Repetition: Model-based and Model-free RL2Dyna: Integrated Planning, Acting and Learning3Prioritized Sweeping4Update Variants5Planning at Decision TimeOliver WallscheidRL Lecture 073

Model-based RL Plan/predict value functions and/or policy from a model. Requires an a priori model or to learn a model from experience. Solves control problems by planning algorithms such as Policy iteration or Value iteration.S0a1a0S1a1a0a0S2a1Fig. 7.2: A model for discrete state and action space problems is generally anMDP (source: www.wikipedia.org, by Waldoalvarez CC BY-SA 4.0)Oliver WallscheidRL Lecture 074

What is a Model? A model M is an MDP tuple ⟨X , U, P, R, γ⟩. In particular, we require the state-transition probabilityP P [Xk 1 xk 1 Xk xk , Uk uk ](7.1) and the reward probabilityR P [Rk 1 rk 1 Xk xk , Uk uk ] .(7.2) State space X and action space U is assumed to be known. Discount factor γ might be given by environment or engineer’s choice. What kind of model is available? If M is perfectly known a priori: true MDP. If M̂ M needs to be learned: approximated MDP.Oliver WallscheidRL Lecture 075

Model Learning / Identification Ideally, the model M is exactly known a priori (e.g., gridworld case). On the contrary, a model might be too complex to derive or notexactly available (e.g., real physical systems). Objective: estimate model M̂ from experience {X0 , U0 , R1 , . . . , XT }. This is a supervised learning / system identification task:{X0 , U0 } {X1 , R1 }.{XT 1 , UT 1 } {XT , RT } Simple tabular / look-up table approach (with n(x, u) visit count):Tp̂uxx′X1 1(Xk 1 x′ Xk x, Uk u),n(x, u)1R̂ux n(x, u)Oliver Wallscheidk 0TX(7.3)1(Xk x Uk u)rk 1 .k 0RL Lecture 076

Distribution vs. Sample Models A model based on P and R is called a distribution model. Contains descriptions of all possibilities by random distributions. Has full explanatory power, but is still rather complex to obtain. Alternatively, use sample models to receive realization series. Remember black jack examples: easy to sample by simulation but hardto model a full distributional MDP.Fig. 7.3: Depending on the application distribution models are easily available ornot (source: Josh Appel on Unsplash)Oliver WallscheidRL Lecture 077

Model-free RLe. If decision making and model learning are both computation-intensivehe available computational resources may need to be divided betweenexploring these issues, in this section we present Dyna-Q, a simple theLearnfunctionsfrom experience.gratingmajor valuefunctionsneeded in and/oran online policyplanningdirectlyagent. Eachin Dyna-Qin a simple,trivial,form.In subsequentwe Requiresnoalmostmodelat all(policycan besectionsconsideredan implicit model).the alternateways controlof achievingeach functionthe trade-o sbetween such as Solvesproblemsby andlearningalgorithmswe seek merely to illustrate the ideas and stimulate your intuition. Monte-Carlo,ning agent, there are at least two roles for real experience: it can be makeSarsait orthe model (tomore accurately match the real environment)ed to directly improvethe value function and policy using the kinds ofQ-learning.rning methods we have discussedvalue/policyters. The former we call modellatter we call direct reinforcementactingRL). The possible relationshipsplanningdirectce, model, values, and policy areRLhe diagram to the right. Each aronship of influence and presumedote how experience can improveexperiencemodelnd policies either directly or inmodel. It is the latter, which ismodelindirect reinforcement learning,learningn planning.d irect methodsFig. 7.4:If ahaveperfecta prioriis not available,RL can be realized directly oruse of a limitedamountof experienceand thusa betterpolicyindirectly(source:R. SuttonandachieveG. Barto,Reinforcementlearning: anonmental interactions. On the other hand, direct methods are muchintroduction,2018,CCBY-NC-ND2.0)not a ected by biases in the design of the model. Some have arguedhods are alwayssuperior to direct ones, while othershaveOliver WallscheidRL Lecture07 argued that8

Advantages & Drawbacks: Model-free vs. Model-based RLPro model-based / indirect RL:Efficiently uses limited amount of experience (e.g., by replay).Allows integration of available a priori knowledge.Learned models might be re-used for other tasks (e.g., monitoring).Pro model-free / direct RL: Is simpler to implement (only one task, not two consequent ones). Not affected by model bias / error during model learning.Oliver WallscheidRL Lecture 079

applying reinforcement learning methods to the simulated experiences just as if they hadhappened.DynaTypically,as in Dyna-Q, the (1)same reinforcement learning method isThereallyGeneralArchitectureused both for learning from real experience and for planning from simulated experience.The reinforcement learning method is thus the “final common path” for both learning andProposedR. Suttonin 1990’splanning. byLearningand planningare deeply integrated in the sense that they shareall thesame machinery,in the sourceof their experience. almostGeneralframeworkwithdi eringmany onlydifferentimplementationvariantsPolicy/value functionsplanning updatedirect ningsearchcontrolModelEnvironmentFigureThe generalDyna Architecture.experience,back andforth betweenFig.7.6:8.1:Dynaframework(source: R.RealSuttonand passingG. Barto,Reinforcementthe environment and the policy, a ects policy and value functions in much the same way as doeslearning:an introduction,CC BY-NC-ND 2.0)simulated experiencegeneratedby the model of2018,the environment.Conceptually,OliverWallscheidplanning, acting, model-learning,RL Lecture 07 and direct RL occur simultaneously11

8.2. Dyna: Integrated Planning, Acting, and Learning163The General Dyna Architecture (2)During planning, the Q-planning algorithm randomly samples only from state–actionpairs that have previously been experienced (in Step 1), so the model is never queriedwith a pair about which it has no information.The overall architecture of Dyna agents, of which the Dyna-Q algorithm is one example,is shown in Figure 8.1. The central column represents the basic interaction betweenagent and environment, giving rise to a trajectory of real experience. The arrow on theleft of the figure represents direct reinforcement learning operating on real experience toimprove the value function and the policy. On the right are model-based processes. Themodel is learned from real experience and gives rise to simulated experience. We use theterm search control to refer to the process that selects the starting states and actionsfor the simulated experiences generated by the model. Finally, planning is achieved byapplying reinforcement learning methods to the simulated experiences just as if they hadreally happened. Typically, as in Dyna-Q, the same reinforcement learning method isused both for learning from real experience and for planning from simulated experience.The reinforcement learning method is thus the “final common path” for both learningand planning. Learning and planning are deeply integrated in the sense that they sharealmost all the same machinery, di ering only in the source of their experience. Direct RL update: any model-free algorithm: Q-learning, Sarsa, . Model learning: In tabular case: simple distribution estimation as in (7.3) Simple experience buffer to re-apply model-free algorithm For large or continuous state/action spaces: function approximation bysupervised learning / system identification (next lecture) Search control: strategies for selecting starting states and action togenerate simulated experiencePolicy/value functionsplanning updatedirect ningsearchcontrolModelEnvironmentFigure 8.1: The general Dyna Architecture. Real experience, passing back and forth betweenthe environment and the policy, a ects policyand valueOliver WallscheidRL Lecture07functions in much the same way as does12

Algorithmic Implementation: Dyna-Qparameter: α {R 0 α 1} , n {N n 1} (planning steps per real step)init: q̂(x, u) arbitrary (except terminal) and M̂(x, u) {x X , u U}for j 1, 2, . . . episodes doInitialize x0 ;k 0;repeatChoose uk from xk using a soft policy derived from q̂(x, u);Take action uk , observe rk 1 and xk 1 ;q̂(xk , uk ) q̂(xk , uk ) α [rk 1 γ maxu q̂(xk 1 , u) q̂(xk , uk )];M̂(xk , uk ) {rk 1 , xk 1 } (assuming deterministic env.);for i 1, 2, . . . n dox̃i random previously visited state;ũi random previously taken action in x̃i ;{r̃i 1 , x̃i 1 } M̂(x̃i , ũi );q̂(x̃i , ũi ) q̂(x̃i , ũi ) α [r̃i 1 γ maxu q̂(x̃i 1 , u) q̂(x̃i , ũi )];k k 1;until xk is terminal;Algo. 7.1: Dyna with Q-learning (Dyna-Q)Oliver WallscheidRL Lecture 0713

Remarks on Dyna-Q ImplementationThe specific Dyna-Q characteristics are:Direct RL update: Q-learning,Model: simple memory buffer of previous real experience,Search strategy: random choices from model buffer.Moreover: Number of Dyna planning steps n is to be delimited from n-stepbootstrapping (same symbol, two interpretations). Without the model M̂ one would receive one-step Q-learning. The model-based learning is done n times per real environmentinteraction: Previous real experience is re-applied to Q-learning. Can be considered a background task: choose max n s.t. hardwarelimitations (prevent turnaround errors). For stochastic environments: use a distributional model as in (7.3). Update rule then may be modified from sample to expected update.Oliver WallscheidRL Lecture 0714

Maze Example (1)Dyna: Integrated Planning, Acting, and Learning165G800Sactions600Stepsperepisode0 planning steps(direct RL only)4005 planning steps50 planning steps2001421020304050EpisodesFig. 7.7: Applying Dyna-Q with differentplanning steps n to simple maze (source: R.Sutton and G. Barto, Reinforcement learning: anintroduction, 2018, CC BY-NC-ND 2.0) Maze with obstacles(gray blocks) Start at S and reach G rT 1 at G Episodic task withγ 0.95 Step size α 0.1 Exploration ε 0.1 Averaged learningcurvese 8.2: A simple maze (inset) and the average learning curves for Dyna-Q agents varyingr number of planning steps (n) per real step. The task is to travel from S to G as quicklysible.ure 8.3 shows why the planning agents found the solution so much faster thanonplanning agent. Shown are the policies found by the n 0 and n 50 agentsay through the second episode. Without planning (n 0), each episode adds onlydditional stepOliverto thepolicy, and so only one step (the last)has beenWallscheidRL Lecture07 learned so far.15

Maze Example (2)Figure 8.3 shows why the planning agents found the solution so much faster thann 0 andn 50agentstheagent.anShownthe policiesfoundpolicyby the (equal nonplanningBlocks withoutarrowaredepicta neutralactionvalues).halfway through the second episode. Without planning (n 0), each episode adds only additionalBlack squarespositionduringsecondepisode.onestep toindicatethe policy,agent’sand so onlyone step(the last)has beenlearned so far.Withplanning,planningagain only(none steplearnedduring theepisode,but here Without0), iseachepisodesonlyfirstaddsone newitemduringtothe second episode an extensive policy has been developed that by the end of the episodethe policy.will reach almost back to the start state. This policy is built by the planning process planning 50),nearthe theavailableexperienceis ofefficientlyutilized.awhileWiththe agentis still (nwanderingstart state.By the endthe third rfectperformanceattained. After the third episode, the planning agent found the optimal policy.WITHOUT PLANNING ( n 0)GWITH PLANNING ( n 50)GSSFigure8.3: PoliciesPolicies foundby planningDyna-Qagentsthroughhalfway throughFig. 7.8:(greedyaction) andfor nonplanningDyna-Q agenthalfwaysecond thesecondepisodeepisode. (source:The arrowsgreedyaction ineach state; if nolearning:arrow is shownR.indicateSuttontheandG. Barto,Reinforcementan for astate, then all of its action values were equal. The black square indicates the location of theintroduction,2018,CCBY-NC-ND2.0)agent.Oliver WallscheidRL Lecture 0716

What if the Model is Wrong?Possible model error sources: A provided a priori model may be inaccurate (expert knowledge). Environment behavior changes over time (non-stationary). Early-stage model is biased due to learning process. If function approximators are used: generalization error (cf. lecture 08and following).Consequences: Model errors are likely to lead to a suboptimal policy. If lucky: errors are quickly discovered and directly corrected bydefault, random exploration. Nevertheless, more intelligent exploration / correction strategies mightbe useful (compared to random actions as in ε-greedy strategies).Oliver WallscheidRL Lecture 0717

The Blocking Maze ExampleModel Is WrongGGS167S150 Maze with a changingobstacle line Start at S and reach G rT 1 at G Dyna-Q encouragesCumulativerewardintelligent explorationDyna-Q(upcoming slides) Dyna-Q requires more00100020003000steps in order toTime stepsovercome the blockadeAverageof Dynaagents on layouta blockingtask.The left environmentFig. performance7.9: Maze witha changingafter1000 Averagedhe first 1000 steps, the right environment for the rest. Dyna-Q is Dyna-Qwith learningsteps illustrates a model error (source: R. Suttonbonus that encourages exploration.curvesDyna-Q and G. Barto, Reinforcement learning: anintroduction, 2018, CC BY-NC-ND 2.0)3: ShortcutMazeOliver WallscheidRL LectureG 07G18

The Shortcut Maze ExampleeGGd oftedFigSSh isbar400eps,upDyna-Q urbDyna-Qht). Cumulativerewardlarthezedhat0e it030006000tepTime stepsvenery Fig.FigureAverageof Dynaagentsafteron7.10:8.5:Mazewithperformancean additionalshortcutso a shortcut task. The left environment was used for the3000 steps (source: R. Sutton and G. Barto,dis- first 3000 steps, the right environment for the rest.Reinforcement learning: an introduction, 2018,CC BY-NC-ND 2.0)nother version of the conflict between exploration andOliver WallscheidRL Lecture 07ext, explorationmeans trying actions that improvethe Maze opens a shortcutafter 3000 steps Start at S and reach G rT 1 at G Dyna-Q with randomexploration is likely notfinding the shortcut Dyna-Q explorationstrategy is able tocorrect internal model Averaged learningcurves19

Dyna-Q ExtensionsCompared to default Dyna-Q in Algo. 7.1, Dyna-Q contains thefollowing extensions: Search heuristic: add κ τ to regular reward. τ : is the number of time steps a state-action transition has not beentried. κ: is a small scaling factor κ {R 0 κ}. Agent is encouraged to keep testing all accessible transitions. Actions for given states that had never been tried before are allowedfor simulation-based planning. Initial model for that: actions lead back to same state without reward.Oliver WallscheidRL Lecture 0720

Background and Idea Dyna-Q randomly samples from the memory buffer. Many planning updates maybe pointless, e.g., zero-valued stateupdates during early training. In large state-action spaces: inefficient search since transitions arechosen far away from optimal policies. Better: focus on important updates. In episodic tasks: backward focusing starting from the goal state. In continuing tasks: prioritize according to impact on value updates. Solution method is called prioritized sweeping. Build up a queue of every state-action pair whose value would changesignificantly. Prioritize updates by the size of change. Neglect state-action pairs with only minor impact.Oliver WallscheidRL Lecture 0722

Algorithmic Implementation: Prioritized Sweepingparameter: α {R 0 α 1} , n {N n 1} , θ {R θ 0}init: q̂(x, u) arbitrary and M̂(x, u) {x X , u U}, empty queue Qfor j 1, 2, . . . episodes doInitialize x0 and k 0;repeatChoose uk from xk using a soft policy derived from q̂(x, u);Take action uk , observe rk 1 and xk 1 ;M̂(xk , uk ) {rk 1 , xk 1 } (assuming deterministic env.);P rk 1 γ maxu q̂(xk 1 , u) q̂(xk , uk ) ;if P θ then insert {xk , uk } in Q with priority P ;for i 1, 2, . . . n while queue Q is not empty do{x̃i , ũi } arg maxP (Q);{r̃i 1 , x̃i 1 } M̂(x̃i , ũi );q̂(x̃i , ũi ) q̂(x̃i , ũi ) α [r̃i 1 γ maxu q̂(x̃i 1 , u) q̂(x̃i , ũi )];for {x, u} predicted to lead to x̃i dor predicted reward for {x, u, x̃i };P r γ maxu q̂(x̃i , u) q̂(x, u) ;if P θ then insert {x, u} in Q with priority P ;k k 1;until xk is terminal;Oliver WallscheidRL Lecture 0723

Remarks on Prioritized Sweeping ImplementationThe specific prioritized sweeping characteristics are: Direct RL update: Q-learning, Model: simple memory buffer of previous real experience, Search strategy: prioritized updates based on predicted value change.Moreover: θ is a hyperparameter denoting the update significance threshold. Prediction step regarding x̃i is a backward search in the model buffer. For stochastic environments: use a distributional model as in (7.3). Update rule then may be modified from sample to expected update.Oliver WallscheidRL Lecture 0724

Comparing Against Dyna-Q on Simple Maze Examplegndn0.t. BackupsUpdatesUpdatesuntiluntileoptimale optimalsolutiony 41031021004794186 376 752 1504 3008 6016Gridworld size (#states)Fig. 7.11: Comparison of prioritized sweepingand Dyna-Q on simple maze (source: R. Suttonstochastic environments are straightforward. Theand G. Barto, Reinforcement learning: anof the number of times each state–action pair hasintroduction, 2018, CC BY-NC-ND 2.0)tates were. It is natural then to update each pairbeen using so far, but with an expected update,Oliver WallscheidRL Lecture 07ates and theirprobabilities of occurring. Environment frameworkas in Fig. 7.7 But: changing mazesizes (number of states) Both methods canutilize up to n 5planning steps Prioritized sweepingfinds optimal solution5-10 times quicker25

Update Rule Alternatives Dyna updates (search strategy) are not bound to Q-learning duringplanning and can be exchanged in many ways (see Fig. 7.12). Even evaluating a fixed policy π in terms of vπ (x) and qπ (x, u) ispossible.Expected updates(DP)policy evaluationvalue evaluationQ-policy evaluationQ-value iterationSample updates(one-step TD)TD(0)SarsaQ-learningFig. 7.12: Possible one-step updates: alternatives for DynaOliver WallscheidRL Lecture 0727

Advantages & Drawbacks: Expected vs. Sampled UpdatesPro expected updates: Delivers more accurate value estimates (no sampling error).Pro sample updates: Is computational cheaper (e.g., distributional model not required).Leads to trade-off: Estimation accuracy vs. computational burden. Evaluate decision on given problem, i.e., how many state-action pairshave to be evaluated for a new expected update? Utilize branching factor b metric: corresponds to number of possiblenext states x′ with p(x′ x, u) 0. If expected update is available, this will be roughly as accurate as bsamplings. If only incomplete expected update is available, prefer samplingsolution (often applies to large state-action spaces).Oliver WallscheidRL Lecture 0728

Example for Sampled vs. Expected Updates Artificial prediction task where all b successor states are equally likely Method initialization such that RMS error is always one 174Sample updates perform particularlywell andforLearninglarge branchingfactors bChapter 8: Planningwith Tabular Methods(takeaway message: if facing large stochastic problems, use sampling)1expectedupdatessampleupdatesb 2 (branching factor)RMS errorin valueestimateb 10b 100b 1000b 10,000001b2bQ(s0 , a0 ) computationsNumber of max0aFigure 8.7: ofComparisonof efficiencyof expectedand sampleupdates.Comparisonexpectedvs. sampledupdates(source:R.Fig. 7.13:Sutton andG. Barto, Reinforcement learning: an introduction, 2018, CC BY-NC-ND 2.0)b successor states are equally likely and in which the error in the initial estimate isRL Lecture correct,071.OliverTheWallscheidvalues at the next states are assumedso the expected update reduces29

Alternatives on Distributing the UpdatesRecap: Dynamic programming: sweep through the entire state(-action) space Dyna-Q: random uniform sampling Mutual problem: irrelevant updates decrease computational efficiencyAlternative: update according to on-policy distribution Based on sampling along the encountered state(-action) pairs(trajectory sweeping) Based on explicit on-policy distribution In both cases: ignore vast, uninteresting parts of the problem space atthe risk of updating same old parts all over againOliver WallscheidRL Lecture 0730

state–action pairs. In addition, on all transitions there was a 0.1tion to the terminal state, ending the episode. The expected rewardas selected from a Gaussian distribution with mean 0 and variance 1.e planning processhaustively computeon-policye of the start stateb 13cy, , given the cur1,000 STATESction Q, as an indiuniformExemplary Update Distribution ComparisonExample: Two actions per stateb 3policy Both actions led to bb 10of the figure toults averaged overnext statesth 1000 states and1, 3, and 10. TheComputation time, in fullbackupsexpectedupdates 10 % probability offound is plotted asmber of expected uptransition to terminalall cases, samplingb 1policy distributionstatenning initially andthe long run. The10,000 STATES reward per transition:and the initial pe- Value ofstateing was longer, at startunderN (µ 0, σ 2 1)ctors. In other ex- greedypolicythat these e ects Task: estimate startr as the number ofexample, the lowerhows results for astate value1 for tasks withexpectedupdatesComputation time, in fullbackupsis case the advan 200 randomly generatedcusing is large and Figure 8.8: Relative efficiency of updates disFig. 7.14: tributedUpdatedistribution comparison (source:uniformly across the state space versusundiscounted episodicsimulatedon-policytrajectories, eachmake sense.In the focusedR. SuttonandonG.Barto,Reinforcementlearning:g according to the starting in the same state. Results are for randomlyrunsgenerated tasks of 2018,two sizes andvariousbranchingan introduction,CCBY-NC-ND2.0)e agent would do on Value ofhich it acted greed- start stateunderuming the model iform2100n helps by focusingear descendants of factors, b.Oliver Wallscheid50,000RL Lecture 0731

Background Planning vs. Planning at Decision TimeBackground Planning (discussed so far): Gradually improves policy or value function if time is available. Backward view: re-apply gathered experience. Feasible for fast execution: policy or value estimate are available withlow latency (important, e.g., for real-time control).Planning at decision time1 (not yet discussed alternative): Select single next future action through planning. Forward view: predict future trajectories starting from current state. Typically discards previous planning outcomes (start from scratchafter state transition). If multiple trajectories are independent: easy parallel implementation. Most useful if fast responses are not required (e.g., turn-basedgames).1Can be interpreted as model predictive control in an engineering context.Oliver WallscheidRL Lecture 0733

e ective.The distribution of updates can be altered in similar ways to focus on the currentstate and its likely successors. As a limiting case we might use exactly the methods ofheuristic search to construct a search tree, and then perform the individual, one-stepupdates from bottom up, as suggested by Figure 8.9. If the updates are ordered in thisway and a tabular representation is used, then exactly the same overall update wouldbe achieved as in depth-first heuristic search. Any state-space search can be viewed inthis way as the piecing together of a large number of individual one-step updates. Thus,the performance improvement observed with deeper searches is not due to the use ofmultistep updates as such. Instead, it is due to the focus and concentration of updateson states and actions immediately downstream from the current state. By devoting alarge amount of computation specifically relevant to the candidate actions, decision-timeplanning can produce better decisions than can be produced by relying on unfocusedupdates.Heuristic Search Develop tree-like continuations from each state encountered. Approximate value function at leaf nodes (using a model) and backup towards the current state. Choose action according to predicted trajectory with highest value. Predictions are normally discarded (new search tree in each state).311086245978.9: Heuristicsearchtreecan be withimplementedas a sequenceof one-stepupdates (shownFig. 7.15:FigureHeuristicsearchexemplaryorderof back-upoperationshere outlined in blue) backing up values from the leaf nodes toward the root. The orderinghere isandfor a selectivedepth-firstsearch.(source: R. shownSuttonG. Barto,Reinforcementlearning: an introduction, 2018,CC BY-NC-ND 2.0)Oliver Wallscheid8.10RL Lecture 07Rollout Algorithms34

Rollout AlgorithmsRollout policy Similar to heuristic search, but: simulate trajectories following arollout policy. Use Monte Carlo estimates of action value only for current state toevaluate on best action. Gradually improves rollout policy but optimal policy might not befound if rollout sequences are too short. Predictions are normally discarded (new rollout in each state).Fig. 7.16: Simplified processing diagram of rollout algorithmsOliver WallscheidRL Lecture 0735

of state–action pairs that are most likely to be reached in a few steps, which form a treerooted at the current state, as illustrated in Figure 8.10. MCTS incrementally extendsthe tree by adding nodes representing states that look promising based on the results ofthe simulated trajectories. Any simulated trajectory will pass through the tree and thenexit it at some leaf node. Outside the tree and at the leaf nodes the rollout policy is usedfor action selections, but at the states inside the tree something better is possible. Forthese states we have value estimates for of at least some of the actions, so we can pickaccumulates values estimates from former MC simulations,among them using an informed policy, called the tree policy, that balances explorationMonte Carlo Tree Search (MCTS) Rollout algorithm, but: makes use of an informed tree policy (e.g., ε-greedy).Repeat while time pRolloutPolicyMonte Carlo Tree Search. When the environment changes to a new state, MCTSFig. 7.17:FigureBasic8.10:blocksof MCTSalgorithmsR. Sutton and G.executesasbuildingmany iterationsas possiblebefore an actionneeds to be(source:selected, incrementallya tree whose rootnode representscurrent state. Each 2018,iteration consistsof the fourBarto, buildingReinforcementlearning:antheintroduction,CC BY-NC-ND2.0)operations Selection, Expansion (though possibly skipped on some iterations), Simulation,and Backup, as explained in the text and illustrated by the bold arrows in the trees. AdaptedChaslot, Bakkes, Szita, and SpronckRL(2008).Oliver fromWallscheidLecture 0736

Basic MCTS ProcedureRepeat the following steps while prediction time is available:1 Selection: Starting at root node, use a tree policy (e.g., ε-greedy) totravel through the tree until arriving at a leaf node. The tree policy exploits auspicious tree regions while maintaining someexploration. It is improved and (possibly) extended in every simulation run.23Expansion: Add child node(s) to the leaf node by evaluatingunexplored actions (optional step).Simulation: Simulate the remaining full episode using the rolloutpolicy starting from the leaf or child node (if available). The rollout policy could be random, pre-trained or based on model-freemethods using real experience (if available).4Backup: Update the values along the traveled trajectory but onlysaves those within the tree policy.Oliver WallscheidRL Lecture 0737

Further MCTS RemarksWhat is happening after reaching the feasible simulation runs? After time is up, MCTS picks an appropriate action regarding theroot node, e.g.: The action visited the most times during all simulation runs or The action having the largest action value. After transitioning to a new state, the MCTS procedure re-starts

applying reinforcement learning methods to the simulated experiences just as if they had really happened. Typically, as in Dyna-Q, the same reinforcement learning method is used both for learning from real experience and for planning from simulated experience. The reinforcement learning method is thus the ÒÞnal common pathÓ for both learning

Related Documents:

BLADERUNNER & SOLERUNNER

Blade Runner Classic Uncommon flooring - Common standards Solerunner Uni Solerunner Bladerunner Solerunner Uni Uni ICE Uni SKY Uni SAND Uni EARTH Uni NIGHT Uni POOL Uni MOSS Uni PINE Sky Sky UNI Sky STONE ENDURANCE VISION SPLASH Ice Ice UNI Ice STONE Ice ENDURANCE Ice SPL

52 Views

2y ago

CHEMICAL REACTION ENGINEERING

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

99 Views

2y ago

1 Introduction to reinforcement learning - GitHub Pages

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

24 Views

2y ago

STATE AND HIGHER EDUCATION - tn.gov

Topics General Information (slides 3-4) Resources and Help (slides 5-8) Eligibility and Enrollment Information (slides 9-14) Health Plan Options/Premiums/Costs (slides 15-23) Benefits Included with Health Plan (slides 24-30) Other Benefits/Premiums (slides 31-41) Enrolling in Benefits/Using ESS (slides 42-43) Other Important Information (slides 44-48)

30 Views

1y ago

Applying Deep Reinforcement Learning to Berkeley's Capture the Flag game

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

99 Views

1y ago

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating ...

In this section, we present related work and background concepts such as reinforcement learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement learning (Sutton and Barto, 1998) environment is typically formalized by means of a Markov decision process (MDP). An MDP can be described as follows. Let S fs 1 .

11 Views

1y ago

Multi-Agent Patrolling with Reinforcement Learning1

learning techniques, such as reinforcement learning, in an attempt to build a more general solution. In the next section, we review the theory of reinforcement learning, and the current efforts on its use in other cooperative multi-agent domains. 3. Reinforcement Learning Reinforcement learning is often characterized as the

11 Views

1y ago

LABORATORIUM ANATOMI - HISTOLOGI BUKU

Tulang-tulang atau cadaver yang digunakan untuk mempelajari ilmu anatomi ini adalah bagian tubuh manusia , YANG TIDAK BOLEH DIPERMAINKAN. Previllage menggunakan cadaver dan tulang guna mempelajari ilmu anatomi hanya dapat dipertanggung jawabkan, jika kita menggunakan kesempatan itu dengan maksud dan tujuan yang suci. 2. Dalam mempelajari cadaver dan tulang kita harus selalu ingat bahwa .

110 Views

3y ago

Recent Views

E-Cigarette Maker Hit with Class Action Lawsuit - Truth in Advertising

action lawsuit Wednesday alleging that e-cigarettes are "unreasonably dangerous, harmful and/or . tlement-investigation/) Life Insurance Claims Lawsuit & Annuities Fraud Class Action Lawsuit Investigation

1y ago

138 Views

Se Filing Your Lawsuit in Ederal Andbook Court

Defendant: An individual (or business) against whom a lawsuit is filed. In Forma Pauperis (IFP): When the filer has been granted the ability to file their lawsuit in federal court without paying the civil filing fee. Litigation: A case, controversy, or lawsuit. Participants (plaintiffs and defendants) in

1y ago

125 Views

The Divine Covenant Lawsuit Motif in Canonical Perspective

the lawsuit genre was the city gate, and instead proposed the cult as the . Hesse insists that the cultic pronouncements and the prophetic lawsuit must be distinguished: the cult always pronounces judgment o

2y ago

129 Views

Talc to a Fws&g Export Right Now 888-7323389

Iutnwstnrts attached Illinois Lawsuit Funding issists zla:nttfs Throognout the ware & I::r,rna wan are facrng tranual nt'::,.: rS and ytCn for nor ia'wud Sc snr.s nuy -cl Ce in aptrs' ft-u n-tv 0-jar,-, or a lawson 4cl r.arwu Of SlWit and no to SIX X-O rirpaidinss ci your raven ou woda Itnstaqn Illinois Lawsuit Loans Lawsuit Loans In .

1y ago

107 Views

Christopher v. Residence Mutual Insurance Company (San .

Christopher v. Residence Mutual Insurance Company in the Superior Court of California, County of Los Angeles. The Lawsuit was transferred to the Superior Court of California, County of San Bernardino, Case No. CIVDS1711860. The Lawsuit alleges that RMIC failed to provide agreed u

2y ago

172 Views

CONFIRMED: The Trillion-Dollar Lawsuit That Could End .

information leaked by Benjamin Fulford -- the former Asia-Pacific bureau chief for Forbes Magazine -- on a week-by-week basis. Finally, the lawsuit at the epicenter of this investigation has now become a tangible reality -- validating everything Fulford has been sayin

2y ago

145 Views

HOW TO FILE AN ANSWER TO AN EVICTION LAWSUIT

SUPERIOR COURT OF STANISLAUS COUNTY SELF HELP CENTER HOW TO FILE AN ANSWER TO AN EVICTION LAWSUIT (UNLAWFUL DETAINER ) Material prepared and/or distributed by the Superior Court Clerk’s Office IS INTENDED FOR INFORMATIONAL AND EDUCATIONAL PURPOSES ONLY. Such material is NOT intended t

2y ago

356 Views

LIVINGSTON COUNTY JAIL LAWSUIT SETTLED I

LIVINGSTON COUNTY JAIL LAWSUIT SETTLED M ARCH 2004 1 The American Civil Liberties Union of Michigan 60 W. Hancock Detroit, MI 48201-1343 (313) 578-6800 www.aclumich.org . the federal district court in Bay City struck it down in 2000. In June, 2003 the entire Sixth Circuit upheld the District Court

2y ago

425 Views

F C A Pro Se Guide

Missouri, then there would be “diversity.” In a diversity case, the defendant may challenge your decision to file the lawsuit in a particular U.S. District Court by filing a motion. For example, if you file your lawsuit in the District of Kansas but the defendant believes that the

2y ago

341 Views

Civil Lawsuit Basics: Motions for Summary Judgment

Civil Lawsuit Basics: Motions for Summary Judgment Presented by Sandra Levin Executive Director LA Law Library October 22, 2016 . Disclaimer! LA Law Library does not provide legal advice. LA Law Library provides legal resources and assistance with legal research as an educational service. T

2y ago

322 Views

EXHIBIT A - Kia Engine Settlement

Kia Motors America, Inc., No. 8:17-cv-00838 (C.D. Cal.) on May 10, 2017; Plaintiffs Stanczak and Creps filed the proposed nationwide class action lawsuit Stanczak and Creps v. Kia Motors America, Inc. et al., No. 8:17-cv-1365 (C.D. Cal.) on August 8, 2017; Plaintiffs Kinnick and Coats filed the proposed nationwide class action lawsuit Kinnick

2y ago

358 Views

A Step-by-step Guide to Filing a Civil Lawsuit in The United States .

A STEP-BY-STEP GUIDE TO FILING A CIVIL LAWSUIT . IN THE UNITED STATES DISTRICT COURT . FOR THE WESTERN DISTRICT OF TEXAS . Rev. Ed. October 26, 2017 ACKNOWLEDGMENT . This Guide was prepared and revised in cooperation with the . San Antonio Chapter of the Federal Bar Association .

1y ago

122 Views

Attorney General Balderas Announces Lawsuit to Halt Holtec Nuclear .

Attorney General Balderas Announces Lawsuit to Halt Holtec Nuclear Storage Facility . Santa Fe, NM---Today, Attorney General Hector Balderas announced that the State of New Mexico filed suit against the United States Nuclear Regulatory Commission ("NRC" or "the Commission") and the United States seeking to stop them from indefinitely storing

1y ago

142 Views

Ł -- I - Utah State Bar

Lawsuit Claims and/or created a duty on the part of the insurer to defend the Claims at its expense. Specifically, throughout 2008 and 2009 (when the Philips Lawsuit Claims were asserted in Philips' initial and amended complaints), BCT was the named insured under both a

1y ago

127 Views

Updates:Legal Updates: Anatomy of a Lawsuit - Lehigh University

! jurisdiction:Personal jurisdiction:! a lawsuit! a mayspecific court, the defendant is the ty that may jurisdiction! r py, ersonal j residency, injuring tort) 2/18/2009 12. Jurisdiction! . Microsoft PowerPoint - t [Compatibility Mode] .

1y ago

116 Views

Lecture Slides 'Reinforcement Learning - Uni-paderborn.de

It looks like you're using an ad-blocker