Lecture Slides 'Reinforcement Learning - Uni-paderborn.de

1m ago
1.96 MB
40 Pages
Last View : 1m ago
Last Download : n/a
Upload by : Maxton Kershaw

Lecture 07: Planning and Learningwith Tabular MethodsOliver WallscheidOliver WallscheidRL Lecture 071

e del-BasedModelFig. 7.1: Main categories of reinforcement learning algorithms(source: D. Silver, Reinforcement learning, 2016. CC BY-NC 4.0) Up to now: independent usage of model-free and model-based RL Today: integrating both strategies (on finite state & action spaces)Oliver WallscheidRL Lecture 072

Table of Contents1Repetition: Model-based and Model-free RL2Dyna: Integrated Planning, Acting and Learning3Prioritized Sweeping4Update Variants5Planning at Decision TimeOliver WallscheidRL Lecture 073

Model-based RL Plan/predict value functions and/or policy from a model. Requires an a priori model or to learn a model from experience. Solves control problems by planning algorithms such as Policy iteration or Value iteration.S0a1a0S1a1a0a0S2a1Fig. 7.2: A model for discrete state and action space problems is generally anMDP (source: www.wikipedia.org, by Waldoalvarez CC BY-SA 4.0)Oliver WallscheidRL Lecture 074

What is a Model? A model M is an MDP tuple ⟨X , U, P, R, γ⟩. In particular, we require the state-transition probabilityP P [Xk 1 xk 1 Xk xk , Uk uk ](7.1) and the reward probabilityR P [Rk 1 rk 1 Xk xk , Uk uk ] .(7.2) State space X and action space U is assumed to be known. Discount factor γ might be given by environment or engineer’s choice. What kind of model is available? If M is perfectly known a priori: true MDP. If M̂ M needs to be learned: approximated MDP.Oliver WallscheidRL Lecture 075

Model Learning / Identification Ideally, the model M is exactly known a priori (e.g., gridworld case). On the contrary, a model might be too complex to derive or notexactly available (e.g., real physical systems). Objective: estimate model M̂ from experience {X0 , U0 , R1 , . . . , XT }. This is a supervised learning / system identification task:{X0 , U0 } {X1 , R1 }.{XT 1 , UT 1 } {XT , RT } Simple tabular / look-up table approach (with n(x, u) visit count):Tp̂uxx′X1 1(Xk 1 x′ Xk x, Uk u),n(x, u)1R̂ux n(x, u)Oliver Wallscheidk 0TX(7.3)1(Xk x Uk u)rk 1 .k 0RL Lecture 076

Distribution vs. Sample Models A model based on P and R is called a distribution model. Contains descriptions of all possibilities by random distributions. Has full explanatory power, but is still rather complex to obtain. Alternatively, use sample models to receive realization series. Remember black jack examples: easy to sample by simulation but hardto model a full distributional MDP.Fig. 7.3: Depending on the application distribution models are easily available ornot (source: Josh Appel on Unsplash)Oliver WallscheidRL Lecture 077

Model-free RLe. If decision making and model learning are both computation-intensivehe available computational resources may need to be divided betweenexploring these issues, in this section we present Dyna-Q, a simple theLearnfunctionsfrom experience.gratingmajor valuefunctionsneeded in and/oran online policyplanningdirectlyagent. Eachin Dyna-Qin a simple,trivial,form.In subsequentwe Requiresnoalmostmodelat all(policycan besectionsconsideredan implicit model).the alternateways controlof achievingeach functionthe trade-o sbetween such as Solvesproblemsby andlearningalgorithmswe seek merely to illustrate the ideas and stimulate your intuition. Monte-Carlo,ning agent, there are at least two roles for real experience: it can be makeSarsait orthe model (tomore accurately match the real environment)ed to directly improvethe value function and policy using the kinds ofQ-learning.rning methods we have discussedvalue/policyters. The former we call modellatter we call direct reinforcementactingRL). The possible relationshipsplanningdirectce, model, values, and policy areRLhe diagram to the right. Each aronship of influence and presumedote how experience can improveexperiencemodelnd policies either directly or inmodel. It is the latter, which ismodelindirect reinforcement learning,learningn planning.d irect methodsFig. 7.4:If ahaveperfecta prioriis not available,RL can be realized directly oruse of a limitedamountof experienceand thusa betterpolicyindirectly(source:R. SuttonandachieveG. Barto,Reinforcementlearning: anonmental interactions. On the other hand, direct methods are muchintroduction,2018,CCBY-NC-ND2.0)not a ected by biases in the design of the model. Some have arguedhods are alwayssuperior to direct ones, while othershaveOliver WallscheidRL Lecture07 argued that8

Advantages & Drawbacks: Model-free vs. Model-based RLPro model-based / indirect RL:Efficiently uses limited amount of experience (e.g., by replay).Allows integration of available a priori knowledge.Learned models might be re-used for other tasks (e.g., monitoring).Pro model-free / direct RL: Is simpler to implement (only one task, not two consequent ones). Not affected by model bias / error during model learning.Oliver WallscheidRL Lecture 079

Table of Contents1Repetition: Model-based and Model-free RL2Dyna: Integrated Planning, Acting and Learning3Prioritized Sweeping4Update Variants5Planning at Decision TimeOliver WallscheidRL Lecture 0710

applying reinforcement learning methods to the simulated experiences just as if they hadhappened.DynaTypically,as in Dyna-Q, the (1)same reinforcement learning method isThereallyGeneralArchitectureused both for learning from real experience and for planning from simulated experience.The reinforcement learning method is thus the “final common path” for both learning andProposedR. Suttonin 1990’splanning. byLearningand planningare deeply integrated in the sense that they shareall thesame machinery,in the sourceof their experience. almostGeneralframeworkwithdi eringmany onlydifferentimplementationvariantsPolicy/value functionsplanning updatedirect ningsearchcontrolModelEnvironmentFigureThe generalDyna Architecture.experience,back andforth betweenFig.7.6:8.1:Dynaframework(source: R.RealSuttonand passingG. Barto,Reinforcementthe environment and the policy, a ects policy and value functions in much the same way as doeslearning:an introduction,CC BY-NC-ND 2.0)simulated experiencegeneratedby the model of2018,the environment.Conceptually,OliverWallscheidplanning, acting, model-learning,RL Lecture 07 and direct RL occur simultaneously11

8.2. Dyna: Integrated Planning, Acting, and Learning163The General Dyna Architecture (2)During planning, the Q-planning algorithm randomly samples only from state–actionpairs that have previously been experienced (in Step 1), so the model is never queriedwith a pair about which it has no information.The overall architecture of Dyna agents, of which the Dyna-Q algorithm is one example,is shown in Figure 8.1. The central column represents the basic interaction betweenagent and environment, giving rise to a trajectory of real experience. The arrow on theleft of the figure represents direct reinforcement learning operating on real experience toimprove the value function and the policy. On the right are model-based processes. Themodel is learned from real experience and gives rise to simulated experience. We use theterm search control to refer to the process that selects the starting states and actionsfor the simulated experiences generated by the model. Finally, planning is achieved byapplying reinforcement learning methods to the simulated experiences just as if they hadreally happened. Typically, as in Dyna-Q, the same reinforcement learning method isused both for learning from real experience and for planning from simulated experience.The reinforcement learning method is thus the “final common path” for both learningand planning. Learning and planning are deeply integrated in the sense that they sharealmost all the same machinery, di ering only in the source of their experience. Direct RL update: any model-free algorithm: Q-learning, Sarsa, . Model learning: In tabular case: simple distribution estimation as in (7.3) Simple experience buffer to re-apply model-free algorithm For large or continuous state/action spaces: function approximation bysupervised learning / system identification (next lecture) Search control: strategies for selecting starting states and action togenerate simulated experiencePolicy/value functionsplanning updatedirect ningsearchcontrolModelEnvironmentFigure 8.1: The general Dyna Architecture. Real experience, passing back and forth betweenthe environment and the policy, a ects policyand valueOliver WallscheidRL Lecture07functions in much the same way as does12

Algorithmic Implementation: Dyna-Qparameter: α {R 0 α 1} , n {N n 1} (planning steps per real step)init: q̂(x, u) arbitrary (except terminal) and M̂(x, u) {x X , u U}for j 1, 2, . . . episodes doInitialize x0 ;k 0;repeatChoose uk from xk using a soft policy derived from q̂(x, u);Take action uk , observe rk 1 and xk 1 ;q̂(xk , uk ) q̂(xk , uk ) α [rk 1 γ maxu q̂(xk 1 , u) q̂(xk , uk )];M̂(xk , uk ) {rk 1 , xk 1 } (assuming deterministic env.);for i 1, 2, . . . n dox̃i random previously visited state;ũi random previously taken action in x̃i ;{r̃i 1 , x̃i 1 } M̂(x̃i , ũi );q̂(x̃i , ũi ) q̂(x̃i , ũi ) α [r̃i 1 γ maxu q̂(x̃i 1 , u) q̂(x̃i , ũi )];k k 1;until xk is terminal;Algo. 7.1: Dyna with Q-learning (Dyna-Q)Oliver WallscheidRL Lecture 0713

Remarks on Dyna-Q ImplementationThe specific Dyna-Q characteristics are:Direct RL update: Q-learning,Model: simple memory buffer of previous real experience,Search strategy: random choices from model buffer.Moreover: Number of Dyna planning steps n is to be delimited from n-stepbootstrapping (same symbol, two interpretations). Without the model M̂ one would receive one-step Q-learning. The model-based learning is done n times per real environmentinteraction: Previous real experience is re-applied to Q-learning. Can be considered a background task: choose max n s.t. hardwarelimitations (prevent turnaround errors). For stochastic environments: use a distributional model as in (7.3). Update rule then may be modified from sample to expected update.Oliver WallscheidRL Lecture 0714

Maze Example (1)Dyna: Integrated Planning, Acting, and Learning165G800Sactions600Stepsperepisode0 planning steps(direct RL only)4005 planning steps50 planning steps2001421020304050EpisodesFig. 7.7: Applying Dyna-Q with differentplanning steps n to simple maze (source: R.Sutton and G. Barto, Reinforcement learning: anintroduction, 2018, CC BY-NC-ND 2.0) Maze with obstacles(gray blocks) Start at S and reach G rT 1 at G Episodic task withγ 0.95 Step size α 0.1 Exploration ε 0.1 Averaged learningcurvese 8.2: A simple maze (inset) and the average learning curves for Dyna-Q agents varyingr number of planning steps (n) per real step. The task is to travel from S to G as quicklysible.ure 8.3 shows why the planning agents found the solution so much faster thanonplanning agent. Shown are the policies found by the n 0 and n 50 agentsay through the second episode. Without planning (n 0), each episode adds onlydditional stepOliverto thepolicy, and so only one step (the last)has beenWallscheidRL Lecture07 learned so far.15

Maze Example (2)Figure 8.3 shows why the planning agents found the solution so much faster thann 0 andn 50agentstheagent.anShownthe policiesfoundpolicyby the (equal nonplanningBlocks withoutarrowaredepicta neutralactionvalues).halfway through the second episode. Without planning (n 0), each episode adds only additionalBlack squarespositionduringsecondepisode.onestep toindicatethe policy,agent’sand so onlyone step(the last)has beenlearned so far.Withplanning,planningagain only(none steplearnedduring theepisode,but here Without0), iseachepisodesonlyfirstaddsone newitemduringtothe second episode an extensive policy has been developed that by the end of the episodethe policy.will reach almost back to the start state. This policy is built by the planning process planning 50),nearthe theavailableexperienceis ofefficientlyutilized.awhileWiththe agentis still (nwanderingstart state.By the endthe third rfectperformanceattained. After the third episode, the planning agent found the optimal policy.WITHOUT PLANNING ( n 0)GWITH PLANNING ( n 50)GSSFigure8.3: PoliciesPolicies foundby planningDyna-Qagentsthroughhalfway throughFig. 7.8:(greedyaction) andfor nonplanningDyna-Q agenthalfwaysecond thesecondepisodeepisode. (source:The arrowsgreedyaction ineach state; if nolearning:arrow is shownR.indicateSuttontheandG. Barto,Reinforcementan for astate, then all of its action values were equal. The black square indicates the location of theintroduction,2018,CCBY-NC-ND2.0)agent.Oliver WallscheidRL Lecture 0716

What if the Model is Wrong?Possible model error sources: A provided a priori model may be inaccurate (expert knowledge). Environment behavior changes over time (non-stationary). Early-stage model is biased due to learning process. If function approximators are used: generalization error (cf. lecture 08and following).Consequences: Model errors are likely to lead to a suboptimal policy. If lucky: errors are quickly discovered and directly corrected bydefault, random exploration. Nevertheless, more intelligent exploration / correction strategies mightbe useful (compared to random actions as in ε-greedy strategies).Oliver WallscheidRL Lecture 0717

The Blocking Maze ExampleModel Is WrongGGS167S150 Maze with a changingobstacle line Start at S and reach G rT 1 at G Dyna-Q encouragesCumulativerewardintelligent explorationDyna-Q(upcoming slides) Dyna-Q requires more00100020003000steps in order toTime stepsovercome the blockadeAverageof Dynaagents on layouta blockingtask.The left environmentFig. performance7.9: Maze witha changingafter1000 Averagedhe first 1000 steps, the right environment for the rest. Dyna-Q is Dyna-Qwith learningsteps illustrates a model error (source: R. Suttonbonus that encourages exploration.curvesDyna-Q and G. Barto, Reinforcement learning: anintroduction, 2018, CC BY-NC-ND 2.0)3: ShortcutMazeOliver WallscheidRL LectureG 07G18

The Shortcut Maze ExampleeGGd oftedFigSSh isbar400eps,upDyna-Q urbDyna-Qht). Cumulativerewardlarthezedhat0e it030006000tepTime stepsvenery Fig.FigureAverageof Dynaagentsafteron7.10:8.5:Mazewithperformancean additionalshortcutso a shortcut task. The left environment was used for the3000 steps (source: R. Sutton and G. Barto,dis- first 3000 steps, the right environment for the rest.Reinforcement learning: an introduction, 2018,CC BY-NC-ND 2.0)nother version of the conflict between exploration andOliver WallscheidRL Lecture 07ext, explorationmeans trying actions that improvethe Maze opens a shortcutafter 3000 steps Start at S and reach G rT 1 at G Dyna-Q with randomexploration is likely notfinding the shortcut Dyna-Q explorationstrategy is able tocorrect internal model Averaged learningcurves19

Dyna-Q ExtensionsCompared to default Dyna-Q in Algo. 7.1, Dyna-Q contains thefollowing extensions: Search heuristic: add κ τ to regular reward. τ : is the number of time steps a state-action transition has not beentried. κ: is a small scaling factor κ {R 0 κ}. Agent is encouraged to keep testing all accessible transitions. Actions for given states that had never been tried before are allowedfor simulation-based planning. Initial model for that: actions lead back to same state without reward.Oliver WallscheidRL Lecture 0720

Table of Contents1Repetition: Model-based and Model-free RL2Dyna: Integrated Planning, Acting and Learning3Prioritized Sweeping4Update Variants5Planning at Decision TimeOliver WallscheidRL Lecture 0721

Background and Idea Dyna-Q randomly samples from the memory buffer. Many planning updates maybe pointless, e.g., zero-valued stateupdates during early training. In large state-action spaces: inefficient search since transitions arechosen far away from optimal policies. Better: focus on important updates. In episodic tasks: backward focusing starting from the goal state. In continuing tasks: prioritize according to impact on value updates. Solution method is called prioritized sweeping. Build up a queue of every state-action pair whose value would changesignificantly. Prioritize updates by the size of change. Neglect state-action pairs with only minor impact.Oliver WallscheidRL Lecture 0722

Algorithmic Implementation: Prioritized Sweepingparameter: α {R 0 α 1} , n {N n 1} , θ {R θ 0}init: q̂(x, u) arbitrary and M̂(x, u) {x X , u U}, empty queue Qfor j 1, 2, . . . episodes doInitialize x0 and k 0;repeatChoose uk from xk using a soft policy derived from q̂(x, u);Take action uk , observe rk 1 and xk 1 ;M̂(xk , uk ) {rk 1 , xk 1 } (assuming deterministic env.);P rk 1 γ maxu q̂(xk 1 , u) q̂(xk , uk ) ;if P θ then insert {xk , uk } in Q with priority P ;for i 1, 2, . . . n while queue Q is not empty do{x̃i , ũi } arg maxP (Q);{r̃i 1 , x̃i 1 } M̂(x̃i , ũi );q̂(x̃i , ũi ) q̂(x̃i , ũi ) α [r̃i 1 γ maxu q̂(x̃i 1 , u) q̂(x̃i , ũi )];for {x, u} predicted to lead to x̃i dor predicted reward for {x, u, x̃i };P r γ maxu q̂(x̃i , u) q̂(x, u) ;if P θ then insert {x, u} in Q with priority P ;k k 1;until xk is terminal;Oliver WallscheidRL Lecture 0723

Remarks on Prioritized Sweeping ImplementationThe specific prioritized sweeping characteristics are: Direct RL update: Q-learning, Model: simple memory buffer of previous real experience, Search strategy: prioritized updates based on predicted value change.Moreover: θ is a hyperparameter denoting the update significance threshold. Prediction step regarding x̃i is a backward search in the model buffer. For stochastic environments: use a distributional model as in (7.3). Update rule then may be modified from sample to expected update.Oliver WallscheidRL Lecture 0724

Comparing Against Dyna-Q on Simple Maze Examplegndn0.t. BackupsUpdatesUpdatesuntiluntileoptimale optimalsolutiony 41031021004794186 376 752 1504 3008 6016Gridworld size (#states)Fig. 7.11: Comparison of prioritized sweepingand Dyna-Q on simple maze (source: R. Suttonstochastic environments are straightforward. Theand G. Barto, Reinforcement learning: anof the number of times each state–action pair hasintroduction, 2018, CC BY-NC-ND 2.0)tates were. It is natural then to update each pairbeen using so far, but with an expected update,Oliver WallscheidRL Lecture 07ates and theirprobabilities of occurring. Environment frameworkas in Fig. 7.7 But: changing mazesizes (number of states) Both methods canutilize up to n 5planning steps Prioritized sweepingfinds optimal solution5-10 times quicker25

Table of Contents1Repetition: Model-based and Model-free RL2Dyna: Integrated Planning, Acting and Learning3Prioritized Sweeping4Update Variants5Planning at Decision TimeOliver WallscheidRL Lecture 0726

Update Rule Alternatives Dyna updates (search strategy) are not bound to Q-learning duringplanning and can be exchanged in many ways (see Fig. 7.12). Even evaluating a fixed policy π in terms of vπ (x) and qπ (x, u) ispossible.Expected updates(DP)policy evaluationvalue evaluationQ-policy evaluationQ-value iterationSample updates(one-step TD)TD(0)SarsaQ-learningFig. 7.12: Possible one-step updates: alternatives for DynaOliver WallscheidRL Lecture 0727

Advantages & Drawbacks: Expected vs. Sampled UpdatesPro expected updates: Delivers more accurate value estimates (no sampling error).Pro sample updates: Is computational cheaper (e.g., distributional model not required).Leads to trade-off: Estimation accuracy vs. computational burden. Evaluate decision on given problem, i.e., how many state-action pairshave to be evaluated for a new expected update? Utilize branching factor b metric: corresponds to number of possiblenext states x′ with p(x′ x, u) 0. If expected update is available, this will be roughly as accurate as bsamplings. If only incomplete expected update is available, prefer samplingsolution (often applies to large state-action spaces).Oliver WallscheidRL Lecture 0728

Example for Sampled vs. Expected Updates Artificial prediction task where all b successor states are equally likely Method initialization such that RMS error is always one 174Sample updates perform particularlywell andforLearninglarge branchingfactors bChapter 8: Planningwith Tabular Methods(takeaway message: if facing large stochastic problems, use sampling)1expectedupdatessampleupdatesb 2 (branching factor)RMS errorin valueestimateb 10b 100b 1000b 10,000001b2bQ(s0 , a0 ) computationsNumber of max0aFigure 8.7: ofComparisonof efficiencyof expectedand sampleupdates.Comparisonexpectedvs. sampledupdates(source:R.Fig. 7.13:Sutton andG. Barto, Reinforcement learning: an introduction, 2018, CC BY-NC-ND 2.0)b successor states are equally likely and in which the error in the initial estimate isRL Lecture correct,071.OliverTheWallscheidvalues at the next states are assumedso the expected update reduces29

Alternatives on Distributing the UpdatesRecap: Dynamic programming: sweep through the entire state(-action) space Dyna-Q: random uniform sampling Mutual problem: irrelevant updates decrease computational efficiencyAlternative: update according to on-policy distribution Based on sampling along the encountered state(-action) pairs(trajectory sweeping) Based on explicit on-policy distribution In both cases: ignore vast, uninteresting parts of the problem space atthe risk of updating same old parts all over againOliver WallscheidRL Lecture 0730

state–action pairs. In addition, on all transitions there was a 0.1tion to the terminal state, ending the episode. The expected rewardas selected from a Gaussian distribution with mean 0 and variance 1.e planning processhaustively computeon-policye of the start stateb 13cy, , given the cur1,000 STATESction Q, as an indiuniformExemplary Update Distribution ComparisonExample: Two actions per stateb 3policy Both actions led to bb 10of the figure toults averaged overnext statesth 1000 states and1, 3, and 10. TheComputation time, in fullbackupsexpectedupdates 10 % probability offound is plotted asmber of expected uptransition to terminalall cases, samplingb 1policy distributionstatenning initially andthe long run. The10,000 STATES reward per transition:and the initial pe- Value ofstateing was longer, at startunderN (µ 0, σ 2 1)ctors. In other ex- greedypolicythat these e ects Task: estimate startr as the number ofexample, the lowerhows results for astate value1 for tasks withexpectedupdatesComputation time, in fullbackupsis case the advan 200 randomly generatedcusing is large and Figure 8.8: Relative efficiency of updates disFig. 7.14: tributedUpdatedistribution comparison (source:uniformly across the state space versusundiscounted episodicsimulatedon-policytrajectories, eachmake sense.In the focusedR. SuttonandonG.Barto,Reinforcementlearning:g according to the starting in the same state. Results are for randomlyrunsgenerated tasks of 2018,two sizes andvariousbranchingan introduction,CCBY-NC-ND2.0)e agent would do on Value ofhich it acted greed- start stateunderuming the model iform2100n helps by focusingear descendants of factors, b.Oliver Wallscheid50,000RL Lecture 0731

Table of Contents1Repetition: Model-based and Model-free RL2Dyna: Integrated Planning, Acting and Learning3Prioritized Sweeping4Update Variants5Planning at Decision TimeOliver WallscheidRL Lecture 0732

Background Planning vs. Planning at Decision TimeBackground Planning (discussed so far): Gradually improves policy or value function if time is available. Backward view: re-apply gathered experience. Feasible for fast execution: policy or value estimate are available withlow latency (important, e.g., for real-time control).Planning at decision time1 (not yet discussed alternative): Select single next future action through planning. Forward view: predict future trajectories starting from current state. Typically discards previous planning outcomes (start from scratchafter state transition). If multiple trajectories are independent: easy parallel implementation. Most useful if fast responses are not required (e.g., turn-basedgames).1Can be interpreted as model predictive control in an engineering context.Oliver WallscheidRL Lecture 0733

e ective.The distribution of updates can be altered in similar ways to focus on the currentstate and its likely successors. As a limiting case we might use exactly the methods ofheuristic search to construct a search tree, and then perform the individual, one-stepupdates from bottom up, as suggested by Figure 8.9. If the updates are ordered in thisway and a tabular representation is used, then exactly the same overall update wouldbe achieved as in depth-first heuristic search. Any state-space search can be viewed inthis way as the piecing together of a large number of individual one-step updates. Thus,the performance improvement observed with deeper searches is not due to the use ofmultistep updates as such. Instead, it is due to the focus and concentration of updateson states and actions immediately downstream from the current state. By devoting alarge amount of computation specifically relevant to the candidate actions, decision-timeplanning can produce better decisions than can be produced by relying on unfocusedupdates.Heuristic Search Develop tree-like continuations from each state encountered. Approximate value function at leaf nodes (using a model) and backup towards the current state. Choose action according to predicted trajectory with highest value. Predictions are normally discarded (new search tree in each state).311086245978.9: Heuristicsearchtreecan be withimplementedas a sequenceof one-stepupdates (shownFig. 7.15:FigureHeuristicsearchexemplaryorderof back-upoperationshere outlined in blue) backing up values from the leaf nodes toward the root. The orderinghere isandfor a selectivedepth-firstsearch.(source: R. shownSuttonG. Barto,Reinforcementlearning: an introduction, 2018,CC BY-NC-ND 2.0)Oliver Wallscheid8.10RL Lecture 07Rollout Algorithms34

Rollout AlgorithmsRollout policy Similar to heuristic search, but: simulate trajectories following arollout policy. Use Monte Carlo estimates of action value only for current state toevaluate on best action. Gradually improves rollout policy but optimal policy might not befound if rollout sequences are too short. Predictions are normally discarded (new rollout in each state).Fig. 7.16: Simplified processing diagram of rollout algorithmsOliver WallscheidRL Lecture 0735

of state–action pairs that are most likely to be reached in a few steps, which form a treerooted at the current state, as illustrated in Figure 8.10. MCTS incrementally extendsthe tree by adding nodes representing states that look promising based on the results ofthe simulated trajectories. Any simulated trajectory will pass through the tree and thenexit it at some leaf node. Outside the tree and at the leaf nodes the rollout policy is usedfor action selections, but at the states inside the tree something better is possible. Forthese states we have value estimates for of at least some of the actions, so we can pickaccumulates values estimates from former MC simulations,among them using an informed policy, called the tree policy, that balances explorationMonte Carlo Tree Search (MCTS) Rollout algorithm, but: makes use of an informed tree policy (e.g., ε-greedy).Repeat while time pRolloutPolicyMonte Carlo Tree Search. When the environment changes to a new state, MCTSFig. 7.17:FigureBasic8.10:blocksof MCTSalgorithmsR. Sutton and G.executesasbuildingmany iterationsas possiblebefore an actionneeds to be(source:selected, incrementallya tree whose rootnode representscurrent state. Each 2018,iteration consistsof the fourBarto, buildingReinforcementlearning:antheintroduction,CC BY-NC-ND2.0)operations Selection, Expansion (though possibly skipped on some iterations), Simulation,and Backup, as explained in the text and illustrated by the bold arrows in the trees. AdaptedChaslot, Bakkes, Szita, and SpronckRL(2008).Oliver fromWallscheidLecture 0736

Basic MCTS ProcedureRepeat the following steps while prediction time is available:1 Selection: Starting at root node, use a tree policy (e.g., ε-greedy) totravel through the tree until arriving at a leaf node. The tree policy exploits auspicious tree regions while maintaining someexploration. It is improved and (possibly) extended in every simulation run.23Expansion: Add child node(s) to the leaf node by evaluatingunexplored actions (optional step).Simulation: Simulate the remaining full episode using the rolloutpolicy starting from the leaf or child node (if available). The rollout policy could be random, pre-trained or based on model-freemethods using real experience (if available).4Backup: Update the values along the traveled trajectory but onlysaves those within the tree policy.Oliver WallscheidRL Lecture 0737

Further MCTS RemarksWhat is happening after reaching the feasible simulation runs? After time is up, MCTS picks an appropriate action regarding theroot node, e.g.: The action visited the most times during all simulation runs or The action having the largest action value. After transitioning to a new state, the MCTS procedure re-starts

applying reinforcement learning methods to the simulated experiences just as if they had really happened. Typically, as in Dyna-Q, the same reinforcement learning method is used both for learning from real experience and for planning from simulated experience. The reinforcement learning method is thus the ÒÞnal common pathÓ for both learning