DisCor: Corrective Feedback In Reinforcement Learning Via .

2y ago

6 Views

2 Downloads

1.76 MB

13 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Sasha Niles

Report this link

Download PDF

Transcription

DisCor: Corrective Feedback in ReinforcementLearning via Distribution CorrectionAviral Kumar, Abhishek Gupta, Sergey LevineElectrical Engineering and Computer Sciences, UC Berkeleyaviralk@berkeley.eduAbstractDeep reinforcement learning can learn effective policies for a wide range of tasks,but is notoriously difficult to use due to instability and sensitivity to hyperparameters. The reasons for this remain unclear. In this paper, we study how RL methodsbased on bootstrapping-based Q-learning can suffer from a pathological interaction between function approximation and the data distribution used to train theQ-function: with standard supervised learning, online data collection should inducecorrective feedback, where new data corrects mistakes in old predictions. With dynamic programming methods like Q-learning, such feedback may be absent. Thiscan lead to potential instability, sub-optimal convergence, and poor results whenlearning from noisy, sparse or delayed rewards. Based on these observations, wepropose a new algorithm, DisCor, which explicitly optimizes for data distributionsthat can correct for accumulated errors in the value function. DisCor computesa tractable approximation to the distribution that optimally induces correctivefeedback, which we show results in reweighting samples based on the estimatedaccuracy of their target values. Using this distribution for training, DisCor results insubstantial improvements in a range of challenging RL settings, such as multi-tasklearning and learning from noisy reward signals.1IntroductionReinforcement learning (RL) algorithms, when combined with high-capacity deep neural net function approximators, have shown promise in domains ranging from robotic manipulation [22] torecommender systems [44]. However, current deep RL methods can be difficult to use: they requiredelicate hyperparameter tuning, and exhibit inconsistent performance. While a number of hypotheseshave been proposed to understand these issues [15, 52, 11, 10], and gradual improvements haveled to more stable algorithms in recent years [14, 18], an effective solution has proven elusive. Wehypothesize that one source of instability in reinforcement learning with function approximation andvalue function estimation, such as Q-learning [53, 38, 33] and actor-critic algorithms [13, 23], is apathological interaction between the data distribution induced by the latest policy, and the errorsinduced in the learned approximate value function as a consequence of training on this distribution.While a number of prior works [1, 10, 29] have provided theoretical examinations of various approximate dynamic programming (ADP) methods, which include Q-learning and actor-critic algorithms,prior work has not extensively studied the relationship between the data distribution induced by thelatest value function and the errors in the future value functions obtained by training on this data.When using supervised learning style procedures to train contextual bandits or dynamics models,online data collection results in a kind of “hard negative” mining: the model collects transitions thatlead to good outcomes according to the model (potentially erroneously). This results in collectingprecisely the data needed to correct errors and improve. On the contrary, ADP algorithms that use34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

bootstrapped targets rather than ground-truth target values may not enjoy such corrective feedbackwith online data collection in the presence of function approximation.Since function approximation couples Q-values at different states, the data distribution under whichADP updates are performed directly affects the learned solution. As we will argue in Section 3, onlinedata collection may give rise to distributions that fail to correct errors in Q-values at states that areused as bootstrapping targets due to this coupling effect. If the bootstrapping targets in ADP updatesare themselves are erroneous, then any form of Bellman error minimization using these targets maynot result in the correction of errors in the Q-function, leading to poor performance. In this work, weshow that we can explicitly address this by modifying the ADP training routine to re-weight the databuffer to a distribution that explicitly optimizes for corrective feedback, giving rise to our proposedmethod, DisCor. With DisCor, transitions sampled from the data buffer are reweighted with weightsthat are inversely proportional to the estimated errors in their target values. Thus, transitions witherroneous targets are down-weighted. We will show how this simple modification to ADP improvecorrective feedback, and increases the efficiency and stability of ADP algorithms.The main contribution of our work is to propose a simple modification to ADP algorithms to providecorrective feedback during the learning process, which we call DisCor. We show that DisCor canbe derived from a principled objective that results in a simple algorithm that reweights the trainingdistribution based on estimated target value error, so as to mitigate error accumulation. DisCoris general and can be used in conjunction with modern deep RL algorithms, such as DQN [33]and SAC [14]. Our experiments show that DisCor substantially improves performance of standardRL methods, especially in challenging multi-task RL settings. We evaluate our approach on bothcontinuous control tasks and discrete-action, image-based Atari games. On the multi-task MT10benchmark [56] and several robotic manipulation tasks, our method learns policies with a finalsuccess rate that is 50% higher than that of SAC.2PreliminariesThe goal in reinforcement learning is to learn a policy that maximizes the expected cumulativediscounted reward in a Markov decision process (MDP), which is defined by a tuple (S, A, P, R, γ).S, A represent state and action spaces, P (s0 s, a) and r(s, a) represent the dynamics and rewardfunction, and γ (0, 1) represents the discount factor. ρ0 (s) is the initial state distribution. Theinfinite-horizon, discounted marginal state distribution of the policy π(a s) is denoted as dπ (s) andthe corresponding state-action marginal is dπ (s, a) dπ (s)π(a s). We define P π , the state-actiontransition matrix under a policy π as P π Q(s, a) : Es0 P (· s,a),a0 π(a0 s0 ) [Q(s0 , a0 )].Approximate dynamic programming (ADP) algorithms, such as Q-learning and actor-critic methods,aim to acquire the optimal policy by modeling the optimal state (V (s)) and state-action (Q (s, a))value functions by recursively iterating the Bellman optimality operator, B , defined as (B Q)(s, a) r(s, a) γEs0 P [maxa0 Q(s0 , a0 )]. With function approximation, these algorithms project the valuesof the Bellman optimality operator B onto a family of Q-function approximators Q (e.g., deep neuralnets) under a sampling or data distribution µ, such that Qk 1 Πµ (B Qk ) anddefΠµ (Q) arg minEs,a µ [(Q0 (s, a) Q(s, a))2 ].0Q Q(1)Q-function fitting is usually interleaved with additional data collection, which typically uses a policyderived from the latest value function, augmented with either -greedy [54, 33] or Boltzmannstyle [14, 45] exploration. For commonly used ADP methods, µ simply corresponds to the on-policystate-action marginal, µk dπk (at iteration k) or else a “replay buffer” [14, 33, 27, 28] formed asPka mixture distribution over all past policies, such that µk 1/k i 1 dπi . However, as we willshow in this paper, the choice of the sampling distribution µ is of crucial importance for the stabilityand efficiency of ADP algorithms. We analyze this issue in Section 3, and then discuss a potentialsolution to this problem in Section 5.3Corrective Feedback in Q-LearningWhen learning with supervised regression (i.e., non-bootstrapped objectives) onto the true valuefunction (e.g., in a bandit setting), active data collection methods will visit precisely those state-actiontuples that have erroneously optimistic values, observe their true values, and correct the errors, byfitting these true values. However, ADP methods that use bootstrapped target values may not be ableto correct errors this way, and online data collection may not reduce the error between the current2

Intermediate values of error(high (L) to low (R) error)States beingupdatedIteration 1Iteration 2Iteration 3Iteration 6Iteration 7Iteration 8Iteration 4Iteration 9Iteration 5Iteration 10Iteration 0Iteration 2Iteration 1Iteration 3Figure 1: Left: Depiction of a possible run of Q-learning iterations on a tree-structured MDP with on-policysampling. The trajectory sampled at each iteration is shown with dotted boundaries. Function approximationresults in aliasing (coupling) of the box-shaped and circle-shaped nodes (i.e., instances of each shape has similarfeatures values). Updating the values at one circle node affects all other circles, likewise for boxes. Regressing toerroneous targets at one circle node may induce errors at another circle node, even if the other node has a correcttarget, simply because the other node is visited less often. Right: If a distribution that puts higher probability onnodes with correct target values, i.e. which moves from leaves to nodes higher up, is chosen, then, the effects offunction approximation aliasing are reduced, and correct Q-values can be obtained.Q-function and Q , especially when function approximation is employed to represent the Q-function.This is because function approximation error can result in erroneous bootstrap target values at somestate-action tuples. Visiting these tuples more often will simply cause the function approximator tomore accurately fit these incorrect target values, rather than correcting the target values themselves.As we will show, those states that are the cause of incorrect target values at other states can beextremely infrequent in the data obtained by running the policy. Therefore, their values will not becorrected, leading to more error propagation.Didactic example. To build intuition for the phenomenon, consider tree-structured MDP example inFigure 1. We illustrate a potential run of Q-learning (Alg. 2) with on-policy data collection. Q-valuesat different states are updated to match their (potentially incorrect) bootstrap target values undera distribution, µ(s, a), which, in this case is dictated by the visitation frequency under the currentpolicy (Equation 1). The choice of µ(s, a), does not affect the resulting Q-function when functionapproximation is not used, as long as µ is full-support, i.e., µ(s, a) 0 s, a.However, with function approximation, updates across state-action pairs affect each other. Erroneousupdates higher up in the tree, trying to match incorrect target values, may prevent error correctionat leaf nodes if the states have similar representations under function approximation (i.e., if theyare partially aliased). States closer to the root have higher frequencies (because there are fewer ofthem) than the leaves, exacerbating this problem. This issue can compound: the resulting erroneousleaf values are again used as targets for other nodes, which may have higher frequencies, furtherpreventing the leaves from learning correct values.If we can instead train with µ(s, a) that puts higher probability on nodes with correct target values,we can alleviate this issue. We would expect that such a method would first fit the most accuratetarget values (at the leaves), and only then update the nodes higher up, as shown in Figure 1 (right).Our proposed algorithm, DisCor, shows how to construct such a distribution in Section 5.Value error in ADP. To more formally quantify, and devise solutions to this issue, we first defineour notion of error correction in ADP in terms of value error:Definition 3.1. The value error is defined as the error of the current Q-function, Qk to the optimalQ averaged under the on-policy (πk ) marginal, dπk (s, a) : Ek Edπk [ Qk Q ].A smooth decrease in value error Ek indicates that effective error correction in the Q-function. If Ekfluctuates or increases, the algorithm is making poor learning progress. When the value error Ek isroughly stagnant at a non-zero value, this indicates premature convergence. The didactic example(Fig. 1) suggests that the value error Ek for ADP may not smoothly decrease to 0, and can evenincrease with function approximation.To analyze this phenomenon computationally, we use the gridworld MDPs from Fu et al. [10] andvisualize the correlations between policy visitations dπk (s, a) and the value of Bellman error afterthe ADP update, i.e. Qk 1 B Qk (s, a), as well as the correlation between visitations and thedifference in value errors after and before the update, Ek 1 (s, a) Ek (s, a). We eliminate finite3

sampling error by training on all transitions, simply weighting them by the true on-policy or replaybuffer distribution. Details are provided in Appendix G.1. In Figure 2, we show that, as expected,Bellman error correlates negatively with visitation frequency (dashed0.00010E EBellmanerrorline), suggesting that visiting a state more often decreases its Bellman0.00008error. However, the change in value error Ek 1 Ek in general does0.000060.00004not correlate negatively with visitation. Value error often increases0.00002in states that are visited more frequently, suggesting that a corrective0.00000feedback mechanism is often lacking.kCorrelationk 1 0.00002The Q-function value error at state-action pairs that will be used asbootstrapping targets for other state-action tuples (Q(s0 , a1 ) is usedas target for all states with action a1 ) is high and the state-actionpair with correct target value, (s3 , a0 ), appears infrequently in theon-policy distribution, since the policy chooses the other actiona1 with high probability. Since the function approximator couplestogether updates across states and actions, the low update frequencyat (s3 , a0 ) and high frequency of state-action tuples with incorrecttargets will cause the Q-function updates to increase value error.Thus, minimizing Bellman error under the on-policy distribution canlead to an increase in the error against Q (Also shown in Figure 2on a gridworld). A more concrete computational example illustrating this phenomenon is describedin detail in Section 4. We can further generalize this discussion over multiple iterations of learning.0100200300IterationFigure 2: Correlation (y-axis) between dπk (s, a) and the Bellmanerror, Qk 1 B Qk (dashed),and correlation between dπk (s, a)and change in value error, Ek 1 Ek (solid), during training withon-policy data. dπk (s, a) negatively correlates with Bellman error, but often correlates positivelywith an increase in value error.UniformReplay BufferTabularNormalized Return (Dashed)Errors (Solid)UniformOn-policyTabularNormalized Return (Dashed)Errors (Solid)Which distributions lead to higher value errors? In Figure 3, we plot value error Ek over thecourse of Q-learning with on-policy and replaySuboptimal ConvergenceSparse Reward1.01.00buffer distributions. The plots show prolonged1510.00.80.75periods where Ek is increasing or fluctuating.7.50.610When this happens, the policy has poor perfor0.505.00.4mance, with returns that are unstable or stag0.2552.50.2nating (Fig. 3). To study the effects of function0.000.00.00approximation and distributions on this issue,01002003000100200300we can control for both of these factors. When Figure 3: Value error (Ek ) and policy performance (nora uniform distribution Unif(s, a) is used instead malized return) for Left: sub-optimal convergence withof the on-policy distribution, as shown in Fig. 3 on-policy distributions, Right: instabilities in learning(red), or when using a tabular representation progress with replay buffers. Note that an oracle rewithout function approximation, but with the weighting to a uniform data distribution or completeremoval of function approximation, gives rise to decreason-policy distribution, as shown in with Fig. 3 ing E curve and better policy performance.k(brown), we see that Ek decreases smoothly, suggesting that the combination of function approximation and naïve distributions can result in challengesin value error reduction.In fact, we can construct a family of MDPs generalizing our didactic tree example, where training withon-policy or replay buffer distributions theoretically requires at least exponentially many iterations toconverge to Q , if at all convergence to Q happens.Theorem 3.1 (Exponential lower bound for on-policy and replay buffer distributions). There existsa family of MDPs parameterized by H 0, with S 2H , A 2 and state features Φ, suchthat on-policy or replay-buffer Q-learning requires Ω γ H exact Bellman projection steps forconvergence to Q , if at all convergence happens. This happens even with features, Φ that canrepresent the optimal Q-function near-perfectly, i.e., Q Φw ε.The proof is in Appendix D. This suggests that on-policy or replay buffer distributions can inducevery slow learning in certain MDPs. We show in Appendix D.3 that our method, DisCor, which wederive in the next section, can avoid many of these challenges, in this MDP family.4Optimal Distributions for Value Error ReductionWe discussed how, with function approximation and on-policy or replay-buffer training distributions,the value error Ek may not decrease over the course of training. What if we instead directly optimizethe data distribution at each iteration so as to minimize value error? To do so, we derive a functionalform for this “optimal” distribution by formulating an optimization problem that directly optimizesthe training distribution pk (s, a) at each iteration k, greedily minimizing the error Ek at the end of4

iteration k. Note that pk (s, a) is now distinct from the on-policy or buffer data distribution denotedby µ(s, a). We will then show how to approximately solve for pk (s, a), yielding a simple practicalalgorithm in Section 5. All proofs are in Appendix A. We can write the optimal pk (s, a) as thesolution to the following optimization problem:X min Edπk [ Qk Q ] s.t. Qk arg min Epk (Q B Qk 1 )2 ,pk (s, a) 1. (2)pkQs,aTheorem 4.1. The solution pk (s, a) to a relaxation of the optimization in Equation 2 satisfies Qk B Qk 1 (s, a),λ Pwhere λ R is the magnitude of Lagrange multiplier for s,a pk (s, a) 1 in Problem 2.pk (s, a) exp ( Qk Q (s, a))(3)Proof sketch. Our proof of Theorem 4.1 utilizes the Fenchel-Young inequality [39] to first upperbound Edπk [ Qk Q ] via more tractable terms giving us the relaxation, and then optimizing theLagrangian. We We use the implicit function theorem (IFT) [24] to compute implicit gradients of Qkwith respect to pk .Intuitively, the optimal pk in Equation 3 assigns higher probability to state-action tuples with highBellman error Qk B Qk 1 , but only when the resulting Q-value Qk is close to Q . However, thisexpression contains terms that depend on Q and Qk , namely Qk Q and Qk B Qk 1 , whichare observed only after pk is chosen. As we will show next, we need to estimate these quantitiesusing surrogates, that only depend upon the past Q-function iterates in order to use pk in a practicalalgorithm. Intuitively, these surrogates exploit the rich structure in Bellman iterations: the Bellmanerror at each iteration contributes to the error against Q in a structured manner, as we will discussbelow, allowing us to approximate the value error using a special sum of Bellman errors. We presentthese approximations below, and then combine then to derive our proposed algorithm, DisCor.Surrogate for Qk Q . For approximating the error against Q , we show that the cumulativesum of discounted and propagated Bellman errors over the past iterations of training, denoted as kand shown in Equation 5, are equivalent to an upper bound on Qk Q . Specifically, Theorem 4.2will show that, up to a constant, k forms a tractable upper bound on Qk Q constructed onlyfrom prior Q-function iterates, Q0 , · · · , Qk 1 . We define k as: kk 1XY k γ k i P πj Qi (B Qi 1 ) . (vector-matrix form of )(4)i 1j i k (s, a) Qk (s, a) (B Qk 1 )(s, a) γ(P πk 1 k 1 )(s, a).(5)Here P πj is the state-action transition matrix under policy πj as described in Section 2. We can thenuse k to define an upper bound on the value error Qk Q , as follows:Theorem 4.2. There exists a k0 N, such that k k0 and k from Equation 5, k satisfies thefollowing inequality, pointwise, for each s, a, as well as, k Qk Q as πk π . k (s, a) kXi 1γ k i αi Qk Q (s, a), αi 2RmaxDTV (πi (· s), π (· s)).1 γA proof and intermediate steps of simplification can be found in Appendix B. The key insight in thisargument is to use a recursive inequality,in Lemma B.1, App. B, to decompose Qk Q ,P presentedk iwhich allows us to show that k i γ αi is a solution to the corresponding recursive equality,and hence, an upper bound on Qk Q . Using an upper bound of this form in Equation 3 maydownweight more transitions, but will never upweight a transition that should not be upweighted.Estimating Qk B Qk 1 . The Bellman error multiplier term Qk B Qk 1 in Equation 3is also not known in advance. Since no information is known about the Q-function Qk , a viableapproximation is to bound Qk B Qk 1 between the minimum and maximum Bellman errorsobtained at the previous iteration, c1 mins,a Qk 1 B Qk 2 and c2 maxs,a Qk 1 B Qk 2 .We restrict the support of state-action pairs (s, a) used to compute c1 and c2 to be the set of transitionsin the replay buffer used for the Q-function update, to ensure that both c1 and c2 are finite. This5

approximation can then be applied to the solution obtained in Equation 3 to replace the Bellman errormultiplier Qk B Qk 1 , effectively giving us a lower-bound on pk (s, a) in terms of c1 and c2 .Re-weighting the replay buffer µ. Since it is challenging to directly obtain samples frompk via online interaction, a practically viable alternative is to use the samples from a standard replay buffer distribution, denoted µ, but reweight these samples using importance weightswk pk (s, a)/µ(s, a). However, naïve importance sampling often suffers from high variance, leading to unstable learning. Instead of directly re-weighting to pk , we re-weight samples from µ to aprojection of pk , denoted as qk , that is still close to µ under the KL-divergence metric, such thatqk arg minq Eq(s,a) [log pk (s, a)] τ DKL (q(s, a) µ(s, a)), where τ 0 is a scalar. The weightswk are thus given by (derivation in Appendix B): Qk Q (s, a) Qk B Qk 1 (s, a)(6)wk (s, a) expτλ Putting it all together. We have noted all practical approximations to the expression for optimal pk(Equation 3), including estimating surrogates for Qk and Q , and the usage of importance weightsto simply re-weighting transitions in the replay buffer, rather than altering the exploration strategy.We now put these together to obtain a tractable expression for weights in our method. Due to spacelimitations, we only provide a sketch of the proof here, and a detailed derivation is in Appendix C.We first upper-bound the quantity Qk Q by k . However, estimating k requires Qk B Qk 1 ,which is not known in advance. We utilize the upper bound c2 : Qk B Qk 1 (s, a) c2 , and henceuse γP πk 1 k 1 (s, a) c2 as an estimator for Qk Q in Equation 6. For the final Bellman errorterm outside the exponent, we can lower bound it with c1 , where Qk B Qk 1 c1 . Simplifyingconstants c1 , c2 and λ , the final expression for this tractable approximation for wk is: γ [P πk 1 k 1 ] (s, a)wk (s, a) exp .(7)τThis expression gives rise to our practical algorithm, DisCor, described in the next section.A concrete demonstration. To illustrate the effectiveness of DisCor and the challenges with naivelychosen distributions in RL, we present a simple computational example in Figure 4 that illustrates that,even in a simple MDP, error can increase with standard Q-learning but decreases with our distributioncorrection approach, DisCor, that is based on the idea of first attempting to minimize value error atstates-action tuples that will serve as target-values for other states. Our example is a 5-state MDP,with the starting state s0 and the terminal state sT (marked in gray). Each state has two availableactions, a0 and a1 , and each action deterministically transits the agent to a state marked by arrows inFigure 4. A reward of 0.001 is received only when action a0 is chosen at state s3 (else reward is 0).The Q-function is a linear function over pre-defined features φ(s, a), i.e., Q(s, a) [w1 , w2 ]T φ(s, a),where φ(·, a0 ) [1, 1] and φ(·, a1 ) a1 , 0 a1 , 0[1, 1.001] (hence features are aliased acrossstates). Computationally, we see that whena0 , 0a0 , 0a0 , 0a0s0s1s2s3sTminimizing Bellman error starting from1e-3a Q-function with weights [w1 , w2 ] a1 , 0γ 0.999[0, 1e-4], under the on-policy distribua1 , 0tion of the Boltzmann policy, π(a0 ·) Figure 4: A simple MDP showing the effect of on-policy 0.001, π(a ·) 0.999, in the absence of1distribution and function approximation on learning dynamicssampling error (using all transitions butof ADP algorithms.weighted), the error against Q still increases from 7.177e-3 to 7.179e-3 in one iteration, whereas with DisCor error decreases to 5.061e-4.With uniform the error also decreases, but is larger: 4.776e-3.5Distribution Correction (DisCor) AlgorithmIn this section, we present the our full, practical algorithm, which uses the weights wk from Equation 7to re-weight the Bellman backup in order to better correct value errors. Pseudocode for our approach,called DisCor (Distribution Correction), is presented in Algorithm 1, with the main differences fromstandard ADP methods highlighted in red. In addition to a standard Q-function, DisCor trains anotherparametric model, φ , to estimate k (s, a) at each state-action pair. The recursion in Equation 5 isused to obtain a simple approximate dynamic programming update rule for the parameters φ (Line6

8). We need to explicitly estimate this error term φ because it is required to compute the weightsdescribed in Equation 7. The second change is the introduction of a weighted Q-function backupwith weights wk (s, a), as shown in Equation 7 on Line 7. Since DisCor simply introduces a changeto the training distribution, this change can be applied to popular ADP algorithms such as DQN [33]or SAC [14], as shown in Algorithm 3, Appendix F.Using the weights wk in Equation 7 forweighting Bellman backups possesses a Algorithm 1 DisCor (Distribution Correction)very clear and intuitive explanation. Note 1: Initialize Q-values Qθ (s, a), initial distribution p0 (s, a),a replay buffer µ, and an error model φ (s, a).that (P πk 1 k 1 )(s, a) corresponds tothe estimated upper bound on the error of 2: for step k in {1, . . . , N} doπk , add them to replay bufferthe target values for the current transition, 3: Collect M samples usingµ, sample {(si , ai )}Ni 1 µdue to the backup operator P πk 1 , as de4:Evaluate Qθ (s, a) and φ (s, a) on samples (si , ai ).scribed in Equation 7. Intuitively, this im- 5: Compute targetvalues for Q and on samples:plies that weights wk downweight thoseyi ri γ maxa0 Qk 1 (s0i , a0 )transitions for which the bootstrapped tarâi arg maxa Qk 1 (s0i , a)get Q-value estimate has a high estimatedˆ i Qθ (s, a) yi γ k 1 (s0i , âi ) error to Q , effectively focusing the learn- 6: Compute wk using Equation 7.ing on samples where the supervision (tar- 7: Minimize Bellman error for Qθ weighted by wk .PN2get value) is estimated to be accurate,θk 1 argmin N1i wk (si , ai )(Qθ (si , ai ) yi )θwhich are precisely the samples that wefor training φ.expect maximally improve the accuracy of 8: Minimize ADP errorPN1ˆ 2φ argmink 1i 1 ( φ (si , ai ) i )Nthe Q function.φ9: end for6Related WorkPrior work has pointed out a number of issues arising when dynamic programming is used withfunction approximation. [35, 36, 8, 43, 26, 42] focused on analysing error induced in Bellmanprojections, under the assumption of an abstract error model. Convergent backups [47, 46, 32]were developed. However, divergence is rarely observed to be an issue with deep Q-learningmethods [10, 52]. In contrast to these works, which mostly focus on convergence of the Bellmanbackup, we focus on the interaction between the ADP update and the data distribution µ. Prior workon Q-learning and stochastic approximation analyzes time-varying µ, but either without functionapproximation [53, 49, 5], or when fully online [50], unlike our setting, that uses replay buffer data.While generalization effects of deep neural nets with ADP updates have been studied [1, 10, 30, 25],often under standard NTK [21] assumptions [1], the high-level idea in these prior works has been tosuppress any coupling effects of the function approximator, effectively obtaining tabular behavior. Incontrast, DisCor solves an optimization problem for the distribution pk that maximally reduces valueerror, and does not explicitly suppress coupling effects, as these can be important for generalizationin high dimensions. [41] studies the effect of data distribution on multi-objective policy gradientmethods and reports a pathological interaction between the data distribution and optimization. [9]shows the existence of suboptimal fixed points with on-policy TD learning as we observed empiricallyin Figure 3 (left). DisCor re-weights the transition in the buffer based on an estimate of their error tothe true optimal value function. This scheme resembles learning with noisy labels via “abstention”from training on labels that are likely to be inaccurate [48]. Prioritized sampling has been usedpreviou

features values). Updating the values at one circle node affects all other circles, likewise for boxes. Regressing to erroneous targets at one circle node may induce errors at another circle node, even if the other node has a correct target, simply because the other node is visited

Related Documents:

Guidance Document Risk-Based Corrective Action (RBCA) For Residential ...

corrective action for heating oil systems described in this document is modeled after the requirements in the Guide for Risk-Based Corrective Action at Petroleum Sites [ASTM E1739-05 (2002)], and is consistent with risk-based corrective action approaches included in many different corrective action programs implemented across the US.

8 Views

1y ago

Detailing Aspects of the Reinforcement in Reinforced Concrete Structures

Using a retaining wall as a case-study, the performance of two commonly used alternative reinforcement layouts (of which one is wrong) are studied and compared. Reinforcement Layout 1 had the main reinforcement (from the wall) bent towards the heel in the base slab. For Reinforcement Layout 2, the reinforcement was bent towards the toe.

9 Views

1y ago

Isolated Footing Design (IS 456-2000) - Bentley

Footing No. Footing Reinforcement Pedestal Reinforcement - Bottom Reinforcement(M z) x Top Reinforcement(M z x Main Steel Trans Steel 2 Ø8 @ 140 mm c/c Ø8 @ 140 mm c/c N/A N/A N/A N/A Footing No. Group ID Foundation Geometry - - Length Width Thickness 7 3 1.150m 1.150m 0.230m Footing No. Footing Reinforcement Pedestal Reinforcement

9 Views

1y ago

1 Introduction to reinforcement learning - GitHub Pages

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

26 Views

2y ago

Applying Deep Reinforcement Learning to Berkeley's Capture the Flag game

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

105 Views

1y ago

MetaLight: Value-based Meta-reinforcement Learning for Traffic Signal ...

Meta-reinforcement learning. Meta reinforcement learn-ing aims to solve a new reinforcement learning task by lever-aging the experience learned from a set of similar tasks. Currently, meta-reinforcement learning can be categorized into two different groups. The ﬁrst group approaches (Duan et al. 2016; Wang et al. 2016; Mishra et al. 2018) use an

15 Views

1y ago

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating ...

In this section, we present related work and background concepts such as reinforcement learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement learning (Sutton and Barto, 1998) environment is typically formalized by means of a Markov decision process (MDP). An MDP can be described as follows. Let S fs 1 .

12 Views

1y ago

What is Behind Latin America’s Declining Income Inequality?

or a small group of countries, we explore possible drivers behind the decline in income inequality in Latin America as a whole. To undertake this task, we utilize an array of methodologies—including correlation and econometric techniques. To start, we look at simple correlations between changes in policy variables and changes in income inequality

46 Views

3y ago

Recent Views

BETTER NUTRITION BRIGHTER FUTURE - Maryland.gov Enterprise Agency Template

TOFU BUY: 12- to 16-ounce container Brands and types shown here ONLY Not WIC Approved: With added fats, sugar, oil, or salt With added ﬂavorings, sauces, or seasonings Azumaya Extra Firm Franklin Farms Firm, Medium Firm, Extra Firm, Soft House Foods Organic: Soft, Firm, Medium Firm, Extra Firm Premium: Soft, Firm, Medium Firm, Extra Firm

1y ago

192 Views

Leaving a Law Firm: A Guide to the Ethical Obligations in Law Firm .

associates or otherwise employed in the firm "not to (1) actively exploit their positions within the [law firm] for their own personal benefits, or (2) hinder the ability of the [law firm] to conduct the business for which it was developed." Burke v. Lakin Law Firm, 2008 WL 64521 (S.D.Ill. Jan. 3, 2008), quoting FoodComm Intern. V.

1y ago

113 Views

Uses of Special Purpose Vehicles (SPVs) in structuring financing .

TFR "Best Law Firm in Trade Finance" Trade & Forfaiting Review (TFR) named Sullivan & Worcester "Best Law Firm in Trade Finance" in its 2014, 2015 and 2016 TFR Excellence Awards . GTR "Best Law Firm" Sullivan & Worcester UK LLP was top ranked firm in the . Global Trade Review (GTR) Best Law Firm 2015 and 2016 polls . The Legal 500 UK . 2016

1y ago

139 Views

Global Elite Law Firm Brand Index 2022 - thomsonreuters

such areas as law firm brand, firm usage, and legal market trends. The responses are distilled . down into four different and non-related measures gathered from the Sharplegal research and . then used to generate the individual Global Elite Law Firm Brand Index score for each law firm. How we generate our insights. In-depth interviews with

1y ago

166 Views

Notice and Order - Law Firm Names - Amendments to RPC 7.5 and Related Rules

LAW FIRM NAMES - AMENDMENTS TO RPC 7.5 AND COURT RULES 1:21-1A, 1:21-1B, AND 1:21-1C The Supreme Court has adopted amendments to Rule of Professional Conduct 7.5 ("Law Firm Names and Letterheads") so as to remove the requirement that the law firm name include the name of a lawyer and describe the nature of the firm's legal practice.

1y ago

134 Views

Law Student's Guide to the Washington, DC-Area Law Firm Market

Years 6-8: Return to law firm as senior associate or counsel . Benefit: In addition to your government experience, law firm employers will value your prior firm experience with billing time, working with private sector clients, etc. In other words, you already "know how law firms work" and this provides a smoother transition back. *Disclaimer:

1y ago

151 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Overcoming Ethical Challenges for Multi-Firm Lawyers and Their Firms .

- Florida Bar Op. 94-7: o Law firm refers personal injury cases to a lawyers who is "of counsel" to the firm and who sometimes works in the law firm's offices, but who also . Formal Ops. 1995-9: o A law firm named "A B & C" is a NY partnership consisting of partners A, B, and C. Motivated by tax concerns, C retires and becomes .

1y ago

116 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Law Firm Performance Metrics - Thomson Reuters

ProLaw XII reporting offers a firm the capability to turn data into knowledge for law firm performance management. The new reporting features within ProLaw XII provide key financial and operational metrics necessary to monitor firm performance - many of which can be self‐defined by the firm.

1y ago

104 Views

CHAPTER 11 35 per hour to firm A but differ in their .

flock to the piece rate firm. After the price of output falls, firm A values all workers at 17.50 per hour, while worker 1’s value at firm B falls to 50 cents, worker 2’s value falls to 1 at firm B, etc. The question is what happens to the wage. Presumably wage also falls, to 17.50 per hour in firm A.

2y ago

165 Views

Faculty of Juridical, Social and Political Sciences Year .

Law L Law IV 8 Drept procesual civil II / Civil Procedure Law II 5 Law L Law IV 8 Dreptul comerțului internațional / International ommercial Law 4 Law L Law IV 8 riminalistică / Forensics 4 Law L Law IV 8 Practică de cercetare pentru elaborarea lucrării de lincență(3 săptămân

2y ago

384 Views

DisCor: Corrective Feedback In Reinforcement Learning Via .

It looks like you're using an ad-blocker