Multi-Objective Reinforcement Learning Using Sets Of Pareto Dominating .

1y ago

11 Views

2 Downloads

933.82 KB

30 Pages

Last View : 12d ago

Last Download : 3m ago

Upload by : Abby Duckworth

Report this link

Download PDF

Transcription

Journal of Machine Learning Research 15 (2014) 3663-3692Submitted 12/13; Revised 7/14; Published 11/14Multi-Objective Reinforcement Learning using Sets ofPareto Dominating PoliciesKristof Van MoffaertAnn Nowékvmoffae@vub.ac.beanowe@vub.ac.beDepartment of Computer ScienceVrije Universiteit BrusselPleinlaan 2, Brussels, BelgiumEditors: Peter Auer, Marcus Hutter, Laurent OrseauAbstractMany real-world problems involve the optimization of multiple, possibly conflicting objectives. Multi-objective reinforcement learning (MORL) is a generalization of standardreinforcement learning where the scalar reward signal is extended to multiple feedbacksignals, in essence, one for each objective. MORL is the process of learning policies thatoptimize multiple criteria simultaneously. In this paper, we present a novel temporal difference learning algorithm that integrates the Pareto dominance relation into a reinforcementlearning approach. This algorithm is a multi-policy algorithm that learns a set of Paretodominating policies in a single run. We name this algorithm Pareto Q-learning and it isapplicable in episodic environments with deterministic as well as stochastic transition functions. A crucial aspect of Pareto Q-learning is the updating mechanism that bootstrapssets of Q-vectors. One of our main contributions in this paper is a mechanism that separates the expected immediate reward vector from the set of expected future discountedreward vectors. This decomposition allows us to update the sets and to exploit the learnedpolicies consistently throughout the state space. To balance exploration and exploitationduring learning, we also propose three set evaluation mechanisms. These three mechanismsevaluate the sets of vectors to accommodate for standard action selection strategies, such as -greedy. More precisely, these mechanisms use multi-objective evaluation principles suchas the hypervolume measure, the cardinality indicator and the Pareto dominance relationto select the most promising actions. We experimentally validate the algorithm on multipleenvironments with two and three objectives and we demonstrate that Pareto Q-learningoutperforms current state-of-the-art MORL algorithms with respect to the hypervolume ofthe obtained policies. We note that (1) Pareto Q-learning is able to learn the entire Paretofront under the usual assumption that each state-action pair is sufficiently sampled, while(2) not being biased by the shape of the Pareto front. Furthermore, (3) the set evaluation mechanisms provide indicative measures for local action selection and (4) the learnedpolicies can be retrieved throughout the state and action space.Keywords: multiple criteria analysis, multi-objective, reinforcement learning, Paretosets, hypervolume1. IntroductionMany real-life problems involve dealing with multiple objectives (Coello et al., 2006; Tesauroet al., 2008; Hernandez-del Olmo et al., 2012). For example, in a wireless sensor network thec 2014 Kristof Van Moffaert and Ann Nowé.

Van Moffaert and Nowécriteria consist of energy consumption and latency, which are conflicting objectives (Gorceet al., 2010). When the system engineer wants to optimize more than one objective, it isnot always clear a priori which objectives might be correlated and how they influence eachother. As the objectives are conflicting, there usually exists no single optimal solution. Inthose cases, we are interested in a set of trade-off solutions that balance the objectives.More precisely, we want to obtain the set of best trade-off solutions, i.e., the set of solutionsthat Pareto dominate all the other solutions but are mutually incomparable.There are two main approaches when dealing with multi-objective problems. The simplest way is to use a scalarization function (Miettinen and Mäkelä, 2002) which transformsthe multi-objective problem into a standard single-objective problem. However, this transformation may not be valid when the scalarization function is non-linear. This approach iscalled a single-policy algorithm, as each run converges to a single solution. In order to finda variety of trade-off solutions, several parameterized scalarization functions are employedand their results are combined. However, the mapping from weight space to objective spaceis not guaranteed to be isomorphic (Das and Dennis, 1997). This means that it is notobvious how to define the weights in order to get a good coverage of the Pareto front ofpolicies.Another class of algorithms are multi-policy algorithms. In contrast to focusing only ona single solution at a time, a multi-policy algorithm searches for a set of optimal solutionsin a single run. Well-known examples of this class are evolutionary multi-objective algorithms, such as SPEA2 (Zitzler et al., 2002) and NSGA-II (Deb et al., 2002), which evolve apopulation of multi-objective solutions. These evolutionary multi-objective algorithms areamongst the most powerful techniques for solving multi-objective optimization problems.In our work, we focus on reinforcement learning for multi-objective problems. Reinforcement learning (Sutton and Barto, 1998) is a machine learning technique that involvesan agent operating in an environment and receiving a scalar feedback signal for its behavior.By sampling actions and observing the feedback signal, the agent adjusts its estimate of thequality of its actions. So far, multi-objective reinforcement learning (MORL) has particularly been focusing on single-policy algorithms (Gabor et al., 1998; Mannor and Shimkin,2004; Van Moffaert et al., 2013b), while only a restricted number of multi-policy MORLalgorithms have been proposed so far. For instance, Barrett and Narayanan (2008) proposethe Convex Hull Value Iteration (CHVI) algorithm. From batch data, CHVI extracts andcomputes every linear combination of the objectives in order to obtain all deterministicoptimal policies. As the algorithm relies on linear combinations, only policies on the convexhull, a subset of the Pareto front, are learned. The most computationally expensive operatoris the procedure to compute and combine the convex hulls in the convex-hull version of theBellman equation. Lizotte et al. (2010) reduce the asymptotic space and time complexity ofthe bootstrapping rule by learning several value functions corresponding to different weightvectors using a piecewise linear spline representation. Wang and Sebag (2013) propose amulti-objective Monte Carlo Tree Search (MO-MCTS) method to learn a set of solutions.The algorithm performs tree traversals by selecting the most promising actions. The upperconfidence bounds of these actions are scalarized by applying the hypervolume indicator onthe combination of their estimates and the set of Pareto optimal policies computed so far.Hence, a scalarized multi-objective value function is constructed that eases the process ofselecting an action with vectorial estimates.3664

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating PoliciesIn this paper, we propose a novel MORL algorithm, named Pareto Q-learning (PQL).To the best of our knowledge, this is the first temporal difference-based multi-policy MORLalgorithm that does not use the linear scalarization function. Thus, Pareto Q-learning isnot limited to the convex hull, but it can learn the entire Pareto front of deterministic nonstationary policies, if enough exploration is provided. In contrast to single-policy approachesthat only add a scalarization layer on top of single-objective algorithms, we extend thecore principles of the learning algorithm to learn a set of non-dominated policies. OurPQL algorithm is particularly suited for on-line use, in other words, when the samplingcost of selecting appropriate actions is important and the performance should graduallyincrease over time. We also propose three evaluation mechanisms for the sets that providea basis for on-line action selection strategies. These evaluation mechanisms use multiobjective indicators such as the hypervolume metric, the cardinality indicator and the Paretodominance relation in order to select the best possible actions throughout the learningprocess based on the contents of the sets. The Pareto Q-learning algorithm is evaluated onmultiple environments with two and three objectives and its performance is compared w.r.t.several single-policy MORL algorithms that use either the linear or Chebyshev scalarizationfunction or the hypervolume indicator.In Section 2, we introduce notations and concepts of reinforcement learning and currentadvances in multi-objective reinforcement learning. In Section 3, we present our novel ParetoQ-learning algorithm and discuss its design specifications. Subsequently, in Section 4, weconduct an empirical comparison of our algorithm to other state-of-the-art MORL algorithms. Finally, in Section 5, we draw our conclusions.2. BackgroundIn this section, we present related work and background concepts such as reinforcementlearning and multi-objective reinforcement learning.2.1 Reinforcement LearningA reinforcement learning (Sutton and Barto, 1998) environment is typically formalized bymeans of a Markov decision process (MDP). An MDP can be described as follows. LetS {s1 , . . . , sN } be the state space and A {a1 , . . . , ar } the action set available to thelearning agent. Each combination of current state s, action choice a A and next state s0has an associated transition probability T (s0 s, a) and expected immediate reward R(s, a).The goal is to learn a deterministic stationary policy π, which maps each state to an action,such that the value function of a state s, i.e., its expected return received from time step tand onwards, is maximized. The state-dependent value function of a policy π in a state sis then X πkV (s) Eπγ rt k 1 st s ,(1)k 0where γ [0, 1] is the discount factor. The value of taking an action in a state under policyπ is represented by a Qπ (s, a)-value which stores the expected return starting from state s,3665

Van Moffaert and Nowétaking action a, and thereafter following π again. The optimal Q -values are defined asQ (s, a) R(s, a) γXs0T (s0 s, a) maxQ (s0 , a0 ).0a(2)Watkins introduced an algorithm to iteratively approximate Q . In the Q-learning algorithm(Watkins, 1989), a Q-table consisting of state-action pairs is stored. Each entry contains avalue for Q̂(s, a) which is the learner’s current estimate about the actual value of Q (s, a).The Q̂-values are updated according to the update ruleQ̂(s, a) (1 αt )Q̂(s, a) αt (r γ maxQ̂(s0 , a0 )),0a(3)where αt is the learning rate at time step t and r is the reward received for performingaction a in state s. Provided that all state-action pairs are visited infinitely often and asuitable evolution for the learning rate is chosen, the estimates, Q̂, will converge to theoptimal values, Q (Tsitsiklis, 1994).The Q-learning algorithm is listed below in Algorithm 1. In each episode, actions areselected based on a particular action selection strategy, for example -greedy where a randomaction is selected with a probability of , while the greedy action is selected with a probabilityof (1 ). Upon applying the action, the environment transitions to a new state s0 and theagent receives the corresponding reward r (line 6). At line 7, the Q̂-value of the previousstate-action pair (s, a) is updated towards the reward r and the maximum Q̂-value of thenext state s0 . This process is repeated until the Q̂-values converge or after a predefinednumber of episodes.Algorithm 1 Single-objective Q-learning algorithm1:2:3:4:5:6:7:Initialize Q̂(s, a) arbitrarilyfor each episode t doInitialize srepeatChoose a from s using a policy derived from the Q̂-values, e.g., -greedyTake action a and observe s0 S, r RQ̂(s, a) Q̂(s, a) αt (r γ maxQ̂(s0 , a0 ) Q̂(s, a))0s s09:until s is terminal10: end fora8:2.2 Multi-Objective Reinforcement LearningIn multi-objective optimization, the objective space consists of two or more dimensions (Roijers et al., 2013). Therefore, regular MDPs are generalized to multi-objective MDPs orMOMDPs. MOMDPs are MDPs that provide a vector of rewards instead of a scalar reward, i.e.,R(s, a) (R1 (s, a), . . . Rm (s, a)),(4)3666

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policieswhere m represents the number of objectives. In the case of MORL, the state-dependentvalue function of a state s is vectorial:πV (s) Eπ X kγ rt k 1k 0 st s .(5)Since the environment now consists of multiple objectives, different policies can be optimalw.r.t. different objectives. In MORL different optimality criteria are used. For instance,Gabor et al. (1998) employ a lexicographical ordering of the objectives, while Barrett andNarayanan (2008) define linear preferences on the different objectives. Although, in general, the Pareto dominance relation is used as an optimality criterion in multi-objectiveoptimization.Definition 1 A policy π1 is said to strictly dominate another solution π2 , that is π2 π1 ,if each objective in Vπ1 is not strictly less than the corresponding objective of Vπ2 and atleast one objective is strictly greater. In the case where Vπ1 strictly improves Vπ2 on atleast one objective and Vπ2 also strictly improves Vπ1 on at least one, the two solutionsare said to be incomparable. A policy π is Pareto optimal if Vπ either strictly dominatesor is incomparable with the value functions of the other policies. The set of Pareto optimalpolicies is referred to as the Pareto front.2.2.1 Single-Policy MORLMost approaches of reinforcement learning on multi-objective tasks rely on single-policyalgorithms (Gabor et al., 1998; Mannor and Shimkin, 2004) in order to learn Pareto optimalsolutions. Single-policy MORL algorithms employ scalarization functions (Vamplew et al.,2008) to define a utility over a vector-valued policy and thereby reducing the dimensionalityof the underlying multi-objective environment to a single, scalar dimension:Definition 2 A scalarization function f is a function that projects a vector v to a scalar:vw f (v, w),(6)where w is a weight vector parameterizing f .Recently, a general framework for scalarized single-policy MORL algorithms is proposed (Van Moffaert et al., 2013b). In the framework, scalar Q̂-values are extended toQ̂-vectors that store a Q̂-value for each objective, i.e.,Q̂(s, a) (Q̂1 (s, a), . . . , Q̂m (s, a)).(7)When selecting an action in a certain state of the environment, a scalarization function fd a) estimateis applied to the Q̂-vector of each action in order to obtain a single, scalar SQ(s,(Algorithm 2, line 4). In the following subsection, we will discuss possible instantiations ofd a) estimates in a list in order tothe scalarization function f. At line 5, we store the SQ(s,apply traditional action selection strategies, such as, for example, the -greedy strategy.3667

Van Moffaert and NowéAlgorithm 2 Scalarized -greedy strategy, scal- -greedy()1: SQList {}2: for each action a A do3:v {Q̂1 (s, a), . . . , Q̂m (s, a)}d a) f (v, w)4:SQ(s,d a) to SQList5:Append SQ(s,6: end for7: return -greedy(SQList). Scalarize Q̂-vectorsThe scalarized multi-objective Q-learning algorithm is presented in Algorithm 3. At line1, the Q̂-values for each triplet of states, actions and objectives are initialized. The agentstarts each episode in state s (line 3) and chooses an action based on the multi-objectiveaction selection strategy at line 5, e.g, scal- -greedy. Upon taking action a, the environmenttransitions the agent into the new state s0 and provides the vector of sampled rewards r.As the Q-table has been extended to incorporate a separate value for each objective, thesevalues are updated for each objective individually and the single-objective Q-learning updaterule is extended for a multi-objective environment at line 9. More precisely, the Q̂-valuesfor each triplet of state s, action a and objective o are updated using the correspondingreward for each objective, r, into the direction of the best scalarized action of the next states0 . It is important to note that this framework only adds a scalarization layer on top of theaction selection mechanisms of standard reinforcement learning algorithms.Algorithm 3 Scalarized multi-objective Q-learning algorithm1:2:3:4:5:6:7:8:9:10:Initialize Q̂o (s, a) arbitrarilyfor each episode t doInitialize state srepeatdChoose action a from s using the policy derived from SQ-values,e.g., scal- greedyTake action a and observe state s0 S and reward vector r Rma0 greedy(s0 ). Call scal. greedy action selectionfor each objective o doQ̂o (s, a) Q̂o (s, a) αt (ro γ Q̂o (s0 , a0 ) Q̂o (s, a))end for11:s s013:until s is terminal14: end for. Proceed to next state12:A scalarization function can come in many forms and flavors, but the most commonfunction is the linear scalarization function. As depicted in Eq. 8, the linear scalarizationfunction calculates a weighted-sum of the Q̂-vector and a non-negative weight vectordlinear (s, a) SQmXo 13668wo · Q̂o (s, a).(8)

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating PoliciesThe weight vector itself should satisfy the equationmXwo 1.(9)o 1dGiven these SQ-values,the standard action selection strategies can decide on the appropriate action to select. For example, in the greedy case in Eq. 10, the action with the largestdSQ-valueis selected:dlinear (s, a0 ).greedylinear (s) arg max SQ(10)a0Because the linear scalarization function computes a convex combination, it has the fundamental limitation that it can only find policies that lie in convex regions of the Paretofront (Vamplew et al., 2008).An alternative scalarization function is based on the Lp metrics (Dunford et al., 1988).These metrics measure the distance between a point x in the multi-objective space and autopian point z . This point z serves as a point of attraction to steer the search processto high-quality solutions. The utopian point z is a parameter that is being constantlyadjusted during the learning process by recording the best value so far for each objectiveo, plus a small negative or positive constant τ , depending whether the problem is to beminimized or maximized, respectively. In our setting, we measure the distance betweeneach objective of x to z with 1 p :Lp (x) mXo 1wo xo z o p 1/p.(11)In the case of p , the metric results in the weighted L or the Chebyshev metric andis of the formL (x) max wo xo z o .(12)o 1.mdL -value is obtained by substituting x for theIn the case of single-policy MORL, a SQ Q̂-vector of a state-action pair (s, a):dL (s, a) max wo · Q̂o (s, a) z .SQo o 1.m(13)dL -value is consideredLp metrics entail that the action corresponding to the minimal SQpthe greedy action in state s. Hence, for the Chebyshev metric that is greedyL (s):dL (s, a0 ).greedyL (s) arg min SQ (14)a0Although the Chebyshev metric is a common and established function in evolutionaryalgorithms, its application in reinforcement learning lacks theoretical guarantees. Moreprecisely, there exist examples which indicate that the Chebyshev metric, being a nonlinear function, does not guarantee the scalarized returns to be additive. As a result, theBellman equation no longer holds and the learning algorithm is not proven to converge to3669

Van Moffaert and Nowéthe optimal policy (Perny and Weng, 2010; Roijers et al., 2013). Nevertheless, even withoutthis theoretical guarantee, a Chebyshev scalarized MORL algorithm can obtain high-qualitysolutions (Van Moffaert et al., 2013b).Quality indicators are functions that assign a real value to a set of vectors and areusually employed to evaluate the results of multi-objective algorithms. Yet, particularmulti-objective algorithms also use indicators in their internal workings to steer the searchprocess (Beume et al., 2007; Igel et al., 2007). This class of algorithms are called indicatorbased algorithms. Many quality indicators exist, but the one that is the most interesting forour context is the hypervolume (Zitzler et al., 2003) indicator. The hypervolume measureis a quality indicator that evaluates a particular set of vectorial solutions by calculatingthe volume with respect to its elements and a reference point (Figure 1). As the goal is tomaximize the hypervolume, this reference point is usually defined by determining the lowerlimit of each objective in the environment.objective 2S1S2S3refobjective 1Figure 1: Illustration of the hypervolume calculator. It calculates the area of a set ofnon-dominated policies, i.e., S1 , S2 and S3 , in the objective space with a givenreference point ref.The hypervolume indicator is of particular interest in multi-objective optimization as itis the only quality measure known to be strictly increasing with regard to Pareto dominance.Recently, the hypervolume-based MORL algorithm (HB-MORL) is proposed (Van Moffaertet al., 2013a). HB-MORL is a specific multi-objective algorithm that uses an archive ofQ̂-vectors of previously visited states and actions. The innovative part of HB-MORL liesin the action selection mechanism, i.e., the action that maximizes its contribution to thearchive in terms of the hypervolume measure is selected.2.2.2 Multi-Policy MORLIn contrast to single-policy MORL, multi-policy algorithms do not reduce the dimensionalityof the objective space but aim to learn a set of optimal solutions at once. White (1982)proposed a dynamic programming (DP) algorithm that computes a set of Pareto dominatingpolicies. Dynamic programming differs from reinforcement learning in the fact that itassumes a model of the environment while reinforcement learning does not need any a3670

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policiespriori knowledge about the environment but is able to work model-free. The DP function isQ̂set (s, a) R(s, a) γXs0 ST (s0 s, a) V N D (s0 ),(15)where R(s, a) is the expected reward vector observed after taking action a in state s andT (s0 s, a) is the corresponding transition probability of reaching state s0 from (s, a). Werefer to V N D (s0 ) as the set of non-dominated vectors of the Q̂set ’s of each action in s0 ,as denoted in Eq. 16. The ND operator is a function that removes all Pareto dominatedelements of the input set and returns the set of non-dominated elements:V N D (s0 ) N D( a0 Q̂set (s0 , a0 )).(16)The operator performs a vector-sum between a vector v and a set of vectors V . Summingtwo vectors can be performed simply by adding the corresponding components of the vectors:v V [(v v0 ).(17)v0 VThe idea is that, after the discounted Pareto dominating rewards are propagated andthe Q̂set ’s converge to a set of Pareto dominating policies, the user can traverse the treeof Q̂set ’s by applying a preference function. As highlighted in Section 2.1, a deterministicstationary policy suffices for single-objective reinforcement learning. In the case of MORL,White (1982) showed that deterministic non-stationary policies, i.e., policies that do notonly condition on the current state but usually also on the time step t, can Pareto dominatethe best deterministic stationary policies. As a result, in infinite horizon problems with largevalues for the discount factor, the number of non-stationary policies increases exponentiallyand therefore it can lead to an explosion of the sets. In order to make the algorithmpractically applicable, Wiering and de Jong (2007) proposed the CON-MODP algorithmwhich solves the problem of non-stationary policies by introducing a consistency operator,but their work is limited to deterministic transition functions.Several multi-policy algorithms were inspired by the work of White. For instance, Barrett and Narayanan (2008) proposed the convex hull value-iteration (CHVI) algorithm whichcomputes the deterministic stationary policies that are on the convex hull of the Paretofront. The convex hull is a set of policies for which the linear combination of the value ofpolicy π, Vπ , and some weight vector w is maximal (Roijers et al., 2013). In Figure 2 (a),white dots denote the Pareto front of a bi-objective problem and in Figure 2 (b) the redline represents the corresponding convex hull. The 4 deterministic policies denoted by reddots are the ones that CHVI would learn. SCHVI bootstraps by calculating the convex hullof the union over all actions in s0 , that is a0 Q(s0 , a0 ). The most computationally expensive operator is the procedure of combining convex hulls in the bootstrapping rule. Lizotteet al. (2010) reduce the asymptotic space and time complexity of the bootstrapping ruleby simultaneously learning several value functions corresponding to different weights andby calculating their piecewise linear spline representation. Recently, they validated theirwork on clinical trial data for three objectives, although the practical possibilities for higherdimensional spaces are not straightforward (Lizotte et al., 2012).3671

Van Moffaert and Nowéobjective 2objective 2To conclude, it is important to note that (1) these methods are batch algorithms thatassume a model of the environment is known and that (2), in general, only policies that lieon a subset of the Pareto front, i.e., the convex hull, are obtained.While the aforementioned multi-policy algorithms learn only a finite set of deterministicpolicies, it might be interesting to employ probabilistic combinations of these policies. Sucha stochastic combination of two policies is called a mixture policy (Vamplew et al., 2009)and can be explained with the following example. Take for instance a very easy multi-objective 1objective 1(a)(b)Figure 2: (a) A Pareto front of bi-objective policies represented by white dots. (b) Theconvex hull of the same Pareto front is represented by a red line. The 4 red dotsdenote policies that CHVI and Lizotte’s method would learn.objective problem where the agent can only follow two deterministic policies π1 and π2 withVπ1 (s0 ) (1, 0) and Vπ2 (s0 ) (0, 1), where s0 denotes the start state. If one would followpolicy π1 with probability p and policy π2 with probability (1 p), the average rewardvector would be (p, 1 p). Thus, although there are only two deterministic policies for theoriginal problem, a mixture policy implicates that we can sample the entire convex hull ofpolicies by combining the deterministic policies with a certain probability. Hence, stochasticcombinations of the policies of the Pareto front in Figure 2 (a) can represent every solutionon the red line in Figure 2 (b).However, mixture policies might not be appropriate in all situations, as highlightedby Roijers et al. (2013). For instance, in the setting of Lizotte et al. (2010), clinical data isanalyzed to propose a treatment to patients based on a trade-off between the effectivenessof the drugs and severity of the side effects. Consider the case where only two policies existthat either maximize the effectiveness and the severity of the side effects and vice versa.While the average performance of the mixture policy of these two basic policies might yieldgood performance across a number of episodes, the policy itself might be unacceptable ineach episode individually, i.e., for each patient independently.3. Pareto Q-learningIn this section, we will propose a novel on-line TD-based multi-objective learning algorithm,named Pareto Q-learning or PQL, which uses the algorithm of White (1982) as a startingpoint. As a result, PQL also learns deterministic non-stationary policies. Before we present3672

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policiesthe details of our algorithm, we first describe the assumption that we make. We currentlyonly focus on episodic problems, i.e., environments with terminal states that end the episode.In Section 5, we analyze the challenges to extend PQL to ergodic environments.The Pareto Q-learning algorithm does not assume a given model, i.e., it works modelfree. Therefore, we present three mechanisms that allow action selection based on thecontent of the sets of Q̂-vectors. We name them set evaluation mechanisms as they providea scalar evaluation of the sets. These scalar evaluations can then be used to guide thestandard exploration strategies, such as -greedy. The details of the evaluation mechanismsare presented in Section 3.2. First, we elaborate on how we can learn and update sets ofvectors in Section 3.1.3.1 Set-Based BootstrappingThe single-objective Q-learning bootstrapping rule updates an estimate of an (s, a)-pairbased on the reward and an estimate of the next state (Watkins, 1989). The update ruleguarantees that the Q̂-values converge to their expected future discounted reward, evenwhen the environment has a stochastic transition function. In this section, we analyze theproblem of bootstrapping sets of vectors. We first present a naive approach whereupon wepresent our novel Pareto Q-learning algorithm.3.1.1 Naive ApproachThe set-based bootstrapping problem boils down to the general problem of updating the setof vectors of the current state-action (s, a)-pair with an observed reward vector r and a setof non-dominated vectors of the next state, N D( a0 Q̂set (s0 , a0 )) over time. The difficultyin this process arises from the lack of correspondence between the vectors in the two sets,i.e., it is not clear which vector of the set of the current (s, a)-pair to update with whichvector in s0 . This correspondence is needed to perform a pairwise update of each vector inQ̂set (s, a) with the corresponding vector (if any) in the other set (see Figure 3).r[0, 0][0, 0.5]?[1, 0][0.9, 0.1][0.5, 0][0, 1][0.7, 0.3][0.2, 0.8][0.1, 0.9][0.15, 0.3][0.3, 0.7][0.3, 0.15][0.5, 0.5][0.8, 0.2][0.4, 0.6][0.6, 0.4]N D( a′ Q̂set (s′ , a′ ))Q̂set (s, a)Figure 3: Set-based bootstrapping: the problem of updating over time the set of vectorsof the current state-action pair with the observe

In this section, we present related work and background concepts such as reinforcement learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement learning (Sutton and Barto, 1998) environment is typically formalized by means of a Markov decision process (MDP). An MDP can be described as follows. Let S fs 1 .

Related Documents:

Multi-Agent Patrolling with Reinforcement Learning1

learning techniques, such as reinforcement learning, in an attempt to build a more general solution. In the next section, we review the theory of reinforcement learning, and the current efforts on its use in other cooperative multi-agent domains. 3. Reinforcement Learning Reinforcement learning is often characterized as the

11 Views

1y ago

1 Introduction to reinforcement learning - GitHub Pages

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

24 Views

2y ago

Multi-Agent Reinforcement Learning in Common Interest and Fixed Sum ...

2. Multi-Agent Reinforcement Learning and Stochastic Games Multi-Agent Reinforcement Learning (MARL) is an extension of RL (Sutton and Barto, 1998; Kaelbling et al., 1996) to multi-agent environments. It deals with the problems associated with the learning of optimal behavior from the point of view of an agent acting in a multi-agent en-vironment.

14 Views

1y ago

Designing Self-organizing Systems With Deep Multi-agent Reinforcement ...

In contrast to the centralized single agent reinforcement learning, during the multi-agent reinforcement learning, each agent can be trained using its own independent neural network. Such approach solves the problem of curse of dimensionality of action space when applying single agent reinforcement learning to multi-agent settings.

27 Views

1y ago

Applying Deep Reinforcement Learning to Berkeley's Capture the Flag game

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

102 Views

1y ago

Multi-Agent Systems on Sensor Networks: A Distributed Reinforcement ...

A Distributed Reinforcement Learning Approach Chen-KhongTham& Jean-Christoph Renaud Presented by Howard Luu. Presentation Structure Paper Objective Background Multi-Agent Systems (MAS) & Wireless Sensor Networks (WSN) Reinforcement Learning Distributed Reinforcement Learning (DRL) Approaches

9 Views

1y ago

Exponential Moving Average Based Multiagent Reinforcement Learning ...

Keywords Multi-agent learning systems Reinforcement learning. 1 Introduction Reinforcement learning (RL) is a learning technique that maps situations to actions so that an agent learns from the experience of interacting with its environment (Sutton and Barto, 1998; Kaelbling et al., 1996). Reinforcement learning has attracted attention and been .

9 Views

1y ago

(Rev. 09/03} 2

For AOl or A02 Transactions where abo·ve situations apply. 9 characters 853: Date (mm/dd/yy) DESCRIPTION: REQUIRED: LENGTH: ENTER: CODES DEFINED: EXAII1PLE: 033-1 c6/0084 2.105 (Rev. 05/90) ITEM 856 - DEMOTION REASON Explanation of employee's preference in voluntarily demoting or choosing a demotion other than that which was directed (e.g., layoff, reassignment, etc.). For A02 Transaction .

56 Views

3y ago

Recent Views

BETTER NUTRITION BRIGHTER FUTURE - Maryland.gov Enterprise Agency Template

TOFU BUY: 12- to 16-ounce container Brands and types shown here ONLY Not WIC Approved: With added fats, sugar, oil, or salt With added ﬂavorings, sauces, or seasonings Azumaya Extra Firm Franklin Farms Firm, Medium Firm, Extra Firm, Soft House Foods Organic: Soft, Firm, Medium Firm, Extra Firm Premium: Soft, Firm, Medium Firm, Extra Firm

1y ago

192 Views

Leaving a Law Firm: A Guide to the Ethical Obligations in Law Firm .

associates or otherwise employed in the firm "not to (1) actively exploit their positions within the [law firm] for their own personal benefits, or (2) hinder the ability of the [law firm] to conduct the business for which it was developed." Burke v. Lakin Law Firm, 2008 WL 64521 (S.D.Ill. Jan. 3, 2008), quoting FoodComm Intern. V.

1y ago

113 Views

Uses of Special Purpose Vehicles (SPVs) in structuring financing .

TFR "Best Law Firm in Trade Finance" Trade & Forfaiting Review (TFR) named Sullivan & Worcester "Best Law Firm in Trade Finance" in its 2014, 2015 and 2016 TFR Excellence Awards . GTR "Best Law Firm" Sullivan & Worcester UK LLP was top ranked firm in the . Global Trade Review (GTR) Best Law Firm 2015 and 2016 polls . The Legal 500 UK . 2016

1y ago

139 Views

Global Elite Law Firm Brand Index 2022 - thomsonreuters

such areas as law firm brand, firm usage, and legal market trends. The responses are distilled . down into four different and non-related measures gathered from the Sharplegal research and . then used to generate the individual Global Elite Law Firm Brand Index score for each law firm. How we generate our insights. In-depth interviews with

1y ago

166 Views

Notice and Order - Law Firm Names - Amendments to RPC 7.5 and Related Rules

LAW FIRM NAMES - AMENDMENTS TO RPC 7.5 AND COURT RULES 1:21-1A, 1:21-1B, AND 1:21-1C The Supreme Court has adopted amendments to Rule of Professional Conduct 7.5 ("Law Firm Names and Letterheads") so as to remove the requirement that the law firm name include the name of a lawyer and describe the nature of the firm's legal practice.

1y ago

134 Views

Law Student's Guide to the Washington, DC-Area Law Firm Market

Years 6-8: Return to law firm as senior associate or counsel . Benefit: In addition to your government experience, law firm employers will value your prior firm experience with billing time, working with private sector clients, etc. In other words, you already "know how law firms work" and this provides a smoother transition back. *Disclaimer:

1y ago

151 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Overcoming Ethical Challenges for Multi-Firm Lawyers and Their Firms .

- Florida Bar Op. 94-7: o Law firm refers personal injury cases to a lawyers who is "of counsel" to the firm and who sometimes works in the law firm's offices, but who also . Formal Ops. 1995-9: o A law firm named "A B & C" is a NY partnership consisting of partners A, B, and C. Motivated by tax concerns, C retires and becomes .

1y ago

116 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Law Firm Performance Metrics - Thomson Reuters

ProLaw XII reporting offers a firm the capability to turn data into knowledge for law firm performance management. The new reporting features within ProLaw XII provide key financial and operational metrics necessary to monitor firm performance - many of which can be self‐defined by the firm.

1y ago

104 Views

CHAPTER 11 35 per hour to firm A but differ in their .

flock to the piece rate firm. After the price of output falls, firm A values all workers at 17.50 per hour, while worker 1’s value at firm B falls to 50 cents, worker 2’s value falls to 1 at firm B, etc. The question is what happens to the wage. Presumably wage also falls, to 17.50 per hour in firm A.

2y ago

165 Views

Faculty of Juridical, Social and Political Sciences Year .

Law L Law IV 8 Drept procesual civil II / Civil Procedure Law II 5 Law L Law IV 8 Dreptul comerțului internațional / International ommercial Law 4 Law L Law IV 8 riminalistică / Forensics 4 Law L Law IV 8 Practică de cercetare pentru elaborarea lucrării de lincență(3 săptămân

2y ago

384 Views

Multi-Objective Reinforcement Learning Using Sets Of Pareto Dominating .

It looks like you're using an ad-blocker