Multi-Agent Patrolling With Reinforcement Learning1

1y ago
10 Views
2 Downloads
678.99 KB
8 Pages
Last View : 21d ago
Last Download : 3m ago
Upload by : Bria Koontz
Transcription

Multi-Agent Patrolling with Reinforcement Learning1Hugo SantanaGeber RamalhoUniversidade Federal dePernambucoCentro de InformáticaCaixa Postal 7851, CEP 50732970, Recife, Brazil{hps, glr}@cin.ufpe.brVincent CorrubleUniversité Paris 6Laboratoire d’Informatique deParis VIBoîte 169 – 4 Place Jussieu75252 PARIS CEDEX 05Vincent.Corruble@lip6.frAbstractPatrolling tasks can be encountered in a variety ofreal-world domains, ranging from computer networkadministration and surveillance to computer wargamesimulations. It is a complex multi-agent task, whichusually requires agents to coordinate their decisionmaking in order to achieve optimal performance of thegroup as a whole. In this paper, we show how thepatrolling task can be modeled as a reinforcementlearning (RL) problem, allowing continuous andautomatic adaptation of the agents’ strategies to theirenvironment. We demonstrate that an efficientcooperative behavior can be achieved by using RLmethods, such as Q-Learning, to train individual agents.The proposed approach is totally distributed, whichmakes it computationally efficient. The empiricalevaluation proves the effectiveness of our approach, asthe results obtained are substantially better than theresults available so far on this domain.1. Introduction1Patrolling is literally “the act of walking or travelingaround an area, at regular intervals, in order to protect orsupervise it” [1]. Performing this patrolling taskefficiently can be useful for various application domainswhere distributed surveillance, inspection or control isrequired. For instance, patrolling agents can be used forhelping administrators in the surveillance of failures orspecific situations in an Intranet [3], for detecting recentlymodified or new web pages to be indexed by searchengines [6], for identifying objects or people in dangeroussituations that should be rescued by robots [13], etc.1Supported by the Brazilian-French project “capes-cofecub 371/01”Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit orcommercial advantage and that copies bear this notice and thefull citation on the first page. To copy otherwise, to republish,to post on servers or to redistribute to lists, requires priorspecific permission and/or a fee.AAMAS'04, July 19-23, 2004, New York, New York, USA.Copyright 2004 ACM 1-58113-864-4/04/0007. 5.00Bohdana RatitchSchool of Computer ScienceMcGill University,3480 University St.,Montreal, CanadaH3A 2A7bohdana@cs.mcgill.caDespite its high potential utility, only recently thisproblem has been rigorously addressed. Our researchgroup, in particular, has done pioneering work on thismatter. Initially, in [10], we presented an original in-depthanalysis of the patrolling task issues, and proposeddifferent multi-agent-based solutions which wereempirically evaluated on a simulator that we developed.Later, in [2], more sophisticated (multi-agent-based butnon-adaptive) solutions were proposed and evaluated, andmore complex instances of the problem were considered.More recently, colleagues addressed the problem ofperforming this task with real robots [15].Although many of the previously developed solutionsshowed good empirical results, we noticed that for eachproposed architecture, there were always some problemsettings (e.g., a particular environment topology) onwhich it performed badly [2]. One of the reasons is that,for this domain, it is very difficult to design beforehand ageneral distributed strategy, since the task necessitates afair amount of topology-dependent coordination betweenthe agents’ actions. Thus, adaptive machine learningtechniques can be very useful in this respect. In fact, someof them have been previously used with success in othermulti-agent domains, such as robotic soccer [16], as away to automatically achieve coordination.In this paper, we investigate the creation of adaptiveagents that learn to patrol using reinforcement learningtechniques [17]. The use of such techniques, in this case,is not straightforward. In order to use most of the RLalgorithms, it is necessary to model this task as a MarkovDecision Process (MDP), but many characteristics of thisdomain, such as the multi-agent setting, make it difficultto do so. One of the challenges, for example, lies in thedefinition of an appropriate model of the instantaneousrewards that would be conducive for achieving a highlong-term collective performance. Such difficulties, andthe ways to overcome them, are discussed in this work.Two RL-based agent architectures, using different

communication schemes, are proposed, implemented,empirically evaluated and compared to non-adaptivearchitectures from our previous work. The optimality ofthese solutions is also analytically studied.The remainder of this paper is structured as follows:Section 2 defines our task and revises previous work.Section 3 summarizes the reinforcement learningconcepts. Section 4 presents a model of this task as a RLproblem. Section 5 shows the experimental results, andSection 6 provides our conclusions and future work.2. The Patrolling TaskIn order to obtain a more general, yet precise,definition of the patrolling task, we adopted a moreabstract representation of the terrain being patrolled: agraph, where the nodes represent specific locations andthe edges represent possible paths, as shown in Fig. 2.This abstract representation can be easily mapped to manydifferent domains, from terrains and maps to computernetworks, for instance. Given a graph, the patrolling taskstudied in the remainder of this paper consists incontinuously visiting its nodes.Intuitively, a good patrolling strategy is the one that,for each node, minimizes the time lag between two visitsto the same node. However, to be more precise, ourprevious work suggests some evaluation criteria forpatrolling strategies, using the notion of idleness [10].Considering that a cycle is a single simulation step, theinstantaneous node idleness for a node n at a cycle t is thenumber of cycles elapsed since the last visit before t(number of cycles n remained unvisited). Theinstantaneous graph idleness is the average instantaneousidleness over all nodes in a given cycle. Considering longterm performance criteria, the average idleness is themean of the instantaneous graph idleness over a t-cyclesimulation. In the same context, another measure is theworst idleness: the highest value of the instantaneousnode idleness encountered during the simulation.With reasonable strategies, performance with respectto these measures tends to improve as the number ofagents grows. However, the lack of appropriatecoordination may diminish the improvement expectedfrom insertion of new agents. In order to assesscoordination quality, we can measure the individualcontribution of each agent by normalizing these criteria:normalized value absolute value number of agentsnumber of nodes(1)2.1. Overview of Previous WorkOur first work on this task considered the problem ofpatrolling in non-weighted graphs (distance betweenadjacent nodes is one) [10]. The implemented solutionsPermission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit orcommercial advantage and that copies bear this notice and thefull citation on the first page. To copy otherwise, to republish,to post on servers or to redistribute to lists, requires priorspecific permission and/or a fee.AAMAS'04, July 19-23, 2004, New York, New York, USA.Copyright 2004 ACM 1-58113-864-4/04/0007. 5.00were simple, but covered several multi-agent architectureswith varying parameters such as agent communication(allowed vs. forbidden), coordination scheme (central andexplicit vs. emergent), agent perception (local vs. global),etc. Then, we considered the real distances betweennodes, representing them as weights on the edges of thegraph [2]. With this representation, we explored moresophisticated (non-adaptive) solutions that employedseveral heuristics based on node idleness values and pathlengths, combined with negotiation mechanisms [11].These studies improved the previous results, and we usethem in this work as a baseline for evaluating our adaptiveagents in Section 5. Recently, [15] addressed the problemof patrolling with robots, and challenges inherent to thecomplexity of the real world, such as the necessity forrobots to recharge, were dealt with.One of the lessons learned from these work is that thesolutions were too sensitive to variations on problemsettings [2]. This conclusion led us to look for machinelearning techniques, such as reinforcement learning, in anattempt to build a more general solution. In the nextsection, we review the theory of reinforcement learning,and the current efforts on its use in other cooperativemulti-agent domains.3. Reinforcement LearningReinforcement learning is often characterized as theproblem of “learning what to do (how to map situations toactions) so as to maximize a numerical reward signal”[17]. This framework is mostly defined over the theory ofMarkov Decision Processes (MDP), which we brieflyreview in the following sections.3.1. Markov Decision ProcessesConsider an agent that sequentially makes decisions inan environment. At each step, this agent chooses anaction from a finite set A, based on a state signal from theenvironment. This state comes from a finite set S, andsummarizes the present and past sensations in a way thatall relevant information is retained [17]. The process’dynamics is described by two components: the statetransition probability distribution, P, and the expectedimmediate reward function, R. Both of them are definedon triples s,a,s’ , where s is the current state, a is theagent’s action and s’ is the next state. Both functions areassumed to satisfy the Markov property. In summary, afinite Markov Decision Process, is specified by a tuple S, A, P, R , of the elements defined before. In thisformalism, the agent acts according to some policy π(s,a),representing the probability of choosing action a A atstate s S. The main goal is to maximize a long-termperformance criterion, called return, which is oftendefined as a sum of the discounted rewards. The agent

then tries to learn an optimal policy π*, which maximizesthe expected return, called the value function, Vπ(s), asshown in (2), over the set of all possible policies: V π ( s) Eπ γ k rt k 1 st s ,(2) k 0 where γ is a discount factor 0 γ 1, and rt is theimmediate reward received at step t.Similarly, we can define the action-value function,Qπ(s, a), as the expected return when starting from state s,performing action a, and then following π thereafter. Ifone can learn the optimal action-value function Q*(s, a),an optimal policy can be constructed greedily: for eachstate s, the best action a is the one that maximizes Q.3.2. Q-LearningQ-Learning [17] is a traditional RL algorithm forsolving MDPs. Skipping technical details, it consists initeratively computing the values for state-action pairs, online, using the following update rule:Q( s, a ) Q( s, a ) α [r γ V ( s' ) Q( s, a )](3)where V(s’) maxaQ(s’,a), and α is a learning rate.This algorithm is guaranteed to converge to theoptimal Q function in the limit, under the standardstochastic approximation conditions. Note that a prioriknowledge about the process’ dynamics is not necessary.3.3. Semi-Markov Decision ProcessesConventional MDPs do not involve temporalabstraction or temporally extended actions [18]: theunitary action taken at time t affects only the state andreward at time t 1. Semi-Markov Decision Processes(SMDPs) can be seen as a more general version of theMDP formalism, where actions may take variable amountof time. There are extensions of MDP algorithms, such asQ-Learning, which can also solve SMDPs in tractabletime. SMDPs are more suitable for modeling ourpatrolling task, as will be shown on next sections.3.4. Cooperative Multi-Agent ReinforcementLearningAs the MDP theory only deals with the single-agentcase, one can try to consider the whole multi-agent systemas a centralized single-agent, but this solution isintractable as it has an exponentially large number ofactions. In contrast to this approach, many solutions useRL agents as independent learners [11], however, havingno guarantees of achieving a satisfactory global behavior.The issue of creating cooperative agents that learn tocoordinate their actions through RL has been studied inthe current literature [5][9][7][19]. Although the problemPermission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit orcommercial advantage and that copies bear this notice and thefull citation on the first page. To copy otherwise, to republish,to post on servers or to redistribute to lists, requires priorspecific permission and/or a fee.AAMAS'04, July 19-23, 2004, New York, New York, USA.Copyright 2004 ACM 1-58113-864-4/04/0007. 5.00of finding a globally optimal solution for a cooperativegroup of agents with partial information is known to beintractable in theory [4], many proposed approaches haveachieved good results in practice.The main difficulty is that, when a group of agents issolving a common task in a distributed manner, if eachagent tries to optimize its own rewards, this does notnecessarily leads to a globally optimal solution. In thiscontext, [19] investigates the design of a more collectiveintelligence through the use of different utility functions.In the Wonderful Life Utility (WLU), the agents optimizea private utility that is aligned with the global utility. Inother words, the utility functions for each agent assurethat they do not work in cross-purposes. This approachwas implemented in the past for some tasks by includingpenalties when more than one agent competed for thesame reward. This utility has been compared to others,such as the selfish utility (SU), where each agent tries tomaximize its own utility, and to the team game utility(TG), where the global performance is given to each agentas a reward. The TG seems to suffer from very poorlearnability, as each agent’s actions contribute littleindividually to the global reward/penalty received.From the communication point of view, distributed RLapproaches can be classified into three major groups,according to the information available to each agent [11].In Black-Box systems, the agents do not know about theactions of other agents, the effect of which is perceived aspart of the environmental stochasticity. In White-Boxsystems, also known as joint action learners, each agentknows about the actions of all other agents. Finally, inGray-Box systems, agents can communicate their actions,or their intentions for future actions. In this paper, weconsider Black-Box and Gray-Box architectures, with thelatter exhibiting a superior performance.4. Patrolling Task as a ReinforcementLearning ProblemSome characteristics of this task can cause a hugeimpact on the difficulty of building the MDP model.Because of this, we discuss our model incrementally,from simpler to more complex instances. While wediscuss it, we introduce the main aspects of our approach.4.1. The Simplest Single-Agent CaseConsider a simplified patrolling task, in which there isonly a single-agent, and the terrain abstraction consists ofa graph without weights (unitary distance between nodes).For this task to be considered an MDP, it is necessary todefine the tuple S, A, P, R , as stated in section 3.1.A natural design for the action space A is a set ofactions that let the agent navigate between adjacent nodes

in the graph. In other words, with one action, the agentcan traverse one edge on the terrain abstraction.Although it is not necessary to define P for a modelfree algorithm like Q-Learning, the state transitionprobabilities can be easily obtained, as with a single agentour environment is completely deterministic.To properly define a reward function, R, we shouldtake into account the evaluation criteria that we wouldlike to optimize. The agents created in this work weredesigned to optimize a single criterion: the averageidleness criterion, which turned out to be very natural forthe RL setting. Being I(t) the instantaneous idleness atcycle t, then, the average idleness, M(T), after a T-cyclesimulation, is represented as follows:T I (t )(4)TLet us denote by Pos(t) the node visited by the agent atcycle t and by Φ(Pos(t),t) the idleness of this node at cyclet. Considering that the idleness of each of the n nodesincreases by one at each cycle of the simulation (if thenode is not visited), the agent can easily calculate thevalue of I(t) with the following recurrence relation:I (0) InitialIdleness ;(5)n Φ ( Pos (t ), t )I (t ) I (t 1) nFrom equation (4), it can be seen that if the sum of theinstantaneous idleness is minimized, consequently, theaverage idleness is also minimized. A plausiblereinforcement (actually, a punishment), would be theinstantaneous idleness of the graph I(t) after each agent’saction, calculated using (5). Then, if the return is definedby a sum of discounted rewards, such a model would notbe optimal (as equation 4 is not discounted over time), butwould yield an approximation. Actually, to seek foroptimal performance with this reward function, we shouldconsider RL methods that optimize the value functionrelative to the average expected reward per-time-step,such as R-Learning [17]. However, we do not deal withsuch methods in this paper, for two main reasons. First,their use would not be a reasonable starting point for ourtask, as there is still very little experience with thesemethods in the community. Second, although using I(t) asa reward function is useful enough for the single-agentcase, it makes a huge assumption that would make itdifficult to generalize to the multi-agent case. Namely, itassumes that the agent has a complete model of the wholeenvironment (it encapsulates the model of the idleness ofthe whole graph). In the single-agent case, this works,because the environment is only modified by the agentitself. In the multi-agent case, communication betweenthe agents or with a central information point would benecessary to form such a reward signal. We show nowhow a more general reward function can be obtained.M (T ) t 0Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit orcommercial advantage and that copies bear this notice and thefull citation on the first page. To copy otherwise, to republish,to post on servers or to redistribute to lists, requires priorspecific permission and/or a fee.AAMAS'04, July 19-23, 2004, New York, New York, USA.Copyright 2004 ACM 1-58113-864-4/04/0007. 5.00Expanding Eq. (5), and considering that InitialIdlenessis zero, a closed-form expression can be given by:tI (t ) (t n) Φ( Pos (i ), i)i 0(6)nSubstituting (6) in (4), M(T) is given by:n (T T 2 ) T ((T i ) Φ( Pos(i), i) )(7)2i 0M (T ) n TIn Eq. (7), we notice that the only term that depends onthe agent’s policy is:T ((T i) Φ(Pos(i), i))(8)i 0Thus, if the agent can learn a policy that maximizes(8), such policy is optimal for the average idlenesscriterion. If we use the function Φ(Pos(t),t) as a rewardfunction, the policy learned by Q-Learning would be theone that maximizes the sum of the discounted rewards:T (γi 0i Φ( Pos (i), i))(9)The difference between (8) and (9) is that (8) isdiscounted over an arithmetic progression, while (9) isdiscounted over a geometric progression. By using anappropriate value for γ, we can approximate (8),achieving a policy that is optimal in many cases.Furthermore, this reward function is entirely local,depending only on the idleness of the node currentlybeing visited by the agent, and it does not need to assumeanything about the rest of the environment.At first sight, it could seem contradictory to minimizethe average idleness of the graph by maximizing theidleness values of the visited nodes. But intuitively itmeans that, if the agent’s reward is the idleness of thenode that it visits, it will try to visit the nodes with highestidleness, thus, decreasing the global idleness, as shown.Having defined A, P and R, we are left with thedefinition of the state space S to obtain a complete MDPmodel. We would like our state representation to be suchthat the resulting model has the Markov property, so thatthe choice of the best next node to visit could be madeonly based on the current state and independently of thewhole history of positions of the agent. Past positions areimportant for future decisions only in that they impact theidleness of the nodes on the graph. Thus the state of theenvironment can be fully captured, with the Markovassumption, by the current position of the agent and theidleness (the number of cycles) of each node on the map.However, as the idleness values are not bounded, thisstate space is not finite. One possible solution is to groupidleness values into a finite set of values, but that wouldstill lead to a great number of states. If k is the number ofconsidered levels and n is the number of nodes in the

graph, there are n x kn possible states, which growsexponentially in the number of nodes. This issue becomeseven more problematic with real-valued distancesbetween nodes. We discuss it in the next sections.If the function Φ(Pos(t), t) is redefined to be theidleness of the visited node at time t, if the agent visitedany node, and zero otherwise, all equations defined insection 4.1 remain unchanged, and the model developedearlier naturally generalizes to weighted graphs.4.2. Considering the Real Distance4.3. The Multi-Agent CaseIn the formulation of the patrolling task, as presentedin the previous section, we considered that every edge onthe graph had a length of one. Intuitively, the problemwith the case where edges may have arbitrary lengths isthat, if the agent is trying to maximize rewards over time,it should avoid actions taking longer time intervals, sinceduring such periods it does not receive any rewards (theidleness of the graph only grows).The following example illustrates the problem:Consider the nodes A, B and C in Fig. 1. Suppose that anagent is at 1 and has to decide between the path (A, B, C)(non dotted) or the path (A, C, B) (dotted), consideringthe utility of each path as the return estimated using QLearning, for example. Using the last reward functiondefined in section 4.1, considering that all idleness valueswere zero initially, and γ 0.8, we would have:Reward (A,B,C) 0 γ · 6 γ2 · 12 12.5, andReward (A,C,B) 0 γ · 10 γ2 · 16 18.2. The agentwould, though, choose the path (A, C, B), but if we useequation (7) to calculate the average idleness (consideringthat the function Φ is zero while on an edge), we see thatthe path (A, B, C) is actually better. In fact, this path takesonly 12 cycles to be executed, while the dotted one takes16, and the agent does not take this into account.B6A610CFig. 1. Weighted graph for patrolling. Agent (atA) has to choose path (A,B,C) or (A,C,B)Fortunately, the Semi-Markov Decision Processformalism [18] already deals with this issue. By using adiscrete-time finite SMDP, the γ factor can be discountedover all time steps, instead of being discounted only at thetimes of the actions’ final outcomes. In this case, for ourprevious example, the new value function, calculatedusing equation (2), would be: Reward(A,B,C) 0 γ6 · 6 γ12 · 12 2.4, and Reward(A,C,B) 0 γ10 · 10 γ16 · 16 1.52, thus yielding a correct policy.Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit orcommercial advantage and that copies bear this notice and thefull citation on the first page. To copy otherwise, to republish,to post on servers or to redistribute to lists, requires priorspecific permission and/or a fee.AAMAS'04, July 19-23, 2004, New York, New York, USA.Copyright 2004 ACM 1-58113-864-4/04/0007. 5.00Our approach for extending the model presented in theprevious sections to the multi-agent case is based on theconcept of independent learners [11]: we try to solve ourglobal (collective) optimization problem by solving local(individual) optimization ones. In the rest of this section,we describe how this can be done.The action set, A, can remain the same as in the singleagent case. However, there are some problems regardingthe rest of the items.By the same arguments that showed that an optimalpolicy with respect to the average idleness criterion is theone that maximizes equation (8), we can generalize thisresult to the multi-agent case: if G is the number of agentsand Posj(t) is a function that tells the position (node) of aspecific agent j, 1 j G, we want now to maximize:G T ((T i) Φ( Posj 1i 0j (i), i) ) (10)Let us consider that each agent of the group uses thesame reward function as in the single-agent case:Φ(Posj(t),t), that is, the idleness of the nodes they visit.Consequently, each of them will try to independentlymaximize an internal summation from (10), in the sameway as shown in section 4.1.However, as the value of Φ, for any node, can beaffected potentially by the policy of any agent j, thisapproach is selfish (each agent will do no effort to helpthe other agents to maximize theirs rewards) and even thatwe had a Markovian representation for each agent, itwould not guarantee a globally optimal solution.Recalling from section 3.4, this result is equivalent to theselfish utility. We can also adapt this reward model withthe concept of wonderful life utility in the same way as[8], by giving penalties when agents compete for idleness(rewards) on the same node.Although the model presented in this section has noguarantees of global optimality, it is perfectly well-suitedfor a distributed implementation: the only feedback thateach agent receives is the idleness of the node currentlybeing visited. This reward is completely local, and can beimplemented by a very simple scheme of flagcommunication: for every node that the agent visits, itplaces a flag containing the number of the cycle that it hasbeen there, and this flag can be seen by other agents.Analyzing now the state space S, as the single-agentcase was already intractable, the same can be said for themulti-agent case. With that in mind, we decided to

represent partial information on the state of our agents. Inparticular, we represent a few characteristics of itssurroundings, analyzing empirically the effect of eachnew characteristic added to the state vector, as will bedetailed later. Although this is a simplification, it hassome advantages. First, this formulation is cheaper interms of computational complexity. Second, it is morerealistic than the one proposed to the single-agent case, inwhich the agent needs to have access to the state of thewhole world, which is not possible in most practicalinstances of patrolling. Third, it is well-suited for adistributed implementation, since each agent can onlyobserve its immediate surroundings.Considering the state transition probabilities P on themulti-agent case, now each agent faces a nondeterministic environment, if it does not know the actionsthat the other agents will perform. Worse than that, sinceall agents are adapting their behavior simultaneously, Pwill be a non-stationary distribution. This problem canmake our attempt to learn by reinforcement useless, sincewe would not have a MDP or a SMDP. In order todiminish the side effects of this non-determinism, weinvestigate architectures using different schemes ofcommunication, which we detail in the next section.information about the node it plans to visit. When anotheragent receives a message containing a neighbor node, itincludes this information in its state. If communication isexpensive, these messages do not need to be broadcastedto all agents, but only to those that are adjacent to thenode informed in the message. With this representation,each BBLA has about 6.250 possible states for the MapsA and B shown on Fig. 2, and each GBLA about 200.000possible states.5. Experimental ResultsA simulator was developed to evaluate patrollingstrategies, and different test instances (maps) werecreated; two of them are shown in Fig. 2. Map A has fewobstacles (highly connected). Map B has the same numberof nodes, but fewer edges.4.4. Agents in more DetailsTwo agents were modeled in this work, using thedifferent communication schemes explained in section3.4: a) a Black-Box Learner Agent (BBLA), whichcommunicates only by placing a flag in each node visited.This flag can be seen by the other agents and consideredin theirs state space representation. b) a Gray-Box LearnerAgent (GBLA), which communicates also by flags, butcan communicate their intentions of actions. Includingthese intentions in the state representation of each agent,the non-stationarity of the environment is diminished.We did several experiments in order to achieve asatisfactory state representation, doing our best effort tobalance the complexity of the state representation with theperformance of the agent. After these experiments, the setof possible states for the BBLA was defined to be a vectorof the following characteristics (consider m as themaximum node connectivity, and n the number of nodes):i) the node in which the agent is - n possible values; ii) theedge from which it came from, informing the agent aboutits past actions – m values; iii) the neighbor node whichhas the highest idleness – m values; iv) the neighbor nodewhich has the lowest idleness – m values.The set of possible stat

learning techniques, such as reinforcement learning, in an attempt to build a more general solution. In the next section, we review the theory of reinforcement learning, and the current efforts on its use in other cooperative multi-agent domains. 3. Reinforcement Learning Reinforcement learning is often characterized as the

Related Documents:

In contrast to the centralized single agent reinforcement learning, during the multi-agent reinforcement learning, each agent can be trained using its own independent neural network. Such approach solves the problem of curse of dimensionality of action space when applying single agent reinforcement learning to multi-agent settings.

2. Multi-Agent Reinforcement Learning and Stochastic Games Multi-Agent Reinforcement Learning (MARL) is an extension of RL (Sutton and Barto, 1998; Kaelbling et al., 1996) to multi-agent environments. It deals with the problems associated with the learning of optimal behavior from the point of view of an agent acting in a multi-agent en-vironment.

Over recent years, deep reinforcement learning has shown strong successes in complex single-agent tasks, and more recently this approach has also been applied to multi-agent domains. In this pa-per, we propose a novel approach, called MAGNet, to multi-agent reinforcement learning that utilizes a relevance graph representa-

This manuscript details some of the literature in transfer learning for reinforcement learning tasks and multi-agent systems. In addition, we will explore a new decen-tralized scalable algorithm for multi agent reinforcement learning. The algorithm is an online actor-critic with a modular action-value function learned using agent

on a machine learning paradigm called reinforcement learning (RL) which could be well-suited when the underlying state dynamics are Markov. Indeed, RL has been applied in many CR applications involving both single-agent and multi-agent environments [5], [6]. For example, multi-agent reinforcement learning (MARL) based on Q-learning was proposed .

Multi-agent reinforcement learning has made significant progress in recent years, but it remains a hard problem. Hence, one often resorts to developing learning algorithms for specific classes of multi-agent systems. In this paper we study reinforcement learning in a specific class of multi-agent systems systems called mean-field games.

Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto T2: Multiagent Reinforcement Learning (MARL). Daan Bloembergen, Tim Brys, Daniel Hennes, Michael Kaisers, Mike Mihaylov, Karl Tuyls Multi-Agent Reinforcement Learning ALA tutorial. Daan Bloembergen

Adventure tourism is a rapidly expanding sector of the tourism industry internationally. New Zealand is internationally recognised as a country where adventure tourism and adventure sports are undertaken by a large proportion of the resident and visitor population. While the risks associated with adventure tourism and adventure sport activity are increasingly highlighted in media reports of .