Reinforcement Learning For Optimal Control Of Queueing Systems

1y ago

22 Views

4 Downloads

1,005.50 KB

8 Pages

Last View : 11d ago

Last Download : 3m ago

Upload by : Shaun Edmunds

Report this link

Download PDF

Transcription

2019 57th Annual Allerton Conference on Communication, Control, and Computing(Allerton)Allerton Park and Retreat CenterMonticello, IL, USA, September 24-27, 2019Reinforcement Learning for Optimal Control ofQueueing SystemsBai Liu , Qiaomin Xie† , and Eytan Modiano for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA† School of Operations Research and Information Engineering, Cornell University, Ithaca, NY LaboratoryAbstract—With the rapid advance of information technology,network systems have become increasingly complex and hencethe underlying system dynamics are typically unknown ordifficult to characterize. Finding a good network control policyis of significant importance to achieving desirable networkperformance (e.g., high throughput or low average job delay).Online/sequential learning algorithms are well-suited to learningthe optimal control policy from observed data for systemswithout the information of underlying dynamics. In this work,we consider using model-based reinforcement learning (RL)to learn the optimal control policy of queueing networksso that the average job delay (or equivalently the averagequeue backlog) is minimized. Existing RL techniques, however,cannot handle the unbounded state spaces of the networkcontrol problem. To overcome this difficulty, we propose a newalgorithm, called Piecewise Decaying -Greedy ReinforcementLearning (PDGRL), which applies model-based RL methodsover a finite subset of the state space. We establish that theaverage queue backlog under PDGRL with an appropriatelyconstructed subset can be arbitrarily close to the optimal result.We evaluate PDGRL in dynamic server allocation and routingproblems. Simulations show that PDGRL minimizes the averagequeue backlog effectively.I. I NTRODUCTIONThe rapid growth of information technology has resulted inincreasingly complex network systems and poses challengesin obtaining explicit knowledge of system dynamics. Forinstance, due to security or economic concerns, a number ofnetwork systems are built as overlay networks, e.g. cachingoverlays, routing overlays and security overlays [1]. In thesecases, only the overlay part is fully controllable by the network administrator, while the underlay part remains uncontrollable and/or unobservable. The “black box” componentsmake network control policy design challenging.In addition to the challenges brought by unknown systemdynamics, many of the current network control algorithms(e.g. MaxWeight [2] and Drift-plus-Penalty [3]) aim at stabilizing the system, while general optimization methods forlong-term performances metrics (e.g. queueing backlog anddelay) are relatively rare.To overcome the above challenges, it is desirable to applyinference and learning schemes. A natural approach is reinforcement learning, which learns and reinforces a good decision policy by repeatedly interacting with the environment.Reinforcement learning methods provide a framework thatenables the design of learning policies for general networks.There have been two main lines of work on reinforcement978-1-7281-3151-1/19/ 31.00 2019 IEEElearning methods: model-free reinforcement learning (e.g. Qlearning [4], policy gradient [5]) and model-based reinforcement learning (e.g., UCRL [6], PSRL [7]). In this work, wefocus on the model-based framework.Related Work. In network control, a widely used algorithmis the MaxWeight [2], which can be applied to generalmulti-server networks with an arbitrary topology. MaxWeightalgorithm has been shown to be throughput-optimal (i.e. canstabilize the system whenever the system is stabilizable).Moreover, the MaxWeight algorithm does not require explicitarrival information but only the current queue backlog andthe knowledge of service rates, which makes it suitable forcomplex systems. The work in [3] extends MaxWeight toutility optimization using the Drift-plus-Penalty algorithm.To design control policies for networks with unknowndynamics, an intuitive approach is as follows: estimatingthe parameters of the underlay components first, and thenapplying classic network control techniques based on theestimated parameters. A variety of learning methods havebeen applied. A popular method is probing, i.e., sendingprobe packets at certain time intervals and collect tunnelinformation. For instance, the work in [8], [9] gathers directand indirect path information by collecting ping data. In[10], simulation results illustrate that the probing approachcould achieve optimal throughput. Recently reinforcementlearning has emerged as a popular and powerful approachfor complex systems. In [11], the authors apply the Qlearning algorithm in overlay non-cooperative multi-agentwireless sensor networks (WSNs) to achieve optimal mutualresponse between two agents. The work of [12] combinesreinforcement learning with neural networks and improvesscalability compared with probe-based inference methods.Reinforcement learning is well-suited to learning the optimal control for a system with unknown parameters. Weconsider model-based reinforcement learning methods, whichtend to be more tractable in analysis. Conventional modelbased reinforcement learning methods like UCRL [6] andPSRL [7] only work for finite-state-space systems, yet queueing systems are usually modeled to have unbounded buffersizes. The work in [13] assumes that the system has anadmission control scheme to keep queue backlogs finite sothat we can turn the system into a finite state MDP modeland apply UCRL. However, the practical queueing systemsmay not have admission control schemes and this approach663

might not apply directly. In [14], the authors modify PSRLalgorithm to deal with MDPs with large state space, yet thealgorithm requires the MDP to have a finite bias span, whichis unrealistic for the MDP problems with unbounded costfunctions. In both [15] and [16], deep reinforcement learning methods are applied to spectrum access problems andoutperform traditional baseline algorithm, but lack rigoroustheoretical performance analysis.To sum up, there have been some work on network controlthat aims at stabilizing the systems, yet the performancemetric of interest here is queue backlog and delay. Among theexisting work on queue backlog optimization, most proposead-hoc solutions for some specific scenarios. Model-basedreinforcement learning is a potential approach for the optimalcontrol of the general queueing system, yet the classicalmethods (UCRL and PSRL) can only solve bounded-statespace MDPs.A. Real SystemThe system can be modeled as a countable-state MDP Mas follows. State space S:For ease of exposition, we focus on queueing networkswhere the system state can be represented by the lengthsof all queues. We denote the number of queues as D.The system state is given by a D-dimensional queuebacklog vectors Q. The system space is denoted as S N · · · N. {z}D times Our contributions. We apply model-based reinforcementlearning to queueing networks with unbounded state spacesand unknown dynamics. Our approach leverages the fact thatfor a vast class of stable queueing systems, the probabilityof the queue backlog being large is relatively small. Thisobservation motivates us to focus on learning control policiesover a finite subset of states where the system visits withhigh probability. Our main contributions are summarized asfollows. We propose a model-based RL algorithm that can dealwith unbounded state spaces. In particular, we introducean auxiliary system with the state space bounded bya threshold of U . Our approach employs a piecewisepolicy: for states below the threshold, a model-based RLalgorithm is used; for all other states, a simple baselinealgorithm is applied.We establish that the average queue backlog underthe proposed algorithm can be made arbitrarily closeto the optimal result with a large threshold of U . Inparticular, by utilizing the Lyapunov analysis technique,we characterize how the gap to the optimal performancedecays with the threshold U .Simulation results on dynamic server allocation androuting problems corroborate the validity of our theoretical guarantees. In particular, the proposed algorithmeffectively achieves a small average queue backlog,with the gap to the optimal policy diminishing as thethreshold U increases.II. M ODELIn this paper, we target at optimizing the average queuebacklog of a general discrete-time queueing network systemthat can be formulated by Markov decision process (MDP)framework. The system consists of a set of nodes and links.Each node maintains one or more queues for the undeliveredpackets, and each queue has an unbounded buffer. The systemmay have an arbitrary topology and the underlying dynamicscan be partially or fully unknown. Action space A:The exact form of action space depends on the problemsetting. For instance, in the server allocation problemwhere D parallel queues compete for the service of asingle server [17], the action is the queue served by theserver at each time slot and the action space is naturallythe set of queue indexes. The action space is representedby A and we assume that A 1.State-transition probability p:The system dynamics satisfy the Markov property. Formally, the probability of transitioning into a particularstate Q0 only depends on the current state Q and theaction a, denoted as p Q0 Q, a . We assume that thenumber of newly arrived and served packets duringeach time slot are both bounded. That is, there existsa constant W such that for every Q(t) 2 S,kQ(t 1)Q(t)k1 6 W.We define the set of states within the one-step reachableregion of Q asnoR (Q, a) , Q0 2 S : p Q0 Q, a 0 .Cost function c (Q):Since we aim at minimizing the average queueP backlog,we define the cost function as c (Q) i Qi . Wedenote the optimal average queue backlog as , andthe corresponding optimal policy as .We define that Qmax , maxi Qi . To deal with theunbounded state space, we consider a natural assumption onthe existence of a policy that stabilizes the system. Assumption 1. There exists a known policy 0 satisfyingthe following condition: there exist a Lyapunov function 0 :S ! R and constants a, , 0 , B0 0 such that (Q) 6aQ max for every Q(t) 2 S. If Qmax (t) B0 , we havehiE 0 0 Q(t 1) 0 ,0 Q(t) Q(t) 6where E 0 denotes expectations under policy 0 .A broad class of queueing systems have been proven tohave a 0 that satisfies Assumption 1. For instance, stabilizing policies are proposed for dynamic server allocationproblem [17]–[19], multiclass routing network [20]–[24] andinventory control [25], [26], with linear or quadratic formof Lyapunov functions. This assumption allows us to upperbound the tail probability of queue backlog.664

B. Auxiliary SystemThe key challenge of applying model-based reinforcementlearning to countable-state MDP is that classical modelbased reinforcement learning techniques usually operate inepisodic manners: for each episode, the system dynamics areestimated and an approximated optimal policy is obtainedbased on the learned dynamics. However, there is no effective solution for general countable-state MDPs with optimalaverage cost (in contrast to discounted cost) criteria.Here we introduce an auxiliary system M̃ with boundedstate space. The auxiliary system has threshold U : the systemhas exact dynamics as the real one, with the only differencethat each queue has buffer size U . In the bounded system, foreach queue, when the queue backlog reaches U , new packetsto the queue will get dropped. Mathematically, the state spaceof M̃ can be defined as S̃ , {Q 2 S : Qmax 6 U }. M̃shares the same action space A and cost function c (Q) asM . We use p̃ to represent the state-transition function in M̃ .We denote the optimal average queue backlog in M̃ by ,and the corresponding optimal policy by . We make an assumption on for technical reasons. Forclarity, we use Ẽ[·] for the expectations in M̃ (to distinguishfrom E[·] for M ).Assumption 2. There exist a Lyapunov function (·) : S̃ !R , and constants 0 6 1, b1 , b2 , b3 , B̃ , 0, suchthat for any U 0, the following properties hold:1 1 For each Q 2 S̃, b1 Qmax 6 (Q) 6 b2 Qmax ; Q(t) 6 For each Q(t) 2 S̃, Q(t 1)b3 U ; If Qi increases (while other entries remain the same), (Q) will not decrease; When (Q) B̃ , we havehi Q(t) Q(t) Q 6 .Ẽ Q(t 1)For network systems that are stabilized under Max-Weight policy [27] or Back-Pressure- policy [28], the Lyapunovfunctions have been proven to have the form of weighted sumof Q1 . When 0 1 and is Max-Weight- or BackiPressure- method, Assumption 2 holds. Other examplesinclude the multi-class Markovian queueing network in [24]and the server allocation problem described in Section V-A.Both problems have linear Lyapunov functions under backlogoptimal policies.C. Our ApproachFor simplicity, we partition S as follows:8no S in , Q 2 S̃ : (Q) 6 b (U W )1 1:S out , S \ S inis used for states in S in ; while for S out , a stabilizing 0(defined in Assumption 1) is applied. See Figure 1 for illustration. A detailed description of the algorithm is providedin Section III.Fig. 1: Schemetic illustration of our approach (when D 2).For performance analysis, we decouple the process intotwo stages: before is learned and after is learned.For the first stage, our model-based reinforcement learningapproach applies -greedy exploration. We establish that theproposed algorithm gradually obtains with arbitrarily highprobability (cf. Theorem 1). For the second stage, by applyingdrift analysis on Markov chain, we show that when isinoutapplied for states in S and 0 for states in S , theprobability of queue backlog exceeding into S out decaysexponentially with U . In addition, every time when queuebacklog leaves S in , the expected queue backlog can bebounded as polynomial terms of U . Together, we show that the gap between our result and is upper boundedby O poly(U )/ exp poly(U ) , which diminishes as Uincreases (cf. Theorem 2).D. Other AssumptionsDue to mathematical requirements, we impose some restrictions on the communication properties on M̃ . We denoteTQ!Q0 as the first hitting time from Q to Q0 and proposethe following assumption.Assumption 3. There exist constants c,any U 0 and every Q, Q0 2 S̃,.Our reinforcement learning method can be decomposed intotwo stages: exploration stage and exploitation stage. For eachepisode, with some relatively small probability, we applyrandomized policy to learn the dynamics of M̃ and estimate . Meanwhile, in our exploitation scheme, we apply a piecewise policy: the policy obtained from exploration stages min Ẽ TQ!Q0 6 ckQ0 0, such that forQk1 ,where is a policy that can be applied to S̃.Directly verifying Assumption 3 might be computationallydifficult. Theorem 11.3.11 from [29] offers a drift analysisapproach for justifying Assumption 3.665

Lemma 1 (Theorem 11.3.11 in [29]). For a -irreducibleMarkov chain, if there exists a Lyapunov function (·) anda petite set C such that for every Q(t) 2 S,hiE Q(t 1)Q(t) Q(t) 6 1 b Q(t)2C ,The details are presented in Algorithm 1.Algorithm 1 The PDGRL algorithmInput: A, U , l 0, L 0Initialization: t1, N (·, ·)0, P̃ (·, ·)0for episodes k p1, 2, · · · , K dopSet LkL · k, kl/ k and uniformly draw 2 [0, 1].5:if 6 k then6:Set( rand (Q) , for Q 2 S̃ k (Q) . 0 (Q) , for Q 2 S \ S̃1:2:3:4:then for every Q, Q0 2 S, there exists c Q0 1 such that E TQ!Q0 6 (Q) c Q0 .A possible method is to analyze the Markov chain under 0 and (re-scaled) 0 (·) in Assumption 1. In this case, if weselect a suitable measure for the -irreducible Markov chain,c (Q) may have a polynomial upper bound regarding Qmaxand Assumption 3 holds.As the learning process proceeds, the estimation for M̃becomes increasingly accurate. However, it is unrealistic forus to obtain the exact M̃ . Therefore, we make the assumptionthat if we estimate the state-transition function accurateenough (i.e. within a certain error bound), the solution tothe estimated M̃ is the same as . The assumption is statedas follows.7:8:9:Assumption 4. There exists a p 0, such that for anyMDP M̃ 0 with the same state space, action space and costfunction as M̃ , if for each Q 2 S̃ and a 2 A, we havep̃ · Q, ap0 · Q, a1610:p,11:12:13:14:then the optimal policy of M is the same as the otimal policy for M̃ .0Notice that in most queueing networks, when systemdynamics (e.g. exogenous arrival rates, service rates, channelcapacities) vary slightly, the optimal policy remains the same.Assumption 4 is reasonable for queueing systems.III. A LGORITHMWe propose an algorithm called Piecewise Decaying Greedy Reinforcement Learning (PDGRL). PDGRL operatesin an episodic manner: at the beginning of episode k, weuniformly draw a real number 2 [0, 1] to decide whetherto explore or exploit during this episode. The length of theepisodes is not fixed butp depends on the observations. If 6 k , l/ k (where 0 l 6 1), we performexploration during this episode. For states in S̃, we applya random policy rand , which selects an action in Auniformly. For states in S \ S̃, we apply 0 . If k , we enter the exploitation stage. We firstcalculate sample means to estimate the state-transitionfunction p̃ of M̃ . We then apply value iteration on theestimated system M̃k and obtain an estimated optimalpolicy k . For the rest of the episode, we apply k forinstates in S and 0 otherwise.in Whenp the number of visits to states in S exceeds Lk L · k (where L 0), PDGRL enters episode k 1 andrepeat the process above.To further illustrate the algorithm, we define a mappingT R(·) : S ! S̃ that describes the packet dropping schemein the bounded system:Q̃ T R (Q) , min {U, Qi }Di 1.15:16:elseFor each Q, Q0 2 S̃ and a 2 A, estimatethat p̃ Q0 Q, a P̃ Q, a, Q0 /N (Q, a) forN (Q, a) 0 and p̃ Q0 Q, a 1/ R (Q, a)otherwise.Solve the estimated MDP M̃k and obtain the estimated optimal policy k .Set( k (Q) , for Q 2 S in k (Q) . 0 (Q) , for Q 2 S outend ifwhile visits to states in S in is smaller that Lk doTake at k Q(t) .Implement at to the real system and observe thenext state Q(t 1).if Q(t) 2 S in thenIncrease N Q(t), at by 1. Increase P̃ Q(t), at , T R Q(t 1)end iftt 1.end while21: end for 22: Output: estimated optimal policy K17:18:19:20:by 1.IV. P ERFORMANCE A NALYSISWe analyze the performance of our algorithm from bothexploration and exploitation perspectives. We first provethat PDGRL can learn within finite episodes with highprobability, which implies that PDGRL explores differentstates sufficiently to obtain an accurate estimation of M̃(cf. Theorem 1). We then show that PDGRL exploits theestimated optimal policy and achieves a performance closeto the true optimal result of (cf. Theorem 2). Due to thespace limit, for all the results presented in this paper, we onlyprovide proof outlines and omit proof details.In this paper, we focus on MDPs such that all states arereachable from each other under the following policies: (a) rand 0 : applying rand to S̃ and 0 to S \ S̃; (b) 0 :applying to S in and 0 to S out ; and (c) : applying (atruncated version of) to S̃ in M̃ . That is, the correspondingMarkov chains under the above policies are irreducible.666

A. Convergence to the Optimal PolicyThe following theorem states that, with arbitrarily highprobability, PDGRL learns within a finite number ofepisodes.Theorem 1. Suppose Assumption 4 holds. For each2(0, 1), there exists k 1 such that PDGRL learns (i.e. k ) within k episodes with probability at least1.Proof. We first calculate the required sample size of (Q, a)pairs to obtain with high probability under Assumption 4.We then show that the number of samples obtained byfollowing the policy rand for states in S̃ is unbounded asK ! 1.Theorem 1 indicates that PDGRL explores (i.e. samples)state-transition functions of each state-action pair (Q, a) inM̃ sufficiently.B. Average Queue BacklogSection IV-A illustrates the sufficient exploration aspect ofPDGRL. In reinforcement learning, the trade-off between exploration and exploitation is of significant importance to thealgorithm performance. In this section, we show that PDGRLalso exploits the learned policy such that the expected averagequeue backlog is bounded and can get arbitrarily close to theoptimal performance of as we increase U .We denote the time step at the end of episode k by tkand the length of episode k by L0k (i.e. L0k tk tk 1with t0 , 0). We use kin to represent the policy applied toS in during episode k and p 0 (·) to denote the stationarydistribution of states when applying to states in S in andout 0 to states in S .By Theorem 1, PDGRL learns with high probability.Note that the probability of utilizing the learned policyconverges to 1 as episode increases. Hence the episodicaverage queue backlog when kin constitutes a largeproportion of the overall expected average queue backlog.Therefore, the key step to upper bound the expected averagequeue backlog is to upper bound the episodic average queuebacklog when kin .For the purpose of analysis, we further partition S in asfollows:8no S in , Q 2 S in : (Q) 6 b (U W )1 bU13in.:S in , S in \ S inbdinWe first upper bound the episodic average queue backlogwhen kin , as stated in the following lemma.Lemma 2. Under Assumptions 1-3, we have32PPtkt tk 1 1i Qi (t) kin 5lim E 4k!1L0k in p 0 Sbd· O U 1 max{2 , } .Proof. For a given k such that kin , we decomposethe accumulated queue backlog into the three parts: theininaccumulated queue backlog when Q 2 Sin, Q 2 SbdandoutQ 2 S . For the first part, we utilize Bellman equationanalysis and Assumption 3. The second part is trivially upperbounded by N U . For the third part, we use Assumption 1 andprove that the time it takes to return back S in is polynomial(using techniques as the proof of Theorem 1.1 in Chapter 5of [34]). The details are omitted in this paper.We further establish the following lemma, which upper inbounds p 0 Sbd.Lemma 3. Under Assumption 2, we have inp 0 Sbd O expU1 .Proof. We show that under Assumption 2 when kin ,there exists a sub-quadratic Lyapunov function with negativedrift for states that have relatively large Lyapunov valueregarding (·). We then apply similar techniques as Lemma1 in [30] to complete the proof. The details are omitted inthis paper.From Lemma 2 and Lemma 3, we can upper boundthe episodic average queue backlog when kin . Bycombining with Theorem 1, we establish the following mainresult of this paper.Theorem 2. Suppose Assumptions 1-4 hold. ApplyingPDGRL to M , the expected average queue backlog is upperbounded ashPi!tK PEt 1i Qi (t)U 1 max{2 , } lim O.K!1tKexp U 1Proof. We first show that the expected average queue backlogbrought by episodes where kin 6 becomes negligible asK ! 1 and obtain the gap between the expected averagequeue backlog and . We then apply similar techniquesas the proof of Lemma 2 to characterize the relationshipbetween and .Theorem 2 provides an upper bound on the performanceguarantee of PDGRL regarding the threshold parameter U :by increasing U , the long-term average queue backlog approaches exponentially fast.V. N UMERICAL E XPERIMENTSIn this section, we evaluate the performance of proposedPDGRL for three queueing networks.A. Dynamic Server AllocationWe first consider a simple server allocation problem:exogenous packets arrive at two nodes according to Bernoulliprocess with rate 1 and 2 respectively. Both nodes haveunbounded buffers. At each time slot, a central server needsto select one of the two queues to serve. The head of line jobin the selected queue i then completes the required serviceand leaves the system with probability pi . The system modeland parameters are illustrated in Figure 2.According to [17], whenever 1 /p1 2 /p2 1, astabilizing policy is to always serve the node with the longest667

Average Queue Backlog7Fig. 2: System model of the dynamic server allocationproblem with permanent connectivity.54321000.511.522.533.544.55104Time Slot(a) U 57Average Queue Backlogconnected queue (LCQ). Therefore, we can use the LCQpolicy as 0 . Note that in our setting, the channels are alwaysconnected, 0 is actually serving the node with the longestqueue (LQ).On the other hand, according to cµ-rule in [19], theoptimal policy that minimizes the average queue backlogis to select the node with the largest successful transmissionrate among all the non-empty queues. However, cµ-rulerequires knowledge of the dynamics, which is infeasibleunder our setting. We include it in simulation only to studythe performance gap between PDGRL and optimum.For the model depicted in Figure 2, under policy thenode 2 is served whenever it is nonempty. However, sincenode 1 has a larger arrival rate and a smaller successfultransmission rate, the queue of node 1 is easier to get queuedup. Therefore, we would expect that node 1 is served morefrequently under policy 0 , leading a performance gap to theoptimal result under .We compare the performances of four policies: 0 (LCQ),PDGRL, 0 (applying for Q 2 S in and 0 otherwise)and (true optimal policy). Note that the policy 0 isexactly the best policy PDGRL can learn. We simulate it toillustrate the convergence rate of PDGRL.We conduct the simulation with U 5 and U 10 and theresults are plotted in Figure 3. It can be observed from Figure3 that PDGRL outperforms 0 , and quickly converges to 0 in both cases. The performance gap between PDGRL and is not negligible when U 5. Theorem 2 shows thatwhen U increases, the average queue backlog of PDGRLapproaches the optimal result. Figure 3 (b) shows that whenU 10 the gap between PDGRL and almost diminishes.665432100123456Time Slot78910104(b) U 10Fig. 3: Simulation results for the dynamic server allocationproblem.Fig. 4: System model of the dynamic server allocationproblem with stochastic connectivity.B. Dynamic Server Allocation (with Stochastic Connectivity)We now turn to a more general case with stochasticconnectivity: during each time slot, the nodes are connectedwith the central server with probability c1 and c2 respectively.The connectivity status is known before taking action. Onlywhen the selected queue i is connected, it is successfullyserved with probability pi . The system model and parametersare shown in Figure 4.According to [17], whenever the condition that(12(1 c1 )(1 c2 )p1 p2 112 c, c21 p2p1is satisfied, a stabilizing policy is to always serve the nodewith the longest connected queue (LCQ). Therefore, we canuse LCQ policy as 0 .However, unlike the setting in Section V-A, for stochasticconnectivity cases, unless the parameters are highly symmetric (i.e. 1 , pi p and ci c), the optimal policy thatminimizes average queue backlog remains an open problem[31].From simulations we find that for the system depicted inFigure 4, the optimal policy for the truncated systemalways chooses serving queue 2 whenever it is connected andnonempty. When only one queue is connected, selects theconnected queue. We thus consider a policy that inherits thebehavior of to approximate the true optimal policy forthe real system: when only one queue is connected, selectthat queue; when both queues are connected, select the nonempty queue with the largest successful service probabilitypi .668

8Average Queue Backlog7654Fig. 6: System model of routing32The parameter pi s are queue-dependent here:((0.9, 0.1), Q2 (t) 6 5(p1 , p2 ) .(0.1, 0.9), Q2 (t) 510012345678106Time Slot(a) U 5An obvious policy that stabilizes the queue backlog is toalways choose s ! 2 ! d, which maintains Q2 (t) 5.Therefore, we can use the policy that always routing throughs ! 2 ! d as 0 .We simulate U 10. The results are plotted in Figure7, which shows that PDGRL outperforms 0 and learns quickly.Average Queue Backlog1086482002468Time Slot101214Average Queue Backlog716106(b) U 10Fig. 5: Simulation results for the dynamic server allocationproblem with stochastic connectivity.654321We implement the simulation for U 5 and U 10. Inthis setting, we incorporate the connectivity status into thestate (i.e. (Q1 , Q2 , C1 , C2 ) 2 S N N {0, 1} {0, 1}).The results are shown in Figure 5. Similar to the results inSection V-A, PDGRL outperforms 0 , and learns gradually. However, since the state space expands (connectivity isalso incorporated into the state space), the convergence rateis smaller. Also, by comparing the gap between 0 and the approximated with different thresholds U , we observethat the average queue backlog of PDGRL approaches theoptimal result as U increases.C. RoutingWe consider a routing problem: exogenous packets arriveat the source node s according to Bernoulli process withrate . Node 1 and node 2 are two intermediate nodes withunbounded buffers and can serve at most one packet duringeach time slot, with probability p1 and p2 respectively. Noded is the destination node. At each time slot, node s has tochoose between routes s ! 1 ! d and s ! 2 ! d to transitnew exogenous packets. Specifically, the system model andparameters are shown in Figure 6.00123456789Time Slot10104Fig. 7: Simulation results of routing.VI. C ONCLUSIONIn this work, we apply a model-based reinforcement learning framework to general queueing networks with unboundedstate space. We propose the PDGRL algorithm, which appliesa -greedy exploration scheme. By leveraging Lyapunovanalysis, we show that the average queue backlog of theproposed approach can get arbitrarily close to the optimalaverage queue backlog under oracle policy. The proposedPDGRL algorithm requires the knowledge of a stable policy.An interes

Reinforcement learning methods provide a framework that enables the design of learning policies for general networks. There have been two main lines of work on reinforcement learning methods: model-free reinforcement learning (e.g. Q-learning [4], policy gradient [5]) and model-based reinforce-ment learning (e.g., UCRL [6], PSRL [7]). In this .

Related Documents:

Bruksanvisning för bilstereo Bruksanvisning for bilstereo ... - Jula

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

375 Views

1y ago

Applying Deep Reinforcement Learning to Berkeley's Capture the Flag game

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

102 Views

1y ago

1 Introduction to reinforcement learning - GitHub Pages

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

26 Views

2y ago

10 tips och tricks för att lyckas med ert sap-projekt

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

735 Views

2y ago

Nordens 25 största medieföretag efter omsättning

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

333 Views

1y ago

SS 02 52 68 Ljudklassning av utrymmen i byggnader - byggtjanst.se

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

357 Views

1y ago

Apple Developer Program License Agreement (Swedish)

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

344 Views

1y ago

EN - European Commission

playing field within the internal market, even in exceptional economic circumstances. This White Paper intends to launch a broad discussion with Member States, other European institutions, all stakeholders, including industry, social partners, civil society organisations,

69 Views

3y ago

Recent Views

AUTOMOTIVE INDUSTRY ANALYSIS REPORT and GUIDE

3.1 General Outlook of the Automotive Industry in the World 7 3.2 Overview of the Automotive Industry in Turkey 10 3.3 Overview of the Automotive Industry in TR42 Region 12 4 Effects of COVID-19 Outbreak on the Automotive Industry 15 5 Trends Specific to the Automotive Industry 20 5.1 Special Trends in the Automotive Industry in the World 20

1y ago

86 Views

Automotive Pathway Automotive Services Fundamentals

Automotive Pathway Automotive Services Fundamentals Course Number: IT11 Prerequisite: None Aligned Industry Credential: S/P2- Safety and Pollution Prevention and SP2- Mechanical and Pollution Prevention Description: This course introduces automotive safety, basic automotive terminology, system & component identification, knowledge and int

2y ago

228 Views

Articulation Agreements: College of Applied Technologies .

Hernando High School FL Automotive . Central Nine Career Center IN Automotive Elkhart Area Career Center IN Automotive . Kokomo Area Career Center IN Automotive North Lawrence Vo-Tech IN MLR Porter County Career Center IN Automotive Richmond High School IN Automotive Southeastern Career

2y ago

376 Views

Automotive Basics - Auto Upkeep

Automotive Basics - Course Description "Automotive Basics includes knowledge of the basic automotive systems and the theory and principles of the components that make up each system and how to service these systems. Automotive Basics includes applicable safety and environmental rules and regulations. In Automotive Basics, students will gain

1y ago

197 Views

Automotive Automotive Automotive - HSBC Bank Malaysia

This Merchant list is subject to change from time to time. Merchant(s) who are terminated from the Instalment program after the published date might still be reflected in this list. HSBC Cardholder(s) are advised to confirm the availability of HSBC Card Instalment Plan with the merchant. Automotive Automotive Automotive

1y ago

173 Views

On the Road: U.S. Automotive Parts Industry Annual Assessment

Table 12: Acquisitions of U.S. Automotive Parts Companies (SIC 3714) Table 13: Automotive Parts Exports, 2000-2010 Table 14: Automotive Parts Imports, 2000-2010 . Automotive parts consumption is linked to the demand for new vehicles, since roughly 70 percent of U.S. automotive parts production is for Original Equipment (OE) products. .

10m ago

72 Views

EMC TEST SYSTEMS FOR AUTOMOTIVE

AUTOMOTIVE EMC TEST SYSTEMS FOR AUTOMOTIVE ELECTRONICS AUTOMOTIVE EMC TEST SYSTEMS FOR AUTOMOTIVE ELECTRONICS Step 1 Step 2 Step 3: Set the parameters Step 4: Active test. Load dump pulses have high pulse energy, which can be highly destructive to electrical or electronic equipment. The LD 200N series simulates these pulses with high energy in a range of up to 1.2 seconds. The LD 200N .

3y ago

266 Views

Automotive Manufacturing - Select Georgia

Jobs created by Georgia’s automotive-related locations Toyo Tire North America Manufacturing and expansions in the last three years 32,000 Automotive-related engineers and production workers in Georgia Sources: EMSI 2020.3, press releases and Automotive Database, Georgia Power Community & Economic Development, 2020 Automotive Manufacturing

2y ago

166 Views

#1 OSAT for Automotive Packaging and Test

We Know Automotive Amkor has extensive experience with automotive process requirements shipping billions of units every year for automotive applications. Our packages meet or exceed automotive quality, reliability, burn-in and safe launch plan criteria. Amkor also has failure analysis, tri-temp test and statistical process capability in all .

1y ago

145 Views

Ipsos Automotive Center of Excellence

Global Automotive Center of Excellence -2014 Ipsos Automotive 9 Automotive Center of Excellence As global automotive markets get more sophisticated, they require vehicle manufacturers to offer the most relevant market propositions to match consumer needs. There is greater value than ever before for a global research partner, who understands

1y ago

126 Views

All about automotive engineering in a pocketbook The 8th edition has .

Automotive Automotive Handbook Handbook All about automotive engineering in a pocketbook The 8th edition has been revised and extended. Automotive Handbook Reference handbook for academic and personal use. ISBN 978--7680-4851-3 Contents - central themes Basic principles: physics, materials, machine parts, joining and bonding techniques

1y ago

135 Views

Brochure: Advanced Flash Storage Solutions for Automotive Applications

iNAND Automotive Embedded Flash Drives (EFDs) are designed to support the harsh environments, high reliability and quality required by the automotive industry. The automotive iNAND product portfolio supports both UFS and e.MMC interfaces in a small 11.5x13mm package with a wide range of capacities to provide automotive OEMs and Tier-1

1y ago

161 Views

Industry Skills Forecast and Proposed Schedule of Work Automotive

Executive summary The Automotive Retail, Service and Repair (AUR) and Automotive Manufacturing (AUM) Training Packages are critical elements in the Vocational Education and Training (VET) system, playing central roles in the training of learners that engage in the automotive industries. A productive and valuable Automotive Training

1y ago

134 Views

Automotive Programs Student Handbook - SCCIowa

include a basic knowledge of all facets of the automotive repair industry, followed by classroom practice and drills of basic skills utilized in the automotive repair industry. The curriculum includes an internship experience in an automotive repair business. The curriculum is evaluated and revised as automotive repair needs change in the industry.

10m ago

72 Views

automotIve

automotive manufacturers worldwide. Those companies that take a forward-thinking approach will gain a competitive advantage and secure a leadership position in a realigned automotive value chain. At Seco, we partner with OEMs and other vehicle-based organisations around the globe to help automotive manufacturers overcome their

3y ago

145 Views

Reinforcement Learning For Optimal Control Of Queueing Systems

It looks like you're using an ad-blocker