Applying Reinforcement Learning To Economic Problems

1y ago

6 Views

3 Downloads

870.44 KB

32 Pages

Last View : 28d ago

Last Download : 3m ago

Upload by : Noelle Grant

Report this link

Download PDF

Transcription

Applying reinforcement learning to economicproblemsNeal HughesAustralian National UniversityNovember 14, 2014AbstractReinforcement learning provides algorithms for solving Markov DecisionProcesses (MDPs), which do not require prior knowledge of the payoff and ortransition functions. Rather agents ‘learn’ optimal polices by observing theoutcomes (e.g., the payoffs and state transitions) of their actions.In this paper, we introduce methods based around fitted Q iteration (Ernstet al. 2005): a batch version of Q-learning (Watkins 1992). Fitted Q iterationis proven to converge (in the single agent case) for certain types of functionapproximators - here we focus on the popular approximation method tilecoding.We consider how reinforcement learning can be applied to complex multiagent problems: stochastic games, where each agent faces a MDP with transition and payoff functions that are dependent on the actions of the otherplayers. We demonstrate these methods in the context of single and multiagent water storage problems.1

1 IntroductionReinforcement learning provides a wide range of algorithms for solving MarkovDecision Processes (MDPs), which do not require ‘models’ of the ‘environment’(the payoff and transition functions). Rather agents ‘learn’ optimal polices byobserving the outcomes of their actions: optimisation by simulation. Similar todynamic programming, reinforcement learning methods work by exploiting theBellman principle.While used extensively in artificial intelligence and operations research, reinforcement learning has received limited attention in economics. One reason, is that forall their simplicity and intuitive appeal, many practical challenges are faced inadapting these methods to economic problems.We focus on the method of fitted Q iteration (Ernst et al. 2005); a batch versionof standard Q-learning (Watkins and Dayan 1992). In fitted Q iteration, a largenumber of action, payoff and state transition samples are simulated, to which anaction-value or Q function is then fit. Similar to fitted value iteration, the methodis proven to converge subject to assumptions on function approximation.Our goal is to develop solution methods for complex multi-agent problems, specifically stochastic games, where each agent faces a MDP with state transition andpayoff functions dependent on the behaviour of other agents. Here we developan approach in the spirit of ‘multi-agent learning’ (Fudenberg and Levine 2007),which combines reinforcement learning with learning concepts from game theory.Our approach provides a middle ground between the dynamic programmingmethods used in heterogeneous agent macro models and the simulation andsearch methods commonly used in agent based computational economics — bothin terms of the size and complexity of the models it can be applied to and thedegree of rationality or ‘intelligence’ assumed for the agents.We begin by defining our problem space: the single agent MDP and the stochasticgame. We then provide an introduction to reinforcement learning for single agentproblems before detailing the function approximation techniques we employ,particularly tile coding. We then demonstrate the single agent method with anapplication to a water storage problem.We then move on to multiple agent problems. Here we summarise the literature onsolution concepts for stochastic games including the Markov Perfect Equilibriumand Oblivious Equilibrium, before discussing the overlapping computer scienceand economic literature on multi-agent learning. We then introduce our multipleagent version of fitted Q iteration, and detail its application to a water storageproblems.2

2 The problems2.1 Markov decision processA MDP represents the problem of an agent selecting some action in an environmentin order to maximise a reward. A MDP operates in discrete time: each periodgiven current state st the agent takes an action at , the environment then producesa reward rt and a state transition st 1 .Formally, a Markov decision process is a tuple (S, A, T, R, β). S RDS is the statespace, where DS {1, 2, .} is the dimensionality of the state space. A RD A isthe action space. T : S A S [0, 1] is the transition function, a probabilitydensity function such thatZS0T (s, a, s0 ) ds0 Prob(st 1 S0 st s, at a)R : S A R is the reward function. Finally β (0, 1) is the discount rate. Theusers problem is to choose at to maximise the expected discounted rewardmax E{ at } t 0(t βt R( at , st )t 0)given T, R, s0 and at Γ(st ) A, where Γ is the feasibility correspondence.A (Markovian) policy function for the MDP is a mapping from states to actionsf : S A. The discounted expected reward of following policy f is definedV f (s) E( βt R( f (st ), st ) s0 st 0)where st 1 T (st , f (st )).The value function associated with the optimal policy is defined asV (s) sup{V f (s)}Typically the value function also satisfies the Bellman principle V (s) max R(s, a) βa3Z0S 0T (s, a, s )V (s ) ds0

2.2 Stochastic gameA stochastic game is essentially a multiple agent MDP.There is a finite set of players I {0, 1, ., N }. The agents take actions ait Ai RD A . We define the action profile as at ( ait )i I and the action spaceA A0 A1 . A N . The state of the game st , can include both agent specificgstates sit Si RDS and a global state st S g RDG , the state space is S S0 S1 . S N S g .Each agent has reward function Ri : S A R and as before we have a transitionfunction T : S A S [0, 1] and discount rate β. The agents problem is tochoose ait to maximise their expected discounted rewardmax E{ ait } t 0(t β t Ri ( a t , s t )t 0)iiigiven s0 , Ri , T, a t and at Γi ( st ) A .We define player policy functions f i : S Ai and the policy profile functionff : S A. For any policy profile each player has a value function Vi : S Rdefined byfVi (s) E( βt R(( f (st ), st ) s0 st 0)Stochastic games were first introduced by Shapley (1953), who showed that atwo player zero-sum stochastic game could be solved by value iteration. Someof the main economic applications of stochastic games have been in industrialorganisation, particularly models of oligopoly with investment and firm entry andexit (Ericson and Pakes 1995).Stochastic games have also been applied to common pool resource extractionproblems (see for example Levhari and Mirman 1980). Recent applications haveincluded fisheries (Kennedy 1987), groundwater (Negri 1989, Rubio and Casino2001, Burt and Provencher 1993) and even surface water reservoirs (Ganji et al.2007). Stochastic games have also been applied to commodity storage problems(Murphy et al. 1987, Rui and Miranda 1996).3 Single agent reinforcement learningReinforcement Learning is a sub-field of machine learning concerned with solvingMDPs. For a detailed introduction see Sutton and Barto (1998), Bertsekas and4

Tsitsiklis (1995) or Weiring and Otterlo (2012).Central to reinforcement learning are so called ‘model free’ approaches: wherethe transition and payoff functions are assumed unknown. Here the agent mustlearn only by observing the outcomes — the reward and state transition — ofits interactions with the environment (figure 1). Learning a good policy requiressome degree of ‘exploration’: that is, testing a range of actions in each state.Figure 1: Reinforcement learningAgentState, stAction, atReward, rtEnvironmentst 1While historically related, the reinforcement learning methods sometimes appliedin repeated games (see for example Erev and Roth 1998) are distinct from themethods we refer to here, as they involve estimating value functions via theBellman principle.Reinforcement learning has some computational advantages over dynamic programming, particularly in larger problems. As a simulation method, attention islimited to realised state combinations (see Judd et al. 2010), rather than a regular(tensor product) grid over the state space. Since, state variables are often correlatedthis can greatly reduce the complexity of the problem (Judd et al. 2010).For example, figure 2 shows 10,000 simulated state points (storage St by inflow It )for a water storage problem (see section 5). Note that all inflows above the storagecapacity of 1000 GL are associated with storage of 1000.Reinforcement learning methods can also be easy to parallelize and generallyprovide greater flexibility to trade-off computation time and accuracy.3.1Q-learningQ-learning (Watkins and Dayan 1992) is the canonical ‘model free’ reinforcementlearning method. Q-learning works on the ‘state-action’ value function Q : S 5

Figure 2: Planner’s storage problem, sample state points2000Inflow (GL)15001000500002004006008001000Storage (GL)A R, defined as the present value payoff from taking action a in state s andfollowing an optimal policy thereafter Q ( a, s) R(s, a) βZST (s, a, s0 ) max Q ( a, s0 ) ds0aOnce in possession of Q , we can compute an optimal (aka greedy) policy withoutthe payoff and transition functionsf (s) arg max Q ( a, s)amax Q( a, s) V (s)aIn standard Q-learning we incrementally update the Q function by observingstate-action transitions {st , at , rt , st 1 }. For a discrete state and action space, thealgorithm operates as follows:Algorithm 1: Q-learning with discrete state and actions123456Initialise Q, s0for t 0 to T doselect at A ;// from some exploration policysimulate ( at , st ) and observe rt and st 1set Q( at , st ) (1 αt ) Q( at , st ) αn {rt β maxa Q( a, st 1 )}endThe actions at must be selected according to an ‘exploration’ (partially randomised)policy. A simple option is an e-greedy policy: a random policy with probabilitye and an optimal policy otherwise. The choice of exploration policy involves6

an exploration-exploitation trade-off: a highly random policy will provide goodcoverage of the state-action space, but risks spending too much time at irrelevantpoints.αt (0, 1) is known as the learning rate. α may be constant, but more commonlyfollows a decreasing schedule. Watkins and Dayan (1992) shows (for discrete stateand actions) that Q-learning converges as t , subject to conditions over theexploration policy and learning rate αt .Q-learning can be extended to the continuous state and action case through function approximation. However, this typically voids convergence guarantees. Further, Q-learning is known to be unreliable (prone to spectacular divergence) in thecontinuous case (Weiring and Otterlo 2012).3.2 Fitted Q iterationFitted Q iteration (Riedmiller 2005, Ernst et al. 2005) is a batch algorithm. First, asimulation is run for T periods (again using an exploration policy). Then a seriesof Q function updates is applied to the accumulated set of state-action samples(see Algorithm 13).The approach has two main advantages: data efficiency since all samples are storedand reused and stability since Q functions can be fit to large samples. Further, ithas some attractive properties in multiple agent problems (Weiring and Otterlo2012).Algorithm 2: Fitted Q Iteration (continuous state and action)12345678910111213initialise s0for t 0 to T do// Simulate the system for T periodsselect at A ;// from some exploration policysimulate ( at , st )store the sample (st , at , rt , st 1 )endinitialise Q( at , st )repeat// Iterate until convergencefor t 0 to T doset Q̂t rt β. maxa .Q( a, st 1 )endestimate Q by regressing Q̂t against ( at , st )until a stopping rule is satisfied;The separation of simulation and the fitting stages provides also additional flexibility. Firstly, it allows any type of parametric or non-parametric regression(supervised learning) model to be applied. Secondly, it facilitates parallel computing, since the simulation stage is so called embarrassingly parallel.7

The success of fitted Q iteration depends crucially on the type of function approximation schemes employed. A variety of schemes have been used in the literature,including random forests (Ernst et al. 2005), neural networks (Riedmiller 2005)and tile coding (Timmer and Riedmiller 2007). Similar to continuous dynamicprogramming, the algorithm is guaranteed to converge for certain classes (i.e.,non-expansive) function approximators (Ernst et al. 2005) (see Section ?).3.3 Fitted Q-V iterationIn noisy economic problems large samples T may be required. In this case optimising Q for each state point st may be an unnecessary burden (especially in themulti-agent case). One option is to optimise over a representative subset of statepoints, then estimate a continuous state value function V (Algorithm 18).Algorithm 3: Fitted Q-V iteration123456789101112131415161718initialise s0for t 0 to T do// Simulate the system for T periodsselect at A ;// from some exploration policysimulate ( at , st )store the sample (st , at , rt , st 1 )endinitialise Q( at , st )repeat// Iterate until convergenceKt Tselect a subset {sk }k 1 {st }t 0for k 0 to K doset V̂k maxak .Q( ak , sk )endestimate V (st ) by regressing V̂k on skfor t 0 to T doset Q̂t rt β.V (st 1 )endestimate Q by regressing Q̂t against ( at , st )until a stopping rule is satisfied;3.4 Sample gridsA natural choice for our subset of state points {sk }kK 1 is a sample of approximatelyequidistant points (i.e., a sample grid). Our starting point here is a simple distancebased method (Algorithm 4), which provides a subset of points at least r distanceapart. 1 . This method is very similar to the approach of Judd et al. (2010) 2 .1 Notetypically we scale all input data to the range [0, 1]our method also counts the sample points within the radius r of each point: in orderto identify outliers. Judd et al. (2010) remove outliers by separately estimating a density function.2 However,8

Figure 3 provides a demonstration in two dimensions. This method is sufficientfor moderate sample sizes but can become inefficient for large dense data sets.One option is to add an early stopping condition, another is to employ some formof function approximation (i.e., tilecoding).Algorithm 4: Selecting an approximately equidistant grid123456789101112131415161718set J 0, c0 X0 , n0 0for t 0 to T doset rmin for j 0 to J doset r Xt c j if r rmin thenset rmin rset j jendendif rmin r thenset j j 1set c j Xtset n j 0elseset n j n j 1endend// Find the nearest center c j // Add Xt as the next center c j// Increment counter for center c j by 1Figure 3: An approximately equidistant grid in two dimensions4433221100 1 1 2 2 3 3 4 4 3 2 10123 4 44(a) 10000 iid standard normal points 3 2 10123(b) 100 points at least 0.4 apart94

4 Function approximationThe success of batch reinforcement learning depends crucially on function approximation. In machine learning this is known as ’supervised learning’: to ‘learn’(estimate) a model for a ‘target’ (dependent) variable Y conditional a vector of‘input’ (explanatory) variables X, given only a set of ‘training data’ {Yt , Xt }tT 0 . Ademonstration of the problem is provided in figure 4.The goal with function approximation is prediction: we want a model that canaccurately predict Y given realisations of X outside our training sample. In attempting to minimise prediction error we face a bias-variance trade-off. A highlyflexible model is at risk of ‘overfitting’ noisy data (figure 4a) while an inflexiblemodel may lead to systematically biased predictions (figures 4b, 4e).For our purposes, computation time is also important: both fitting (estimation)time and prediction (function call) time. In practice, subtle tradeoffs are facedbetween predictive power, fitting time and prediction time. Unfortunately, there isno general purpose method that achieves the optimal balance of all factors in allapplications: there is no free lunch (Wolpert 1996).Our Fitted Q V iteration approach, poses two distinct approximation problems:A big problem (i.e., the Q function) and a small problem (i.e., the policy and valuefunctions f , V). The features of these problems are summarised in table 1.Table 1: Two approximation problemsBig ProblemSample sizeLarge (0.5-1 106 )Input dimensionsSmall (5)Data structureNoneTarget noiseHighTime constraintFittingExtrapolation important NoExampleQ( at , st )Small problemSmall (500-2000)Small (4)GriddedLowPredictionYesf (st )For various reasons, standard methods employed in economics — such as orthogonal polynomials — are not ideal for these problems. Below we introduce themethod of tile coding, which we use to approximate Q, V and f in all our waterstorage problems.10

110.40.60.80.20.10.00.00.20.10.00.0(d) Tilecoding (TC-A) NT 3, NL 0.2(a) Tilecoding (TC-A) NL 1, NT 50.60.70.40.60.80.20.6(e) Linear model0.40.8(b) Tilecoding (TC-A) NL 1, NT 60.21.01.00.20.40.60.8(c) Tilecoding (TC-SGD) NT 4, NL 400.00.00.10.20.30.40.50.60.7Figure 4: A comparison of function approximation methods in one dimension1.0

For various reasons, standard methods employed in economics — such as orthogonal polynomials — are not ideal for these problems. Below we introduce a methodknown as tile coding, which is particularly well suited to noisy problems in low(i.e., less than 10) dimensions.4.1 Tile codingTile coding (Albus 1975, Sutton and Barto 1998) is a function approximationscheme popular in reinforcement learning. With tile coding, the input space ispartitioned into tiles. A whole partition is referred to as a tilling or a layer. A tilecoding scheme then involves multiple overlapping layers, each offset from eachother according to some displacement vector.Figure 5: Tile codinginput spacetiling layer 1tiling layer 2input point Xtactivated tile, layer 1activated tile, layer 2Tile coding is best understood visually. In the simplest approach the tilings arejust regular grids (i.e. rectangular tiles) and each grid is offset uniformly (i.e.diagonally) as in figure 5.More formally, a tile coding scheme involves i 1 to NL layers. Each layercontains j 1 to NT binary basis functionsφij (X) 1 0if X Xijif otherwiseFor each point Xt one tile j in each layer i is activated and our predicted value isthe mean of the weights wij attached to the active tiles12

Y 1NLNL NT wij φij (X)i 1 j 1Tile coding is a constant piecewise approximation scheme: the ‘resolution’ ofthe approximation depends on the number of layers and the number of tilesper layer. For example, figure 4a shows a tile coding scheme with NL 1 andNT 6. Just increasing the number of tiles gives a finer resolution but provides lessgeneralisation leading to over fitting (see figure 4b). Figure 4c, shows a model withNL 40 and NT 4 which provides both high resolution and good generalisation.Tile coding has some computational advantages schemes with basis functionsof global support. Tile weights are stored in arrays and accessed directly (computing array indexes just involves integer conversion of X). Each function callthen involves only the NL active weights. As such, prediction time is low andgrows linearly in the number of layers rather than exponential in the number ofdimensions.Function predict(X)123456set Ŷ 0for i 0 to NL doj index (X, i ) ;Ŷ Ŷ N1L wijendreturn Ŷ// returns index of active tileThis speed gain comes at the cost of higher memory usage. However, since manyreinforcement learning applications are CPU bound, this is an efficient use ofresources. Memory limits are not an issue in any of our problems. In higherdimensions, weight arrays can be compressed through ’hashing’.4.1.1 FittingThe standard method of training the weights is SGD. An alternative, is to defineeach weight as a simple average, as in Algorithm 5.Timmer and Riedmiller (2007) considers the use of tile coding in fitted-Q-iteration.When tile weights are simple averages, fitted-Q-iteration is guaranteed to convergeon a unique fixed point (Timmer and Riedmiller 2007). A convergence result ispossible, because this form of tile coding is a non-expansive approximator: that isa smoother or averager (see Stachurski 2008, Gordon 1995)3 .3 Thisform of tile coding is in fact closely related to other more common averaging methodssuch as k-nearest neighbours and random forests.13

Algorithm 5: Fitting tile weights by averaging123456789101112for t 0 to T dofor j 0 to NL doi index (Xt , j) ;set nij nij 1 ;set wij wij Yij ;endendfor j 0 to Nl dofor i 0 to NT doset wij wij /nij ;endend// returns index i of active tile// count tile sample size// calculate sum// calculate meanWhile fitting by averaging provides a convergence guarantee it is unlikely toprovide ideal performance: it will suffer badly from bias if the tiles are too wide(see figure 4d) and variance if the tiles are too small. An alternative approachis ‘Averaged SGD’ (ASGD) (Bottou 2010), where the weights are defined as theaverage of a single SGD pass over the data (Algorithm 6). .Algorithm 6: Averaged Stochastic Gradient Descent (ASGD) - Tile coding123456789101112131415Initialise wij by averaging (Algorithm 5)set w̄ij 0 for all i, jfor t 0 to T doset δt predict (X) Ytfor j 0 to J doi index (Xt , j)set wij wij αt δtset w̄ij w̄ij wij ;endendfor j 0 to NL dofor i 0 to NL doset wij w̄ij /nij ;endend// A single SGD pass// sum the weight updates// calculate meanBottou (2010) demonstrates the superiority of ASGD over SGD from problemswith large samples. In Section 5 we show that fitting the Q function by ASGDachieves a performance gain over averaging with no loss of stability.14

4.1.2 Big problemsFor big problems (i.e., the Q function) we use tile coding with regularly spacedgrids. We use the ‘optimal’ displacement vectors of (Brown and Harris 1994). Tileweights are fit by ASGD.Tile coding can suffer from noise in regions where training data are sparse, so forinput variables with tails, we limit the tiling to a percentile range of the trainingdata (e.g., the 1st to 99th). We then pass on the job of extrapolating into theunrepresented parts of the input space to our policy and value functions. 4 .4.1.3 Small problemsFor small problems we again use regular grids and optimal displacement vectors.The tile weights are fit either by averaging — for value functions — or standardSGD using averaging for starting values.For extrapolation, we combine our tile coding scheme with a sparse linear splinemodel. The combined scheme replaces the tile code weight wij with the linearspline predicted value if nij 0.5 The planner’s water storage problemHere we apply fitted Q-V iteration to the planners storage problem and compareit to the benchmark of dynamic programming.5.1 The problemThe problem is that of a planner managing a reservoir on a river. The reservoirreceives stochastic water inflows makes storage releases to satisfy multiple waterusers (i.e., irrigation farmers) located at a single demand node.The planners problems is to maximise the benefits from water use, subject to thehydrological (water supply) constraints:max E {Wt }tt 0( βt Π(Qt , It )t 04 There)are a number of other more complex options here. One is the idea of ‘adaptive tilecoding’ where the tile sizes are endogenous and may for example be larger in regions with lessdata (see Whiteson et al. 2007). Another is to apply a non-linear scaling to the input data to make itmore uniformly distributed.15

Figure 6: A simple river systemInflow, It 1Storage, St1Release point, F1tExtraction, Et2Extraction point, F2tDemand nodeReturn flow, RtEnd of system, F3t3Subject to:St 1 min{St Wt δ0 αSt2/3 It 1 , K }0 Wt StQt max{(1 δ1b )Wt δ1a , 0}where Π is a concave payoff function (i.e., irrigation profit function), Qt is wateruse, It 1 is the stochastic storage inflow (following an AR-1 process), St the storagevolume, Wt the storage withdrawals, K is the fixed storage capacity, δ0 , δ1a , δ1b , αare all loss (i.e., evaporation) parameters and β is the discount rate. Further detailon parameter assumptions is contained in my thesis.An important feature of the problem are storage ‘spills’: where the storage reachescapacity and further inflows are ‘lost’ (i.e. flow uncontrolled downstream) wedefine spills Zt asZt max{ It 1 (K St Wt δ0 αSt2/3 ), 0}16

5.2 The solutionFor fitted Q-V iteration we adopt a two stage ‘growing batch’ approach. We simulate an initial batch of samples assuming uniform (i.e., uninformed) exploration:Wt et .Stet U [0, 1]After running fitted Q-V on the first first batch we add a second batch of samples,this time with Gaussian exploration:Wt min{max{ fˆ(St , Ĩt ) et St , 0}, St }et N (0, δ)0 δ 1where fˆ is the policy function obtained from the first batch.Algorithm 4 is used to build a grid of state points with r 0.02. Tilecoding isused to approximate Q, V and f . We test fitting the Q function both by averaging(TC-A) with ASGD (TC-ASGD).For dynamic programming we employ fitted policy iteration and use tile codingto approximate V over a 35 35 grid of the state space. We begin both methodswith the initial guess V (X) 0.Both methods are coded predominantly in cython. Both make user of parallelization and run on a standard 4-core i7 desktop. Mean welfare, storage, and solutiontime are shown in tables 2, 3, 4 (each the average of 10 runs).Fitted Q-V iteration obtains a policy comparable with SDP in less time. The factthat fitted Q-V iteration compares well for trivial single agent problems, suggestssignificant gains (in computation time) may be achievable in larger problems.Table 2: Mean social TC-A185.4TC-ASGD 186.6185.9186.2181.4186.6186.0186.3Reinforcement learning achieves welfare levels up to 99.8 per cent of the SDPsolution and results in similar mean storage levels.17

Figure 7: Performance of fitted Q-V iteration, planner’s storage problemSocial welfare as percentage of 940.99310000200003000040000500006000070000Number of samplesTable 3: Mean TC-A691.5TC-ASGD 697.8685.8710.8577.4697.8690.5699.6Table 4: Computation 47.20.40.67.50.50.97.40.61.37.40.81.91880000

For a given sample size ASGD outperforms averaging. Note that in this trivialexample, TC-A still performs well on a computation time basis, because fitting timeis longer with ASGD. However in more complex problems — where simulation ismore time intensive — the gains from ASGD become more important.6 Equilibrium in stochastic games6.1 Markov Perfect EquilibriumA natural equilibrium concept for stochastic games is a ‘Markov perfect equilibrium’ (MPE) (Maskin and Tirole 1988). An MPE is defined by a set of markovianpolicies functions { f 0 , f 1 , ., f N } which simultaneously solve each agents’ problem,forming a sub-game perfect Nash equilibrium.MPE existence results have been established for specific classes of stochastic gamesas early as Shapley (1953). These early results typically assume finite state space ortime horizons and mixed strategies. A general (pure strategy, infinite horizon andstate space) existence result has remained elusive despite much recent attention(Duggan 2012, Escobar 2011, Balbus et al. 2011, Horst 2005, Amir 2005).Broadly, MPE existence results involve two steps: one, show that for any feasibleset of opponent policies f i the agents’ problems have unique solutions Vi (s);two, show that the static ‘stage game’, with payoff functions πi ( a, s, Vi (s))πi ( a, s, Vi (s)) R(s, a) βZST (s, a, s0 )Vi (s0 ) ds0has a Nash equilibrium for any s S and any feasible set of Vi .Recent existence results all rely on particular regularizing assumptions. For example Escobar (2011) adopts an assumption of concave reduced payoffs: concavityof πi with respect to a. Horst (2005) relies on a weak interaction condition: thatplayers utility is affected more their own actions than by all other players actions.Amir (2005) applies the lattice theory concepts of supermodularity and increasingdifferences. Recently Duggan (2012) proves existence of MPE where the transitionfunctions are subject to a particular form of noise.In general, uniqueness of equilibra in stochastic games is not guaranteed. Asdemonstrated by Doraszelski and Satterthwaite (2010) multiple equilibira arecommonly observed in the Ericson and Pakes (1995) style models. The uniquenessof equilibria is usually considered numerically, by testing invariance to startingvalues. Although standard algorithms are not guaranteed to locate all possibleequilibria (Borkovsky et al. 2008).19

6.2 Oblivious EquilibriumA general problem with stochastic games is that the dimensionality of the statespace, DSn DG , can be too large, particularly when n is large and DS 0. Furtherthe assumption that the agents have information on all opponent state variablesbecomes unrealistic.Oblivious Equilibrium (OE) is an alternative to MPE in the case were n is large(Weintraub et al. 2008). Here opponent state variables are replaced with relevantsummary statistics, such that the state space for the agents’ problems is condensedto Si SG .Weintraub et al. (2008) shows that under certain conditions (a large number ofsimilarly sized firms) OE approximates MPE for oligopoly type models. This resultis generalised to a broader class of dynamic stochastic games by Abhishek et al.(2007). In context of oligopoly models opponent state variables are replaced withtheir long-run average means. Weintraub et al. (2010) extend OE to models withaggregate shocks, here opponent states are replaced with their mean conditionalon the aggregate shock.While uniqueness of OE is not guaranteed Weintraub et al. (2008) find no examplesof multiple equilibria in applied pr

Reinforcement learning methods can also be easy to parallelize and generally provide greater ﬂexibility to trade-off computation time and accuracy. 3.1 Q-learning Q-learning (Watkins and Dayan 1992) is the canonical 'model free' reinforcement learning method. Q-learning works on the 'state-action' value function Q : S 5

Related Documents:

Applying Deep Reinforcement Learning to Berkeley's Capture the Flag game

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

96 Views

1y ago

1 Introduction to reinforcement learning - GitHub Pages

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

23 Views

2y ago

Lecture Slides 'Reinforcement Learning - uni-paderborn.de

applying reinforcement learning methods to the simulated experiences just as if they had really happened. Typically, as in Dyna-Q, the same reinforcement learning method is used both for learning from real experience and for planning from simulated experience. The reinforcement learning method is thus the ÒÞnal common pathÓ for both learning

13 Views

1y ago

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating ...

In this section, we present related work and background concepts such as reinforcement learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement learning (Sutton and Barto, 1998) environment is typically formalized by means of a Markov decision process (MDP). An MDP can be described as follows. Let S fs 1 .

11 Views

1y ago

Multi-Agent Patrolling with Reinforcement Learning1

learning techniques, such as reinforcement learning, in an attempt to build a more general solution. In the next section, we review the theory of reinforcement learning, and the current efforts on its use in other cooperative multi-agent domains. 3. Reinforcement Learning Reinforcement learning is often characterized as the

10 Views

1y ago

MetaLight: Value-based Meta-reinforcement Learning for Traffic Signal ...

Meta-reinforcement learning. Meta reinforcement learn-ing aims to solve a new reinforcement learning task by lever-aging the experience learned from a set of similar tasks. Currently, meta-reinforcement learning can be categorized into two different groups. The ﬁrst group approaches (Duan et al. 2016; Wang et al. 2016; Mishra et al. 2018) use an

14 Views

1y ago

Reinforcement Learning for Optimal Control of Queueing Systems

Reinforcement learning methods provide a framework that enables the design of learning policies for general networks. There have been two main lines of work on reinforcement learning methods: model-free reinforcement learning (e.g. Q-learning [4], policy gradient [5]) and model-based reinforce-ment learning (e.g., UCRL [6], PSRL [7]). In this .

22 Views

1y ago

Detailing Aspects of the Reinforcement in Reinforced Concrete Structures

Using a retaining wall as a case-study, the performance of two commonly used alternative reinforcement layouts (of which one is wrong) are studied and compared. Reinforcement Layout 1 had the main reinforcement (from the wall) bent towards the heel in the base slab. For Reinforcement Layout 2, the reinforcement was bent towards the toe.

8 Views

1y ago

Recent Views

Grammar as a Foreign Language - List of Proceedings

Grammar as a Foreign Language Oriol Vinyals Google vinyals@google.com Lukasz Kaiser Google lukaszkaiser@google.com Terry Koo Google terrykoo@google.com Slav Petrov Google slav@google.com Ilya Sutskever Google ilyasu@google.com Geoffrey Hinton Google geoffhinton@google.com Abstract Synta

2y ago

445 Views

Attention is All you Need - NIPS

Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomezy University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia .

1y ago

303 Views

GSA Implementation of Google (G) Suite

Google Meet Classic Hangouts Google Chat Google Calendar Google Drive and Shared Drive Google Docs Google Sheets Google Slides Google Forms Google Sites Google Keep Apps Script D

2y ago

316 Views

Google Drive (Google Docs, Google Sheets, Google Slides)

Google Drive (Google Docs, Google Sheets, Google Slides) Employees are automatically issued a Kyrene Google account. Navigate to drive.google.com. Use Kyrene email address and network password to login. Launch in Chrome browser for best experience. Google Drive is a cloud storage sys

2y ago

388 Views

Quick Guide of Using Google Home to Control Smart Devices

Configuration needs Google Home app. Search "Google Home" in App Store or Google Play to install the app. 3.1 Set up Google Home with Google Home app You can skip this part if your Google Home is already set up. 1. Make sure your Google Home is energized. 2. Open the Google Home app by tapping the app icon on your mobile device. 3.

1y ago

326 Views

Elaboração de Provas Online usando o Formulário Google Docs

2 Após o login acesse o Google Drive ou o Google Docs e selecione a ferramenta Google Forms (Formulários). Clique na caixa de Ferramentas do Google, localizada no canto direito superior da tela e selecione o Google Drive. Na tela do Google Drive clique em New , opção More e selecione Google Forms. OBS: É possível acessar o google

10m ago

123 Views

ACS WASC Templates

File upload, Folder upload, Google Docs, Google Sheets, or Google Slides. You can also create Google Forms, Google Drawings, Google My Maps, etc. Share with exactly who you want — without email attachments. Search or sort your list of files, folders, and Google Docs. Preview files and Google Docs.

2y ago

366 Views

Share a Google Doc in Schoology - fcps.edu

After you have connected your Google Drive to Schoology (directions in a separate handout), another way to share a Doc with students is to use the Google Drive Resource App. To share a Google Doc using the Google Drive Resources App: 1. From the Add Materials drop down menu, select Import from Resources. 2. Select Apps. Then Google Drive .

1y ago

92 Views

Google Drive - San Bernardino City Unified School District

Google Apps All of the Google applications that are available upon logging into Google.com (G , Gmail, Gphotos, Gdrive, etc.). Google Suite Google’s online cloud based office companion applications (Docs, Sheets, Slides). Google Drive Google’s online cloud storage and file sharing/collaboration application.

2y ago

378 Views

Single Sign On for Google Apps with NetScaler Unified Gateway

Google Apps for Work is a suite of cloud computing productivity and collaboration applications provided by Google on a subscription basis. It includes Google’s popular web applications including Gmail, Google Drive, Google Hangouts, Google Calendar and Google

2y ago

295 Views

Serviceteil

Google 84, 87, 124 Google 110 Google AdWords 101, 103 Google Alerts 127 Google Analytics 89 Google Maps 100, 110, 173 Google-Maps 63 Google Places 100, 103, 124 Graphiken 66 H Haftung 170 Haftungsausschluss 72 Hausfarbe 11 Headline 35 Heilmittelwerbegesetz 14, 69, 163 Heilversprechen 164 HONcode 78 HTML 58 HWG 31 I Imagefilm 31

2y ago

336 Views

Best practices for managing identities when you move to Google Cloud

Google Cloud. To provide t he informat ion an organizat ion would ne e d to transfer data and ownership from one Google Account to anot her for s ome of t he noncore Google s er vice s, such as Google Ads, Google Analyt ics, or DV360. Intende d audience Organizat ion administrators. Sta planning Google Cloud / Google Wor kspace migrat ion. Key .

1y ago

481 Views

Google Analytics 101 - Content Jam

Google Analytics 101 201 301 Google Ads 101 201 Google Tag Manager 101 Google Data Studio 101 Google Optimize 101. Welcome Fun Facts: Share . Google Analytics 301 35 Web Property The web property ID is of the form UA-XXXXXX-YY. It's often called the "UA number" since it starts with

1y ago

107 Views

Introduction - Google Earth User Guide

Google Earth Community: Learn from other Google Earth users by asking questions and sharing answers on the Google Earth Community forums. Using Google Earth: This blog describes how you can use some of the interesting features of Google Earth. Selecting a Server Note: This section is relevant to Google Earth Pro and EC users.

3y ago

288 Views

Using Google Forms to Manage Officials Signups

Google Sheets, deleting a response from the form or sheet will not affect the other. Once the Google Form is linked to a Google Sheet, clicking on the spreadsheet icon will open the linked Google Sheet. Google Responses Sheet Google automatically creates and populates the sp

2y ago

276 Views

Applying Reinforcement Learning To Economic Problems

It looks like you're using an ad-blocker