Markov Decision Processes Lodewijk Kallenberg University Of Leiden

1y ago
4 Views
2 Downloads
4.56 MB
725 Pages
Last View : 21d ago
Last Download : 2m ago
Upload by : Elise Ammons
Transcription

MARKOV DECISION PROCESSESLODEWIJK KALLENBERGUNIVERSITY OF LEIDEN

PrefaceBranching out from operations research roots of the 1950’s, Markov decision processes (MDPs)have gained recognition in such diverse fields as economics, telecommunication, engineering andecology. These applications have been accompanied by many theoretical advances. Markovdecision processes, also referred to as stochastic dynamic programming or stochastic controlproblems, are models for sequential decision making when outcomes are uncertain. The Markovdecision process model consists of decision epochs, states, actions, transition probabilities andrewards. Choosing an action in a state generates a reward and determines the state at the nextdecision epoch through a transition probability function. Policies or strategies are prescriptionsof which action to choose under any eventuality at every future decision epoch. Decision makersseek policies which are optimal in some sense.These lecture notes aim to present a unified treatment of the theoretical and algorithmic aspects of Markov decision process models. It can serve as a text for an advanced undergraduateor graduate level course in operations research, econometrics or control engineering. As a prerequisite, the reader should have some background in linear algebra, real analysis, probability, andlinear programming. Throughout the text there are a lot of examples. At the end of each chapterthere is a section with bibliographic notes and a section with exercises. A solution manual isavailable on request (e-mail to kallenberg@math.leidenuniv.nl).Chapter 1 introduces the Markov decision process model as a sequential decision model withactions, transitions, rewards and policies. We illustrate these concepts with nine different applications: red-black gambling, how-to-serve in tennis, optimal stopping, replacement problems,maintenance and repair, production control, optimal control of queues, stochastic scheduling, andthe multi-armed bandit problem.Chapter 2 deals with the finite horizon model with nonstationary transitions and rewards, andthe principle of dynamic programming: backward induction. We present an equivalent stationaryinfinite horizon model. We also study under which conditions optimal policies are monotone, i.e.nondecreasing or nonincreasing in the ordering of the state space.In chapter 3 the discounted rewards over an infinite horizon are studied. This results in theoptimality equation and methods to solve this equation: policy iteration, linear programming,value iteration and modified value iteration. Furthermore, we study under which conditionsmonotone optimal policies exist.Chapter 4 discusses the total rewards over an infinite horizon under the assumption thatthe transition matrices are substochastic. We first present some background material on square

iimatrices, eigenvalues and the spectral radius. Then, we introduce the linear program and itscorrespondence to policies. We derive equivalent statements for the properties that the modelis a so-called contracting or normalized dynamic programming model. Next, we present theoptimality equation and results on the computations of optimal transient policies. For contractingdynamic programming results and algorithms can be formulated which are similar to the resultsand algorithms in the discounted reward model. Special sections are devoted to finite horizon andtransient MDPs, to positive, negative and convergent MDPs, and to special models as red-blackgambling and the optimal stopping problem.Chapter 5 discusses the criterion of average rewards over an infinite horizon, in the mostgeneral case. Firstly, polynomial algorithms are developed to classify MDPs as irreducible orcommunicating. The distinction between unichain and multichain turns out to be N P-complete,so there is no hope of a polynomial algorithm. Then, the stationary, the fundamental and thedeviation matrices are introduced, and the internal relations and properties are derived. Next,an extension of a theorem by Blackwell and the Laurent series expansion are presented. Theseresults are fundamental to analyze the relation between discounted, average and more sensitiveoptimality criteria. With these results, as in the discounted case but via a more complicatedanalysis, the optimality equation is derived and methods to solve this equation are presented(policy iteration, linear programming and value iteration).In chapter 6 special cases of the average reward criterion (irreducible, unichain and communicating) are considered. In all these cases the optimality equation and the methods of policyiteration, linear programming and value iteration can be simplified. Furthermore, we present themethod of modified value iteration for these special cases.Chapter 7 introduces more sensitive optimality criteria: bias optimality, n-discount and naverage optimality, and Blackwell optimality. The criteria of n-discount and n-average optimalityare equivalent. We present a unifying framework, based on the Laurent series expansion, to derivesensitive discount optimality equations. Using a lexicographic ordering of the Laurent series, wederive the policy iteration method for n-discount optimality. In the irreducible case, one canderive a sequence of nested linear programs to compute n-discount optimal policies for any n.Also for Blackwell optimality, even in the most general case, linear programming can be applied.However, then the elements are not real numbers, but lie in a much general ordered field, namelyin an ordered field of rational functions. For bias optimality, an optimal policy can be foundwith a three-step linear programming approach. When in addition the model is a unichain MDP,the linear programs for bias optimality can be simplified. In this unichain case, we also derive asimple policy iteration method and turnpike results. The last sections of this chapter deal withsome special optimality criteria. We consider overtaking, average overtaking and cumulativeovertaking optimality. A next section deals with a weighted combination of the total discountedrewards and the long-run average rewards. For this criterion an optimal policy might not exist,even when we allow nonstationary randomized policies. We present an iterative algorithm forcomputing an ε-optimal nonstationary policy with a simple structure. Finally, we study anoptimality criterion which is the sum of expected total discounted rewards with different one-step

iiirewards and discount factors. It turns out that for this criterion an optimal deterministic policyexists with a first nonstationary part and then it becomes stationary. We present an algorithmto compute such policy.In chapter 8, six of the applications introduced in chapter 1 (replacement problems, maintenance and repair, production and inventory control, optimal control of queues, stochastic scheduling and multi-armed bandit problems) are analyzed in much more detail. In most cases theoreticaland computational (algorithmic) results are presented. It turns out that in many cases polynomial algorithms exist, e.g. of order O(N 3 ), where N is the number of states. Finally, we presentseparableMDP problems.Chapter 9 deals with some other topics. We start with complexity results (e.g. MDPs areP-complete, deterministic MDPs are in N C), additional constraints (for discounted and averagerewards, and for MDPs with sum of discounted rewards and different discount factors) andmultiple objectives (both for discounted MDPs as well as for average MDPs). Then, the linearprogram approach for average rewards is revisited. Next, we consider mean-variance tradeoffs,followed by determinstic MDPs (models in which each action determines the next state withprobability 1). In the last section of this chapter semi-Markov decision problems are analyzed.The subject of the last chapter (chapter 10) is stochastic games, particularly the two-personzero-sum stochastic game. Then, both players may choose actions from their own action sets,resulting in transitions and rewards determined by both players. Zero-sum means that the rewardfor player 1 has to be payed by player 2. Hence, there is a conflicting situation: player 1 wants tomaximize the rewards, while player 2 tries to minimize the rewards. We discuss the value of thegame and the concept of optimal policies for discounted, total as well as for average rewards. Wealso derive mathematical programming formulations and iterative methods. In some special caseswe can present finite solution methods to find the value and optimal policies. In the last sectionbefore the sections with the bibliographic notes and the exercises we discuss two-person generalsum stochastic games in which each player has his own reward function and tries to maximize hisown payoff.For these lecture notes a lot of material, collected over the years and from various sources is used.In the bibliographic notes is referred to many books, papers and reports. I close this preface byexpressing my gratitude to Arie Hordijk, who introduced me to the topic of MDPs. Furthermore,he was my supervisor and after my PhD a colleague during many years.Lodewijk KallenbergLeiden, October, 2016.

iv

Contents1 Introduction11.1The MDP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.2Policies and optimality criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.2.1Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.2.2Optimality criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71.3Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.3.1Red-black gambling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.3.2Gaming: How to serve in tennis . . . . . . . . . . . . . . . . . . . . . . . . . 161.3.3Optimal stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3.4Replacement problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3.5Maintenance and repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.3.6Production control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.3.7Optimal control of queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.3.8Stochastic scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.3.9Multi-armed bandit problem . . . . . . . . . . . . . . . . . . . . . . . . . . 231.4Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.5Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 Finite Horizon292.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2Backward induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3An equivalent stationary infinite horizon model . . . . . . . . . . . . . . . . . . . . 322.4Monotone optimal policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.5Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.6Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Discounted rewards433.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2Monotone contraction mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3The optimality equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5Linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

viCONTENTS3.6Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.7Value iteration and bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.8Modified Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883.9Monotone optimal policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963.10 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014 Total reward1074.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.2Square matrices, eigenvalues and spectral radius . . . . . . . . . . . . . . . . . . . 1094.3The linear program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1154.4Transient, contracting, excessive and normalized MDPs . . . . . . . . . . . . . . . 1164.5The optimality equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254.6Optimal transient policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274.7The contracting model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1324.8Finite horizon and transient MPDs . . . . . . . . . . . . . . . . . . . . . . . . . . . 1364.9Positive MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1404.10 Negative MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474.11 Convergent MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1514.12 Special models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1544.12.1 Red-black gambling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1544.12.2 Optimal stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1564.13 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1604.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1615 Average reward - general case1655.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1655.2Classification of MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1665.35.2.1Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1665.2.2Classification of Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . 1675.2.3Classification of Markov decision chains . . . . . . . . . . . . . . . . . . . . 167Stationary, fundamental and deviation matrix . . . . . . . . . . . . . . . . . . . . . 1735.3.1The stationary matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1735.3.2The fundamental matrix and the deviation matrix . . . . . . . . . . . . . . 1765.4Extension of Blackwell’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1805.5The Laurent series expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1815.6The optimality equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1825.7Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1845.8Linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1925.9Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2025.10 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

CONTENTSvii5.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2126 Average reward - special cases6.16.26.3215The irreducible case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2156.1.1Optimality equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2166.1.2Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2176.1.3Linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2186.1.4Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2246.1.5Modified policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224The unichain case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2286.2.1Optimality equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2286.2.2Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2296.2.3Linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2346.2.4Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2436.2.5Modified policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247The communicating case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2516.3.1Optimality equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2526.3.2Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2526.3.3Linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2546.3.4Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2586.3.5Modified value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2586.4Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2616.5Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2617 More sensitive optimality criteria2657.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2657.2Equivalence between n-discount and n-average optimality . . . . . . . . . . . . . . 2667.3Stationary optimal policies and optimality equations . . . . . . . . . . . . . . . . . 2687.4Lexicographic ordering of Laurent series . . . . . . . . . . . . . . . . . . . . . . . . 2717.5Policy iteration for n-discount optimality . . . . . . . . . . . . . . . . . . . . . . . 2747.6Linear programming and n-discount optimality (irreducible case) . . . . . . . . . . 2797.6.1Average optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2807.6.2Bias optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2807.6.3n-discount optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2827.7Blackwell optimality and linear programming . . . . . . . . . . . . . . . . . . . . . 2837.8Bias optimality and policy iteration (unichain case) . . . . . . . . . . . . . . . . . . 2887.9Bias optimality and linear programming . . . . . . . . . . . . . . . . . . . . . . . . 2897.9.1The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2897.9.2The unichain case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2977.10 Turnpike results and bias optimality (unichain case) . . . . . . . . . . . . . . . . . 2987.11 Overtaking, average overtaking and cumulative overtaking optimality . . . . . . . . 302

viiiCONTENTS7.12 A weighted combination of discounted and average rewards . . . . . . . . . . . . . 3037.13 A sum of discount factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3107.14 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3157.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3168 Special models8.18.28.38.48.58.68.7321Replacement problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3228.1.1A general replacement model . . . . . . . . . . . . . . . . . . . . . . . . . . 3228.1.2A replacement model with increasing deterioration . . . . . . . . . . . . . . 3258.1.3Skip to the right model with failure . . . . . . . . . . . . . . . . . . . . . . 3278.1.4A separable replacement problem . . . . . . . . . . . . . . . . . . . . . . . . 328Maintenance and repair problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3298.2.1A surveillance-maintenance-replacement model . . . . . . . . . . . . . . . . 3298.2.2Optimal repair allocation in a series system . . . . . . . . . . . . . . . . . . 3318.2.3Maintenance of systems composed of highly reliable components . . . . . . 334Production and inventory control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3448.3.1No backlogging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3448.3.2Backlogging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3468.3.3Inventory control and single-critical-number policies . . . . . . . . . . . . . 3488.3.4Inventory control and (s, S)-policies . . . . . . . . . . . . . . . . . . . . . . 350Optimal control of queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3558.4.1The single-server queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3558.4.2Parallel queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358Stochastic scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3598.5.1Maximizing finite-time returns on a single processor . . . . . . . . . . . . . 3598.5.2Optimality of the µc-rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3608.5.3Optimality of threshold policies . . . . . . . . . . . . . . . . . . . . . . . . . 3618.5.4Optimality of join-the-shortest-queue policies . . . . . . . . . . . . . . . . . 3628.5.5Optimality of LEPT and SEPT policies . . . . . . . . . . . . . . . . . . . . 3648.5.6Maximizing finite-time returns on two processors . . . . . . . . . . . . . . . 3698.5.7Tandem queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369Multi-armed bandit problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3728.6.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3728.6.2A single project with a terminal reward . . . . . . . . . . . . . . . . . . . . 3728.6.3Multi-armed bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3748.6.4Methods for the computation of the Gittins indices . . . . . . . . . . . . . . 377Separable problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3868.7.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3868.7.2Examples (part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3878.7.3Discounted rewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388

CONTENTSix8.7.4Average rewards - unichain case . . . . . . . . . . . . . . . . . . . . . . . . 3908.7.5Average rewards - general case . . . . . . . . . . . . . . . . . . . . . . . . . 3928.7.6Examples (part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3998.8Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4018.9Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4039 Other topics9.1407Complexity results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4089.1.1Complexity theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4089.1.2MDPs are P-complete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4139.1.39.1.4DMDPs are in N C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415For discounted MDPs, the policy iteration and linear programming methodare strongly polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4199.1.59.2Value iteration for discounted MDPs . . . . . . . . . . . . . . . . . . . . . . 433Additional constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4369.2.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4369.2.2Infinite horizon and discounted rewards . . . . . . . . . . . . . . . . . . . . 4369.2.3Infinite horizon and total rewards . . . . . . . . . . . . . . . . . . . . . . . . 4469.2.4Infinite horizon and total rewards for transient MDPs . . . . . . . . . . . . 4499.2.5Finite horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4509.2.6Infinite horizon and average rewards . . . . . . . . . . . . . . . . . . . . . . 4519.2.7Constrained MDPs with sum of discounted rewards and different discountfactors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4649.2.89.3Constrained discounted MDPs with two discount factors . . . . . . . . . . . 475Multiple objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4809.3.1Multi-objective linear programming . . . . . . . . . . . . . . . . . . . . . . 4819.3.2Discounted rewards9.3.3Average rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4839.4The linear program approach for average rewards revisited . . . . . . . . . . . . . . 4969.5Mean-variance tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5029.69.5.1Formulations of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . 5029.5.2A unifying framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5039.5.3Determination of an optimal solution . . . . . . . . . . . . . . . . . . . . . . 5049.5.4Determination of an optimal policy . . . . . . . . . . . . . . . . . . . . . . . 5089.5.5The unichain case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5119.5.6Finite horizon variance-penalized MDPs . . . . . . . . . . . . . . . . . . . . 512Deterministic MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5209.6.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5209.6.2Average costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5209.6.3Discounted costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525

xCONTENTS9.7Semi-Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5329.7.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5329.7.2Model formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5339.7.3Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5349.7.4Discounted rewards9.7.5Average rewards - general case . . . . . . . . . . . . . . . . . . . . . . . . . 5419.7.6Average rewards - special cases . . . . . . . . . . . . . . . . . . . . . . . . . 5489.7.7Continuous-time Markov decision processes . . . . . . . . . . . . . . . . . . 558. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5369.8Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5649.9Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56610 Stochastic Games56910.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57010.1.1 The model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57010.1.2 Optimality criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57110.1.3 Matrix games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57110.1.4 Bimatrix games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57410.2 Discounted rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57610.2.1 Value and optimal policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57610.2.2 Mathematical programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 58810.2.3 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58910.2.4 Finite methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59710.3 Total rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61110.3.1 Value and optimal policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61110.3.2 Mathematical programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 61210.3.3 Single-controller stochastic game: the transient case . . . . . . . . . . . . . 61410.3.4 Single-controller stochastic game: the general case . . . . . . . . . . . . . . 61810.4 Average rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62110.4.1 Value and optimal policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62110.4.2 The Big Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62210.4.3 Mathematical programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 62610.4.4 Perfect information and irreducible games . . . . . . . . . . . . . . . . . . . 63510.4.5 Finite methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64110.5 Two-person general-sum stochastic game . . . . . . . . . . . . . . . . . . . . . . . . 67010.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67010.5.2 Discounted rewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67110.5.3 Single-controller stochastic games . . . . . . . . . . . . . . . . . . . . . . . . 67410.6 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68210.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684

Chapter 1Introduction1.1The MDP model1.2Policies and optimality criteria1.2.1 Policies1.2.2 Optimality criteria1.3Examples1.3.1 Red-black gambling1.3.2 Gaming: How to serve in tennis1.3.3 Optimal stopping1.3.4 Replacement problems1.3.5 Maintenance and repair1.3.6 Production control1.3.7 Optimal control of queues1.3.8 Stochastic scheduling1.3.9 Multi-armed bandit problems1.4Bibliographic notes1.5Exercises1.1The MDP modelAn MDP is a model for sequential decision making under uncertainty, taking into account boththe short-term outcomes of current decisions and opportunities for making decisions in the future.While the notion of an MDP may appear quite simple, it encompasses a

problems, are models for sequential decision making when outcomes are uncertain. The Markov decision process model consists of decision epochs, states, actions, transition probabilities and rewards. Choosing an action in a state generates a reward and determines the state at the next decision epoch through a transition probability function.

Related Documents:

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artificial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: filtering, smoothing, best sequence Kalman filters (a brief mention) Dynamic Bayesian networks

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

so-calledfiltered Markov Decision Processes.MoreoverPiecewise Determinis-tic Markov Decision Processes are discussed and we give recent applications . to the students at Ulm University and KIT who struggled with the text in their seminars. Special thanks go to Rolf B auerle and Sebastian Urban for

2.2 Markov chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a collection of methods to generate pseudorandom numbers via Markov Chains. MCMC works constructing a Markov chain which steady-state is the distribution of interest. Random Walks Markov are closely attached to MCMC. Indeed, t

Stationary policy composed of stochastic history-dependent Markov decision rules ˇ(s t) (U(M s;M s 10) ifs t s t 1 2 0 otherwise Non-stationary policy composed of deterministic Markov decision rules ˇ t(s) (M s ift 6 b(M s) 5c otherwise As one can see, any combination of di erent types of decision rules and policies can be .

Markov Decision Processes (MDPs) Machine Learning - CSE546 Carlos Guestrin University of Washington December 2, 2013 Carlos Guestrin 2005-2013 1 Markov Decision Process (MDP) Representation! State space: " Joint state x of entire system ! Action space: " Joint action a {a 1, , a n} for all agents ! Reward function:

After you’ve done that, get some feedback – from a friend, a careers consultant, a family member, an academic – someone you trust to give you reliable, impartial advice. What was the feedback?