Probabilistic Goal Markov Decision Processes

3y ago
19 Views
3 Downloads
224.97 KB
7 Pages
Last View : 21d ago
Last Download : 3m ago
Upload by : Abram Andresen
Transcription

Probabilistic Goal Markov Decision Processes Huan XuDepartment of Mechanical EngineeringNational University of Singapore, Singaporempexuh@nus.edu.sgAbstractThe Markov decision process model is a powerful tool in planing tasks and sequential decisionmaking problems. The randomness of state transitions and rewards implies that the performanceof a policy is often stochastic. In contrast to thestandard approach that studies the expected performance, we consider the policy that maximizesthe probability of achieving a pre-determined target performance, a criterion we term probabilistic goal Markov decision processes. We show thatthis problem is NP-hard, but can be solved using apseudo-polynomial algorithm. We further considera variant dubbed “chance-constraint Markov decision problems,” that treats the probability of achieving target performance as a constraint instead of themaximizing objective. This variant is NP-hard, butcan be solved in pseudo-polynomial time.1IntroductionThe Markov Decision Process (MDP) model is a powerfultool in planning tasks and sequential decision making problems [Puterman, 1994; Bertsekas, 1995]. In MDPs, the system dynamics is captured by transition between a finite number of states. In each decision stage, a decision maker picksan action from a finite action set, then the system evolves toa new state accordingly, and the decision maker obtains a reward. Both the state transition and the immediate reward areoften random in nature, and hence the cumulative reward Xmay be inherently stochastic. Classical approach deals withthe maximization of the expected value of X, which implicitly assumes that the decision maker is risk-neutral.In this paper we propose and study an alternative optimization criterion: fix a target level V R, the decisiongoal is to find a policy π that maximizes the probability ofachieving the target, i.e., to maximize P r(Xπ V ). Thecriterion of maximizing the probability of achieving a goal,which we call probabilistic goal, is an intuitively appealingobjective, and justified by axiomatic decision theory [Castagnoli and LiCalzi, 1996]. Moreover, empirical research has H. Xu is supported by an NUS startup grant R-265-000-384133. S. Mannor is supported by the Israel Science Foundation undercontract 890015.Shie MannorDepartment of Electrical EngineeringTechnion, Israelshie@ee.technion.ac.ilalso concluded that in daily decision making, people tendsto regard risk primarily as failure to achieving a predetermined goal [Lanzillotti, 1958; Simon, 1959; Mao, 1970;Payne et al., 1980; Payne et al., 1981].Different variants of the probabilistic goal formulationsuch as chance constrained programs1 , have been extensivelystudied in single-period optimization [Miller and Wagner,1965; Prékopa, 1970]. However, little has been done in thecontext of sequential decision problem including MDPs. Thestandard approaches in risk-averse MDPs include maximization of expected utility function [Bertsekas, 1995], and optimization of a coherent risk measure [Riedel, 2004; Le Tallec, 2007]. Both approaches lead to formulations that cannot be solved in polynomial time, except for special cases including exponential utility function [Chung and Sobel, 1987],piecewise linear utility function with a single break downpoint [Liu and Koenig, 2005], and risk measures that can bereduced to robust MDPs satisfying the so-called “rectangularcondition” [?; Iyengar, 2005].Two notable exceptions that explicitly investigate the probabilistic goal or chance constrained formulation in the contextof MDPs are [Filar et al., 1995] and [Delage and Mannor,2010]. The first reference considers the average reward caseof MDPs, and reduces the probability of achieving a target tothe probability of entering irreducible chains. The analysis isthus very specific and can not be extended to finite horizoncase or discounted reward infinite horizon case. The secondreference investigated, instead of the internal randomness,the parametric uncertainty of the cumulative reward. That is,the parameters of the MDP, namely the transition probabilityand the reward, are not precisely known. While the decisionmaker is risk-neutral to internal randomness – performancefluctuation due to the stochasticity of the state transition, immediate reward and possible randomness of the action, he/sheis risk averse to the performance deviation due to the parameter uncertainty. Conceptually, in this setup one can think ofof having many identical machines all with the same uncertain parameter, and the goal is to maximize the probabilitythat the average reward (w.r.t. all machines) achieves a certain target. Because of this apparent difference in modeling,1Instead of maximizing the probability of achieving a target,chance constrained program requires such probability to be no lessthan a fixed threshold.

the techniques used in [Delage and Mannor, 2010] do not extend to the probabilistic goal of the internal randomness, thesetup that we consider.To the best of our knowledge, the probabilistic goal orchance constrained formulation has not been investigated forfinite-horizon or infinite-horizon discounted reward MDPs.Our contributions include the following:1. We compare different policy sets: we show that for aprobabilistic goal MDP, an optimal policy may dependon accumulated reward, and randomization does not improve the performance.2. We show that the probabilistic goal MDP is NP-hard.Thus, it is of little hope that such problem can be solvedin polynomial time in general.3. We propose a pseudo-polynomial algorithm based onstate-augmentation, that solves the probabilistic goalMDP.4. We investigate chance constrained MDPs and show itcan be solved in pseudo polynomial time.Before concluding this section, let us briefly discuss somepractical examples that motivate this study. Consider the following scenario that many of us have experienced. Supposeone wants to drive to the airport to catch a plane which issoon to take off. He/she needs to decide which route to take,while the time he/she will spend on each link is random. Thiscan be modeled as an MDP with a deterministic transitionprobability and random reward. It is clear that the expectedtime the decision maker will spend on route is less criticalthan whether he/she will catch the plane, a natural fit forprobabilistic goal MDP. Other examples can be found in finance, where synthetic derivatives that are triggered by discrete events are becoming common, and hence minimizing ormaximizing the probability of such event is relevant and important; and airplanes design, where one seek a reconfiguration policy that maximizes the chance of not having a criticalfailure.As standard, we use symbol π to denote a policy of anMDP. A history-dependent, deterministic policy is a mapping from the history Ht (s0:t , a0:t , r0:t ) of the processto an action a Ast . We use Πh to represent the set ofhistory-dependent deterministic policies. The set of deterministic policies that only depend on the current state (andthe time horizon), which is often called Markovian policies,is denoted by Πt,s . As we show in the sequel, it is sometimesbeneficial to incorporate the accumulated reward in the decision. The set of policies that depend on the time horizon, thecurrent state, and the accumulated reward up-to-now, whichwe called “pseudo-Markovian” policy, is denoted by Πt,s,x .Besides deterministic policy, we also considered randomized policy. That is, assuming there is available a sequenceof i.i.d. random variables U1 , · · · , UT , independent to everything else, and a history-dependent, randomized policy is amapping from (Ht , Ut ) to an action. We denote the set ofsuch policies by Πh,u . The set of Markovian and pseudoMarkovian randomized policies are defined similarly, and denoted respectively by Πt,s,u and Πt,s,x,u .Note that given the distribution of the reward parameter,the total reward of the MDP under a policy π is a well defined random variable, denoted by Xπ . We are interested inthe following problems. As standard in the study of computational complexity, the first problem we consider is a “yes/no”decision problem.Problem 1 (Decision Problem). Given V R and α (0, 1). Is there a π Πh,u such thatP r(Xπ V ) α?We call this problem D(Πh,u ). Similarly, we defineD(Πt,s,u ), D(Πt,s,x,u ), and their deterministic counterparts.A related problem of more practical interest is the optimization one.Problem 2 (probabilistic goal MDP). Given V R, findπ Πh,u that maximizesP r(Xπ V ).2SetupIn this section we present the problem setup and somenecessary notations. A (finite) MDP is a sextuple T, S, A, R, p, g where T is the time horizon, assumed to be finite;Let us remark that while we focus in this paper the finitehorizon case, most results easily extend to infinite horizondiscounted reward case, due to the fact that the contributionof the tail of the time horizon can be made arbitrarily smallby increasing the time horizon. S is a finite set of states, with s0 S the initial state;3 A is a collection of finite action sets, one set for eachstate. For s S, we denote its action set by As ;In this section we compare different policy sets. We say apolicy set Π is “inferior” to another Π0 , if D(Π) being trueimplies that D(Π0 ) is true, and there exists an instance wherethe reverse does not hold. We say two policy sets Π and Π0are “equivalent” if D(Π) is true if and only if D(Π0 ) is true.We first show that randomization does not help: Πh isequivalent to Πh,u , Πt,s,x is equivalent to Πt,s,x,u , and similarly Πt,s is equivalent to Πt,s,u . This essentially means thatfor probabilistic goal MDP, it suffices to focus on Πh , Πt,s,xand Πt,s . Note that it suffices to show the following theorem,since the reverse direction holds trivially. R is a finite subset of R, and is the set of possible valuesof the immediate rewards. We let K maxr R r ; p is the transition probability. That is, pt (s0 s, a) is theprobability that the state at time t 1 is s0 , given that thestate at time t is s, and the action chosen at time t is a. g is a set of reward distributions. In particular, gt (r s, a)is the probability that the immediate reward at time t isr, if the state is s and action is a.Comparison of policy sets

0 (100%) 0 0as0s1s1t 1 (50%)-1 (50%)as21/2w1b1-1/2w1badsFigure 2: Example of NP-hard probabilistic goal MDPTheorem 1. Given an MDP, α [0, 1], and V , if there existsπu Πh,u (respectively Πt,s,x,u and Πt,s,u ) such thatthen there exists π Πh (respectively Πt,s,x and Πt,s ) suchthatP r(Xπ V ) α.Proof. Since T , the set Πh is finite, i.e., there is a finitenumber of deterministic history dependent policies. For succinctness of presentation, we write πu π, where πu Πh,uand π Πh , to denote the event (on probability space of U )that πu (Ht , U0:t ) π(Ht ) for all t and Ht . Fix a randomizedpolicy π Πh,u , due to theorem of total probability we haveXP r(Xπu V ) P r(Xπu V πu π)P r(πu π)π Πh XP r(Xπ V )P r(πu π).π ΠhNote thatPπ ΠhP r(πu π) 1, we have thatmax P r(Xπ V ) P r(Xπu V ),π Πhwhich establishes the theorem for the history-dependent case.The other two cases are essentially same and hence omitted.It is straightforward to generalize the finite horizon caseto the discounted infinite horizon case, due to the fact thatthe contribution of the tail of the time horizon can be madearbitrarily small. (The proof is by contradiction: if there is adifference in performance in the limit, it must appear in finitetime as well.)We next show that in general, Πt,s is inferior to Πt,s,x .This essentially means that including the information of accumulated reward can improve the performance of a policyfor probabilistic goal MDP.Example 1. Consider the MDP as in Figure 1: S {s0 , s1 , t}, where t is the terminal state. The initial state s0has one action, which leads to state s1 , and the immediate reward is that with probability 0.5 it is 1, and otherwise 1.State s1 has two actions a and b, both lead to state t. Theimmediate reward of a is always 0, and that of b is that withprobability 0.5 it is 1, and otherwise 2. Observe that foreither fixed action a or b, P r(X 0) 0.5. In contrast,consider the policy π that at state s1 , takes action a if the accumulated reward is 1, otherwise takes action b. Observethat P r(Xπ 0) 0.75, which shows that Πt,s is inferiorto Πt,s,x .w21-1/2-LFigure 1: Illustration of Example 1P r(Xπu V ) α,t1/2w2 v2 v1 1 (50%)-2 (50%)s3bFinally, we remark that adding extra information beyondthe cumulative reward does not improve the performance forprobabilistic goal MDP, as we will show in Section 5. Thepolicies clasees Πt,s,x and Πh are therefore equivalent.4ComplexityIn this section we show that in general, solving the probabilistic goal MDP is a computationally hard problem.Theorem 2. The problem D(Π) is NP-hard, for Π equals toΠh , Πh,u , Πt,s , Πt,s,u , Πt,s,x or Πt,s,x,u .Proof. We start with proving the result for Πt,s , by showing that we can reduce a well-known NP complete problem,KnapSack Problem (KSP), into a probabilistic goal MDP, andhence establish its NP-hardness. Recall that a KSP is thefollowing problem. There are n items, 1 through n, eachitem i has a value vi and a weight wi . We can assume theyare all non-negative integers. The KSP problem is to decidegiven two positive number W and V , whether there exists aI [1 : n] such thatXXwi W ;vi V.i Ii IGiven n, wi , vi , W and V , we construct the followingMDP. Let T n 2. There are n 2 states: S {s1 , · · · , sn , sbad , t}: s1 is the initial state and t is the terminal state. Each state si has two actions: if action a is taken,then the immediate reward is 0 and the next state will be si 1 ;if action b is taken, then the immediate reward is vi ; furthermore, with probability 1/2wi the next state will be si 1 (respectively terminal state t if i n) and otherwise the nextstate will be sbad . The state sbad has onlyPn one action whichincurs an immediate reward L , 2 i 1 vi , and the nextstate will be t. See Figure 2.Now consider the decision problem D(Πt,s ) with α 1/2W . That is, to answer whether there exists π Πt,s suchthatP r(Xπ V ) α.We now show that the answer to D(Πt,s ) is positive if andonly if the answer to KSP is positive.Suppose the answer to D(Πt,s ) is positive, i.e., there existsa policy π Πt,s such thatP r(Xπ V ) α.

Define set I 0 as all i [1 : n] such that π takes b in si .Observe this implies thatXvi V.i I 0Furthermore, due to the extreme large negative reward insbad ,P r(Xπ V ) P r(sbad is never reached.)11 Πi I 0 wi P 0 w ,22 i I i5.1The main technique of our pseudo-polynomial algorithm isstate augmentation. That is, we construct a new MDP inwhich the state space is the Cartesian product of the originalstate-space and the set of possible accumulated rewards. Tobetter understand the algorithm, we first consider in this subsection a special case where the immediate reward is integervalued, i.e., R Z. We show that the probabilistic goal MDPcan be solved in time polynomial to S , A and K.Theorem 3. Suppose that R Z. In computational timepolynomial in T , S , A and K, one can solve D(Πh,u ),i.e., determine the correctness of the following claim:which implies that1P2i I 0wi π Πh,u : P r(Xπ V ) α. 1/2W Xwi W.i I 0Thus, the answer to KSP is also positive.Now suppose the answer to KSP is positive, i.e., there exists I [1 : n] such thatXXwi W ;vi V.i ISimilarly, in computational time polynomial to T , S , A and K, one can solve the probabilistic goal MDP, i.e., findπ Πh,u that maximizes P r(Xπ V ). Moreover, thereexists an optimal solution to probabilistic goal MDP that belongs to Πt,s,x .Proof. It suffices to show that an optimal solution to the probabilistic goal MDPπ arg max P r(Xπ V ),i IConsider the following policy π 0 of the MDP: take action bfor all si where i I, and a otherwise. We haveP r(sbad is never reached.)Y 11 P w 1/2W α.ii I2wi2i IbadFurthermore,is never reached, the cumulative reP when sward is i I vi V . Thus, we have thatP r(Xπ0 V ) α,i.e., the answer to D(Πt,s ) is also positive.Thus, determining the answer to KSP is reduced to answering D(Πt,s ), the probabilistic goal MDP. Since the formeris NP-complete, we conclude that probabilistic goal MDP isNP-hard.Notice that in this example, Πt,s Πt,s,x Πh . Thus,NP-hardness for the decision problems with respect to thesepolicy sets are established as well. Furthermore, Theorem 1implies that D(Πh,u ), D(Πt,s,u ), D(Πt,s,x,u ) are also NPhard.5Integer RewardPseudo-polynomial algorithmsIn the previous section we showed that it is of little hope thatthe probabilistic goal MDP can be solved in polynomial time.In this section we develop a pseudo-polynomial algorithm tohandle this question. Recall that the running time of a pseudopolynomial algorithm is polynomial in the number of parameters and the size of the parameter, as opposed to polynomialalgorithms whose running time is polynomial in the numberof parameters and polylogarithmic in the size of the parameters.π Πhtogether with the optimal value P r(Xπ V ), can be obtained by solving an MDP with 2T K S states.We construct a new MDP as follows: each state is a pair(s, C) where s S and C is an integer in [ T K, T K], andthe initial state is (s0 , 0). The action set for (s, C) is As , forall C. In time t, the transition between states is defined asfollows: P r (s0 , C 0 ) a, (s, C) pt (s0 a, s) gt (C 0 C s, a).That is, a transition to (s0 , C 0 ) happens if in the original MDP,a transition to state s0 happens and the immediate reward isC 0 C. Notice that since the immediate reward of the originalMDP is bounded in [ K, K] and there are only T stages,the accumulated reward of the original MDP is bounded in[ T K, T K]. The immediate-reward of the new MDP is zeroexcept in the last stage, in which for states (s, C) with C V , a reward 1 is incurred.It is easy to see that the solution to probabilistic goal MDPis equivalent to a policy that maximizes the expected reward for the new MDP. Note that there exists a deterministic, Markovian policy to the latter, then the probabilistic goalMDP has an optimal solution in Πt,s,x .Theorem 3 leads to the following algorithm for solvingprobabilistic goal MDPs. Note that, not surprisingly, the proposed algorithm indeed parallels the standard algorithm (alsopseudo-polynomial) solving KSP.5.2General RewardIn this section we relax the assumption that the reward is integer valued. We discretize the real-valued reward parametersto obtain a new MDP that approximates the original one. Thenew MDP is equivalent to an integer valued MDP, and hencecan be solved in pseudo-polynomial time thanks to results

Input: MDP T, S, A, R, p, g with R Z, V R. Output: π Πt,s,wthat maximizes P r(Xπ V ). Algorithm:1. Construct MDP with augmented state-space.2. Find a Markovian π̂ that maximizes the expectedreward of the new MDP.3. Construct the policy of the original MDP as follows: the action taken at stage t, state s, with anaccumulated reward C equals to π̂t (s, C). Outputthe policy.in Section 5.1. To show that such approximation is legitimate, we bound the performance gap that diminishes as thediscretization becomes finer.Theorem 4. There exists an algorithm, whose running timeis polynomial in T , S , A , K and 1/ , that finds a policy π̂,such thatP r(Xπ V ) P r(Xπ̂ V ); π Πh,u .Proof. Given an MDP hT, S, A, R, p, gi, with K maxr R r , and the grid size δ 0, we can construct a newMDP T, S, A, R̂, p, ĝ , where R̂ {iδ i Z; K/δ i K/δ}, and P0r 0 :r r 0 r δ gt (r s, a), if r R̂;ĝt (r s, a) 0,otherwise.Observe that, by scaling the parameters, the new MDP isequivalent to an integer reward MDP whose reward parameters are bounded by K/δ. Hence the probabilistic goal MDPfor the new MDP can be solved in time poly

2.We show that the probabilistic goal MDP is NP-hard. Thus, it is of little hope that such problem can be solved in polynomial time in general. 3.We propose a pseudo-polynomial algorithm based on state-augmentation, that solves the probabilistic goal MDP. 4.We investigate chance constrained MDPs and show it can be solved in pseudo polynomial time.

Related Documents:

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artificial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: filtering, smoothing, best sequence Kalman filters (a brief mention) Dynamic Bayesian networks

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

Markov Decision Processes{ Solution 1) Invent a simple Markov decision process (MDP) with the following properties: a) it has a goal state, b) its immediate action costs are all positive, c) all of its actions can result with some probability in the start state, and d) the optimal

so-calledfiltered Markov Decision Processes.MoreoverPiecewise Determinis-tic Markov Decision Processes are discussed and we give recent applications . to the students at Ulm University and KIT who struggled with the text in their seminars. Special thanks go to Rolf B auerle and Sebastian Urban for

2.2 Markov chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a collection of methods to generate pseudorandom numbers via Markov Chains. MCMC works constructing a Markov chain which steady-state is the distribution of interest. Random Walks Markov are closely attached to MCMC. Indeed, t

Stationary policy composed of stochastic history-dependent Markov decision rules ˇ(s t) (U(M s;M s 10) ifs t s t 1 2 0 otherwise Non-stationary policy composed of deterministic Markov decision rules ˇ t(s) (M s ift 6 b(M s) 5c otherwise As one can see, any combination of di erent types of decision rules and policies can be .

The ISO 14001 Standard has been through a number of revisions since it was first published in 1996. ISO Standards are reviewed every five years to establish if a revision is required in order to keep them current and relevant. The current Standard, ISO 14001:2015, responds to the increasing need for management systems to be integrated by using “Annex SL”, a common format for management ISO .