Markov Decision Processes{ Solution - Idm-lab

1y ago
7 Views
2 Downloads
695.40 KB
8 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Raelyn Goode
Transcription

Markov Decision Processes– Solution1) Invent a simple Markov decision process (MDP) with the following properties:a) it has a goal state, b) its immediate action costs are all positive, c) all of itsactions can result with some probability in the start state, and d) the optimalpolicy without discounting differs from the optimal policy with discounting anda discount factor of 0.9. Prove d) using value 91000.9990.001S3With no discount factor, action a1 is preferred over action a2 in state s1:iState1 a1a2State2 0.With discount factor 0.9, action a2 is preferred over action a1 in state s1:iState1 a1a2State2 211.090047610.2) Consider the following problem (with thanks to V. Conitzer): Consider a roverthat operates on a slope and uses solar panels to recharge. It can be in one ofthree states: high, medium and low on the slope. If it spins its wheels, it climbsthe slope in each time step (from low to medium or from medium to high) orstays high. If it does not spin its wheels, it slides down the slope in each timestep (from high to medium or from medium to low) or stays low. Spinning itswheels uses one unit of energy per time step. Being high or medium on theslope gains three units of energy per time step via the solar panels, while being

low on the slope does not gain any energy per time step. The robot wants togain as much energy as possible.a) Draw the MDP graphically. b) Solve the MDP using value iteration with adiscount factor of 0.8. c) Describe the optimal policy.Answer:spinLdon‘t spin0-1spindon‘t spin322spinMdon‘t spin3Hwhere L low, M medium and H high.Starting with 0 as initial values, value iteration calculates the 20*12.20*10.0010.0012.2012.2012.76*12.76*At each iteration, the value of a state is the value of the maximizing actionin that state (since we are trying to maximize energy) and is marked with anasterisk. For instance, in iteration 4, the value of L, v4 (L), is computed asfollows:v4 (L, spin) 0.8 v3 (M ) 1 0.8 6.32 1 4.06v4 (L, don0 t) 0.8 v3 (L) 0 0.8 2.52 2.02v4 (L) max(v4 (L, spin), v4 (L, don0 t)) 4.06The optimal policy is to spin when the rover is low or medium on the slope andnot to spin when it is high on the slope.

Now answer the three questions above for the following variant of the robotproblem: If it spins its wheels, it climbs the slope in each time step (from lowto medium or from medium to high) or stays high, all with probability 0.3. Itstays where it is with probability 0.7. If it does not spin its wheels, it slidesdown the slope to low with probability 0.4 and stays where it is with probability0.6. Everything else remains unchanged from the previous problem.Answer:0.70.7Ldon‘t spin-1spin0.400.3don‘t spin30.6M2spinspin20.3don‘t spin3H0.60.4Starting with 0 as initial values, value iteration calculates the In this variant, in iteration 4, the value of L, v4 (L), is computed as follows:v4 (L, spin) 0.8 (0.3 v3 (M ) 0.7 v3 (L)) 1 0.8 (0.3 5.55 0.7 0.07) 1 0.37v4 (L, don0 t) 0.8 v3 (L) 0 0.8 0.07 0.056v4 (L) max(v4 (L, spin), v4 (L, don0 t)) 0.37The optimal policy is to spin, wherever the rover is on the slope.

3) Consider the following problem (with thanks to V. Conitzer): Consider a roverthat operates on a slope. It can be in one of four states: top, high, medium andlow on the slope. If it spins its wheels slowly, it climbs the slope in each timestep (from low to medium or from medium to high or from high to top) withprobability 0.3. It slides down the slope to low with probability 0.7. If it spinsits wheels rapidly, it climbs the slope in each time step (from low to medium orfrom medium to high or from high to top) with probability 0.5. It slides downthe slope to low with probability 0.5.Spinning its wheels slowly uses one unit of energy per time step. Spinning itswheels rapidly uses two units of energy per time step. The rover is low on theslope and aims to reach the top with minimum expected energy consumptions.a) Draw the MDP graphically. b) Solve the MDP using undiscounted valueiteration (that is, value iteration with a discount factor of 1). c) Describe theoptimal 0.5(1)slowly0.3(2)rapidly0.50.5H(1)slowly 0.3(2)rapidly 0.5T0.5where L low, M medium H high, and T top.Starting with 0 as initial values, value iteration calculates the *5.39*5.84*25.33*25.6723.1322.00*18.7314.67*

7*14.67*At each iteration, the value of a state is the value of the minimizing action inthat state (since we are trying to minimize cost). For instance, in iteration 4,the value of L, v4 (L), is computed as follows:v4 (L, slowly) 0.3 v3 (M ) 0.7 v3 (L) 1 ' 3.97v4 (L, rapidly) 0.5 v3 (M ) 0.5 v3 (L) 2 ' 4.96v4 (L) min(v4 (L, slowly), v4 (L, rapidly)) 3.97The optimal policy is to spin slowly in the low state and to spin rapidly in theother states.Here’s a sample C code for this value iteration.#include s t d i o . h #define min ( x , y ) ( ( ( x) (y ) ? ( x ) : ( y ) ) )main ( ){int i t e r a t i o n 0 ;float l 0 . 0 ;float m 0 . 0 ;float h 0 . 0 ;float t 0 . 0 ;f l o a t l s l o w , l r a p i d , m slow , m rapid , h slow , h r a p i d ;while ( 1 ){l s l o w 1 0 . 7 l 0 . 3 m;l r a p i d 2 0 . 5 l 0 . 5 m;m slow 1 0 . 7 l 0 . 3 h ;m rapid 2 0 . 5 l 0 . 5 h ;h slow 1 0.7 l 0.3 t ;h rapid 2 0.5 l 0.5 t ;p r i n t f ( ”%d : ” , i t e r a t i o n ) ;p r i n t f ( ”%.2 f %.2 f ” , l s l o w , l r a p i d ) ;p r i n t f ( ”%.2 f %.2 f ” , m slow , m rapid ) ;p r i n t f ( ”%.2 f %.2 f \n” , h slow , h r a p i d ) ;l min ( l s l o w , l r a p i d ) ;m min ( m slow , m rapid ) ;h min ( h slow , h r a p i d ) ;}}

4) You won the lottery and they will pay you one million dollars each year for 20years (starting this year). If the interest rate is 5 percent, how much money doyou need to get right away to be indifferent between this amount of money andthe annuity?Answer:A million dollars we get right away is worth a million dollars to us now. Amillion dollars we get in a year from now is worth γ 1/(1 0.05) milliondollars to us now because, with interest, it would be (1/1.05) 1.05 1 milliondollars in a year. Similarly, a million dollars we get in 19 years from now (inthe beginning of the 20th year) is worth only (1/1.05)19 0.4 million dollars tous now. Therefore, getting paid a million dollars each year for 20 years is worth1 γ γ 2 · · · γ 19 (1 γ 20 )/(1 γ) 0.623/0.0476 13.08 million dollarsto us now.5) Assume that you are trying to pick up a block from the table. You drop itaccidentally with probability 0.7 while trying to pick it up. If this happens, youtry again to pick it up. How many attempts does it take on average before youpick up the block successfully?Answer:We can model this problem as an MDP as follows:0.7on1table pick up0.3pickedupNow, we can calculate the expected cost with X 1 0.7X 0.3Y and Y 0(where X on table and Y picked up), which gives us X 1/0.3 3.33.The Markov assumption is the assumption that, whenever one executes anaction in a state, the probability distribution over its outcomes is the same, nomatter how one got into the state. Is this assumption realistic here?No. If the pickup operation does not work the first time (e.g. because the blockis too slippery), it will likely not work the next time either.6) Assume that you use undiscounted value iteration (that is, value iteration witha discount factor of 1) for a Markov decision process with goal states, wherethe action costs are greater than or equal to zero. Give a simple example thatshows that the values that value iteration converges to can depend on the initialvalues of the states, in other words, the values that value iteration converges toare not necessarily equal to the expected goal distances of the states.Answer:

0Sa11a2GConsider the initial values v 0 (S) 0 and v 0 (G) 0. Value iteration determines the values after convergence to be v (S) 0 and v (G) 0, yet the (expected) goal distance of S is 1, not 0. Now consider the initial values v 0 (S) 2and v 0 (G) 0. Value iteration determines the values after convergence to bev (S) 1 and v (G) 0.7) An MDP with a single goal state (S3) is given below. a) Given the expectedgoal distances c(S1) 7, c(S2) 4.2, and c(S3) 0, calculate the optimalpolicy. b) Suppose that we want to follow a policy where we pick action a2 instate S1 and action a3 in state S2. Calculate the expected goal distances of S1and S2 for this policy.Answer:a) We use c(s, a) to denote the expected cost of reaching a goal state if onestarts in state s, executes action a and then acts according to the policy. SinceS3 is a goal state and S2 has only one available action, we only need to calculatec(S1, a1) and c(S1, a2) in order to decide whether to execute a1 or a2 at S1.c(S1, a1) 0.25(1 c(S3)) 0.75(2 c(S1)) 0.25(1 0) 0.75(2 7)) 7c(S1, a2) 0.5(1 c(S2)) 0.5(2 c(S1)) 0.5(1 4.2) 0.5(2 7)) 7.1Since c(S1, a1) c(S1, a2), in the optimal policy, we execute a1 at S1.b) Since the given policy executes a2 at S1, we simply ignore a1 during ourcomputation. We first generate the following set of equations:c(S1) c(S1, a2) 0.5(1 c(S2)) 0.5(2 c(S1))

c(S2) c(S2, a3) 0.6(1 c(S3)) 0.4(2 c(S1))c(S3) 0Plugging c(S3) 0 into our second equation, we get:c(S2) 0.6(1 0) 0.4(2 c(S1))c(S2) 1.4 0.4c(S1)Plugging this value into our first equation, we get:c(S1) 0.5(1 1.4 0.4c(S1)) 0.5(2 c(S1))c(S1) 2.2 0.7c(S1)c(S1) 2.2/0.3 7.33Finally, we get:c(S2) 1.4 0.4c(S1) 1.4 0.4(7.33) 4.333

Markov Decision Processes{ Solution 1) Invent a simple Markov decision process (MDP) with the following properties: a) it has a goal state, b) its immediate action costs are all positive, c) all of its actions can result with some probability in the start state, and d) the optimal

Related Documents:

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artificial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: filtering, smoothing, best sequence Kalman filters (a brief mention) Dynamic Bayesian networks

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

1.1 Identity Management (IDM) System Overview CMS created the IDM System to provide Business Partners with a means to request and obtain a single User ID which they can use to access one or more CMS applications. The IDM System uses a cloud-based distributed architecture that supports the needs of both legacy and new

so-calledfiltered Markov Decision Processes.MoreoverPiecewise Determinis-tic Markov Decision Processes are discussed and we give recent applications . to the students at Ulm University and KIT who struggled with the text in their seminars. Special thanks go to Rolf B auerle and Sebastian Urban for

2.2 Markov chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a collection of methods to generate pseudorandom numbers via Markov Chains. MCMC works constructing a Markov chain which steady-state is the distribution of interest. Random Walks Markov are closely attached to MCMC. Indeed, t

Stationary policy composed of stochastic history-dependent Markov decision rules ˇ(s t) (U(M s;M s 10) ifs t s t 1 2 0 otherwise Non-stationary policy composed of deterministic Markov decision rules ˇ t(s) (M s ift 6 b(M s) 5c otherwise As one can see, any combination of di erent types of decision rules and policies can be .

Prosedur Akuntansi Hutang Jangka Pendek & Panjang BAGIAN PROYEK PENGEMBANGAN KUR IKULUM DIREKTORAT PENDIDIKAN MENENGAH KEJURUAN DIREKTORAT JENDERAL PENDIDIKAN DASAR DAN MENENGAH DEPARTEMEN PENDIDIKAN NASIONAL 2003 Kode Modul: AK.26.E.6,7 . BAGIAN PROYEK PENGEMBANGAN KURIKULUM DIREKTORAT PENDIDIKAN MENENGAH KEJURUAN DIREKTORAT JENDERAL PENDIDIKAN DASAR DAN MENENGAH DEPARTEMEN PENDIDIKAN .