Reinforcement Learning - University Of Colorado Boulder

1y ago
7 Views
3 Downloads
1.44 MB
76 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Abby Duckworth
Transcription

Reinforcement LearningMachine Learning: Jordan Boyd-GraberUniversity of Colorado BoulderLECTURE 25Slides adapted from Tom Mitchell and Peter AbeelMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 1 of 32

Reinforcement Learning Control learning Control policies that choose optimal actions Q learning Feature-based representations Policy SearchMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 2 of 32

Control LearningPlanControl LearningQ-LearningPolicy SearchMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 3 of 32

Control LearningControl LearningConsider learning to choose actions, e.g., Roomba learning to dock on battery charger Learning to choose actions to optimize factory output Learning to play BackgammonNote several problem characteristics: Delayed reward Opportunity for active exploration Possibility that state only partially observable Possible need to learn multiple tasks with same sensors/effectorsMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 4 of 32

Control LearningOne Example: TD-Gammon[Tesauro, 1995]Learn to play BackgammonImmediate reward 100 if win -100 if lose 0 for all other statesTrained by playing 1.5 million games against itselfNow approximately equal to best human playerMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 5 of 32

Control LearningReinforcement Learning ProblemAgentStateRewardActionEnvironmentas0 0Machine Learning: Jordan Boyd-Graberr0 Boulders1a1r1s2a2r2.Reinforcement Learning 6 of 32

Control LearningMarkov Decision ProcessesAssume finite set of states S set of actions A at each discrete time agent observes state st S and choosesaction at A then receives immediate reward rt and state changes to st 1 Markov assumption: st 1 δ(st , at ) and rt r (st , at ) i.e., rt and st 1 depend only on current state and action functions δ and r may be nondeterministic functions δ and r not necessarily known to agentMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 7 of 32

Control LearningAgent’s Learning TaskExecute actions in environment, observe results, and learn action policy π : S A that maximizes E rt γrt 1 γ 2 rt 2 . . .from any starting state in S here 0 γ 1 is the discount factor for future rewardsNote something new: Target function is π : S A but we have no training examples of form hs, ai training examples are of form hhs, ai, r iMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 8 of 32

Q-LearningPlanControl LearningQ-LearningPolicy SearchMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 9 of 32

Q-LearningValue FunctionTo begin, consider deterministic worlds . . .For each possible policy π the agent might adopt, we can define anevaluation function over statesV π (s) rt γrt 1 γ 2 rt 2 . X γ i rt ii 0where rt , rt 1 , . . . are from following policy π starting at state sMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 10 of 32

Q-LearningQ-learningRestated, the task is to learn the optimal policy π π arg max V π (s), ( s)π r (s, a) (immediate reward) values01000G000000100000 Q(s, a) values One optimal policyMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 11 of 32

Q-LearningQ-learningRestated, the task is to learn the optimal policy π π arg max V π (s), ( s)π r (s, a) (immediate reward) values Q(s, a) values900100G817281908110090817281 One optimal policyMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 11 of 32

Q-LearningQ-learningRestated, the task is to learn the optimal policy π π arg max V π (s), ( s)π r (s, a) (immediate reward) values Q(s, a) values One optimal policyGMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 11 of 32

Q-LearningWhat to Learn We might try to have agent learn the evaluation function V π (whichwe write as V )It could then do a lookahead search to choose best action from anystate s becauseπ (s) arg max[r (s, a) γV (δ(s, a))]aA problem: This works well if agent knows δ : S A S, and r : S A But when it doesn’t, it can’t choose actions this wayMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 12 of 32

Q-LearningQ FunctionDefine new function very similar to V Q(s, a) r (s, a) γV (δ(s, a))If agent learns Q, it can choose optimal action even without knowingδ!π (s) arg max[r (s, a) γV (δ(s, a))]aπ (s) arg max Q(s, a)aQ is the evaluation function the agent will learnMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 13 of 32

Q-LearningTraining Rule to Learn QNote Q and V closely related:Q(s, a0 )V (s) max0aWhich allows us to write Q recursively asQ(st , at ) r (st , at ) γV (δ(st , at ))) r (st , at ) γ maxQ(st 1 , a0 )0aNice! Let Q̂ denote learner’s current approximation to Q. Considertraining ruleQ̂(s, a) r γ maxQ̂(s 0 , a0 )0awheres0is the state resulting from applying action a in state sMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 14 of 32

Q-LearningQ Learning for Deterministic WorldsFor each s, a initialize table entry Q̂(s, a) 0Observe current state sDo forever: Select an action a and execute it Receive immediate reward r Observe the new state s 0 Update the table entry for Q̂(s, a) as follows:Q̂(s, a) r γ maxQ̂(s 0 , a0 )0a s s0Machine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 15 of 32

Q-LearningUpdating Q̂R9010072636381R10081a rightNext state: s2Initial state: s1Q̂(s1 , aright ) r γ maxQ̂(s2 , a0 )0a 0 0.9 max{63, 81, 100} 90if rewards non-negative, then( s, a, n) Q̂n 1 (s, a) Q̂n (s, a)and( s, a, n) 0 Q̂n (s, a) Q(s, a)Q̂ converges to Q.Machine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 16 of 32

Q-LearningNondeterministic CaseWhat if reward and next state are non-deterministic?We redefine V , Q by taking expected valuesV π (s) E [rt γrt 1 γ 2 rt 2 . . .] X E[γ i rt i ]i 0Q(s, a) E [r (s, a) γV (δ(s, a))]Machine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 17 of 32

Q-LearningNondeterministic CaseQ learning generalizes to nondeterministic worldsAlter training rule toQ̂n (s, a) (1 αn )Q̂n 1 (s, a) αn [r maxQ̂n 1 (s 0 , a0 )]0awhereαn 11 visitsn (s, a)Can still prove convergence of Q̂ to Q [Watkins and Dayan, 1992]Machine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 18 of 32

Q-LearningTemporal Difference LearningQ learning: reduce discrepancy between successive Q estimatesOne step time difference:Q (1) (st , at ) rt γ max Q̂(st 1 , a)aWhy not two steps?Q (2) (st , at ) rt γrt 1 γ 2 max Q̂(st 2 , a)aOr n?Q (n) (st , at ) rt γrt 1 · · · γ (n 1) rt n 1 γ n max Q̂(st n , a)aBlend all of these:hiQ λ (st , at ) (1 λ) Q (1) (st , at ) λQ (2) (st , at ) λ2 Q (3) (st , at ) · · ·Machine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 19 of 32

Q-LearningTemporal Difference LearninghiQ λ (st , at ) (1 λ) Q (1) (st , at ) λQ (2) (st , at ) λ2 Q (3) (st , at ) · · ·Equivalent expression:Q λ (st , at ) rt γ[ (1 λ) max Q̂(st , at )aλ λ Q (st 1 , at 1 )]TD(λ) algorithm uses above training rule Sometimes converges faster than Q learning converges for learning V for any 0 λ 1 (Dayan, 1992) Tesauro’s TD-Gammon uses this algorithmMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 20 of 32

Q-LearningWhat if the number of states is huge and/or structured? Let’s say we discover that stateis bad In Q learning, we know nothingabout similar statesMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 21 of 32

Q-LearningWhat if the number of states is huge and/or structured? Let’s say we discover that stateis bad In Q learning, we know nothingabout similar states Solution: Feature-basedRepresentation Machine Learning: Jordan Boyd-Graber BoulderDistance to closest ghostDistance to closest dotNumber of ghostsIs Pacman in a tunnel?Reinforcement Learning 21 of 32

Q-LearningFunction Approximation Q(s, a) w1 f1 (s, a) . . . Q-learning now had perceptron style updates (least squaresregression)Machine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 22 of 32

Policy SearchPlanControl LearningQ-LearningPolicy SearchMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 23 of 32

Policy SearchPolicy Search Problem: often feature-based policies that work well aren’t thosethat approximate V /Q best Solution: Find policies that maximize rewards rather than thevalue that predicts rewards SuccessfulMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 24 of 32

Policy SearchExample: Imitation Learning Take examples of experts {(s1 , a1 ) . . . } Learn a classifier mapping s a Create loss as the negative rewardMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 25 of 32

Policy SearchExample: Imitation Learning Take examples of experts {(s1 , a1 ) . . . } Learn a classifier mapping s a Create loss as the negative reward Problems?Machine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 25 of 32

Policy SearchHow do we find a good policy? Find optimal policies through dynamic programming π0 π Represent states s through a feature vector f (s)Machine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 26 of 32

Policy SearchHow do we find a good policy? Find optimal policies through dynamic programming π0 π Represent states s through a feature vector f (s) Until convergence: Generate examples of state action pairs: (πt (s), s) Create a classifier that maps states to actions (an apprentice policy)ht : f (s) 7 A Interpolate learned classifier πt 1 λπt (1 λ)htMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 26 of 32

Policy SearchHow do we find a good policy? Find optimal policies through dynamic programming π0 π Represent states s through a feature vector f (s) Until convergence: Generate examples of state action pairs: (πt (s), s) Create a classifier that maps states to actions (an apprentice policy)ht : f (s) 7 A (Loss of classifier is the negative reward) Interpolate learned classifier πt 1 λπt (1 λ)htMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 26 of 32

Policy SearchHow do we find a good policy? Find optimal policies through dynamic programming π0 π Represent states s through a feature vector f (s) Until convergence: Generate examples of state action pairs: (πt (s), s) Create a classifier that maps states to actions (an apprentice policy)ht : f (s) 7 A (Loss of classifier is the negative reward) Interpolate learned classifier πt 1 λπt (1 λ)htsearn: Searching to Learn (Daumé & Marcu, 2006)Machine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 26 of 32

Policy SearchApplications of Imitation Learning Car driving Flying helicopters Question answering Machine translationMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 27 of 32

Policy SearchApplications of Imitation Learning Car driving Flying helicopters Question answering Machine translationMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 27 of 32

Policy SearchQuestion Answering Boulder State: The wordsseen, opponentMachine Learning: Jordan Boyd-GraberReinforcement Learning 28 of 32

Policy SearchQuestion Answering State: The words seen, opponent Action: Buzz or wait Reward: PointsMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 28 of 32

Policy SearchWhy machine translation really hard is State: The words you’ve seen,output of machine translationsystem Action: Take translation,predict the verb Reward: Translation qualityMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 29 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMachine Learning: JordanBoyd-GraberMonotone BoulderHe to the storeGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristLaden gegangenβGood TranslationPsychicMachine Learning: JordanBoyd-GraberMonotonezumHeHe wentwent totothethe storestore HeBoulderBad TranslationHe to the storeGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMachine Learning: JordanBoyd-GraberMonotone HeBoulderHe to the storeGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMachine Learning: JordanBoyd-GraberMonotone HeBoulderHe to the storeGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMachine Learning: JordanBoyd-GraberMonotone HeBoulderHe to the storeGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMachine Learning: JordanBoyd-GraberMonotone HeBoulderHe to the storeGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenTGood TranslationPsychicHe went tothe storeBad TranslationMachine Learning: JordanBoyd-GraberMonotone HeBoulderHe to the storeGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationHe to the storeMonotoneHe to theMachine Learning: Jordan Boyd-GraberBatch BoulderGood TranslationHe to thestore wentHe wentBad TranslationGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristLaden gegangenβGood TranslationPsychicHeHe wentwent totothethe storestoreMonotone He Bad TranslationHe to the storeHe to theMachine Learning: Jordan Boyd-GraberBatchzumBoulderGood TranslationHe to thestore wentHe wentBad TranslationGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeGood TranslationHe to thestore wentMachine Learning: Jordan Boyd-GraberBatch BoulderHe wentBad TranslationGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theMachine Learning: Jordan Boyd-GraberBatch BoulderGood TranslationHe to thestore wentHe wentBad TranslationGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theMachine Learning: Jordan Boyd-GraberBatch BoulderGood TranslationHe to thestore wentHe wentBad TranslationGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theMachine Learning: Jordan Boyd-GraberBatch BoulderGood TranslationHe to thestore wentHe wentBad TranslationGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenTGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theMachine Learning: Jordan Boyd-GraberBatch BoulderGood TranslationHe to thestore wentHe wentBad TranslationGoodTranslation LearningReinforcement 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationHe to the storeMonotoneHe to theBatchMachine Learning: Jordan Boyd-GraberGood TranslationHe to thestore wentHe wentto thestore BoulderBad TranslationGood TranslationBad TranslationReinforcement LearningGood Translation 30 of 32

Policy SearchComparing PoliciesSource SentenceEristLaden gegangenβGood TranslationPsychicHeHe wentwent totothethe storestoreMonotone HeBad TranslationHe to the storeHe to theBatchMachine Learning: Jordan Boyd-GraberzumGood TranslationHe to thestore wentHe wentto thestore BoulderBad TranslationGood TranslationBad TranslationReinforcement LearningGood Translation 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeGood TranslationHe to thestore wentBatchMachine Learning: Jordan Boyd-GraberHe wentto thestore BoulderBad TranslationGood TranslationBad TranslationReinforcement LearningGood Translation 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theBatchMachine Learning: Jordan Boyd-GraberGood TranslationHe to thestore wentHe wentto thestore BoulderBad TranslationGood TranslationBad TranslationReinforcement LearningGood Translation 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theBatchMachine Learning: Jordan Boyd-GraberGood TranslationHe to thestore wentHe wentto thestore BoulderBad TranslationGood TranslationBad TranslationReinforcement LearningGood Translation 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theBatchMachine Learning: Jordan Boyd-GraberGood TranslationHe to thestore wentHe wentto thestore BoulderBad TranslationGood TranslationBad TranslationReinforcement LearningGood Translation 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenTGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theBatchMachine Learning: Jordan Boyd-GraberGood TranslationHe to thestore wentHe wentto thestore BoulderBad TranslationGood TranslationBad TranslationReinforcement LearningGood Translation 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationHe to the storeMonotoneHe to theBatchHe wentto thestorePolicyPredictionMachine Learning: Jordan Boyd-GraberGood TranslationHe to thestore wentBad TranslationGood TranslationBad TranslationGood TranslationHe went BoulderHe went toHe wentthe storeto theBad TranslationReinforcement Learning 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumβGood TranslationPsychicHeHe wentwent totothethe storestoreMonotone HeBad TranslationHe to the storeHe to theBatchGood TranslationHe to thestore wentHe wentto thestorePolicyPredictionMachine Learning: Jordan Boyd-GraberLaden gegangenBad TranslationGood TranslationHeHe went Bad TranslationGood TranslationBoulderHe went toHe wentthe storeto theBad TranslationReinforcement Learning 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeGood TranslationHe to thestore wentBatchHe wentto thestorePolicyPredictionMachine Learning: Jordan Boyd-GraberBad TranslationGood TranslationHeHe went Bad TranslationGood TranslationBoulderHe went toHe wentthe storeto theBad TranslationReinforcement Learning 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theBatchHe wentto thestorePolicyPredictionMachine Learning: Jordan Boyd-GraberGood TranslationHe to thestore wentBad TranslationGood TranslationHeHe went Bad TranslationGood TranslationBoulderHe went toHe wentthe storeto theBad TranslationReinforcement Learning 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theBatchHe wentto thestorePolicyPredictionMachine Learning: Jordan Boyd-GraberGood TranslationHe to thestore wentBad TranslationGood TranslationHeHe went Bad TranslationGood TranslationBoulderHe went He went tothe storeto theBad TranslationReinforcement Learning 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenβGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theBatchHe wentto thestorePolicyPredictionMachine Learning: Jordan Boyd-GraberGood TranslationHe to thestore wentBad TranslationGood TranslationHeHe went Bad TranslationGood TranslationBoulderHe went He went tothe storeto theBad TranslationReinforcement Learning 30 of 32

Policy SearchComparing PoliciesSource SentenceEristzumLaden gegangenTGood TranslationPsychicHe went tothe storeBad TranslationMonotone HeHe to the storeHe to theBatchHe wentto thestorePolicyPredictionMachine Learning: Jordan Boyd-GraberGood TranslationHe to thestore wentBad TranslationGood TranslationHeHe went Bad TranslationGood TranslationBoulderHe went He went tothe storeto theBad TranslationReinforcement Learning 30 of 32

Policy SearchApplying SEARNOracle ExampleRewardCommitWaitPredictWaitWaitTimeMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 31 of 32

Policy SearchApplying SEARNOracle ExampleTraining f(s2)Predictf(s3)WaitTimeMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 31 of 32

Policy SearchApplying SEARNOracle ExampleTraining f(s2)Predictf(s3)WaitCost-SensitiveClassification : s 7! aClassifierTimeMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 31 of 32

Policy SearchApplying SEARN : s 7! aOracle ClassifierWaitTimeMachine Learning: Jordan Boyd-GraberWaitWaitWaitTime BoulderReinforcement Learning 31 of 32

Policy SearchApplying SEARN : s 7! aOracle ExamplePredictWaitNGO TimeMachine Learning: Jordan Boyd-GraberWaitWaitWaitTime BoulderReinforcement Learning 31 of 32

Policy SearchApplying SEARN : s 7! aOracle ClassifierWaitTimeMachine Learning: Jordan Boyd-GraberWaitWaitWaitTime BoulderReinforcement Learning 31 of 32

Policy SearchApplying SEARN : s 7! aOracle ictWaitWaitWaitTimeMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 31 of 32

Policy SearchApplying SEARNTraining f(s2)Predictf(s3)WaitTimeMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 31 of 32

Policy SearchApplying SEARNClassifier N : s 7! aTraining Examplesf(s3)PredictWaitWaitWaitTimeTimeMachine Learning: Jordan WaityWaitCommitCommitWaitxf(s1) BoulderReinforcement Learning 31 of 32

Policy SearchRecap Reinforcement learning: interacting with environments Important to scale to large state spaces Connection with classification Lets computers observe and repeatMachine Learning: Jordan Boyd-Graber BoulderReinforcement Learning 32 of 32

Machine Learning: Jordan Boyd-Graber j Boulder Reinforcement Learning j 4 of 32. Control Learning One Example: TD-Gammon [Tesauro, 1995] Learn to play Backgammon Immediate reward 100 if win . where s0is the state resulting from applying action a in state s Machine Learning: Jordan Boyd-Graber j Boulder Reinforcement Learning j 14 of 32. Q .

Related Documents:

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

In this section, we present related work and background concepts such as reinforcement learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement learning (Sutton and Barto, 1998) environment is typically formalized by means of a Markov decision process (MDP). An MDP can be described as follows. Let S fs 1 .

learning techniques, such as reinforcement learning, in an attempt to build a more general solution. In the next section, we review the theory of reinforcement learning, and the current efforts on its use in other cooperative multi-agent domains. 3. Reinforcement Learning Reinforcement learning is often characterized as the

Meta-reinforcement learning. Meta reinforcement learn-ing aims to solve a new reinforcement learning task by lever-aging the experience learned from a set of similar tasks. Currently, meta-reinforcement learning can be categorized into two different groups. The first group approaches (Duan et al. 2016; Wang et al. 2016; Mishra et al. 2018) use an

Reinforcement learning methods provide a framework that enables the design of learning policies for general networks. There have been two main lines of work on reinforcement learning methods: model-free reinforcement learning (e.g. Q-learning [4], policy gradient [5]) and model-based reinforce-ment learning (e.g., UCRL [6], PSRL [7]). In this .

Using a retaining wall as a case-study, the performance of two commonly used alternative reinforcement layouts (of which one is wrong) are studied and compared. Reinforcement Layout 1 had the main reinforcement (from the wall) bent towards the heel in the base slab. For Reinforcement Layout 2, the reinforcement was bent towards the toe.

Footing No. Footing Reinforcement Pedestal Reinforcement - Bottom Reinforcement(M z) x Top Reinforcement(M z x Main Steel Trans Steel 2 Ø8 @ 140 mm c/c Ø8 @ 140 mm c/c N/A N/A N/A N/A Footing No. Group ID Foundation Geometry - - Length Width Thickness 7 3 1.150m 1.150m 0.230m Footing No. Footing Reinforcement Pedestal Reinforcement