Deep Reinforcement Learning: Q-Learning

2y ago
16 Views
2 Downloads
7.39 MB
138 Pages
Last View : 25d ago
Last Download : 3m ago
Upload by : Mariam Herr
Transcription

Deep ReinforcementLearning: Q-LearningGarima LalwaniKaran GanjuUnnat Jain

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQN

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQN

Q-LearningDavid Silver’s Introduction to RL lecturesPeter Abbeel’s Artificial Intelligence - Berkeley (Spring 2015)

Q-LearningDavid Silver’s Introduction to RL lecturesPeter Abbeel’s Artificial Intelligence - Berkeley (Spring 2015)

Q-LearningDavid Silver’s Introduction to RL lecturesPeter Abbeel’s Artificial Intelligence - Berkeley (Spring 2015)

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQN

Function Approximation - Why? Value functions Every state s has an entry V(s)Every state-action pair (s, a) has an entry Q(s, a)How to get Q(s,a) Table lookupWhat about large MDPs? Estimate value function with function approximation Generalise from seen states to unseen states

Function Approximation - How? Why Q? How to approximate? Features for state s s,a x(s,a)Linear model Q(s,a) wTx(s,a)Deep Neural Nets - CS598 Q(s,a) NN(s,a)

Function Approximation - Demo

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQN

Deep Q Network1) Input:4 images currentframe 3 previousMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.2) Output: Q(s,ai)Q(s,a1)Q(s,a2)Q(s,a3).Q(s,a18)

Deep Q Network1) Input:4 images currentframe 3 previousMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.2) Output: Q(s,ai)Q(s,a1)Q(s,a2)Q(s,a3).Q(s,a18)

Deep Q Network1) Input:4 images currentframe 3 previous?Mnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.2) Output: Q(s,ai)Q(s,a1)Q(s,a2)Q(s,a3).Q(s,a18)

Deep Q Network1) Input:4 images currentframe 3 previous(s)Mnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.2) Output: Q(s,ai)Q(s,a1)Q(s,a2)Q(s,a3).Q(s,a18)

Deep Q Network1) Input:4 images currentframe 3 previous(s)Mnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.2) Output: Q(s,ai)Q(s,a1)Q(s,a2)Q(s,a3).Q(s,a18)

Deep Q Network1) Input:4 images currentframe 3 previous(s)Mnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.2) Output: Q(s,ai)Q(s,a1)Q(s,a2)Q(s,a3).Q(s,a18)

Supervised SGD (lec2) SGD update assuming supervisionDavid Silver’s Deep Learning Tutorial, ICML 2016vsQ-Learning SGD

Supervised SGD (lec2) SGD update assuming supervisionDavid Silver’s Deep Learning Tutorial, ICML 2016vsQ-Learning SGD

Supervised SGD (lec2) SGD update assuming supervisionDavid Silver’s Deep Learning Tutorial, ICML 2016vsQ-Learning SGD SGD update for Q-Learning

Training tricks Issues:a.b.Data is sequential Successive samples are correlated, non-iid An experience is visited only once in online learningPolicy changes rapidly with slight changes to Q-values Policy may oscillateMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.

Training tricks Issues:a.Data is sequential Successive samples are correlated, non-iid An experience is visited only once in online learningMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.

Training tricks Issues:a. Data is sequential Successive samples are correlated, non-iid An experience is visited only once in online learningSolution: ‘Experience Replay’ : Work on a dataset - Sample randomly and repeatedly Build dataset Take action at according to -greedy policy Store transition/experience (st , at ,rt 1,st 1) in dataset D (‘replay memory’) Sample randomly mini-batch (32 experiences) of (s, a, r, s’)) from DMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.

Training tricks Issues:a. Data is sequential Successive samples are correlated, non-iid An experience is visited only once in online learningSolution: ‘Experience Replay’ : Work on a dataset - Sample randomly and repeatedly Build dataset Take action at according to -greedy policy Store transition/experience (st , at ,rt 1,st 1) in dataset D (‘replay memory’) Sample randomly mini-batch (32 experiences) of (s, a, r, s’)) from DMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.

Training tricks Issues:a.b.Data is sequentialExperience replay Successive samples are correlated, non-iid An experience is visited only once in online learningPolicy changes rapidly with slight changes to Q-values Policy may oscillateMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.

Network 1Training tricks Issues:a.b. Data is sequentialExperience replay Successive samples are correlated, non-iid An experience is visited only once in online learningPolicy changes rapidly with slight changes to Q-values Policy may oscillateSolution: ‘Target Network’ : Stale updates C step delay between update of Q and its use as targetsMnih, Volodymyr, et al. "Human-level control through deepWikiCommonslearning."[Img link] Nature 518.7540 (2015): 529-533.reinforcementTells me Q(s,a) targets(wi -1)Network 2Q values are updated every SGD step(wi )

Network 1Training tricks Issues:a.b. Data is sequentialExperience replay Successive samples are correlated, non-iid An experience is visited only once in online learningPolicy changes rapidly with slight changes to Q-values Policy may oscillateSolution: ‘Target Network’ : Stale updates C step delay between update of Q and its use as targetsMnih, Volodymyr, et al. "Human-level control through deepWikiCommonslearning."[Img link] Nature 518.7540 (2015): 529-533.reinforcementTells me Q(s,a) targets(wi )After 10,000 SGD updatesNetwork 2Q values are updated every SGD step(wi 1)

Training tricks Issues:a.b.Data is sequentialExperience replay Successive samples are correlated, non-iid An experience is visited only once in online learningPolicy changes rapidly with slight changes to Q-values Policy may oscillateMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.Target Network

Training tricks Issues:a.b.Data is sequentialExperience replay Successive samples are correlated, non-iid An experience is visited only once in online learningPolicy changes rapidly with slight changes to Q-values Policy may oscillateMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.Target Network

DQN: ResultsWhy not just use VGGNet features?Mnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.

DQN: ResultsMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQN

Q-Learning for RouletteHasselt, Hado V. "Double Q-learning." In Advances in NeuralInformation Processing Systems, pp. 2613-2621. pielbank-wiesbaden-by-RalfR-094.jpg

Q-Learning for RouletteHasselt, Hado V. "Double Q-learning." In Advances in NeuralInformation Processing Systems, pp. 2613-2621. pielbank-wiesbaden-by-RalfR-094.jpg

Q-Learning for RouletteHasselt, Hado V. "Double Q-learning." In Advances in NeuralInformation Processing Systems, pp. 2613-2621. pielbank-wiesbaden-by-RalfR-094.jpg

Q-Learning Overestimation : Function ApproximationQ EstimateActual Q ValueVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.

Q-Learning Overestimation : Function ApproximationVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.

One (Estimator) Isn’t Good Enough?

One (Estimator) Isn’t Good Enough?Use g

Double Q-Learning Two estimators: Estimator Q1 : Obtain best action Estimator Q2 : Evaluate Q for the above action Chances of both estimators overestimating at same action is lesserVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.

Double Q-Learning Two estimators: Estimator Q1 : Obtain best action Estimator Q2 : Evaluate Q for the above actionQ TargetVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.

Double Q-Learning Two estimators: Estimator Q1 : Obtain best action Estimator Q2 : Evaluate Q for the above actionQ TargetVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.Double Q Target

Double Q-Learning Two estimators: Estimator Q1 : Obtain best action Estimator Q2 : Evaluate Q for the above actionQ1Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.Q2

Results - All Atari GamesVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.

Results - Solves OverestimationsVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQN

Pong - Up or DownMnih, Volodymyr, et al. "Human-level control through deepreinforcement learning." Nature 518.7540 (2015): 529-533.

Enduro - Left or omunidade/Enduro.png

Enduro - Left or g

Advantage FunctionLearning action values Inherently learning both state values and relative value of theaction in that state!We can use this to help generalize learning for the state values.Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, andNando de Freitas. "Dueling network architectures for deep reinforcement learning."arXiv preprint arXiv:1511.06581 (2015).

Dueling g dqn.htmlAggregating Module

Dueling g dqn.htmlAggregating Module

Dueling g dqn.htmlAggregating Module

ResultsWhere does V(s) attend to?Where does A(s,a) attend to?Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, andNando de Freitas. "Dueling network architectures for deep reinforcement learning."arXiv preprint arXiv:1511.06581 (2015).

ResultsImprovements of dueling architecture overPrioritized DDQN baseline measured bymetric above over 57 Atari gamesWang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, andNando de Freitas. "Dueling network architectures for deep reinforcement learning."arXiv preprint arXiv:1511.06581 (2015).

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQN

Moving to more General and Complex Games All games may not be representable using MDPs; some may be POMDPs FPS shooter gamesScrabbleEven Atari gamesEntire history a solution?

Moving to more General and Complex Games All games may not be representable using MDPs; some may be POMDPs FPS shooter gamesScrabbleEven Atari gamesEntire history a solution? LSTMs !

Deep Recurrent Q-LearningFCHausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partiallyobservable mdps." arXiv preprint arXiv:1507.06527 (2015).

Deep Recurrent Q-LearningHausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partiallyobservable mdps." arXiv preprint arXiv:1507.06527 (2015).

Deep Recurrent Q-Learningh1h2h3Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprint arXiv:1507.06527 (2015).

DRQN Results MissesHausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partiallyobservable mdps." arXiv preprint arXiv:1507.06527 (2015).

DRQN Results PaddleDeflectionHausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partiallyobservable mdps." arXiv preprint arXiv:1507.06527 (2015).

DRQN Results WallDeflectionsHausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partiallyobservable mdps." arXiv preprint arXiv:1507.06527 (2015).

Results - Robustness to partial observabilityHausknecht, Matthew, and Peter Stone."Deep recurrent q-learning for partiallyobservable mdps." arXiv preprintarXiv:1507.06527 (2015).POMDPMDP

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQN

Application of DRQN: Playing ‘Doom’Lample, Guillaume, and Devendra Singh Chaplot. "Playing FPSgames with deep reinforcement learning."

Doom Demo

How does DRQN help ?-Observe ot instead of stLimited field of viewInstead of estimating Q(st, at), estimate Q(ht,at) where ht LSTM(ht-1,ot)

Architecture: Comparison with Baseline DRQN

Training Tricks Jointly training DRQN model and game feature detection

Training Tricks Jointly training DRQN model and game feature detectionWhat do you think is the advantage of this ?

Training Tricks Jointly training DRQN model and game feature detectionWhat do you think is the advantage of this ? CNN layers capture relevant information about features of the game thatmaximise action value scores

Modular ArchitectureEnemyspottedAction Network(DRQN)All clear!

Modular ArchitectureEnemyspottedDRQNAction Network(DRQN)All clear!DQN

Modular Network : Advantages-Can be trained and tested independentlyBoth can be trained in parallelReduces the state-action pairs space : Faster TrainingMitigates camper behavior : “Tendency to stay in one area of the map andwait for enemies”

Rewards Formulation for DOOMWhat do you think ?

Rewards Formulation for DOOMWhat do you think ? Positive rewards for Kills and Negative rewards for suicidesSmall Intermediate Rewards : Positive Reward for object pickupNegative Reward for losing healthNegative Reward for shooting or losing ammoSmall Positive Rewards proportional to the distance travelled since last step(Agent avoids running in circles)

Performance with Separate Navigation Network

Results

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQN

h-DQN

Double DQNVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.

Dueling NetworksVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.

How is this game different ? Complex Game EnvironmentSparse and Longer Range DelayedRewards Insufficient Exploration : We needtemporally extended explorationKulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B.Tenenbaum. “Hierarchical Deep Reinforcement Learning: Integrating TemporalAbstraction and Intrinsic Motivation.” NIPS 2016

How is this game different ? Complex Game EnvironmentSparse and Longer Range Delayed Rewards Insufficient Exploration : We need temporally extended explorationDividing Extrinsic Goal into Hierarchical Intrinsic Subgoals

Intrinsic Goals in Montezuma’s RevengeKulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B.Tenenbaum. “Hierarchical Deep Reinforcement Learning: Integrating TemporalAbstraction and Intrinsic Motivation.” NIPS 2016

Hierarchy of DQNsEnvironmentAgent

Hierarchy of DQNsEnvironmentAgent

Architecture Block for h-DQNKulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B.Tenenbaum. “Hierarchical Deep Reinforcement Learning: Integrating TemporalAbstraction and Intrinsic Motivation.” NIPS 2016

h-DQN Learning Framework (1) V(s,g) : Value function of a state for achieving the given goal g GOption : A multi-step action policy to achieve these intrinsic goals g GCan also be primitive actions: Policy Over Options to achieve goal gAgents learnsg which Intrinsic goals are importantgcorrect sequence of such policiesgKulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B.Tenenbaum. “Hierarchical Deep Reinforcement Learning: Integrating TemporalAbstraction and Intrinsic Motivation.” NIPS 2016

h-DQN Learning Framework (2)Objective Function for Meta-Controller :-Maximise Cumulative Extrinsic RewardFt Objective Function for Controller :-Maximise Cumulative Intrinsic RewardRt Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B.Tenenbaum. “Hierarchical Deep Reinforcement Learning: Integrating TemporalAbstraction and Intrinsic Motivation.” NIPS 2016

Training-Two disjoint memories D1 and D2 for Experience ReplayExperiences (st , gt , ft , st N) for Q2 are stored in D2Experiences (st , at , gt , rt , st 1) for Q1 are stored in D1Different time scales-Transitions from Controller (Q1) are picked at every time stepTransitions from Meta-Controller (Q2) are picked only when controller terminates on reachingthe intrinsic goal or epsiode endsKulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B.Tenenbaum. “Hierarchical Deep Reinforcement Learning: Integrating TemporalAbstraction and Intrinsic Motivation.” NIPS 2016

Results :Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B.Tenenbaum. “Hierarchical Deep Reinforcement Learning: Integrating TemporalAbstraction and Intrinsic Motivation.” NIPS 2016

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQN

ReferencesBasic RL David Silver’s Introduction to RL lectures Peter Abbeel’s Artificial Intelligence - Berkeley (Spring 2015)DQN Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015):529-533. Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).DDQN Hasselt, Hado V. "Double Q-learning." In Advances in Neural Information Processing Systems, pp. 2613-2621. 2010. Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." In AAAI,pp. 2094-2100. 2016.Dueling DQN Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. "Dueling networkarchitectures for deep reinforcement learning." arXiv preprint arXiv:1511.06581 (2015).

ReferencesDRQN Hausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partially observable mdps." arXiv preprintarXiv:1507.06527 (2015).Doom Lample, Guillaume, and Devendra Singh Chaplot. "Playing FPS games with deep reinforcement learning."h-DQN Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B. Tenenbaum. “Hierarchical DeepReinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.” NIPS 2016Additional NLP/Vision applications Narasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "Language understanding for text-based games usingdeep reinforcement learning." EMNLP 20155 Caicedo, Juan C., and Svetlana Lazebnik. "Active object localization with deep reinforcement learning." Proceedingsof the IEEE International Conference on Computer Vision. 2015. Zhu, Yuke, et al. "Target-driven visual navigation in indoor scenes using deep reinforcement learning." arXiv preprintarXiv:1609.05143 (2016).

Deep Q Learning for text-based gamesNarasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "Language understanding for text-based gamesusing deep reinforcement learning." EMNLP 2015

Text Based Games : Back in 1970’s Predecessors to Modern Graphical GamesMUD (Multi User Dungeon Games) still prevalentNarasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "Language understanding for text-based gamesusing deep reinforcement learning." EMNLP 2015

State Spaces and Action Spaces-Hidden state space h H but given textual description{ ψ : H S}-Actions are commands (action-object pairs) A {(a,o)}Thh’(a,o) : Transition ProbabilitiesJointly learn state representations and control policies as learnedStrategy/Policy directly builds on the text interpretationNarasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "Language understanding for text-based gamesusing deep reinforcement learning." EMNLP 2015

Learning Representations and Control PoliciesNarasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "Language understanding for text-based gamesusing deep reinforcement learning." EMNLP 2015

Results (1) :Learnt UsefulRepresentations for theGameNarasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "Language understanding for text-based gamesusing deep reinforcement learning." EMNLP 2015

Results (2) :Narasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "Language understanding for text-based gamesusing deep reinforcement learning." EMNLP 2015

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQNMore applications: Text based gamesObject DetectionIndoor Navigation

Object Detection as a RL problem?-States:Actions:

Object detection as a RL problem?-States:Actions:[image-link]

Object detection as a RL problem?-States:Actions:[image-link]

Object detection as a RL problem?-States:Actions:[image-link]

Object detection as a RL problem?-States:Actions:[image-link]

Object detection as a RL problem?-States:Actions:[image-link]

Object detection as a RL problem?-States:Actions:- c*(x2-x1) , c*(y2-y1) relative translation- scale- aspect ratio- trigger when IoU is high[image-link]J. Caicedo and S. Lazebnik, ICCV 2015

Object detection as a RL problem?-States: fc6 feature of pretrained VGG19Actions:- c*(x2-x1) , c*(y2-y1) relative translation- scale- aspect ratio- trigger when IoU is high[image-link]J. Caicedo and S. Lazebnik, ICCV 2015

Object detection as a RL problem?--States: fc6 feature of pretrained VGG19Actions:- c*(x2-x1) , c*(y2-y1) relative translation- scale- aspect ratio- trigger when IoU is highReward:[image-link]J. Caicedo and S. Lazebnik, ICCV 2015

Object detection as a RL problem?--States: fc6 feature of pretrained VGG19Actions:- c*(x2-x1) , c*(y2-y1) relative translation- scale- aspect ratio- trigger when IoU is highReward:[image-link]J. Caicedo and S. Lazebnik, ICCV 2015

Object detection as a RL problem?Q(s,a1 scale up)State(s)CurrentboundingboxQ(s,a2 scale down)Q(s,a3 shift left)Q(s,a9 trigger)J. Caicedo and S. Lazebnik, ICCV 2015

Object detection as a RL problem?Q(s,a1 scale up)State(s)CurrentboundingboxQ(s,a2 scale down)Q(s,a3 shift left)Q(s,a9 trigger)J. Caicedo and S. Lazebnik, ICCV 2015

Object detection as a RL problem?HistoryQ(s,a1 scale up)State(s)CurrentboundingboxQ(s,a2 scale down)Q(s,a3 shift left)Q(s,a9 trigger)J. Caicedo and S. Lazebnik, ICCV 2015

Object detection as a RL problem?Fine details:- Class specific, attention-action model- Does not follow a fixed sliding window trajectory, image dependent trajectory- Use 16 pixel neighbourhood to incorporate contextJ. Caicedo and S. Lazebnik, ICCV 2015

Object detection as a RL problem?J. Caicedo and S. Lazebnik, ICCV 2015

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQNMore applications: Text based gamesObject detectionIndoor Navigation

Navigation as a RL problem?- States:- Actions:“Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, Ali Farhadi

Navigation as a RL problem?- States: ResNet-50 features- Actions:“Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, Ali Farhadi

Navigation as a RL problem?- States: ResNet-50 feature- Actions:“Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, Ali Farhadi

Navigation as a RL problem?- States: ResNet-50 feature- Actions:- Forward/backward 0.5 m- Turn left/right 90 deg- Trigger“Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, Ali Farhadi

Navigation as a RL problem?Q(s,a1 forward)State (s)Current frame andthe target frameQ(s,a2 backward)Q(s,a3 turn left)Q(s,a2 turn right)Q(s,a6 trigger)“Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, Ali Farhadi

Navigation as a RL problem?Q(s,a1 forward)State (s)Current frame andthe target frameQ(s,a2 backward)Q(s,a3 turn left)Q(s,a2 turn right)Q(s,a6 trigger)“Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, Ali Farhadi

Navigation as a RL problem?Simulated environmentReal environment

Today’s takeaways Bonus RL recapFunctional ApproximationDeep Q NetworkDouble Deep Q NetworkDueling NetworksRecurrent DQN Solving “Doom”Hierarchical DQNMore applications: Text based gamesObject detectionIndoor Navigation

Q-Learning Overestimation : Function ApproximationVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.

Q-Learning Overestimation : Intuition[Jensen’s com/2015/12/doubleqposter.pdf

Q-Learning Overestimation : Intuition[Jensen’s Inequality]What we 5/12/doubleqposter.pdf

Q-Learning Overestimation : Intuition[Jensen’s Inequality]What we estimate .com/2015/12/doubleqposter.pdfWhat we want

Double Q-Learning : Function ApproximationVan Hasselt, Hado, Arthur Guez, and David Silver. "Deep ReinforcementLearning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016.

ResultsMean and median scores across all 57 Atari games, measured inpercentages of human performanceWang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, andNando de Freitas. "Dueling network architectures for deep reinforcement learning."arXiv preprint arXiv:1511.06581 (2015).

Results : Comparison to 10 frame DQN Captures in one frame (and history state) what DQN captures in a stack of 10for Flickering Pong 10 frame DQN conv-1captures paddleinformationHausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partiallyobservable mdps." arXiv preprint arXiv:1507.06527 (2015).

Results : Comparison to 10 frame DQN Captures in one frame (and history state) what DQN captures in a stack of 10for Flickering Pong 10 frame DQN conv-2captures paddle and balldirection informationHausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partiallyobservable mdps." arXiv preprint arXiv:1507.06527 (2015).

Results : Comparison to 10 frame DQN Captures in one frame (and history state) what DQN captures in a stack of 10for Flickering Pong 10 frame DQN conv-3captures paddle, balldirection, velocity anddeflection informationHausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning for partiallyobservable mdps." arXiv preprint arXiv:1507.06527 (2015).

Results : Comparison to 10 frame DQNScores are comparable to10-frame DQN outperforming in some andlosing in someHausknecht, Matthew, and Peter Stone. "Deep recurrent q-learning forpartially observable mdps." arXiv preprint arXiv:1507.06527 (2015).

Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. Training tricks Issues: a. Data is sequential Experience replay . Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 5

Related Documents:

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

Deep Reinforcement Learning: Reinforcement learn-ing aims to learn the policy of sequential actions for decision-making problems [43, 21, 28]. Due to the recen-t success in deep learning [24], deep reinforcement learn-ing has aroused more and more attention by combining re-inforcement learning with deep neural networks [32, 38].

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

A representative work of deep learning is on playing Atari with Deep Reinforcement Learning [Mnih et al., 2013]. The reinforcement learning algorithm is connected to a deep neural network which operates directly on RGB images. The training data is processed by using stochastic gradient method. A Q-network denotes a neural network which approxi-

Using a retaining wall as a case-study, the performance of two commonly used alternative reinforcement layouts (of which one is wrong) are studied and compared. Reinforcement Layout 1 had the main reinforcement (from the wall) bent towards the heel in the base slab. For Reinforcement Layout 2, the reinforcement was bent towards the toe.

Footing No. Footing Reinforcement Pedestal Reinforcement - Bottom Reinforcement(M z) x Top Reinforcement(M z x Main Steel Trans Steel 2 Ø8 @ 140 mm c/c Ø8 @ 140 mm c/c N/A N/A N/A N/A Footing No. Group ID Foundation Geometry - - Length Width Thickness 7 3 1.150m 1.150m 0.230m Footing No. Footing Reinforcement Pedestal Reinforcement

In this section, we present related work and background concepts such as reinforcement learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement learning (Sutton and Barto, 1998) environment is typically formalized by means of a Markov decision process (MDP). An MDP can be described as follows. Let S fs 1 .

learning techniques, such as reinforcement learning, in an attempt to build a more general solution. In the next section, we review the theory of reinforcement learning, and the current efforts on its use in other cooperative multi-agent domains. 3. Reinforcement Learning Reinforcement learning is often characterized as the