Deep Reinforcement Learning For Continuous Control

1y ago
6 Views
3 Downloads
1.22 MB
26 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Wade Mabry
Transcription

Deep ReinforcementLearning for ContinuousControlDeep Reinforcement Learning fuer kontinuierliche RegelungenBachelor-Thesis von Simon Ramstedt aus Frankfurt am MainTag der Einreichung:1. Gutachten: Prof. Dr. Gerhard Neumann2. Gutachten: Prof. Dr. Jan Peters3. Gutachten: MSc. Simone Parisi

Deep Reinforcement Learning for Continuous ControlDeep Reinforcement Learning fuer kontinuierliche RegelungenVorgelegte Bachelor-Thesis von Simon Ramstedt aus Frankfurt am Main1. Gutachten: Prof. Dr. Gerhard Neumann2. Gutachten: Prof. Dr. Jan Peters3. Gutachten: MSc. Simone ParisiTag der Einreichung:

Erklärung zur Bachelor-ThesisHiermit versichere ich, die vorliegende Bachelor-Thesis ohne Hilfe Dritter nur mitden angegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die ausQuellen entnommen wurden, sind als solche kenntlich gemacht. Diese Arbeit hat ingleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegen.Darmstadt, den 1. Mai 2016(Simon Ramstedt)

AbstractReinforcement learning is a mathematical framework for agents to interact intelligently with their environment. In this field, real-world control problems are particularly challenging because of the noise andthe high-dimensionality of input data (e.g., visual input). In the last few years, deep neural networkshave been successfully used to extract meaning from such data. Building on these advances, deep reinforcement learning achieved stunning results in the field of artificial intelligence, being able to solvecomplex problems like Atari games [1] and Go [2]. However, in order to apply the same methods toreal-world control problems, deep reinforcement learning has to be able to deal with continuous actionspaces. In this thesis, Deep Deterministic Policy Gradients, a deep reinforcement learning method forcontinuous control, has been implemented, evaluated and put into context to serve as a basis for furtherresearch in the field.ZusammenfassungReinforcement-Learning ist ein mathematischer Rahmen, um intelligent mit ihrer Umgebung interagierende Agenten zu erzeugen. Regelungsprobleme in realen Umgebungen sind dabei wegen starkverrauschten, hochdimensionalen Eingabedaten (z.B. Video) besonders anspruchsvoll. In den letzten Jahren wurden dafür jedoch erfolgreich neuronale Netze benutzt. Deep-Reinforcement-Learning(Reinforcement-Learning mit neuronalen Netzen) hatte bereits große Erfolge in der künstlichen Intelligenz und war in der Lage komplexe Probleme wie Go [2] oder Atari-Spiele [1] zu lösen. Um dieseMethoden aber in der echten Welt anwenden zu können, muss Deep-Reinforcement-Learning mit kontinuierlichen Handlungsräumen umgehen können. Deshalb wurde in dieser Thesis Deep DeterministicPolicy Gradients, eine Deep-Reinforcement-Learning-Methode für kontinuierliche Regelungen implementiert, evaluiert und in Bezug zu anderen Methoden gesetzt.i

AcknowledgmentsI thank my supervisors Simone Parisi and Gerhard Neumann for their patience and helpful discussions,with special thanks to Simone for his support during time-critical periods.ii

Contents1 Introduction22 Foundations2.1 Deep Learning and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3363 Deep Reinforcement Learning3.1 Deep Q-Network (DQN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2 Deep Deterministic Policy Gradient (DDPG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8894 Implementation114.1 DDPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 The Cart-Pole Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Evaluation5.1 Reference Results Without Batch Normalization5.2 Applying Batch Normalization . . . . . . . . . . .5.3 Disabling Target Network and Replay Memory .5.4 Sparse Reward Functions . . . . . . . . . . . . . .13131414156 Conclusion and Future Work6.1 Data Efficiency . . . . . . .6.2 Exploration . . . . . . . . .6.3 Imitation . . . . . . . . . .6.4 Curriculum Learning . . .1717181818Bibliography.19iii

Figures and TablesList of Figures2.12.22.32.42.52.6Neuron . . . . . . . . . . .Activation functions . . . .Deep Neural Network . .Automatic DifferentiationCost Surfaces . . . . . . . .Agent and Environment .3445663.1 DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2 DDPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .894.1 Ezex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Network Layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Cart-pole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.1 Cart-pole trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146.1 Phi-network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17List of Tables4.1 DDPG Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

1 IntroductionReinforcement learning is a mathematical framework for agents to interact intelligently with their environment. Unlike supervised learning, where a system learns with the help of labeled data, reinforcement learning agents learn how to act by trial and error only receiving a reward signal from theirenvironments. A field where reinforcement learning has been prominently successful is robotics [3].However, real-world control problems are also particularly challenging because of the noise and highdimensionality of input data (e.g., visual input). In recent years, in the field of supervised learning,deep neural networks have been successfully used to extract meaning from this kind of data. Buildingon these advances, deep reinforcement learning was used to solve complex problems like Atari gamesand Go. Mnih et al. [1] built a system with fixed hyperparameters able to learn to play 49 different Atarigames only from raw pixel inputs. However, in order to apply the same methods to real-world controlproblems, deep reinforcement learning has to be able to deal with continuous action spaces. Discretizingcontinuous action spaces would scale poorly, since the number of discrete actions grows exponentiallywith the dimensionality of the action. Furthermore, having a parametrized policy can be advantageousbecause it can generalize in the action space. Therefore with this thesis we study state-of-the-art deepreinforcement learning algorithm, Deep Deterministic Policy Gradients. We provide a theoretical comparison to other popular methods, an evaluation of its performance, identify its limitations and investigatefuture directions of research.The remainder of the thesis is organized as follows. We start by introducing the field of interest, machinelearning, focusing our attention of deep learning and reinforcement learning. We continue by describing in details the two main algorithms, core of this study, namely Deep Q-Network (DQN) and DeepDeterministic Policy Gradients (DDPG). We then provide implementatory details of DDPG and our testenvironment, followed by a description of benchmark test cases. Finally, we discuss the results of ourevaluation, identifying limitations of the current approach and proposing future avenues of research.2

2 FoundationsMachine learning is an approach to design and optimize information processing systems by directlyusing data. As typically in real-world problems the training data is limited and does not cover allpossible scenarios, the system has to learn how to behave also in the presence of data which it has notbeen trained on. In this regard, overfitting is one of the biggest challenges in machine learning andconsists in having a system that strictly adapts its behavior to the training data, without being able togeneralize to different unseen input data.Machine learning is usually divided into supervised learning, reinforcement learning and unsupervisedlearning. Supervised learning systems learn input-output mappings from a dataset of desired inputoutput pairs, i.e. they are explicitly told how to behave. As we will see in the next section, most of deeplearning and neural network research so far has been done in the supervised setting. Reinforcementlearning systems, on the contrary, learn how to behave by receiving feedback from the environmentencoding a specific goal. For example, a trash-collector robot would receive a reward for collectingtrash or would be punished for hitting a wall. It is common for reinforcement learning to exploitsupervised learning techniques. Unsupervised learning, on the other hand, is about finding patterns indata and will not be discussed further in this thesis. In the next sections we will describe in more detailone of the most prominent supervised learning technique, namely deep learning and neural networks.2.1 Deep Learning and Neural NetworksDeep learning is an area of machine learning concerned with deep neural networks. Neural networksare non-linear parametric functions loosely inspired by the human brain. They are composed of layersof units called neurons, as shown in Figure 2.1.Figure 2.1: A canonical artificial neuron. The sum of the inputs x i weighted by w i Pand the bias b isfed into the activation function f , producing the neuron output y f ( i w i x i b). Theweights w i represent a pattern in the neurons input space. The closer the input is to thatpattern, the higher the output of the neuron will be.Multipleparallel neurons form a layer, each characterized by outputs called activations y j Pf ( i w i, j x i ). Single layer (shallow) neural networks called perceptron have been first described byRosenblatt in 1958 [4]. However in deep neural networks multiple layers are stacked on top of eachother. In Krizhevsky et al. [5] achieved state-of-the-art computer vision results with an 8-layer, 500,000neurons and 60-million parameter neural network. Since the outputs of a layer are non-linear featuresof the input, the more layers are in the network, the more non-linear and abstract those features are3

(a) Rectified Linear Unit (ReLU): f (x) max(0, x).(b) Hyperbolic Tangent (tanh): f (x) tanh(x).Figure 2.2: Examples of activation functions f . Usually, the function is monotonically increasing andsaturates for low / negative inputs.compared to the original input. This hierarchy of features enables neural networks to approximate complex functions.Neural networks are usually initialized randomly and then trained on a dataset with regard to acost function. Typically, cost functions are distance measures between the desired outputs (targetst i ) and Pthe actual network outputs ( yi ). A common and simple cost function is the mean squarednerror 1n i 1 ( yi t i )2 . Training a neural network consists in optimizing the network parametersθ (w1 , b1 , · · · , w n , bn ) to minimize the cost on the training dataset. However, training a neuralnetwork can be challenging. First, as the parameter space can be non-convex, neural networks areusually optimized with techniques guaranteeing convergence to a local optimum (e.g., gradient descent). Second, the number of parameters grows with the complexity of the network. As deep, complexnetworks are typically required to learn difficult functions, the training can be highly demanding, bothin terms of computational time and data. Nevertheless, in practice neural networks can be optimizedquite reliably by gradient descent, as we will see in the next section.Image"Sara"Figure 2.3: Multi-layer (deep) neural network. Neurons in higher (leftmost) layers represent more abstract features (top) of the network input.2.1.1 Stochastic Gradient DescentUsing the gradient for optimization (i.e., first order optimization) is common to many machine learningalgorithms with high number of parameters, since zeroth order methods (e.g., genetic algorithms) needseveral function evaluations and higher n-order methods are too expensive per evaluation. Therefore,gradient descent is a key element of deep learning. It consists in following the direction pointed bythe gradient of the cost function with respect to the parameters of the system. In the case of neuralnetworks, the gradient of the cost function with respect to the network parameters θ tells us how tochange the parameters in order to reduce the cost. More specifically, given the gradient of the cost with4

respect to the parameters ddθC , we can update the parameters θnew θold α ddθC , where α is a learningrate. However, computing the cost and the gradient on large datasets is expensive. Stochastic gradientdescent alleviates this issue by only computing the cost and the gradient for a small subset (minibatch)of the dataset. Like gradient descent, stochastic gradient descent is guaranteed to converge to a localminimum.In high-dimensional optimization spaces like those of neural networks, optimization often gets stucknear saddle points or valleys where gradients are only high in directions orthogonal to the valley inwhich no progress can be made. This issue can be alleviated by having adaptive learning rates for eachparameter. A recent version of stochastic gradient descent using these techniques and used in this thesisis ADAM [6]. Another interesting property of ADAM is that its stepsize is independent of the scale of thegradients.To compute the gradient in a neural network, we can use automatic differentiation, a technique tocompute derivatives in computational graphs in a modular way. It is based on the chain rule, whichsays that the derivative of a composition y f (g(x)) of two functions g : x w and f : w y can bedydydecomposed into d x d w dd wx g 0 (w) · f 0 (x). In a computational graph, we can compute the derivativeof any edge y with respect to any other edge x by first computing the derivative of the last node g 0 (w)and then compute the derivative of all the previous nodes f 0 (x) by decomposing them in the same way.Derivatives are then passed backwards through the computational graph. An example is depicted inFigure 2.4.Figure 2.4: Learning iteration in a computational graph. First, the network output and cost C are computed (left). Subsequently, a backward pass is executed to compute derivatives (right).2.1.2 NormalizationNeural network training is highly sensitive to the mean and variance of the data. They affect the cost,gradients, activations and the operation region of the activation functions. This issue makes it hardto select good learning rates, parameter initializations and to set other hyperparameters. Also, scaledifferences between dimensions within a dataset result in skewing the cost surfaces (Figure 2.5a), thusresulting in a wrong direction of the gradients. Normalizing the data to zero mean and unit variancealleviates these issues (Figure 2.5b). Furthermore, normalization is not only important for the networkinputs and targets but also for the activations inside the network. During training, each layer has toconstantly adapt to its changing input distribution caused by the optimization of the previous layers.For alleviating this issue, Ioffe and Szegedy [7] proposed batch normalization, a techniques consistingin ensuring zero mean and unit variance for activations for each minibatch resulting in training timereduction by an order of magnitude.In the next section we extend the discussion of deep learning techniques to reinforcement learning.5

(a) Skewed cost surface due to unnormalized data.(b) Normalized cost surface. The red arrows (gradi-The gradient (red arrow) points in a directionalmost orthogonal to the optimum.ents at each timesteps) correctly point towardsthe optimum.Figure 2.5: Examples of contour plots of cost surfaces of one neuron with two weights.2.2 Reinforcement LearningReinforcement learning is concerned with learning through interaction with an environment. At everytimestep, a learner or decision-maker called agent executes an action and the environment in turn yieldsa new observation and a reward, as shown in Figure 2.6. The task of the agent is to maximize the sumof rewards received during the interaction with the e 2.6: Agent-environment interaction in reinforcement learning. The agent can be anything interacting with an environment and able to improve its behavior (a human, an animal, a controlsystem). Everything not learned is part of the environment (sensors, motors, rewards).More formally, we can define a state s S Rds as all the information the agent has about the environment at a given timestep. Generally, this information might not include the full state of the environment(e.g., in a card game the state would include the agent’s hand but not all opponents hands even thoughthey are part of the full state of the game).An action a A Rda encodes how the agents can interact with the environment. The mapping fromstates to actions is called policy π : S A. A policy can either be stochastic (e.g., a probability distribution of actions over states) or deterministic.The reward r R is a feedback informing the agent about the immediate quality of its actions. Typically,the function generating the rewards is defined by an expert and can depend on the last state and actionr : S A R. The reward signal can encode the goal of the agent at different levels. For instance, inchess the agent can be rewarded at each capture or only at the end of the match.The goal of Pthe agent is to maximize the sum of the rewards received by the environment, namely thereturn R t γ t r t . The discount factor γ (0, 1] is to guarantee the convergence of the sum if thetime horizon is not finite (infinite horizon). In the next section, we present a brief overview of classicalapproaches to solve reinforcement learning problems.6

2.2.1 Learning ApproachesReinforcement learning algorithms can be categorized as either value-based, policy-based or a combination of the two. Value-based methods consist in explicitly learning the value of all states and using it toselect the action that leads to the highest-valued state. In this setting, the value function V is defined asthe expected return of the agent being in state s t and then following the policy π, while the action-valuefunction Q is defined as the expected return of the agent being in state s t , executing action a t and thenfollowing the policy π. The advantage function A connects both.T XV π (s t ) E ri t ,si t E; ai t πγ(i t) r(si , ai ) ,i tT XQπ (s t , a t ) Aπ (s t , a t ) Qπ (s t , a t ) V π (s t ).E ri t ,si t E; ai t π γ(i t) r(si , ai ) ,i tFor deterministic policies, Q can be formulated recursively by the so-called Bellman equationQπ (s t , a t ) E r t ,s t 1 E [r t γQπ (s t 1 , π(s t 1 ))]. Knowing Qπ (s t , a t ) for each state-action pair, the best policy π is to select the action with the highest action value at every timestep, i.e., π (s t ) argmaxa Qπ (s t , a). In practice we neither know Qπ norπ in the first place. However, even when starting from a random Q, it is possible to iteratively update Q by exploiting the Bellman equation and to converge to Qπ (and therefore to π ). This powerful approach is a key element in recent Deep-Q-Networks (DQN), which will be discussed further in Section 3.1.In continuous action spaces, however, maximizing over Q is infeasible as there are infinite actions to consider. Even discretizing the action space becomes intractable for relatively low dimensional action spaces(curse of dimensionality). To overcome this issue, it might be better to directly learn a (parametrized)policy instead of a value function. Algorithms following this approach are called policy-based. One of themost successful class of policy-based algorithms is policy gradient [8, 9, 10]. Policy gradient approachesdirectly optimize the policy parameters ξ by following the direction of the gradient of the expected return with respect to the policy parameters ( ξ R), which can be directly estimated from samples.A mixture of the value-based and policy-based approaches are actor-critic algorithms. These approachesmake use of both a parametric policy (actor) and a value function estimator (critic), used to improvethe policy. A very recent actor-critic approach for learning deterministic policies is Deterministic PolicyGradients (DPG) [11]. DPG uses a differentiable action-value function approximator to obtain the policy gradient by taking the derivative of its output with respect to the action input dQ/d a. The policygradient is then computed as ξQ a Q · ξ π. Deep Deterministic Policy Gradient (DDPG), a DPGwith neural network function approximators, will be discussed further in Section 3.2.2.2.2 Exploration and ExploitationOne of the biggest issues in reinforcement learning is the tradeoff between exploration and exploitation.Acting greedily (exploitation) with respect to an approximated function (e.g., Q-function) and choosingthe current best action might prevent the agent from discovering new better states and therefore preventimprovement of the policy. On the contrary, excessive exploration might slow down the learning or evenresults in harmful policies. A tradeoff is therefore necessary. Usually, noise is added to the actionsduring training. In the case of discrete actions, the ε-greedy policy is a common solution: the agentacts randomly with probably ε and greedily with probability 1 ε. In the case of continuous actions,Gaussian noise could instead be added. In this thesis, we will use these simple strategies. However,the exploration-exploitation tradeoff is still an open problem in reinforcement learning. Alternativeexploration strategies might consist in artificially rewarding exploration or even leaving the decisionabout exploration completely to the agent. We will come back to this topic in Section 6.2.7

3 Deep Reinforcement LearningAs discussed in the previous section, reinforcement learning heavily relies on either approximatingthe Q-function (in the case of value-based algorithms) or on the policy parameterization (in the caseof policy-based algorithms). In both cases, the use of rich function approximators or policies allowsreinforcement learning to scale to complex problems. Neural networks have been successfully used asfunction approximators in supervised learning and are differentiable, a useful quality for many reinforcement learning algorithms. In this section, we focus on two recent reinforcement learning algorithms.The first one, Deep Q-Network (DQN) [1, 12], is a value-based algorithm successfully able to play 49Atari games from pixels better than human experts. The second one, Deep Deterministic Policy Gradients(DDPG) [13], is an actor-critic algorithm, an extension to DQN for continuous actions.3.1 Deep Q-Network (DQN)DQN is a value-based algorithm using a neural network Q(s, a θ ) to approximate the optimal actionvalue Q of each action in a given state. The training targets for Q are computed via the Bellmanequation y j r j γQ(s j , π(s j θ 0 ) θ 0 ). The network is trained to minimize the mean squared error withrespect to the Q-function, i.e., 2 1 X000C(θ θ ) r j γ Q s j 1 , π(s j 1 θ ) θ Q s j , π(s j θ ) θm j {z}yj 21 X00r j γ max Q s j 1 , a θ max Q s j , a θ. aam j {z}yjHowever, the dependence of the Q targets on Q itself can lead to instabilities or even divergence duringlearning. Having a second set of network parameters θ 0 LP(θ ), where LP is a low-pass filter (e.g.,exponential moving average), stabilizes the learning.Additional instabilities can arise from training directly on the incoming states and rewards, since,unlike supervised learning, the input data (state-action pairs) is highly correlated as they are part ofa trajectory. Furthermore, the policy and therefore data distribution might change quickly as Q evolves.DQN solves both issues by storing all (s t , a t , r t , s t 1 ) transitions in a replay memory dataset D and thenlearning on minibatches consisting of random transitions from D. This trick breaks the correlation ofthe input data and smoothes the changes in the input distribution. It also increases data efficiency byallowing the agent to perform multiple gradient descend steps on the same transition.sQq(a1 ).πarg maxaq(an )Figure 3.1: Architecture of DQN. The Q-Network outputs an estimate of the action-value function foreach action. Subsequently, the policy π chooses the action with the highest value.Finally, to ensure sufficient exploration, a simple ε-greedy policy, i.e., π(s θ ) arg maxa Q(s, a θ ). Thisoff-policy approach is possible because the algorithm does not learn on full trajectories but only on isolated transitions. The complete algorithm is shown in Algorithm 1.8

Algorithm 1: DQN123456789101112Initialize replay memory DInitialize Q with random weights θInitialize target weights θ 0 θfor t 1 to T doSelect action a t ε-greedy [argmaxa Q(s t , a θ )]Execute a t and observe reward r t and next state s t 1Store transition (s t , a t , r t , s t 1 ) in DSample random minibatch of m transitions from DSet targets y j r j γ maxa Q(a, s j 1 θ 0 )PPerform gradient descent on cost C m1 j ( y j Q(s j , a j θ ))2 with respect to θUpdate θ 0 LP(θ )end3.2 Deep Deterministic Policy Gradient (DDPG)As discussed in the introduction, a parametrized policy is advantagous for control because it allows forlearning in continuous actions spaces. Since DDPG is an actor-critic policy gradient algorithm, thereis a policy network π(s ξ) with parameteters ξ in addition to the action-value network Q(s, a θ ) withparameters θ .The training targets for Q are computed as in DQN, with the only difference that π now depends on ξ.Using the mean squared error, we derive the cost function for Q 2 1 X00.C(θ θ , ξ ) r j γ Q s j 1 , π(s j 1 ξ ) θ Q s j , π(s j ξ) θm j {z}00yjAs the targets depend on the explicit policy network, we also need the target policy parametersξ0 L P(ξ). Here, LP will be an exponential moving average with update rule ξ0 τξ (1 τ)ξ0and θ 0 τθ (1 τ)θ 0 .The policy is trained via policy gradient ξ Q s j , π(s j ξ) θ a Q s j , π(s j ξ) θ · ξ π(s j ξ),that is, Q is the cost function for π C(ξ θ ) - Q s j , π(s j ξ) θ .As actions are continuous, correlated Gaussian noise M t is added to the actions to ensure exploration.More specifically, M t 1 ϑ · M t N (0, σ)), where N (0, σ) a normal distributed random variable andϑ a hyperparameter controlling the frequency of the noise. Again, this off-policy approach is possiblebecause the algorithm does not learn on trajectories but only on isolated transitions. Algorithm 2 showsthe complete learning procedure.πsaQqFigure 3.2: Architecture of DDPG. The policy π is trained by back-propagating the q-gradient with respectto the action a.9

Algorithm 2: DDPG123456789101112131410Initialize replay memory DInitialize π with random weights ξ and target weights ξ0 ξInitialize Q with random weights θ and target weights θ 0 θfor t 1 to T doSelect action a t π(s t ) M tExecute a t and observe reward r t and next state s t 1Store transition (s t , a t , r t , s t 1 ) in DSample random minibatch of m transitions from DSet Q targets y j r j γQ(s j 1 , π(s j 1 ξ0 ) θ 0 )PPerform gradient descent on cost C m1 j ( y j Q(s j , a j θ ))2 with respect to θPerform gradient ascent on Q(s j , π(s j ξ) θ ) with respect to ξUpdate θ 0 LP(θ )Update ξ0 LP(ξ)end

4 ImplementationA big challenge in deep reinforcement learning implementations are efficient neural network routines.The most critical part are the inner products of inputs and weights and the respective derivatives ineach layer. We first coded in MATLAB, a machine learning popular tool providing efficient linear algebraroutines. Due to the lack of automatic differentiation, we implemented gradient propagation routines,as well as the RMSProp and ADAM optimizers. The code can be found on the CD of this thesis.However, due to the high computational demands of neural networks, we also developed a TensorFlowimplementation. TensorFlow [14] is an open source Python / C library providing fast routines( 30 times faster than MATLAB from our experience) for deep learning, accomplished through GPUsupport. It features automatic differentiation and therefore allows for much more flexible network andalgorithm design. Additionally, TensorFlow provides a wide variety of built-in optimizers (e.g., ADAM)and operations (e.g., batch normalization). The results shown in this theses have been produced by theTensorFlow implementation, which can be found on the CD of this thesis.Another concern in setting up the testing framework regarded the ability to run several experimentson computing clusters with job schedulers while having a changing codebase. For this purpose, wewrote ezex, a small framework that provides basic operations (e.g., starting and aborting jobs) and avisualizat

spaces. In this thesis, Deep Deterministic Policy Gradients, a deep reinforcement learning method for continuous control, has been implemented, evaluated and put into context to serve as a basis for further research in the field. Zusammenfassung Reinforcement-Learning ist ein mathematischer Rahmen, um intelligent mit ihrer Umgebung intera-

Related Documents:

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Deep Reinforcement Learning: Reinforcement learn-ing aims to learn the policy of sequential actions for decision-making problems [43, 21, 28]. Due to the recen-t success in deep learning [24], deep reinforcement learn-ing has aroused more and more attention by combining re-inforcement learning with deep neural networks [32, 38].

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

Pradeep Sharma, Ryan P. Lively, Benjamin A. McCool and Ronald R. Chance. 2 Cyanobacteria-based (“Advanced”) Biofuels Biofuels in general Risks of climate change has made the global energy market very carbon-constrained Biofuels have the potential to be nearly carbon-neutral Advanced biofuels Energy Independence & Security Act (EISA) requires annual US production of 36 .