Deep Reinforcement Learning For Robotic Manipulation

1y ago
3 Views
2 Downloads
2.26 MB
9 Pages
Last View : 20d ago
Last Download : 2m ago
Upload by : Julia Hutchens
Transcription

Deep Reinforcement Learning for Robotic ManipulationarXiv:1610.00633v1 [cs.RO] 3 Oct 2016Shixiang Gu ,1,2,3 and Ethan Holly ,1 and Timothy Lillicrap4 and Sergey Levine1,5Abstract— Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, roboticapplications of reinforcement learning often compromise theautonomy of the learning process in favor of achieving trainingtimes that are practical for real physical systems. This typicallyinvolves introducing hand-engineered policy representationsand human-supplied demonstrations. Deep reinforcement learning alleviates this limitation by training general-purpose neuralnetwork policies, but applications of direct deep reinforcementlearning algorithms have so far been restricted to simulatedsettings and relatively simple tasks, due to their apparenthigh sample complexity. In this paper, we demonstrate thata recent deep reinforcement learning algorithm based on offpolicy training of deep Q-functions can scale to complex3D manipulation tasks and can learn deep neural networkpolicies efficiently enough to train on real physical robots. Wedemonstrate that the training times can be further reducedby parallelizing the algorithm across multiple robots whichpool their policy updates asynchronously. Our experimentalevaluation shows that our method can learn a variety of3D manipulation skills in simulation and a complex dooropening skill on real robots without any prior demonstrationsor manually designed representations.I. I NTRODUCTIONReinforcement learning methods have been applied torange of robotic control tasks, from locomotion [1], [2] tomanipulation [3], [4], [5], [6] and autonomous vehicle control[7]. However, practical real-world applications of reinforcement learning have typically required significant additionalengineering beyond the learning algorithm itself: an appropriate representation for the policy or value function mustbe chosen so as to achieve training times that are practicalfor physical hardware [8], and example demonstrations mustoften be provided to initialize the policy and mitigate safetyconcerns during training [9]. In this work, we show thata recently proposed deep reinforcement learning algorithmsbased on off-policy training of deep Q-functions [10], [11]can be extended to learn complex manipulation policies fromscratch, without user-provided demonstrations, and usingonly general-purpose neural network representations that donot require task-specific domain knowledge.One of the central challenges with applying direct deepreinforcement learning algorithms to real-world robotic platforms has been their apparent high sample-complexity. Wedemonstrate that, contrary to commonly held assumptions,recently developed off-policy deep Q-function based algorithms such as the Deep Deterministic Policy Gradientalgorithm (DDPG) [10] and Normalized Advantage Functioncontribution, 1 Google Brain, 2 University of Cambridge, 3 MPITübingen, 4 Google DeepMind, 5 UC Berkeley equalFig. 1: Two robots in the process of learning a door openingtask. We present a method that allows multiple robots tocooperatively learn a single policy with deep reinforcementlearning.algorithm (NAF) [11] can achieve training times that aresuitable for real robotic systems. We also demonstrate thatwe can further reduce training times by parallelizing thealgorithm across multiple robotic platforms. To that end,we present a novel asynchronous variant of NAF, evaluatethe speedup obtained with varying numbers of learners insimulation, and demonstrate real-world results with parallelism across multiple robots. An illustration of these robotslearning a door opening task is shown in Figure 1.The main contribution of this paper is a demonstration ofasynchronous deep reinforcement learning using our parallelNAF algorithm across a cluster of robots. Our technicalcontribution consists of the asynchronous variant of the NAFalgorithm, as well as practical extensions of the method toenable sample-efficient training on real robotic platforms.We also introduce a simple and effective safety mechanismfor constraining exploration at training time, and presentsimulated experiments that evaluate the speedup obtainedfrom parallelizing across a variable number of learners.Our experiments also evaluate the benefits of deep neuralnetwork representations for several complex manipulationtasks, including door opening and pick-and-place, by comparing to more standard linear representations. Our realworld experiments show that our approach can be used tolearn a door opening skill from scratch using only generalpurpose neural network representations and without anyhuman demonstrations. To the best of our knowledge, thisis the first demonstration of autonomous door opening thatdoes not use human-provided examples for initialization.II. R ELATED W ORKApplications of reinforcement learning (RL) in roboticshave included locomotion [1], [2], manipulation [3], [4],[5], [6], and autonomous vehicle control [7]. Many of

the RL methods demonstrated on physical robotic systemshave used relatively low-dimensional policy representations,typically with under one hundred parameters, due to thedifficulty of efficiently optimizing high-dimensional policyparameter vectors [12]. Although there has been considerableresearch on reinforcement learning with general-purposeneural networks for some time [13], [14], [15], [16], [17],such methods have only recently been developed to the pointwhere they could be applied to continuous control of highdimensional systems, such as 7 degree-of-freedom (DoF)arms, and with large and deep neural networks [18], [10],[11]. This has made it possible to learn complex skills withminimal manual engineering, though it has remained unclearwhether such approaches could be adapted to real systemsgiven their sample complexity.In real robot environments, particularly those with contactevents, environment dynamics are rarely available or cannotbe accurately modeled. In this work we thus focus on modelfree reinforcement learning, which includes policy searchmethods [19], [3], [20] and value-iteration methods [21],[22], [14]. Both approaches have recently been combinedwith deep neural networks to achieve unprecedented successes in learning complex tasks [23], [24], [18], [10], [11],[25]. However, while policy search methods [23], [18], [25]offer a simple and direct way to optimize the true objective, they often require significantly more data than valueiteration methods because of on-policy learning, makingthem a less obvious choice for robotic applications. Wetherefore build on two value iteration methods based on Qlearning with function approximation [22], Deep Deterministic Policy Gradient (DDPG) [10] and Normalized AdvantageFunction (NAF) [11], as they successfully extend Deep QLearning [24] to continuous action space and are significantlymore sample-efficient than competing policy search methodsdue to off-policy learning. DDPG is closely related to theNFQCA [26] algorithm, with principle differences beingthat NFQCA uses full-batch updates and parameter resettingbetween episodes.Accelerating robotic learning by pooling experience frommultiple robots has long been recognized as a promisingdirection in the domain of cloud robotics, where it is typicallyreferred to as collective robotic learning [27], [28], [29],[30]. In deep reinforcement learning, parallelized learninghas also been proposed to speed up simulated experiments[25]. The goals of this prior work are fundamentally differentfrom ours: while prior asynchronous deep reinforcementlearning work seeks to reduce overall training time, underthe assumption that simulation time is inexpensive and thetraining is dominated by neural network computations, ourwork instead seeks to minimize the training time whentraining on real physical robots, where experience is expensive and computing neural network backward passes iscomparatively cheap. In this case, we retain the use of areplay buffer, and focus on asynchronous execution andneural network training. Our results demonstrate that weachieve significant speedup in overall training time fromsimultaneously collecting experience across multiple roboticplatforms.III. BACKGROUNDIn this section, we will formulate the robotic reinforcementlearning problem, introduce essential notation, and describethe existing algorithmic foundations on which we build themethods for this work. The goal in reinforcement learningis to control an agent attempting to maximize a rewardfunction which, in the context of a robotic skill, denotesa user-provided definition of what the robot should try toaccomplish. At state xt in time t, the agent chooses andexecutes action ut according to its policy π(uut xxt ), transitionsto a new state xt according to the dynamics p(xxt xxt , ut )and receives a reward r(xxt , ut ). Here, we consider infinitehorizon discounted return problems, where the objective isthe γ discounted future return from time t to , given by(i t) r(xxi , u i ). The goal is to find the optimal policyRt i t γ π which maximizes the expected sum of returns from theinitial state distribution, given by R Eπ [R1 ].Among reinforcement learning methods, off-policy methods such as Q-learning offer significant data efficiency compared to on-policy variants, which is crucial for roboticsapplications. Q-learning trains a greedy deterministic policy π(uut xxt ) δ (uut µ (xxt )) by iterating between learningthe Q-function, Qπn (xxt , ut ) Eri t ,xxi t E,uui t πn [Rt xxt , ut ], ofa policy and updating the policy by greedily maximizing the Q-function, µ n 1 (xxt ) arg maxu Qπn (xxt , ut ). Let θ Qparametrize the action-value function, β be an arbitraryexploration policy, and ρ β be the state visitation distributioninduced by β , the learning objective is to minimize theBellman error, where we fix the target yt :L(θ Q ) Ext ρ β ,uut β ,xxt 1 ,rt E [(Q(xxt , ut θ Q ) yt )2 ]yt r(xxt , ut ) γQ(xxt 1 , µ (xxt 1 ))For continuous action problems, the policy update step isintractable for a Q-function parametrized by a deep neuralnetwork. Thus, we will investigate Deep Deterministic PolicyGradient (DDPG) [10] and Normalized Advantage Functions(NAF) [11]. DDPG circumvents the problem by adoptingan actor-critic method, while NAF restricts the class ofQ-function to the expression below to enable closed-formupdates, as in the discrete action case. During exploration, atemporally-correlated noise is added to the policy networkoutput. For more details and comparisons on DDPG andNAF, please refer to [10], [11] as well as experimental resultsin Section V-B.Q(xx, u θ Q ) A(xx, u θ A ) V (xx θ V )1A(xx, u θ A ) (uu µ (xx θ µ ))T P (xx θ P )(uu µ (xx θ µ ))2We evaluate both DDPG and NAF in our simulated experiments, where they yield comparable performance, withNAF producing slightly better results overall for the tasksexamined here. On real physical systems, we focus onvariants of the NAF method, which is simpler, requiresonly a single optimization objective, and has fewer hyperparameters.

This RL formulation can be applied on robotic systemsto learn a variety of skills defined by reward functions.However, the learning process is typically time consuming,and requires a number of practical considerations. In thenext section, we will present our main technical contribution,which consists of a parallelized variant of NAF, and alsodiscuss a variety of technical contributions necessary to applyNAF to real-world robotic skill learning.IV. A SYNCHRONOUS T RAINING OF N ORMALIZEDA DVANTAGE F UNCTIONSIn this section, we present our primary contribution: anextension of NAF that makes it practical for use with realworld robotic platforms. To that end, we describe howonline training of the Q-function estimator can be performedasynchronously, with a learner thread that trains the networkand one or more worker threads that collect data by executingthe current policy on one or more robots. Besides makingNAF suitable for real time applications, this approach alsomakes it straightforward to collect experience from multiplerobots in parallel. This is crucial in real-world robot learning,since the learning time is often constrained by the datacollection rate in real time, rather than network trainingspeed. When data collection is the limiting factor, then 23 times quicker data collection may translate directly to 2-3times faster skill acquisition on a real robot. We also describepractical considerations, such as safety constraints, which arenecessary in order to allow the exploration required to traincomplex policies from scratch on real systems. To the bestof our knowledge, this is the first direct deep RL method thathas been demonstrated on a real robotics platform with manyDoFs and contact dynamics, and without demonstrations orsimulated pretraining [18], [10], [11]. As we will show inour experimental evaluation, this approach can be used tolearn complex tasks such as door opening from scratch,which previously required additional details such as humandemonstrations to succeed [6].A. Asynchronous LearningIn asynchronous NAF, the learner thread is separated fromthe experience collecting worker threads. The asynchronouslearning algorithm is summarized in Algorithm 1. Thelearner thread uses the replay buffer to perform asynchronousupdates to the deep neural network Q-function approximator.This thread runs on a central server, and dispatches updatedpolicy parameters to each of the worker threads. The experience collecting worker threads run on the individual robots,and send the observation, action, and reward for each timestep to the central server to append to the replay buffer. Thisdecoupling between the training and the collecting threadsallows the controllers on each of the robots to run in realtime, without experiencing delays due to the computationalcost of backpropagation through the network. Furthermore, itmakes it straightforward to parallelize experience collectionacross multiple robots simply by adding additional workerthreads. We only use one thread for training the network;however, the gradient computation can also be distributed inAlgorithm 1 Asynchronous NAF - N collector threads and1 trainer thread// trainer threadRandomly initialize normalized Q network Q(xx, u θ Q ), whereθ Q {θ µ , θ P , θ V } as in Eq. 10Initialize target network Q0 with weight θ Q θ QInitialize shared replay buffer R 0/for iteration 1, I doSample a(random minibatch of m transitions from R0r γV 0 (xx0i θ Q ) if ti TSet yi iriif ti TQUpdate the weight θby minimizing the loss:L m1 i (yi Q(xxi , u i θ Q ))200Update the target network: θ Q τθ Q (1 τ)θ Qend for// collector thread n, n 1.NµRandomly initialize policy network µ (xx θn )for episode 1, M doµSync policy network weight θn θ µInitialize a random process N for action explorationReceive initial observation state x 1 p(xx1 )for t 1, T doµSelect action ut µ (xxt θn ) NtExecute ut and observe rt and xt 1Send transition (xxt , ut , rt , xt 1 ,t) to Rend forend forsame way as [25] within our framework. While the trainerthread keeps training from the centralized replay buffer, thecollector threads sync their policy parameters with the trainerthread at the beginning of each episode, execute commandson the robots, and push experience into the buffer.B. Safety ConstraintsEnsuring safe exploration poses a significant challenge forreal-world training with reinforcement learning. Q-learningrequires a significant amount of noisy exploration for gathering the experience necessary for action-value functionapproximation. For all experiments, we set a maximum commanded velocity allowed per joint, as well as strict positionlimits for each joint. In addition to joint position limits, weused a bounding sphere for the end-effector position. If thecommanded joint velocities would send the end-effector outside of the sphere, we used the forward kinematics to projectthe commanded velocity onto the surface of the sphere, plussome correction velocity to force toward the center. Forexperiments with no contacts, these safety constraints weresufficient to prevent unsafe exploration; for experiments withcontacts, additional heuristics were required for safety.C. Network ArchitecturesTo minimize manual engineering, we use a simple andreadily available state representation consisting of jointangles and end-effector positions, as well as their timederivatives. In addition, we append a target position to thestate, which depends on the task: for the reaching task,this is the goal position for the end-effector; for the dooropening, this is the handle position when the door is closed

and the quaternion measurement of the sensor attached tothe door frame. Since the state representation is compact,we use standard feed-forward networks to parametrize theaction-value functions and policies. We use two-hidden-layernetwork with size 100 units each to parametrize each of µ (x),L (x) (Cholesky decomposition of P (xx)), and V (xx) in NAFand µ (x) and Q(xx, u ) in DDPG. For Q(xx, u ) in DDPG, theaction vector u added as another input to second hidden layerfollowed by a linear projection. ReLU is used as hiddenactivations and hyperbolic tangent (Tanh) is used for thefinal layer activation function in the policy networks µ (xx)to bound the action scale.To illustrate the importance of deep neural networksfor representing policies or action-value functions, westudy these neural network models against another simpler parametrization. Specifically we study a variant ofNAF (Linear-NAF) as below, where µ (xx) f (kk K x ),P , k , K , B , b , c are learnable matricies, vectors, or scalars ofappropriate dimension, and f is Tanh to enforce boundedactions.1Q(xx, u ) (uu µ (xx))T P (uu µ (xx)) x T B x x T b c2If f is identity, then the expression corresponds to a globallyquadratic Q-function and a linear feedback policy, thoughdue to the Tanh non-linearity, the Q-function is not linearwith respect to state-action features.V. S IMULATED E XPERIMENTSWe first performed a detailed investigation of the learningalgorithms using simulated tasks modeled using the MuJoCophysics simulator [31]. Simulated environments enable fastcomparisons of design choices, including update frequencies, parallelism, network architectures, and other hyperparameters. We modeled a 7-DoF lightweight arm that wasalso used in our physical robot experiments, as well as a 6DoF Kinova JACO arm with 3 additional degrees of freedomin the fingers, for a total of 9 degrees of freedom. Both armswere controlled at the level of joint velocities, except thethree JACO finger joints which are controlled with torqueactuators. The 7-DoF arm is controlled at 20Hz to match thereal-world robot experiments, and the JACO arm is controlledat 100Hz. Gravity is turned off for the 7-DoF arm, which isa valid assumption given that the actual robot uses builtin gravity compensation. Gravity is enabled for the JACOarm. Different arm geometries, control frequencies, andgravity settings illustrate the learning algorithm’s robustnessto different learning environments.A. Simulation TasksTasks include random-target reaching, door pushing, doorpulling, and pick & place in a 3D environment, as detailedbelow. The 7-DoF arm is set up for the random targetreaching and door tasks, while the JACO arm is used forthe pick & place task (see Figure 2). Details of each taskare below, where d is Huber loss and ci ’s are non-negativeconstants. Discount factor of γ 0.98 is chosen and theAdam optimizer [32] with base learning rate of either 0.0001Fig. 2: The 7-DoF arm and JACO arm in simulation.or 0.001 is used for all the experiments. Importantly, almostno hyperparameter search was required to ensure that theemployed algorithms were successful across robot and task.1) Reaching (7-DoF arm): The 7-DoF arm tries to reacha random target in space from a fixed initial configuration.A random target is generated per episode by sampling pointsuniformly from a cube of size 0.2m centered around apoint. State features include the 7 joint angles and their timederivatives, the end-effector position and the target position,totalling 20 dimensions. Each episode duration is 150 timesteps (7.5 seconds). Success rate is computed from 5 randomtest episodes where an episode is successful if the arm canreach within 5 cm of the target. Given the end-effectorposition e and the target position y , the reward function isbelow,r(xx, u) c1 d(yy, e(xx)) c2 uT u2) Door Pushing and Pulling (7-DoF arm): The 7-DoFarm tries to open the door by pushing or pulling the handle(see Figure 2). For each episode, the door position is sampledrandomly within a rectangle of 0.2m by 0.1m. The handlecan be turned downward for up to 90 degrees, while thedoor can be opened up to 90 degrees in both directions.The door has a spring such that it closes gradually when noforce is applied. The door has a latch such that it couldonly open the door only when the handle is turned pastapproximately 60 degrees. To make the setting similar to thereal robot experiment where the quaternion readings from theVectorNav IMU are used for door angle measurements, thequaternion of the door handle is used to compute the loss.The reward function is composed of two parts: the closenessof the end-effector to the handle, and the measure of howmuch the door is opened in the right direction. The firstpart depends on the distance between end-effector position eand the handle position h in its neutral state. The second partdepends on the distance between the quaternion of the handleq and its value when the handle is turned and door is openedq o . We also added the distance when the door is at neutralposition as offset di d(qqo , q i ) such that, when the dooris opened the correct way, it receives positive reward. Statefeatures include the 7 joint angles and their time derivatives,the end-effector position, the resting handle position, the doorframe position, the door angle, and the handle angle, totally25 dimensions. Each episode duration is 300 time steps (15

seconds). Success rate is computed from 20 random testepisodes where an episode is successful if the arm can openthe door in the correct direction by a minimum of 10 degrees.r(xx, u ) c1 d(hh, e (xx)) c2 ( d(qqo , q (xx)) di ) c3 u T u3) pick & place (JACO): The JACO arm tries to pickup a stick suspending in the air by a string and place itnear the target upward in the space (see Figure 2). The handbegins near to, but not in contact with the stick, so the graspmust be learned. The task is similar to a task previouslyexplored with on-policy methods [25], except that here thetask requires moving the stick to multiple targets. For eachepisode a new target is sampled from a square of size 0.24m a t a fixed height, while the initial stick position andthe arm configuration are fixed. Given the grip site positiong (where the three fingers meet when closed), the threefinger tip positions f 1 , f 2 , f 3 , the stick position s and thetarget position y , the reward function is below. State featuresinclude the position and rotation matrices of all geometriesin the environment, the target position and the vector fromthe stick to the target, totally 180 dimensions. The largeobservation dimensionality creates an interesting comparisonwith the above two tasks. Each episode duration is 300 timesteps (3 seconds). Success rate is computed from 20 randomtest episodes where an episode is judged successful if thearm can bring the stick within 5 cm of the target.3r(xx, u ) c1 d(ss(xx), g (xx)) c2 d(ss(xx), f i (xx))(a) Door Pulling(b) JACO pick & placeFig. 3: The figure shows the learning curves for two tasks,comparing DDPG, Linear-NAF, and NAF. Note that thelinear model struggles to learn the tasks, indicating theimportance of expressive nonlinear policy representations.i 1 c3 d(yy, s (xx)) c4 u T uB. Neural Network Policy RepresentationsNeural networks are powerful function approximators, butthey have significantly more parameters than the simplerlinear models that are often used in robotic learning [20],[8]. In this section, we compare empirical performance ofDDPG, NAF, and Linear-NAF as described in Section IV-C.In particular, we want to verify if deep representations forpolicy and value functions are necessary for solving complextasks from scratch, and evaluate how they compare withlinear models in terms of convergence rate. For the 7-DoFarm tasks, DDPG and NAF models have significantly moreparameters than Linear-NAF, while the pick & place taskhas a high-dimensional observation, and thus the parametersizes are more comparable. Of course, many other linearrepresentations are possible, including DMPs [33], splines[3], and task-specific representations [34]. This comparisononly serves to illustrate that our tasks are complex enoughthat simple, fully generic linear representations are not bythemselves sufficient for success. For the experiments in thissection, batch normalization [35] is applied. These experiments were conducted synchronously, where 1 parameterupdate is applied per 1 time step in simulation.Figure 3 shows the experimental results on the 7-DoF doorpulling and JACO pick & place tasks and Table 4 summarizesthe overall results. For reaching and pick & place, LinearNAF learns good policies competitive with those of NAFand DDPG, but converges significantly slower than bothNAF and DDPG. This is contrary to common belief thatneural networks take significantly more data and update stepsto converge to good solutions. One possible explanation isthat in RL the data collection and the model learning arecoupled, and if the model is more expressive, it can explore agreater variety of complex policies efficiently and thus collectdiverse and good data quickly. This is not a problem for wellpre-trained policy learning but could be an important issuewhen learning from scratch. In the case of door tasks, thelinear model completely fails to learn perfect policies. Morethorough investigations into how expressivity of the policyinteract with reinforcement learning is a promising directionfor future work.Additionally, the experimental results on the door tasksshow that Linear-NAF does not succeed in learning suchtasks. The difference from above tasks likely comes fromthe complexity of policies. For reaching and pick & place,the tasks mainly requires learning single-motion policies,e.g. close fingers to grasp the stick and move it tothe target. For the door tasks, the robot is required tolearn how to hook onto the door handle in different locations, turn it, and push or pull. See the supplementary video at ion/ for learned resulting behaviors for each tasks.

Max. success rate (%)ReachDoor PullDoor PushPick & PlaceDDPG100 0100 0100 0100 0Lin-NAF100 05 640 10100 0Episodes to 100% success (1000s)NAF100 0100 0100 0100 0DDPG3.2 0.710 83.1 1.04.4 0.6Lin-NAF8 3N/AN/A12 3NAF3.6 1.06 34.2 1.02.9 0.9Fig. 4: The table summarizes the performances of DDPG, Linear-NAF, and NAF across four tasks. Note that the linearmodel learns the perfect reaching and pick & place policies given enough time, but fails to learn either of the door tasks.(a) Reaching(b) Door PushingFig. 5: Asynchronous training of NAF in simulation. Notethat both learning speed and final policy success rates depending significantly on the number of workers.C. Asynchronous TrainingIn asynchronous training, the training thread continuously trains the network at a fixed frequency determinedby network size and computational hardware, while eachcollector thread runs at a specified control frequency. Themain question to answer is: given these constraints, howmuch speedup can we gain from increasing the number ofworkers, i.e. the data collection speed? To analyze this in arealistic but controlled setting, we first set up the followingexperiment in simulation. We locked each collector threadto run at S times the speed of the training thread. Then, wevaried the number of collector threads N. Thus, the overalldata collection speed is approximately S N times that of thetrainer thread. For our experiments, we varied N and fixedS 1/5 since our training thread runs at approximately 100updates per second on CPU, while the collector thread inreal robot will be locked to 20Hz. Layer normalization isapplied [36].Figure 5 shows the results on reaching and door pushing.The x-axis shows the number of parameter updates, whichis proportional to the amount of wall-clock time required fortraining, since the amount of data per step increases with thenumber of workers. The results demonstrate three points: (1)under some circumstances, increasing data collection makesthe learning converge significantly faster with respect to thenumber of gradient steps, (2) final policy performances depend a lot on the ratio between collecting and training speeds,and (3) there is a limit where collecting more data does nothelp speed up learning. However, we hypothesize that accelerating the speed of neural network training, which in thesecases was pegged to one update per time step, could allowthe model to ingest more data and benefit more from greaterparallelism. This is particularly relevant as parallel computational hardware, such as GPUs, are improved and deployedmore widely. Videos of the learned policies are availablein supplementary materials and online: ion/VI. R EAL -W ORLD E XPERIMENTSThe real-world experiments are conducted with the 7-DoFarm shown in Figure 6. The tasks are the same as the simulation tasks in Section V-A with some minor changes. Forreaching, the same state representation and reward functionsare used. The randomized target position is sampled from acube of 0.4

One of the central challenges with applying direct deep reinforcement learning algorithms to real-world robotic plat-forms has been their apparent high sample-complexity. We demonstrate that, contrary to commonly held assumptions, recently developed off-policy deep Q-function based al-gorithms such as the Deep Deterministic Policy Gradient

Related Documents:

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Deep Reinforcement Learning: Reinforcement learn-ing aims to learn the policy of sequential actions for decision-making problems [43, 21, 28]. Due to the recen-t success in deep learning [24], deep reinforcement learn-ing has aroused more and more attention by combining re-inforcement learning with deep neural networks [32, 38].

IEOR 8100: Reinforcement learning Lecture 1: Introduction By Shipra Agrawal 1 Introduction to reinforcement learning What is reinforcement learning? Reinforcement learning is characterized by an agent continuously interacting and learning from a stochastic environment. Imagine a robot movin

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

Araling Panlipunan Ikalawang Markahan - Modyul 5: Interaksiyon ng Demand at Supply Z est for P rogress Z eal of P artnership 9 Name of Learner: _ Grade & Section: _ Name of School: _ Alamin Ang pinakatiyak na layunin ng modyul na ito ay matutuhan mo bilang mag-aaral ang mahahalagang ideya o konsepto tungkol sa interaksiyon ng demand at supply. Mula sa mga inihandang gawain at .