Learning Agile Robotic Locomotion Skills By Imitating Animals

2y ago
8 Views
2 Downloads
9.91 MB
14 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Macey Ridenour
Transcription

Learning Agile Robotic Locomotion Skills byImitating AnimalsXue Bin Peng † , Erwin Coumans , Tingnan Zhang , Tsang-Wei Edward Lee , Jie Tan , Sergey Levine † Google† University of California, BerkeleyResearch,Email: xbpeng@berkeley.edu, {erwincoumans,tingnan,tsangwei,jietan}@google.com, svlevine@eecs.berkeley.eduFig. 1. Laikago robot performing locomotion skills learned by imitating motion data recorded from a real dog. Top: Motion capture data recorded from adog. Middle: Simulated Laikago robot imitating reference motions. Bottom: Real Laikago robot imitating reference motions.Abstract—Reproducing the diverse and agile locomotion skillsof animals has been a longstanding challenge in robotics. Whilemanually-designed controllers have been able to emulate manycomplex behaviors, building such controllers involves a timeconsuming and difficult development process, often requiringsubstantial expertise of the nuances of each skill. Reinforcement learning provides an appealing alternative for automatingthe manual effort involved in the development of controllers.However, designing learning objectives that elicit the desiredbehaviors from an agent can also require a great deal of skillspecific expertise. In this work, we present an imitation learningsystem that enables legged robots to learn agile locomotion skillsby imitating real-world animals. We show that by leveragingreference motion data, a single learning-based approach is ableto automatically synthesize controllers for a diverse repertoirebehaviors for legged robots. By incorporating sample efficientdomain adaptation techniques into the training process, oursystem is able to learn adaptive policies in simulation thatcan then be quickly adapted for real-world deployment. Todemonstrate the effectiveness of our system, we train an 18DoF quadruped robot to perform a variety of agile behaviorsranging from different locomotion gaits to dynamic hops andturns. (Video1 )I. I NTRODUCTIONAnimals can traverse complex environments withremarkable agility, bringing to bear broad repertoires ofagile and acrobatic skills. Reproducing such agile behaviorshas been a long-standing challenge in robotics, with a largebody of work devoted to designing control strategies forvarious locomotion skills [37, 49, 54, 18, 3]. However,1 Supplementaryvideo: xbpeng.github.io/projects/Robotic Imitation/designing control strategies often involves a lengthy development process, and requires substantial expertise of both theunderlying system and the desired skills. Despite the manysuccess in this domain, the capabilities achieved by thesesystems are still far from the fluid and graceful motions seenin the animal kingdom.Learning-based approaches offer the potential to improvethe agility of legged robots, while also automating a substantialportion of the manual effort involved in the developmentof controllers. In particular, reinforcement learning (RL) canbe an effective and general approach for developing controllers that can perform a wide range of sophisticated skills[7, 43, 25, 44, 34]. While these methods have demonstratedpromising results in simulation, agents trained through RLare prone to adopting unnatural behaviors that are dangerousor infeasible when deployed in the real world. Furthermore,designing reward functions that elicit the desired behaviors canitself require a laborious task-specific tuning process.The comparatively superior agility seen in animals, ascompared to robots, might lead one to wonder: can we buildmore agile robotic controllers with less effort by directly imitating animal motions? In this work, we propose an imitationlearning framework that enables legged robots to learn agilelocomotion skills from real-world animals. Our frameworkleverages reference motion data to provide priors regardingfeasible control strategies for a particular skill. The use ofreference motions alleviates the need to design skill-specificreward functions, thereby enabling a common framework tolearn a diverse array of behaviors. To address the high sample

requirements of current RL algorithms, the initial trainingphase is performed in simulation. In order to transfer policieslearned in simulation to the real world, we propose a sampleefficient adaptation technique, which fine-tunes the behaviorof a policy using a learned dynamics representation.The central contribution of our work is a system that enableslegged robots to learn agile locomotion skills by imitatinganimals. We demonstrate the effectiveness of our frameworkon a variety of dynamic locomotion skills with the Laikagoquadruped robot [61], including different locomotion gaits, aswell as dynamic hops and turns. In our ablation studies, weexplore the impact of different design decisions made for thevarious components of our system.II. R ELATED W ORKThe development of controllers for legged locomotion hasbeen an enduring subject of interest in robotics, with a largebody of work proposing a variety of control strategies forlegged systems [37, 49, 54, 20, 18, 64, 8, 3]. However, manyof these methods require in-depth knowledge and manualengineering for each behavior, and as such, the resulting capabilities are ultimately limited by the designer’s understandingof how to model and represent agile and dynamic behaviors.Trajectory optimization and model predictive control can mitigate some of the manual effort involved in the design process,but due to the high-dimensional and complex dynamics oflegged systems, reduced-order models are often needed toformulate tractable optimization problems [11, 17, 12, 2].These simplified abstractions tend to be task-specific, andagain require significant insight of the properties of each skill.Motion imitation. Imitating reference motions provides ageneral approach for robots to perform a rich variety ofbehaviors that would otherwise be difficult to manually encodeinto controllers [48, 21, 55, 63]. But applications of motionimitation to legged robots have predominantly been limitedto behaviors that emphasize upper-body motions, with fairlystatic lower-body movements, where balance control can bedelegated to separate control strategies [39, 27, 30]. In contrastto physical robots, substantially more dynamic skills can bereproduced by agents in simulation [38, 33, 9, 35]. Recently,motion imitation with reinforcement learning has been effective for learning a large repertoire of highly acrobatic skillsin simulation [44, 34, 45, 32]. But due to the high samplecomplexity of RL algorithms and other physical limitations,many of the capabilities demonstrated in simulation have yetto be replicated in the real world.Sim-to-real transfer. The challenges of applying RL in thereal world have driven the use of domain transfer approaches,where policies are first trained in simulation (source domain),and then transferred to the real world (target domain). Sim-toreal transfer can be facilitated by constructing more accuratesimulations [58, 62], or adapting the simulator with realworld data [57, 23, 26, 36, 5]. However, building high-fidelitysimulators remains a challenging endeavour, and even state-ofthe-art simulators provide only a coarse approximation of therich dynamics of the real world. Domain randomization canbe incorporated into the training process to encourage policiesto be robust to variations in the dynamics [52, 60, 47, 42, 41].Sample efficient adaptation techniques, such as finetuning[51] and meta-learning [13, 16, 6] can also be applied tofurther improve the performance of pre-trained policies in newdomains. In this work, we leverage a class of adaptation techniques, which we broadly referred to as latent space methods[24, 65, 67], to transfer locomotion policies from simulationto the real world. During pre-training, these methods learna latent representation of different behaviors that are effectiveunder various scenarios. When transferring to a new domain, asearch can be conducted in the latent space to find behaviorsthat successfully execute a desired task in the new domain.We show that by combining motion imitation and latent spaceadaptation, our system is able to learn a diverse corpus ofdynamic locomotion skills that can be transferred to leggedrobots in the real world.RL for legged locomotion. Reinforcement learning has beeneffective for automatically acquiring locomotion skills in simulation [44, 34, 32] and in the real world [31, 59, 14, 58, 22, 26].Kohl and Stone [31] applied a policy gradient method to tunemanually-crafted walking controllers for the Sony Aibo robot.By carefully modeling the motor dynamics of the Minitaurquadruped robot, Tan et al. [58] was able to train walkingpolicies in simulation that can be directly deployed on areal robot. Hwangbo et al. [26] proposed learning a motordynamics model using real-world data, which enabled directtransfer of a variety of locomotion skills to the ANYmal robot.Their system trained policies using manually-designed rewardfunctions for each skill, which can be difficult to specify formore complex behaviors. Imitating reference motions can bea general approach for learning diverse repertoires of skillswithout the need to design skill-specific reward functions[35, 44, 45]. Xie et al. [62] trained bipedal walking policies forthe Cassie robot by imitating reference motions recorded fromexisting controllers and keyframe animations. The policies areagain transferred from simulation to the real world with theaid of careful system identification. Yu et al. [65] transferredbipedal locomotion policies from simulation to a physicalDarwin OP2 robot using a latent space adaptation method,which mitigates the dependency on accurate simulators. Inthis work, we leverage a similar latent space method, but bycombining it with motion imitation, our system enables realrobots to perform more diverse and agile behaviors than havebeen demonstrated by these previous methods.III. OVERVIEWThe objective of our framework is to enable robots to learnskills from real animals. Our framework receives as inputa reference motion that demonstrates a desired skill for therobot, which may be recorded using motion capture (mocap)of real animals (e.g. a dog). Given a reference motion, it thenuses reinforcement learning to synthesize a policy that enablesa robot to reproduce that skill in the real world. A schematic

Fig. 2. The framework consists of three stages: motion retargeting, motionimitation, and domain adaptation. It receives as input motion data recordedfrom an animal, and outputs a control policy that enables a real robot toreproduce the motion.illustration of our framework is available in Figure 2. Theprocess is organized into three stages: motion retargeting,motion imitation, and domain adaptation. 1) The referencemotion is first processed by the motion retargeting stage,where the motion clip is mapped from the original subject’smorphology to the robot’s morphology via inverse-kinematics.2) Next, the retargeted reference motion is used in the motionimitation stage to train a policy to reproduce the motion witha simulated model of the robot. To facilitate transfer to thereal world, domain randomization is applied in simulation totrain policies that can adapt to different dynamics. 3) Finally,the policy is transferred to a real robot via a sample efficientdomain adaptation process, which adapts the policy’s behaviorusing a learned latent dynamics representation.IV. M OTION R ETARGETINGFig. 3. Inverse-kinematics (IK) is used to retarget mocap clips recordedfrom a real dog (left) to the Laikago robot (right). Corresponding pairs ofkeypoints (red) are specified on the dog and robot’s bodies, and then IK isused to compute a pose for the robot that tracks the keypoints.return for a given task [56]. At each timestep t, the agentobservers a state st from the environment, and samples anaction at π(at st ) from its policy π. The agent then appliesthis action, which results in a new state st 1 and a scalarreward rt r(st , at , st 1 ). Repeated applications of this process generates a trajectory τ {(s0 , a0 , r0 ), (s1 , a1 , r1 ), .}.The objective then is to learn a policy that maximizes theagent’s expected return J(π),"T 1#XtJ(π) Eτ p(τ π)γ rt ,(2)t 0where T denotes the time horizon of each episode, and γ [0, 1] is a discount factor. p(τ π) represents the likelihood ofa trajectory τ under a given policy π,p(τ π) p(s0 )TY 1p(st 1 st , at )π(at st ),(3)t 0V. M OTION I MITATIONwith p(s0 ) being the initial state distribution, andp(st 1 st , at ) representing the dynamics of the system,which determines the effects of the agent’s actions.To imitate a given reference motion, we follow a similarmotion imitation approach as Peng et al. [44]. The inputs tothe policy is augmented with an additional goal gt , whichspecifies the motion that the robot should imitate. The policyis modeled as a feedforward network that maps a given statest and goal gt to a distribution over actions π(at st , gt ). Thepolicy is queried at 30Hz for a new action at each timestep.The state st (qt 2:t , at 3:t 1 ) is represented by the posesqt 2:t of the robot in the three previous timesteps, and thethree previous actions at 3:t 1 . The pose features qt consistof IMU readings of the root orientation (row, pitch, yaw)and the local rotations of every joint. The root position isnot included among the pose features to avoid the need toestimate the root position during real-world deployment.The goal gt (q̂t 1 , q̂t 2 , q̂t 10 , q̂t 30 ) specifies targetposes from the reference motion at four future timesteps,spanning approximately 1 second. The action at specifiestarget rotations for PD controllers at each joint. To ensuresmoother motions, the PD targets are first processed by alow-pass filter before being applied on the robot [4].We formulate motion imitation as a reinforcement learningproblem. In reinforcement learning, the objective is to learn acontrol policy π that enables an agent to maximize its expectedReward Function. The reward function encourages the policyto track the sequence of target poses (q̂0 , q̂1 , ., q̂T ) from theWhen using motion data recorded from animals, the subject’s morphology tends to differ from that of the robot’s. Toaddress this discrepancy, the source motions are retargeted tothe robot’s morphology using inverse-kinematics [19]. First,a set of source keypoints are specified on the subject’s body,which are paired with corresponding target keypoints on therobot’s body. An illustration of the keypoints is available inFigure 3. The keypoints include the positions of the feetand hips. At each timestep, the source motion specifies the3D location x̂i (t) of each keypoint i. The correspondingtarget keypoint xi (qt ) is determined by the robot’s pose qt ,represented in generalized coordinates [15]. IK is then appliedto construct a sequence of poses q0:T that track the keypointsat each frame,XXarg min x̂i (t) xi (qt ) 2 (q̄ qt )T W(q̄ qt ). (1)q0:TtiAn additional regularization term is included to encourage the poses to remain similar to a default pose q̄, andW diag(w1 , w2 , .) is a diagonal matrix specifying regularization coefficients for each joint.

reference motion at every timestep. The reward function issimilar to the one used by Peng et al. [44], where the rewardrt at each timestep is given by:rt wp rtp wv rtv we rte wrp rtrp wrv rtrv(4)wp 0.5, wv 0.05, we 0.2, wrp 0.15, wrv 0.1The pose reward rtp encourages the robot to minimize thedifference between the joint rotations specified by the reference motion and those of the robot. In the equation below, q̂jtrepresents the 1D local rotation of joint j from the referencemotion at time t, and qjt represents the robot’s joint, Xrtp exp 5 q̂jt qjt 2 .(5)jSimilarly, the velocity reward rtv is calculated according to theˆ jt and q̇jt being the angular velocity ofjoint velocities, with q̇joint j from the reference motion and robot respectively, X jˆ t q̇jt 2 .(6)rtv exp 0.1 q̇jNext, the end-effector reward rte , encourages the robot to trackthe positions of the end-effectors, where xet denotes the relative3D position of end-effector e with respect to the root,"#Xrte exp 40 x̂et xet 2 .(7)eFinally, the root pose reward rtrp and root velocity reward rtrvencourage the robot to track the reference root motion. xroottdenotes the root’s global position and linear velocity,and ẋroottare the rotation and angular velocity,and q̇rootwhile qroottt rp2root2rt exp 20 x̂root xroot qroot(8)tt 10 q̂tt hirvrootroot 2rootroot 2ˆˆrt exp 2 ẋt ẋt 0.2 q̇t q̇t . (9)VI. D OMAIN A DAPTATIONDue to discrepancies between the dynamics of the simulation and the real world, policies trained in simulationtend to perform poorly when deployed on a physical system.Therefore, we propose a sample efficient adaptation techniquefor transferring policies from simulation to the real world.A. Domain RandomizationDomain randomization is a simple strategy for improvinga policy’s robustness to dynamics variations [52, 60, 42].Instead of training a policy in a single environment with fixeddynamics, domain randomization varies the dynamics duringtraining, thereby encouraging the policy to learn strategies thatare functional across different dynamics. However, there maybe no single strategy that is effective across all environments,and due to unmodeled effects in the real world, strategies thatare robust to different simulated dynamics may nonethelessfail when deployed in a physical system.B. Domain AdaptationIn this work, we aim to learn strategies that are robust tovariations in the dynamics of the environment, while also beingable to adapt its behaviors as necessary for new environments.Let µ represent the values of the dynamics parameters thatare randomized during training in simulation (Table I). At thestart of each episode, a random set of parameters are sampledaccording to µ p(µ). The dynamics parameters are thenencoded into a latent embedding z E(z µ) by a stochasticencoder E, and z is provided as an additional input to the policy π(a s, z). For brevity, we have excluded the goal input gfor the policy. When transferring a policy to the real world, wefollow a similar approach as Yu et al. [66], where a search isperformed to find a latent encoding z that enables the policyto successfully execute the desired behaviors on the physicalsystem. Next, we propose an extension that addresses potentialissues due to over-fitting with the previously proposed method.A potential degeneracies of the previously described approach is that the policy may learn strategies that dependon z being an accurate representation of the true dynamicsof the system. This can result in brittle behaviors where thestrategies utilized by the policy for a given z can overfit tothe precise dynamics from the corresponding parameters µ.Furthermore, due to unmodeled effects in the real world, theremight be no µ that accurately models real-world dynamics.Therefore, to encourage the policy to be robust to uncertaintyin the dynamics, we incorporate an information bottleneck intothe encoder. The information bottleneck enforces an upperbound Ic on the mutual information I(M, Z) between thedynamics parameters M and the encoding Z. This results inthe following constrained policy optimization objective,"T 1#Xtarg max Eµ p(µ) Ez E(z µ) Eτ p(τ π,µ,z)γ rt (10)π,Es.t.t 0I(M, Z) Ic .(11)where the trajectory distribution is now given by,p(τ π, µ, z) p(s0 )TY 1p(st 1 st , at , µ)π(at st , z). (12)t 0Since computing the mutual information is intractable, theconstraint in Equation 11 can be approximated with a variational upper bound using the KL divergence between E anda variational prior ρ(z) [1],I(M, Z) Eµ p(µ) [DKL [E(· µ) ρ(·)]] .(13)We can further simplify the objective by converting Equation 11 into a soft constraint, to yield the followinginformation-regularized objective,"T 1#Xtarg max Eµ p(µ) Ez E(z µ) Eτ p(τ π,µ,z)γ rt(14)π,Et 0 β Eµ p(µ) [DKL [E(· µ) ρ(·)]] ,with β 0 being a Lagrange multiplier. In our experiments,we model the encoder E(z µ) N (m(µ), Σ(µ)) as a

ParameterTraining RangeTesting RangeAlgorithm 1 Adaptation with Advantage-Weighted RegressionMass[0.8, 1.2] default value[0.5, 2.0] default value1: π trained policyInertia[0.5, 1.5] default value[0.4, 1.6] default valueMotor Strength[0.8, 1.2] default value[0.7, 1.3] default value2: ω0 N (0, I)Motor Friction[0, 0.05] N ms/rad[0, 0.075] N ms/rad3: D Latency[0, 0.04] s[0, 0.05] s4: for iteration k 0, ., kmax 1 doLateral Friction[0.05, 1.25] N s/m[0.04, 1.35] N s/m5:zk sampled encoding from ωk (z)TABLE I6:Rollout an episode with π conditioned zk and record DYNAMIC PARAMETERS AND THEIRRESPECTIVE RANGE OF VALUES USEDthe return RkDURING TRAINING AND TESTING . A LARGER RANGE OF VALUES AREUSED DURING TESTING TO EVALUATE THE POLICIES ’ ABILITY TO7:Store (zPk , Rk ) in DGENERALIZE TO UNFAMILIAR DYNAMICS .k18:v̄ k i 1 Ri Pk 19:ωk 1 arg maxω i 1 log ω(zi ) exp α (Ri v̄)distribution at each iteration (Line 9) can be determinedanalytically. However, we found that the analytic solution10: end foris prone to premature convergence to a suboptimal solution.Instead, we update ωk (z) incrementally using a few steps ofGaussian distribution with mean m(µ) and standard devia- gradient descent. This process is repeated for kmax iterations,tion Σ(µ), and the prior ρ(z) N (0, I) is given by the and the mean of the final distribution ωkmax (z) is used as anunit Gaussian. This objective can be interpreted as training approximation of the optimal encoding z for deploying thea policy that maximizes the agent’s expected return across policy in the real world.different dynamics, while also being able to adapt its behaviorsVII. E XPERIMENTAL E VALUATIONwhen necessary by relying on only a minimal amount ofWeevaluateour robotic learning system by learning toinformation from the ground-truth dynamics parameters. Inimitatingavarietyof dynamic locomotion skills using theour formulation, the Lagrange multiplier β provides a tradeLaikagorobot[61],an18 degrees-of-freedom quadruped withoff between robustness and adaptability. Large values of β3actuateddegrees-of-freedomper leg, and 6 under-actuatedrestrict the amount of information that the policy can accessdegreesoffreedomfortheroot(torso). Behaviors learned byfrom µ. In the limit β , the policy converges tothepoliciesarebestseeninthesupplementary video1 , anda robust but non-adaptive policy that does not access theunderlying dynamics parameters. Conversely, small values of snapshots of the behaviors are also available in Figure 4. In theβ 0 provides the policy with unfettered access to the following experiments, we aim to evaluate the effectiveness ofdynamics parameters, which can result in brittle strategies our framework on learning a diverse set of quadruped skills,where the policy’s behaviors overfit to the nuances of each and study how well real-world adaptation can enable moresetting of the dynamics parameters, potentially leading to poor agile behaviors. We show that our adaptation method canefficiently transfer policies trained in simulation to the realgeneralization to real-world dynamics.world with a small number of trials on the physical system.C. Real World TransferWe further study the effects of regularizing the latent dynamicsTo adapt a policy to the real world, we directly search for an encoding with an information bottleneck, and show that thisprovides a mechanism to trade off between the robustness andencoding z that maximizes the return on the physical systemadaptability of the learned policies."T 1#Xz arg max Eτ p (τ π,z)γ t rt ,(15) A. Experimental Setupz t 0with p (τ π, z) being the trajectory distribution under realworld dynamics. To identify z , we use advantage-weightedregression (AWR) [40, 46], a simple off-policy RL algorithm.Algorithm 1 summarizes the adaptation process. The searchdistribution is initialized with the prior ω0 (z) N (0, I). Ateach iteration k, we sample an encoding from the currentdistribution zk ωk (z) and execute an episode with thepolicy conditioned on zk . The return Rk for the episodeis recorded and stored along with zk in a replay bufferD containing all samples from previous iterations. ωk (z) isthen updated by fitting a new distribution that assigns higherlikelihoods to samples with larger advantages. The likelihoodof each sample z i is weighted by the exponentiated-advantageexp α1 (Ri v̄) , where the baselines v̄ is the average returnof all samples in D, and α is a manually specified temperatureparameter. Note that, since ωk (z) is Gaussian, the optimalRetargeting via inverse-kinematics and simulated trainingis performed using PyBullet [10]. Table I summarizes thedynamics parameters and their respective range of values. Themotion dataset contains a mixture of mocap clips recordedfrom a dog and clips from artist generated animations. Themocap clips are collected from a public dataset [68] and retargeted to the Laikago following the procedure in Section IV.Figure 5 lists the skills learned by the robot and summarizesthe performance of the policies when deployed in the realworld. Motion clips recorded from a dog are designated with“Dog”, and the other clips correspond to artist animatedmotions. Performance is recorded as the average normalizedreturn, with 0 corresponding to the minimum possible returnper episode and 1 being the maximum return. Note that themaximum return may not be achievable, since the referencemotions are generally not physically feasible for the robot.Performance is calculated using the average of 3 policies

Fig. 4.(a) Dog Pace(b) Dog Backwards Trot(c) Side-Steps(d) Turn(e) Hop-Turn(f) Running ManLaikago robot performing skills learned by imitating reference motions. Top: Reference motion. Middle: Simulated robot. Bottom: Real robot.initialized with different random seeds. Each policy is trainedwith proximal policy optimization using about 200 millionsamples in simulation [53]. Both the encoder and policy aretrained end-to-end using the reparameterization trick [29].Domain adaptation is performed on the physical system withAWR in the latent dynamics space, using approximately 50real-world trials to adapt each policy. Trials vary between 5sand 10s in length depending on the space requirements of eachskill. Hyperparameter settings are available in Appendix A.Model representation. All policies are modeled using theneural network architecture shown in Figure 6. The encoderE(z µ) is represented by a fully-connected network thatmaps the dynamics parameters µ to the mean mE (µ) andstandard deviation ΣE (µ) of the encoder distribution. Thepolicy network π(a s, g, z) receives as input the state s,goal g, and dynamics encoding z, then outputs the meanmπ (s, g, z) of a Gaussian action distribution. The standarddeviation Σπ diag(σπ1 , σπ2 , .) of the action distribution isrepresented by a fixed matrix. The value function V (s, g, µ)receives as input the state, goal, and dynamics parameters.B. Learned SkillsOur framework is able to learn a diverse set of locomotionskills for the Laikago, including dynamic gaits, such as pacingand trotting, as well as agile turning and spinning motions(Figure 4). Pacing is typically used for walking at slowerspeeds, and is characterized by each pair of legs on the sameside of the body moving in unison (Figure 4(a)) [50]. Trottingis a faster gait, where diagonal pairs of legs move together(Figure 1). We are able to train policies for these different gaitsjust by providing the system with different reference motions.

Fig. 5. Performance statistics of imitating various skills in the real world. Performance is recorded as the average normalized return between [0, 1]. Threepolicies initialized with different random seeds are trained for each combination of skill and method. The performance of each policy is evaluated over 5episodes, for a total of 15 trials per method. The adaptive policies outperform the non-adaptive policies on most skills.Fig. 6. Schematic illustration of the network architecture used for the adaptivepolicy. The encoder E(z µ) receives the dynamics parameters µ as input,which are processed by two fully-connected layers with 256 and 128 ReLUunits, and then mapped to a Gaussian distribution over the latent space Zwith mean mE (µ) and standard deviation ΣE (µ). An encoding z is sampledfrom the encoder distribution and provided to the policy π(a s, g, µ) as input,along with the state s and goal g. The policy is modeled with two layers of512 and 256 units, followed by an output layer which specifies the meanmπ (s, g, z) of the action distribution. The standard deviation Σπ of theaction distribution is specified by a fixed diagonal matrix. The value functionV (s, g, µ) is modeled by a separate network with 512 and 256 hidden units.Furthermore, by simply playing the mocap clips backwards,we are able to train policies for different backwards walkinggaits (Figure 4(b)). The gaits learned by our policies arefaster than those of the manually-designed controller fromthe manufacturer. The fastest manufacturer gait reaches a topspeed of about 0.84m/s, while the Dog Trot policy reachesa speed of 1.08m/s. The backwards trotting gait reaches aneven higher speed of 1.20m/s. In addition to imitating mocapdata from animals, our system is also able to learn from artistanimated motions. While these hand-animated motions aregenerally not physically correct, the policies are nonethelessable to closely imitate most motions with the real robot.This includes a highly dynamic Hop-Turn motion, in whichthe robot performs a 90 degrees turn midair (Figure 4(e)).While our system is able to imitate a variety of motions,some motions, such as Running Man (Figure 4(f)), provechallenging to reproduce. The motion requires the robot totravel backwards while moving in a forward-walking mann

Learning Agile Robotic Locomotion Skills by Imitating Animals Xue Bin Pengy, Erwin Coumans , Tingnan Zhang , Tsang-Wei Edward Lee , Jie Tan , Sergey Leviney Google Research, yUniversity of California, Berkeley Email: xbpeng@berkeley.edu, ferwincoumans,tingna

Related Documents:

1. The need for an agile way of working 6 2. The need for an agile way of working 9 3. Agile Core Values - Agile Project Management Vs. 10 Agile Event Management 4. Agile principles 12 _Agile Principles of Agile Project Management 13 _Agile Principles of VOK DAMS Agile Event Management 14 5. Agile Methods 16 _Scrum in Short 16 _Kanban in Short 18

1.1 Purpose of the Agile Extension to the BABOK Guide1 1.2 What is Agile Business Analysis?2 1.3 Structure6 Chapter 2:The Agile Mindset 2.1 What is an Agile Mindset?7 2.2 The Agile Mindset, Methodologies, and Frameworks8 2.3 Applying the Agile Mindset9 2.4 Agile Extension and the Agile Ma

Agile Estimating and Planning by Mike Cohn Agile Game Development with Scrum by Clinton Keith Agile Product Ownership by Roman Pichler Agile Project Management with Scrum by Ken Schwaber Agile Retrospectives by Esther Derby and Diana Larsen Agile Testing: A Practical Guide for Testers and Agile Teams by Lisa Crispin and .

Agile World View "Agility" has manydimensions other than IT It ranges from leadership to technological agility Today's focus is on organizational & enterprise agility Agile Leaders Agile Organization Change Agile Acquisition & Contracting Agile Strategic Planning Agile Capability Analysis Agile Program Management Agile Tech.

AGILE TESTING For agile testers, test engineers, test managers, developers, technical leads. Ensure the production of effective and valuable software. Agile Fundamentals Agile Programming Agile Software Design Agile Fundamentals Agile Testing Agile Test Automation ICP CERTIFIED PROFESSIONAL ICP CERTIFIED PROFESSIONAL ICP-PRG CERTIFIED .

The most popular agile methodologies include: extreme programming (XP), Scrum, Crystal, Dynamic Sys-tems Development (DSDM), Lean Development, and Feature Driven Development (FDD). All Agile methods share a common vision and core values of the Agile Manifesto. Agile Methods: Some well-known agile software development methods include: Agile .

1. Agile methods are undisciplined and not measurable. 2. Agile methods have no project management. 3. Agile methods apply only to software development. 4. Agile methods have no documentation. 5. Agile methods have no requirements. 6. Agile methods only work with small colocated teams.-7. Agile methods do not include planning. 8.

Thomas Talarico, Nicole Inan . Pennsylvania Policy Forum, from Solicitor, Richard Perhacs, in which he stated "Empower Erie" and the "Western Pennsylvania Policy Forum" are private entities separate and distinct from the County of Erie." Mr. Davis's question to Council regarding this is that, if Empower Erie is separate from the County, why did Tim McNair current Chair of Empower Erie send a .