1m ago

3 Views

0 Downloads

1.46 MB

7 Pages

Tags:

Transcription

Episodic Learning with Control Lyapunov Functions forUncertain Robotic Systems*Andrew J. Taylor1 , Victor D. Dorobantu1 , Hoang M. Le, Yisong Yue, Aaron D. AmesAbstract— Many modern nonlinear control methods aim toendow systems with guaranteed properties, such as stabilityor safety, and have been successfully applied to the domainof robotics. However, model uncertainty remains a persistentchallenge, weakening theoretical guarantees and causing implementation failures on physical systems. This paper developsa machine learning framework centered around Control Lyapunov Functions (CLFs) to adapt to parametric uncertaintyand unmodeled dynamics in general robotic systems. Ourproposed method proceeds by iteratively updating estimatesof Lyapunov function derivatives and improving controllers,ultimately yielding a stabilizing quadratic program modelbased controller. We validate our approach on a planar Segwaysimulation, demonstrating substantial performance improvements by iteratively refining on a base model-free controller.Fig. 1.CAD model & physical system, a modified Ninebot Segway.I. INTRODUCTIONThe use of Control Lyapunov Functions (CLFs) [4], [38]for nonlinear control of robotic systems is becoming increasingly popular [26], [17], [29], often utilizing quadraticprogram (QP) controllers [2], [1], [17]. While effective, onemajor challenge is the need for extensive tuning, whichis largely due to modeling deficiencies such as parametricerror and unmodeled dynamics (cf. [26]). While there hasbeen much research in developing robust control methodsthat maintain stability under uncertainty (e.g., via inputto-state stability [39]) or in adapting to limited forms ofuncertainty (e.g., adaptive control [23], [20]), relatively littlework has been done on systematically reducing uncertaintywhile maintaining stability for general function classes ofmodels.We take a machine learning approach to address the abovelimitations. Learning-based approaches have already showngreat promise for controlling imperfectly modeled roboticplatforms [22], [35]. Successful learning-based approacheshave typically focused on learning model-based uncertainty[5], [8], [7], [37], or direct model-free controller design [25],[36], [14], [42], [24].We are particularly interested in learning-based approachesthat guarantee Lyapunov stability [21]. From that perspective,the bulk of previous work has focused on using learning toconstruct a Lyapunov function [31], [12], [30], or to assessthe region of attraction for a Lyapunov function [9], [6].One limitation of previous work is that learning is conducted1 Bothauthors contributed equally.All authors are with the Department of Computing roban@caltech.edu, hmle@caltech.edu,yyue@caltech.edu, ames@caltech.eduover the full-dimensional state space, which can be datainefficient. We instead constructively prescribe a CLF, andfocus on learning only the necessary information to choosecontrol inputs that achieve the associated stability guarantees,which can be much lower-dimensional.One challenge in developing learning-based methods forcontroller improvement is how best to collect training datathat accurately reflects the desired operating environmentand control goals. In particular, exhaustive data collectiontypically scales exponentially with dimensionality of the jointstate and control output space, and so should be avoided. Butfirst pre-collecting data upfront can lead to poor performanceas downstream control behavior may enter states that are notpresent in the pre-collected training data. We will leverageepisodic learning approaches such as Dataset Aggregation(DAgger) [33] to address these challenges in a data-efficientmanner, and lead to iteratively refined controllers.In this paper we present a novel episodic learning approachthat utilizes CLFs to iteratively improve controller design andachieve Lyapunov stability. To the best of our knowledge,our approach is the first that combines CLFs with generalsupervised learning (e.g., including deep learning) in amathematically integrated way. Another distinctive aspect isthat our approach performs learning on the projection of statedynamics onto the CLF time derivative, which can be muchlower dimensional than learning the full state dynamics orthe region of attraction.Our paper is organized as follows. Section II reviewsinput-output feedback linearization focused on constructingCLFs for unconstrained robotic systems. Section III discusses model uncertainty of a general robotic system andestablishes assumptions on the structure of this uncertainty.

These assumptions allow us to prescribe a CLF for the truesystem, but leave open the question of how to model its timederivative. Section IV provides an episodic learning approachto iteratively improving a model of the time derivative of theCLF. We also present a variant of optimal CLF-based controlthat integrates the learned representation. Finally, Section Vprovides simulation results on a model of a modified Ninebotby Segway E , seen in Fig. 1. We also provide a Pythonsoftware package (LyaPy) implementing our experiments andlearning framework.1II. PRELIMINARIES ON CLF SThis section provides a brief review of input-output feedback linearization, a control technique which can be used tosynthesize a CLF. The resulting CLF will be used to quantifythe impact of model uncertainty and specify the learningproblem outlined in Section III.A. Input-Output LinearizationInput-Output (IO) Linearization is a nonlinear controlmethod that creates stable linear dynamics for a selectedset of outputs of a system [34]. Outputs encode informationsuch as the position of a floating-based robot or roboticarm end effector as a function of configuration in a waythat is useful for designing controllers. Additionally, IOLinearization provides a constructive method for generatingLyapunov functions, a central tool in certifying stability andsynthesizing controllers for nonlinear systems.Consider an affine robotic control system with configuration space Q Rn and an input space U Rm . AssumeQ is path-connected and non-empty. The dynamics of thesystem are specified by:D(q)q̈ C(q, q̇)q̇ G(q) Bu,{z} f (q,q̇) dy(q) yd (t) dt ẏ(q, q̇) ẏd (t)} z"#{ y q q̇ ẏ q q̇ y 1H(q, q̇) q D(q) 0k mẏd (t) yu, 1Bÿd (t) q D(q){z} {z } ṙ(t)g(q)(2)noting that ẏq̇ y q . For all q R, g(q) is full rank by therelative degree assumption. Define η : Q Rn I R2k ,ee : Q Rk m as:f : Q Rn Rk , and g y(q) yd (t)η(q, q̇, t) (3)ẏ(q, q̇) ẏd (t) ẏ yef (q, q̇) q̇ D(q) 1 H(q, q̇)(4) q q ye(q) D(q) 1 B,(5)g qand assume U Rm . The input-output linearizing controllaw kIO : Q Rn I U is specified by:e(q)† ( ekIO (q, q̇, t) gf (q, q̇) ÿd (t) ν(q, q̇, t)), (6)with auxiliary input ν(q, q̇, t) Rk for all q Q,q̇ Rn , and t I, where † denotes the Moore-Penrosepseudoinverse. Eliminating nonlinear terms, this controllerused in (2) generates linear error dynamics of the form: 0Ik k0η̇(q, q̇, t) k kη(q, q̇, t) k k ν(q, q̇, t),0k k 0k kIk k {z} {z }FG(7)where (F, G) areacontrollablepair.Defininggainmatrix K Kp Kd where Kp , Kd Sk , the auxiliary control input ν(q, q̇, t) Kη(q, q̇, t) induces error dynamics:(1)H(q,q̇)with generalized coordinates q Q, coordinate rates q̇ Rn , input u U, inertia matrix D : Q Sn , centrifugaland Coriolis terms C : Q Rn Rn n , gravitational forcesG : Q Rn , and static actuation matrix B Rn m . HereSn denotes the set of n n symmetric positive definitematrices. Define twice-differentiable outputs y : Q Rk ,with k m, and assume each output has relative degree 2on some domain R Q (see [34] for details). Intuitively,the relative degree assumption implies that no configurationin R results in an inability to actuate the system. Considerthe time interval I [t0 , tf ] for initial and final timest0 , tf and define twice-differentiable time-dependent desired outputs yd : I Rk with r(t) yd (t) ẏd (t) .The error between actual and desired outputs (referred to as1 https://github.com/vdorobantu/lyapyvirtual constraints [45]) yields the dynamic system:η̇(q, q̇, t) Acl η(q, q̇, t),(8)where Acl F GK is Hurwitz. This implies the desiredoutput trajectory yd is exponentially stable, allowing us toconstruct a Lyapunov function for the system using conversetheorems found in [21]. With Acl Hurwitz, for any Q S2k ,there exists a unique P S2ksuchthattheContinuous Time Lyapunov Equation (CTLE):A cl P PAcl Q,(9)is satisfied. Let C {η(q, q̇, t) : (q, q̇) R Rn , t I}.Then V (η) η Pη, implicitly a function of q, q̇, and t,is a Lyapunov function certifying exponential stability of (8)on C satisfying:λmin (P)kηk22 V (η) λmax (P)kηk222V̇ (η) λmin (Q) kηk2 ,(10)for all η C. Here λmin (·) and λmax (·) denote the minimum and maximum eigenvalues of a symmetric matrix,

respectively. A similar Lyapunov function can be constructeddirectly from (7) using the Continuous Algebraic RiccatiEquation (CARE) [1], [21].b(q)uη̇ bf (q, q̇) ṙ(t) gB. Control Lyapunov FunctionsLyapunov functions encode information about the dynamics in a low-dimensional representation suitable for learning.The preceding formulation of a Lyapunov function requiredthe choice of the specific control law given in (6) to analyzestability of a closed-loop system. More generally, ControlLyapunov Functions (CLFs) extend this idea to enable synthesis of optimal nonlinear controllers. Let C R2k . Afunction V : R2k R is a Control Lyapunov Function(CLF) for (1) on C certifying exponential stability if thereexist constants c1 , c2 , c3 0 such that:c1 kηk22 V (η) c2 kηk22inf V̇ (η, u) c3 kηk22 ,(11)u Ufor all η C. We see that the previously constructedLyapunov function satisfying (10) satisfies (11) by choosingthe control law specified in (6). In the absence of a specificcontrol law, we may write the CLF time derivative as:V̇ (η, u) Lyapunov Function (CLF), V , for the system. Using the definitions established in (2) in conjunction with the estimatedmodel, we see that true system evolves as: V Vη̇ (f (q, q̇) ṙ(t) g(q)u). η η(12)Dynamic information directly appears within the scalar function V̇ . Also note that V̇ is affine in u, leading to a QP basedcontrol law kQP : Q Rn I U given by:1 kQP (q, q̇, t) arg minu Mu s u r2u U2s.t. V̇ (η, u) c3 kηk2 ,(13)mfor M Sm , s R , and r R, provided U is ampolyhedron. Here S denotes the set of m m symmetricpositive semi-definite matrices.III. UNCERTAINTY MODELS & LEARNINGThis section defines the class of model uncertainty weconsider in this work and investigates its impact on thecontrol system, and concludes with motivation for a datadriven approach to mitigate this impact.A. Uncertainty Modeling AssumptionsAs defined in Section II, we consider affine robotic controlsystems that evolve under dynamics described by (1). Inpractice, we do not know the dynamics of the system exactly,and instead develop our control systems using the estimatedmodel:bbbbD(q)q̈ C(q,q̇)q̇ G(q) Bu.(14){z} bH(q,q̇)We assume the estimated model (14) satisfies the relativedegree condition on the domain R, and thus may use themethod of input-output linearization to produce a Controlb(q))u f (q, q̇) b (g(q) gf (q, q̇) . {z} {z}A(q)(15)b(q,q̇)We note the following features of modeling uncertainty inthis fashion: Uncertainty is allowed to enter the system dynamicsvia parametric error as well as through completelyunmodeled dynamics. In particular, the function H cancapture a wide variety of nonlinear behavior and onlyneeds to be Lipschitz continuous. This formulation explicitly allows uncertainty in howthe input is introduced into the dynamics via uncertaintyin the inertia matrix D and static actuation matrix B.This definition of uncertainty is also compatible with adynamic actuation matrix B : Q Rn Rn m givenproper assumptions on the relative degree of the system.Given this formulation of our uncertainty, we make thefollowing assumptions of the true dynamics:Assumption 1. The true system is assumed to be deterministic, time invariant, and affine in the control input.Assumption 2. The CLF V , formulated for the estimatedmodel, is a CLF for the true system.It is sufficient to assume that the true system have relativedegree 2 on the domain R to satisfy Assumption 2. Thise, if known, enableholds since the true values of ef and gchoosing control inputs as in (6) that respect the same linearerror dynamics (8). Given that V is a CLF for the truesystem, its time derivative under uncertainty is given by:ḃ (η,u)VV̇ (η, u) } {z V bb(q)u)(f (q, q̇) ṙ(t) g η V V A(q) u b(q, q̇), η η {z } {z }a(η,q) (16)b(η,q,q̇)for all η R2k and u U. While V is a CLF for the truesystem, it is no longer possible to determine if a specificcontrol value will satisfy the derivative condition in (11) dueto the unknown components a and b. Rather than form a newLyapunov function, we seek to better estimate the Lyapunovfunction derivative V̇ to enable control selection that satisfiesthe exponential stability requirement. This estimate should beaffine in the control input, enabling its use in the controllerdescribed in (13). Instead of learning the unknown dynamicsterms A and b, which scale with both the dimension ofthe configuration space and the number of inputs, we willlearn the terms a and b, which scale only with the numberof inputs. In the case of the planar Segway model wesimulate, we reduce the number of learned components from

4 to 2 (assuming kinematics are known). These learnedrepresentations need to accurately capture the uncertaintyover the domain in which the system is desired to evolveto ensure stability during operation.B. Motivating a Data-Driven Learning ApproachThe formulation from (15) and (16) defines a general classof dynamics uncertainty. It is natural to consider a datadriven method to estimate the unknown quantities a and bover the domain of the system. To motivate our learningbased framework, first consider a simple approach of learninga and b via supervised regression [19]: we operate the systemusing some given state-feedback controller to gather datapoints along the system’s evolution and learn a function thatapproximates a and b via supervised learning.Concretely, let q0 Q be an initial configuration. Anexperiment is defined as the evolution of the system over afinite time interval from the initial condition (q0 , 0) usinga discrete-time implementation of the given controller. Aresulting discrete-time state history is obtained, which is thentransformed with Lyapunov function V and finally differentiated numerically to estimate V̇ throughout the experiment.This yields a data set comprised of input-output pairs:n2kD {((qi , q̇i , η i , ui ), V̇i )}N U) R.i 1 (Q R R(17)Consider a class Ha of nonlinear functions mapping fromR2k Q to Rm and a class Hb of nonlinear functionsb Hamapping from R2k Q Rn to R. For a given aċband b Hb , define W as:ċ (η, q, q̇, u) Vḃ (η, u) ab(η, q) u bb(η, q, q̇), (18)Wand let H be the class of all such estimators mapping R2k Q Rn U to R. Defining a loss function L : R R R ,the supervised regression task is then to find a function inH via empirical risk minimization (ERM):infba Habb Hb1NNXċ (η , q , q̇ , u ), V̇ ).L(Wiiiii(19)i 1This experiment protocol can be executed either in simulationor directly on hardware. While being simple to implement,supervised learning critically assumes independently andidentically distributed (i.i.d) training data. Each experimentviolates this assumption, as the regression target of each datapoint is coupled with the input data of the next time step. Asa consequence, standard supervised learning with sequential,non-i.i.d data collection often leads to error cascades [24].IV. INTEGRATING EPISODIC LEARNING & CLF SIn this section we present the main contribution of thiswork: an episodic learning algorithm that captures the uncertainty present in the Lyapunov function derivative in alearned model and utilizes it in a QP based controller.Algorithm 1 Dataset Aggregation for Control LyapunovFunctions (DaCLyF)Require: Control Lyapunov Function V , derivative estimate Vḃ 0 , model classes Ha and Hb , loss function L,set of initial configurations Q0 , nominal state-feedbackcontroller k0 , number of experiments T , sequence of trustcoefficients 0 w1 · · · wT 1D . Initialize data setfor k 1, . . . , T do(q0 , 0) sample(Q0 {0}) . Get initial conditionDk experiment((q0 , 0), kk 1 ) . Run experimentD D Dk. Aggregate data setḃbb, b ERM(Ha , Hb , L, D, V 0 )a. Fit estimatorsb u bb . Update derivative estimatorVḃ k Vḃ 0 akk k0 wk · augment(k0 , Vḃ k ) . Update controllerend forreturn Vḃ T , uTA. Episodic Learning FrameworkEpisodic learning refers to learning procedures that iteratively alternates between executing an intermediate controller(also known as a roll-out in reinforcement learning [22]),collecting data from that roll-out, and designing a newcontroller using the newly collected data. Our approachintegrates learning a and b with improving the performanceand stability of the control policy u in such an iterativefashion. First, assume we are given a nominal state-feedbackcontroller k : Q Rn I U, which may not stabilize theċ H as defined in (18), wesystem. With an estimator Wspecify an augmenting controller as:k0 (q, q̇, t) arg min J(u0 )u0 Rmċ (η, q, q̇, k(q, q̇, t) u0 ) c kηk2s.t. W32k(q, q̇, t) u0 U,(20)where J : Rm R is any positive semi-definite quadraticcost function. This augmenting control law effectively findsthe minimal addition u0 to the input determined by the nominal control law k such that the sum stabilizes the system;ċ .however, stability degrades with error in the estimator WIn an effort to reduce the remaining error, we use this newcontroller to obtain better estimates of a and b. One option,as seen in Section III-B, is to perform experiments and useb and bb. Toconventional supervised regression to update aovercome the limitations of conventional supervised learning,we leverage reduction techniques: a sequential predictionproblem is reduced to a sequence of supervised learningproblems over multiple episodes [15], [32]. In particular,in each episode, an experiment generates data using adifferent controller. The data set is aggregated and a newERM problem is solved after each episode. Our episodiclearning implementation is inspired by the Data Aggregationalgorithm (DAgger) [32], with some key differences:

t 0t 1t 2t 3t 4t 5Fig. 2. (Left) Model based QP controller fails to track trajectory. (Right) Improvement in angle tracking of system with augmented controller over nominalPD controller. (Bottom) Corresponding visualizations of state data. Note that Segway is tilted in the incorrect direction at the end of the QP controllersimulation, but is correctly aligned during the augmented controller simulation. Video of this animation is found at https://youtu.be/cB5MY 8vCrQ. DAgger is a model-free policy learning algorithm,which trains a policy directly in each episode usingoptimal computational oracles. Our algorithm definesa controller indirectly via a CLF to ensure stability.The ERM problem is underdetermined, i.e., differentapproximations (ba, bb) may achieve similar loss for agiven data set while failing to accurately model a andb. This potentially introduces error in estimating V̇for control inputs not reflected in the training data,and necessitates the use of exploratory control actionb and bb. Such explorationto constrain the estimators acan be achieved by randomly perturbing the controllerused in an experiment at each time step. This needfor exploration is an analog to the notion of persistentexcitation from adaptive systems [28].Algorithm 1 specifies a method of computing a sequenceof Lyapunov function derivative estimates and augmentingcontrollers. During each episode, the augmenting controllerassociated with the estimate of the Lyapunov function derivative is scaled by a heuristically chosen factor reflecting trustin the estimate and added to the nominal controller for usein the subsequent experiment. The trust coefficients forma monotonically non-decreasing sequence on the interval[0, 1]. Importantly, this experiment need not take place insimulation; the same procedure may be executed directly onhardware. It may be infeasible to choose a specific configuration for an initial condition on a hardware platform; thereforewe specify a set of initial configurations Q0 Q from whichan initial condition may be sampled, potentially randomly.At a high level, this episodic approach makes progress bygathering more data in relevant regions of the state space,such as states close to a target trajectory. This extends thegeneralizability of the estimator in its use by subsequentcontrollers, and improves stability results as explored in [43].B. Additional Controller DetailsDuring augmentation, we specify the controller in (20) byselecting the minimum-norm cost function:J(u0 ) 12kk(q, q̇, t) u0 k2 ,2(21)for all u0 Rm , q Q, q̇ Rn , and t I. For practicalconsiderations we incorporate the following smoothing terminto the cost:2R(u0 ) R ku0 uprev k2 ,for all u0 Rm , where uprev Rm is the previouslycomputed augmenting controller and R 0. This is doneto avoid chatter that may arise from the optimization basednature of the CLF-QP formulation [27].Note that for this choice of Lyapunov function, the gradient V η , and therefore a, approach 0 as η approaches0, which occurs close to the desired trajectory. While theestimated Lyapunov function derivative may be fit with lowabsolute error on the data set, the relative error may stillbe high for states near the desired trajectory. Such relativeerror causes the optimization problem in (20) to be poorlyconditioned near the desired trajectory. We therefore add aslack term δ R to the decision variables, which appearsin the inequality constraint [2]. The slack term is additionally

(rad)Fig. 3. Augmenting controllers consistently improve trajectory tracking across episodes. 10 instances of the algorithm are executed with the shaded regionformed from minimum and maximum angles for each time step within an episode. The corresponding average angle trajectories are also displayed.incorporated into the cost as: 1 Vb(η, q)b(q)C(δ) C ag2 η2δ2 ,(22)2for all δ R , where C 0. As states approach thetrajectory, the coefficient of the quadratic term decreasesand enables relaxation of the exponential stability inequalityconstraint. In practice this leads to input-to-state stablebehavior, described in [40], around the trajectory.The exploratory control during experiments is naively chosen as additive noise from a centered uniform distribution,with each coordinate drawn i.i.d. The variance is scaled bythe norm of the underlying controller to introduce explorationwhile maintaining a high signal-to-noise ratio.V. APPLICATION ON SEGWAY PLATFORMIn this section we apply the episodic learning algorithmconstructed in Section IV to the Segway platform. In particular, we consider a 4-dimensional planar Segway model basedon the simulation model in [18]. The system states consist ofhorizontal position and velocity, pitch angle, and pitch anglerate. Control is specified as a single voltage input supplied toboth motors. The parameters of the model (including mass,inertias, and motor parameters but excluding gravity) arerandomly modified by up to 10% of their nominal valuesand are fixed for the simulations.We seek to track a pitch angle trajectory2 generated forthe estimated model. The nominal controller is a linearproportional-derivative (PD) controller on angle and anglerate error. 20 experiments are conducted with trust valuesvarying from 0.01 to 0.99 in a sigmoid fashion. The exploratory control is drawn uniformly at random between 20% and 20% of the norm of the underlying controller foran episode for the first 10 episodes. The percentages decaylinearly to 0 in the remaining 10 episodes. The model classesselected are sets of two-layer neural networks with ReLUnonlinearities with hidden layer width of 2000 nodes3 . Theinputs to both models are all states and the CLF gradient.Failure of the controller (13) designed for the estimatedmodel to track the desired trajectory is seen in the left portionof Fig. 2. The baseline PD controller and the augmented2 Trajectory3 Modelswas generated using the GPOPS-II Optimal Control Softwarewere implemented in Kerascontroller after 20 experiments can be seen in the rightportion Fig. 2. Corresponding visualizations of the Segwaystates are displayed at the bottom of Fig. 2. The augmentedcontroller exhibits a notable improvement over the modelbased and PD controller in tracking the trajectory.To verify the robustness of the learning algorithm, the20 experiment process was conducted 10 times. After eachexperiment the intermediate augmented controller was testedwithout exploratory perturbations. For the last three experiments and a test of the final augmented controller,the minimum, mean, and maximum angles across all 10instances are displayed for each time step in Fig. 3. Themean trajectory consistently improves in these later episodesas the trust factor increases. The variation increases butremains small, indicating that the learning problem is robustto randomness in the initialization of the neural networks,in the network training algorithm, and in the noise addedduring the experiments. The performance of the controllerin the earlier episodes displayed negligible variation fromthe baseline PD controller due to small trust factors.VI. CONCLUSIONS & FUTURE WORKWe presented an episodic learning framework that directlyintegrates into an established method of nonlinear controlusing CLFs. Our method allows for the effects of bothparametric error and unmodeled dynamics to be learned fromexperimental data and incorporated into an QP controller.The success of this approach was demonstrated in simulationon a Segway, showing improvement upon a model estimatebased controller.There are two main interesting directions for future work.First, a more thorough investigation of episodic learning algorithms can yield superior performance as well as learningtheoretic converge guarantees. Other episodic learning approaches to consider include SEARN [13], AggreVaTeD[41], and MoBIL [10], amongst others. Second, our approachcan also be applied to learning with other forms of guarantees, such as with Control Barrier Functions (CBFs) [3].Existing work on learning CBFs are restricted to learningwith Gaussian processes [44], [16], [11], and also learn overthe full state space rather than over the low-dimensionalprojection onto the CBF time derivative.Acknowledgements. This work was supported in part byfunding and gifts from DARPA, Intel, PIMCO, and Google.

R EFERENCES[1] A. Ames, K. Galloway, K. Sreenath, and J. Grizzle. Rapidly exponentially stabilizing control lyapunov functions and hybrid zero dynamics.IEEE Transactions on Automatic Control, 59(4):876–891, 2014.[2] A. Ames and M. Powell. Towards the unification of locomotionand manipulation through control lyapunov functions and quadraticprograms. In Control of Cyber-Physical Systems, pages 219–240.Springer, 2013.[3] A. Ames, X. Xu, J. Grizzle, and P. Tabuada. Control barrierfunction based quadratic programs for safety critical systems. IEEETransactions on Automatic Control, 62(8):3861–3876, 2017.[4] Z. Artstein. Stabilization with relaxed controls. Nonlinear Analysis:Theory, Methods & Applications, 7(11):1163–1173, 1983.[5] T. Beckers, D. Kulić, and S. Hirche. Stable gaussian process basedtracking control of euler–lagrange systems. Automatica, 103:390–397,2019.[6] F. Berkenkamp, R. Moriconi, A. Schoellig, and A. Krause. Safelearning of regions of attraction for uncertain, nonlinear systems withgaussian processes. In 55th Conference on Decision and Control(CDC), pages 4661–4666. IEEE, 2016.[7] F. Berkenkamp and A. Schoellig. Safe and robust learning control withgaussian processes. In 2015 European Control Conference (ECC),pages 2496–2501. IEEE, 2015.[8] F. Berkenkamp, A. Schoellig, and A. Krause. Safe controller optimization for quadrotors with gaussian processes. In 2016 IEEEInternational Conference on Robotics and Automation (ICRA), pages491–496. IEEE, 2016.[9] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safemodel-based reinforcement learning with stability guarantees. InI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural InformationProcessing Systems 30, pages 908–918. Curran Associates, Inc., 2017.[10] C. Cheng, X. Yan, E. Theodorou, and B. Boots. Accelerating imitationlearning with predictive models. In International Conference onArtificial Intelligence and Statistics (AISTATS), 2019.[11] R. Cheng, G. Orosz, R.M. Murray, and J.W. Burdick. End-to-endsafe reinforcement learning through barrier functions for safety-criticalcontinuous control tasks. 2019.[12] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. Alyapunov-based approach to safe reinforcement learning. In S. Bengio,H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31,pages 8092–8101. Curran Associates, Inc., 2018.[13] H. Daumé, J. Langford, and D. Marcu. Search-based structuredprediction. Machine learning, 75(3):297–325, 2009.[14] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel.Benchmarking deep reinforcement learning for continuous control.In International Conference on Machine Learning, pages 1329–1338,2016.[15] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch modereinforcement learning. Jo

punov Functions (CLFs) to adapt to parametric uncertainty and unmodeled dynamics in general robotic systems. Our proposed method proceeds by iteratively updating estimates of Lyapunov function derivatives and improving controllers, ultimately yielding a stabilizing quadratic program model-based controller. We validate our approach on a planar .

Related Documents: