Combining Geometric Nonlinear Control With Reinforcement Learning .

1y ago
4 Views
2 Downloads
2.99 MB
76 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Jerry Bolanos
Transcription

Combining Geometric Nonlinear Control withReinforcement Learning-Enabled ControlTyler WestenbroekDept EECSUniversity of California BerkeleyAdapted for EECS 290-005,February 24, 20211

Geometric Nonlinear Control Main idea: exploit underlying structures in thesystem to systematically design feedbackcontrollers Explicitly connects ‘global’ and ‘local’ systemstructures Gives fine-grain control over system behavior Amenable to formal analysis Difficult to learn to exploit non-parametricuncertainties2

Deep Reinforcement Learning Main idea: sample system trajectories tofind (approximately) optimal feedbackcontroller ‘Discovers’ connection between global andlocal structure Automatically generates complex[Levine et. al.](IJRR 2020)behaviors, but requires reward shaping Effectively handles non-parametricuncertainty Can require large amounts of data[Open AI](2019)3

Motivating Questions Can we design local reward signals with global structural information‘baked in’ using geometric control? Can we use these structures to provide correctness and safetyguarantees for the learning? Does reinforcement learning implicitly take advantage of thesestructures? What structures make a system ‘easy’ to control?4

Thesis ProposalPart 1: Overcomenon-parametricuncertainty by combiningRL and geometric controlMBControllerLearnedCorrection5

Tyler’s PhD researchExample: Learning a stable walking gaitwith 20 seconds of dataPart 1: Overcomenon-parametricuncertainty by combiningRL and geometric controlMBControllerLearnedCorrectionUse structures from geometric control as a‘template’ for the learning6

Project FlowPart 1: Overcomenon-parametricuncertainty by combiningRL and geometric controlPart 2: Provide correctnessand safety guarantees forspecific learningalgorithms:Part 3: Future work: What makes a rewardsignal difficult to learnfrom? What makes a systemfundamentallydifficult to control? Where shouldMBControllerLearnedCorrectiongeometric control beused in the long-run?High-probability tracking tube7

Part 1 Outline Steps in design process Example control architectures Feedback Linearization Control Lyapunov Functions Other architectures Trade-offs with ‘Model-based’ RL8

Steps in Design ProcessStep 1: Choosegeometric controlarchitecture whichproduces desired globalbehaviorStep 2: Augment thenominal controller witha learned component:learned augmentatione.g. feedback linearizingcontroller9Step 3: Formulatereward which capturesdesired local behaviorMinimize loss with RL

Feedback Linearization10

Goal: Output Tracking Consider the systemwithinput andthe state,the output.the Goal: track any smooth referencewith one controller11

Calculating a Linearizing Controller For the time being assume. To obtain a direct relationship betweenthe inputs and outputs we differentiate : Now iffor eachthen the controlleryields1

Calculating a Linearizing Controller For the time being assume. To obtain a direct relationship betweenthe inputs and outputs we differentiate : Now iffor eachthen the controllerIf this is zero the controlleris undefinedyields1

Calculating a Linearizing Controller Now ifthen we differentiate a second time and obtain andexpression of the form Now iffor eachthen the control lawyields1

Calculating a Linearizing Controller In general, we can keep differentiating At this point we can apply the controlwhich yields1until the input appears:

‘Inverting’ the Dynamics Take time derivatives of outputsto obtain an input-outputrelationship of the form Applying the control lawyieldsVector Relative Degree16

Normal Form Choose the outputs and their derivatives as new states for thesystem: Ifwe can ‘complete the basis’ by appropriatelyselectingextra variables:Can trackusing linear controlMay become unstable!Zero Dynamics17(Systems is minimum-phase if theseare asymptotically stable)

Zero Dynamics We refer to the un-driven dynamicsas the zero dynamics. We say that the overall control system is minimum-phase if thezero dynamics are asymptotically stable We say that the system is non-minimum-phase if the zerodynamics are unstable1

Tracking Desired Outputs To track the desired outputFeedback TermFeedforward Term If we designlaw drivessuch thatis Hurwitz then this controlexponentially quickly However, the zero dynamics may not stay stable!1

Model Mismatch Suppose we have an approximate dynamics model: Why not just learn the forward dynamics?May be singular!20

Directly Learning the LinearizingController We know the linearizing controllers are of the form There is a “gap” between the two controllers: To overcome the gap we approximate“Feedback linearization for uncertain systems via RL” [WFMAPST] (2020)21

Penalize Deviations fromDesired Linear Behavior We want to find a set of learned parameter such that Thus, we define the point-wise loss We then define the optimization problemDistributionof statesDistribution ofvirtual inputs22

Solutions to the ProblemTheorem: [1] Assume that the learned controller is of the formwhereandare linearly independent sets of features. Thenthe optimization problem is strongly convex.Corollary: Further assume thatsome feasible. Thenforis the unique optimizer for .Remark: There are many known bases which can recover anywhereand up to a desiredare linearlyindependentsets offeatures.continuousfunctionaccuracy(e.g radial basisfunctions).Then the optimization problem is strongly convex.23

Discrete-Time Approximations withReinforcement Learning In practice, we use a discretized version of the reward as a running costin an RL problem:Finite difference approximate toGaussian noise added for exploration,enables use of policy gradient algorithms24

12D Quadrotor Model Nominal dynamics model:After Feedback Linearization:Choose Outputs25

Improvement After 1 Hour of Data“Proximal Policy Optimization Algorithms” [Schulman et. al.] (2017)26

Effects of Model Accuracy27

7-DOF Baxter ArmAfter 1 Hour of Data28

Learning a Stable WalkingGait in 20 Minutes Feedback linearization iscommonly used to design stablewalking gates for bipedal robots Outputs are carefully designed sothat zero dynamics generated astable walking gate“Improving I-O Linearizing Controllers for Bipedal Robots Via RL” [CWAWTSS] (2020)“Continuous Control With Deep Reinforcement Learning” [Lillicrap et. al.] (2015)29

Control Lyapunov Functions30

Generalized ‘Energy’ Functions Consider the plant We say that the positive definitefunctionis a controlLyapunov function (CLF) for thesystem ifUser-specified energy dissipation rate31

Learning Min-normStabilizing Controllers Given a Control Lyapunov Functionthe associatedmin-norm controller for the plant is given by To learn the min-norm controller we want to solve:32

Penalizing the Constraint To remove the constraint we add a penalty term to the cost:scaling parameterpenalty function If the controller in linear in its parametersunder the additional assumption thatis strongly convex,“Learning Min-norm Stabilizing Control Laws for systems with Unknown Dynamics” [WCASS] (CDC 2020, To Appear)33

Learning the‘Forward’ TermsAdvantages of our approach: Other approaches estimate the terms Faster update rates for learned controllerin the constraint [1][2]: Learned controller always ‘feasible’ Does not require implicit ‘inversion’ of learnedterms: Then incorporate into QP:[Choi et. al] (2020)3[Taylor et. al] (2019)

Double Pendulum 4 Minutes of Data

Learning a Stable Walking ControllerWith 20 Seconds of DataNominal ControllerLearned Controller“Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with aStochastic Actor: [Haarnoja et. al. ] (2018)

Steps in Design ProcessStep 1: Choosegeometric controlarchitecture whichproduces desired globalbehaviorStep 2: Augment thenominal controller witha learned component:learned augmentatione.g. feedback linearizingcontroller37Step 3: Formulatereward which capturesdesired local behaviorMinimize loss with RL

Specific Architectures Control Barrier Functions[Ames et. al.] (2019) Time Varying CLFs[Kim et. al.] (2019) Geometric Controllers on[Lee et. al.] (2010)38

Trade-offs With‘Model-Based’ RL Mb-RL: learn a neural network dynamicsmodel from scratch, use for onlineplanning or training controllers offlinewith model-free RL[Nagabandi et. al.] (2018)Main Advantage of Mb-RL:Advantages of our Approach: Can be used when ‘ideal’ control Fine grain control over system behaviorarchitecture is not known39

Key Take Aways Connecting local and global geometric structure allows usto efficiently overcome model uncertainty Learning a forward dynamics model may be incompatiblewith geometric control

Relevant Papers “Feedback Linearization for Uncertain Systems via ReinforcementLearning” [WFMAPST] (ICRA 2020) “Improving Input-Output Linearizing Controllers for Bipedal Robots ViaReinforcement Learning” [CWAWTSS] ( L4DC 2020) “Learning Min-norm Stabilizing Control Laws for systems with UnknownDynamics” [WCASS] (IEEE, CDC 2020, Dec. 2020) “Learning Feedback Linearizing Controllers with Reinforcement Learning”[WFPMST] (IJRR, In Prep) “Directly Learning Safe Controllers with Control Barrier Functions” (TBD) “Learning Time-based Stabilizing Controllers for Quadrupedal Locomotion”(TBD)41

Current Work ExtensionsCan we use model-free policyoptimization to overcome modelmismatch in high dimensions forspecific control architectures?Can we use geometric control tosystematically design rewards whichare ‘easy’ to optimize over, andachieve the desired objective?Control ArchitecturesFeedbackLinearizationCLFs CBFsCombining Learningand Adaptive ControlOther Choice ofLearningAlgorithm

Project FlowPart 1: Overcomenon-parametricuncertainty by combiningRL and geometric controlPart 2: Provide correctnessand safety guarantees forspecific learningalgorithms:Part 3: Future work: What makes a rewardsignal difficult to learnfrom? What makes a systemfundamentallydifficult to control? Where y tracking tube43geometric control beused in the long-run?

Part 2 Outline Goal: show that we can safely learn a linearizing controller onlineusing standard RL algorithms Provide probabilistic tracking tracking bounds for overall learningsystem Simple policy gradient algorithms More sophisticated algorithms (Future Work) Comparison with ‘model-based’ adaptive control44

Analysis and Design StepsStep 1: Use our lossfunction from before todesign an ‘ideal’ CTupdate ruleStep 2: Model DTmodel-free policygradient algorithms asnoisy discretization ofthe CT processStep 3: Provideprobabilistic safetyguarantees for theoverall learning systemHigh-probability tracking tube45

Modeling Learningas CT Process Goal: track a desired trajectorywhile improving estimatedparameters4

Modeling Learningas CT ProcessRecall the normal form: Goal: track a desired trajectorywhile improving estimatedparameters Apply estimated controller4

Modeling Learningas CT ProcessRecall the normal form: Goal: track a desired trajectorywhile improving estimatedparameters Apply estimated controllerAssumption: Controller is linear is parameters:4

Modeling Learningas CT ProcessRecall the normal form: Goal: track a desired trajectorywhile improving estimatedparameters Apply estimated controllerAssumption: Controller is linear is parameters:Assumption: There exists a unique set of parameters4

Modeling Learningas CT ProcessRecall the normal form: Goal: track a desired trajectorywhile improving estimatedparameters Tracking Error Dynamics: Apply estimated controllerAssumption: Controller is linear is parameters:Assumption: There exists5

Modeling Learningas CT ProcessAssumption: Controller is linear is parameters: Goal: track a desired trajectorywhile improving estimatedparameters CT reward function: Ideal CT update rule:5

Modeling Online Learningas CT Process Goal: trackwhileimproving estimated parametersand using the estimatedcontroller We apply the ‘ideal’ updateAssumption 1: Controller is linear in parameters:Assumption 2: There exists a unique Under a persistency of excitationcondition we showexponentially quicklyLeast square loss from beforeDefine:52s.t.

Modeling Online Learningas CT ProcessRecall the normal form: Goal: track a desired trajectorywhile improving estimatedparameters Tracking Error Dynamics: Apply estimated controller CT reward function: Ideal CT update rule:5

Adaptive Control Approach Goal: track a desired trajectory whileimproving estimated parameters Apply estimated tracking controller: Tracking error dynamicsCT Reward:5

Persistency of Excitation We say thatis persistently exciting iffor some Under this condition we haveexponentially quickly as5

Analyzing DT RL Algorithms On the intervalwe apply the noisy controland apply noisy parameter updates of the formSamplingPeriod56LearningRateNoisy Estimateof

Implementable DT StochasticApproximations Main idea: model standard policy gradient updates as (noisy)discretization of the ideal parameter update To explore the dynamics,we apply the control5

Implementable DT StochasticApproximations Main idea: model standard policy gradient updates as (noisy)discretization of the ideal parameter update To explore the dynamics,we apply the control This leads to a discrete time process of the form5

Implementable DT StochasticApproximations Main idea: model standard policy gradient updates as (noisy)discretization of the ideal parameter update To explore the dynamics,we apply the control This leads to a discrete time process of the formLearning Rate5Estimate for gradient of

‘Vanilla’ Policy Gradient As a first step in analysis, we consider the simple policygradient estimator:6

Convergence of ‘Vanilla PolicyGradient’Theorem: For eachThen there existsput. Further assume the PE condition holds.such that,and with probability61

Double PendulumTracking With Learning62Tracking Without Learning

Step-size Selection Many convergence results from theML literature require: In a forthcoming article, wewill show that the learning‘converges’ if we take63

Trade-offs with Model-BasedAdaptive Control Advantages: Can deal with non-parametric uncertainty More freedom in choosing function approximator Disadvantages: Generally slower Loss of deterministic guarantees64

L1 Adaptive Control Model unknown nonlinearities as a disturbance to be identified:Estimate withEstimate with Control size of tracking by using fast adaptation for6

(Near) Future Work: MoreSophisticated AlgorithmsReinforcementLearningBaselining Advantage EstimationData Reuse/Off-Policy MethodsAdaptiveControlLarge Batches/Hybrid UpdatesFiltering TheoryReward Clipping Gradient RescalingBound Effects ofReconstruction Error66

Thesis ProposalPart 1: Overcomenon-parametricuncertainty by combiningRL and geometric controlPart 2: Provide correctnessand safety guarantees forspecific learningalgorithms:Part 3: Future work: What makes a rewardsignal difficult to learnfrom? What makes a systemfundamentallydifficult to control? Where y tracking tube67geometric control beused in the long-run?

Relavent Papers “Adaptive Control for Linearizable Systems Using On-Policy ReinforcementLearning” [WMFPTS] (CDC 2020, To Appear) “Reinforcement Learning for the Adaptive Control of Linearizable Systems”[WSMFTS] (Transaction on Automatic Control, In Prep) “Data-Efficient Off-Policy Reinforcement Learning for Nonlinear AdaptiveControl” [WSMFTS] (TBD)68

Where do thesetechniques fit in? Can we use geometric control topartially reduce the complexity oflearning more difficult tasks? Can we combine our approach withtechniques suchas al.]meta-learning?[Finn et.(2017) Can we automatically synthesizerewards for families of tasks?69

UnderstandingGeometric ‘Templates’ So far: use geometric structures as ‘templates’ for learning Can we formalize what makes a local reward signal ‘compatible’ withthe global structure of the problem? Can we quantify the difficulty of a RL problem in terms of how muchglobal information the reward contains? Does reinforcement learning implicitly take advantage of thestructures we’ve identified? Can we apply general structural results from geometric control?70

What makes a system difficultto control? Control theory has many colloquial ways to describe what makes asystem difficult to control Can we use sample complexity to make these notions rigorous?[1][2] Can we use ideas from geometric control to separate out different‘complexity classes’ of problems? For example, Minimum-Phase Non-Minimum Phase?[1] Dean, Mania, Matni, Recht, Tu (2018)[2] Fazel, Ge, Kakade, Mesbahi(2018)71

Non-Minimum Phase TrackingControl When zero dynamics are NMP wecannot ‘forget’ them Example: steering a bikexy72Recall the normal form:

Non-Minimum Phase TrackingControl When zero dynamics are NMP wecannot ‘forget’ them Example: steering a bikexy73Recall the normal form:

Non-Minimum Phase TrackingControl When zero dynamics are NMP weRecall the normal form:cannot ‘forget’ them Example: steering a bike“Counter-Steering” Can we learn these behaviors?74

Non-Minimum Phase TrackingControl When zero dynamics are NMP weRecall the normal form:cannot ‘forget’ them Example: steering a bike“Counter-Steering” Can we learn these behaviors? Can we quantify what makes itLook Ahead Windowdifficult to learn thesebehaviors? How much preview do we needto learn? Can we learn safely?[Devasia et. al.] (1999)75

Questions?76

Geometric Nonlinear Control Main idea: exploit underlying structures in the system to systematically design feedback controllers Explicitly connects 'global' and 'local' system structures Gives fine-grain control over system behavior Amenable to formal analysis Difficult to learn to exploit non-parametric uncertainties 2

Related Documents:

The formula for the sum of a geometric series can also be written as Sn a 1 1 nr 1 r. A geometric series is the indicated sum of the terms of a geometric sequence. The lists below show some examples of geometric sequences and their corresponding series. Geometric Sequence Geometric Series 3, 9, 27, 81, 243 3 9 27 81 243 16, 4, 1, 1 4, 1 1 6 16 .

Outline Nonlinear Control ProblemsSpecify the Desired Behavior Some Issues in Nonlinear ControlAvailable Methods for Nonlinear Control I For linear systems I When is stabilized by FB, the origin of closed loop system is g.a.s I For nonlinear systems I When is stabilized via linearization the origin of closed loop system isa.s I If RoA is unknown, FB provideslocal stabilization

The first term in a geometric sequence is 54, and the 5th term is 2 3. Find an explicit form for the geometric sequence. 19. If 2, , , 54 forms a geometric sequence, find the values of and . 20. Find the explicit form B( J) of a geometric sequence if B(3) B(1) 48 and Ù(3) Ù(1) 9. Lesson 4: Geometric Sequences Unit 7: Sequences S.41

The first term in a geometric sequence is 54, and the 5th term is 2 3. Find an explicit form for the geometric sequence. 19. If 2, , , 54 forms a geometric sequence, find the values of and . 20. Find the explicit form B( J) of a geometric sequence if B(3) B(1) 48 and Ù(3) Ù(1) 9. Lesson 7: Geometric Sequences

been proven to be stable and effective and could significantly improve the geometric accuracy of optical satellite imagery. 2. Geometric Calibration Model and the Method of Calculation 2.1. Rigorous Geometric Imaging Model Establishment of a rigorous geometric imaging model is the first step of on-orbit geometric calibration

eigenvalue buckling analysis, nonlinear stress analysis, and graphical post-processing. In this paper a brief description of CALEB version 1.4 and of its main features is presented. INTRODUCTION CALEB is a nonlinear finite element program for geometric and material nonlinear analysis of offshore platforms and general framed structures.

Nonlinear Finite Element Analysis Procedures Nam-Ho Kim Goals What is a nonlinear problem? How is a nonlinear problem different from a linear one? What types of nonlinearity exist? How to understand stresses and strains How to formulate nonlinear problems How to solve nonlinear problems

Third-order nonlinear effectThird-order nonlinear effect In media possessing centrosymmetry, the second-order nonlinear term is absent since the polarization must reverse exactly when the electric field is reversed. The dominant nonlinearity is then of third order, 3 PE 303 εχ The third-order nonlinear material is called a Kerr medium. P 3 E