Automation & Robotics Research Institute (ARRI) The University Of Texas .

1y ago
21 Views
2 Downloads
517.89 KB
56 Pages
Last View : 21d ago
Last Download : 3m ago
Upload by : Alexia Money
Transcription

F.L. Lewis & Draguna Vrabie Moncrief-O’Donnell Endowed Chair Head, Controls & Sensors Group Supported by : NSF - PAUL WERBOS Automation & Robotics Research Institute (ARRI) The University of Texas at Arlington Adaptive Dynamic Programming (ADP) For Discrete-Time Systems Talk available online at http://ARRI.uta.edu/acs

Bill Wolovich "Linear Multivariable Systems" New York: Springer-Verlag, 1974. "Robotics: Basic Analysis and Design" , 1987. “Automatic Control Systems: Basic Analysis and Design,” Wolovich, 1994. Interactor Matrix & Structure Falb and Wolovich, “Decoupling in the design and synthesis of multivariable control systems, IEEE Trans. Automatic Control,” 1967. Wolovich and Falb, “On the structure of multivariable systems,” SIAM J. Control, 1969. Wolovich, “The use of state feedback for exact model matching,” SIAM J. Control, 1972. Falb and Wolovich, “The role of the interactor in decoupling, JACC, 1977. Invariants and canonical forms under dynamic compensation, W. Wolovich and P. Falb, SIAM, J. on Control, 14, 1976. The solution of the input-output cover problems WOLOVICH [1972], MORSE [1976], HAMMER and HEYMANN [1981], WONHAM [1974 Pole Placement via Static Output Feedback is NP-Hard Morse, A.S., Wolovich, W.A., Anderson, B.D.O. "GENERIC POLE ASSIGNMENT - PRELIMINARYRESULTS." IEEE Transactions on Automatic Control 28 503 - 506, 1983.

Discrete-Time Optimal Control xk 1 f ( xk ) g ( xk )uk system Vh ( x k ) γ i k r ( x i , u i ) cost i k Example Vh ( xk ) r ( xk , uk ) γ Value function recursion Control policy Example γ i k 1 i ( k 1) r ( xk , uk ) xkT Qxk ukT Ruk r ( xi , ui ) Vh ( xk ) r ( xk , h( xk )) γ Vh ( xk 1 ) , Vh (0) 0 u k h( xk ) the prescribed control input function uk Kxk Linear state variable feedback

Discrete-Time Optimal Control cost Vh ( x k ) γ i k r ( x i , u i ) i k Value function recursion Vh ( xk ) r ( xk , h( xk )) γVh ( xk 1 ) u k h( xk ) the prescribed control policy H ( xk , V ( xk ), h) r ( xk , h( xk )) γVh ( xk 1 ) Vh ( xk ) Hamiltonian V * ( xk ) min(r ( xk , h( xk )) γVh ( xk 1 )) Optimal cost h Bellman’s Principle V * ( xk ) min (r ( xk , u k ) γV * ( xk 1 )) uk Backwards in time solution Optimal Control h * ( xk ) arg min (r ( xk , u k ) γV * ( xk 1 )) uk System dynamics does not appear

The Solution: Hamilton-Jacobi-Bellman Equation xk 1 f ( xk ) g ( xk )uk System V ( xk ) xiT Qxi uiT Rui i k DT HJB equation V ( xk ) min xkT Qxk ukT Ruk V ( xk 1 ) Difficult to solve Contains the dynamics uk min xkT Qxk ukT Ruk V ( f ( xk ) g ( xk )uk ) uk Minimize wrt uk dV ( xk 1 ) 2 Ruk g ( xk )T 0 dxk 1 dV ( xk 1 ) 1 1 T u ( xk ) R g ( xk ) 2 dxk 1

DT Optimal Control – Linear Systems Quadratic cost (LQR) system cost xk 1 Axk Buk V ( xk ) xiT Qxi uiT Rui i k Fact. The cost is quadratic V ( xk ) xkT Pxk for some symmetric matrix P HJB DT Riccati equation 0 AT PA P Q AT PB( R BT PB) 1 BT PA Optimal Control uk Lxk L ( R BT PB) 1 BT PA Optimal Cost V * ( xk ) xkT Pxk Off-line solution Dynamics must be known

Discrete-Time Optimal Adaptive Control Vh ( x k ) γ i k r ( x i , u i ) cost i k Value function recursion Vh ( xk ) r ( xk , h( xk )) γVh ( xk 1 ) u k h( xk ) the prescribed control policy Hamiltonian H ( xk , V ( xk ), h) r ( xk , h( xk )) γVh ( xk 1 ) Vh ( xk ) Optimal cost V * ( xk ) min(r ( xk , h( xk )) γVh ( xk 1 )) h Bellman’s Principle V * ( xk ) min (r ( xk , u k ) γV * ( xk 1 )) uk Optimal Control h * ( xk ) arg min (r ( xk , u k ) γV * ( xk 1 )) Focus on these two eqs uk

Discrete-Time Optimal Control Solutions by Comp. Intelligence Community Value function recursion Vh ( xk ) r ( xk , h( xk )) γ Vh ( xk 1 ), u k h( xk ) the prescribed control policy The Lyapunov Equation Theorem: Let Vh ( xk ) solve the Lyapunov equation. Then Vh ( xk ) γ i k r ( xi , h( xi )) i k Gives value for any prescribed control policy Policy Evaluation for any given current policy Policy must be stabilizing Vh (0) 0

Optimal Control h * ( xk ) arg min (r ( xk , u k ) γV * ( xk 1 )) uk Bellman’s result What about? h '( xk ) arg min(r ( xk , uk ) γ Vh ( xk 1 )) uk for a given policy h(.) ? Theorem. Bertsekas. Let Vh ( xk ) be the value of any given policy h(xk). Then Vh ' ( xk ) Vh ( xk ) Policy Improvement One step improvement property of Rollout Algorithms

e.g. Control policy SVFB DT Policy Iteration h( xk ) Lxk Cost for any given control policy h(xk) satisfies the recursion Vh ( xk ) r ( xk , h( xk )) γVh ( xk 1 ) Lyapunov eq. Recursive form Consistency equation Recursive solution Pick stabilizing initial control Policy Evaluation V j 1 ( xk ) r ( xk , h j ( xk )) γV j 1 ( xk 1 ) f(.) and g(.) do not appear Policy Improvement h j 1 ( xk ) arg min(r ( xk , u k ) γV j 1 ( xk 1 )) uk Howard (1960) proved convergence for MDP

Adaptive Critics The Adaptive Critic Architecture Value update V j 1 ( xk ) r ( xk , h j ( xk )) γV j 1 ( xk 1 ) Control policy update cost Policy Evaluation (Critic network) h j 1 ( xk ) arg min (r ( xk , u k ) γV j 1 ( xk 1 )) uk Action network System h j ( xk ) Control policy Leads to ONLINE FORWARD-IN-TIME implementation of optimal control

Different methods of learning Reinforcement learning Ivan Pavlov 1890s We want OPTIMAL performance - ADP- Approximate Dynamic Programming Actor-Critic Learning Desired performance Reinforcement signal Critic environment Tune actor Adaptive Learning system Actor Control Inputs System outputs

Adaptive (Approximate) Dynamic Programming Four ADP Methods proposed by Paul Werbos Critic NN to approximate: Heuristic dynamic programming Value V ( xk ) Dual heuristic programming Gradient V x AD Heuristic dynamic programming (Watkins Q Learning) Q function Q ( xk , u k ) AD Dual heuristic programming Gradients Q , x Action NN to approximate the Control Bertsekas- Neurodynamic Programming Barto & Bradtke- Q-learning proof (Imposed a settling time) Q u

DT Policy Iteration – Linear Systems Quadratic Cost- LQR xk 1 Axk Buk For any stabilizing policy, the cost is V ( xk ) xiT Qxi u T ( xi ) Ru ( xi ) i k LQR value is quadratic DT Policy iterations V ( x) xT Px Solves Lyapunov eq. without knowing A and B V j 1 ( xk ) xkT Qxk u Tj ( xk ) Ru j ( xk ) V j 1 ( xk 1 ) dV j 1 ( xk 1 ) 1 1 T u j 1 ( xk ) R g ( xk ) 2 dxk 1 Equivalent to an Underlying Problem- DT LQR: ( A BL j )T Pj 1 ( A BL j ) Pj 1 Q LTj RL j L j 1 ( R BT Pj 1 B) 1 BT Pj 1 A DT Lyapunov eq. Hewer proved convergence in 1971 ADP Solves Riccati equation WITHOUT knowing System Dynamics

DT Policy Iteration – How to implement online? Linear Systems Quadratic Cost- LQR V ( xk ) xiT Qxi u ( xi ) Ru ( xi ) xk 1 Axk Buk LQR cost is quadratic i k V ( x) xT Px for some matrix P Solves Lyapunov eq. without knowing A and B DT Policy iterations V j 1 ( xk ) xkT Qxk u Tj ( xk ) Ru j ( xk ) V j 1 ( xk 1 ) xkT Pj 1 xk xkT 1 Pj 1 xk 1 xkT Qxk u Tj Ru j x 1 k [ p11 p x 11 p12 2 k p12 p12 x1k 1 2 xk 1 p22 xk ( x1k ) 2 p22 ] 2 x1k xk2 [ p11 ( xk2 ) 2 W jT 1 [ϕ ( xk ) ϕ ( xk 1 ) ] 2 k 1 x p12 p 11 p12 p12 x1k 1 p22 xk2 1 ( x1k 1 ) 2 p22 ] 2 x1k 1 xk2 1 ( xk2 1 ) 2 Quadratic basis set

Implementation- DT Policy Iteration Value Function Approximation (VFA) V ( x) W T ϕ ( x) weights basis functions LQR case- V(x) is quadratic V ( x) xT Px W T ϕ ( x) ϕ ( x) W T [ p11 Quadratic basis functions p12 L] Nonlinear system case- use Neural Network

Implementation- DT Policy Iteration Value function update for given control V j 1 ( x k ) r ( x k , h j ( x k )) γV j 1 ( x k 1 ) Assume measurements of xk and xk 1 are available to compute uk 1 VFA Then V j ( xk ) W jT ϕ ( xk ) Since xk 1 is measured, do not need knowledge of f(x) T W j 1 [ϕ ( xk ) γϕ ( xk 1 )] r ( xk , h j ( xk )) or g(x) for value fn. update Indirect Adaptive control with identification of the optimal value regression matrix Solve for weights using RLS or, many trajectories with different initial conditions over a compact set Then update control using h j ( xk ) L j xk ( R BT Pj B) 1 BT Pj Axk Model-Based Policy Iteration Need to know f(xk) AND g(xk) for control update Robustness?

1. Select control policy Solves Lyapunov eq. without knowing dynamics 2. Find associated cost V j 1 ( x k ) r ( x k , h j ( x k )) γV j 1 ( x k 1 ) dV j ( xk 1 ) 1 1 T 3. Improve control u j 1 ( xk ) R g ( xk ) 2 dxk 1 observe xk Needs 10 lines of MATLAB code apply uk Direct optimal adaptive control observe cost rk observe xk 1 update V k k 1 do until convergence to Vj 1 update control to uj 1

Adaptive Control Identify the performance valueOptimal Adaptive Identify the system modelIndirect Adaptive Identify the ControllerDirect Adaptive Plant control output

Greedy Value Fn. Update- Approximate Dynamic Programming ADP Method 1 - Heuristic Dynamic Programming (HDP) Paul Werbos Policy Iteration V j 1 ( xk ) r ( xk , h j ( xk )) γV j 1 ( xk 1 ) h j 1 ( xk ) arg min(r ( xk , u k ) γV j 1 ( xk 1 )) uk Lyapunov eq. For LQR ( A BL j ) Pj 1 ( A BL j ) Pj 1 Q L RL j Underlying RE L j ( R BT Pj B) 1 BT Pj A T T j Hewer 1971 Initial stabilizing control is needed ADP Greedy Cost Update Two occurrences of cost allows def. of greedy update V j 1 ( xk ) r ( xk , h j ( xk )) γV j ( xk 1 ) h j 1 ( xk ) arg min(r ( xk , u k ) γV j 1 ( xk 1 )) Simple recursion uk Pj 1 For LQR Underlying RE ( A BL j )T Pj ( A BL j ) Q LTj RL j L j ( R BT Pj B) 1 BT Pj A Lancaster & Rodman proved convergence Initial stabilizing control is NOT needed

Implementation- DT HDP Value function update for given control V j 1 ( xk ) r ( xk , h j ( xk )) γ V j ( xk 1 ) Since xk 1 is measured, do not need knowledge of f(x) or g(x) for value fn. update Assume measurements of xk and xk 1 are available to compute uk 1 VFA Then V j ( xk ) W jT ϕ ( xk ) regression matrix Old weights W jT 1 [ϕ ( xk )] r ( xk , h j ( xk )) γ W jT [ϕ ( xk 1 )] Solve for weights using RLS or, many trajectories with different initial conditions over a compact set Then update control using h j ( xk ) L j xk ( R BT Pj B) 1 BT Pj Axk Need to know f(xk) AND g(xk) for control update

DT HDP vs. Receding Horizon Optimal Control Forward-in-time HDP T T 1 T Pi 1 AT PA Q A PB ( R B PB ) B PA i i i i P0 0 Backward-in-time optimization – RHC Pk AT Pk 1 A Q AT Pk 1 B( R BT Pk 1 B) 1 BT Pk 1 A PN Control Lyapunov Function overbounding P

Hongwei Zhang Dr. Jie Huang Adaptive Terminal Cost RHC Standard RHC xk 1 Axk Buk V ( xk ) k N 1 ( x Qx u T i i k i T i Rui ) xkT N P0 xk N P0 is the same for each stage T 1 T Pi 1 AT Pi A Q AT PB P0 i ( R B PB i ) B Pi A , T 1 T RH ukRH 1 ( R B PN 1 B ) B PN 1 A xk 1 LN xk 1 Requires P0 to be a CLF that overbounds the optimal inf. horizon cost, or large N Our ATC RHC V ( xk ) k N 1 ( x Qx i k T i i uiT Rui ) Final cost from previous stage xkT N PkN xk N T 1 T Pi 1 AT Pi A Q AT PB PkN i ( R B PB i ) B Pi A , HWZ Theorem- Let N 1 under the usual suspect observability and controllability assumptions ATC RHC guarantees ultimate uniform exponential stability for ANY P0 0. Moreover, our solution converges to the optimal inf. horizon cost.

Q Learning - Action Dependent ADP Value function recursion for given policy h(xk) Vh ( xk ) r ( xk , h( xk )) γVh ( xk 1 ) Define Q function Qh ( xk , u k ) r ( xk , u k ) γVh ( xk 1 ) Note uk arbitrary policy h(.) used after time k Qh ( xk , h( xk )) Vh ( xk ) Recursion for Q Qh ( xk , u k ) r ( xk , u k ) γQh ( xk 1 , h( xk 1 )) Simple expression of Bellman’s principle V * ( xk ) min (Q * ( xk , u k )) uk h * ( xk ) arg min (Q * ( xk , u k )) Optimal Adaptive Control (for unknown DT systems) uk

Draguna Vrabie Continuous-Time Optimal Control x& f ( x, u ) System Cost t t Off-line solution Dynamics must be known V ( x(t )) r ( x, u ) dt (Q( x) u T Ru ) dt Hamiltonian T T V V V & H ( x, , u ) V& r ( x, u ) x r ( x , u ) f ( x, u ) r ( x, u ) x x x c.f. DT Hamiltonian Optimal cost Bellman Optimal control H ( xk , V ( xk ), h) r ( xk , h( xk )) γVh ( xk 1 ) Vh ( xk ) T T * V * V x& min r ( x, u ) f ( x, u ) 0 min r ( x, u ) x u (t ) u (t ) x * V h ( x(t )) 1 2 R g ( x) x 1 * T HJB equation T T * * dV * 1 T dV 1 dV f Q ( x) 4 gR g 0 dx dx dx , V ( 0) 0

Bill Wolovich Interactor Matrix & Structure Theorem The solution of the input-output cover problems Pole Placement via Static Output Feedback Thank you for your inspiration and motivation in 1970

Q Function Definition u j h( x j ); Specify a control policy j k , k 1,. Define Q function Qh ( xk , u k ) r ( xk , u k ) γVh ( xk 1 ) Note uk arbitrary policy h(.) used after time k Qh ( xk , h( xk )) Vh ( xk ) Recursion for Q Qh ( xk , u k ) r ( xk , u k ) γQh ( xk 1 , h( xk 1 )) Q * ( xk , u k ) r ( xk , u k ) γV * ( xk 1 )) Optimal Q function Q * ( xk , u k ) r ( xk , u k ) γQ * ( xk 1 , h * ( xk 1 )) Optimal control solution V * ( xk ) Q * ( xk , h* ( xk )) min(Qh ( xk , h( xk ))) h h * ( xk ) arg min (Qh ( xk , h( xk )) h Simple expression of Bellman’s principle V * ( xk ) min (Q * ( xk , u k )) uk h * ( xk ) arg min (Q * ( xk , u k )) uk

Q Function ADP – Action Dependent ADP Q function for any given control policy h(xk) satisfies the recursion Qh ( xk , u k ) r ( xk , u k ) γQh ( xk 1 , h( xk 1 )) Recursive solution Pick stabilizing initial control policy Find Q function Q j 1 ( xk , u k ) r ( xk , u k ) γQ j ( xk 1 , h j ( xk 1 )) Update control h j 1 ( xk ) arg min(Q j 1 ( xk , u k )) uk Now f(xk,uk) not needed Bradtke & Barto (1994) proved convergence for LQR

Q Learning does not need to know f(xk) or g(xk) For LQR V ( x) W T ϕ ( x) x T Px V is quadratic in x Qh ( xk , u k ) r ( xk , u k ) Vh ( xk 1 ) xkT Qxk u kT Ru k ( Axk Bu k )T P( Axk Bu k ) x k u k T T T Q AT PA xk xk H xx AT PB xk xk H T T R B PB u k u k B PA u k u k H ux H xu xk H uu u k Q is quadratic in x and u Control update is found by so 0 Q 2[ B T PAxk ( R B T PB)u k ] 2[ H ux xk H uu u k ] u k 1 u k ( R B T PB) 1 B T PAxk H uu H ux xk L j 1 xk Control found only from Q function A and B not needed

Implementation- DT Q Function Policy Iteration For LQR Q function update for control u k L j xk is given by Q j 1 ( x k , u k ) r ( x k , u k ) γQ j 1 ( x k 1 , L j xk 1 ) Assume measurements of uk, xk and xk 1 are available to compute uk 1 QFA – Q Fn. Approximation Q ( x, u ) W T ϕ ( x, u ) Then Now u is an input to the NN- Werbos- Action dependent NN regression matrix W jT 1 [ϕ ( xk , uk ) γϕ ( xk 1 , L j xk 1 )] r ( xk , L j xk ) Solve for weights using RLS or backprop. For LQR case ϕ ( x) Since xk 1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

Q Learning does not need to know f(xk) or g(xk) For LQR V ( x) W T ϕ ( x) x T Px V is quadratic in x Qh ( xk , u k ) r ( xk , u k ) Vh ( xk 1 ) xkT Qxk u kT Ru k ( Axk Bu k )T P( Axk Bu k ) x k u k T T T Q AT PA xk xk H xx AT PB xk xk H T T R B PB u k u k B PA u k u k H ux H xu xk H uu u k Q is quadratic in x and u Control update is found by so 0 Q 2[ B T PAxk ( R B T PB)u k ] 2[ H ux xk H uu u k ] u k 1 u k ( R B T PB) 1 B T PAxk H uu H ux xk L j 1 xk Control found only from Q function A and B not needed

Model-free policy iteration Q Policy Iteration Q j 1 ( x k , u k ) r ( x k , u k ) γQ j 1 ( x k 1 , L j xk 1 ) [ ] Bradtke, Ydstie, Barto W jT 1 ϕ ( xk , u k ) γϕ ( xk 1 , L j xk 1 ) r ( xk , L j xk ) Control policy update Stable initial control needed h j 1 ( xk ) arg min(Q j 1 ( xk , u k )) uk 1 u k H uu H ux xk L j 1 xk Greedy Q Fn. Update - Approximate Dynamic Programming ADP Method 3. Q Learning Action-Dependent Heuristic Dynamic Programming (ADHDP) Greedy Q Update Model-free ADP Paul Werbos Q j 1 ( xk , u k ) r ( xk , u k ) γQ j ( xk 1 , h j ( xk 1 )) W jT 1ϕ ( xk , u k ) r ( xk , L j xk ) W jT γϕ ( xk 1 , L j xk 1 ) target j 1 Update weights by RLS or backprop.

Q learning actually solves the Riccati Equation WITHOUT knowing the plant dynamics Model-free ADP Direct OPTIMAL ADAPTIVE CONTROL Works for Nonlinear Systems Proofs? Robustness? Comparison with adaptive control methods?

Discrete-Time Zero-Sum Games Consider the following continuous-state and action spaces discrete-time dynamical system m1 n u R k x R x k 1 Ax k Bu k Ewk y Rp wk R m yk xk , 2 with quadratic cost V ( xk ) i k xiT Qxi uiT ui γ 2 wiT wi The zero-sum game problem can be formulated as follows: T T 2 T [ V ( xk ) min max x Qx u u γ wi wi ] i i i i i k u w The goal is to find the optimal strategies (State-feedback) u * ( x) Lx w* ( x) Kx

Asma Al-Tamimi DT Game Heuristic Dynamic Programming: Forward-in-time Formulation An Approximate Dynamic Programming Scheme (ADP) where one has the following incremental optimization { } Vi 1 ( xk ) min max xkT Qxk u kT u k γ 2 wkT wk Vi ( xk 1 ) uk wk which is equivalently written as Vi 1 ( x k ) x kT Qx k u iT ( x k )u i ( x k ) γ 2 wiT ( x k ) wi ( x k ) Vi ( x k 1 )

Game Algebraic Riccati Equation Using Bellman optimality principle “Dynamic Programming” V ( xk ) minmax( xkT Qxk ukT uk γ 2 wkT wk V ( xk 1 )) uk wk xkT Pxk minmax(r ( xk , uk , wk ) xkT 1Pxk 1 ). uk wk The Game Algebraic Riccati equation GARE 1 P AT PA Q [ AT PB T T T I B PB B PE B PA T A PE ] T T 2 T E PA E PE I E PA γ The condition for saddle point are I BT PB 0 I γ 2 E T PE 0

Game Algebraic Riccati Equation The optimal policies for control and disturbance are L ( I BT PB BT PE ( E T PE γ 2 I ) 1 E T PB) 1 ( BT PE ( E T PE γ 2 I ) 1 E T PA BT PA). K ( E T PE γ 2 I E T PB( I BT PB) 1 BT PE ) 1 ( E T PB( I BT PB) 1 BPA E T PA).

Linear Quadratic case- V and Q are quadratic Asma Al-Tamimi V ( xk ) xkT Pxk Q learning for H-infinity Control Q ( xk , uk , wk ) r ( xk , uk , wk ) V ( xk 1 ) xkT ukT wkT H xkT ukT wkT T Q function update Qi 1 ( xk , uˆi ( xk ), wˆ i ( xk )) xkT Rxk uˆi ( xk )T uˆi ( xk ) γ 2 wˆ i ( xk )T wˆ i ( xk ) Qi ( xk 1 , uˆi ( xk 1 ), wˆ i ( xk 1 )) [ xkT ukT wkT ]H i 1[ xkT ukT wkT ]T xkT Rxk ukT u k γ 2 wkT wk [ xkT 1 u kT 1 wkT 1 ]H i [ xkT 1 u kT 1 wkT 1 ]T Control Action and Disturbance updates ui ( xk ) Li xk , wi ( xk ) K i xk i i 1 i i i 1 i Li ( H uui H uw H ww H wu ) 1 ( H uw H ww H wx H uxi ), i i i i i K i ( H ww H wu H uui 1 H uw ) 1 ( H wu H uui 1 H uxi H wx ). H xx H ux H wx H xu H uu H wu H xw H uw H ww A, B, E NOT needed

Compare to Q function for H2 Optimal Control Case Qh ( xk , u k ) r ( xk , u k ) Vh ( xk 1 ) xkT Qxk u kT Ru k ( Axk Bu k )T P( Axk Bu k ) x k u k T T T Q AT PA xk xk H xx AT PB xk xk H u u H T T u u R B PB k k B PA k k ux H-infinity Game Q function H xu xk H uu u k

Quadratic Basis set is used to allow on-line solution Qˆ ( z , hi ) zT H i z hiT z where z xT wT uT T Asma Al-Tamimi z ( z12 ,K , z1 zq , z22 , z2 z3 ,K , zq 1 zq , zq2 ) and Quadratic Kronecker basis Q function update Qi 1 ( xk , uˆi ( xk ), wˆ i ( xk )) xkT Rxk uˆi ( xk )T uˆi ( xk ) γ 2 wˆ i ( xk )T wˆ i ( xk ) Qi ( xk 1 , uˆi ( xk 1 ), wˆ i ( xk 1 )) Solve for ‘NN weights’ - the elements of kernel matrix H h z ( xk ) x Rxk uˆi ( xk ) uˆi ( xk ) γ wˆ i ( xk ) wˆ i ( xk ) h z ( xk 1 ) T i 1 T k T 2 T T i Use batch LS or online RLS Control and Disturbance Updates uˆi ( x ) Li x wˆ i ( x ) K i x Probing Noise injected to get Persistence of Excitation uˆei ( xk ) Li xk n1k wˆ ei ( x k ) K i x k n2 k Proof- Still converges to exact result

Asma Al-Tamimi

ADHDP Application for Power system System Description x(t ) [Δf (t ) ΔPg (t ) ΔX g (t ) ΔF (t )]T 1/ Tp [0.033,0.1] 1/ Tp 0 A 1/ RTG KE K p / Tp [4,12] K p / Tp 1/ TT B [ 0 0 1/ TG T E T 1 K p / Tp 0 0 0 1/ TT 1/ TG 0 0] 0 1/ TG 0 0 1/ TT [2.564,4.762] 1/ TG [9.615,17.857] 1/ RTG [3.081,10.639] 0 0 0 The Discrete-time Model is obtained by applying ZOH to the CT

ADHDP Application for Power system The system state Δf incremental frequency deviation (Hz) ΔPg incremental change in generator output (p.u. MW) ΔXg incremental change in governor position (p.u. MW) ΔF incremental change in integral control. ΔPd is the load disturbance (p.u. MW); and The system parameters are: - TG the governor time constant TT turbine time constant TP plant model time constant Kp planet model gain R speed regulation due to governor action KE integral control gain.

ADHDP Application for Power system 3 P 11 P 2 12 P 13 P 1 22 P 23 P 33 0 P 34 P -1 44 0 1000 2000 time (k) 3000 The convergence of the control po ADHDP policy tuning The convergence of P 1 0 L L -1 L L -2 -3 0 11 12 13 14 1000 2000 Time (k) 3000

ADHDP Application for Power system Comparison 0.15 0.2 0.1 0.15 states x1, x2, x3,x4 0.05 0 -0.05 Frequency deviation -0.1 Incrmental change of the governer out Incrmental change of the governer pos -0.15 Incrmental change of the in itegral cont -0.2 -0.25 0 X: 0.5 Y: -0.2024 states x1, x2, x3,x4 0.05 0.1 0 -0.05 10 15 Time in sec The ADHDP controller design 20 Incrmental change of the generator ou Incrmental change of the governer pos -0.15 Incrmental change of the in itegral cont -0.2 -0.25 -0.3 5 Frequency deviation -0.1 0 X: 0.5794 Y: -0.2507 5 10 15 20 Time sec The design from [1] The maximum frequency deviation when using the ADHDP controller is improved by 19.3% from the controller designed in [1] [1] Wang, Y., R. Zhou, C. Wen, “Robust load-frequency controller design for power systems”, IEE Proc.-C, Vol. 140, No. I , 1993

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof Problem Formulation xk 1 f ( xk ) g ( xk )uk V ( xk ) min xi Qxi ui Rui uk i k requires solving the DT HJB V ( xk ) min xkT Qxk ukT Ruk V ( xk 1 ) uk min xkT Qxk ukT Ruk V ( f ( xk ) g ( xk )uk ) uk 1 1 T dV ( xk 1 ) u ( xk ) R g ( xk ) 2 dxk 1

Asma Al-Tamimi Discrete-time Nonlinear Adaptive Dynamic Programming: System dynamics xk 1 f ( xk ) g ( xk )u ( xk ) V ( xk ) i k xiT Qxi uiT Rui Value function recursion V ( xk ) x Qxk u Ruk i k 1 xiT Qxi uiT Rui T k T k xkT Qxk ukT Ruk V ( xk 1 ) HDP ui ( xk ) arg min( xkT Qxk u T Ru Vi ( xk 1 )) u Vi 1 min( xkT Qxk u T Ru Vi ( xk 1 )) u xkT Qxk uiT ( xk ) Rui ( xk ) Vi ( f ( xk ) g ( xk )ui ( xk ))

Asma Al-Tamimi Proof of convergence of DT nonlinear HDP Flavor of proofs

Standard Neural Network VFA for On-Line Implementation NN for Value - Critic NN for control action Vˆi ( xk ,WVi ) WViT φ ( xk ) HDP (can use 2-layer NN) uˆi ( xk , Wui ) W σ ( xk ) T ui Vi 1 min( xkT Qxk u T Ru Vi ( xk 1 )) u x Qxk uiT ( xk ) Rui ( xk ) Vi ( f ( xk ) g ( xk )ui ( xk )) T k ui ( xk ) arg min( xkT Qxk u T Ru Vi ( xk 1 )) u d (φ ( xk ), WViT ) xkT Qxk uˆiT ( xk ) Ruˆi ( xk ) Vˆi ( xk 1 ) Define target cost function xkT Qxk uˆiT ( xk ) Ruˆi ( xk ) WViT φ ( xk 1 ) Explicit equation for cost – use LS for Critic NN update WVi 1 arg min{ W WVi 1 φ ( xk ) d (φ ( xk ), W ) dxk } T Vi 1 T Vi 2 Ω WVi 1 φ ( xk )φ ( xk )T dx Ω 1 φ ( x )d k T (φ ( xk ),WViT ,WuiT )dx Ω Implicit equation for DT control- use gradient descent for action update x Qxk uˆ ( xk ,α ) Ruˆ ( xk ,α ) Wui arg min α ˆ ˆ V f x g x u x α ( ( ) ( ) ( , )) k k k i Ω T k T Wui ( j 1) Wui ( j ) α ( xkT Qxk uˆiT( j ) Ruˆi ( j ) Vˆi ( xk 1 ) Wui ( j ) Wuij 1 Wuij ασ ( xk )(2 Ruˆi ( j ) g ( xk )T φ ( xk 1 ) WVi )T xk 1 Backpropagation- P. Werbos

Issues with Nonlinear ADP Selection of NN Training Set LS solution for Critic NN update WVi 1 φ ( xk )φ ( xk )T dx Ω 1 T T T φ ( x ) d ( φ ( x ), W , W )dx k k Vi ui Ω x2 x2 x1 x1 time time Integral over a region of state-space Approximate using a set of points Take sample points along a single trajectory Batch LS Recursive Least-Squares RLS Set of points over a region vs. points along a trajectory For Linear systems- these are the same Conjecture- For Nonlinear systems They are the same under a persistence of excitation condition - Exploration

Interesting Fact for HDP for Nonlinear systems Linear Case h j ( xk ) L j xk ( I B T Pj B ) 1 B T Pj Axk must know system A and B matrices NN for control action uˆi ( xk , Wui ) WuiT σ ( xk ) Implicit equation for DT control- use gradient descent for action update x Qxk uˆ ( xk ,α ) Ruˆ ( xk ,α ) Wui arg min α ˆ ˆ V f x g x u x α ( ( ) ( ) ( , )) k k k i Ω T k Wui ( j 1) Wui ( j ) α T ( xkT Qxk uˆiT( j ) Ruˆi ( j ) Vˆi ( xk 1 ) Wui ( j ) Wuij 1 Wuij ασ ( xk )(2 Ruˆi ( j ) g ( xk )T φ ( xk 1 ) WVi )T xk 1 Note that state internal dynamics f(xk) is NOT needed in nonlinear case since: 1. NN Approximation for action is used 2. xk 1 is measured

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof Simulation Example 1 The linear system – Aircraft longitudinal dynamics 0 -0.0541 -0.0153 1.0722 0.0954 4.1534 1.1175 0 -0.8000 -0.1010 A 0.1359 0.0071 1.0 0.0039 0.0097 0 0 0 0.1353 0 0 0 0 0 0.1353 -0.0453 -1.0042 B 0.0075 0.8647 0 -0.0175 -0.1131 0.0134 0 0.8647 Unstable, Two-input system The HJB, i.e. ARE, Solution P 55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782 -0.7265 -0.1215 0.0464 0.0782 1.0240 -4.1136 -0.7170 -0.3847 0.5277 0.0707 L -0.6315 -0.1003 0.1236 0.0653 0.0798

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof Simulation The Cost function approximation Vˆi 1 ( xk , WVi 1 ) WViT 1φ ( xk ) φ T ( x) x 2 x1 x2 x1 x3 x1 x4 WVT [ wV 1 wV 2 wV 3 wV 4 x3 1 x 22 x2 x3 x4 x2 wV 5 wV 6 wV 7 wV 8 x4 x5 ] x1 x5 The Policy approximation uˆi WuiT σ ( xk ) σ T ( x) [ x1 x2 wu12 w WuT u11 wu 21 wu 22 wu13 wu14 wu 23 wu 24 wu15 wu 25 x2 x5 wV 9 x32 wV 10 x3 x4 x3 x5 wV 11 wV 12 x42 x4 x5 wV 13 x52 wV 14 wV 15 ]

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof Simulation The convergence of the cost WVT [55.5411 15.2789 31.3032 -9.3255 -1.4536 24.8262 -1.3076 P11 P 21 P31 P41 P51 P12 P13 P14 P22 P23 P24 P32 P33 P34 P42 P43 P44 P52 P53 P54 0.0920 1.5388 P15 wV 1 P25 0.5wV 2 P35 0.5wV 3 P45 0.5wV 4 P55 0.5wV 5 P 0.1564 2.3142 2.9234 -1.6594 -0.2430 1.0240] 0.5wV 2 0.5wV 3 0.5wV 4 wV 6 0.5wV 7 0.5wV 8 0.5wV 7 wV 10 0.5wV 11 0.5wV 8 0.5wV 11 wV 13 0.5wV 9 0.5wV 12 0.5wV 14 0.5wV 5 0.5wV 9 0.5wV 12 0.5wV 14 wV 15 55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782 -0.7265 -0.1215 0.0464 0.0782 1.0240

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof Simulation The convergence of the control policy -4.1136 -0.7170 -0.3847 0.5277 0.0707 L -0.6315 -0.1003 0.1236 0.0653 0.0798 4.1068 0.7164 0.3756 -0.5274 -0.0707 Wu 0.6330 0.1005 -0.1216 -0.0653 -0.0798 L11 L12 L 21 L22 L13 L14 L23 L24 L15 wu12 w u11 L25 wu 21 wu 22 wu13 wu14 wu 23 wu 24 wu15 wu 25 Note- In this example, internal dynamics matrix A is NOT Needed.

Falb and Wolovich, "Decoupling in the design and synthesis of multivariable control systems, IEEE Trans. Automatic Control," 1967. Wolovich and Falb, "On the structure of multivariable systems," SIAM J. Control, 1969. Wolovich, "The use of state feedback for exact model matching," SIAM J. Control, 1972.

Related Documents:

PSI AP Physics 1 Name_ Multiple Choice 1. Two&sound&sources&S 1∧&S p;Hz&and250&Hz.&Whenwe& esult&is:& (A) great&&&&&(C)&The&same&&&&&

Argilla Almond&David Arrivederci&ragazzi Malle&L. Artemis&Fowl ColferD. Ascoltail&mio&cuore Pitzorno&B. ASSASSINATION Sgardoli&G. Auschwitzero&il&numero&220545 AveyD. di&mare Salgari&E. Avventurain&Egitto Pederiali&G. Avventure&di&storie AA.&VV. Baby&sitter&blues Murail&Marie]Aude Bambini&di&farina FineAnna

The program, which was designed to push sales of Goodyear Aquatred tires, was targeted at sales associates and managers at 900 company-owned stores and service centers, which were divided into two equal groups of nearly identical performance. For every 12 tires they sold, one group received cash rewards and the other received

The Future of Robotics 269 22.1 Space Robotics 273 22.2 Surgical Robotics 274 22.3 Self-Reconfigurable Robotics 276 22.4 Humanoid Robotics 277 22.5 Social Robotics and Human-Robot Interaction 278 22.6 Service, Assistive and Rehabilitation Robotics 280 22.7 Educational Robotics 283

College"Physics" Student"Solutions"Manual" Chapter"6" " 50" " 728 rev s 728 rpm 1 min 60 s 2 rad 1 rev 76.2 rad s 1 rev 2 rad , π ω π " 6.2 CENTRIPETAL ACCELERATION 18." Verify&that ntrifuge&is&about 0.50&km/s,∧&Earth&in&its& orbit is&about p;linear&speed&of&a .

The Robotics and Automation Council determined to initially focus on three major activities: IEEE Journal of Robotics and Automation, which in 1989 was re-named to IEEE Transactions on Robotics and Automation. In 2004 the journal was split into IEEE Transactions on Robotics and IEEE Transactions on Automation Science and Engineering.

theJazz&Band”∧&answer& musical&questions.&Click&on&Band .

Robotics and automation technologies have the potential to raise the quality of life for people around the globe. At the IEEE International Conference on Robotics and Automation held in Hong Kong, the IEEE Robotics and Automation Society asked academic and non-academic communities to solve important global problems in the Humanitarian Robotics .