# Adaptive Optimal Control Of Partially-unknown Constrained .

2y ago
21 Views
2.59 MB
11 Pages
Last View : 4m ago
Transcription

II. Optimal Control of Constrained-input SystemsA. Constrained optimal control and policy iterationIn this section, the optimal control problem for affine-in-the-input nonlinear systems with input constraints isformulated and an offline PI algorithm is given for solving the related optimal control problem.Consider the system dynamics be described by the differential equationx& f ( x ) g ( x ) u ( x )(1)where x ℜn is a measurable system state vector, f ( x ) ℜn is the drift dynamics of the system, g ( x ) ℜn 1 is the{input dynamics of the system, and u ( x ) ℜ is the control input. We denote Ωu u u ℜ, u ( x ) λ} as the set ofall inputs satisfying the input constraints, where λ is the saturating bound for actuators. It is assumed thatf ( x ) g ( x ) u is Lipchitz and that the system is stabilizable.It is assumed that the drift dynamics f ( x ) is unknown and g ( x ) is known.Define the performance indexV ( x (t )) t(Q ( x (τ )) U (u (τ ))) dτ(2)where Q ( x ) is positive definite monotonically increasing function and U (u ) is a positive definite integrandfunction.Assumption 1: The performance functional (2) satisfies zero-state observability.Definition 1 (Admissible control) [6, 7]: A control policy µ ( x ) is said to be admissible with respect to (2) on Ω ,defined by µ π ( Ω ), if µ ( x ) is continuous on Ω , µ ( 0 ) 0 , u ( x ) µ ( x ) stabilizes system (1) on Ω , andV ( x0 ) is finite x0 Ω .To deal with the input constraints, the following generalized nonquadratic functional can be used [7, 22].U (u ) 2 u0( λ tanh ( v λ )) R dvT 1(3)Using the cost functional (3) in Eq. (2), the value function becomes (V ( x (t )) Q ( x ) 2 tu0(λ tanh 1( v λ ))T)R dv dτ(4)Differentiating V along the system trajectories, the following Bellman equation is obtainedQ ( x) 2 ( λ tanh ( v λ ))u 10TR dv V T ( x ) ( f ( x ) g ( x ) u ( x ) ) 0(5)where V ( x) V ( x ) x . The optimal value function V ( x ) satisfies the HJB equation ()uTmin Q ( x ) 2 λ tanh 1 ( v λ ) R dv V T ( x ) ( f ( x ) g ( x ) u ( x )) 00u π ( Ω ) (6)The optimal control for the given problem is obtained by differentiating Eq. (6) and is given as 1 1 T u ( x ) λ tanh R g ( x ) V ( x ) 2λ Using Eq. (7) in Eq. (3), it yields3American Institute of Aeronautics and Astronautics(7)

( )(( )( ))U u λ V T ( x ) g ( x ) tanh D λ 2 R ln 1 tanh 2 D where D (1 2λ ) R g ( x ) V T 1 ( x ) . Substituting(8)u ( x ) (7) back into Eq. (5) and using U ( u ) (8), the followingHJB equation is obtained(()( )) 0H x, u* , V * Q ( x ) V T ( x ) f ( x ) λ 2 R ln 1 tanh 2 D (9)In order to find the optimal control solution directly, first the HJB equation (9) must be solved for the optimalvalue function, then the optimal control input that achieves this minimal performance is obtained by Eq. (7).However, solving the HJB equation (9) requires solving a nonlinear PDE, which may be impossible to solve inpractice.Instead of directly solving the HJB equation, in  an iterative PI algorithm is presented. The PI algorithm startswith a given admissible control policy and then performs a sequence of two-step iterations to find the optimalcontrol policy. In the policy improvement step, the Bellman equation (5) is used to find the value function for agiven fixed policy and in the policy improvement step, using the value function found in the policy evaluation step,the algorithm finds an improved control policy of the form Eq. (7). However, to evaluate the value of a fixed policyusing the Bellman equation (5), the complete knowledge of the system dynamics must be known a priori. In order tofind an equivalent formulation of the Bellman equation in policy evaluation step that does not involve the dynamics,we use the integral reinforcement learning (IRL) idea as introduced in . Note that for any time t and timeinterval T 0 , the value function (4) satisfies 1 (Q ( x ) 2 0 (λ tanh (v λ )) R dv ) dτ V ( xt T )tV ( xt ) Tu(10)t TIn , it is shown that Eq. (10) and Eq. (4) are equivalent and have the same solution. Therefore, Eq. (10) canbe viewed as a Bellman equation for CT systems. Note that the IRL form of the Bellman equation does not involvethe system dynamics. Using Eq. (10) instead of Eq. (5) to evaluate the value function, the following PI algorithm isobtained.Algorithm 3.1: Integral Reinforcement Learning1. (policy evaluation) given a control input u i ( x ) , find V i ( x ) using the Bellman equationV i ( xt ) tuT 1i Q ( x ) 2 0 λ tanh ( v λ ) R dv dτ V ( xt T )t Ti()(11)2. (policy improvement) update the control policy using 1 u i 1 ( x ) λ tanh R 1 g T ( x ) V i ( x ) 2λ (12)The above PI algorithm only needs to have knowledge of the input dynamics. The online implementation of thisPI algorithm is introduced in section III.B. Value function approximation and the approximated HJB equationIn this subsection, we discuss the value function approximation to solve for the cost function V(x) in policyevaluation (11). Assuming the value function is a smooth function, according to the Weierstrass high-orderapproximation Theorem , there exists a single-layer NN such that the solution V ( x ) and its gradient can beuniformly approximated asV ( x) W1T φ ( x ) ε v ( x )(13) V ( x) φ T ( x )W1 ε v ( x )(14)where φ ( x ) ℜm provides a suitable basis function vector, ε v ( x ) is the approximation error, W1 ℜm is a constantparameter vector, l is the number of neurons.4American Institute of Aeronautics and Astronautics

Assumption 2 . The NN reconstruction error and its gradient are bounded over a compact set. Also, the NNactivation functions and their gradients are bounded.Before presenting the actor and critic update laws, it is necessary to see the effect of the reconstruction error onthe HJB equation. Assuming that the optimal value function is approximated by Eq. (13) and using its gradients Eq.(14) in the Bellman equation (10) it yields 1 (Q ( x ) 2 0 ( λ tanh ( v λ ))tuT)R dv dτ W1T Δφ ( x (t ) ) ε B (t )t T(15)whereΔφ ( x ( t ) ) φ ( x ( t ) ) φ ( x ( t T ) )(16)and ε B (t ) is the Bellman approximation error and under Assumption 1 is bounded on the compact set Ω . Also, theoptimal policy is obtained as 1 u λ tanh R 1 g T φ T W1 ε vT 2λ ()(17))(18)Using Eq. (17) in Eq. (15), the following HJB equation is obtained.t ( Q W1 φTt T()f λ 2 R ln 1 tanh 2 ( D ) ε HJB dτ 0where D (1 2λ ) R 1gT φ T W1 , and ε HJB is the residual error due to the function reconstruction error. In , theauthors show that as for each constant vector ε h , we can construct a NN so that sup x(18) and in the sequel, the variable x is dropped for ease of exposition.ε HJB ε h . Note that in Eq.III. Online Intergal Reinforcement Learning to Solve the Constrained Optimal Control ProblemAn online IRL algorithm based on Policy Iteration (PI) algorithm is now given. The learning structure uses twoNNs, i.e., an actor NN and a critic NN, which approximate the Bellman equation and its corresponding policy. Theoffline PI Algorithm 3.1 is used to motivate the structure of this online PI algorithm. Instead of sequentiallyupdating the critic and actor NNs, as in Algorithms 3.1, both are updated simultaneously in real-time. We call thissynchronized online PI. This is the continuous version of Generalized Policy Iteration (GPI) introduced in .A. Critic NN and tuning using experience replayThis subsection presents tuning and convergence of the critic NN weights for a fixed admissible control policy,in effect designing an observer for the unknown value function for using in feedback.Consider a fixed admissible control policy u(x) and assume that its corresponding value function isapproximated by Eq. (13). Then, the Bellman equation (15) can be used to find the value function related to thiscontrol policy. However, the ideal weights of the critic NN, i.e. W1 , which provide the best approximation solutionfor Eq. (15) are unknown and must be approximated in real-time. Hence, the output of the critic NN can be writtenasVˆ ( x) Wˆ1T φ(19)where the weights Ŵ1 are the current estimated values of W1 and then the approximate Bellman equation becomes 1 (Q ( x ) 2 0 ( λ tanh ( v λ ))tt TuT)R dv dτ Wˆ1T Δφ ( x (t ) ) e (t )(20)Equation (20) can be written asTe ( t ) Wˆ1 ( t ) Δφ ( t ) p ( t )where5American Institute of Aeronautics and Astronautics(21)

p (t ) 1 (Q ( x ) 2 0 ( λ tanh ( v λ ))tuT)R dv dτt T(22)Note that the Bellman error e in Eqs. (20) and (21) is the continuous-time counterpart of the TemporalDifference (TD) . The problem of finding the value function is now converted to adjusting the parameters of thecritic NN such that the TD e is minimized.In the following, a real-time learning algorithm based on the experience replay technique is applied for updatingthe critic NN weights. In contrast to traditional learning algorithms, in which only instantaneous Bellman equationerror is used to update the critic weights, recorded data are used concurrently with current data for adaptation of thecritic NN weights. Using this learning law, a simple condition on the richness of the recorded data is sufficient toguarantee exponential parameter estimation errors convergence.The proposed experience replay-based update rule for the critic NN weights stores recent transition samples andrepeatedly presents them to the gradient-based update rule. In can be interpreted as a gradient-descent algorithm thatnot only tries to minimize the instantaneous Bellman error, but also the Bellman equation error for the storedtransition samples obtained by the current critic NN weights. These samples are stored in a history stack. To collecta history stack, let t j , j 1, ,l. denote some recorded times during learning. Let() (Δφ j Δφ ( t j ) φ x ( t j ) φ x (t j T ))(23)andp j p (t j ) tj t j T(Q ( x ) 2 u0( λ tanh 1( v λ ))T)R dv dτ(24)denote Δφ (t ) and p(t ) evaluated at time t j , j 1, ,l ande j Wˆ1 (t ) Δφ j p j(25)is the Bellman equation error at time t j using the current critic NN weights. Note that using Eqs. (15), (21) and (25)we haveTe j W%1 ( t ) Δφ j ε B ( t j )(26)e (t ) W% (t ) Δφ (t ) ε B (t )(27)T1( ) is the reconstruction error obtained by Eq. (15) in time twhere ε B t jjˆand W%1 W1 W1. The proposed learninggradient descent algorithm for the critic NN is now given as&Wˆ1 ( t ) α1Δφ ( t )(1 Δφ (t )TΔφ ( t ))( p (t ) Δφ (t ) Wˆ (t )) α Δφ jlT211j 1(1 ΔφTjΔφ j)2(pj Δφ Tj Wˆ1 (t ))(28)Remark 2. Note that in this experience replay tuning law the last term depends on the history stack of previousactivation function differences. Furthermore, note that the updates based on both current and recorded data use thecurrent estimate of the weights()Using Eqs. (26), (27) and (28), and notations Δφ (t ) Δφ (t ) 1 Δφ (t ) Δφ (t ) and ms 1 Δφ ( t ) Δφ ( t ) , the criticTTNN weights error dynamics becomesll ε t %( t ) α Δφ ( t ) Δφ ( t )T Δφ Δφ T W%( t ) α Δφ ( t ) ε B ( t )

II. Optimal Control of Constrained-input Systems A. Constrained optimal control and policy iteration In this section, the optimal control problem for affine-in-the-input nonlinear systems with input constraints is formulated and an offline PI algorithm is given for solving the related optimal control problem.

Related Documents:

Adaptive Control, Self Tuning Regulator, System Identification, Neural Network, Neuro Control 1. Introduction The purpose of adaptive controllers is to adapt control law parameters of control law to the changes of the controlled system. Many types of adaptive controllers are known. In  the adaptive self-tuning LQ controller is described.

Adaptive Control - Landau, Lozano, M'Saad, Karimi Adaptive Control versus Conventional Feedback Control Conventional Feedback Control System Adaptive Control System Obj.: Monitoring of the "controlled" variables according to a certain IP for the case of known parameters Obj.: Monitoring of the performance (IP) of the control

Summer Adaptive Supercross 2012 - 5TH PLACE Winter Adaptive Boardercross 2011 - GOLD Winter Adaptive Snocross 2010 - GOLD Summer Adaptive Supercross 2010 - GOLD Winter Adaptive Snocross 2009 - SILVER Summer Adaptive Supercross 2003 - 2008 Compete in Pro Snocross UNIQUE AWARDS 2014 - TEN OUTSTANDING YOUNG AMERICANS Jaycees 2014 - TOP 20 FINALIST,

Chapter Two first discusses the need for an adaptive filter. Next, it presents adap-tation laws, principles of adaptive linear FIR filters, and principles of adaptive IIR filters. Then, it conducts a survey of adaptive nonlinear filters and a survey of applica-tions of adaptive nonlinear filters. This chapter furnishes the reader with the necessary

The current literature on learning an optimal safe linear pol-icy adopts an offline/non-adaptive learning approach, which does not improve the policies until the learning terminates (Dean et al.,2019b). To improve the control performance during learning, adaptive/online learning-based control al-gorithms should be designed. However, though adaptive

predictive-adaptive controller for the control of glass temperature. The new controller was able to reduce temperature settling times following set point or production rate changes from 4 to 6 hours to between 2 to 3 hours. This control . Adaptive Control, the PID algorithm was bypassed and the adaptive

American Revolution, students are exposed to academic, domain-specific vocabulary and the names and brief descriptions of key events. Lesson 2 is a simulation in which the “Royal Tax Commissioners” stamp all papers written by students and force them to pay a “tax” or imprisonment.