UTA Research Institute (UTARI) - University Of Texas At Arlington

1y ago
22 Views
2 Downloads
2.91 MB
91 Pages
Last View : 9d ago
Last Download : 2m ago
Upload by : Melina Bettis
Transcription

F.L. Lewis UTA Research Institute (UTARI) The University of Texas at Arlington, USA Reinforcement Methods for Autonomous Online Learning in Robotic systems Talk available online at http://ARRI.uta.edu/acs

Optimal Control Reinforcement learning Policy Iteration Q Learning Humanoid Robot Learning Control Using RL Telerobotic Interface Learning Using RL

It is man’s obligation to explore the most difficult questions in the clearest possible way and use reason and intellect to arrive at the best answer. Man’s task is to understand patterns in nature and society. The first task is to understand the individual problem, then to analyze symptoms and causes, and only then to design treatment and controls. Ibn Sina 1002-1042 (Avicenna)

Synthesis of Robot Control and Biologically Inspired Learning

F.L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits & Systems Magazine, Invited Feature Article, pp. 32-50, Third Quarter 2009. IEEE Control Systems feature article to appear. magazine,

Importance of Feedback Control Darwin- FB and natural selection Volterra- FB and fish population balance Adam Smith- FB and international economy James Watt- FB and the steam engine FB and cell homeostasis The resources available to most species for their survival are meager and limited Nature uses Optimal control

Controller Topologies yd (t ) controller control u(t) output y(t) plant identification error desired output system identifier yˆ (t ) estimated output yd (t ) Indirect Scheme desired output yd (t ) desired output controller control u(t) output y(t) plant tracking error Direct Scheme control u(t) controller #1 tracking error output y(t) plant controller #2 Feedback/Feedforward Scheme

Issues in Feedback Control Desired trajectories Disturbances Sensor noise Control inputs Feedforward controller SystemUnknown dynamics Feedback controller Stability Tracking Boundedness Robustness to disturbances to unknown dynamics Measured outputs

Discrete-Time Optimal Control xk 1 f ( xk ) g ( xk )uk system cost Vh ( x k ) i k r ( x i , u i ) Example i k Difference eq. equivalent i ( k 1) r ( xi , ui ) i k 1 u k h( xk ) the prescribed control input function Control policy Example Vh ( xk ) r ( xk , uk ) r ( xk , uk ) xkT Qxk ukT Ruk uk Kxk Bellman equation Linear state variable feedback Vh ( xk ) r ( xk , h( xk )) Vh ( xk 1 ) , Vh (0) 0 Vh ( xk ) xkT Qxk ukT Ruk Vh ( xk 1 ) Bellman’s Principle gives Bellman opt. eq DT HJB V * ( xk ) min (r ( xk , u k ) V * ( xk 1 )) uk Optimal Control h * ( xk ) arg min (r ( xk , u k ) V * ( xk 1 )) uk V ( xk 1 ) u ( xk ) 12 R 1 g ( xk )T xk 1 Off-line solution Dynamics must be known

DT Optimal Control – Linear Systems Quadratic cost (LQR) system cost xk 1 Axk Buk V ( xk ) xiT Qxi uiT Rui i k Fact. The cost is quadratic V ( xk ) xkT Pxk for some symmetric matrix P HJB DT Riccati equation 0 AT PA P Q AT PB( R BT PB) 1 BT PA Optimal Control uk Kxk K ( R BT PB ) 1 BT PA Optimal Cost V * ( xk ) xkT Pxk Off-line solution Dynamics must be known

We want robot controllers that learn optimal control solutions online in real-time Synthesis of Computational intelligence Control systems Neurobiology Different methods of learning Machine learning- the formal study of learning systems Supervised learning Unsupervised learning Reinforcement learning

Different methods of learning We want OPTIMAL performance - ADP- Approximate Dynamic Programming Paul Werbos Reinforcement learning Ivan Pavlov 1890s Actor-Critic Learning Desired performance Reinforcement signal Critic environment Tune actor Adaptive Learning system Actor Control Inputs System outputs

RL Policy Iterations to Solve Optimal Control Problem xk 1 f ( xk ) g ( xk )uk system cost Vh ( x k ) i k r ( x i , u i ) i k Difference eq. equivalent Bellman equation Vh ( xk ) r ( xk , uk ) i ( k 1) r ( xi , ui ) i k 1 Vh ( xk ) r ( xk , h( xk )) Vh ( xk 1 ) , Vh (0) 0 Vh ( xk ) xkT Qxk ukT Ruk Vh ( xk 1 ) Bellman’s Principle gives Bellman opt. eq DT HJB V * ( xk ) min (r ( xk , u k ) V * ( xk 1 )) uk Optimal Control h * ( xk ) arg min (r ( xk , u k ) V * ( xk 1 )) uk V ( xk 1 ) 1 u ( xk ) 2 R g ( xk ) xk 1 1 T Focus on these two eqs.

Bellman Equation Vh ( xk ) r ( xk , h( xk )) Vh ( xk 1 ) Can be interpreted as a consistency equation that must be satisfied by the value function at each time stage. Expresses a relation between the current value of being in state x and the value(s) of being in next state x’ given that policy Captures the action, observation, evaluation, and improvement mechanisms of reinforcement learning. Temporal Difference Idea ek Vh ( xk ) r ( xk , h( xk )) Vh ( xk 1 )

Policy Evaluation and Policy Improvement consider algorithms that repeatedly interleave the two procedures: Policy Evaluation by Bellman Equation: Vh ( xk ) r ( xk , h( xk )) Vh ( xk 1 ) Policy Improvement: V ( xk 1 ) 1 h '( xk ) R 1 g ( xk )T 2 xk 1 Policy Improvement makes Vh ' ( x) Vh ( x) (Bertsekas and Tsitsiklis 1996, Sutton and Barto 1998). the policy h '( xk ) is said to be greedy with respect to value function Vh ( x) At each step, one obtains a policy that is no worse than the previous policy. Can prove convergence under fairly mild conditions to the optimal value and optimal policy. Most such proofs are based on the Banach Fixed Point Theorem. One step is a contraction map. There is a large family of algorithms that implement the policy evaluation and policy improvement procedures in various ways

DT Policy Iteration to solve HJB Cost for any given control policy h(xk) satisfies the recursion Vh ( xk ) r ( xk , h( xk )) Vh ( xk 1 ) Bellman eq. Recursive form Consistency equation Recursive solution Pick stabilizing initial control Policy Evaluation – solve Bellman Equation V j 1 ( xk ) r ( xk , h j ( xk )) V j 1 ( xk 1 ) Policy Improvement h j 1 ( xk 1 ) arg min(r ( xk , uk ) V j 1 ( xk 1 )) uk Howard (1960) proved convergence for MDP (Bertsekas and Tsitsiklis 1996, Sutton and Barto 1998). the policy h j 1 ( xk ) is said to be greedy with respect to value function V j 1 ( x) At each step, one obtains a policy that is no worse than the previous policy. Can prove convergence under fairly mild conditions to the optimal value and optimal policy. Most such proofs are based on the Banach Fixed Point Theorem. One step is a contraction map.

Methods to implement Policy Iteration Exact Computation- needs full system dynamics Temporal Difference- for robot trajectory following Montecarlo Learning- for learning episodic robot tasks

DT Policy Iteration – Linear Systems Quadratic Cost- LQR xk 1 Axk Buk ( A BL) xk , uk Lxk For any stabilizing policy, the cost is V ( xk ) xiT Qxi u T ( xi ) Ru ( xi ) i k LQR value is quadratic DT Policy iterations V ( x) xT Px Solves Lyapunov eq. without knowing A and B V j 1 ( xk ) xkT Qxk u Tj ( xk ) Ru j ( xk ) V j 1 ( xk 1 ) dV j 1 ( xk 1 ) 1 1 T u j 1 ( xk 1 ) R g ( xk ) dxk 1 2 Equivalent to an Underlying Problem- DT LQR: ( A BL j )T Pj 1 ( A BL j ) Pj 1 Q LTj RL j L j 1 ( R BT Pj 1 B) 1 BT Pj 1 A DT Lyapunov eq. Hewer proved convergence in 1971 Policy Iteration Solves Lyapunov equation WITHOUT knowing System Dynamics

DT Policy Iteration – How to implement online? Linear Systems Quadratic Cost- LQR V ( xk ) xiT Qxi u ( xi ) Ru ( xi ) xk 1 Axk Buk LQR cost is quadratic DT Policy iterations i k V ( x) xT Px for some matrix P Solves Lyapunov eq. without knowing A and B V j 1 ( xk ) xkT Qxk u Tj ( xk ) Ru j ( xk ) V j 1 ( xk 1 ) xkT Pj 1 xk xkT 1 Pj 1 xk 1 xkT Qxk u Tj Ru j x 1 k p11 p x 11 p12 2 k p12 p12 x1k 1 2 xk 1 p22 xk ( x1k ) 2 p22 2 x1k xk2 p11 ( xk2 ) 2 2 k 1 x p12 p 11 p12 p12 x1k 1 p22 xk2 1 ( x1k 1 ) 2 p22 2 x1k 1 xk2 1 ( xk2 1 ) 2 Quadratic basis set W jT 1 ( xk ) ( xk 1 ) xkT Qxk u Tj ( xk ) Ru j ( xk ) Then update control using h j ( xk ) L j xk ( R BT Pj B ) 1 BT Pj Axk Need to know A AND B for control update

Implementation- DT Policy Iteration Nonlinear Case Value Function Approximation (VFA) V ( x) W T ( x) weights basis functions LQR case- V(x) is quadratic V ( x) xT Px W T ( x) ( x) W T [ p11 Quadratic basis functions p12 ] Nonlinear system case- use Neural Network

Implementation- DT Policy Iteration Nonlinear Case Value Function Approximation (VFA) V ( x) W T ( x) weights Neural Network Weierstrass Approximator basis functions Value function update for given control – Bellman Equation V j 1 ( x k ) r ( x k , h j ( x k )) V j 1 ( x k 1 ) Becomes an algebraic equation for the weights W jT 1 ( xk ) ( xk 1 ) r ( xk , h j ( xk ))

Implementation- DT Policy Iteration Then regression matrix W jT 1 ( xk ) ( xk 1 ) r ( xk , h j ( xk )) Solve for weights in real-time using RLS or, batch LS- many trajectories with different initial conditions over a compact set Then update control using dV j 1 ( xk 1 ) 1 1 T u j 1 ( xk 1 ) R g ( xk ) 2 dxk 1 1 R 1 g ( xk )T T ( xk 1 )W jT 1 2 Now can write simple MATLAB code for Policy Iteration

1. Select control policy 2. Find associated cost V j 1 ( x k ) r ( x k , h j ( x k )) V j 1 ( x k 1 ) W jT 1 ( xk ) ( xk 1 ) r ( xk , h j ( xk )) dV j ( xk 1 ) 1 1 T 3. Improve control u j 1 ( xk 1 ) R g ( xk ) 2 dxk 1 observe xk Needs 20 lines of MATLAB code apply uk Direct optimal adaptive control observe cost rk observe xk 1 update V k k 1 do until convergence to Vj 1 update control to uj 1

Persistence of Excitation W jT 1 ( xk ) ( xk 1 ) r ( xk , h j ( xk )) Regression vector must be PE

Adaptive Critics Use RLS until convergence The Adaptive Critic Architecture Value update V j 1 ( xk ) r ( xk , h j ( xk )) V j 1 ( xk 1 ) Control policy update cost Policy Evaluation (Critic network) dV j ( xk 1 ) 1 u j 1 ( xk 1 ) R 1 g ( xk )T dxk 1 2 Action network h j ( xk ) System Control policy Optimal Adaptive Control Leads to ONLINE FORWARD-IN-TIME implementation of optimal control A Two Timescale Controller

Adaptive Control Identify the performance valueOptimal Adaptive V ( x) W T ( x) Identify the system modelIndirect Adaptive Identify the ControllerDirect Adaptive Plant control output

Oscillation is a fundamental property of neural tissue Brain has multiple adaptive clocks with different timescales gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity. consolidation of memory spatial mapping of the environment – place cells The high frequency processing is due to the large amounts of sensorial data to be processed theta rhythm, Hippocampus, Thalamus, 4-10 Hz sensory processing, memory and voluntary control of movement. Spinal cord Motor control 200 Hz D. Vrabie, F. Lewis, and Dan Levine- RL for Continuous-Time Systems

Limbic system Doya, Kimura, Kawato 2001

picture by E. Stingu D. Vrabie Summary of Motor Control in the Human Nervous System Cerebral cortex Motor areas Long term Memory functions Basal ganglia Limbic System Thalamus Reinforcement Learning- dopamine gamma rhythms 30-100 Hz, Unsupervised learning Hippocampus theta rhythms 4-10 Hz Cerebellum Short term Supervised learning Brainstem (eye movement) inf. olive Spinal cord Motor control 200 Hz reflex Exteroceptive receptors Interoceptive receptors Muscle contraction and movement Hierarchy of multiple parallel loops

Adaptive Critic structure Reinforcement learning Theta waves 4-8 Hz Motor control 200 Hz

Cerebral cortex Motor areas Basal ganglia Hippocampus Thalamus theta rhythms 4-10 Hz gamma rhythms 30-100 Hz Intense processing due to the amounts of information data to be processed Cognitive map of the environment - place cells Cerebellum theta rhythms 4-10 Hz Brainstem Behavior reference Information sent to the lower processing levels inf. olive Spinal cord Motor control 200 Hz Exteroceptive receptors Interoceptive receptors Muscle contraction and movement

Adaptive (Approximate) Dynamic Programming Four ADP Methods proposed by Paul Werbos Critic NN to approximate: Heuristic dynamic programming Value Iteration Value V ( xk ) Q function Q ( xk , u k ) Dual heuristic programming Gradient - costate AD Heuristic dynamic programming (Watkins Q Learning) V x AD Dual heuristic programming Gradients Q , x Action NN to approximate the Control Bertsekas- Neurodynamic Programming Barto & Bradtke- Q-learning proof (Imposed a settling time) Q u

Q Learning- Watkins Action Dependent ADP – Paul Werbos xk 1 f ( xk ) g ( xk )uk Robot System Value function recursion for given policy h(xk) Vh ( xk ) r ( xk , h( xk )) Vh ( xk 1 ) xk k uk xk 1 h(x) k 1 Define Q function Qh ( xk , u k ) r ( xk , u k ) Vh ( xk 1 ) Note uk arbitrary policy h(.) used after time k Qh ( xk , h( xk )) Vh ( xk ) Bellman eq for Q Qh ( xk , u k ) r ( xk , u k ) Qh ( xk 1 , h( xk 1 )) Simple expression of Bellman’s principle V * ( xk ) min (Q * ( xk , u k )) uk DO NOT NEED h * ( xk ) arg min (Q * ( xk , u k )) uk dV j ( xk 1 ) 1 u j 1 ( xk 1 ) R 1 g ( xk )T dxk 1 2 Optimal Adaptive Control for completely unknown DT systems

Q Learning does not need to know f(xk) or g(xk) For LQR V ( x) W T ( x) x T Px V is quadratic in x xk 1 Axk Buk Qh ( xk , u k ) r ( xk , u k ) Vh ( xk 1 ) xkT Qxk u kT Ru k ( Axk Bu k )T P( Axk Bu k ) x k u k T T T Q AT PA xk xk H xx AT PB xk xk H T T R B PB u k u k B PA u k u k H ux H xu xk H uu u k Q is quadratic in x and u Control update is found by so 0 Q 2[ B T PAxk ( R B T PB )u k ] 2[ H ux xk H uu u k ] u k 1 u k ( R B T PB) 1 B T PAxk H uu H ux xk L j 1 xk Control found only from Q function A and B not needed

Q Learning– Action Dependent HDP – Paul Werbos Q function for any given control policy h(xk) satisfies the Bellman equation Qh ( xk , u k ) r ( xk , u k ) Qh ( xk 1 , h( xk 1 )) Policy Iteration Using Q Function- Recursive solution to HJB Pick stabilizing initial control policy Find Q function Q j 1 ( xk , u k ) r ( xk , u k ) Q j ( xk 1 , h j ( xk 1 )) Update control h j 1 ( xk ) arg min(Q j 1 ( xk , u k )) uk Bradtke & Barto (1994) proved convergence for LQR Does not require the Dynamics Model xk 1 f ( xk ) g ( xk )uk

Implementation- DT Q Function Policy Iteration For LQR Q function update for control u k L j xk Bradtke and Barto is given by Q j 1 ( x k , u k ) r ( xk , u k ) Q j 1 ( x k 1 , L j xk 1 ) Assume measurements of uk, xk and xk 1 are available to compute uk 1 QFA – Q Fn. Approximation Q ( x, u ) W T ( x, u ) Then Now u is an input to the NN- Werbos- Action dependent NN regression matrix W jT 1 ( xk , uk ) ( xk 1 , L j xk 1 ) r ( xk , L j xk ) Solve for weights using RLS or backprop. Since xk 1 is measured in training phase, do not need knowledge of f(x) or g(x) for value fn. update

Reinforcement Learning Uses Policy Evaluation based on the Bellman Equation PI Based on Value Function V j 1 ( xk ) r ( xk , uk ) V j 1 ( xk 1 ) PI Based on Q Function Q j 1 ( xk , u k ) r ( xk , u k ) Q j 1 ( xk 1 , L j xk 1 ) The key to desired performance is to properly specify the Utility r ( xk , uk ) A computer program implements PI and takes as user input the Utility Function Examples Minimum energy regulator r ( xk , uk ) xkT Qx k ukT Ruk Track desired trajectory r ( xk , uk ) ( yk ykdes )T Q( yk ykdes ) (uk uk 1 )T R(uk uk 1 )

Model-free policy iteration Q Policy Iteration Q j 1 ( x k , u k ) r ( xk , u k ) Q j 1 ( x k 1 , L j xk 1 ) Bradtke, Ydstie, Barto W jT 1 ( xk , u k ) ( xk 1 , L j xk 1 ) r ( xk , L j xk ) Control policy update Stable initial control needed h j 1 ( xk ) arg min(Q j 1 ( xk , u k )) uk 1 u k H uu H ux xk L j 1 xk Greedy Q Fn. Update - Approximate Dynamic Programming ADP Method 3. Q Learning Action-Dependent Heuristic Dynamic Programming (ADHDP) Greedy Q Update Model-free HDP Paul Werbos Stable initial control NOT needed Q j 1 ( xk , u k ) r ( xk , u k ) Q j ( xk 1 , h j ( xk 1 )) W jT 1 ( xk , u k ) r ( xk , L j xk ) W jT ( xk 1 , L j xk 1 ) target j 1 Update weights by RLS or backprop.

Q learning solves the Control design Riccati Equation WITHOUT knowing the plant dynamics Model-free ADP Direct OPTIMAL ADAPTIVE CONTROL Works for Nonlinear Systems Proofs? Robustness? Comparison with adaptive control methods?

A Q-Learning Based Adaptive Optimal Controller Implementation for a Humanoid Robot Arm Said Ghani Khan1, Guido Herrmann1, Tony Pipe1, Chris Melhuish1 Bristol Robotics Laboratory University of Bristol and University of the West of England, Bristol, UK Conference on Decision and Control (CDC) 2011, Orlando, 11 December 2011

BRL BERT II ARM The mechanical design and manufacturing for the BERT II torso including hand and arm has been conducted by Elumotion (www.elumotion.com), a Bristol Based company

ADP Actor-Critic Scheme Stingue et al. 2010

Algorithm The cost of control is modeled via an NN

Algorithm The function e.g. z k ( xk , u k , d k ) uk1 ukm x d k1 k1 z k ( xk , u k , d k ) xk n d k n x k1 x kn is a vector, linear in the control and control error and system states, Note that the control signals uk are at most quadratic in This Selected elements of the Kronecker product of zk will be used as functions of a polynomial neural network (greater detail later) again is a practical assumption.

Introducing Constraints The cost function is modified to include constraints qL is the joint limit. The new cost function.

Introducing Constraints - Modelling Q The NN nodes are obtained by the Kronecker product of : Additional neurons are added to deal with the extra nonlinearity due to constraints

Constrained Case-Experiment

BRL BERT II ARM Bristol Robotics Lab

Reinforcement Learning for Human-Robot Interfaces Build an adaptive interface system that allows a single operator to manipulate a robots with multiple degrees of freedom and/or multiple robots The interface system should be intuitive, easy to use and can be learned quickly , and should be able to apply with different robot configurations Output input Operator Interface System Performance Evaluation Robot System

Adaptive Interface Mapping for Intuitive Teleoperation of Multi-DOF Robots Jartuwat Rajruangrabin, Isura Ranatunga, Dan O. Popa corresponding author: popa@uta.edu Multiscale Robotics and Systems Lab Department of Electrical Engineering, University of Texas at Arlington, Texas, USA Support from National Science Foundation, Grants #CPS 1035913, and #CNS 0923494. Presentation at FIRA-TAROS 2012 Congress, Bristol, UK, 4pm Tuesday 21 August 2012

Experiment Platform 6-degrees of freedom robotic manipulator mounted on a differential drive mobile robot platform q e t x t y t z t t t t T

Past Work / Challenges u y input Operator Interface System xp Robot System Output Performance Evaluation Robot-Interface System Challenges Set of input / output pairs have to be specified. Desired trajectory is known What if we cannot specify the desired output trajectory directly? Use Reinforcement Learning

Interface Mapping What is it exactly? y u f u What can we do to get f(u)? The simplest way is to obtain a set of inputs and a set of outputs and calculate the relationship (Curve Fitting) u y Curve Fitting Algorithm P Static nonlinear map y f (u ) wi i wij ui i 0 j 0 M f x

NN Training Log sigmoid function is used as a neuron activation function 1 . 1 e x Train using MATLAB NN Toolbox very easily if input/output pairs are known.

Static Mapping Approach Found by training neural network with known input/output pairs f(u) is a coordinate transformation f u

Reinforcement Learning u y Curve Fitting Algorithm f u What if we cannot specify the desired output trajectory directly ? “Learning by interacting with an environment” Reinforcement Learning With RL we do not have to specify a desired trajectory. Instead, a Reward Function is used

RL Implementation Haptic/Robot Arm on a Mobile Robot Objective: -Exp1 : Implement RL with reward function that allow the user to control movement of the mobile platform -Exp2 : Reverse the direction mapping of mobile platform based on Exp1 Step 1(Initialization): Train the NN so that the weights are optimal according to the desired trajectory Step 2: (Online Learning) Implement the TD(λ) learning algorithm Mapping Model: - Use a non-linear dynamic model y k 1 f w y k , u Reward function: y y x , y y x x max x x max x rVL y x y x min x , y x y x min x 0, otherwise y y x , y y x x max x x max x rV R y x y x min x , y x y x min x 0, otherwise

Experiment Result – Contour Shaping Reverse and Scale Y - Mapping Update Through TD(λ) Learning Algorithm

Experiment Result – Online TD(λ) Learning Interface device is the Emotiv BCI device, with a set of 15 electrodes recording neuro-muscular activity in the face. Robot is Neptune with a camera attached to Katana arm. Task is visual servoing to center a light green rectangular object in the field of view.

Learning of Interface Mapping: Reward Function 5 The reward function is defined as R rj “Emotion Energy” j 1 Where r1, r2, r3 are related to the affective state of user (Engagement/Boredom, Concentration, and Excitement classified by the EPOC headband. r4, r5 are metrics related to visual servoing goal attainment. (e.g. centering of green object in image ‐ see paper)

Adaptive Interface Mapping for Intuitive Teleoperation of Multi-DOF Robots Jartuwat Rajruangrabin, Isura Ranatunga, Dan O. Popa corresponding author: popa@uta.edu Multiscale Robotics and Systems Lab Department of Electrical Engineering, University of Texas at Arlington, Texas, USA Support from National Science Foundation, Grants #CPS 1035913, and #CNS 0923494. Presentation at FIRA-TAROS 2012 Congress, Bristol, UK, 4pm Tuesday 21 August 2012 MORE LATER TODAY!

Meng Tze 550 BC He who exerts his mind to the utmost knows nature’s pattern. The way of learning is none other than finding the lost mind. Man’s task is to understand patterns in nature and society. INTERPLAY OF CONTROL THEORY AND GRAPH THEORY.

Patterns in Nature and Society

Cooperative Control for Multi‐Agent Systems

Reynolds, Computer Graphics 1987 Flocking Reynolds’ Rules: Alignment : align headings i aij ( j i ) LOCAL VOTING PROTOCOL j Ni Cohesion : steer towards average position of neighbors- towards c.g. Separation : steer to maintain separation from neighbors

Communication Graph Strongly connected if for all nodes i and j there is a path from i to j. Communication graph can change as team players move Tree Leader or root node Followers

Consensus Control for Swarm Motions 1 i 2 a ( ij j i ) heading angle consensus j Nic 4 x i V cos i y i V sin i 3 6 5 6 50 0 5 y -50 4 -100 3 -150 2 -200 1 0 -250 0 5 10 15 20 25 30 35 time Convergence of headings 40 -300 -350 -300 -250 -200 -150 -100 -50 0 50 x Nodes converge to consensus heading

Controlled Consensus: Cooperative Tracker control node v x i ui Node state Local voting protocol with control node v ui a j Ni ij ( x j xi ) bi (v xi ) bi 0 If control v is in the neighborhood of node i ui aij xi aij x j j Ni j Ni bi v Strongly connected graph L Control node is in some neighborhoods x ( L B) x B 1 v Ni B diag{bi } Theorem. Let at least one bi 0. Then L B is nonsingular with all e-vals positive and -(L B) is asymptotically stable So initial conditions of nodes in graph A go away. Consensus value depends only on v In fact, v is now the only spanning node

Second Order Consensus – Kevin Moore and Wei Ren Let each node have an associated state xi ui Second-order local voting protocol ui 0 aij ( x j xi ) 1 aij ( x j x i ) j Ni Position/velocity feedback j Ni Reaches consensus in both xi , x i iff graph has a spanning tree and gains are chosen for stability Has 2 integrators- Can follow a ramp consensus input

Second Order Controlled Consensus for Position Offset Control – Kevin Moore and Wei Ren v Leader node state xi ui Second-order controlled protocol ui 0 aij ( x j xi ) ij 1 aij ( x j x i ) j Ni j Ni bi ( x0 xi ) ( x 0 x i ) where node 0 is a leader node. ij is a desired separation vector Good for formation offset position control x0 Leader ij Followers

Herd and Panic Behavior During Emergency Building Egress Helbring, Farkas, Vicsek, Nature 2000

Modeling Crowd Behavior in Stress Situations Combine coop. ctrl. and potential fields Consensus term Helbring, Farkas, Vicsek, Nature 2000 Interaction pot. field Wall pot. field Repulsive force Radial compression term Tangential friction term

Our revels now are ended. These our actors, As I foretold you, were all spirits, and Are melted into air, into thin air. The cloud-capped towers, the gorgeous palaces, The solemn temples, the great globe itself, Yea, all which it inherit, shall dissolve, And, like this insubstantial pageant faded, Leave not a rack behind. We are such stuff as dreams are made on, and our little life is rounded with a sleep. Prospero, in The Tempest, act 4, sc. 1, l. 152-6, Shakespeare

ARRI Automation & Robotics Research Institute University of Texas at Arlington An Approximate Dynamic Programming Based Controller for an Underactuated 6DoF Quadrotor Emanuel Stingu Supported by ARO grant W91NF-05-1-0314 NSF grant ECCS-0801330 Frank Lewis

3 control loops The quadrotor has 17 states and only 4 control inputs, thus it is very under-actuated. Three control loops with dynamic inversion are used to generate the 4 control signals. yaw x direction of tilt z amount of tilt y Altitude control: thrust force Reference trajectory Position and velocity errors Translation and yaw Controller Dynamic inversion accelerations to attitude Desired attitude Attitude Controller Momentum theory applied to the propeller Desired forces Dynamic inversion rotation accelerations to forces Motor and Propeller Controller Dynamic inversion thrust forces to motor voltages Motor commands

Approximate Dynamic Programming The actor is and uk h( xk 1 , zk 1 ) u q ad where h( xk , zk ) h ( x , z ) h ( x , z ) h ( x , z ) 1T The critic is 1 k Vh ( xk , zk ) j k State vector and references Reference trajectory Position and velocity errors 2T 2 k 2 k 3T 3 k 3 k Vmot ad T T r xj , z j ,uj Subsets of the state and tracking error vectors Actor tuning Global Critic Translation and yaw Controller Local Actor j k 1 k Ω ad az ad Desired attitude Attitude Controller Local Actor Desired forces Motor and Propeller Controller Local Actor Motor commands

Approximate Dynamic Programming Once the value of the Q function at ( xk 1 , zk 1 , uk 1 ) is known, a backup of it is made into the RBF neural network by adjusting the weights W and/or by adding more neurons and by reconfiguring their other parameters. This is a separate process that just needs to know the x, z, u coordinates and the new value to store. r xk 1 , zk 1 , h xk 2 , zk 2 Q xk 1 , zk 1 , uk 1 W xk 1 , zk 1 , uk 1 T T W xk , zk , uk W xk 1 , zk 1 , uk 1 T r xk , zk , uk zkT Qzk wk 1 wk R wk 1 wk uk uek S uk uek T T The update of the Q value is not made completely towards the new value. This slows down the learning, but adds robustness. Qstored Qold Qnew Qold , 0 1 The policy update step is done by simply solving after the new Q value was stored. The value for h is stored into the actor RBF neural network using the same mechanism as before: Q xk , zk , u 0 u h xk , zk U T xk , zk u U T xk , zk

The Curse of Dimensionality The actor acts as a nonlinear function approximator. Normally we have uk 1 h( xk ) In the quadrotor case, because the reference is not zero and the system is nonlinear, we need uk 1 h( xk , zk ) For each of the position, attitude and motor/propeller loops the state vector includes the local states and the external states that have a big coupling effect on the loop performance. It is easy to see that this way the input space can easily have n 14 or more dimensions. A RBF neural network with the neurons placed on a grid with N elements in each dimension would require neurons. For N 5 and n 14, N n are required. 6 109 Placing neurons on a grid is no better than a look-up table. The solutions to reducing the number of neurons are the following: preprocess the states to provide signals with physical significance as inputs combine multiple states into a lower dimension signal map multiple equivalent regions from the state-space into only one.

Using standard controller Using Reinforcement learning controller

Our revels now are ended. These our actors, As I foretold you, wer

Control policy V j 1 (x k ) r(x k ,h j (x k )) V j 1 (x k 1) Adaptive Critics Value update Control policy update Leads to ONLINE FORWARD-IN-TIME implementation of optimal control Optimal Adaptive Control Use RLS until convergence 1 1 11 1 1 () 2 T jk jk k k dV x ux Rgx dx A Two Timescale Controller

Related Documents:

UTA Santa Fe), and the UTA Research Institute (UTARI) unless otherwise noted . The Report discloses crime statistics for Clery crimes. 2. occurring in UTA’s Clery geography. 3. that have been reported to a UTA Campus Security Authority (CSA). The UTAPD collects on-campus crime statistics daily through its normal law enforcement operations.

UTA's (Undergraduate Teaching Assistants) will aid the professor during labs and lectures and will also likely lead a lab. UTA's are expected to hold meeting hours. A UTA can not grade or administer exams. To be eligible for a UTA position, you must have previously been an LA. The maximum time commitment for a UTA is 10 hours a week.

A through M me1@mae.uta.edu N through Z me2@mae.uta.edu AE students:students: A through Z ae@mae.uta.edu. Checklist 1. Carefully review your filled form. 2. Send it to the correct e‐mail else it will not be processed. 3. Give us 2 business days to process the form. 4. In case you are not able to regitister bdbeyond 2 days after subiibmission .

The University of Texas at Arlington (UTA) processes payroll using PeopleSoft's Human Capital Management System. Total UTA salaries and wages were 296,274,922 in Fiscal Year 2019. Coordination between the following is critical to help ensure accurate and timely payroll processing. Payroll Services - Payroll Services is part of UTA's Division of

POLiCY FORm SERiES CF-960201-UTA, CF-940101-UTA & CF-940101-UTA wiTh CF-940101-UTA-BiA DesCription: A lump sum benefit cancer policy payable on first diagnosis of internal cancer or malignant melanoma. Benefit amounts are state specific

Adhere to the UTA Agreement and UTA Program. Must return the utility tunnel key and UTA Map upon the agreed time. Disseminate all UTA Program information to utility tunnel entrants and assure compliance. Account for all utility tunnel ent

A UTa KNITTER LEARNING TO MAKE SOCKS ON THE A UTa KNITTER . 1 BOBBIN WINDER AND SWIFT Fig.! Bobbin Yarn Winder BOBBIN WINDING . which must be exactly over the center of the cylin-der. Draw the yarn through the yarn carri

Testing Standard: ASTM C423 A-Mount Test Date: 05/18/1999 Why this test: This test evaluates a products efficiency of absorbing sound at multiple frequencies. The test simulates the product’s acoustical performance with a direct installation on a wall or ceiling. Test Result Summary: NRC - 1.15; SAA - 0.95 Test ID: AS-SA1448C NRC SAA 1.15 0.95 Frequency (Hz) Absorption Coefficient 100 0.02 .