Extreme Learning Machine: Theory And Applications

3y ago
27 Views
3 Downloads
446.27 KB
13 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Carlos Cepeda
Transcription

ARTICLE IN PRESSNeurocomputing 70 (2006) 489–501www.elsevier.com/locate/neucomExtreme learning machine: Theory and applicationsGuang-Bin Huang , Qin-Yu Zhu, Chee-Kheong SiewSchool of Electrical and Electronic Engineering, NanyangTechnological University, Nanyang Avenue, Singapore 639798, SingaporeReceived 27 January 2005; received in revised form 1 December 2005; accepted 3 December 2005Communicated by J. Tin-Yau KwokAvailable online 16 May 2006AbstractIt is clear that the learning speed of feedforward neural networks is in general far slower than required and it has been a majorbottleneck in their applications for past decades. Two key reasons behind may be: (1) the slow gradient-based learning algorithms areextensively used to train neural networks, and (2) all the parameters of the networks are tuned iteratively by using such learningalgorithms. Unlike these conventional implementations, this paper proposes a new learning algorithm called extreme learning machine(ELM) for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determinesthe output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance at extremely fast learningspeed. The experimental results based on a few artificial and real benchmark function approximation and classification problemsincluding very large complex applications show that the new algorithm can produce good generalization performance in most cases andcan learn thousands of times faster than conventional popular learning algorithms for feedforward neural networks.1r 2006 Elsevier B.V. All rights reserved.Keywords: Feedforward neural networks; Back-propagation algorithm; Extreme learning machine; Support vector machine; Real-time learning; Randomnode1. IntroductionFeedforward neural networks have been extensively usedin many fields due to their ability: (1) to approximatecomplex nonlinear mappings directly from the inputsamples; and (2) to provide models for a large class ofnatural and artificial phenomena that are difficult to handleusing classical parametric techniques. On the other hand,there lack faster learning algorithms for neural networks.The traditional learning algorithms are usually far slowerthan required. It is not surprising to see that it may takeseveral hours, several days, and even more time to trainneural networks by using traditional methods.From a mathematical point of view, research on theapproximation capabilities of feedforward neural networks Corresponding author. Tel.: 65 6790 4489; fax: 65 6793 3318.E-mail address: egbhuang@ntu.edu.sg (G.-B. Huang).URL: http://www.ntu.edu.sg/home/egbhuang/.1For the preliminary idea of the ELM algorithm, refer to ‘‘ExtremeLearning Machine: A New Learning Scheme of Feedforward NeuralNetworks’’, Proceedings of International Joint Conference on NeuralNetworks (IJCNN2004), Budapest, Hungary, 25–29 July, 2004.0925-2312/ - see front matter r 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.neucom.2005.12.126has focused on two aspects: universal approximation oncompact input sets and approximation in a finite set oftraining samples. Many researchers have explored theuniversal approximation capabilities of standard multilayerfeedforward neural networks. Hornik [7] proved that if theactivation function is continuous, bounded and nonconstant, then continuous mappings can be approximated inmeasure by neural networks over compact input sets.Leshno [17] improved the results of Hornik [7] and provedthat feedforward networks with a nonpolynomial activation function can approximate (in measure) continuousfunctions. In real applications, the neural networks aretrained in finite training set. For function approximation ina finite training set, Huang and Babri [11] shows that asingle-hidden layer feedforward neural network (SLFN)with at most N hidden nodes and with almost anynonlinear activation function can exactly learn N distinctobservations. It should be noted that the input weights(linking the input layer to the first hidden layer) and hiddenlayer biases need to be adjusted in all these previoustheoretical research works as well as in almost all practicallearning algorithms of feedforward neural networks.

ARTICLE IN PRESS490G.-B. Huang et al. / Neurocomputing 70 (2006) 489–501Traditionally, all the parameters of the feedforwardnetworks need to be tuned and thus there exists thedependency between different layers of parameters (weightsand biases). For past decades, gradient descent-basedmethods have mainly been used in various learningalgorithms of feedforward neural networks. However, itis clear that gradient descent-based learning methods aregenerally very slow due to improper learning steps or mayeasily converge to local minima. And many iterativelearning steps may be required by such learning algorithmsin order to obtain better learning performance.It has been shown [23,10] that SLFNs (with N hiddennodes) with randomly chosen input weights and hiddenlayer biases (and such hidden nodes can thus be calledrandom hidden nodes) can exactly learn N distinctobservations. Unlike the popular thinking and mostpractical implementations that all the parameters of thefeedforward networks need to be tuned, one may notnecessarily adjust the input weights and first hidden layerbiases in applications. In fact, some simulation results onartificial and real large applications in our work [16] haveshown that this method not only makes learning extremelyfast but also produces good generalization performance.In this paper, we first rigorously prove that the inputweights and hidden layer biases of SLFNs can be randomlyassigned if the activation functions in the hidden layer areinfinitely differentiable. After the input weights and thehidden layer biases are chosen randomly, SLFNs can besimply considered as a linear system and the output weights(linking the hidden layer to the output layer) of SLFNs canbe analytically determined through simple generalizedinverse operation of the hidden layer output matrices.Based on this concept, this paper proposes a simplelearning algorithm for SLFNs called extreme learningmachine (ELM) whose learning speed can be thousands oftimes faster than traditional feedforward network learningalgorithms like back-propagation (BP) algorithm whileobtaining better generalization performance. Differentfrom traditional learning algorithms the proposed learningalgorithm not only tends to reach the smallest trainingerror but also the smallest norm of weights. Bartlett’s [1]theory on the generalization performance of feedforwardneural networks states for feedforward neural networksreaching smaller training error, the smaller the norm ofweights is, the better generalization performance thenetworks tend to have. Therefore, the proposed learningalgorithm tends to have good generalization performancefor feedforward neural networks.As the new proposed learning algorithm can be easilyimplemented, tends to reach the smallest training error,obtains the smallest norm of weights and the goodgeneralization performance, and runs extremely fast, inorder to differentiate it from the other popular SLFNlearning algorithms, it is called the extreme learningmachine in the context of this paper.This paper is organized as follows. Section 2 rigorouslyproves that the input weights and hidden layer biases ofSLFNs can be randomly assigned if the activationfunctions in the hidden layer are infinitely differentiable.Section 3 further proposes the new ELM learningalgorithm for single-hidden layer feedforward neural networks (SLFNs). Performance evaluation is presented inSection 4. Discussions and conclusions are given in Section 5.The Moore–Penrose generalized inverse and the minimumnorm least-squares solution of a general linear systemwhich play an important role in developing our new ELMlearning algorithm are briefed in the Appendix.2. Single hidden layer feedforward networks (SLFNs) withrandom hidden nodesFor N arbitrary distinct samples ðxi ; ti Þ, where xi ¼½xi1 ; xi2 ; . . . ; xin T 2 Rn and ti ¼ ½ti1 ; ti2 ; . . . ; tim T 2 Rm , standard SLFNs with N hidden nodes and activation functiongðxÞ are mathematically modeled asN Xbi gi ðxj Þ ¼i¼1N Xbi gðwi xj þ bi Þ ¼ oj ,i¼1j ¼ 1; . . . ; N,ð1Þwhere wi ¼ ½wi1 ; wi2 ; . . . ; win T is the weight vector connecting the ith hidden node and the input nodes, bi ¼½bi1 ; bi2 ; . . . ; bim T is the weight vector connecting the ithhidden node and the output nodes, and bi is the thresholdof the ith hidden node. wi xj denotes the inner product ofwi and xj . The output nodes are chosen linear in this paper.That standard SLFNs with N hidden nodes withactivation function gðxÞ can approximatethese N samplesPN with zero error means thatko tk¼ 0, i.e., therejjj¼1exist bi , wi and bi such thatN Xbi gðwi xj þ bi Þ ¼ tj ;j ¼ 1; . . . ; N.(2)i¼1The above N equations can be written compactly as(3)Hb ¼ T,whereHðw1 ; . . . ; wN ; b1 ; . . . ; bN ; x1 ; . . . ; xN Þ23gðw1 x1 þ b1 Þ gðwN x1 þ bN Þ67.67¼67. .45gðw1 xN þ b1 Þ23bT1676 . 7b ¼ 6 . 745bTN N mgðwN xN þ bN Þ ð4ÞN N 2and3tT16 . 7. 7T¼64 . 5tTN,.(5)N mAs named in Huang et al. [11,10], H is called the hiddenlayer output matrix of the neural network; the ith columnof H is the ith hidden node output with respect to inputsx1 ; x2 ; . . . ; xN .

ARTICLE IN PRESSG.-B. Huang et al. / Neurocomputing 70 (2006) 489–501If the activation function g is infinitely differentiable we can prove that the required number of hidden nodes NpN.2Strictly speaking, we haveTheorem 2.1. Given a standard SLFN with N hidden nodesand activation function g : R ! R which is infinitelydifferentiable in any interval, for N arbitrary distinct samplesðxi ; ti Þ, where xi 2 Rn and ti 2 Rm , for any wi and birandomly chosen from any intervals of Rn and R, respectively, according to any continuous probability distribution,then with probability one, the hidden layer output matrix Hof the SLFN is invertible and kHb Tk ¼ 0.Proof. Let us consider a vector cðbi Þ ¼ ½gi ðx1 Þ; . . . ;gi ðxN Þ T ¼ ½gðwi x1 þ bi Þ; . . . ; gðwi xN þ bi Þ T , the ith column of H, in Euclidean space RN , where bi 2 ða; bÞ andða; bÞ is any interval of R.Following the same proof method of Tamura andTateishi ( [23], p. 252) and our previous work ( [10],Theorem 2.1), it can be easily proved by contradiction thatvector c does not belong to any subspace whose dimensionis less than N.Since wi are randomly generated based on a continuousprobability distribution, we can assume that wi xk awi xk0 for all kak0 . Let us suppose that c belongs to asubspace of dimension N 1. Then there exists a vector awhich is orthogonal to this subspaceða; cðbi Þ cðaÞÞ ¼ a1 gðbi þ d 1 Þ þ a2 gðbi þ d 2 Þþ þ aN gðbi þ d N Þ z ¼ 0,ð6Þwhere d k ¼ wi xk , k ¼ 1; . . . ; N and z ¼ a cðaÞ,8bi 2 ða; bÞ. Assume aN a0, Eq. (6) can be further writtenasgðbi þ d N Þ ¼ N 1Xgp gðbi þ d p Þ þ z aN ,(7)p¼1where gp ¼ ap aN , p ¼ 1; . . . ; N 1. Since gðxÞ is infinitelydifferentiable in any interval, we haveðlÞg ðbi þ d N Þ ¼ N 1XðlÞgp g ðbi þ d p Þ,p¼1l ¼ 1; 2; . . . ; N; N þ 1; . . . ,ð8Þwhere gðlÞ is the lth derivative of function g of bi . However,there are only N 1 free coefficients: g1 ; . . . ; gN 1 for thederived more than N 1 linear equations, this is contradictory. Thus, vector c does not belong to any subspacewhose dimension is less than N.Hence, from any interval ða; bÞ it is possible to randomlychoose N bias values b1 ; . . . ; bN for the N hidden nodessuch that the corresponding vectors cðb1 Þ; cðb2 Þ; . . . ; cðbN Þspan RN . This means that for any weight vectors wi andbias values bi chosen from any intervals of Rn and R,respectively, according to any continuous probability2In fact, the theorem and its proof are also linearly valid for the casegi ðxÞ ¼ gðkx wi k bi Þ; wi 2 Rn ; bi 2 Rþ .491distribution, then with probability one, the column vectorsof H can be made full-rank. &Such activation functions include the sigmoidal functions as well as the radial basis, sine, cosine, exponential,and many other nonregular functions as shown in Huangand Babri [11].Furthermore, we haveTheorem 2.2. Given any small positive value 40 andactivation function g : R ! R which is infinitely differentiable in any interval, there exists NpNsuch that for N arbitrarydistinct samples ðxi ; ti Þ, where xi 2 Rn and ti 2 Rm , for any wiand bi randomly chosen from any intervals of Rn and R,respectively, according to any continuous probability distribution, then with probability one, kHN N bN m TN m ko . Proof. The validity of the theorem is obvious, otherwise, onecould simply choose N ¼ N which makes kHN N bN m TN m ko according to Theorem 2.1. &3. Proposed extreme learning machine (ELM)Based on Theorems 2.1 and 2.2 we can propose in thissection an extremely simple and efficient method to trainSLFNs.3.1. Conventional gradient-based solution of SLFNsTraditionally, in order to train an SLFN, one may wish such that i ; b i ; b (i ¼ 1; . . . ; N)to find specific w N ; b 1 ; . . . ; b N Þb TkkHðw 1 ; . . . ; w¼ min kHðw1 ; . . . ; wN ; b1 ; . . . ; bN Þb Tkwi ;bi ;bwhich is equivalent to minimizing the cost function!2N NXXE¼bi gðwi xj þ bi Þ tj .j¼1ð9Þ(10)i¼1When H is unknown gradient-based learning algorithms are generally used to search the minimum ofkHb Tk. In the minimization procedure by usinggradient-based algorithms, vector W, which is the set ofweights (wi ,bi ) and biases (bi ) parameters, is iterativelyadjusted as follows:qEðWÞ.(11)qWHere Z is a learning rate. The popular learning algorithmused in feedforward neural networks is the BP learningalgorithm where gradients can be computed efficiently bypropagation from the output to the input. There are severalissues on BP learning algorithms:Wk ¼ Wk 1 Z(1) When the learning rate Z is too small, the learningalgorithm converges very slowly. However, whenZ is too large, the algorithm becomes unstable anddiverges.

ARTICLE IN PRESSG.-B. Huang et al. / Neurocomputing 70 (2006) 489–501492(2) Another peculiarity of the error surface that impactsthe performance of the BP learning algorithm is thepresence of local minima [6]. It is undesirable that thelearning algorithm stops at a local minima if it islocated far above a global minima.(3) Neural network may be over-trained by using BPalgorithms and obtain worse generalization performance. Thus, validation and suitable stoppingmethods are required in the cost function minimizationprocedure.(4) Gradient-based learning is very time-consuming inmost applications.The aim of this paper is to resolve the above issuesrelated with gradient-based algorithms and propose anefficient learning algorithm for feedforward neuralnetworks.3.2. Proposed minimum norm least-squares (LS) solution ofSLFNsAs rigorously proved in Theorems 2.1 and 2.2, unlike thetraditional function approximation theories which requireto adjust input weights and hidden layer biases, inputweights and hidden layer biases can be randomly assignedif only the activation function is infinitely differentiable. Itis very interesting and surprising that unlike the mostcommon understanding that all the parameters of SLFNsneed to be adjusted, the input weights wi and the hiddenlayer biases bi are in fact not necessarily tuned and thehidden layer output matrix H can actually remain unchanged once random values have been assigned to theseparameters in the beginning of learning. For fixed inputweights wi and the hidden layer biases bi , seen from Eq. (9),to train an SLFN is simply equivalent to finding a leastsquares solution b of the linear system Hb ¼ T:kHðw1 ; . . . ; wN ; b1 ; . . . ; bN Þb Tk¼ min kHðw1 ; . . . ; wN ; b1 ; . . . ; bN Þb Tk.b(1) Minimum training error. The special solution b ¼ Hy Tis one of the least-squares solutions of a general linearsystem Hb ¼ T, meaning that the smallest trainingerror can be reached by this special solution:kHb Tk ¼ kHHy T Tk ¼ min kHb Tk.bð12Þ(13)where Hy is the Moore–Penrose generalized inverse ofmatrix H [22,19].(14)Although almost all learning algorithms wish to reachthe minimum training error, however, most of themcannot reach it because of local minimum or infinitetraining iteration is usually not allowed in applications.(2) Smallest norm of weights. Further, the special solutionb ¼ Hy T has the smallest norm among all the leastsquares solutions of Hb ¼ T: ¼ kHy Tkpkbk,kbkno 8b 2 b : kHb TkpkHz Tk; 8z 2 RN N .ð15Þ(3) The minimum norm least-squares solution of Hb ¼ T isunique, which is b ¼ Hy T.3.3. Proposed learning algorithm for SLFNsThus, a simple learning method for SLFNs calledextreme learning machine (ELM) can be summarized asfollows:Algorithm ELM: Given a training set @ ¼ fðxi ; ti Þjxi 2Rn ; ti 2 Rm ; i ¼ 1; . . . ; Ng, activation function gðxÞ, and hidden node number N,Step 1: Randomly assign input weight wi and bias bi , i ¼ 1; . . . ; N.Step 2: Calculate the hidden layer output matrix H.Step 3: Calculate the output weight bb ¼ Hy T,If the number N of hidden nodes is equal to the number Nof distinct training samples, N ¼ N, matrix H is square andinvertible when the input weight vectors wi and the hiddenbiases bi are randomly chosen, and SLFNs can approximate these training samples with zero error.However, in most cases the number of hidden nodes ismuch less than the number of distinct training samples, N5N,H is a nonsquare matrix and there may not exist such that Hb ¼ T. According towi ; bi ; bi (i ¼ 1; . . . ; N)Theorem 5.1 in the Appendix, the smallest norm leastsquares solution of the above linear system isb ¼ Hy T,Remark 1. As discussed in the Appendix, we have thefollowing important properties:(16)Twhere T ¼ ½t1 ; . . . ; tN .Remark 2. As shown in Theorem 2.1, in theory thisalgorithm works for any infinitely differential activationfunction gðxÞ. Such activation functions include thesigmoidal functions as well as the radial basis, sine, cosine,exponential, and many nonregular functions as shown inHuang and Babri [11]. According to Theorem 2.2, theupper bound of the required number of hidden nodes is the number of distinct training samples, that is NpN.Remark 3. Several works [11,23,10,9,4] have shown thatSLFNs with N hidden nodes can exactly learn N distinctobservations. Tamura and Tateishi [23] and Huang [10,9]rigorously prove that SLFNs (with N hidden nodes) withrandomly chosen sigmoidal hidden nodes (with both inputweights and hidden biases randomly generated) can exactlylearn N distinct observations. Huang et al. [11,9] alsorigorously proves that if input weights and hidden biases

ARTICLE IN PRESSG.-B. Huang et al. / Neurocomputing 70 (2006) 489–501are allowed to be tuned (as done in most traditionalimplementations) SLFNs with at most N hidden nodes andwith almost any nonlinear activation function can exactlylearn N distinct observations and these activation functionsinclude differentiable and nondifferentiable functions,continuous and noncontinuous functions, etc.This paper rigorously proves that for any infinitelydifferentiable activation function SLFNs with N hiddennodes can learn N distinct samples exactly and SLFNs mayrequire less than N hidden nodes if learning error isallowed. Different from previous works [11,23,10,9,4] andthe ELM algorithm introduced in this paper, Ferrari andStengel [4] shows that SLFNs with N sigmoidal hiddennodes and with input weights randomly generated but hiddenbiases appropriately tuned can exactly learn N distinctobservations. Hidden nodes are not randomly generated inthe work done by Ferrari and Stengel [4], although theinput weights are randomly generated, the hidden biasesneed to be determined based on the input weights and inputtraining data (cf. Eq. (18) of [4]).Remark 4. Modular networks have also been suggested inseveral works [23,10,16,9,4], which partition the trainingsamples into L subsets each learned by an SLFNseparately. Suppose that t

Extreme learning machine: Theory and applications Guang-Bin Huang , Qin-Yu Zhu, Chee-Kheong Siew School of Electrical and Electronic Engineering, NanyangTechnological University, Nanyang Avenue, Singapore 639798, Singapore Received 27 January 2005; received in revised form 1 December 2005; accepted 3 December 2005 Communicated by J. Tin-Yau Kwok

Related Documents:

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

Machine learning has many different faces. We are interested in these aspects of machine learning which are related to representation theory. However, machine learning has been combined with other areas of mathematics. Statistical machine learning. Topological machine learning. Computer science. Wojciech Czaja Mathematical Methods in Machine .

A majority ofArizona voters say that wildfires (84%), heat waves (79%), and drought (74%) have become at least somewhat more extreme in the past 3-5 years. 38% 36% 29% 36% 26% 43% 21% 55% 16% Drought Heat waves Wildfires Much more extreme Somewhat more extreme Not changed much at all Somewhat less extreme Much less extreme Perceptions of .

Humanist Learning Theory 2 Introduction In this paper, I will present the Humanist Learning Theory. I’ll discuss the key principles of this theory, what attracted me to this theory, the roles of the learners and the instructor, and I’ll finish with three examples of how this learning theory could be applied in the learning environment.File Size: 611KBPage Count: 9Explore furtherApplication of Humanism Theory in the Teaching Approachcscanada.net/index.php/hess/article/view (PDF) The Humanistic Perspective in Psychologywww.researchgate.netWhat is the Humanistic Theory in Education? (2021)helpfulprofessor.comRecommended to you b

Introduction 5 Statistical extreme value theory is a field of statistics dealing with extreme values, i.e., large deviations from the median of probability distributions. The theory assesses the type of probability distribution generated by processes. Extreme value distributions are the limiting distributions for the minimum

machine learning Supervised & unsupervised learning Models & algorithms: linear regression, SVM, neural nets, -Statistical learning theory Theoretical foundation of statistical machine learning -Hands-on practice Advanced topics: sparse modeling, semi-supervised learning, transfer learning, Statistical learning theory:

Computability Theory Learning Programs to Fit/Predict Data & Machine Self-Reference . and empirical Machine Learning [CJO 00, CJK 01, COSS02, CJM 06, CJ10]. My Theory project re Machine Self-Reference [CM09a, CM12, CM09b, CM11]. John Case (CIS Department) Learn. Theory/Self-Reference SIGNewGrad'12 2 / 11 . regularization & understanding .

Machine Learning Real life problems Lecture 1: Machine Learning Problem Qinfeng (Javen) Shi 28 July 2014 Intro. to Stats. Machine Learning . Learning from the Databy Yaser Abu-Mostafa in Caltech. Machine Learningby Andrew Ng in Stanford. Machine Learning(or related courses) by Nando de Freitas in UBC (now Oxford).