A Sufficient Condition For Convergences Of Adam And RMSProp

3y ago
15 Views
3 Downloads
749.81 KB
9 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Lilly Kaiser
Transcription

A Sufficient Condition for Convergences of Adam and RMSPropFangyu Zou† , Li Shen‡ , Zequn Jie‡ , Weizhong Zhang‡ , Wei Liu‡‡†Tencent AI LabStony Brook Universityfangyu.zou@stonybrook.edu, mathshenli@gmail.com, zequn.nus@gmail.com,zhangweizhongzju@gmail.com, wl2223@columbia.eduAbstractAdam and RMSProp are two of the most influential adaptive stochastic algorithms for training deep neural networks,which have been pointed out to be divergent even in the convex setting via a few simple counterexamples. Many attempts,such as decreasing an adaptive learning rate, adopting abig batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have been triedto promote Adam/RMSProp-type algorithms to converge. Incontrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely dependson the parameters of the base learning rate and combinations of historical second-order moments, to guarantee theglobal convergence of generic Adam/RMSProp for solvinglarge-scale non-convex stochastic optimization. Moreover,we show that the convergences of several variants of Adam,such as AdamNC, AdaEMA, etc., can be directly implied viathe proposed sufficient condition in the non-convex setting.In addition, we illustrate that Adam is essentially a specifically weighted AdaGrad with exponential moving averagemomentum, which provides a novel perspective for understanding Adam and RMSProp. This observation coupledwith this sufficient condition gives much deeper interpretations on their divergences. At last, we validate the sufficientcondition by applying Adam and RMSProp to tackle a certaincounterexample and train deep neural networks. Numericalresults are exactly in accord with our theoretical analysis.1. IntroductionLarge-scale non-convex stochastic optimization [5], covering a slew of applications in statistics and machine learning[13, 5] such as learning a latent variable from massive datawhose probability density distribution is unknown, takes thefollowing generic formulation: min f (x) Eξ P fe(x, ξ) ,x Rd(1) The first two authors contribute equally. † This work was partially donewhen Fangyu Zou was a research intern at Tencent AI Lab, China.where f (x) is a non-convex function and ξ is a randomvariable satisfying an unknown distribution P.Due to the uncertainty of distribution P, the batch gradient descent algorithm [4] is impractical to employ full gradient f (x) to solve problem (1). Alternatively, a compromised approach to handle this difficulty is to use an unbiasedstochastic estimate of f (x), denoted as g(x, ξ), whichleads to the stochastic gradient descent (SGD) algorithm[21]. Its coordinate-wise version is defined as follows:xt 1,k xt,k ηt,k gt,k (xt , ξt ),(2)for k 1, 2, . . . , d, where ηt,k 0 is the learning rate ofthe k-th component of stochastic gradient g(xt , ξt ) at thet-th iteration. A sufficient condition [21] to ensure the globalconvergence of vanilla SGD (2) is to require ηt to meet Pt 1kηt k and Pt 1kηt k2 .(3)Although the vanilla SGD algorithm with learning rate ηtsatisfying condition (3) does converge, its empirical performance could be still stagnating, since it is difficult to tune aneffective learning rate ηt via condition (3).To further improve the empirical performance of SGD,a large variety of adaptive SGD algorithms, including AdaGrad [9], RMSProp [11], Adam [14], Nadam [8], etc., havebeen proposed to automatically tune the learning rate ηt byusing second-order moments of historical stochastic gradients. Let vt,k and mt,k be the linear combinations of the222historical second-order moments (g1,k, g2,k, · · · , gt,k) andstochastic gradient estimates (g1,k , g2,k , · · · , gt,k ), respectively. Then, the generic iteration scheme of these adaptiveSGD algorithms [20, 6] is summarized as xt 1,k xt,k ηt,k mt,k , with ηt,k αt / vt,k , (4)for k 1, 2, . . . , d, where αt 0 is called base learningrate and it is independent of stochastic gradient estimates(g1,k , g2,k , · · · , gt,k ) for all t 1. Although RMSProp,Adam, and Nadam work well for solving large-scale convex and non-convex optimization problems such as trainingdeep neural networks, they have been pointed out to be divergent in some scenarios via convex counterexamples [20].11127

This finding thoroughly destroys the fluke of a direct use ofthese algorithms without any further assumptions or corrections. Recently, developing sufficient conditions to guaranteeglobal convergences of Adam and RMSProp-type algorithmshas attracted much attention from both machine learning andoptimization communities. The existing successful attemptscan be divided into four categories:(C1) Decreasing a learning rate. Reddi et al. [20] havedeclared that the core cause of divergences of Adam andRMSProp is largely controlled by the difference between thetwo adjacent learning rates, i.e., Γt 1/ηt 1/ηt 1 vt /αt vt 1 /αt 1 . (5)Once positive definiteness of Γt is violated, Adam and RMSProp may suffer from divergence [20]. Based on this observation, two variants of Adam called AMSGrad and AdamNChave been proposed with convergence guarantees in both theconvex [20] and non-convex [6] stochastic settings by requiring Γt 0. In addition, Padam [24] extended fromAMSGrad has been proposed to contract the generalizationgap in training deep neural networks, whose convergencehas been ensured by requiring Γt 0. In the stronglyconvex stochastic setting, by using the long-term memorytechnique developed in [20], Huang et al. [12] have proposed NosAdam by attaching more weights on historicalsecond-order moments to ensure its convergence. Prior tothat, the convergence rate of RMSProp [19] has already beenestablished in the convex stochastic setting by employingsimilar parameters to those of AdamNC [20].(C2) Adopting a big batch size. Basu et al. [2], for thefirst time, showed that deterministic Adam and RMSPropwith original iteration schemes are actually convergent byusing full-batch gradient. On the other hand, both Adamand RMSProp can be reshaped as specific signSGD-typealgorithms [1, 3] whose O(1/ T ) convergence rates havebeen provided in the non-convex stochastic setting by settingbatch size as large as the number of maximum iterations [3]. Recently, Zaheer et al. [23] have established O(1/ T )convergence rate of original Adam directly in the non-convexstochastic setting by requiring the batch size to be the sameorder as the number of maximum iterations. We commentthat this type of requirement is impractical when Adam andRMSProp are applied to tackle large-scale problems like (1),since these approaches cost a huge number of computationsto estimate big-batch stochastic gradients in each iteration.(C3) Incorporating a temporal decorrelation. By exploring the structure of the convex counterexample in [20], Zhouet al. [25] have pointed out that the divergence of RMSPropis fundamentally caused by the unbalanced learning raterather than the absence of Γt 0. Based on this viewpoint,Zhou et al. [25] have proposed AdaShift by incorporating atemporal decorrelation technique to eliminate the inappropriate correlation between vt,k and the current second-order2moment gt,k, in which the adaptive learning rate ηt,k is re2. However, convergencequired to be independent of gt,kof AdaShift in [25] was merely restricted to RMSProp forsolving the convex counterexample in [20].(C4) Seeking an analogous surrogate. Due to the divergences of Adam and RMSProp [20], Zou et al. [26] recentlyproposed a class of new surrogates called AdaUSM to approximate Adam and RMSProp by integrating weightedAdaGrad with unified heavy ball and Nesterov acceleratedgradient momentums. Its O(log (T )/ T ) convergence ratehas also been provided in the non-convex stochastic settingby requiring a non-decreasing weighted sequence. Besides,many other adaptive stochastic algorithms without combining momentums, such as AdaGrad [22, 18] and stagewiseAdaGrad [7], have been guaranteed to be convergent andwork well in the non-convex stochastic setting.In contrast with the above four types of modificationsand restrictions, we introduce an alternative easy-to-checksufficient condition (abbreviated as (SC)) to guarantee theglobal convergences of original Adam and RMSProp. Theproposed (SC) merely depends on the parameters in estimating vt,k and base learning rate αt . (SC) neither requires thepositive definiteness of Γt like (C1) nor needs the batch sizeas large as the same order as the number of maximum iterations like (C2) in both the convex and non-convex stochasticsettings. Thus, it is easier to verify and more practical compared with (C1)-(C3). On the other hand, (SC) is partiallyoverlapped with (C1) since the proposed (SC) can coverAdamNC [20], AdaGrad with exponential moving average(AdaEMA) momentum [6], and RMSProp [19] as instanceswhose convergences are all originally motivated by requiring the positive definiteness of Γt . While, based on (SC),we can directly derive their global convergences in the nonconvex stochastic setting as byproducts without checkingthe positive definiteness of Γt step by step. Besides, (SC)can serve as an alternative explanation on divergences oforiginal Adam and RMSProp, which are possibly due toincorrect parameter settings for accumulating the historicalsecond-order moments rather than the unbalanced learningrate caused by the inappropriate correlation between vt,k2and gt,klike (C3). In addition, AdamNC and AdaEMA areconvergent under (SC), but violate (C3) in each iteration.Moreover, by carefully reshaping the iteration schemeof Adam, we obtain a specific weighted AdaGrad with exponential moving average momentum, which extends theweighted AdaGrad with heavy ball momentum and Nesterovaccelerated gradient momentum [26] in two aspects: the newmomentum mechanism and the new base learning rate setting provide a new perspective for understanding Adam andRMSProp. At last, we experimentally verify (SC) by applying Adam and RMSProp with different parameter settingsto solve the counterexample [20] and train deep neural networks including LeNet [16] and ResNet [10]. In summary,11128

the contributions of this work are five-fold:(1) We introduce an easy-to-check sufficient condition toensure the global convergences of original Adam andRMSProp in the non-convex stochastic setting. Moreover, this sufficient condition is distinctive from theexisting conditions (C1)-(C4) and is easier to verify.(2) We reshape Adam as weighted AdaGrad with exponential moving average momentum, which provides a newperspective for understanding Adam and RMSProp andalso complements AdaUSM in [26].(3) We provide a new explanation on the divergences oforiginal Adam and RMSProp, which are possibly dueto an incorrect parameter setting of the combinations ofhistorical second-order moments based on (SC).(4) We find that the sufficient condition extends the restrictions of RMSProp [19] and covers many convergent variants of Adam, e.g., AdamNC, AdaGrad withmomentum, etc. Thus, their convergences in the nonconvex stochastic setting naturally hold.(5) We conduct experiments to validate the sufficient condition for the convergences of Adam/RMSProp. Theexperimental results match our theoretical results.Remark 1. The original Adam with the bias correction [14]takes constant parameters βt β and θt θ. The iterationctmmtct 1 βscheme is written as xt 1 xt αbt , with mtbv ttvt1 θbt 1 θand vbt 1 βt . Let αt αt . Then, the above can be rewritten as xt 1 xt αt mt / vt . Thus, it is equivalentto taking constant βt , constant θt , and new base learningrate αt in Generic Adam.2.1. Weighted AdaGrad PerspectiveNow we show that Generic Adam can be reformulated asa new type of weighted AdaGrad algorithms with exponential moving average momentum (Weighted AdaEMA).Algorithm 2 Weighted AdaEMA1: Parameters: Choose parameters {αt }, momentum factors {βt }, and weights {wt }. Set W0 1, m0 0,V0 ǫ, and x1 Rn .2: for t 1, 2, . . . , T do3:Sample a stochastic gradient gt ;4:Wt Wt 1 wt ;5:for k 1, 2, . . . , d do26:Vt,k Vt 1,k wt gt,k;7:mt,k βt mt 1,k (1 pβt )gt,k ;8:xt 1,k xt,k αt mt,k / Vt,k /Wt ;9:end for10: end for2. Generic AdamFor readers’ convenience, we first clarify a few necessarynotations used in the forthcoming Generic Adam. We denotext,k as the k-th component of xt Rd , and gt,k as the k-thcomponent of the stochastic gradient at the t-th iteration, andcall αt 0 base learning rate and βt momentum parameter,respectively. Let ǫ 0 be a sufficiently small constant.Denote 0 (0, · · · , 0) Rd , and ǫ (ǫ, · · · , ǫ) Rd .All operations, such as multiplying, dividing, and takingsquare root, are executed in the coordinate-wise sense.Algorithm 1 Generic Adam1: Parameters: Choose {αt }, {βt }, and {θt }. Choosex1 Rd and set initial values m0 0 and v0 ǫ.2: for t 1, 2, . . . , T do3:Sample a stochastic gradient gt ;4:for k 1, 2, . . . , d do25:vt,k θt vt 1,k (1 θt )gt,k;6:mt,k βt mt 1,k (1 βt )gt,k ; 7:xt 1,k xt,k αt mt,k / vt,k ;8:end for9: end forGeneric Adam covers RMSProp by setting βt 0. Moreover, it covers Adam with a bias correction [14] as follows:Remark 2. The Weighted AdaEMA is a natural generalization of the AdaGrad algorithm. The classical AdaGrad is totake the weights wt 1, the momentum factors βt 0, andthe parameters αt η/ t 1 for constant η.The following proposition states the equivalence betweenGeneric Adam and Weighted AdaEMA.Proposition 3. Algorithm 1 and Algorithm 2 are equivalent.The divergence issue of Adam/RMSProp. When θt istaken constant, i.e., θt θ, Reddi et al. [20] have pointedout that Adam and RMSProp (βt 0) can be divergent evenin the convex setting. They conjectured that the divergenceis possibly due to the uncertainty of positive definiteness ofΓt in Eq. (5). This idea has motivated many new convergentvariants of Adam by forcing Γt 0. Recently, Zhou et al.[25] further argued that the nature of divergences of Adamand RMSProp is possibly due to the unbalanced learningrate ηt caused by the inappropriate correlation between vt,k2and gt,kby studying the counterexample in [20]. However,this explanation can be violated by many existing convergent Adam-type algorithms such as AdamNC, NosAdam[12], etc. So far, there is no satisfactory explanation for thecore reason of the divergence issue. We will provide moreinsights in Section 4 based on our theoretical analysis.11129

3. Main ResultsIn this section, we characterize the upper-bound of gradient residual of problem (1) as a function of parameters(θt , αt ). Then the convergence rate of Generic Adam is derived directly by specifying appropriate parameters (θt , αt ).Below, we state the necessary assumptions that are commonly used for analyzing the convergence of a stochasticalgorithm for non-convex problems:(A1) The minimum value of problem (1) is lower-bounded,i.e., f minx Rd f (x) ;Theorem 4. Let {xt } be a sequence generated by GenericAdam for initial values x1 , m0 0, and v0 ǫ. Assumethat f and stochastic gradients gt satisfy assumptions (A1)(A4). Let τ be randomly chosen from {1, 2, . . . , T } withequal probabilities pτ 1/T . We have PT hi 3/2C C ′ t 1 αt 1 θt4/3,E k f (xτ )k T αT where C ′ 2C02 C3 d G2 ǫd [(1 β)θ1 ] and G2 2C0 G2 ǫd (C4 C3 C0 dχ1 log 1 ,C 1 βǫd(A2) The gradient of f is L-Lipschitz continuous, i.e.,k f (x) f (y)k Lkx yk, x, y Rd ;(A3) The stochastic gradient gt is an unbiased estimate, i.e.,E [gt ] ft (xt );where C4 and C3 are defined as C4 f (x1 ) f and C02 χ1 L 2 C0 2 β/(1 β) 1 G .C3 C (1 γ) C1 (1 γ)2C1 (1 γ)θ11(A4) The second-order moment of stochastic gradient gt isuniformly upper-bounded, i.e., E kgt k2 G.To establish the upper-bound, we also suppose that theparameters {βt }, {θt }, and {αt } satisfy the restrictions:(R1) The parameters {βt } satisfy 0 βt β 1 for all tfor some constant β;(R2) The parameters {θt } satisfy 0 θt 1 and θt isnon-decreasing in t with θ : limt θt β 2 ; αt1 θtis “al(R3) The parameters {αt } satisfy that χt : most” non-increasing in t, by which we mean that thereexist a non-increasing sequence {at } and a positive constant C0 independent of t such that at χt C0 at .The restriction (R3) indeed says that χt is the product between some non-increasing sequence {at } and somebounded sequence. This is a slight generalization of χt itself being non-decreasing. If χt itself is non-increasing, wecan then take at χt and C0 1. For most of the wellknown Adam-type methods, χt is indeed non-decreasing,for instance, for AdaGrad with EMA momentum we haveαt η/ t and θt 1 1/t, so χt η is constant; for Adam with constant θt θ and non-increasing αt (say αt η/ tor αt η), χt αt / 1 θ is non-increasing. The motivation, instead of χt being decreasing, is that it allows us todeal with the bias correction steps in Adam [14].We fix a positive constant θ′ 01 such that β 2 θ′ θ.Let γ : β 2 /θ′ 1 andQNθ C1 : j 1 θj′ ,(6)where N is the maximum of the indices j with θj θ′ . Thefiniteness of N is guaranteed by the fact that limt θt θ θ′ . When there are no such indices, i.e., θ1 θ′ , wetake C1 1 by convention. In general, C1 1. Our mainresults on estimating gradient residual state as follows:Theorem 5. Suppose the same setting and hypothesis asTheorem 4. Let τ be randomly chosen from {1, 2, . . . , T }with equal probabilities pτ 1/T . Then for any δ 0, thefollowing bound holds with probability at least 1 δ 2/3 :2k f (xτ )k C C′PTt 1 α t 1 θtδT αT: Bound(T ),where C and C ′ are defined as those in Theorem 4.Remark 6. (i) The constants C and C ′ depend on aprioriknown constants C0 , C1 , β, θ′ , G, L, ǫ, d, f and θ1 , α1 , x1 .(ii) Convergence in expectation in Theorem 4 is slightlystronger than convergence in probability in Theorem 5. Con34vergence in expectation is on the term (E[k f (xτ )k 2 ]) 3 ,2which is slightly weaker than E[k f (xτ )k ]. The latteris adopted for most SGD variants with global learningrates, namely, the learning rate for Peach coordinate is theT2same. This is due to that PT1 α E t 1 αt k f (xt )k is2t 1texactly E[k f (xτ )k ] if τ is randomly selected via distribution P(τ k) PTαk α . This does not apply tot 1 tcoordinate-wise adaptive methods because the learning ratefor each coordinate is different, and hence unable to randomly select an index according to some distribution uniform for each coordinate. On the other hand, the proofs ofAMSGrad and AdaEMA [6] are able to achieve the bound2for E[k f (xτ )k ]. This is due to the strong assumptionkgt k G which results in a uniform lower bound for eachcoordinate of the adaptive learning rate ηt,k αt /G. Thus,the proof of AMSGrad [6] can be dealt with in a way similarto the case of global learning rate. In our paper we use acoordinate-wise adaptive learning rate and assume a weaker22assumption E[kgt k ] G instead of kgt k G. To separate2the term k f (xt )k from k f (xt )kη̂t , we can only apply the31 Inthe special case that θt θ is constant, we can directly set θ ′ θ.4Hölder theorem to obtain a bound for (E[k f (xτ )k 2 ]) 3 .11130

Corollary 7. Take αt η/ts with 0 s 1. Supposelimt θt θ 1. Then the Bound(T ) in Theorem 5 isbounded from below by constants C′ 1 θBound(T ) .(7)δIn particular, when θt θ 1, we have the following moresubtle estimate on lower and upper-bounds for Bound(T ) CC′ 1 θCC ′ 1 θ Bound(T) .δηT 1 sδδηT 1 s δ(1 s)Remark 11. Corollary 10 recovers and extends the resultsof some well-known algorithms below: AdaGrad with exponential movingaverage (EMA). When θt 1 1/t, αt η/ t, and βt β 1,Generic Adam is exactly AdaGrad with EMA momentum (AdaEMA) [6]. In particular, if β 0, this is thevanilla coordinate-wise AdaGrad. It corresponds totaking r 1 and s 1/2 in Corollary 10. Hence,AdaEMA has

satisfying condition (3) does converge, its empirical perfor-mance could be still stagnating, since it is difficult to tune an effective learning rate ηt via condition (3). To further improve the empirical performance of SGD, a large variety of adaptive SGD algorithms, including Ada-Grad [9], RMSProp [11], Adam [14], Nadam [8], etc., have

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

Les divergences et les convergences entre la comptabilité islamique et la comptabilité conven

och krav. Maskinerna skriver ut upp till fyra tum breda etiketter med direkt termoteknik och termotransferteknik och är lämpliga för en lång rad användningsområden på vertikala marknader. TD-seriens professionella etikettskrivare för . skrivbordet. Brothers nya avancerade 4-tums etikettskrivare för skrivbordet är effektiva och enkla att

1 eng1a01 1 transactions essential english language skills 4 3 7 2 eng1a02 1 ways with words literatures in english 5 3 9 3 eng2a03 2 writing for academic and professional 4 4 11 . 3 success 4 eng2a04 2 zeitgeist readings on contempo rary culture 5 4 13 5 eng3a05 3 signatures expressing the self 5 4 15 6 eng4a06 4 spectrum literature and contemporary issues 5 4 17 to tal 22 .