Robust, Accurate Stochastic Optimization For Variational .

3y ago
12 Views
2 Downloads
871.67 KB
14 Pages
Last View : 25d ago
Last Download : 3m ago
Upload by : Sabrina Baez
Transcription

Downloaded from orbit.dtu.dk on: Apr 28, 2021Robust, Accurate Stochastic Optimization for Variational InferenceDhaka, Akash Kumar; Catalina, Alejandro; Andersen, Michael Riis; Magnusson, Måns; Huggins,Jonathan H.; Vehtari, AkiPublished in:Proceedings of 34 sup th /sup Conference on Neural Information Processing SystemsPublication date:2020Document VersionPublisher's PDF, also known as Version of recordLink back to DTU OrbitCitation (APA):Dhaka, A. K., Catalina, A., Andersen, M. R., Magnusson, M., Huggins, J. H., & Vehtari,A. (2020). Robust,thAccurate Stochastic Optimization for Variational Inference. In Proceedings of 34 Conference on NeuralInformation Processing SystemsGeneral rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyrightowners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portalIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Robust, Accurate Stochastic Optimization forVariational InferenceAkash Kumar DhakaAalto Universityakash.dhaka@aalto.fiAlejandro CatalinaAalto Universityalejandro.catalina@aalto.fiMichael Riis AndersenTechnical University of Denmarkmiri@dtu.dkMåns MagnussonUppsala Universitymans.magnusson@statistik.uu.seJonathan H. HugginsBoston Universityhuggins@bu.eduAki VehtariAalto Universityaki.vehtari@aalto.fiAbstractWe consider the problem of fitting variational posterior approximations usingstochastic optimization methods. The performance of these approximations depends on (1) how well the variational family matches the true posterior distribution,(2) the choice of divergence, and (3) the optimization of the variational objective.We show that even in the best-case scenario when the exact posterior belongs tothe assumed variational family, common stochastic optimization methods lead topoor variational approximations if the problem dimension is moderately large. Wealso demonstrate that these methods are not robust across diverse model types.Motivated by these findings, we develop a more robust and accurate stochasticoptimization framework by viewing the underlying optimization algorithm as producing a Markov chain. Our approach is theoretically motivated and includes adiagnostic for convergence and a novel stopping rule, both of which are robust tonoisy evaluations of the objective function. We show empirically that the proposedframework works well on a diverse set of models: it can automatically detectstochastic optimization failure or inaccurate variational approximation.1IntroductionBayesian inference is a popular approach due to its flexibility and theoretical foundation in probabilistic reasoning [2, 46]. The central object in Bayesian inference is the posterior distribution of theparameter of interest given the data. However, using Bayesian methods in practice usually requiresapproximating the posterior distribution. Due to its computational efficiency, variational inference(VI) has become a commonly used approach for large-scale approximate inference in machinelearning [26, 56]. Informally, VI methods find a simpler approximate posterior that minimizes adivergence measure D [q p] from the approximate posterior q to the exact posterior distribution p –that is, they compute a optimal variational approximation q arg minq Q D [q p]. The variationalfamily is often parametrized by a vector λ RK so the parameter of q is given byλ arg min D [qλ p] .(1)λ RKVariational approximations in machine learning is typically used for prediction, but recent workhas shown that these approximations possess good statistical properties as point estimators and as34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

MCSE 0.01(IA) ELBO 10 2DD(IA)10610R̂ 1.1 SE 0.01101(-)ELBO010 101000200010510010410 110310 20300020004000Distance D between momentsMCSE 0.01(LI) ELBO 0.01(IA)(-)ELBODistance D between moments ELBO 0.01(LI)6000IterationsDimensions of variational parameter(K)Figure 1: (left) The distance between the variational and ground truth moments for a full rank VIapproximation on linear regression models of varying dimensions of posterior (see Section 4 for aprecise definition of the distance). ELBO denotes the standard stopping rule, MCSE denotes ourproposed stopping rule, and IA indicates that our iterate averaging approach was used while LI meansthe last iterate was used. IA and our proposed stopping rule both improve accuracy, particularly inhigher dimensions. (right) The negative evidence lower bound (-ELBO) and the distances betweenthe variational and ground truth moments based on the current iterate and using IA. The stoppingpoint based on ELBO is shown by the dotted red line and occurs prematurely. Using our proposedalgorithm, the starting and stopping points for IA are shown by the dotted orange and black lines,respectively.posterior approximations [7, 39, 57, 58]. Variational inference is therefore becoming an attractivestatistical method since variational approximations can often be computed more efficiently than eitherthe maximum likelihood estimate or more precise posterior estimates – particularly when there arelocal latent variables that need to be integrated out. Therefore, there is a need to develop variationalmethods that are appropriate for statistical inference: where the model parameters are themselves theobject of interest, and thus the accuracy of the approximate posterior compared to the true posterioris important. In addition, we would ideally like to refine a variational approximation further usingimportance sampling [23, 60] – as in the adaptive importance sampling literature [38].Meanwhile, two developments have greatly increased the scope of the applicability of VI methods.The first is stochastic variational inference (SVI), where Eq. (1) is solved using stochastic optimizationwith mini-batching [21]. The increased computational efficiency of mini-batching allows SVI toscale to datasets with tens of millions of observations. The second is black box variational inferencemethods, which have extended variational inference to a wide range of models in probabilisticprogramming context by removing the need for model-specific derivations [28, 44, 51]. This flexibilityis obtained by approximating local expectations and their auto-differentiated gradients using MonteCarlo approximations. While using stochastic optimization to solve Eq. (1) makes variationalinference scalable as well as flexible, there is a drawback: it becomes increasingly difficult to solvethe optimization problem with sufficiently high accuracy, particularly as the dimensionality of thevariational parameter λ increases. Figure 1(left, solid lines) demonstrates this phenomenon on asimple linear regression problem where the exact posterior belongs to the variational family. Sinceq p, all of the error is due to the stochastic optimization.Because in machine learning the quality of a posterior approximation is usually evaluated by out-ofsample predictive performance, the additional error from the stochastic optimization is not necessarilyproblematic. Therefore, there has been less attention paid to developing stochastic optimizationschemes that provide very accurate variational parameter estimates and, ideally, have good importancesampling properties too. And, as seen in Fig. 1(left, solid blue line), standard VI optimization schemesremain insufficient for statistical inference because they do not provide accurate variational parameterestimates – particularly in higher dimensions.Moreover, existing optimizers are fragile, in that they require the choice of many hyperparametersand can fail badly. For example, the common stopping rule ELBO [28] is based on the change inthe variational objective function value (the negative ELBO). But, as illustrated in Fig. 1(right), using ELBO results in termination before the optimizer converges, resulting in an inaccurate variational2

approximation (intersection of blue line and purple vertical line). Using a smaller cutoff for ELBOto ensure convergence resulted in the criterion never being met because the stochastic estimates ofthe negative ELBO were too noisy. To remedy this problem a combination of a smaller step size(resulting in slower convergence) and a more accurate Monte Carlo gradient estimates (resulting isgreater per-iteration computation) must be used. Thus, the standard optimization algorithm is fragiledue to a non-trivial interplay between its many hyperparameters, which requires the user to carefullytune all of them jointly.In this paper, we address the shortcomings of current stochastic optimizers for VI by viewing theunderlying optimization algorithm as producing a Markov chain. While such a perspective has beenpursued in theoretical contexts [12, 43] and in the deep neural network literature [15, 22, 24, 35], thepotential innovative algorithmic consequences of such a perspective, particularly in the VI context,have not been explored. Our Markov chain perspective allows us create more accurate variationalparameter estimates by using iterate averaging, which is particularly effective in high dimensions (seered dotted lines in Fig. 1). But, even when using iterate averaging, the problems of fragility remain.In particular, we need to decide (A) when to start averaging (or when the optimizer has failed) andb diagnostic [16, 54], a well-established(B) when to terminate the optimization. For (A), we use the Rmethod from the MCMC literature. For (B), we use Monte Carlo standard error estimates based onthe chain’s effective sample size (ESS) and the ESS itself [54] to ensure convergence of the parameterestimate (again drawing on a rich MCMC literature [13, 14]). We also use the k̂ diagnostic from theimportance sampling literature to check on the quality of the variational approximation and determinewhether it can be used as an importance distribution [55, 60]. By combining all of these ideas, wedevelop an optimization framework that is robust to the selection of optimization hyperparameterssuch as step size and mini-batch size while also producing substantially more accurate posteriorapproximations. We empirically validate our proposed framework on a wide variety of models anddatasets.2Background: Variational InferenceLet p(y, θ) denote the joint density for a model of interest, where y Y N is a vector of Nobservations and θ RP is a vector of model parameters. In this work, we assume that theobservations are conditionally independent given θ; that is, the joint density factorizes as1 p(y, θ) QNi 1 p(yi θ)p0 (θ). The goal is to approximate the resulting posterior distribution, p(θ) p(θ y),by finding the best approximating distribution q Q in the variational family Q as measured bya divergence measure. We focus on two commonly used variational families – the mean-field andthe full-rank Gaussian families – and the standard Kullback–Leibler (KL) divergence objective, butour approach generalizes to other variational families and divergences as well. It can be shown thatminimizing the KL divergence is equivalent to maximizing the functional known as the evidencelower bound (ELBO) L : RK R given by [3] XN NX1L (λ) Eq [ln p(y, θ)] Eq [ln q(θ)] Eq [ln p(yi θ)] KL [q p0 ] Li (λ) ,Ni 1i 1where q is parametrized by λ RK and Li (λ) Eq [ln p(yi θ)] approximation is qλ for λ arg maxλ L (λ).2.11N KL [q p0 ].The optimalStochastic Optimization for VIWe will consider approximately finding λ using the stochastic optimization schemeλt 1 λt ηγt ĝt ,(2)where ĝt is an unbiased, stochastic estimator of the gradient L at λt (i.e., E [ĝt ] L (λt )), η isa base step size, and γt 0 is the learning rate at iteration t, which may depend on current andpast iterates and gradients. The noise in the gradients is a consequence of using mini-batching, orapproximating the local expectations Li (λ) using Monte Carlo estimators, or both [21, 37, 44]. For1RIn addition, we may have that p(yi θ) p(yi θ, zi )p(zi θ)dzi . But, for simplicity, we do not write theexplicit dependence on the local latent variable zi .3

standard stochastic gradient descent (SGD), γt is a deterministiconly and convergesP function of t P asymptotically if γt satisfies the Robbins–Monro conditions t 1 γt and t 1 γt2 [45].SGD is very sensitive to the choice of step size since too large of a step size will result in the algorithmdiverging while too small of a step size will lead to very slow convergence. The shortcomings ofSGD have led to the development of more robust, adaptive stochastic optimization schemes such asAdagrad [11], Adam [27, 52], and RMSProp [20], which modify the step size schedule according tothe norm of current and past gradient estimates.Even when using adaptive stochastic optimization schemes, however, it remains non-trivial to checkfor convergence because we only have access to unbiased estimates of the value and gradient ofthe optimization objective L. Practitioners often run the optimization for a pre-defined number ofiterations or use simple moving window statistics of L such as the running median or the runningmean to test for convergence. We refer to the approach based on looking at the change in L as the ELBO stopping rule. This stopping rule can be problematic as the scale of the ELBO makes itnon-trivial to specify a universal convergence tolerance . For example, Kucukelbir et al. [28] used 10 2 , but Yao et al. [60] demonstrate that 10 4 might be needed for good accuracy. Moregenerally, sometimes the objective estimates are too noisy relative to the chosen step size η, learningrate γt , threshold , and the scale of L, which results in the stopping rule never triggering because thestep size is too large relative to the threshold. The stopping rule can also trigger too early if is toolarge relative to η and the scale of L. In either case, the user might have to adjust any or all of η, γt ,and ; run the optimiser again; and then hope for the best.2.2Refining a Variational ApproximationAnother challenge with variational inference is assessing how close the variational approximationqλ (θ) is to the true posterior distribution p. Recently, the k̂ diagnostic has been suggested as adiagnostic for variational approximations [60]. Let θ1 , ., θS qλ denote draws from the variationalposterior. Using (self-normalized) importance sampling we can then estimate an expectation underPSPSthe true posterior as E [f (θ))] s 1 f (θs )w(θs )/ s 1 w(θs ), where w(θs ) p(θs y)/q(θs ).If the proposal distribution is far from the true posterior, the weights w(θs ) will have high or infinitevariance. The number of finite moments of a distribution can be estimated using the shape parameterk in the generalized Pareto distribution (GPD) [55]. If k 0.5, then variance of the importancesampling estimate of E [f (θ)] is infinite. Theoretical and empirical results show that values below0.7 indicate that the approximation is close enough to be used for importance sampling, while valuesabove 1 indicate that the approximation is very poor [55].Recent work [18] suggests that SGD iterates can converge towards a heavy tailed stationary distribution with infinite variance for even simple models (i.e. linear regression). Furthermore, even in casesthat don’t show infinite variance, the heavy tailed distribution may not be consistent for the mean,i.e. the mean of the stationary distribution might not coincide with the mode of the objective. In thiswork we again rely on k̂ to provide an estimate of the tail index of the iterates (at convergence) andwarn the user when the empirical tail index indicates a very poor approximation. We leave a morethorough study of this phenomenon for future work.3Stochastic Optimization as a Markov ChainFigure 1 (left) shows that as the dimensionality of the variational parameter increases, the qualityof the variational approximation degrades. To understand the source of the problem, we can view astochastic optimization procedure as producing a discrete-time stochastic process (λt )t 1 [5, 8, 32,36, 59]. Under Robbins–Monro-type conditions, many stochastic optimization procedures convergeasymptotically to the exact solution λ [33, 45], but any iterate λt obtained after a finite number ofiterations will be a realization of a diffuse probability distribution πt (i.e., λt πt (λt )) that dependson the objective function, the optimization scheme, and the number of iterations t.We can gain further insight into the behavior of (λt )t 1 by considering SGD with constant learningrate (that is, with γt 1). Under regularity assumptions, SGD admits a stationary distributionπ (that is, lim πt π ). Moreover, π will have covariance Σ and mean λ such thatkλ λ k O(η) [8]. Thus, for some sufficiently large t0 , once t t0 the SGD will reachapproximate stationarity: πt π . This implies that E[λt ] is within O(η) of λ . However, the4

variance V[λt ] Σ could be large. Indeed, we expect that as the number of model parametersincrease – and hence the number of variational parameters K increases – the expected squareddistance from λ to the optimal parameter λ will increase. For example, assuming for simplicity thatthe stationary distribution is isotropic with Σ α2 IK (where IK denotes the K K identity matrix), 222the expected squared distance from λ to the optimal value is given by E[kλ λ k ] α K O(η ). Therefore, we should expect distance between λt and λ to be O( K), which implies that thevariational parameter estimates output by SGD become increasingly inaccurate as the dimensionalityof the variational parameter increases. As demonstrated in Fig. 1(left), one should be particularlycareful when fitting a full-rank variational family since the number of parameters is K P (P 1)/2.Although the preceding discussion only applies directly to SGD, it is reasonable to expect that robuststochastic optimization schemes such as Adagrad, Adam, and RMSprop will have similar behavior aslong as γt and ĝt depend at most very weakly on iterates far in the past.3.1Improving Optimization Accuracy with Iterate AveragingWhile we have shown that we should not expect a single iteration λt to be close to λ in highdimensional settings, the expected value of λt is equal to (or, more realistically, close to) λ .Therefore, we can use iterate averaging (IA) to construct a more accurate estimate of λ given byPT(3)λ̄ T1 i 1 λt i ,where we should aim to choose t t0 . In the fixed step-sizeP setting described above, the estimatorλ̄ has bias of order η and covariance V[λ̄] Σ/T 2 1 i j T cov[λt i , λt j ]/T 2 . Hence, aslong as the iterates λt are not too strongly correlated, we can reduce the variance and alleviate theeffect of dimensionality by using iterative averaging.Iterate averaging has been previously considered in a number of scenarios. Ruppert [50] proposesto use a moving average of SGD iterates to improve SGD algorithms in the context of linearone-dimensional models. Polyak and Juditsky [42] extend the moving average approach to multidimensional and nonlinear models, and showed that it improved the rate of convergence in severalimportant scenarios; thus, it is often referred to as Polyak–Ruppert averaging. In related work,Bach and Moulines [1] show that an averaged stochastic gradient scheme with constant step sizecan achieve optimal convergence for linear models

stochastic optimization failure or inaccurate variational approximation. 1 Introduction Bayesian inference is a popular approach due to its flexibility and theoretical foundation in proba-bilistic reasoning [2, 46]. The central object in Bayesian

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

2. Robust Optimization Robust optimization is one of the optimization methods used to deal with uncertainty. When the parameter is only known to have a certain interval with a certain level of confidence and the value covers a certain range of variations, then the robust optimization approach can be used. The purpose of robust optimization is .

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

Jul 09, 2010 · Stochastic Calculus of Heston’s Stochastic–Volatility Model Floyd B. Hanson Abstract—The Heston (1993) stochastic–volatility model is a square–root diffusion model for the stochastic–variance. It gives rise to a singular diffusion for the distribution according to Fell

are times when the fast stochastic lines either cross above 80 or below 20, while the slow stochastic lines do not. By slowing the lines, the slow stochastic generates fewer trading signals. INTERPRETATION You can see in the figures that the stochastic oscillator fluctuates between zero and 100. A stochastic value of 50 indicates that the closing