Variational Zero-in Ated Gaussian Processes With Sparse Kernels - UAI

1y ago
10 Views
2 Downloads
784.03 KB
11 Pages
Last View : 17d ago
Last Download : 3m ago
Upload by : Julius Prosser
Transcription

Variational zero-inflated Gaussian processes with sparse kernelsPashupati HegdeMarkus HeinonenSamuel KaskiHelsinki Institute for Information Technology HIITDepartment of Computer Science, Aalto UniversityAbstractZero-inflated datasets, which have an excess of zero outputs, are commonly encountered in problems such as climate orrare event modelling. Conventional machine learning approaches tend to overestimate the non-zeros leading to poor performance. We propose a novel model familyof zero-inflated Gaussian processes (ZiGP)for such zero-inflated datasets, producedby sparse kernels through learning a latent probit Gaussian process that can zeroout kernel rows and columns whenever thesignal is absent. The ZiGPs are particularly useful for making the powerful Gaussian process networks more interpretable.We introduce sparse GP networks wherevariable-order latent modelling is achievedthrough sparse mixing signals. We derivethe non-trivial stochastic variational inference tractably for scalable learning of thesparse kernels in both models. The noveloutput-sparse approach improves both prediction of zero-inflated data and interpretability of latent mixing models.1INTRODUCTIONZero-inflated quantitative datasets with overabundance of zero output observations are common inmany domains, such as climate and earth sciences(Enke & Spekat, 1997; Wilby, 1998; Charles et al.,2004), ecology (del Saz-Salazar & Rausell-Köster,2008; Ancelet et al., 2009), social sciences (Bohning et al., 1997), and in count processes (Barry &Welsh, 2002). Traditional regression modelling ofsuch data tends to underestimate zeros and overestimate nonzeros (Andersen et al., 2014).A conventional way of forming zero-inflated models is to estimate a mixture of a Bernoulli “on-off”process and a Poisson count distribution (Johnson& Kotz, 1969; Lambert, 1992). In hurdle models abinary “on-off” process determines whether a hurdle is crossed, and the positive responses are governed by a subsequent process (Cragg, 1971; Mullahy, 1986). The hurdle model is analogous to firstperforming classification and training a continuouspredictor on the positive values only, while the zeroinflated model would regress with all observations.Both stages can be combined for simultaneous classification and regression Abraham & Tan (2010).Gaussian process models have not been proposedfor zero-inflated datasets since their posteriors areGaussian, which are ill-fitted for zero predictions. Asuite of Gaussian process models have been proposedfor partially related problems, such as mixture models (Tresp, 2001; Rasmussen & Ghahramani, 2002;Lázaro-Gredilla et al., 2012) and change point detection (Herlands et al., 2016). Structured spikeand-slab models place smoothly sparse priors overthe structured inputs (Andersen et al., 2014).In contrast to other approaches, we propose aBayesian model that learns the underlying latentprediction function, whose covariance is sparsifiedthrough another Gaussian process switching between the ‘on’ and ‘off’ states, resulting in an zeroinflated Gaussian process model. This approachintroduces a tendency of predicting exact zerosto Gaussian processes, which is directly useful indatasets with excess zeros.A Gaussian process network (GPRN) is a latentsignal framework where multi-output data are explained through a set of latent signals and mixingweight Gaussian processes (Wilson et al., 2012). Thestandard GPRN tends to have dense mixing thatcombines all latent signals for all latent outputs. By

applying the zero-predicting Gaussian processes tolatent mixture models, we introduce sparse GPRNswhere latent signals are mixed with sparse insteadof dense mixing weight functions. The sparse modelinduces variable-order mixtures of latent signals resulting in simpler and more interpretable models.We demonstrate both of these properties in our experiments with spatio-temporal and multi-outputdatasets.Main contributions.Our contributions include11. A novel zero-inflated Gaussian process formalism consisting of a latent Gaussian process anda separate ‘on-off’ probit-linked Gaussian process that can zero out rows and columns of themodel covariance. The novel sparse kernel addsto GPs the ability to predict zeros.2. Novel stochastic variational inference (SVI) forsuch sparse probit covariances, which in general are intractable due to having to computeexpectations of GP covariances with respect toprobit-linked processes. We derive the SVI forlearning both of the underlying processes.3. A novel sparse GPRN with an on-off processin the mixing matrices leading to sparse andvariable-order mixtures of latent signals.4. A solution to the stochastic variational inference of sparse GPRN where the SVI is derivedfor the network of full probit-linked covariances.2We begin by introducing the basics of conventionalGaussian processes. Gaussian processes (GP) area family of non-parametric, non-linear Bayesianmodels (Rasmussen & Williams, 2006). Assume adataset of n inputs X (x1 , . . . , xn ) with xi RDand noisy outputs y (y1 , . . . , yn ) Rn . The observations y f (x) ε are assumed to have additive,zero mean noise ε N (0, σy2 ) with a zero-mean GPprior on the latent function f (x),(1)which defines a distribution over functions f (x)whose mean and covariance areE[f (x)] 00Then for any collection of inputs X, the functionvalues follow a multivariate normal distribution f N (0, KXX ), where f (f (x1 ), . . . , f (xN ))T Rn ,and where KXX Rn n with [KXX ]ij K(xi , xj ).The key property of Gaussian processes is that theyencode functions that predict similar output valuesf (x), f (x0 ) for similar inputs x, x0 , with similaritydetermined by the kernel K(x, x0 ). In this paper weassume the Gaussian ARD kernel D0 2X(x x)1jj ,(4)K(x, x0 ) σf2 exp 2 j 1 2jwith a signal variance σf2 and dimension-specificlengthscale 1 , . . . , D parameters.GAUSSIAN PROCESSESf (x) GP (0, K(x, x0 )) ,Figure 1: Illustration of a zero-inflated GP (a) andstandard GP regression (b). The standard approachis unable to model sudden loss of signal (at 4 . . . 5)and signal close to zero (at 0 . . . 1 and 7 . . . 9).The inference of the hyperparameters θ (σy , σf , 1 , . . . , D ) is performed commonly by maximizing the marginal likelihoodZp(y θ) p(y f )p(f θ)df ,(5)which results in a convenient marginal likelihoodcalled evidence, p(y θ) N (y 0, KXX σy2 I) fora Gaussian likelihood.The Gaussian process defines a univariate normal predictive posterior distribution f (x) y, X N (µ(x), σ 2 (x)) for an arbitrary input x with theprediction mean and variance2(2)0cov[f (x), f (x )] K(x, x ).(3)µ(x) KxX (KXX σy2 I) 1 y,2σ (x) Kxx KxX (KXX σy2 I) 1 KXx ,(6)(7)1The TensorFlow compatible code will bemade publicly available at 2In the following we omit the implicit conditioning ondata inputs X for clarity.

Twhere KXx KxX Rn is the kernel column vectorover pairs X x, and Kxx K(x, x) R is a scalar.The predictions µ(x) σ(x) come with uncertaintyestimates in GP regression.3ZERO-INFLATED GAUSSIANPROCESSESFigure 2: Illustration of the zero-inflated GP (a)and the sparse kernel (b) composed of a smoothlatent function (c,d) filtered by a probit supportfunction (e,f ), which is induced by the underlyinglatent sparsity (g,h).We introduce zero-inflated Gaussian processes thathave – in contrast to standard GP’s – a tendency toproduce exactly zero predictions (See Figure 1). Letg(x) denote the latent “on-off” state of a functionf (x). We assume GP priors for both functions witha joint modelp(y, f , g) p(y f )p(f g)p(g),(8)wherep(y f ) N (y f , σy2 I)(9)Tp(f g) N (f 0, Φ(g)Φ(g) Kf )p(g) N (g β1, Kg )(10)(11)The sparsity values g(x) are squashed between 0 and1 through a standard Normal cumulative distribution, or a probit link function, Φ : R [0, 1] Z g1gΦ(g) φ(τ )dτ 1 erf , (12)22 12where φ(τ ) 12π e 2 τ is the standard normal density function. The structured probit sparsity Φ(g)models the “on-off” smoothly due to the latent sparsity function g having a GP prior with prior meanβ. The latent function f is modeled throughout butit is only visible during the “on” states. This masking effect has similarities to both zero-inflated andhurdle models. The underlying latent function f islearned from only non-zero data similarly to in hurdle models, but the function f is allowed to predictzeros similarly to zero-inflated models.The key part of our model is the sparse probitsparsified covariance Φ(g)Φ(g)T K where the “onoff” state Φ(g) has the ability to zero out rows andcolumns of the kernel matrix at the “off” states(See Figure 2f for the probit pattern Φ(g)Φ(g)T andFigure 2b for the resulting sparse kernel). Sincethe sparse kernel is represented as Hadamard product between a covariance kernel K and an outerproduct kernel Φ(g)Φ(g)T , Schur product theoremimplies that it is a valid kernel. As the sparsityg(x) converges towards minus infinity, the probitlink Φ(g(x)) approaches zero, which leads the function distribution approaching N (fi 0, 0), or fi 0.Numerical problems are avoided since in practiceΦ(g) 0, and due to the conditioning noise variance term σy2 0.The marginal likelihood of the zero-inflated Gaussian process is intractable due to the probitsparsification of the kernel. We derive a stochasticvariational Bayes approximation, which we show tobe tractable due to the choice of using the probitlink function.3.1STOCHASTIC VARIATIONALINFERENCEInference for standard Gaussian process models isdifficult to scale as complexity grows with O(n3 ) asa function of the data size n. Titsias (2009) proposed a variational inference approach for GPs usingm n inducing variables, with a reduced computational complexity of O(m3 ) for m inducing points.The novelty of this approach lies in the idea thatthe locations and values of inducing points can betreated as variational parameters, and optimized.Hensman et al. (2013, 2015) introduced more efficient stochastic variational inference (SVI) with factorised likelihoods that has been demonstrated withup to billion data points (Salimbeni & Deisenroth,2017). This approach cannot be directly applied tosparse kernels due to having to compute expectationof the probit product in the covariance. We derive

the SVI bound tractably for the zero-inflated modeland its sparse kernel, which is necessary in order toapply the efficient parameter estimation techniqueswith automatic differentiation with frameworks suchas TensorFlow (Abadi et al., 2016).We begin by applying the inducing point augmentations f (zf ) uf and g(zg ) ug for boththe latent function f (·) and the sparsity functiong(·). We place m inducing points uf 1 , . . . uf mand ug1 , . . . ugm for the two functions. The augmented joint distribution is p(y, f , g, uf , ug ) p(y f )p(f g, uf )p(g ug )p(uf )p(ug ), where3ef )p(f g, uf ) N (f diag(Φ(g))Qf uf , Φ(g)Φ(g)T K(13)eg )p(g ug ) N (g Qg ug , K(14)p(uf ) N (uf 0, Kf mm )(15)p(ug ) N (ug 0, Kgmm )(16)We minimize the Kullback-Leibler divergence between the true augmented posterior p(f , g, uf , ug y)and the variational distribution q(f , g, uf , ug ),which is equivalent to solving the following evidence lower bound (as shown by e.g. Hensman et al.(2015)):log p(y) Eq(f ) log p(y f ) KL[q(uf , ug ) p(uf , ug )],(24)where we defineZZZp(f g, uf )q(uf )p(g ug )q(ug )duf dug dgq(f ) Z q(f g)q(g)dg,(25)where the variational approximations are tractablyand whereQf Qg Kf nm Kf 1mm 1Kgnm KgmmZ(17)(18)Kf nm Kf 1mm Kf mn(19) 1e g Kgnn Kgnm KgmmKKgmn .(20)e f Kf nn Kq(g) We denote the kernels for functions f and g by thecorresponding subscripts. The kernel Kf nn is between all n data points, the kernel Kf nm is betweenall n datapoints and m inducing points, and the kernel Kf mm is between all m inducing points (similarlyfor g as well).The distributions p(f uf ) and p(g ug ) can be obtained by conditioning the joint GP prior betweenrespective latent and inducing functions. Further,the conditional distribution p(f g, uf ) can be obtained by the sparsity augmentation of latent conditional f uf similar to equation (10) (See Supplements).Next we use the standard variational approach byintroducing approximative variational distributionsfor the inducing points,q(uf ) N (uf mf , Sf )(21)q(ug ) N (ug mg , Sg )(22) N (g µg , Σg )Zq(f g) p(f g, uf )q(uf )dufq(f , g, uf , ug ) p(f g, uf )p(g ug )q(uf )q(ug ). (23)3We drop the implicit conditioning on z’s for clarity.(26)(27) N (f diag(Φ(g))µf , Φ(g)Φ(g)T Σf )withµf Qf mf(28)µg Qg mg(29)Σf Σg Kf nn Qf (Sf Kf mm )QTfKgnn Qg (Sg Kgmm )QTg .(30)(31)We additionally assume the likelihood p(y f ) QNi 1 p(yi fi ) factorises.We solve the final ELBO of equations (24) and (25)as (See Supplements for detailed derivation)LZI N nXlog N (yi hΦ(gi )iq(gi ) µf i , σy2 )(32)i 1 where Sf , Sg Rm m are square positive semidefinite matrices. The variational joint posterior isp(g ug )q(ug )dug o1Var[Φ(gi )]µ2f i hΦ(gi )2 iq(gi ) σf2 i22σy KL[q(uf ) p(uf )] KL[q(ug ) p(ug )],where µf i is the i’th element of µf and σf2 i is thei’th diagonal element of Σf (similarly with g).

The expectations are tractable,hΦ(gi )iq(gi ) Φ(λgi ),λgi qwhere for all q 1, . . . , Q and p 1, . . . , P we assume GP priors and additive zero-mean noises,µgi21 σgi λgihΦ(gi )2 iq(gi ) Φ(λgi ) 2T λgi ,(34)µgi λgiVar[Φ(gi )] Φ(λgi ) 2T λgi , Φ(λgi )2 .µgi(35)Rb)The Owen’s T function T (a, b) φ(a) 0 φ(aτ1 τ 2 dτ(Owen, 1956) has efficient numerical solutions inpractise (Patefield & Tandy, 2000).The ELBO is considerably more complex than thestandard stochastic variational bound of a Gaussianprocess (Hensman et al., 2013), due to the probitsparsified covariance.The bound is likely only tractable for the choice ofprobit link function Φ(g), while other link functionssuch as the logit would lead to intractable boundsnecessitating slower numerical integration (Hensmanet al., 2015).We optimize the Lzi with stochastic gradient ascent techniques with respect to the inducing locations zg , zf , inducing value means mf , mg andcovariances Sf , Sg , the sparsity prior mean β,the noise variance σy2 , the signal variances σf , σg ,and finally the dimensions-specific lengthscales f 1 , . . . , f D ; g1 , . . . , gD of the Gaussian ARD kernel.4fq (x) GP(0, Kf (x, x0 ))(33)0Wqp (x) GP(0, Kw (x, x )) q εp N (0, σf2 )N (0, σy2 ).(38)(39)(40)The subscripts are used to denote individual components of f and W with p and q indicating pth outputdimension and q th latent dimension, respectively.We assume shared latent and output noise variancesσf2 , σy2 without loss of generality. The distributionsof both functions f and W have been inferred either with variational EM (Wilson et al., 2012) or byvariational mean-field approximation with diagonalized latent and mixing functions (Nguyen & Bonilla,2013).4.1STOCHASTIC VARIATIONALINFERENCEVariational inference for GPRN has been proposedearlier with diagonalized mean-field approximationby (Nguyen & Bonilla, 2013). Further, stochasticvariational inference by introducing inducing variables has been proposed for GPRN (Nguyen et al.,2014). In this section we rederive the SVI boundfor standard GPRN for completeness and then propose the novel sparse GPRN model, and solve itsSVI bounds as well, in the following section.We begin by introducing the inducing variable augmentation technique for latent functions f (x) andmixing weights W (x) with uf , zf {ufq , zfq }Qq 1and uw , zw {uwqp , zwqp }Q,P:q,p 1p(y, f , W, uf , uw )GAUSSIAN PROCESSNETWORK(37)(41) p(y f , W )p(f uf )p(W uw )p(uf )p(uw )p(f uf ) The Gaussian Process Regression Networks (GPRN)framework by Wilson et al. (2012) is an efficientmodel for multi-target regression problems, whereeach individual output is a linear but non-stationarycombination of shared latent functions. Formally, avector-valued output function y(x) RP with Poutputs is modeled using vector-valued latent functions f (x) RQ with Q latent values and mixingweights W (x) RP Q asQYef )N (fq Qfq ufq , Kq(42)q 1p(W uw ) Q,PYew )N (wqp Qwqp uwqp , Kqp(43)q,p 1p(uf ) QYN (ufq 0, Kfq ,mm )(44)q 1p(uw ) Q,PYN (uwqp 0, Kwqp ,mm ),(45)q,p 1y(x) W (x)[f (x) ] ε,(36)where we have separate kernels K and extrapolationmatrices Q for each component of W (x) and f (x)

that are of the same form as in equations (17–20).The w is a vectorised form of W . The variationalapproximation is thenq(f , W, uf , uw ) p(f uf )p(W uw )q(uf )q(uw )5(46)q(ufq ) QYN (ufq mfq , Sfq )(47)q 1q(uwqp ) Q,PYN (uwqp mwqp , Swqp ), (48)q,p 1where uwqp and ufq indicate the inducing points forthe functions Wqp (x) and fq (x), respectively. TheELBO can be now stated aslog p(y) Eq(f ,W ) log p(y f , W )(49) KL[q(uf , uw ) p(uf , uw )],where the variational distributions decompose asq(f , W ) q(f )q(W ) with marginals of the same formas in equations (28–31),Zq(f ) q(f uf )q(uf )duf N (f µf , Σf ) (50)Zq(W ) q(W uw )q(uw )duw N (w µw , Σw ).(51)Since the noise term ε is assumed to be isotropicGaussian, the density p(y W, f ) factorises across alltarget observations and dimensions. The expectation term in equation (49) then reduces to solvingthe following integral for the ith observation and pthtarget dimension,N,PXZZwhere µfq ,i is the i’th element of µfq and σf2q ,i isthe i’th diagonal element of Σfq (similarly for theWqp ’s).Tlog N (yp,i wp,ifi , σy2 )q(fi , wp,i )dwp,i dfi .i,p 1(52)SPARSE GAUSSIAN PROCESSNETWORKIn this section we demonstrate how zero-inflatedGPs can be used as plug-in components in otherstandard models. In particular, we propose a significant modification to GPRN by adding sparsityto the mixing matrix components. This correspondsto each of the p outputs being a sparse mixture ofthe latent Q functions, i.e. they can effectively useany subset of the Q latent dimensions by having zeros for the rest in the mixing functions. This makesthe mixture more easily interpretable, and inducesa variable number of latent functions to explain theoutput of each input x. The latent function f canalso be sparsified, with a derivation analogous to thederivation below.We extend the GPRN with probit sparsity for themixing matrix W , resulting in a joint modelp(y, f , W, g) p(y f , W )p(f )p(W g)p(g),where all individual components of the latent function f and mixing matrix W are given GP priors.We encode the sparsity terms g for all the Q Pmixing functions Wqp (x) asp(Wqp gqp ) N (wqp 0, Φ(gqp )Φ(gqp )T Kw ).(55)To introduce variational inference, the joint modelis augmented with three sets of inducing variablesfor f , W and g. After marginalizing out the inducing variables as in equations (25–27), the marginallikelihood can be written aslog p(y) Eq(f ,W,g) log p(y f , W )Lgprn (1 22σy Q,PXq,p log N yp,i p 1i 1Q,PXPX µ2wqp ,i σf2q ,iQXµwqp ,i µfq ,i , σy2 q 1 2µ2fq ,i σwqp ,i 2σwσ2qp ,i fq ,i) q,p 1KL[q(uwqp , ufq ) p(uwqp , ufq )],(53)(56) KL[q(uf , uw , ug ) p(uf , uw , ug )].The above integral has a closed form solution resulting in the final ELBO as (See Supplements)NX(54)The joint distribution in the variational expectation factorizes as q(f , W, g) q(f )q(W g)q(g). Also,with a Gaussian noise assumption, the expectationterm factories across all the observations and target dimensions. The key step reduces to solving thefollowing integrals:N,PXZZZlog N (yp,i (wp,i gp,i )T fi , σy2 )i,p 1· q(fi , wp,i , gp,i )dwp,i dfi dgp,i .(57)

The above integral has a tractable solution leadingto the final sparse GPRN evidence lower bound (SeeSupplements)( PQN XXXLsgprn log N yp,i µwqp ,i µgqp ,i µfq ,i , σy2p 1i 1 12σy212σy2Q,PXQ,PX (µ2gqp ,i σg2qp ,i )q 1(58)q,p 1 22· (µ2wqp ,i σf2q ,i µ2fq ,i σw σwσ2 )qp ,iqp ,i fq ,i)Q,P X σg2qp ,i µ2fq ,i µ2wqp ,iq,p 1KL[q(ufq , uwqp , ugqp ) p(ufq , uwqp , ugqp )],q,pwhere µfq ,i , µwqp ,i are the variational expectationmeans for f (·), W (·) as in equations (28, 29), µgqp ,i isthe variational expectation mean of g(·) as in equation (33), and analogously for the variances.6EXPERIMENTSFirst we demonstrate how the proposed method canbe used for regression problems with zero-inflatedtargets. We do that both on a simulated datasetand for real-world climate modeling scenarios ona Finnish rain precipitation dataset with approximately 90% zeros. Finally, we demonstrate theGPRN model and how it improves both the interpretability and predictive performance in the JURAgeological dataset.We use the squared exponential kernel with ARD inall experiments. All the parameters including inducing locations, values and variances and kernelparameters were learned through stochastic Adamoptimization (Kingma & Ba, 2014) on the TensorFlow (Abadi et al., 2016) platform.We compare our approach ZiGP to baseline Zerovoting, to conventional Gaussian process regression(GPr) and classification (GPc) with SVI approximations from the GPflow package (Matthews et al.,2017). Finally, we also compare to first classifyingthe non-zeros, and successively applying regressioneither to all data points (GPcr), or to only predicted non-zeros (GPcr6 0 , hurdle model).We record the predictive performance by considering mean squared error and mean absolute error. Wealso compare the models’ ability to predict true zeros with F1, accuracy, precision, and recall of theoptimal models.Figure 3: ZiGP model fit on the precipitationdataset. Sample of the actual data (a) against thesparse rain function estimate (b), with the probitsupport function (c) showing the rain progress.6.1SPATIO-TEMPORAL DATASETZero-inflated cases are commonly found in climatology and ecology domains. In this experimentwe demonstrate the proposed method by modeling precipitation in Finland4 . The dataset consistsof hourly quantitative non-negative observations ofprecipitation amount across 105 observatory locations in Finland for the month of July 2018. Thedataset contains 113015 datapoints with approximately 90% zero precipitation observations. Thedata inputs are three-dimensional: latitude, longitude and time. Due to the size of the data, this experiment illustrates the scalability of the variationalinference.We randomly split 80% of the data for training andthe rest 20% for testing purposes. We split acrosstime only, such that at a single measurement time,all locations are simultaneously either in the trainingset, or in the test n.

Table 1: Results for the precipitation dataset overbaseline (Zero; majority voting), four competingmethods and the proposed method ZiGP on testdata. The columns list both quantitative and qualitative performance criteria, best performance isboldfaced.ModelZeroGPcGPrGPcrGPcr6 e further utilize the underlying spatio-temporalgrid structure of the data to perform inference in anefficient manner by Kronecker techniques (Saatchi,2011). All the kernels for latent processes are assumed to factorise as K Kspace Ktime whichallows placing inducing points independently on spatial and temporal grids.Figure 5: The sparse GPRN model fit on the Juradataset with 11 inducing points. The Q 2 (dense)latent functions (a) are combined with the 3 2sparse mixing functions (b) into the P 3 outputpredictions (c). The real data are shown in (d).The white mixing regions are estimated ‘off’.which biases the elementary accuracy, precision andrecall quantities towards the majority class.6.2Figure 4: The distribution of errors with the raindataset with the ZiGP and the GPr. The zeroinflated GP achieves much higher number of perfect(zero) predictions.Figure 3 depicts the components of the zero-inflatedGP model on the precipitation dataset. As shownin panel (c), the latent support function models thepresence or absence of rainfall. It smoothly followsthe change in rain patterns across hourly observations. The amount of precipitation is modeled bythe other latent process and the combination of thesetwo results in sparse predictions. Figure 4 showsthat the absolute error distribution is remarkablybetter with the ZiGP model due to it identifyingthe absence of rain exactly. While both models fitthe high rainfall regions well, for zero and near-zeroregions GPr does not refine its small errors. Table1 indicates that the ZiGP model achieves the lowestmean square error, while also achieving the highestF1 score that takes into account the class imbalance,MULTI-OUTPUT PREDICTION JURAIn this experiment we model the multi-response Juradataset with the sparse Gaussian process regressionnetwork sGPRN model and compare it with standard GPRN as baseline. Jura contains concentration measurements of cadmium, nickel and zinc metals in the region of Swiss Jura. We follow the experimental procedure of Wilson et al. (2012) andNguyen & Bonilla (2013). The training set consistsof n 259 observations across D 2 dimensionalgeo-spatial locations, and the test set consists of 100separate locations. For both models we use Q 2latent functions with the stochastic variational inference techniques proposed in this paper. SparseGPRN uses a sparsity inducing kernel in the mixing weights. The locations of inducing points for theweights W (x) and the support g(x) are shared. Thekernel length-scales are given a gamma prior withthe shape parameter α 0.3 and rate parameterβ 1.0 to induce smoothness. We train both themodels 30 times with random initialization.Table 2 shows that our model performs better thanthe state-of-the-art SVI-GPRN, both with m 5and m 10 inducing points. Figure 5 visualisesthe optimized sparse GPRN model, while Figure 6indicates the sparsity pattern in the mixing weights.The weights have considerable smooth ‘on’ regions(black) and ‘off’ regions (white). The ‘off’ regionsindicate that for certain locations, only one of thetwo latent functions is adaptively utilised.

Table 3: Normalized MSE results on the SARCOStest data for sparse GPRN and standard GPRNmodels. Best performance is mentioned with boldface.ModelGPRNsGPRNFigure 6: The sparse probit support (a) and latentfunctions (b) of the weight function W (x) of theoptimized sparse GPRN model. The black regionsof (a) show regional activations, while the white regions show where the latent functions are ‘off’. Theelementwise product of the support and weight functions is indicated in the Figure 5b).Table 2: Results for the Jura dataset for sparseGPRN and vanilla GPRN models with test data.Best performance is with boldface. We do not reportRMSE and MAE values GPc, since its a classification method.GPRNsGPRN6.3m5101551015CadmiumRMSE 5670.7250.569NickelRMSE .03322.67033.475 21.77434.225 22.11434.308 22.288MULTI-OUTPUT PREDICTION SARCOSIn this experiment we tackle the problem of learning inverse dynamics for seven degrees of freedom ofSARCOS anthropomorphic robot arm (Vijayakumaret al., 2005). The dataset consists of 48,933 observations with an input space of 21 dimensions (7 jointspositions, 7 joint velocities, 7 joint accelerations).The multi-output prediction task is to learn a mapping from these input variables to the corresponding7 joint torques of the robot. Multi-output GP hasbeen previously used for inverse dynamics modeling(Williams et al., 2009), but in a different model setting and on a smaller dataset. GPRN with stochastic inference framework has also been explored tomodel SARCOS dataset (Nguyen et al., 2014), however, they use a different experimental setting andconsider 2 of the 7 joint torques as multi-outptuttargets.m 500.01670.01460.01590.0140m 1000.01450.01210.01310.0117m 1500.01270.01080.01250.0096We consider 80% 20% random split of the fulldataset for training and testing respectively. BothGPRN and SGPRN model are trained with m 50, 100 and 150 inducing points and Q 2 and3 latent functions. We repeat the experiment 20times and report normalized-MSE in Table 3. SparseGPRN gives better results than standard GPRNin all our experimental settings. Moreover, sparsemodel (nMSE 0.0096) gives gives 12% improvement over the standard model (nMSE 0.0108) forthe best test performance with Q 3 latent functions and m 150.7ModelQ 2Q 3Q 2Q 3DISCUSSIONWe proposed a novel paradigm of zero-inflated Gaussian processes with a novel sparse kernel. The sparsity in the kernel is modeled with smooth probitfiltering of the covariance rows and columns. Thismodel induces zeros in the prediction function outputs, which is highly useful for zero-inflated datasetswith excess of zero observations. Furthermore, weshowed how the zero-inflated GP can be used tomodel sparse mixtures of latent signals with the proposed sparse Gaussian process network. The latentmixture model with sparse mixing coefficients leadsto locally using only a subset of the latent functions,which improves interpretability and reduces modelcomplexity. We demonstrated tractable solutions tostochastic variational inference of the sparse probitkernel for the zero-inflated GP, conventional GPRN,and sparse GPRN models, which lends to efficientexploration of the parameter space of the model.AcknowledgementsWe would like to thank anonymous reviewers fortheir helpful suggestions and comments. This workhas been supported by the Academy of Finlandgrant no. 294238 and 292334.

ReferencesAbadi, M., Barham, P., Chen, J., Chen, Z., Davis,A., Dean, J., Devin, M., Ghemawat, S., Irving, G.,Isard, M., et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pp.265–283, 2016.Abraham, Z. and Tan, P-N. An integrate

3. A novel sparse GPRN with an on-o process in the mixing matrices leading to sparse and variable-order mixtures of latent signals. 4. A solution to the stochastic variational infer-ence of sparse GPRN where the SVI is derived for the network of full probit-linked covariances. 2 GAUSSIAN PROCESSES We begin by introducing the basics of conventional

Related Documents:

Agenda 1 Variational Principle in Statics 2 Variational Principle in Statics under Constraints 3 Variational Principle in Dynamics 4 Variational Principle in Dynamics under Constraints Shinichi Hirai (Dept. Robotics, Ritsumeikan Univ.)Analytical Mechanics: Variational Principles 2 / 69

Gaussian filters might not preserve image brightness. 5/25/2010 9 Gaussian Filtering examples Is the kernel a 1D Gaussian kernel?Is the kernel 1 6 1 a 1D Gaussian kernel? Give a suitable integer-value 5 by 5 convolution mask that approximates a Gaussian function with a σof 1.4. .

II. VARIATIONAL PRINCIPLES IN CONTINUUM MECHANICS 4. Introduction 12 5. The Self-Adjointness Condition of Vainberg 18 6. A Variational Formulation of In viscid Fluid Mechanics . . 25 7. Variational Principles for Ross by Waves in a Shallow Basin and in the "13-P.lane" Model . 37 8. The Variational Formulation of a Plasma . 9.

entropy is additive :- variational problem for A(q) . Matrix of Inference Methods EP, variational EM, VB, NBP, Gibbs EP, EM, VB, NBP, Gibbs EKF, UKF, moment matching (ADF) Particle filter Other Loopy BP Gibbs Jtree sparse linear algebra Gaussian BP Kalman filter Loopy BP, mean field, structured variational, EP, graph-cuts Gibbs

Outline of the talk A bridge between probability theory, matrix analysis, and quantum optics. Summary of results. Properties of log-det conditional mutual information. Gaussian states in a nutshell. Main result: the Rényi-2 Gaussian squashed entanglement coincides with the Rényi-2 Gaussian entanglement of formation for Gaussian states. .

Stochastic Variational Inference. We develop a scal-able inference method for our model based on stochas-tic variational inference (SVI) (Hoffman et al., 2013), which combines variational inference with stochastic gra-dient estimation. Two key ingredients of our infer

Variational Form of a Continuum Mechanics Problem REMARK 1 The local or strong governing equations of the continuum mechanics are the Euler-Lagrange equation and natural boundary conditions. REMARK 2 The fundamental theorem of variational calculus guarantees that the solution given by the variational principle and the one given by the local

an accounting policy. In making that judgment, management considers, first the requirement of other IFRS standards dealing with similar issues, and the concepts in the IASB’s framework. It also may consider the accounting standards of other standard-setting bodies. International Financial Reporting Standards Australian Accounting Standards