Bayesian Inference: An Introduction To Principles And .

2y ago

13 Views

2 Downloads

516.35 KB

19 Pages

Last View : 14d ago

Last Download : 3m ago

Upload by : Sasha Niles

Report this link

Download PDF

Transcription

Bayesian Inference: An Introduction toPrinciples and Practice in Machine LearningMichael E. TippingMicrosoft Research, Cambridge, U.K.Published as:Year of publication:This version typeset:Available from:Correspondence:Abstract1“Bayesian inference: An introduction to Principles and practice in machine learning.” In O. Bousquet, U. von Luxburg, and G. Rätsch (Eds.), Advanced Lectures onMachine Learning, pp. 41–62. Springer.2004June 26, tipping.comThis article gives a basic introduction to the principles of Bayesian inference in a machinelearning context, with an emphasis on the importance of marginalisation for dealing withuncertainty. We begin by illustrating concepts via a simple regression task before relatingideas to practical, contemporary, techniques with a description of ‘sparse Bayesian’ modelsand the ‘relevance vector machine’.IntroductionWhat is meant by “Bayesian inference” in the context of machine learning? To assist in answeringthat question, let’s start by proposing a conceptual task: we wish to learn, from some givennumber of example instances of them, a model of the relationship between pairs of variables A andB. Indeed, many machine learning problems are of the type “given A, what is B?”.1Verbalising what we typically treat as a mathematical task raises an interesting question in itself.How do we answer “what is B?”? Within the appealingly well-defined and axiomatic framework ofpropositional logic, we ‘answer’ the question with complete certainty, but this logic is clearly toorigid to cope with the realities of real-world modelling, where uncertaintly over ‘truth’ is ubiquitous.Our measurements of both the dependent (B) and independent (A) variables are inherently noisyand inexact, and the relationships between the two are invariably non-deterministic. This is whereprobability theory comes to our aid, as it furnishes us with a principled and consistent frameworkfor meaningful reasoning in the presence of uncertainty.We might think of probability theory, and in particular Bayes’ rule, as providing us with a “logicof uncertainty” [1]. In our example, given A we would ‘reason’ about the likelihood of the truth ofB (let’s say B is binary for example) via its conditional probability P (B A): that is, “what is theprobability of B given that A takes a particular value?”. An appropriate answer might be “B istrue with probability 0.6”. One of the primary tasks of ‘machine learning’ is then to approximateP (B A) with some appropriately specified model based on a given set of corresponding examplesof A and B.21 In this article we will focus exclusively on such ‘supervised learning’ tasks, although of course there are othermodelling applications which are equally amenable to Bayesian inferential techniques.2 In many learning methods, this conditional probability approximation is not made explicit, though such aninterpretation may exist. However, one might consider it a significant limitation if a particular machine learningprocedure cannot be expressed coherently within a probabilistic framework.

Bayesian Inference: Principles and Practice in Machine Learning2It is in the modelling procedure where Bayesian inference comes to the fore. We typically (thoughnot exclusively) deploy some form of parameterised model for our conditional probability:P (B A) f (A; w),(1)where w denotes a vector of all the ‘adjustable’ parameters in the model. Then, given a set Dof N examples of our variables, D {An , Bn }Nn 1 , a conventional approach would involve themaximisation of some measure of ‘accuracy’ (or minimisation of some measure of ‘loss’) of ourmodel for D with respect to the adjustable parameters. We then can make predictions, given A,for unknown B by evaluating f (A; w) with parameters w set to their optimal values. Of course,if our model f is made too complex — perhaps there are many adjustable parameters w — werisk over-specialising to the observed data D, and consequently realising a poor model of the trueunderlying distribution P (B A).The first key element of the Bayesian inference paradigm is to treat parameters such as w asrandom variables, exactly the same as A and B. So the conditional probability now becomesP (B A, w), and the dependency of the probability of B on the parameter settings, as well as A,is made explicit. Rather than ‘learning’ comprising the optimisation of some quality measure, adistribution over the parameters w is inferred from Bayes’ rule. We will demonstrate this conceptby means of a simple example regression task in Section 2.To obtain this ‘posterior’ distribution over w alluded to above, it is necessary to specify a ‘prior’ distribution p(w) before we observe the data. This may be considered an incovenience, but Bayesianinference treats all sources of uncertainty in the modelling process in a unified and consistentmanner, and forces us to be explicit as regards our assumptions and constraints; this in itself isarguably a philosophically appealing feature of the paradigm.However, the most attractive facet of a Bayesian approach is the manner in which “Ockham’sRazor” is automatically implemented by ‘integrating out’ all irrelevant variables. That is, underthe Bayesian framework there is an automatic preference for simple models that sufficiently explainthe data without unnecessary complexity. We demonstrate this key feature in Section 3, andin particular underline the point that this property holds even if the prior p(w) is completelyuninformative. We show that, in practical terms, the concept of Ockham’s Razor enables us to‘set’ regularisation parameters and ‘select’ models without the need for any additional validationprocedure.The practical disadvantage of the Bayesian approach is that it requires us to perform integrationsover variables, and many of these computations are analytically intractable. As a result, much contemporary research in Bayesian approaches to machine learning relies on, or is directly concernedwith, approximation techniques. However, we show in Section 4, where we describe the “sparseBayesian” model, that a combination of analytic calculation and straightforward, practically efficient, approximation can offer state-of-the-art results.2From Least-Squares to Bayesian InferenceWe introduce the methodology of Bayesian inference by considering an example prediction (regression) problem. Let us assume we are given a very simple data set (illustrated later withinFigure 1) comprising N 15 samples artificially generated from the function y sin(x) withadded Gaussian noise of variance 0.2. We will denote the ‘input’ variables in our example by xn ,n 1 . . . N . For each such xn , there is an associated real-valued ‘target’ tn , n 1 . . . N , and fromthese input-target pairs, we wish to ‘learn’ the underlying functional mapping.

Bayesian Inference: Principles and Practice in Machine Learning2.13Linear ModelsWe will model this data with some parameterised function y(x; w), where w (w1 , w2 , . . . , wM )is the vector of adjustable model parameters. Here, we consider linear models (strictly, “linear-inthe-parameter”) models which are a linearly-weighted sum of M fixed (but potentially nonlinear)basis functions φm (x):MXy(x; w) wm φm (x).(2)m 1For our purposes here, we makeª the common choice to utilise Gaussian data-centred basis functionsφm (x) exp (x xm )2 /r2 , which gives us a ‘radial basis function’ (RBF) type model.2.1.1“Least-squares” Approximation.Our objective is to find values for w such that y(x; w) makes good predictions for new data: i.e.it models the underlying generative function. A classic approach to estimating y(x; w) is “leastsquares”, minimising the error measure:"#2NMX1XED (w) tn wm φm (xn ) .2 n 1m 1(3)If t (t1 , . . . , tN )T and Φ is the ‘design matrix’ such that Φnm φm (xn ), then the minimiser of(3) is obtained in closed-form via linear algebra:wLS (ΦT Φ) 1 ΦT t.(4)However, with M 15 basis functions and only N 15 examples here, we know that minimisationof squared-error leads to a model which exactly interpolates the data samples, as shown in Figure1.Ideal fitLeast squares RBF fit1.510.50 0.5 1 1.50246Figure 1: Overfitting? The ‘ideal fit’ is shown on the left, while the least-squares fit using 15basis functions is shown on the right and perfectly interpolates all the data points.Now, we may look at Figure 1 and exclaim “the function on the right is clearly over-fitting!”.But, without prior knowledge of the ‘truth’, can we really judge which model is genuinely better?The answer is that we can’t — in a real-world problem, the data could quite possibly have beengenerated by a complex function such as shown on the right. The only way that we can proceed tomeaningfully learn from data such as this is by imposing some a priori prejudice on the nature of thecomplexity of functions we expect to elucidate. A common way of doing this is via ‘regularisation’.

Bayesian Inference: Principles and Practice in Machine Learning2.24Complexity Control: RegularisationA common, and generally very reasonable, assumption is that we typically expect that data isgenerated from smooth, rather than complex, functions. In a linear model framework, smootherfunctions typically have smaller weight magnitudes, so we can penalise complex functions by addingan appropriate penalty term to the cost function that we minimise:bE(w) ED (w) λEW (w).A standard choice is the squared-weight penalty, EW (w) the “penalised least-squares” (PLS) estimate for w:12(5)PMm 12wm, which conveniently giveswP LS (ΦT Φ λI) 1 ΦT t.(6)The hyperparameter λ balances the trade-off between ED (w) and EW (w) — i.e. between how wellthe function fits the data and how smooth it is. Given that we can compute the weights directlyfor a given λ, the learning problem is now transformed into one of finding an appropriate valuefor that hyperparameter. A very common approach is to assess potential values of λ according tothe error calculated on a set of ‘validation’ data (i.e. data which is not used to estimate w), andexamples of fits for different values of λ and their associated validation errors are given in Figure2.All dataValidation error: E 2.111.510.50 0.5 1 1.5024Validation error: E 0.526Validation error: E 0.70Figure 2: Function estimates (solid line) and validation error for three different values of regularisation hyperparameter λ (the true function is shown dashed). The training datais plotted in black, and the validation set in green (gray).In practice, we might evaluate a large number of models with different hyperparameter values andselect the model with lowest validation error, as demonstrated in Figure 3. We would then hopethat this would give us a model which was close to ‘the truth’. In this artificial case where we knowthe generative function, the deviation from ‘truth’ is illustrated in the figure with the measurementof ‘test error’, the error on noise-free samples of sin(x). We can see that the minimum validationerror does not quite localise the best test error, but it is arguably satisfactorily close. We’ll comeback to this graph in Section 3 when we look at marginalisation and how Bayesian inference canbe exploited in order to estimate λ. For now, we look at how this regularisation approach can beinitially reformulated within a Bayesian probabilistic framework.

Bayesian Inference: Principles and Practice in Machine Learning510.90.8Normalised error0.70.60.50.4Validation0.3Test0.20.10 14Training 12 10 8 6 4 2log λ024Figure 3: Plots of error computed on the separate 15-example training and validation sets,along with ‘test’ error measured on a third noise-free set. The minimum test andvalidation errors are marked with a triangle, and the intersection of the best λ computed via validation is shown.2.3A Probabilistic Regression FrameworkWe assume as before that the data is a noisy realisation of an underlyingfunctional model: tn Py(xn ; w) ²n . Applying least-squares resulted in us minimising n ²2n , but here we first define anexplicit probabilistic model over the noise component ²n , chosen to be a Gaussian distribution withmean zero and variance σ 2 . That is, p(²n σ 2 ) N (0, σ 2 ). Since tn y(xn ; w) ²n it follows thatp(tn xn ,w, σ 2 ) N (y(xn ; w), σ 2 ). Assuming that each example from the the data set has beengenerated independently (an often realistic assumption, although not always true), the likelihood 3of all the data is given by the product:p(t x,w, σ 2 ) NYp(tn xn ,w, σ 2 ),n 1 NYn 1(7)"2 1/2(2πσ )2{tn y(xn ; w)}exp 2σ 2#.(8)Note that, from now on, we will write terms such as p(t x,w, σ 2 ) as p(t w, σ 2 ), since we never seekto model the given input data x. Omitting to include such conditioning variables is purely fornotational convenience (it implies no further model assumptions) and is common practice.2.4Maximum Likelihood and Least-SquaresThe ‘maximum-likelihood’ estimate for w is that value which maximises p(t w, σ 2 ). In fact, this isidentical to the ‘least-squares’ solution, which we can see by noting that minimising squared-error3 Although ‘probabilty’ and ‘likelihood’ functions may be identical, a common convention is to refer to “probability” when it is primarily interpreted as a function of the random variable t, and “likelihood” when interpreted asa function of the parameters w.

Bayesian Inference: Principles and Practice in Machine Learning6is equivalent to minimising the negative logarithm of the likelihood which here is: log p(t w, σ 2 ) NN1 X2log(2πσ 2 ) 2{tn y(xn ; w)} .22σ n 1(9)Since the first term on the right in (9) is independent of w, this leaves only the second term whichis proportional to the squared error.2.5Specifying a Bayesian PriorOf course, giving an identical solution for w as least-squares, maximum likelihood estimation willalso result in overfitting. To control the model complexity, instead of the earlier regularisationweight penalty EW (w), we now define a prior distribution which expresses our ‘degree of belief’over values that w might take:p(w α) M ³on αYα 1/22exp wm.2π2m 1(10)This (common) choice of a zero-mean Gaussian prior, expresses a preference for smoother modelsby declaring smaller weights to be a priori more probable. Though the prior is independent foreach weight, there is a shared inverse variance hyperparameter α, analogous to λ earlier, whichmoderates the strength of our ‘belief’.2.6Posterior InferencePreviously, given our error measure and regulariser, we computed a single point estimate wLS forthe weights. Now, given the likelihood and the prior, we compute the posterior distribution overw via Bayes’ rule:p(w t, α, σ 2 ) likelihood priorp(t w, σ 2 )p(w α) .normalising factorp(t α, σ 2 )(11)As a consequence of combining a Gaussian prior and a linear model within a Gaussian likelihood,the posterior is also conveniently Gaussian: p(w t, α, σ 2 ) N (µ, Σ) withµ (ΦT Φ σ 2 αI) 1 ΦT t,2T2 1Σ σ (Φ Φ σ αI).(12)(13)So instead of ‘learning’ a single value for w, we have inferred a distribution over all possible values.In effect, we have updated our prior ‘belief’ in the parameter values in light of the informationprovided by the data t, with more posterior probability assigned to values which are both probableunder the prior and which ‘explain the data’.2.6.1MAP Estimation: a ‘Bayesian’ Short-cut.The “maximum a posteriori ” (MAP) estimate for w is the single most p rFigure 7: Top: negative log model probability log p(t Φ, r) for various basis sets, evaluatedby analytic integration over w and Monte-Carlo averaging over α and σ 2 . Bottom:corresponding test error for the posterior mean predictor. Basis sets examined were‘Gaussian’, exp x xm 2 /r2 , ‘Laplacian’, exp { x xm /r}, sin(x), sin(x) withcos(x). For the Gaussian and Laplacian basis, the horizontal axis denotes varying‘width’ parameter r shown. For the sine/cosine bases, the horizontal axis has nosignificance and the values are placed to the left for convenience.Nevertheless, regarding these points, we can still leverage Bayesian techniques to considerablebenefit exploiting carefully-applied approximations. In particular, marginalised likelihoods withinthe Bayesian framework allow us to estimate fixed values of hyperparameters where desired and,most beneficially, choose between models and their varying parameterisations. This can all be donewithout the need to use validation data. Furthermore: it is straightforward to estimate other parameters in the model that may be of interest, e.g.the noise variance, we can sample from both prior and posterior models of the data, the exact parameterisation of the model is irrelevant when integrating out, we can incorporate other priors of interest in a principled manner.We now further demonstrate these points, notably the last one, in the next section where wepresent a practical framework for the inference of ‘sparse’ models.

Bayesian Inference: Principles and Practice in Machine Learning44.115Sparse Bayesian ModelsBayes and Contemporary Machine LearningIn the previous section we saw that marginalisation is a valuable component of the Bayesianparadigm which offers a number of advantageous features applicable to many data modellingtasks. Disadvantageously, we also saw that the integrations required for full Bayesian inference canoften be analytically intractable, although approximations for simple linear models could be veryeffective. Historically, interest in Bayesian “machine learning” (but not statistics!) has focussedon approximations for non-linear models, e.g. for neural networks, the “evidence procedure” [7]and “hybrid Monte Carlo” sampling [5]. More recently, flexible (i.e. many-parameter) linear kernelmethods have attracted much renewed interest, thanks mainly to the popularity of the “supportvector machine”. These kind of models, of course, are particuarly amenable to Bayesian techniques.4.1.1Linear Models and Sparsity.Much interest in linear models has focused on sparse learningP algorithms, which set many weightswm to zero in the estimated predictor function y(x) m wm φm (x). Sparsity is an attractiveconcept; it offers elegant complexity control, feature extraction, the potential for elucidation ofmeaningful input variables along with the practical benefits of computational speed and compactness.How do we impose a preference for sparsity in a model? The most common approach is viaan appropriate regularisationPM term or prior. The most common regularisation term that we havealready met, EW (w) m 1 wm 2 , of course corresponds to a Gaussian prior and is easy to workwith, but while it is an effective way to control complexity,P it does not promote sparsity. In theregularisation sense, the ‘correct’ term would be EW (w) P m wm 0 , but this, being discontinuousin wm , is very difficult to work with. Instead, EW (w) m wm 1 is a workable compromise whichgives reasonable sparsity and reasonable tractability,and is exploited in a number of methods,Pincluding as a Laplacian prior p(w) exp( m wm ) [8]. However, there is an arguably moreelegant way of obtaining sparsity within a Bayesian framework that builds effectively on the ideasoutlined in the previous section and we conclude this article with a brief outline thereof.4.2A Sparse Bayesian PriorIn fact, we can obtain sparsity by retaining the traditional Gaussian prior, which is great news fortractability. The modification to our earlier Gaussian prior (10) is subtle:p(w α1 , . . . , αM ) M ·Ym 1½ 1/2(2π)αm1/212exp αm wm2¾ .(26)In constrast to the model in Section 2, we now have M hyperparameters α (α1 , . . . , αM ), oneαm independently controlling the (inverse) variance of each weight wm .4.2.1A Hierarchical Prior.The prior p(w α) is nevertheless still Gaussian, and superficially seems to have little preferencefor sparsity. However, it remains conditioned on α, so for full Bayesian consistency we should nowdefine hyperpriors over all αm . Previously, we utilised a log-uniform hyperprior — this is a special

Bayesian Inference: Principles and Practice in Machine Learning16case of a Gamma hyperprior, which we introduce for greater generality here. This combination ofthe prior over αm controlling the prior over wm gives us what is often referred to as a hierarchicalprior. Now, if we have p(wm αm ) and p(αm ) and we want to know the ‘true’ p(wm ) we alreadyknow what to do — we must marginalise:Zp(wm ) p(wm αm ) p(αm ) dαm .(27)For a Gamma p(αm ), this integral is computable and we find that p(wm ) is a Student-t distributionillustrated as a functionof two parameters in Figure 8; its equivalent as a regularising penaltyPfunction would be m log wm .Gaussian priorMarginal prior: single αIndependent αFigure 8: Contour plots of Gaussian and Student-t prior distributions over two parameters.While the marginal prior p(w1 , w2 ) for the ‘single’ hyperparameter model of Section2 has a much sharper peak than the Gaussian at zero, it can be seen that it isnot sparse unlike the multiple ‘independent’ hyperparameter prior, which as well ashaving a sharp peak at zero, places most of its probability mass along axial ridgeswhere the magnitude of one of the two parameters is small.4.3A Sparse Bayesian Model for RegressionWe can develop a sparse regression model by following an identical methodology to the previoussections. Again, we assume independent Gaussian noise: tn N (y(xn ; w), σ 2 ), which gives acorresponding likelihood:½¾1p(t w, σ 2 ) (2πσ 2 )–N/2 exp 2 kt Φwk2 ,(28)2σwhere as before we denote t (t1 . . . tN )T , w (w1 . . . wM )T , and Φ is the N M ‘design’ matrixwith Φnm φm (xn ).Following the Bayesian framework, we desire the posterior distribution over all unknowns:p(w, α, σ 2 t) p(t w, α, σ 2 )p(w, α, σ 2 ),p(t)(29)which we can’t compute analytically. So as previously, we decompose this as:p(w, α, σ 2 t) p(w t, α, σ 2 ) p(α, σ 2 t)(30)where p(w t, α, σ 2 ) is the ‘weight posterior’ distribution, and is tractable. This leaves p(α, σ 2 t)which must be approximated.

Bayesian Inference: Principles and Practice in Machine Learning4.3.117The Weight Posterior Term.Given the data, the posterior distribution over weights is Gaussian:p(w t, α, σ 2 ) p(t w, σ 2 ) p(w α),p(t α, σ 2 )–(N 1)/2 (2π) Σ –1/2½¾1–1Texp (w µ) Σ (w µ) ,2(31)withΣ (σ –2 ΦT Φ A)–1 ,–2Tµ σ ΣΦ t,(32)(33)and where we collect all the hyperparameters into a diagonal matrix: A diag(α1 , α2 , . . . , αM ).A key point to note from (31–33) is that if any αm , the corresponding µm 0.4.3.2The Hyperparameter Posterior Term.Again we will adopt the “type-II maximum likelihood” approximation where we maximise p(t α, σ 2 )2to find αMP and σMP. As before, for uniform hyperpriors over log α and log σ, p(α, σ 2 t) 2p(t α, σ ), where the marginal likelihood p(t α, σ 2 ) is obtained by integrating out the weights:Z2p(t α, σ ) p(t w, σ 2 ) p(w α) dw,¾½1 T 2–1 T –1–N/2 2 1 T –1/2(34) (2π) σ I ΦA Φ exp t (σ I ΦA Φ ) t .2In Section 2, we found the single αMP empirically but here for multiple (in practice, perhapsthousands of) hyperparametlikelihood p(t α) being a normalised distribution over the space of all possible datasets t. Models with high α only fit (assign significant marginal probability to) data from smoothfunctions. Models with low values of α can fit data generated from functions that are both smoothand complex. However, because of normalisation, the low-α model must generally assign lowerprobability to data from smooth functions, so the marginal likelihood naturally prefers the simplermodel if the data is smooth, which is precisely the meaning of Ockham’s Razor. Furthermore, onecan see from Figure 6 that for a data set of ‘intermediate’ complexity, a ‘medium’ value of α canbe preferred. This is qualitatively analogous to the case of our example set, where we indeed findthat an intermediate value of α is optimal. Note, crucially, that this is achieved without any priorpreference for any particular value of α as we originally assumed a uniform hyperprior over itslogarithm. The effect of Ockham’s Razor is an automatic and pleasing consequence of applyingthe Bayesian framework.3.8Model SelectionWhile we have concentrated so far on the search for an appropriate value of hyperparameter α (and,to an extent, σ 2 ), our model is also conditioned on other variables we have up to now overlooked:the choice of basis set Φ and, for our Gaussian basis, its width parameter r (as defined in Section2.1). Ideally, we should define priors P (Φ) and p(r), and integrate out those variables when makingpredictions. More practically, we could use p(t Φ, r) as a criterion for model selection with theexpectation that Ockham’s Razor will assist us in selecting a model that is sufficient to explainthe data but is not over-complex. In our example model, we previously optimised the marginallikelihood to find a value for α. In fact, as there are only two nuisance parameters here, it isfeasible to integrate out α and σ 2 numerically.In Figure 7 we evaluate several basis sets Φ and width values r by computing the integralZp(t Φ, r) p(t α, σ 2 , Φ, r) p(α) p(σ 2 ) dα dσ 2 , S1Xp(t αs , σs2 , Φ, r),S s 1(24)(25)with a Monte-Carlo average where we obtain S samples log-uniformly from α [10 12 , 1012 ] andσ [10 4 , 100 ].

Bayesian Inference: Principles and Practice in Machine LearningFigure 6: A schematic plot of three marginal probability distributions for ‘high’, ‘medium’ and‘low’ values of α. The figure is a simplification of the case for the actual distributionp(t α), where for illustrative purposes the N -dimensional space of t has been compressed onto a single axis and where, notionally, data sets (instances of t) arisingfrom simpler (smoother) functions lie towards the left-hand end of the horizontalscale, and data from complex functions to the right.The results of Figure 7 are quite compelling: with uniform priors over all nuisance variables —i.e.we have imposed absolutely no prior knowledge — we observe that test error appears very closelyrelated to marginal likelihood. The qualitative shapes of the curves, and the relative merits, ofGaussian and Laplacian basis functions are also captured. For the Gaussian basis we are veryclose to obtaining the optimal value of r, in terms of test error, from just 15 examples and novalidation data. Reassuringly, the simplest model that contains the ‘truth’, y w1 sin(x), is themost probable model here. We also show in the figure the model y w1 sin(x) w2 cos(x) whichis also an ideal fit for the data, but it is penalised in marginal probability terms since the additionof the w2 cos(x) term allows it to explain more data sets, and normalisation thus requires it toassign less probability to our particular set. Nevertheless, it is still some orders of magnitude moreprobable than the Gaussian basis model.3.9Summary So Far. . .Marginalisation is the key element of Bayesian inference, and hopefully some of the examples abovehave persuaded the reader that it can be an exceedingly powerful one. Max error: 0.0664RMS error: 0.0322RV’s: 7Figure 9:19Support Vector RegressionNoise: 0.100Estimate: 0.107Max error: 0.0896RMS error: 0.0420SV’s: 29Noise: 0.100C and ε found bycross validationThe relevance vector and support vector machines applied to a regression problemusing a Gaussian kernel, which demonstrates some of the advantages of the Bayesianapproach. Of particular note is the sparsity of the final Bayesian model, whichqualitatively appears near-optimal. It is also worth underlining that the ‘nuisance’parameters C and ² for the SVM had to be found by a separate cross-validationprocedure, whereas the RVM algorithm estimates them automatically, and arguablyquite accurately in the case of the noise variance.References[1] Edward T. Jaynes. Probability theory: the logic of science. Cambridge University Press, 2003.[2] M Evans and T B Swartz. Methods for approximating integrals in statistics with special emphasis onBayesian integration. Statistical Science, 10(3):254–272, 1995.[3] Matt Beal and Zoubin Ghahramani.http://www.variational-bayes.org/, 2003.TheVariationalBayeswebsiteat[4] Christopher M Bishop and Michael E Tipping. Variational relevance vector machines. In CraigBoutilier and Moisés Goldszmidt, editors, Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pages 46–53. Morgan Kaufmann, 2000.[5] Radford M Neal. Bayesian Learning for Neural Networks. Springer, 1996.[6] James O Berger. Statistical decision theory and Bayesian analysis. Springer, second edition, 1985.[7] David J C MacKay. The evidence framework applied to classification networks. Neural Computation,4(5):720–736, 1992.[8] Peter M Williams. Bayesian regularisation and pruning using a Laplace prior. Neural Computation,7(1):117–143, 1995.[9] Michael E Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of MachineLearning Research, 1:211–244, 2001.[10] Michael E. Tipping and Anita C Faul. Fast marginal likelihood maximisation for sparse Bayesianmodels. In C. M. Bishop and B. J. Frey, editors, Proceedings of the Ninth International Workshop onArtificial Intelligence and Statistics, Key West, FL, Jan 3-6, 2003.[11] David J C MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992.[12] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.[13] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian Data Analysis. Chapman& Hall, 1995.[14] David J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge UniversityPress, 2003.

Bayesian" model, that a combination of analytic calculation and straightforward, practically e–-cient, approximation can oﬁer state-of-the-art results. 2 From Least-Squares to Bayesian Inference We introduce the methodology of Bayesian inference by considering an example prediction (re-gression) problem.

Related Documents:

Computational Bayesian Statistics

Computational Bayesian Statistics An Introduction M. Antónia Amaral Turkman Carlos Daniel Paulino Peter Müller. Contents Preface to the English Version viii Preface ix 1 Bayesian Inference 1 1.1 The Classical Paradigm 2 1.2 The Bayesian Paradigm 5 1.3 Bayesian Inference 8 1.3.1 Parametric Inference 8

25 Views

2y ago

Introduction to Bayesian Inference

Bayesian Modeling Using WinBUGS, by Ioannis Ntzoufras, New York: Wiley, 2009. 2 PuBH 7440: Introduction to Bayesian Inference. Textbooks for this course Other books of interest (cont’d): Bayesian Comp

13 Views

2y ago

Reading 20: Comparison of frequentist and Bayesian Inference

Comparison of frequentist and Bayesian inference. Class 20, 18.05 Jeremy Orloﬀ and Jonathan Bloom. 1 Learning Goals. 1. Be able to explain the diﬀerence between the p-value and a posterior probability to a doctor. 2 Introduction. We have now learned about two schools of statistical inference: Bayesian and frequentist.

21 Views

2y ago

Bayesian inference and generative models - ETH Z

Why should I know about Bayesian inference? Because Bayesian principles are fundamental for statistical inference in general system identification translational neuromodeling ("computational assays") - computational psychiatry - computational neurology

10 Views

1y ago

Bayesian parameter inference for stochastic biochemical ...

of inference for the stochastic rate constants, c, given some time course data on the system state, X t.Itis therefore most natural to ﬁrst consider inference for the earlier-mentioned MJP SKM. As demonstrated by Boys et al. [6], exact Bayesian inference in this settin

29 Views

2y ago

Amortized Bayesian Inference for Models of Cognition

variety of modeling problems. With this work, we provide a general introduction to amortized Bayesian parameter estima-tion and model comparison and demonstrate the applicability of the proposed methods on a well-known class of intractable response-time models. Keywords: Bayesian inference; Neural netwo

19 Views

2y ago

Lecture Notes on Bayesian Nonparametrics Peter Orbanz

value of the parameter remains uncertain given a nite number of observations, and Bayesian statistics uses the posterior distribution to express this uncertainty. A nonparametric Bayesian model is a Bayesian model whose parameter space has in nite dimension. To de ne a nonparametric Bayesian model, we have

21 Views

1y ago

Introduction to Digital Logic with Laboratory Exercises

Introduction to Digital Logic with Laboratory Exercises 6 A Global Text. This book is licensed under a Creative Commons Attribution 3.0 License Preface This lab manual provides an introduction to digital logic, starting with simple gates and building up to state machines. Students should have a solid understanding of algebra as well as a rudimentary understanding of basic electricity including .

63 Views

3y ago

Recent Views

Quotes within Quotes: When Single (') and Double (") Quotes . - SAS

Here the outside double quotes are replaced by a single quote and the apostrophe is replaced by two single quotes. This works because when the parser sees two single (or double) quotes immediately following each other, the parser resolves them into one quote mark after the closing quote has been determined.

1y ago

237 Views

What These Inspirational Quotes Say

Self Motivation Quotes Success Quotes Teacher Quotes And after reading all of these inspirational quotes you’d like to share which quotation is . -- Brian Tracy "You must constantly ask yourself these questions: Who am I around? What are they doing to me? Wha

2y ago

302 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Common Questions About Home Insurance

Homes with good security will generally be offered lower insurance quotes than the equivalent homes with poor security. In fact, some insurers may not offer quotes at all for homes with poor security. Contents Insurance Is money automatically covered? Most insurance policies will cover a limited amount of money (say up to 500) as part of

1y ago

257 Views

ANZ MatureChoice Home Insurance

Home Insurance In case you need to make a claim, tear off the cards below and place in your car, wallet and at home. ANZ MatureChoice Home Insurance is underwritten by CGU Insurance Limited ABN 27 004 478 371 AFSL 238291 (CGU Insurance). ANZ MatureChoice Home Insurance 24 hr/7 days Claims Assistance Service Call 1300 306 497

1y ago

122 Views

Quotations - Free Website Builder: Create free websites

cards, but sometimes, playing a poor hand well." . 50th Birthday Quotes 60th Birthday Quotes And there are more. Funny Birthday Quotes Cute Birthday Quotes . it a try, itʼs free. Triumph over failure can be a

2y ago

267 Views

The Top 100 Motivational & Inspirational Quotes for 2015

I've spent hours crawling through the web trying to find the best quotes to keep me motivated and inspired all throughout the New Year. I've saved hundreds of quotes on my laptop and figured that words alone could motivate and inspire me. but if I couple the quotes

2y ago

329 Views

Inspirational Quotes - Guideposts

Inspirational Quotes Inspiring quotes are like vitamins for the soul. From the heartfelt to the humorous, the words of wisdom you’ll find here will strengthen your faith, lift your spirits, and even spark a positive change in your life. This collection of some our favorite inspirational quotes from religious figures, world leaders, authors,

2y ago

553 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Fortress Bedroom Rated Home Insurance Policy

WELCOME TO FORTRESS HOME INSURANCE Fortress is a bedroom-rated Home and Contents insurance policy arranged by UK General Insurance Limited on behalf of Great Lakes Insurance SE, UK Branch. Great Lakes Insurance SE is a German insurance company with its headquarters at Königinstrasse 107, 80802 Munich. .

1y ago

139 Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

Industry Observations Insurance Industry

Jun 30, 2019 · 6/17/2019 Commercial Insurance Branch of Extraco Banks, N.A. Higginbotham Insurance Group, Inc. Insurance Brokers NA 6/13/2019 Links Insurance Services, LLC World Insurance Associates LLC Property and Casualty Insurance NA 6/13/2019 Abram Interstate Insurance Services, Inc. Risk Placement Services,

2y ago

619 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

18.01.41 - REPLACEMENT OF LIFE INSURANCE AND ANNUITIES - Idaho

Department of Insurance Replacement of Life Insurance and Annuities. Page 3. 04. Existing Life Insurance or Annuity. "Existing Life Insurance or Annuity" means any life insurance or annuity in force, including life insurance under a binding or conditional receipt or a lif e insurance policy or annuity that is within an unconditional refund period.

1y ago

407 Views

Bayesian Inference: An Introduction To Principles And .

It looks like you're using an ad-blocker