Lecture 2 Bayesian Linear Regression

2y ago
88 Views
3 Downloads
1.94 MB
41 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Gannon Casey
Transcription

Advanced Probabilistic Machine LearningLecture 2 – Bayesian linear regressionNiklas WahlströmDivision of Systems and ControlDepartment of Information TechnologyUppsala talog/nikwa7781 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Summary of lecture 1 (I/IV)Conditional probability is defined asp(x y) p(x, y)p(y)where p(y) 6 0Marginalization is defined asp(x) XZp(x, y) or p(x) p(x, y)dyyyMuch of the probability theory can be derived from these two rules.Bayes’ theorem is derived by using the def. of conditional probabilitytwicep(x y) 2 / 38niklas.wahlstrom@it.uu.sep(y x)p(x)p(y)Bayesian linear regression

Summary of lecture 1 (II/IV)In this course we solve problems using Bayes’ theoremp(θ D) p(D θ)p(θ)p(D) D : observed data θ : parameters of some model explaining the data p(θ): prior belief of parameters before we collected any data p(θ D): posterior belief of parameters after inferring data p(D θ): likelihood of the data in view of the parameters p(D): The marginal likelihood3 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Summary of lecture 1 (III/IV)If we view the quantities as functions of θ , we can disregard thenormalization constant p(y).p(θ D) p(D θ) p(θ) {z } {z } {z}posteriorlikelihood priorConjugate prior A prior ensuring that the posterior and the prior belong to the same probability distribution family.Example: Beta-BinomialBeta (µ; a , b ) Bin (m; N, µ) Beta (µ; a, b) {zposterior} {zlikelihood} {zpriora a m}b b N mBeta distribution is a conjugate prior to the binomial likelihood.4 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Summary of lecture 1 (IV/IV)LikelihoodfunctionPriorp(µ)p(m µ)p(µ m)Beta (µ; a, b)a 1, b 1Bin (m; N, µ)m 1, N 1Beta (µ; a , b )a 2, b 12210Posterior 00.5µ1210 00.5µ11000.5µ1Assume you get N 1 data point, of which m 1 is head, D {1}.5 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Summary of lecture 1 (IV/IV)LikelihoodfunctionPriorp(µ)p(m µ)p(µ m)Beta (µ; a, b)a 1, b 1Bin (m; N, µ)m 4, N 5Beta (µ; a , b )a 5, b 22210Posterior 00.5µ1210 00.5µ11000.5µ1Assume you get N 5 data points, of which m 4 are heads,D {1, 0, 1, 1, 1}.5 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Congenitalrmatosis dyskeratosisBenignSupervised machine learningB-cellLearninga model from labeled data.T-cellLabels e.g. mat,mirror, boat, . . .AngiosarcomaDermalcarcinomaEpidermalBasal cellcarcinomaUnseen data?LearningMerkel cellalgorithmMalignantMalignantmaTraining dataCutaneouslymphomaPredicting output of newdata based on this model.ModelpredictionModelSquamouscellcarcinomaHow do we rephrase supervised machine learning as a within theonomy and example testexample images from two disease classes. These test imagprobabilistic methodology?ructured taxonomy of skindifficulty of malignant versus benign discernment for theeases and is organized basedcritical classification tasks we consider: epidermal lesionsd indicates malignant,lesions and melanocytic lesions visualized with a dermos6 / 38niklas.wahlstrom@it.uu.selinear regressiononditionsthat can be either.images reprinted with permission fromBayesianthe EdinburghD

Supervised machine learning –Probabilistic perspectiveGiven: Data of inputs & outputs D {(x1 , y1 ), . . . , (xN , yN )}.Task: Predict the output y for a new unseen input x .Solution:1. Likelihood Define the likelihoodp(y θ, X)2. Prior Define the prior p(θ) xT1 X . ,xTNy1 y . yN3. Learning Do inference by applying Bayes’ theoremp(θ y, X) p(y θ, X)p(θ)4. Prediction Compute predictive distribution by marginalizingZp(y x , y, X) 7 / 38niklas.wahlstrom@it.uu.sep(y θ, x )p(θ y, X)dθBayesian linear regression

Example: Linear regression model Recall the linear regression from lecture 2 in the SML course Now we introduce a prior over the parameter wLinear regression modelyn wT xn εn ,εn N (0, σ 2 ),n 1, ., Nw p(w).Present assumptions:1. yn – observed random variable.2. w – unknown deterministic3. xn – known deterministic variable.4. εn – unknown random variable.5. σ – known deterministic variable.8 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Linear regression: Maximum likelihoodTwo equivalent ways of expressing the linear regression model:21. yn wT xn εn , εn N (0, σ )T22. p(yn w) N yn ; w xn , σ .The likelihood p(y w) is given byp(y w) NYp(yn w) n 1NY N yn ; w T xn , σ 2n 12 N y; Xw, σ IN . xT1 X . ,xTNy1 y . yNThe solution if found by maximizingthe likelihoodŵ arg max p(y w)w9 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Example: Linear regression model Recall the linear regression from lecture 2 in the SML course Now we introduce a prior over the parameter wBayesian linear regression modelyn wT xn εn ,εn N (0, σ 2 ),n 1, ., Nw p(w).Present assumptions:1. yn – observed random variable.2. w – unknown random variable. (difference from SML)3. xn – known deterministic variable.4. εn – unknown random variable.5. σ – known deterministic variable.10 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Bayesian linear regression modelRemember Bayes’ theoremp(w y) p(y w)p(w)p(y) Prior distribution: p(w) describes the knowledge we haveabout w before observing any data. Likelihood: p(y w) described how “likely” the observed data isfor a particular parameter value. Posterior distribution: p(w y) summarize all our knowledgeabout w from the observed data and the model.In Bayesian linear regression we use a Gaussian distribution as priorp(w) N (w; m0 , Σ0 )11 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Scalar Gaussian (Normal) distributionFor a scalar variable x, the Gaussian distribution can be written on theform2(x µ) 1e 2σ2N x; µ, σ 2 2 2πσ{z } µ is the mean (expected value σ is the standard deviationyof the distribution) σ 2 is the variance Z is the normalizationconstantWhat if x is a vector x x112 / 38niklas.wahlstrom@it.uu.se0.0 0.2 0.4 0.6 0.8 1.0Zµ 1, σ2 0.5µ 6, σ2 0.2µ 3, σ2 4 402xx2 · · · xD46 T8 10?Bayesian linear regression

Multivariate GaussianFor a D -dimensional vector x, the multivariate Gaussian distributioncan be written on the form N (x; µ, Σ) 1 1 exp (x µ)T Σ 1 (x µ) .D/2 {z}2(2π)det Σ {z}quadratic formZ µ is the mean vector Σ is the covariance matrix Z is the normalization constantGaussian equadratic form13 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Multivariate Gaussian0.3Value of PDFValue of PDF0.20.150.10.050.20.105055500x2-5 Σ 14 / 38niklas.wahlstrom@it.uu.se-51 00 1x10x2 0-5 Σ -5x1 1 0.40.4 0.5Bayesian linear regression

Partitioned Gaussian – marginalizationPartition the Gaussian random vector x N (µ, Σ), where x Rninto two sets of random variables xa Rna and xb Rnb , xax ,xb µaµ ,µb Σaa Σab.Σ Σba ΣbbTask: Compute the marginal distribution p(xa ),Zp(xa ) 15 / 38niklas.wahlstrom@it.uu.sep(xa , xb )dxb .Bayesian linear regression

Partitioned Gaussian – marginalizationTheorem 1 (Marginalization)Partition the Gaussian random vector x N (µ, Σ) according to xax ,xb µ µa,µb Σaa ΣabΣ .Σba ΣbbThe marginal distribution p(xa ) is then given byp(xa ) N (xa ; µa , Σaa ) .16 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Partitioned Gaussian – marginalizationxa17 / 38niklas.wahlstrom@it.uu.sexbBayesian linear regression

Partitioned Gaussian – Theoremsp(xa , xb )Thm 1p(xa )p(xb )18 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Partitioned Gaussian – conditioningTheorem 2 (Conditioning)Partition the Gaussian random vector x N (µ, Σ) according to xax ,xb µ µa,µb Σaa Σab.Σ Σba ΣbbThe conditional distribution p(xa xb ) is then given by p(xa xb ) N xa ; µa b , Σa b ,µa b µa Σab Σ 1bb (xb µb ),Σa b Σaa Σab Σ 1bb Σba .19 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Partitioned Gaussian – conditioningxa20 / 38niklas.wahlstrom@it.uu.sexbBayesian linear regression

Partitioned Gaussian – Theoremsp(xa , xb )Thm 1Thm 221 / 38p(xa )p(xb xa )p(xb )p(xa xb )niklas.wahlstrom@it.uu.seBayesian linear regression

Affine transformation of multivar. GaussWe can also do the opposite:compute p(xa , xb ) based on p(xb xa ) and p(xa )Theorem 3 (Affine transformation)Assume that xa , as well as xb conditioned on xa , are Gaussiandistributed according top(xa ) N (xa ; µa , Σa ) , p(xb xa ) N xb ; Mxa , Σb a .Then the joint distribution of xa and xb is xaµap(xa , xb ) N;,RxbMµawith22 / 38niklas.wahlstrom@it.uu.seΣaΣa MTMΣa Σb a MΣa MT R Bayesian linear regression

Partitioned Gaussian – Theoremsp(xa , xb )Thm 1Thm 2Thm 323 / 38p(xa )p(xb xa )p(xb )p(xa xb )niklas.wahlstrom@it.uu.seBayesian linear regression

Bayesian linear regression modelBayesian linear regression model:y n w T x n εn ,εn N (0, β 1 ),w p(w).β σ 2 is calledthe precisionThe probabilistic model is given by: p(y w) N y; Xw, β 1 IN ,p(w) N (w; m0 , S0 ) ,likelihoodprior distributionTask: Compute the posterior distribution: p(w y).24 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Partitioned Gaussian – Theoremsp(xa , xb )Thm 1Thm 2Thm 3p(xa )p(xb xa )Col 1 Thm 3 Thm 2p(xb )25 / 38niklas.wahlstrom@it.uu.sep(xa xb )Bayesian linear regression

Affine transformation of multivar. GaussBy combining Theorem 3 and Theorem 2 we getCorollary 1 (Affine transformation – conditional)Assume that xa , as well as xb conditioned on xa , are Gaussiandistributed according top(xa ) N (xa ; µa , Σa ) , p(xb xa ) N xb ; Mxa b, Σb a .Then the conditional distribution of xa given xb is p(xa xb ) N xa ; µa b , Σa b ,with T 1µa b Σa b Σ 1µ MΣ(x b),aab a b 1T 1Σa b Σ 1.a M Σb a M26 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Bayesian linear regressionThe probabilistic model is given by: p(y w) N y; Xw, β 1 IN ,p(w) N (w; m0 , S0 ) ,likelihoodprior distributionTask: Compute the posterior distribution: p(w y).Solution: Identifyxa w,xb y,With Corollary 1 we get the posterior distributionp(w y) N (w; mN , SN )whereTmN SN (S 10 m0 βX y), 1TS 1N S0 βX X,27 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

ex) Bayesian linear regressionConsider the problem of fitting a straight line to noisy measurements.Let the model be (yn R, xn R)yn w0 w1 xn εn , {z }wT xwheren 1, . . . , N.n 1xn ,xn w0w w1εn N (0, β 1 ),β 52 . Furthermore, let the prior be Tp(w) N w 0 0 , α 1 I2 ,whereα 2.28 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

ex) Bayesian linear regressionPlot of the situation before any data arrives.10.80.60.4yy0.2w10 0.2 0.4 0.6 0.8w0Prior, T 1p(w) N w 0 0 , I2229 / 38niklas.wahlstrom@it.uu.se 1 1 0.50xx0.51Example of a few realizations fromthe prior.Bayesian linear regression

ex) Bayesian linear regressionPlot of the situation after one measurement has arrived.w1w1w0Priorw1w0Likelihoodp(w) N (w m0 , S0 )p(y1 w) N (y1 w0 w1 x1 , β 1 )1Posterior/prior,w0p(w y1 ) N (w m1 , S1 ) ,m1 βS1 XT y1 ,S1 (αI2 βXT X) 1 .0.80.60.4y0.2y0Example of a few realizations from the posterior and thefirst measurement (black circle). 0.2 0.4 0.6 0.8 1 130 / 38 0.50xxniklas.wahlstrom@it.uu.se0.51Bayesian linear regression

ex) Bayesian linear regressionPlot of the situation after two measurements have arrived.w1w1w0Priorw1w0Likelihoodp(w y1 ) N (w m1 , S1 ) p(y2 w) N (y2 w0 w1 x2 , β 1 )1Posterior/prior,w0p(w y2 ) N (w m2 , S2 ) ,m2 βS2 XT y,S2 (αI2 βXT X) 1 .0.80.60.4y0.2y0Example of a few realizations from the posterior and thefirst measurement (black circle). 0.2 0.4 0.6 0.8 1 130 / 38 0.50xxniklas.wahlstrom@it.uu.se0.51Bayesian linear regression

ex) Bayesian linear regressionPlot of the situation after 30 measurements have arrived.w1w1w0Priorw1w0Likelihoodp(w y2 ) N (w m2 , S2 ) p(y3 w) N (y3 w0 w1 x3 , β 1 )1Posterior/prior,w0p(w y3 ) N (w m3 , S3 ) ,m3 βS3 XT y,S3 (αI2 βXT X) 1 .0.80.60.4y0.2y0Example of a few realizations from the posterior and thefirst measurement (black circle). 0.2 0.4 0.6 0.8 1 130 / 38 0.50xxniklas.wahlstrom@it.uu.se0.51Bayesian linear regression

Partitioned Gaussian – Theoremsp(xa , xb )Thm 1Thm 2Thm 3p(xa )p(xb xa )Col 1 Thm 3 Thm 2p(xb )31 / 38niklas.wahlstrom@it.uu.seCol 2 Thm 3 Thm 1p(xa xb )Bayesian linear regression

Affine transformation of multivar. GaussBy combining Theorem 3 and Theorem 1 we getCorollary 2 (Affine transformation – Marginalization)Assume that xa , as well as xb conditioned on xa , are Gaussiandistributed according top(xa ) N (xa ; µa , Σa ) , p(xb xa ) N xb ; Mxa b, Σb a .Then the marginal distribution of xb is then given byp(xb ) N (xb ; µb , Σb ) ,whereµb Mµa b,Σb Σb a MΣa MT .32 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Predictive distributionFor a new data point (y , x ), we have: 1p(y w) N y ; xT, w, βp(w y) N (w; mN , SN )likelihoodposteriorIdentifyxa w,xb y ,With Corollary 2 we get the predictive distributionp(y y) N (y ; m , s )wherem xT mNs β 1 xT SN x 33 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

ex) Predictive distributionInvestigating the predictive distribution for the example above0.6110.50.50.40.2000 0.5 0.5 0.2 0.4 0.6 0.8 1 1 1.5 1 1.5 1 1 1.2 1 0.500.5N 2 observations1 0.500.5N 5 observations1 0.500.51N 200 observations Gray shaded area: One standard deviation of the predictivedistribution as function of x 1T 1Tp(y y) N y ; x mN , β x SN x where x x Blue line: Mean of predictive distribution Black circles: Observations Red line: true model34 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Conjugate priors (I/II)The probabilistic model with unknown w is given by:p(w) N (w; m0 , S0 )p(y w) N y; Xw, β 1prior distributionIN likelihoodwhich gives the posteriorp(w y) N (w; mN , SN )posteriorNote that, using a Gaussian prior gives a Gaussian posteriorposteriorlikelihoodpriorz } { z } { z } {p(w y) p(y w) p(w) {z } {z } {z }GaussianGaussian GaussianHence, the Gaussian prior is a conjugate prior for the Gaussianlikelihood unknown w.Q: What if also precision β is unknown?35 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Conjugate prior (II/II)The probabilistic model with unknown w and β is given by: p(w, β) N w; m0 , β 1 S0 Gam (β; a0 , b0 ) p(y w) N y; Xw, β 1 INpriorlikelihoodwhich gives the posterior p(w, β y) N w; mN , β 1 SN Gam (β; aN , bN )posteriorUsing a Gauss-Gamma prior gives a Gauss-Gamma posteriorposteriorlikelihoodpriorz } {z } { z } {p(w, β y) p(y w, β) p(w, β) {z } {z } {z }Gauss-GammaGaussGauss-GammaHence, the Gauss-Gamma prior is a conjugate prior for the Gaussianlikelihood with unknown w and unknown precision β .See further in Exercise 2.1136 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Non-conjugate priorsIn the first two lectures we could solve Bayes’ theorem analyticallysince we used conjugate priorsp(w y) p(y w)p(w)p(y)However, often you have a personal belief incompatible with theconjugacy.For example: Likelihood with heavy tails Multi modal distributionsWe have to use approximative inference methods. In this course willdiscuss two methods Monte carlo (lecture 4) Variational inference (lecture 6)37 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

A few concepts to summarize lecture 2Prior distribution: p(w) The representation we have about theunknown parameters w before we have considered any data.Likelihood distribution: p(y w) describes how likely themeasurements are for a particular parameter value.Posterior distribution: p(w y) summarize our knowledge about theparameters w based on the information we have from themeasurements y and the model.Predictive distribution: p(y? y) the distribution of unobservedobservations y? conditional on the observed data y.38 / 38niklas.wahlstrom@it.uu.seBayesian linear regression

Figure 2 A schematic illustration of the taxonomy and example test set images. , A subset of the top of the tree-structured taxonomy of skin a disease. The full taxonomy contains 2,032 diseases and is organized based on visual and

Related Documents:

independent variables. Many other procedures can also fit regression models, but they focus on more specialized forms of regression, such as robust regression, generalized linear regression, nonlinear regression, nonparametric regression, quantile regression, regression modeling of survey data, regression modeling of

Probability & Bayesian Inference CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 3 Linear Regression Topics What is linear regression? Example: polynomial curve fitting Other basis families Solving linear regression problems Regularized regression Multiple linear regression

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3: Kernel Method: Enabling nonlinearity Lecture 1: Linear Regression Linear Regression Notation Loss Function Solving the Regression Problem Geome

LINEAR REGRESSION 12-2.1 Test for Significance of Regression 12-2.2 Tests on Individual Regression Coefficients and Subsets of Coefficients 12-3 CONFIDENCE INTERVALS IN MULTIPLE LINEAR REGRESSION 12-3.1 Confidence Intervals on Individual Regression Coefficients 12-3.2 Confidence Interval

Lecture 9: Linear Regression. Goals Linear regression in R Estimating parameters and hypothesis testing with linear models Develop basic concepts of linear regression from a probabilistic framework. Regression Technique used for the modeling and analysis of numerical dataFile Size: 834KB

3 LECTURE 3 : REGRESSION 10 3 Lecture 3 : Regression This lecture was about regression. It started with formally de ning a regression problem. Then a simple regression model called linear regression was discussed. Di erent methods for learning the parameters in the model were next discussed. It also covered least square solution for the problem

Its simplicity and flexibility makes linear regression one of the most important and widely used statistical prediction methods. There are papers, books, and sequences of courses devoted to linear regression. 1.1Fitting a regression We fit a linear regression to covariate/response data. Each data point is a pair .x;y/, where