Lecture 18(a): Linear Regression: OLS, Ridge, LASSO Setup .

2y ago
118 Views
2 Downloads
1.15 MB
26 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Evelyn Loftin
Transcription

1Lecture 18(a): Linear Regression: OLS, Ridge,LASSOSetup and Practical ConsiderationsFoundations of Data Science:Algorithms and Mathematical FoundationsMihai Cucuringumihai.cucuringu@stats.ox.ac.ukCDT in Mathematics of Random SystemUniversity of OxfordOctober 2, 2020

2Advertising data set sales of a product in 200 different markets budgets for the product in each of those markets for threedifferent media: TV, radio, and newspaper goal: predict sales given the three media budgets input variables (denoted by X1 , X2 , . . .) X1 TV budget X2 radio budget X3 newspaper budget inputs known as such as predictors, independent variables,features, variables, covariates. the output variable (sales) is the response or dependent variable(denoted by Y )

3Advertising data set

4Linear Regression Is there a relationship between advertising budget and sales? How strong is the relationship between advertising budget andsales? Which media contribute to sales? How accurately can we estimate the effect of each medium onsales? How accurately can we predict future sales? Is the relationship linear? Is there synergy among the advertising media? (50k on TV 50kon radio 100k on either one) (interaction effect)

5ErrorsModel: Y β0 β1 XExample: sales β0 β1 radioDefine the residual sum of squares (RSS):n2RSS i(1)i 1 i yi β̂0 β̂1 xi , i 1, . . . , n(2)

510152025623. Linear VFigure:FIGUREThe least squares fit for the regression of sales onto TV. The fit is found by3.1. For the Advertising data, the least squares fit for the regressionminimizingthe sumgreysegmentan error,of salesonto ofTV squaredis shown.errors.The fit Eachis foundby lineminimizingtherepresentssum of squaredand theerrors.fit ,alinear fitEach grey line segment represents an error, and the fit makes a comprocapturesthebyessencethe squares.relationship,it is somewhatin ofthe leftmiseaveragingoftheirIn thisalthoughcase a linearfit captures deficientthe essencethe relationship, although it is somewhat deficient in the left of the plot.of the plot.

1050 5 10 50YY5103. Linear RegressionLeast64 squaresfit 107 2 10X12 2 1012XFigure:A simulateddataset. Left:the truetherelationship,FIGURE3.3. AsimulateddataTheset. redLeft:lineTherepresentsred line representstrue rela-f (X ) tionship,2 3X ,fwhichknownas thepopulationThe blue(X) is2 3X, whichis knownas theregressionpopulation line.regressionline. lineThe is theblue line line:is theit leastline; it is estimatethe least squares(X)observedbasedleast squaresis thesquaresleast squaresfor f (Xestimate) based foron ftheon the inobservedshownblack. Right:The populationregressionlineinisred,data, shownblack. data,Right:The inpopulationregressionline is againshownagainin red,andleastsquaresline blue,in darkblue.In squareslight blue,linesten leastand theleastshownsquareslinein thedarkblue.In lighttenleastaresquares lines are shown, each computed on the basis of a separate random set ofshown, each computed on the basis of a separate random set of observations. Eachobservations. Each least squares line is different, but on average, the least squaresleast squaresdifferent,on average,the leastlines are linequiteiscloseto thebutpopulationregressionline. squares lines are quite closeto the population regression line.

8Recall the OLS estimatorsThe least squares coefficient estimates for simple linear regressionnβ̂1 i 1 (xi x̄)(yi ȳ )n i 1 (xi x̄)2β̂0 ȳ β̂1 x̄where ȳ 1nn iyi and x̄ 1nn i(3)(4)xi denote the sample means.The corresponding standard errors are given by2x̄22 1SE(β̂0 ) σ [ n n] i 1 (xi x̄)2σ2SE(β̂1 ) 2with σ Var( )n(5)2 i 1 (xi x̄)2(6)

9Confidence intervals 95 % confidence interval for β1β̂1 2 SE(β̂1 ) i.e., 95 % prob. the β1 lies in[β̂1 2 SE(β̂1 ), β̂1 2 SE(β̂1 ) similarly for β0Advertising data, the 95% confidence interval β0 [6.130, 7.935]: without any advertising sales will situatearound 6,130 and 7,940 units. β1 [0.042, 0.053]: each 1,000 increase in TV advertising average increase in sales by between 42 and 53 units.

10Hypothesis testing: the null hypothesisH0 There is no relationship between X and Yβ1 0H1 There is some relationship between X and Yβ1 0Y β0 β1 X compute the t-statistic given byt β̂1 0SE(β̂1 )i.e., the number of standard deviations β̂1 is away from 0 if no relationship between X and Y, t t-distribution with n-2degrees of freedom for n 30, t-distribution is similar to the Gaussian

11Hypothesis testing: the null hypothesis p-value: probability of observing any value equal to t or larger,assuming β1 0 Small p-value: unlikely to observe such a substantial associationbetween X and Y due to chance, (if X and Y were truly unrelated) Typical p-values for rejecting the null hypothesis: 5% or 1%

12Quality metrics RSE: measures lack of fit of the model to the data n 1 12 RSS RSE n 2 (yi ŷi )n 2(7)i 1 R 2 : measures the proportion of variance explained2R TSS RSSRSS 1 TSSTSS TSS (yi ȳ )2 , the total variance in the response Y RSS (yi ŷi )2 , the amount of variability that is left unexplainedafter the regression for simple linear regression: R 2 ρ2 , where ρ is the usualPearson correlation

13From Simple to Multiple Linear RegressionFigure: A 1,000 increase in radio spending an average increase in salesby 203 units. A 1,000 increase in newspaper spending an averageincrease in sales by around 55 units.

14Multiple Linear RegressionY β0 β1 X1 β2 X2 βp Xp ŷ βˆ0 β̂1 x1 β̂2 x2 β̂p xpŷi βˆ0 β̂1 xi,1 β̂2 xi,2 β̂p xi,p i 1, . . . , nsales β0 β1 TV β2 radio β3 newspaper

15Errors being minimized3.2 Multiple Linear Regression73YX2X1Figure: Ina three-dimensionalsetting, withtwopredictorsandandoneFIGURE3.4. In a three-dimensionalsetting,withtwo predictorsoneresponse,rethe leastsquares regressionline becomesa plane.Theplaneplane isis chosenchosenthe leastsponse,squaresregressionline becomesa plane.Thetominimize the sum of the squared vertical distances between each tancesbetweeneach(shown in red) and the plane.observation (shown in red) and the plane.The values β̂0 , β̂1 , . . . , β̂p that minimize (3.22) are the multiple least squares

16Multiple Linear RegressionFixing TV and newspaper advertising, spending an additional 1,000on radio sales increase 189 unitsNote βnewspaper is now very close to zero, with a small t-statistic andp-value.

17 corr(radio,newspaper) 0.35 newspaper gets ”credit” for the effect of radio on sales shark attacks vs ice cream sales at a given beach shows apositive relationship higher temperatures more people visit the beach more icecream sales and more shark attacks ice cream no longer significant after adjusting for temperature

18Hypothesis testing: the null hypothesisH0 β1 β2 . . . βp 0′H1 at least one of the βj s is non-zero compute the F-statistic given byF (TTS RSS)pRSS/(n p 1) TSS (yi ȳ )2 , the total variance in the response Y RSS (yi ŷi )2 , the amount of variability that is left unexplainedafter the regression Under the linear linear model assumption, one can showE[RSS/(n p 1)] σ2 If H0 is true: E[(TSS RSS)/p] σ 2 F 1 If H1 is true: E[(TSS RSS)/p] σ 2 F 1

19Figure: Least squares model for the regression of number of units sold onTV, newspaper, and radio advertising budgets in the Advertising data.

20Variable selectionWhich predictors are associated with the response? (in order to fit asingle model involving only those d predictors) Note: R 2 always increase as you add more variables to the model adjusted R 2 : 1 RSS/(n p 1) 1 (1 R 2 ) n 1n p 1TSS/(n 1) Mallow’s: Cp 1 (RSS 2pσ̂ 2 )n Akaike Information criterion AIC 1(RSSnσ̂ 22 2pσ̂ )pCannot consider all 2 models. Best Subset Selection: fit a separate least squares regression foreach possible k -combination of the p predictors, and select thebest one Forward selection: start with the null model and keep addingpredictors one by one Backward selection: start with all variables in the model, andremove the variable with the largest p-value

21Other considerations (see the textbook) prediction intervals extensions of the linear modelY β0 β1 X1 β2 X2 β3 X1 X2 sales β0 β1 TV β2 radio β3 ( radio TV ) β0 (β1 β3 radio ) TV β2 radio 2 R for this model 96.8% vs 89.7% for the model that uses TVand radio without an interaction term. The hierarchical principle: if we include X Y , you should alsoinclude the main effects X and Y (even if their p-values are notsignificant) Non-linear Relationships2Y β0 β1 X β2 X

22Potential Problems with Linear Regression Non-linearity of the response-predictor relationships Correlation of error terms Non-constant variance of error terms Outliers High-leverage points Collinearity

23(1) Non-linearity of the DataFigure: Residuals vs. predicted (or fitted) values for the Auto data set. Ineach plot, the red line is a smooth fit to the residuals. Left: Y X , Right:2Y X .

24(2) Time series of residualsFigure: Plots of residuals from simulated time series data sets generatedwith differing levels of correlation between error terms for adjacent timepoints

25(3) Residual plotsFigure: Red line: smooth fit to the residuals. Blue lines: track the outerquantiles of the residuals. Left: The funnel shape indicatesheteroscedasticity (variance of the errors is not constant). Right: Thepredictor has been log-transformed no evidence of heteroscedasticity.

26 Make sure you read the entire Chapter 3.

Lecture 18(a): Linear Regression: OLS, Ridge, LASSO Setup and Practical Considerations Foundations of Data Science: Algorithms and Mathematical Foundations Mihai Cucuringu mihai.cucuringu@stats.ox.ac.uk CDT in Mathematics of Rando

Related Documents:

independent variables. Many other procedures can also fit regression models, but they focus on more specialized forms of regression, such as robust regression, generalized linear regression, nonlinear regression, nonparametric regression, quantile regression, regression modeling of survey data, regression modeling of

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3: Kernel Method: Enabling nonlinearity Lecture 1: Linear Regression Linear Regression Notation Loss Function Solving the Regression Problem Geome

LINEAR REGRESSION 12-2.1 Test for Significance of Regression 12-2.2 Tests on Individual Regression Coefficients and Subsets of Coefficients 12-3 CONFIDENCE INTERVALS IN MULTIPLE LINEAR REGRESSION 12-3.1 Confidence Intervals on Individual Regression Coefficients 12-3.2 Confidence Interval

Lecture 9: Linear Regression. Goals Linear regression in R Estimating parameters and hypothesis testing with linear models Develop basic concepts of linear regression from a probabilistic framework. Regression Technique used for the modeling and analysis of numerical dataFile Size: 834KB

3 LECTURE 3 : REGRESSION 10 3 Lecture 3 : Regression This lecture was about regression. It started with formally de ning a regression problem. Then a simple regression model called linear regression was discussed. Di erent methods for learning the parameters in the model were next discussed. It also covered least square solution for the problem

Probability & Bayesian Inference CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 3 Linear Regression Topics What is linear regression? Example: polynomial curve fitting Other basis families Solving linear regression problems Regularized regression Multiple linear regression

Its simplicity and flexibility makes linear regression one of the most important and widely used statistical prediction methods. There are papers, books, and sequences of courses devoted to linear regression. 1.1Fitting a regression We fit a linear regression to covariate/response data. Each data point is a pair .x;y/, where