Lecture 2: Linear Methods For Regression

2y ago

25 Views

2 Downloads

410.56 KB

13 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Gannon Casey

Report this link

Download PDF

Transcription

Lecture 2: Linear methods for regressionRafael A. Irizarry and Hector Corrada BravoJanuary, 2010The next three lectures will cover basic methods for regression and classification.We’ll see linear methods and tree-based for both in some detail, and will seenearest-neighbor methods without details for comparison. Regardless of lecturetitle, they are pretty much one long sequence on basics seen through linearmethods and treee-based methods.Terminology and notationWe will be mixing the terminology of statistics and computer science. Forexample, we will sometimes call Y and X the outcome/predictors, sometimesobserved/covariates, and even input/output.We will denote predictors with X and outcomes with Y (quantitative) and G(qualitative). Notice G are not numbers, so we cannot add or multiply them.Height and weight are quantitative measurements. These are sometimes calledcontinuous measurements.Gender is a qualitative measurement. They are also called categorical or discrete.This is a particularly simple example because there are only two values. Withtwo values we sometimes call it binary. We will use G to denote the set ofpossible values. For gender it would be G {M ale, F emale}. A special caseof qualitative variables are ordered qualitative where one can impose an order.With men/women this can’t be done, but with, say, G {low, medium, high}it can.For both types of variables, it makes sense to use the inputs to predict theoutput. Statisticians call the prediction task regression when the outcome isquantitative and classification when we predict qualitative outcomes. We willsee that these have a lot in common and that they can be viewed as a task offunction approximation (as with scatter plots).Notice that inputs also vary in measurement type.1

Technical notationWe will follow the notation of Hastie, Tibshirani and Friedman. Observed valueswill be denoted in lower case. So xi means the ith observation of the randomvariable X. Matrices are represented with bold face upper case. For exampleX will represent all observed predictors. N will usually mean the number ofobservations, or length of Y . i will be used to denote which observation and jto denote which covariate or predictor. Vectors will not be bold, for examplexi may mean all predictors for subject i, unles it is the vector of a particularpredictor xj . All vectors are assumed to be column vectors, so the i-th row ofX will be x0i , i.e., the transpose of xi .A regression problemRecall the example data from AIDS research mentioned previously. Here weare plotting the data along with a curve from which data could have plausiblybeen generated. In fact, this curve was estimated from the data by a regressiontechnique called loess, which we will discuss in a future lecture.For now, let’s consider this curve as truth and simulate CD4 counts from it. We2

will use this simulated data to compare two simple but commonly used methodsto predict Y (CD4 counts) from X (Time), and discuss some of the issues thatwill arise throughout this course. In particular, what is overfitting, and what isthe bias-variance tradeoff.Linear regressionProbably the most used method in statistics. In this case, we predict the outputY via the modelY β0 β1 X.However, we do not know what β0 or β1 are.We use the training data to estimate them. We can also say we train the modelon the data to get numeric coefficients. We will use the hat to denote theestimates: βˆ0 and βˆ1 .We will start using β to denote the vector (β0 , β1 )0 . A statistician would callthese the parameters of the model.3

The most common way to estimates βs is by least squares. In this case, wechoose the β that minimizesRSS(β) NX{yi (β0 β1 Xi )}2 .i 1If you know linear algebra and calculus you can show that β̂ (X0 X) 1 X 0 y.Notice we can predict Y for any X:Ŷ βˆ0 βˆ1 XThe next Figure shows the prediction graphically. However, the data seems tosuggest we could do better by considering more flexible models.K-nearest neighborNearest neighbor methods use the points closest in predictor space to x to obtainan estimate of Y . For the K-nearest neighbor method (KNN) we define4

Ŷ 1kXyk .xk Nk (x)Here Nk (x) contains the k-nearest points to x. Notice, as for linear regression,we can predict Y for any X.In the next Figure we see the results of KNN using the 15 nearest neighbors.This estimate looks better than the linear model.We do better with KNN than with linear regression. However, we have to becareful about overfitting.Roughly speaking, overfitting is when you mold an algorithm to work very well(sometimes perfect) on a particular data set forgetting that it is the outcomeof a random process and our trained algorithm may not do as well in otherinstances.Next, we see what happens when we use KNN with k 1. In this case we makeno mistakes in prediction, but do we really believe we can do well in generalwith this estimate?5

It turns out we have been hiding a test data set. Now we can see which of thesetrained algorithms performs best on an independent test set generated by thesame stochastic process.We can see how good our predictions are using RSS again.MethodLinearK 15K 1Train set99.7067.410.00Test set93.5875.32149.10Notice RSS is worse in test set than in the training set for the KNN methods.Especially for KNN 1. The spikes we had in the estimate to predict the trainingdata perfectly no longer helps.So, how do we choose k? We will study various ways. First, let’s talk about thebias/variance trade-off.Smaller k give more flexible estimates, but too much flexibility can result inover-fitting and thus estimates with more variance. Larger k will give morestable estimates but may not be flexible enough. Not being flexible is relatedto being biased.7

The next figure shows the RSS in the test and training sets for KNN withvarying k. Notice that for small k we are clearly overfitting.An illustration of the bias-variance tradeoffThe next figure illustrates the bias/variance tradeoff. Here we plot histogramsof f (1) fˆ(1), where fˆ(1) is the estimate for each method trained on 1000simulations.We can see that the prediction from the linear model is consistently inaccurate.That is, it is biased, but stable (little variance). For k 1 we get the opposite,there is a lot of variability, but once in a while it is very accurate (unbiased).For k 15 we get a decent tradeoff of the two.In this case we have simulated data as follows: for a given xY f (x) where f (x) is the “true” curve we are using and is normal with mean zero andsome variance σ 2 .8

We have been measuring how good our predictions are by using RSS. Recallfrom the last lecture that we sometimes refer to this as a loss function. Recallalso that for this loss function, if we want to minimize the expected predictionerror for a given x:EY X x [{Y f (X)}2 X x],we get the conditional expectation f (x) E[Y X x]. With some algebra wesee that the RSS for this optimal selection is σ 2 in our setting. That is, we can’tdo better than this, on average, with any other predictor.Notice that KNN is an intuitive estimator of this optimal predictor. We do notknow the function E[Y X x] looks like so we estimate it with the y’s of nearbyx’s. The larger k is, the less precise my estimate might be since the radius ofx’s I use for is larger.Predictions are not always perfect.Linear methods for regressionLinear predictorsBefore computers became fast, linear regression was almost the only way ofattacking certain prediction problems. To see why, consider a model such asthisY β0 β1 eβ2 X finding the βs that minimize, for example, least squares is not straight forward.A grid search would require many computations because we are minimizing overa 3-dimensional space.Technical note: For minimizing least squares in this case the Newton-Raphsonalgorithm would work rather well. But we still don’t get an answer in closedform.As mentioned, the least squares solution to the linear regression model:Y β0 pXβj Xj j 1has a closed form linear solution. In Linear Algebra notation we write: β̂ (X0 X) 1 X0 y, with β (β0 , β1 , . . . , βj )0 . The important point here is that forany set of predictors x the prediction can be written as a linear combination10

of the observed data Ŷ and do not depend on y.PNi 1wi (x)yi . The wi (x) are determined by the Xj sWhat is the prediction Ŷ for x.When we say linear regression we do not necessarily mean that we model theY as an actual line. All we mean is that the expected value of Y is a linearcombination of predictors. For example, this is a linear model:Y β0 β1 X β2 X 2 β3 X 3 To see this, simply define X1 X, X2 X 2 and X3 X 3 .For the model we saw above, we cannot do the same because X1 eβ2 X containsa parameter.If the linear regression model holds, then the least squares solution has variousnice properties. For example, if the s are normally distributed, then β̂ is themaximum likelihood estimate and is normally distributed as well. Estimatingthe variance components is simple: (X0 X) 1 σ 2 with σ 2 the error variance var( ).σ 2 is usually well estimated using the residual sum of squares.If the s are independent and identically distributed (IID), then β̂ is the linearunbiased estimate with the smallest variance. This is called the Gauss-Markovtheorem.Technical note: Linear regression also has a nice geometrical interpretation.The prediction is the orthogonal projection of the vector defined by the datato the hyper-plane defined by the regression model. We also see that the leastsquares estimates can be obtained by using the Gram-Schmidt algorithm, whichorthogonalizes the covariates and then uses simple projections. This algorithmalso helps us understand the QR decomposition. For more details see Hastie,Tibshirani and Friedman.Testing hypothesesThe fact that we can get variance estimates from regression, permits us to testsimple hypotheses. For example,β̂j tN p 1se(βˆj )under the assumption of normality for . When is not normal but IID, thenthe above is asymptotically normal.If we want to test significance of various coefficients, we can generalize to theF-test:(RSS0 RSS1 )/(p1 p0 )RSS1 /(N p1 1)11

RSS1 is for the least squares fit of the bigger model with p1 1 parameters, andRSS0 is the same for the smaller model with p0 1 parameters, having p1 p0parameters constrained to be 0.Under normality assumptions this statistic (the F-statistic) follows a Fp1 p0 ,N p0 p1distribution.Similarly, we can form confidence intervals (or balls). For the case of multiplecoefficients, we can use the fact that(β̂ β)0 X0 X(β̂ β)σ̂ 2follows a χ2p 1 distribution.Gram-SchmidtOne can show that the regression coefficient for the j-th predictor is the simpleregression coefficient of y on this predictor adjusted for all others (obtainedusing Gram-Schmidt).For the simple regression problem (with no intercept)Y Xβ the least square estimate isPNβ̂ Pi 1Nxi yii 1x2iCan you see for the constant model?Mathematicians write the above solution asβ̂ hx, yihx, xiWe will call this operation regressing y on x (it’s the projection of y onto thespace spanned by x).The residual can the be written asr y βxWhat was the solution for β1 to Y β0 β1 X ?We can write the result as12

β̂1 hx x̄1, yihx x̄1, x x̄1iRegression by Successive Orthogonalization1. Initialize z0 x0 12. For j 1, 2, . . . , pRegress xj on z0 , z1 , . . . , zj 1 to produce coefficients γlj Pj 11 and residual vector zj xj k 0 γkj zkhzl ,xj ihzl ,zl i ,l 0, . . . , j 3. Regress y on the residual zp to give estimate βˆpNotice that the Gram-Schmidt algorithm permits us to estimate the βj in amultivariate regression problem by successively regressing (orthogonalizing) xjto produce residual vectors that form an orthogonal basis for the column spaceof X. The least squares estimate is found by regressing y on the final residuals,i.e. projecting on this orthogonal basis.Notice that if all the x’s are correlated then each predictor affects the coefficientsof the others.The interpretation is that the coefficient of a predictor β̂j is the regression of yon xj after xj has been adjusted for all other predictors.13

Lecture 2: Linear methods for regression Rafael A. Irizarry and Hector Corrada Bravo January, 2010 The next three lectures will cover basic methods for regression and classi cation. We’ll see linear methods and tree-based for both in some detail, and will see nearest-neighbor meth

Related Documents:

CHEMICAL REACTION ENGINEERING

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

100 Views

2y ago

Bruksanvisning för bilstereo Bruksanvisning for bilstereo ... - Jula

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

377 Views

1y ago

10 tips och tricks för att lyckas med ert sap-projekt

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

738 Views

2y ago

Nordens 25 största medieföretag efter omsättning

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

339 Views

1y ago

SS 02 52 68 Ljudklassning av utrymmen i byggnader - byggtjanst.se

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

358 Views

1y ago

Apple Developer Program License Agreement (Swedish)

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

345 Views

1y ago

LECTURE NOTES on PROGRAMMING & DATA STRUCTURE Course Code : BCS101

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

60 Views

1y ago

MSE 460: Electronic Materials, Devices, and Processing

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

155 Views

3y ago

Recent Views

IN THIS ISSUE CAR WASH INSIGHT Recent, Notable M&A Transactions .

9/8/2022 Club Car Wash Sites of Tidal Wave Express Car Wash 8 8/29/2022 Take 5 Car Wash Soft Touch Car Wash, Auto Oasis Car Wash, Clearwater Car Wash and Birdie's Car Wash 5 8/25/2022 WhiteWater Express Geaux Clean Car Wash 7 8/19/2022 ModWash Home Team Car Wash 3 8/18/2022 Splash In ECO Car Wash (Wills Group) Blue Hen Car Wash 2

9m ago

106 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

685 Views

ESSENTIAL PLAN - Discovery

Car insurance only Car and home insurance Car insurance only Car and home insurance 12.5% 25% 5% 10% YOUR FUEL CASH BACK PERCENTAGE GET TO THE HIGHEST CASH BACK PERCENTAGE Add at least R250 000 of home insurance (household contents, buildings or both) Take your car to Tiger Wheel & Tyre and pass the Annual MultiPoint check

1y ago

277 Views

CAR INSURANCE EVERYTHING EXPLAINED - RSA Insurance Group

CAR INSURANCE 93013821.indd 1 15/03/2018 10:46. 2 WELCOME TO µ CAR INSURANCE Thank you for choosing µ to protect you and your car. This booklet is intended to help you check your cover and to reassure you that µ will give you the protection you need for the year ahead. First of all, to help you understand your car insurance policy we want to .

1y ago

283 Views

Describe types and purposes of insurance.

D.O. CAPS Consumer Skills: Insurance—10E 3 Your car - The car you drive can also affect your insurance rates. Insurance companies place certain kinds of cars in special risk categories. You should ask your insurance agent before making a car purchase to make sure you aren't getting a car that will cost you extra for your liability insurance.

1y ago

238 Views

Money Online Price Comparison - WordPress

you to compare car insurance quotes. You'll notice at the top of the screen is a warning regarding telling the truth when completing any form of car insurance quote as something withheld, which later becomes known, can void an insurance claim. 7 The process of completing a car insurance price comparison is broken down into 4

1y ago

178 Views

Contours Options Infant Car Seat Adapter Instruction Sheet

your Infant Car Seat, as described in the instruction manual provided by the Infant Car Seat manufacturer. † WHEN USING ONLY ONE INFANT CAR SEAT ADAPTER OR TWO FOR TWINS, THE FOLLOWING INFANT CAR SEATS CAN BE USED: † If your Infant Car Seat is not one of the models listed above, DO NOT use your infant car seat with this car seat adapter.

2y ago

572 Views

Microsoft Advertising Travel Update

last minute cruise deals -58.50% Car Rental Queries WoW Change car rental -43.80% rental cars -46.30% car rentals -40.60% cheap car rentals -48.00% car rentals cheapest rates -52.20% rent a car- 40.30% cheap rental cars -45.60% rental car -41.80% car rental deals -49.30% rental cars lowest price -53.90% Flight Queries WoW Change cheap flights .

1y ago

344 Views

Design and development of lift for an automatic car parking system

1. Stacker type car parking system 2. Puzzle type car parking system 3. Level type car parking system 4. Chess type car parking system 5. Rotary type car parking system 6. Tower type car parking system But lift is used only in tower type car parking system. Objectives:-

6m ago

182 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

577 Views

Car Insurance This booklet covers:Car Rapid Bonus Business

Car Insurance This booklet covers:Car Rapid Bonus Business RAC Direct Insurance is a trading name of London and Edinburgh Insurance Company Limited. Registered in England No 924430. Registered Office: 8 Surrey Street, Norwich NR1 3NG. Member of the Aviva Group. Authorised and regulated by the Financial Services Authority. RAC052(V27)-1971-06.06 .

1y ago

223 Views

Root Insurance (ROOT) - Citron Research

Root Insurance (ROOT) Leveling the Playing Field of Car Insurance What every trader needs to know about one of the mostheavily shorted stocks in the market Traditional Credit-Based Car Insurance PerpetuatesEconomic and Racial Inequalities as one in three American cannot affordessentials because of car insurance premiums

1y ago

216 Views

-xglfldo:Dwfk Xjxvw Wkurxjk)2,

Affordable Care Act - insurance comparison, cheapest insurance, cheap health insurance NJ, cheapest insurance company Priority One High Volume - Washington state health insurance plans, affordable health insurance The best performing ad copy included those that made specific reference to finding "health insurance" for

1y ago

267 Views

The Pricing of Group Life Insurance Schemes - Actuaries

Thus, in comparison to individual life insurance, group life insurance is more cost-effective per thousand of rupees insurance cover. 2. General Characteristics of Group Life Insurance Group life insurance, within certain restrictions and conditions, provides insurance to members of a group without requiring evidence of insurability. There is a .

1y ago

181 Views

NK-ID 0192-8365-3702-0D3E - Car-O-Liner

CAR-O-DATA. 4. The vast majority of vehicles on the road today can be found in Car-O-Liner's database. Your . Car-O-Tronic. is delivered with a 14-day trial . Car-O-Data Vision2. subscription. Car-O-Data. is available with different subscription periods and database. 4. Check all options with our distributors. SOFTWARE PART. NO. Vision2 X1 .

3y ago

326 Views

Lecture 2: Linear Methods For Regression

It looks like you're using an ad-blocker