Lecture 17 Outliers & Influential Observations

3y ago
33 Views
2 Downloads
213.88 KB
29 Pages
Last View : 29d ago
Last Download : 3m ago
Upload by : Carlos Cepeda
Transcription

Lecture 17Outliers & InfluentialObservationsSTAT 512Spring 2011Background ReadingKNNL: Sections 10.2-10.417-1

Topic Overview Statistical Methods for Identifying Outliers /Influential Observations CDI/Physicians Case Study Remedial Measures17-2

Outlier Detection in MLR We can have both X and Y outliers In SLR, outliers were relatively easy to detectvia scatterplots or residual plots. In MLR, it becomes more difficult to detectoutlier via simple plots.o Univariate outliers may not be as extremein a MLRo Some multivariate outliers may not bedetectable in single-variable analyses17-3

Using ResidualsDetecting Outliers in the Response (Y) Have seen how we can use residuals foridentifying problems with normality,constancy of variance, linearity. Could also use residuals to identify outlyingvalues in Y (large magnitude impliesextreme value) Residuals don’t really have a “scale”, so.What defines a large magnitude? Needsomething more standard17-4

Semi-studentized Residualsiid Recall thatε i N (0, σ 2 ) ,εi 0 N (0,1)σso:are “standardized errors” However, we don’t know the true errors orσ , so we use residuals and MSE . When you divide the residuals by MSE ,you have semi-studentized residuals. Slightly better than regular residuals, canuse them in the same ways we usedresiduals.17-5

Studentized Residuals Previous is a “quick fix” because thestandard deviation of a residual is actuallys {ei } MSE (1 hii ) Where hii are the ith elements on the maindiagonal of the hat matrix, between 0 and 1 Goal is to consider the magnitude of eachresidual, relative to its standard deviation. Studentized Residuals are ie eiMSE (1 hii ) t (n p)17-6

Studentized Deleted Residuals Another Refinement – each residual isobtained by regressing using all of dataexcept for the point in question Similar to what is done to compute PRESSstatistic:di Yi Yˆi(i ) Note: Formula available to avoid computingthe entire regression over and over.di ei / (1 hii )17-7

Studentized Deleted Resid. (2) Standard deviation for this residual isMSE(i )s {di } 1 hiidiei ti is called thes {di }MSE(i ) (1 hii )studentized deleted residual (SDR). Follows a T-distribution with n – p – 1degrees of freedom allowing us to knowwhat constitutes an “extreme value”.17-8

Studentized Deleted Resid (3) Alternative formula to calculate thesewithout rerunning the regression n timesn p 1ti eiSSE (1 hii ) ei2 SAS of course uses this, and matrices, to doall of the arithmetic quickly17-9

Using Studentized Residuals Both studentized and studentized deletedresiduals can be quite useful for identifyingoutliers Since we know they have a T-distribution,for reasonable size n, an SDR ofmagnitude 3 or more (in abs. value) will beconsidered an outlier. Any with magnitudebetween 2-3 may be close depending onsignificance level used (see tables). Many high SDR indicates inadequate model.17-10

Regular vs. “Deleted” Both generally tend to give similarinformation. “Deleted” perhaps is the preferred methodsince this method means that each datapoint is not used in computing its ownresidual and gives us something tocompare to as an “extreme value”.17-11

Formal Test for Outliers in Y Test each of the n residuals to determine if itis an outlier. Bonferroni used to adjust for the n tests –significance level becomes 0.05 / n. Compare studentized deleted residuals (inabsolute value) to a T-critical value usingthe above alpha, and n – p – 1 degrees offreedom SDR’s that are larger in magnitude than thecritical value identify outliers.17-12

CDI / Physicians Example(cdi outliers.sas) Note: We leave LA and Chicago in themodel this time. More “options” for the model statement /r produces analysis of the residuals /influence produces influence statistics Work with 5-variable model from last time(tot income, beds, crimes, hsgrad,unemploy)17-13

Example (2)proc reg data cdi outest fits;model lphys beds tot income hsgradcrimes unemploy /r;run; Produces several pages of output since eachresidual information is given for each ofthe 440 data points We’ll look at only a small part of thisoutput, for illustration17-14

1 0 1 2D-9.380 ****** 12.186-5.535 ****** 1.130-1.627 *** 0.0290.974 * 0.0060.773 * 0.0083.676 ****** 6.5410.611 * 0.001-0.676 * 0.0040.711 * 0.0050.633 * 0.002Note: 1 LA, 2 Cook, 6 Kings17-15

Leverage Values Outliers in X can be identified because theywill have large leverage values. Theleverage is just hii from the hat matrix. In general, 0 hii 1 and hii p Large leverage values indicate the ith case isdistant from the center of all X obs. Leverage considered large if it is bigger thantwice the mean leverage value, 2p / n . Leverages can also be used to identifyhidden extrapolation (page 400 of KNNL).17-16

Physicians Example /influence used in the model statement to getleverage values (called hat diag H in theoutput) Can also get these statistics into a datasetusing an OUTPUT statementproc reg data cdi;model lphys beds tot income hsgrad crimesunemploy /influence;output out diag student studresids h leveragerstudent studdelresid;proc sort data diag; by studdelresid;proc print data diag;var county studresids leverage studdelresid;17-17

OutputRemember we can compare leverage to 2p/n 0.031234437438439440countystudresidsLos Ange-9.380Cook-5.535Sarpy-3.378Livingst-2.174San Fran1.935New 2.181.9412.0552.3383.73017-18

Other Influence Statistics Not all outliers have a strong influence onthe fitted model. Some measures to detectthe influence of each observation are:o Cook’s Distance measures the influenceof an observation on all fitted valueso DFFits measures the influence of anobservation on its own fitted valueo DFBeta measures the influence of anobservation on a particular regressioncoefficient17-19

Cook’s Distance Assess the influence of a data point in ALLpredicted values Obtain from SAS using /r Large values suggest that an observation hasa lot of influence (can compare to anF(p, n-p) distribution).17-20

DFFits Assess the influence of a data point in ITSOWN prediction only Obtain from SAS using /influence Essentially measures difference betweenprediction of itself with/without using thatobservation in the computation Large absolute values (bigger than 1, orbigger than 2 p / n ) suggest that anobservation has a lot of influence on itsown prediction17-21

DFBetas One per parameter per observation Obtained using /influence in proc reg Assess the influence of each observation oneach parameter individually Absolute values bigger than 1 or 2/ n areconsidered large17-22

Exampleproc reg data cdi ;model lphys beds tot income hsgradcrimes unemploy /r influence;output out diag dffits dffitcookd cooksd;proc sort data diag; by descending cooksd;proc print data diag;var county dffit cooksd; run;17-23

Output12345678countyLos 02530.02180.021717-24

Conclusions Compare DFFits to 2 p / n 0.23 Could assess Cook’s Distance using F-distn. Los Angeles, Kings, and Cook counties havean overwhelming amount of influence,both on their own fitted values as well ason the regression line itself If look at DFBetas (only way to do this is toview the output from /influence), will seesimilar influence on the parameterscompare to 2 / n 0.316 .17-25

Influential Observations Big question now is, once we identify anoutlier, or influential observation, what dowe do with it? For a good understanding of the regressionmodel, the analysis IS needed. In ourexample, we now know that we have threecases holding a lot of influence. We maywant to. See what happens when we exclude these from themodel. Investigate these cases separately.17-26

What not to do. Never simply exclude / ignore a data pointjust because you don’t like what it does tothe results Never ignore the fact that you have one ortwo overly influential observations17-27

Some Remedial Measures See Section 11.3 Robust Regression procedures decrease theemphasis of outlying observations Doing this is slightly beyond the scope ofthe class, but it doesn’t hurt to be awarethat such methods exist.17-28

Upcoming in Lecture 18. Miscellaneous topics in MLR.o Chapter 8, Section 10.117-29

Leverage Values Outliers in X can be identified because they will have large leverage values. The leverage is just hii from the hat matrix. In general, 0 1 hii and h pii Large leverage values indicate the ith case is distant from the center of all X obs. Leverage considered large if it is bigger than

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

influential outliers can have a severe distortion on the model of prediction: The aim of this study is to evaluate the influence of outliers using standardized residual and Cook’s distance on the prediction of ozone (O 3) concentrations level by excluding the point of outliers in the observation.

Types of outliers in linear regression Recap Question True or False? 1 Influential points always change the intercept of the regression line. 2 Influential points always reduce R2. 3 It is much more likely for a low leverage point to be influential, than a high leverage point. 4 When the data set includes an influential point, the

Outliers Summary Removing outliers in the tailgating study By removing the outliers, the pooled standard deviation drops from 44 to 12 As a result, our observed di erence is now 1.7 standard errors away from its null hypothesis expected value The p-value goes from 0.53 to 0.09 Patrick Breheny Introduction to Biostatistics (171:161) 17/26

Visualizing Big Data Outliers through Distributed Aggregation Leland Wilkinson Fig. 1. Outliers revealed in a box plot [72] and letter values box plot [36]. These plots are based on 100,000 values sampled from a Gaussian (Standard Normal) distribution. By definition, the data contain no probable outliers, yet the ordinary box plot shows

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Residual Analysis and Outliers Lecture 48 Sections 13.4 - 13.5 Robb T. Koether Hampden-Sydney College Wed, Apr 11, 2012 Robb T. Koether (Hampden-Sydney College) Residual Analysis and Outliers Wed, Apr 11, 2012 1 / 31

ArtificialIntelligence: A Modern Approachby Stuart Russell and Peter Norvig, c 1995 Prentice-Hall,Inc. Section 2.3. Structure of Intelligent Agents 35 the ideal mapping for much more general situations: agents that can solve a limitless variety of tasks in a limitless variety of environments. Before we discuss how to do this, we need to look at one more requirement that an intelligent agent .