Robust Regression Modeling With STATA Lecture Notes

3y ago
26 Views
4 Downloads
285.82 KB
93 Pages
Last View : 18d ago
Last Download : 2m ago
Upload by : Maleah Dent
Transcription

Robust RegressionModelingwith STATAlecture notesRobert A. Yaffee, Ph.D.Statistics, Social Science, and Mapping GroupAcademic Computing ServicesOffice:75 Third Avenue, Level C-3Phone: 212-998-3402Email: yaffee@nyu.edu1

What does Robust mean?1.Definitions differ in scope and content. In the mostgeneral construction: Robust models pertains tostable and reliable models.2. Strictly speaking:Threats to stability and reliability includeinfluential outliersInfluential outliers played havoc with statisticalestimation. Since 1960, many robusttechniques of estimation have developed thathave been resistant to the effects of suchoutliers.SAS Proc Robustreg in Version 9deals with these.S-Plus robust library inStata rreg, prais, and arimamodels3.Broadly speaking: HeteroskedasticityHeteroskedastically consistent varianceestimatorsStata regress y x1 x2, robust4.Non-normal residuals1. Nonparametric Regression modelsStata qreg, rreg2. Bootstrapped Regression1. bstrap2. bsqreg2

Outline1.Regression modeling preliminaries1. Tests for misspecification1. Outlier influence2. Testing for normality3. Testing for heterskedasticity4. Autocorrelation of residuals2. Robust Techniques1. Robust Regression2. Median or quantile regression3. Regression with robust standard errors4. Robust autoregression models3. Validation and cross-validation1. Resampling2. Sample splitting4. Comparison of STATA with SPLUS and SAS3

Preliminary Testing: Prior tolinear regression modeling, use amatrix graph to confirm linearityof relationshipsgraph y x1 x2, 244.215.819.14

The independent variablesappear to be linearlyrelated with yWe try to keep the models simple. If therelationships are linear then we model them withlinear models. If the relationships are nonlinear,then we model them with nonlinear ornonparametric models.5

Theory of RegressionAnalysisWhat is linear regressionAnalysis?Finding the relationship between adependent and an independentvariable.Y a bx eGraphically, this can be done witha simple Cartesian graph6

The MultipleRegression FormulaY a bx eY is the dependent variablea is the interceptb is the regression coefficientx is the predictor variable7

Graphical Decompositionof EffectsDecomposition of Effectsyi y Total Effect{Y}}Yiŷ a bxyi yˆi errorŷ y regression effectYXX8

Derivation of the Intercepty a bx ee y a bxn ei 1 inn y aii 1i 1in b xii 1nBecause by definition ei 0i 10 nn y ai 1ii 1nnni 1i 1i 1in b xii 1 ai yi b xinni 1i 1na yi b xia y bx9

Derivation of theRegression CoefficientGiven : yi a b xi eiei yi a b xin eii 1n (y ii 1n a b xi )n2e i 2y a bx() iii 1i 1n ei 2ni 1i 1 2 xi ( yi ) 2b xi xii 1 b0nnni 1i 1 2 xi ( yi ) 2b xi xinb x yi 1ni2x ii 1i10

If we recall that the formula forthe correlation coefficient canbe expressed as follows:11

n r i 1xi yi ( x ) ( y )nn2i 1i2ii 1w herex xi xy yi ynbj xii 1n xyi2i 1from which it can be seen that the regression coefficient b,is a function of r.bj r *sd ysd x12

Extending the bivariate to the multivariateCase13

β yx . x 12β yx . x 21ryx1 ryx2 rx1x21 r2x1 x2ryx2 ryx1 rx1x21 r*2x1 x2*sd ysd xsd ysd x(6)(7)It is also easy to extend the bivariate interceptto the multivariate case as follows.a Y b1 x1 b2 x2 (8)14

Linear MultipleRegression Suppose that we have thefollowing data set.15

Stata OLS regressionmodel syntaxWe now see that the significance levels reveal that x1 and x2are both statistically significant. The R2 and adjusted R2have not been significantly reduced, indicating that this model stillfits well. Therefore, we leave the interaction term pruned from themodel.What are the assumptions of multiple linear regression analysis?16

Regression modelingand the assumptions1. What are the assumptions?1. linearity2. Heteroskedasticity3. No influential outliers in smallsamples4. No multicollinearity5. No autocorrelation of residuals6. Fixed independent variables-nomeasurement error7. Normality of residuals17

Testing the model formispecification androbustnessLinearitymatrix graphs shown aboveMulticollinearityvifMisspecification testsheteroskedasticity testsrvfplothettestresidual autocorrelation testscorrgramoutlier detectiontabulation of standardized residualsinfluence assessmentresidual normality testssktestSpecification tests (not covered in this lecture)18

Misspecification tests We need to test the residualsfor normality. We can save the residuals inSTATA, by issuing a commandthat creates them, after wehave run the regressioncommand. The command to generate theresiduals is predict resid, residuals19

Generation of the regression residuals20

Generation ofstandardized residuals Predict rstd, rstandard21

Generation ofstudentized residuals Predict rstud, rstudent22

Testing the Residualsfor Normality1. We use a Smirnov-Kolmogorovtest.2. The command for the test is:sktest residThis tests the cumulative distribution of the residuals against that ofthe theoretical normal distribution with a chi-square testTo determine whether there is a statistically significant difference.The null hypothesis is that there is no difference. When the probabilityis less than .05, we must reject the null hypothesis and infer that23the residuals are non-normally distributed.

Testing the Residualsfor heteroskedasticity1. We may graph the standardized orstudentized residuals against thepredicted scores to obtain a graphicalindication of heteroskedasticity.2. The Cook-Weisberg test is used to testthe residuals for heteroskedasticity.24

A Graphical test ofheteroskedasticity:rvfplot, border yline(0)This displays any problematic patterns that might suggestheteroskedasticity. But it doesn’t tell us which residuals areoutliers.25

Cook-Weisberg TestVar (ei ) σ 2 exp( zt )whereei error in regression modelz x βˆ or variable list supplied by userThe test is whether t 0hettest estimates the model ei 2 α zi t ν iSS of model2 χ 2 where p number of parametersit forms a score test S h0 : S df p26

Cook-Weisberg testsyntax1. The command for this test is:hettest residAn insignificant result indicates lack of heteroskedasticity.That is, an such a result indicates the presence of equal varianceof the residuals along the predicted line. This condition isotherwise known as homoskedasticity.27

Testing the residuals forAutocorrelation1. One can use the command,dwstat, after the regression toobtain the Durbin-Watson dstatistic to test for first-orderautocorrelation.2. There is a better way.Generate a casenumvariable: Gen casenum n28

Create a timedependent series29

Run the Ljung-Box Q statisticwhich tests previous lags forautocorrelation and partialautocorrelationThe STATA command is :corrgram residThe significance of the AC (Autocorrelation) and PAC(Partial autocorrelation) is shown in the Prob column.None of these residuals has any significant autocorrelation.30

One can runAutoregression in theevent of autocorrelationThis can be done withnewey y x1 x2 x3 lag(1) timeprais y x1 x2 x331

Outlier detection Outlier detection involves thedetermination whether the residual(error predicted – actual) is anextreme negative or positive value. We may plot the residual versusthe fitted plot to determine whicherrors are large, after running theregression. The command syntax was alreadydemonstrated with the graph onpage 16: rvfplot, border yline(0)32

Create StandardizedResiduals A standardized residual is onedivided by its standard deviation.yˆi yiresid standardized swhere s std dev of residuals33

Standardized residualspredict residstd, rstandardlist residstdtabulate residstd34

Limits of StandardizedResidualsIf the standardized residualshave values in excess of 3.5and -3.5, they are outliers.If the absolute values are lessthan 3.5, as these are, thenthere are no outliersWhile outliers by themselvesonly distort mean predictionwhen the sample size is smallenough, it is important togauge the influence of outliers.35

Outlier Influence Suppose we had a differentdata set with two outliers. We tabulate the standardizedresiduals and obtain thefollowing output:36

Outlier a does not distortthe regression line butoutlier b does.baY a bxOutlier a has bad leverage and outlier adoes not.37

In this data set, we have two outliers. One is negative and theother is positive.38

Studentized Residuals Alternatively, we could formstudentized residuals. These aredistributed as a t distribution withdf n-p-1, though they are notquite independent. Therefore, wecan approximately determine ifthey are statistically significant ornot. Belsley et al. (1980)recommended the use ofstudentized residuals.39

Studentized Residualei seis 2 (i ) (1 hi )whereei s studentized residuals( i ) standard deviation where ith obs is deletedhi leverage statisticThese are useful in estimating the statistical significanceof a particular observation, of which a dummy variableindicator is formed. The t value of the studentized residualwill indicate whether or not that observation is a significantoutlier.The command to generate studentized residuals, called rstudt is:predict rstudt, rstudent40

Influence of Outliers1. Leverage is measured by thediagonal components of the hatmatrix.2. The hat matrix comes from theformula for the regression of Y.Yˆ X β X '( X ' X ) 1 X ' Ywhere X '( X ' X ) 1 X ' the hat matrix, HTherefore,Yˆ HY41

Leverage and the Hatmatrix1.2.3.4.5.6.The hat matrix transforms Y into thepredicted scores.The diagonals of the hat matrix indicatewhich values will be outliers or not.The diagonals are therefore measures ofleverage.Leverage is bounded by two limits: 1/n and1. The closer the leverage is to unity, themore leverage the value has.The trace of the hat matrix the number ofvariables in the model.When the leverage 2p/n then there is highleverage according to Belsley et al. (1980)cited in Long, J.F. Modern Methods ofData Analysis (p.262). For smaller samples,Vellman and Welsch (1981) suggested that3p/n is the criterion.42

Cook’s D1. Another measure of influence.2. This is a popular one. Theformula for it is: 1 hi ei 2Cook ' s Di 2 p 1 hi s (1 hi ) Cook and Weisberg(1982) suggested that values ofD that exceeded 50% of the F distribution (df p, n-p)are large.43

Using Cook’s D inSTATA Predict cook, cooksd Finding the influential outliers List cook, if cook 4/n Belsley suggests 4/(n-k-1) as a cutoff44

Graphical Exploration ofOutlier Influence Graph cook residstd, xlab ylabThe two influential outliers can be found easily herein the upper right.45

DFbeta One can use the DFbetas toascertain the magnitude ofinfluence that an observation hason a particular parameter estimateif that observation is deleted.DFbeta j b j b(i ) j u j u2j(1 h j )where u j residuals ofregression of x on remaining xs.46

Obtaining DFbetas inSTATA47

Robust statistical optionswhen assumptions areviolated1.Nonlinearity1.2.2.Influential Outliers1.2.3.2.Regression with Huber/White/Sandwichvariance-covariance estimatorsRegress y x1 x2, robustResidual autocorrelation correction1.2.5.Robust regression with robust weight functionsrreg y x1 x2Heteroskedasticity of residuals1.4.Transformation to linearityNonlinear regressionAutoregression withprais y x1 x2, robustnewey-west regressionNonnormality of residuals1.2.Quantile regression: qreg y x1 x2Bootstrapping the regression coefficients48

Nonlinearity:Transformations to linearity1. When the equation is notintrinsically nonlinear, thedependent variable orindependent variable may betransformed to effect alinearization of the relationship.2. Semi-log, translog, Box-Cox, orpower transformations may beused for these purposes.1. Boxcox regression permitsdetermines the optimal parametersfor many of these transformations.49

Fix for Nonlinear functionalform: Nonlinear RegressionAnalysisExamples of 2 exponential growth curve models, the firstof which we estimate with our data.nl exp2 y xestimates Y b1b2xnl exp3 y xestimates y b0 b1b2x50

Nonlinear Regression inStata . nl exp2 y x(obs 15) Iteration 0:Iteration 1:Iteration 2:Iteration 3: Source .residual SS residual SS residual SS residual SS 56.0829749.4637249.459349.4593SSdfMSNumber of obs 15F( 2, 13) 1585.01Model 12060.5407 2 6030.27035Prob F 0.0000Residual 49.459299913 3.80456153R-squared Adj R-squared 0.9953Total1211015 807.333333Root MSE 1.950529Res. dev. 60.464652-param. exp. growth curve,y b1*b2 xyCoef.Std. Err.b1b258.60656.96118691.472156 39.81 0.000.0016449 584.36 0.000(SE's, P values, CI's, andtP t0.9959[95% Conf. Interval]55.42616 61.78696.9576334 .9647404correlations are asymptotic approximations)51

Heteroskedasticitycorrection1. Prof. Halbert White showed thatheteroskedasticity could behandled in a regression with aheteroskedasticity-consistentcovariance matrix estimator(Davidson & McKinnon (1993),Estimation and Inference inEconometrics, Oxford U Press,p. 552).2. This variance-covariance matrixunder ordinary least squares isshown on the next page.52

OLS Covariance MatrixEstimator 1( X ' X ) ( X ' ΣX )( X ' X ) 1where Σ st /( X ' X )253

White’s HAC estimator1. White’s estimator is for largesamples.2. White’s heteroskedasticitycorrected variance and standarderrors can be larger or smallerthan the OLS variances andstandard errors.54

Heteroskedastically consistentcovariancematrix “Sandwich” estimator (H.White)BreadMeat(tofu)Breadn 1 ( X ' X ) 1 (n 1 X ' ΩX )(n 1 X ' X ) 1et 2where Ω 1 ht 2However , there are different versions :HC0 : Ω et 2net 2n ket 2HC 2 : Ω 1 htHC1 : Ω et 2HC 3 : Ω (1 ht ) 255

Regression with robust standarderrors for heteroskedasticityRegress y x1 x2, robustOptions other than robust, are hc2 and hc3 referringto the versions mentioned by Davidson and McKinnon above.56

Robust options for theVCV matrix in Stata Regress y x1 x2, hc2 Regress y x1 x2, hc3 These correspond to theDavidson and McKinnon’sversions of theheteroskedastically consistentvcv options 2 and 3.57

Problems withAutoregressive Errors1.Problems in estimation with OLS1. When there is first-order autocorrelation of theresiduals,2. et D1et-1 vt2.Effect on the Variance1.et2 D12et-12 vt 25858

Sources of Autocorrelation1. Lagged endogenous variables2. Misspecification of the model3. Simultaneity, feedback, or reciprocalrelationships4. Seasonality or trend in the model59

Prais-WinstonTransformation-cont’d2vtet 2 , therefore et 2(1 ρ )vt(1 ρ 2 )It follows thatYt a bxt (1 ρ 2 Yt (1 ρ 2 a vt(1 ρ )2(1 ρ 2 bxt vtYt * a * bxt * vt60

Autocorrelation of theresiduals: prais & neweyregressionTo test whether the variable isautocorrelated Tsset time corrgram y prais y x1 x2, robust newey y x1 x2, lag(1) t(time)61

Testing for autocorrelationof residualsregress mna10 l5sumprcpredict resid10, residualcorrgram resid1062

Prais-Winston Regressionfor AR(1) errorsUsing the robust option here guarantees that theWhite heteroskedasticity consistent sandwichvariance-covariance estimator will be usedin the autoregression procedure.63

Newey-West RobustStandard errors An autocorrelation correction isadded to the meat or tofu in theWhite Sandwich estimator byNewey-West.n 1 ( X ' X ) 1 (n 1 X ' ΩX )(n 1 X ' X ) 1et 2where Ω 1 ht 2However , there are different versions :HC0 : Ω et 2net 2n ket 2HC 2 : Ω 1 htHC1 : Ω et 2HC 3 : Ω (1 ht ) 264

Central Part of NeweyWest Sandwich estimatorˆXX 'Ωnewey westˆX X 'Ωwhiten m l 1 ei ei 1 ( xi ' xi 1 xi 1 ' xi )n k l 1 m 1 where k number of predictorsl time lagm maximum time lag65

Newey-West RobustStandard errorsNewey West standard errors are robust to autocorrelationand heteroskedasticity with time series regression models.66

Assume OLSregression We regress y on x1 x2 x3 We obtain the following outputNext we examine the residuals67

Residual AssessmentThe data set is to small to drop case 21, so I use robustregression68

Robust regressionalgorithm: rreg1. A regression is performedand absolute residuals arecomputed.ri y i x i b 2. These residuals arecomputed and scaled:riui syi xi b s69

Scaling the residualsMs 0.6745whereM med ( ri med (ri ) )The residuals are scaled by the median absolutevalue of the median residual.70

Essential Algorithm The estimator of the parameter bminimizes the sum of a lessrapidly increasing function of theresiduals (SAS Institute, TheRobustreg Procedure, draft copy,p.3505, forthcoming): riQ(b) ρ σi 1where ri y n xi bσ is estimated by s71

Essential algorithm-cont’d1. If this were OLS, the ρ would bea quadratic function.2. If we can ascertain s, wecan by taking the derivativeswith respect to b, find a firstorder solution to ri ψ xij 0, s i 1where j 1,., pnψ ρ'72

Case weights are developedfrom weight functions1. Case weights are formed basedon those residuals.2. Weight functions for those caseweights are first the Huberweights and then the Tukeybisquare weights:3. A weighted regression is rerunwith the case weights.73

Iteratively reweightedleast squares The case weight w(x) is defined as:w( x) ψ ( x)xIt is updated at each iteration until itconverges on a value and the changefrom iteration to iteration declines belowa criterion.74

Weights functions forreducing outlier influencec is the tuning constant used in determining the case weights.For the Huber weights c 1.345 by default.75

Weight FunctionsTukey biweight (bisquare)C is also the biweight tuning constant. C is set at 4.685for the biweight.76

Tuning Constants When the residuals are normallydistributed and the tuningconstants are set at the default,they give the procedure about95% of the efficiency of OLS. The tuning constants may beadjusted to providedownweighting of the outliers atthe expense of Gaussianefficiency. Higher tuning constants cause theestimator to more closelyapproximate OLS.77

Robust Regressionalgorithm –cont’d3. WLS regression is performedusing those case weights4. Iterations case when caseweights drop below atolerance level5. Weights are based initially onHuber weights. Then Beatonand Tukey biweights areused.6. Caveat: M estimation is notthat robust with regard toleverage points.78

Robust Regression fordown-weighting outliers rreg y x1 x2 x3Uses Huber and Tukey biweights to downweight the influenceof outliers in the estimation of the mean of y in the upper panelwhereas ols regression is given in the lower panel.79

A Corrective Option forNonnormality of theResiduals1. Quantile regression (medianregression is the default) is oneoption.2. Algorithm1. Minimizes the sum of the absoluteresiduals2. The residual in this case is thevalue minus the unconditionalmedian.3. This produces a formula thatpredicts the median of thedependent variableYmed a bx80

Quantile Regressionqreg in STATA estimates leastabsolute value ( LAV or MAD orL1 norm regression).The algorithm minimizes the sum ofthe absolute deviations about themedian.The formula generated estimates themedian rather than the mean, asrreg does.Ymedian constant bx81

Median regression82

Bootstrapping Bootstrapping may be used toobtain empirical regressioncoefficients, standard errors,confidence intervals, etc. whenthe distribution is non-normal. Bootstrapping may be appliedto qreg with bsqreg83

Bootstrapping quantile ormedian regressionstandard errors qreg y x1 x2 x3 bsqreg y x1 x2 x3, reps(1000)84

Methods of ModelValidation These methods may benecessary where the samplingdistributions of the parameters

Leverage and the Hat matrix 1. The hat matrix transforms Y into the predicted scores. 2. The diagonals of the hat matrix indicate which values will be outliers or not. 3. The diagonals are therefore measures of leverage. 4. Leverage is bounded by two limits: 1/n and 1. The closer the leverage is to unity, the more leverage the value has. 5.

Related Documents:

Stata is available in several versions: Stata/IC (the standard version), Stata/SE (an extended version) and Stata/MP (for multiprocessing). The major difference between the versions is the number of variables allowed in memory, which is limited to 2,047 in standard Stata/IC, but can be much larger in Stata/SE or Stata/MP. The number of

independent variables. Many other procedures can also fit regression models, but they focus on more specialized forms of regression, such as robust regression, generalized linear regression, nonlinear regression, nonparametric regression, quantile regression, regression modeling of survey data, regression modeling of

Categorical Data Analysis Getting Started Using Stata Scott Long and Shawna Rohrman cda12 StataGettingStarted 2012‐05‐11.docx Getting Started Using Stata – May 2012 – Page 2 Getting Started in Stata Opening Stata When you open Stata, the screen has seven key parts (This is Stata 12. Some of the later screen shots .

To open STATA on the host computer, click on the “Start” Menu. Then, when you look through “All Programs”, open the “Statistics” folder you should see a folder that says “STATA”. Click on the folde r and it will open up three STATA programs (STATA 10, STATA 11, and STATA 12). These are all the

There are several versions of STATA 14, such as STATA/IC, STATA/SE, and STATA/MP. The difference is basically in terms of the number of variables STATA can handle and the speed at which information is processed. Most users will probably work with the “Intercooled” (IC) version. STATA runs on the Windows, Mac, and Unix computers platform.

Stata Version 13 - Spring 2015 Illustration: Simple and Multiple Linear Regression \1. Teaching\stata\stata version 13 - SPRING 2015\stata v 13 first session.docx Page 12 of 27 II - Simple Linear Regression 1. A General Approach for Model Development There are no rules nor single best strategy. In fact, different study designs and .

Stata/MP, Stata/SE, Stata/IC, or Small Stata. Stata for Windows installation 1. Insert the installation media. 2. If you have Auto-insert Notification enabled, the installer will start auto-matically. Otherwise, you will want to navigate to your installation media and double-click on Setup.exe to start the installer. 3.

Stata/IC and Stata/SE use only one core. Stata/MP supports multiple cores, but only commands are speeded up. . I am using Stata 14 and not Stata 15) Setting up the seed using dataset lename. type can be F create creates a dataset with empty seeds for each variation. If option fill is used, then seeds are random numbers.