Missing Data Using Stata - Statistical Horizons

3y ago
33 Views
2 Downloads
376.05 KB
21 Pages
Last View : 6d ago
Last Download : 3m ago
Upload by : Alexia Money
Transcription

Missing Data Using StataPaul Allison, Ph.D.Upcoming Seminar:August 15-16, 2017, Stockholm, Sweden

1Missing Data Using Stata2Basics3For Further Reading4Many ptions9Listwise Deletion (Complete Case)10Listwise Deletion (continued)11Listwise Deletion (continued)12Pairwise Deletion (Available Case)13Dummy Variable Adjustment14Imputation15Maximum Likelihood16Properties of Maximum Likelihood17ML with Ignorable Missing Data18ML for 2 x 2 Contingency Table19Maximizing the Likelihood with ℓEM20ML for Quantitative Variables21EM Algorithm22EM for Multivariate Normal Data23EM for Multivariate Normal Data24College Example25Preliminary Analysis 126Preliminary Analysis 227Preliminary Analysis 328EM in Stata29Convert Covariances to Correlations

30EM As Input to regress31Direct ML32Direct ML33Direct ML (cont.)34SEM without Auxiliary Variable35SEM with Auxiliary Variable36SEM Output with Auxiliary Variable37Compare with Listwise Deletion38Regression with Mplus39Regression with Mplus40Logistic Regression with Mplus41Other Capabilities of Mplus42ML for Repeated Measures Data43Binary Example44Estimation in Stata45Figure 146Limitations of Maximum Likelihood47Multiple Imputation48Regression Imputation49Adding a Random Component50Multiple, Random Imputations51Combining the Imputations52Formula for Standard Error53Random Variation in Parameters54Monotonic Missing Data55MI for Monotone Missing Data56Non-Monotone Missing Data57Two Iterative Solutions58MCMC

59MCMC for Multivariate Normal60Software61Steps for MCMC in Stata62MCMC With Stata63Stata Output 164Stata Output 265Formulas66Imputation with the Dependent Variable67Should Missing Data on the Dependent Variable Be Imputed?68How Many Data Sets?69Options for mi impute mvn70Change the Number of Iterations71Change the Prior Distribution72Categorical Variables73Categorical Variables (cont.)74Some Things NOT to Do75Fully Conditional Specification76Logit Imputation of a Binary Variable77Predictive Mean Matching78Fill-In Phase of FCS79Imputation Phase of FCS80Downside of FCS81Software82FCS in Stata for NLSY Data83Impute Output84Estimate Output85Test Output86mi estimate with Other Commands87Multi-Parameter Inference

88Restricted FMI Test89Unrestricted FMI Test90mi test command91Combining Chi-Squares92Stats Not Reported by mi estimate93mibeta for R-square & Standardized94mibeta Output95Interactions and Nonlinearities96Interaction Results97Imputation Model vs. Analysis Model98MI for Panel Data99Hip Fracture Example100Imputing Clustered Data in Stata101Imputation with Cluster Dummies102Imputation in Wide Form103Imputation Via Random Effects104Hip Fracture Example (cont.)105Why Didn’t Imputation Do Better?106Nonignorable Missing Data107Nonignorable Missing Data108Heckman’s Model for Selection Bias109Heckman’s Model in Stata110Heckman’s Model (cont.)111Pattern-Mixture Models with MI112MI for Pattern-Mixture Models113Summary and Review114Summary and Review

Missing Data Using StataPaul D. Allison, Ph.D.February 2016www.StatisticalHorizons.com1BasicsDefinition: Data are missing on some variables forsome observationsProblem: How to do statistical analysis when dataare missing? Three goals: Minimize bias Maximize use of available information Get good estimates of uncertaintyNOT a goal: imputed values “close” to real values.2

For Further ReadingAlso:Allison, Paul D. (2009) "MissingData." Pp. 72-89 in The SAGEHandbook of Quantitative Methodsin Psychology, edited by Roger E.Millsap and Alberto MaydeuOlivares. Thousand Oaks, CA:Sage Publications ds/2012/01/MilsapAllison.pdf3Many Methods Conventional Listwise deletion (complete case analysis)Pairwise deletion (available case analysis)Dummy variable adjustmentImputation Replacement with meansRegressionHot deckNovel Maximum likelihoodMultiple imputationInverse probability weighting (not discussed here)4

AssumptionsMissing completely at random (MCAR)Suppose some data are missing on Y. These data are said to beMCAR if the probability that Y is missing is unrelated to Y orother variables X (where X is a vector of observed variables).Pr (Y is missing X,Y) Pr(Y is missing) MCARis the ideal situation. What variables must be in the X vector? Only variables in themodel of interest. Ifdata are MCAR, complete data subsample is a random samplefrom original target sample. MCARallows for the possibility that missingness on one variablemay be related to missingness on another e.g., sets of variables may always be missing together5AssumptionsMissing at random (MAR)Data on Y are missing at random if the probability that Y ismissing does not depend on the value of Y, after controlling forobserved variablesPr (Y is missing X,Y) Pr(Y is missing X)E.g., the probability of missing income depends on marital status,but within each marital status, the probability of missing incomedoes not depend on income. Considerably weaker assumption than MCAR Only X’s in the model must be considered. But, includingother X’s (correlated with Y) can make MAR more plausible. Can test whether missingness on Y depends on X Cannot test whether missingness on Y depends on Y6

IgnorabilityThe missing data mechanism is said to be ignorable if The data are missing at random andParameters that govern the missing data mechanism aredistinct from parameters to be estimated (unlikely to beviolated)In practice, “MAR” and “ignorable” are used interchangeablyIf MAR but not ignorable (parameters not distinct), methodsassuming ignorability would still be good, just not optimal.If the missing data mechanism is ignorable, there is no needto model it.Any general purpose method for handling missing data mustassume that the missing data mechanism is ignorable.7AssumptionsNot missing at random (NMAR)If the MAR assumption is violated, the missing data mechanismmust be modeled to get good parameter estimates.Heckman’s regression model for sample selection bias is a goodexample.Effective estimation for NMAR missing data requires very goodprior knowledge about missing data mechanism. Data contain no information about what models would be appropriateNo way to test goodness of fit of missing data modelResults often very sensitive to choice of modelListwise deletion able to handle one important kind of NMAR8

Listwise Deletion (Complete Case)Delete any unit with any missing data (onlyuse complete cases)Strengths Easy to implementWorks for any kind of statistical analysisIf data are MCAR, does not introduce any biasin parameter estimatesStandard error estimates are appropriate9Listwise Deletion (continued)Weaknesses May delete a large proportion of cases, resulting in loss ofstatistical powerMay introduce bias if MAR but not MCARRobust to NMAR for predictor variables inregression analysisLet Y be the dependent variable in a regression (any kind) and Xone of the predictors. SupposePr(X is missing X, Y) Pr(X is missing X)Then listwise deletion will not introduce bias.10

Listwise Deletion (continued)Example: Estimate a regression with number of children asdependent variable and income as an independent variable. 30% of cases have missing data on income, persons with high income areless likely to report incomeBut probability of missing income does not depend on number of childrenThen listwise deletion will not introduce any bias into estimates ofregression coefficientsFor logistic regression, listwise deletion is robust to NMAR onindependent OR dependent variable (but not both)Caveat: This property of listwise deletion presumes thatregression coefficients are invariant across subgroups (noomitted interactions).11Pairwise Deletion (Available Case) For linear models, parameters are functions of means,variances and covariances (moments)Estimate each moment with all available nonmissing casesPlug moment estimates into formulas for parametersStrengths: Approximately unbiased if MCARUses all available informationWeaknesses: Standard errors incorrect (no appropriate sample size)Biased if MAR but not MCARMay break down (correlation matrix not positive definite)12

Dummy Variable AdjustmentA popular method for handling missing data on predictors inregression analysis (Cohen and Cohen 1985)In a regression predicting Y, suppose there is missing data on apredictor X.1.2.3.Create a new variable D 1 if X is missing and D 0 if X is present.When X is missing, set X c where c is some constant (e.g., themean of X).Regress Y on both X and D (and any other variables) Produces biased coefficient estimates (Jones, JASA, 1996) So does a related method: For categorical variables, createa separate missing data category But may be appropriate for “doesn’t apply” missing dataMay also be useful for predictive modeling with missingdata.13ImputationAny method that substitutes estimatedvalues for missing values Replacement with meansRegression imputation (replace with conditional means)Problems Often leads to biased parameter estimates (e.g., variances)Usually leads to standard error estimates that are biaseddownward Treats imputed data as real data, ignores inherent uncertaintyin imputed values.14

Maximum LikelihoodChoose as parameter estimates those values which, if true,would maximize the probability of observing what has, in fact,been observed.Likelihood function: Expresses the probability of the data as afunction of the data and the unknown parameter values.Example: Let p(y θ) be the probability density for y, givenθ (a vector of parameters). For a sample of n independentobservations, the likelihood function is15Properties of Maximum LikelihoodTo get ML estimates, we find the value of θthat maximizes the likelihood function.Under usual conditions, ML estimates havethe following properties: Consistent (implies approximately unbiased inlarge samples)Asymptotically efficientAsymptotically normal16

ML with Ignorable Missing DataSuppose we have 2 discrete variables X and Y, and there isignorable missing data on X. Let p(x,y θ) be the jointprobability function.For a single observation with X missing, the likelihood isThe likelihood for the entire sample with m complete cases isThis likelihood may be maximized like any other.17ML for 2 x 2 Contingency TableVoteMaleFemaleYesNo36223752Furthermore, voting wasmissing for 10 males and 15females.The parameters are p11, p12, p21, p22. If we excludecases with missing data, the likelihood is(p11)36(p12)37(p21)22(p22)52If we allow for missing data, the likelihood is(p11)36(p12)37(p21)22(p22)52(p11 p12)10(p21 p22)1518

Maximizing the Likelihood with ℓEMFreeware for Windows by Jeroen manresdimlabsubmoddatOutput2* P(sv) *11 1 0.23802 2 21 2 0.24462 1 0.1538r s v2 2 0.3636sv ssv[36 37 22 52 10 15](0.0339)(0.0342)(0.0297)(0.0384)ℓEM fits a large class of models for categorical data, includinglog-linear, logit, latent class, and discrete time event historymodels.19ML for Quantitative VariablesAssume multivariate normality, which implies All variables are normally distributedAll conditional expectation functions are linearAll conditional variance functions are homoscedasticA strong assumption but widely invoked as the basis formultivariate analysisSeveral ways to get ML estimates with missing data, based onthis assumption Factoring the likelihood for monotone missing data patternsEM algorithmDirect maximization of the likelihood20

EM AlgorithmA general approach to getting ML estimates with missing dataTwo-step procedure1.Expectation (E): Find the expected value of the loglikelihood for the observed data, based on currentparameter values.2.Maximization (M): Maximize the expected log-likelihood toget new parameter estimates.Repeat until convergence.For multivariate normal data, parameters are means,variances, and covariances.21EM for Multivariate Normal Data1. Choose starting values for means and covariance matrix.2. If data are missing on x, use current values of parametersto calculate the linear regression of x on all variablespresent for each case.3. Use linear regressions to impute values of x. (E-step)4. After all data have been imputed, recalculate means andcovariance matrix, with corrections for variances andcovariances (see next slide). (M-step)5. Repeat steps 2-4 until convergence.22

EM for Multivariate Normal DataCorrection: Suppose X was imputed using variables W and Z.Let S2x.wz be the residual variance from that regression. Then,in calculating the variance for X, wherever you would usex2i , substitute x2i S2x.wzFor covariances between two variables with missing values,there’s a similar correction in which you add the residualcovariance.EM algorithm for multivariate normal data is available in manycommercial software packages: SPSS, Systat, SAS, Splus,Stata23College Example1994 U.S. News Guide to Best Colleges 1302 four-year colleges in U.S. Goal: estimate a regression model predicting graduationrate (# graduating/#enrolled 4 years earlier x 100) 98 colleges have missing data on graduation rateIndependent variables: 1st year enrollment (logged, 5 cases missing)Room & Board Fees (40% missing)Student/Faculty Ratio (2 cases missing)Private 1, Public 0Mean Combined SAT Score (40% missing) Auxiliary variable: Mean ACT scores (45% missing) 24

Preliminary Analysis 1use c:\data\college.dta, clearmi set wideThis declares the data to be a missing data set. Italso specifies that imputed data are to be stored inthe wide format. The are four different storageformats. But how it’s stored usually doesn’t matter,and we’re not imputing yet anyway.mi misstable summarizeThis requests basic descriptive statistics.25Preliminary Analysis 2Obs . ----------------------------- UniqueVariable Obs . Obs . Obs . valuesMinMax---------- ----------------------- -----------------------------gradrat 981,204 898118lenroll 51,297 5002.8903728.912608rmbrd 519783 5001.268.7stufac 21,300 2082.391.8csat 523779 3396001410act 588714 --------------------MissingNotmissing26

Preliminary Analysis 3mi misstab patternsMissing-value patterns(1 means complete) PatternPercent 1 2 3 45 6------------ --------------------23% 1 1 1 11 1 12 1 1 1 01 112 1 1 1 11 012 1 1 1 10 09 1 1 1 10 19 1 1 1 01 08 1 1 1 00 06 1 1 1 00 11 1 1 0 01 11 1 1 0 10 01 1 1 0 11 1 1 1 1 0 00 0 1 1 1 0 10 1 1 1 1 0 00 1 1 1 1 0 01 0 1 1 1 0 11 0 1 0 0 0 01 1 1 0 1 0 00 1 1 1 0 0 00 0 1 1 0 0 01 0 1 1 0 1 00 1 1 1 0 1 10 0------------ --------------------100% Variables are (1) stufac (2)lenroll (3) gradrat (4) rmbrdcsat (6) act(5)27EM in Statami register impute gradrat lenroll rmbrd stufac csatact privatemi impute mvn gradrat lenroll rmbrd stufac csat actprivate, --------------------------------------------- -- ------------------------- cons 0169------------- -------------------------Sigma gradrat 355.7137 -.499845110.38471 -31.141711352.98130.584513.608253lenroll -.4998451.9936801 -.01884091.38223123.23804.4695323 -.2964039rmbrd 10.38471 -.01884091.32903 -1.68540467.118751.514341.1885311stufac -31.141711.382231 -1.68540426.88555 -198.4039 -4.121786 -.9156043csat 1352.98123.2380467.11875 -198.403914745.07298.90689.381542act 30.58451.46953231.514341 -4.121786298.90687.353064.29118private 3.608253 -.2964039.1885311 -----------------------------These are the maximum likelihood estimates of the means andthe covariance matrix.28

Convert Covariances to CorrelationsML covariance matrix ML correlation matrixmatrix Sigma r(Sigma em)matrix M r(Beta em)*we’ll need these means latergetcovcorr Sigma, corrmatrix C r(C)matlist C gradratlenrollrmbrdstufaccsatactprivate-------- ---------------gradrat 1lenroll -.02658651rmbrd .4776137 -.0163951stufac -.3184437 .2674224 -.28195321csat .5907693 .1919786 .4794608 -.31511371act .598022 .1737033 .4844202 -.2931513 .9077751private .3983337 -.6191004 .3404992 -.367662 .1608612 .2235773129EM As Input to regresscorr2data gradrat lenroll rmbrd stufac csat actprivate, cov(Sigma) mean(M) clearregress gradrat lenroll rmbrd stufac csat privateThis produces ML estimates of the regression coefficients. Butstandard errors and associated statistics are incorrectbecause the sample size is taken to be 1302.gradrat Coef.Std. Err.tP t [95% Conf. Interval]-------- -------lenroll 2.083176.53938473.860.0001.0250133.141339rmbrd 2.403941.40009836.010.0001.619033.188852stufac -.1813901.0841226 -2.160.031 -.3464216 -.0163587csat .066875.0039007 17.140.000.0592227.0745273private 12.914421.146564 11.260.00010.6650915.16374cons -32.394754.354628 -7.440.000 -40.93764 -23.85186These are MLestimatesThese are biasedestimates30

Direct MLAlso known as “raw ML” or “full information ML” (FIML)Directly maximize the likelihood for the specified modelSeveral structural equation modeling (SEM) packagescan do this for a large class of linear models. )Mplus (www.statmodel.com)LISREL (www.ssicentral.com/lisrel)OpenMX (R package) (openmx.psyc.virginia.edu)EQS (www.mvsoft.com)PROC CALIS (support.sas.com)Stata sem (www.stata.com)lavaan (R package) (lavaan.ugent.be)31Direct MLWith no missing data, the multivariate normallikelihood iswhere32

Missing Data Using Stata Paul D. Allison, Ph.D. February 2016 www.StatisticalHorizons.com 1 Basics Definition: Data are missing on some variables for some observations Problem: How to do statistical analysis when data are missing? Three goals: Minimize bias Maximize use of available information Get good estimates of uncertainty

Related Documents:

Stata is available in several versions: Stata/IC (the standard version), Stata/SE (an extended version) and Stata/MP (for multiprocessing). The major difference between the versions is the number of variables allowed in memory, which is limited to 2,047 in standard Stata/IC, but can be much larger in Stata/SE or Stata/MP. The number of

Categorical Data Analysis Getting Started Using Stata Scott Long and Shawna Rohrman cda12 StataGettingStarted 2012‐05‐11.docx Getting Started Using Stata – May 2012 – Page 2 Getting Started in Stata Opening Stata When you open Stata, the screen has seven key parts (This is Stata 12. Some of the later screen shots .

To open STATA on the host computer, click on the “Start” Menu. Then, when you look through “All Programs”, open the “Statistics” folder you should see a folder that says “STATA”. Click on the folde r and it will open up three STATA programs (STATA 10, STATA 11, and STATA 12). These are all the

There are several versions of STATA 14, such as STATA/IC, STATA/SE, and STATA/MP. The difference is basically in terms of the number of variables STATA can handle and the speed at which information is processed. Most users will probably work with the “Intercooled” (IC) version. STATA runs on the Windows, Mac, and Unix computers platform.

Stata/IC and Stata/SE use only one core. Stata/MP supports multiple cores, but only commands are speeded up. . I am using Stata 14 and not Stata 15) Setting up the seed using dataset lename. type can be F create creates a dataset with empty seeds for each variation. If option fill is used, then seeds are random numbers.

Stata/MP, Stata/SE, Stata/IC, or Small Stata. Stata for Windows installation 1. Insert the installation media. 2. If you have Auto-insert Notification enabled, the installer will start auto-matically. Otherwise, you will want to navigate to your installation media and double-click on Setup.exe to start the installer. 3.

STATA/IC, STATA/SE, and STATA/MP. The difference is basically in terms of the number of variables STATA can handle and the speed at which information is processed. Most users will probably work with the “Intercooled” (IC) version. STATA runs on the Windows (2000, 2003, XP, Vista, Server 2008, or Windows 7), Mac, and Unix computers platform.

setiap area anatomi tertentu. Tulang (Bones) Tubuh mengandung 206 tulang. Tulang memiliki beberapa fungsi, seperti dukungan, perlindungan, pemindahan, penyimpanan mineral, dan pembentukan sel darah. Susunan tulang yang membentuk sendi dan perlekatan otot pada tulang-tulang tersebut menentukan pergerakan. Tulang diklasifikasikan berdasarkan bentuknya menjadi empat kelompok: tulang panjang .