Econometrics Course: Cost As The Dependent Variable (II)

2y ago
14 Views
2 Downloads
723.24 KB
79 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Kian Swinton
Transcription

Econometrics Course:Cost as theDependent Variable (II)Paul G. Barnett, PhDApril 10, 2019

POLL QUESTION #1Which method(s) have you usedto evaluate health care costs?(answer all that apply)– None yet– Rank test (non-parametric method)– Ordinary Least Squares regression with rawcost– OLS log transformed cost– GLM model (gamma regression)2

Health care costsdifficult to analyze– Skewed by rare but extremely high costevents– Zero cost incurred by enrollees who don’tuse care– No negative values– Variance can vary with independent variable3

Limitation ofOrdinary Least Squares (OLS) OLS with raw cost– non-normal dependent variable can generatebiased parameters– can predict negative costs OLS with log transformation of cost– Log cost is normally distributed, can use in OLS– Predicted cost is affected by re-transformation bias– Can’t take log of zero– Assumes variance of errors is constant4

Topics for today’s courseWhat is heteroscedasticity, and whatshould be done about it? What should be done when there aremany zero values? How to test differences in groups with noassumptions about distribution? How to determine which model is best? 5

Topics for today’s courseWhat is heteroscedasticity and whatshould be done about it? What should be done when there aremany zero values? How to test differences in groups with noassumptions about distribution? How to determine which model is best? 6

What is heteroscedasticity? Heteroscedasticity– Variance depends on x (or on predicted y)– For example, the variation in incomeincreases with age OLS assumes Homoscedasticity– Identical variance E(εi2) σ27

Homoscedasticity– Errors have identical variance E(εi2) σ2432e10-105,00010,00015,00020,000-2-3-48

Heteroscedasticity– Errors depend on x (or on predicted y)432e10-105,00010,00015,00020,000-2-3-49

Why worry aboutheteroscedasticity? Predictions based on OLS model can be biasedRe-transformation assumes homoscedasticerrorsPredicted cost when the error isheteroscedastic can be “appreciably biased”10

What should be done aboutheteroscedasticity?Use a Generalized Linear Models (GLM) Analyst specifies a link function g( ) Analyst specifies a variance function – Key reading: “Estimating log models: totransform or not to transform,” Mullahy andManning J Health Econ 20:461, 200111

Link function g( ) in GLMg (E (y x) ) α βx Link function can be natural log, squareroot, or other function – E.g. ln ( E ( y x)) α βx– When link function is natural log, then βrepresents percent change in y for a unitchange in x12

GLM vs. OLSOLS of log estimate: E ( ln ( y) x)) GLM estimate: ln (E ( y x)) – Log of expectation of y is not the same asexpectation of log y!13

GLM advantagesDependent variable can be zero No retransformation bias when predicting – Smearing estimator is not used Does not assume homoscedastic errors14

GLM variance functionGLM does not assume constant variance GLM assumes there is function thatexplains the relationship between thevariance and mean – var (y x)15

Variance assumptions for GLM costmodels Gamma Distribution (most common)– Variance is proportional to the square of themean Poisson Distribution– Variance is proportional to the mean16

Estimation methods How to specify log link and gammadistribution with dependent variableCOST and independent variables X1, X2,X317

GLM with log link and gammadistribution in StataGLM COST X1 X2 X3, FAM(GAM)LINK(LOG)18

GLM with log link and gammadistribution in SAS Basic syntax (drops zero cost observations)PROC GENMOD;MODEL COST X1 X2 X3 / DIST GAMMA LINK LOG; Refined syntax (keeps zero cost observations)PROC GENMOD;A MEAN ;B RESP ;D B/A LOG(A)VARIANCE VAR A**2DEVIANCE DEV D;MODEL COST X1 X2 X3 / LINK LOG;19

Choice between GLMand OLS of log cost GLM advantages:– Handles heteroscedasticity– Predicted cost is not subject toretransformation error OLS of log transform advantages– OLS is more efficient (standard errors aresmaller than with GLM)20

Dataset for worked examplesNew primary care episodes of nonspecific low-back pain in 2016 10% Sample N 43,909 VA and community care costs in thefollowing year (excluding residential,nursing home) Cost difference of episodes that started inCBOC? 21

Worked exampleTry gamma regression (GLM log link,gamma distribution) Evaluate link function with Box-Coxregression Evaluate distribution with GLM familytest 22

GLM RegressionLog link, gamma distributionInterceptIndex visit at CBOCBaseline pain scoreAgeNumber of chronic conditionsIndicator of diagnosis for substance useor psychiatric diagnosisFemaleHistory of opiate Rx in prior yearMRI within 6 weeksNumber of visits for physical therapywithin 6 weeksParameter Standard WaldPr t EstimateErrorChi-sq7.94010.0240 109230 .0001-0.27500.0106675.9 .00010.04520.0017726.7 .00010.00260.000445.5 .00010.17850.0036 2397.4 .00010.22460.0128308.2 60.7 .0001 .0001 .00010.09490.012062.8 .000123

Which GLM link function?– Maximum likelihood estimation ofBox-cox parameter (called θ or λ)θCOST 1θ α βx ε24

Link function depends onBox-Cox parameterLink functionBox-CoxparameterInverse (1/cost)-1Log(cost)0Square root (cost).5Cost1Cost Squared225

Box Cox regression Stataboxcox cost {indep. vars} if cost 0 SASproc transreg data {dataset} ;model boxcox(cost/lambda -1 to 2 by .5) identity (&ind vars);26

Box-Cox regression with example dataBox-Cox Transformation Information forCost in year after index stayLambdaR-SquareLog Like-1.00.05-409600-0.50.11-3691220.0 0.15-354795 596029 - Best Lambda* - 95% Confidence Interval - Convenient Lambda27

Which variance structure with GLM?Is it appropriate to assume the gammadistribution? GLM family test – Also called modified Park testRun GLM gamma regression Evaluate residuals with secondregression 28

GLM family test (step 1) Run a gamma regression– Assume log link, gamma variance– Include independent variables Find predicted cost (in log scale)– Xβ from first regression Find residual (in raw scale)– COST - 𝑒𝑒 𝑋𝑋𝛽𝛽Square these residuals29

GLM family test (step 2) Run second gamma regression– Dependent variable is squared residualsin raw scale– Independent variable is predicted cost inlog scale (Xβ from first regression)(COST 𝑒𝑒 𝑋𝑋𝛽𝛽 )2 𝛾𝛾0 𝛾𝛾1 (𝑋𝑋𝛽𝛽) 𝜈𝜈30

GLM family test (step 3) Evaluate the regression coefficient γ1γ1Variance to be used in GLM0Gaussian (Normal)1Poisson2Gamma3Wald (Inverse Normal)31

GLM Family Testwith example dataParameterStandardEstimateErrorWald ChiSquarePr ChiSqγ01.570.184672.1 .0001γ11.980.02138587.6 .0001Result:Since γ1 2, the gamma distribution was appropriate32

Other models for skewed data Generalized gamma models– Estimate link function, distribution, andparameters in single model– STATA ado file “pglm”– See: Basu & Rathouz (2005)33

Questions?34

Topics for today’s courseWhat is heteroscedasticity, and whatshould be done about it? (GLM models) What should be done when there aremany zero values? How to test differences in groups with noassumptions about distribution? How to determine which model is best? 35

What should be done when there aremany zero values? Examples– Members of a health plan without utilization– Clinical trial participants with no cost36

The two-part model Part 1: Dependent variable is indicatorany cost is incurred– 1 if cost is incurred (Y 0)– 0 if no cost is incurred (Y 0) Part 2: Regression of how much cost,among those who incurred any cost38

Annual per person VHA costs FY10among those who used VHA in FY090.40Medical OnlyMedical Rx0.300.20 30K 30K 25K 20K 15K 10K 5K0.00no cost 1K0.1037

The two-part model Expected value of Y conditional on XE (Y X ) P (Y 0) X ) E (Y Y 0, X )Part 1.The probability thatY is greater than zero,conditional on X}}Is the product of:Part 2.Expected value of Y,conditional on Y beinggreater than zero,conditional on X39

Predicted cost in two-part model Predicted value of YPart 1.Probability of any costbeing incurred}Is the product of:}E (Y X ) P (Y 0) X ) E (Y Y 0, X )Part 2.Predicted costconditional onincurring any cost40

Question for classP(Y 0) X ) Part one estimates probability Y 0– Y 0 is dichotomous indicator– 1 if cost is incurred (Y 0)– 0 if no cost is incurred (Y 0)41

POLL QUESTION #2Which regression method(s) are used fora dichotomous (zero/one) dependentvariable? (check all that apply)– Ordinary Least Squares– Logistic Regression– Probit Regression– Survival (Cox) regression42

First part of modelRegression with dichotomous variable Logistic regression or probitLogistic regression uses maximumlikelihood function to estimate log oddsratio:Pi α β1 Xlog1 Pi43

Logistic regression syntax in SASProc Logistic;Model HASCOST X1 X2 X3 / Descending;Output out {dataset} prob {variable name}; HASCOST an indicator variableOutput statement saves the predicted probability thatthe dependent variable equals one (cost was incurred)Descending option in model statement is required,otherwise SAS estimates the probability that thedependent variable equals zero44

Logistic regression syntax inStataLogit HASCOST X1 X2 X3Predict {variable name}, pr Predict statement generates the predictedprobability that the dependent variableequals one (cost was incurred)45

Second part of modelConditional quantityRegression involves only observationswith non-zero cost (conditional costregression) Use GLM or OLS with log cost 46

Two-part models Separate parameters for participation andconditional quantity– How independent variables predict participation in care quantity of cost conditional on participation– each parameter may have its policyrelevance47

Stata TPM command Fits two part regressions– First part: binary choice (Prob depvar 0)– Second part: distribution of depvarconditional on depvar 0 User developed ADO file– must be installed from web Federico Belotti & Partha Deb (2012)48

Stata TPM command First part options– Logit or Probit Second part options– OLS of raw value, OLS of log, or GLM Example syntaxTPM COST X1 X2 X3, f(logit) s(glm, fam(gamma) link(log))49

Stata TPM command Post-estimation commands– Predict values of depvar– Allows out of sample predictions– Corrects for retransformation bias in OLSmodels50

Alternatives to two-part modelOLS with untransformed cost OLS with log cost, using small positivevalues in place of zero – not recommended GLM models, e.g. gamma regression– Cannot have “too many” values with zero51

Topics for today’s courseWhat is heteroscedasticity, and whatshould be done about it? (GLM models) What should be done when there aremany zero values? (Two-part models) How to test differences in groups with noassumptions about distribution? How to determine which model is best? 52

Non-parametric statistical testsMake no assumptions about distribution,variance Wilcoxon rank-sum test Assigns rank to every observation Compares ranks of groups Calculates the probability that the rankorder occurred by chance alone 53

Extension to more than twogroupsGroup variable with more than twomutually exclusive values Kruskall-Wallis test – is there any difference between any pairs ofthe mutually exclusive groups? If KW is significant, then a series ofWilcoxon tests allows comparison ofpairs of groups54

Limits of non-parametric test It is too conservative– Compares ranks, not means– Ignores influence of outliers– E.g. all other ranks being equal, Wilcoxon willgive same result regardless of whether Top ranked observation is 1 million more costly thansecond observation, or Top ranked observation just 1 more costly Doesn’t allow for additional explanatoryvariables55

Topics for today’s courseWhat is heteroscedasticity, and whatshould be done about it? (GLM models) What should be done when there aremany zero values? (Two-part models) How to test differences in groups with noassumptions about distribution? (Nonparametric statistical tests) How to determine which model is best? 56

Which model is best?Find predictive accuracy of models Estimate regressions with half the data,test their predictive accuracy on the otherhalf of the data Find – Mean Absolute Error (MAE)– Root Mean Square Error (RMSE)57

Mean Absolute Error For each observation– find difference between observed and predicted cost– take absolute value– find the mean Model with smallest value is best1MAE Yi Ŷin i 1n58

Root Mean Square Error Square the differences between predictedand observed, find their mean, find itssquare rootBest model has smallest value12RMSE (Yi Ŷi ) n i 1n59

MSE and RMSE in example(10% sample of low back pain patients)Model4 age categoricalvariables7 additionalvariablesMean AbsoluteError(MAE)Root MeanSquare Error(RMSE)5,97214,0545,60813,745Find in raw scale to be more transparentLower values are more desirable!60

Evaluations of residuals Mean residual (predicted less observed)or Mean predicted ratio (ratio of predicted toobserved)– calculate separately for each decile ofobserved Y– A good model should have equal residuals(or equal mean ratio) for all deciles61

Evaluation of Residuals(10% sample of low back pain patients)Decile ofCost12345678910MeanMeanRatio ofpredictedobserved predicted tocost incost in decile 9,97132,23217.56.54.23.22.41.91.51.10.80.462

Formal tests of residuals Variant of Hosmer-Lemeshow Test– F test of whether residuals in raw scale ineach decile are significantly different Pregibon’s Link Test– Tests if linearity assumption was violated See Manning, Basu, & Mullahy, 200563

Variant of Hosmer-Lemeshow Test(10% sample of low back pain patients)Dependent variable: residual in raw scaleVariableInterceptF 2,516 p 0.0001ParameterT-valueEstimate-4,235-25.1p 811.60.11decile57463.10.002decile61,2455.2 .0001decile72,0568.6 .0001decile83,37414.2 .0001decile96,35226.7 .0001decile1027,573115.7 .000164

Questions?65

Review of presentation Cost is a difficult dependent variable– Skewed to the right by high outliers– May have many observations with zerovalues– Cost is not-negative66

When cost is skewed OLS of raw cost is prone to bias– Especially in small samples with influentialoutliers– “A single case can have tremendous influence”67

When cost is skewed (cont.) Log transformed cost– Log cost is more normally distributed thanraw cost– Log cost can be estimated with OLS68

When cost is skewed (cont.) To find predicted cost, must correct forretransformation bias– Smearing estimator assumes errors arehomoscedastic– Biased if errors are heteroscedasctic69

When cost is skewedand errors are heteroscedastic GLM with log link and gamma variance– Considers heteroscedasctic errors– Not subject to retransformation bias– May not be very efficient– Alternative GLM specification Poisson instead of gamma variance function Square root instead of log link function70

When cost has many zero values Two part model– Logit or probit is the first part– Conditional cost regression is the secondpart71

Comparison without distributionalassumptionsNon-parametric tests can be useful May be too conservative Don’t allow co-variates 72

Evaluating modelsMean Absolute Error Root Mean Square Error Other evaluations and tests of residuals 73

Key sources on GLM MANNING, W. G. (1998) The logged dependent variable,heteroscedasticity, and the retransformation problem, J HealthEcon, 17, 283-95.* MANNING, W. G. & MULLAHY, J. (2001) Estimating logmodels: to transform or not to transform?, J Health Econ, 20,461-94.* MANNING, W. G., BASU, A. & MULLAHY, J. (2005)Generalized modeling approaches to risk adjustment ofskewed outcomes data, J Health Econ, 24, 465-88.BASU, A. & Rathouz P.J. (2005) Estimating marginal andincremental effects on health outcomes using flexible link andvariance function models, Biostatistics 6(1): 93-109, 2005.74

Key sources on two-part models * MULLAHY, J. (1998) Much ado about two:reconsidering retransformation and the twopart model in health econometrics, J HealthEcon, 17, 247-81 JONES, A. (2000) Health econometrics, in:Culyer, A. & Newhouse, J. (Eds.) Handbook ofHealth Economics, pp. 265-344 (Amsterdam,Elsevier).75

References to worked examples DEB, P & NORTON, EC (2018) Modeling healthcare expenditures and use, Ann Rev Public Health39:489–505FLEISHMAN, J. A., COHEN, J. W., MANNING, W.G. & KOSINSKI, M. (2006) Using the SF-12 healthstatus measure to improve predictions of medicalexpenditures, Med Care, 44, I54-63.MONTEZ-RATH, M., CHRISTIANSEN, C. L.,ETTNER, S. L., LOVELAND, S. & ROSEN, A. K.(2006) Performance of statistical models to predictmental health and substance abuse cost, BMC MedRes Methodol, 6, 53.76

References to worked examples(cont). MORAN, J. L., SOLOMON, P. J., PEISACH, A. R.& MARTIN, J. (2007) New models for old questions:generalized linear models for cost prediction, J EvalClin Pract, 13, 381-9.DIER, P., YANEZ D., ASH, A., HORNBROOK, M.,LIN, D. Y. (1999). Methods for analyzing healthcare utilization and costs Ann Rev Public Health(1999) 20:125-144 (Also gives accessible overviewof methods, but lacks information from more recentdevelopments)77

Link to HERC CyberseminarHSR&D study of worked examplePerformance of Statistical Models to PredictMental Health and Substance Abuse CostMaria Montez-Rath, M.S. 11/8/2006The audio: http://vaww.hsrd.research.va.gov/for researchers/cyber seminars/HERC110806.asxThe Power point slides: http://vaww.hsrd.research.va.gov/for researchers/cyber seminars/HERC110806.pdf78

Book chapters MANNING, W. G. (2006) Dealing withskewed data on costs and expenditures, in:Jones, A. (Ed.) The Elgar Companion toHealth Economics, pp. 439-446 (Cheltenham,UK, Edward Elgar).79

OLS with raw cost – non-normal dependent variable can generate biased parameters – can predict negative costs OLS with log transformation of cost – Log cost is normally distributed, can use in OLS – Predicted cost is affected by re -transformation bias – Can’t take

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

Econometrics is the branch of economics concerned with the use of mathematical methods (especially statistics) in describing economic systems. Econometrics is a set of quantitative techniques that are useful for making "economic decisions" Econometrics is a set of statistical tools that allows economists to test hypotheses using

Harmless Econometrics is more advanced. 2. Introduction to Econometrics by Stock and Watson. This textbook is at a slightly lower level to Introductory Econometrics by Wooldridge. STATA 3. Microeconometrics Using Stata: Revised Edition by Cameron and Trivedi. An in-depth overview of econometrics with STATA. 4. Statistics with STATA by Hamilton .