A Non-Parametric Bayesian Approach To The Instrumental .

2y ago
29 Views
2 Downloads
391.92 KB
60 Pages
Last View : 10d ago
Last Download : 3m ago
Upload by : Kamden Hassan
Transcription

A Non-Parametric Bayesian Approach to the Instrumental Variable ProblembyTim ConleyChris HansenRob McCullochPeter E. RossiGraduate School of BusinessUniversity of ChicagoJune 2006Keywords: instrumental variables, non-parametric Bayesian inference, Dirichlet processpriorsJEL classification: C11, C14, C3AbstractWe develop a Bayesian non-parametric approach to the instrumental variable problem. ADirichlet process prior is used for the joint distribution of structural and instrumentalvariable equations errors.This can be interpreted as modeling the unknown jointdistribution with a mixture of normal distributions with a variable number of mixturecomponents. We demonstrate that this procedure is both feasible and sensible using actualand simulated data. Sampling experiments compare inferences from the non-parametricBayesian procedure with those based on procedures from the recent literature on weakinstrument asymptotics. When errors are non-normal, our non-parametric procedure ismore efficient than standard Bayesian or classical methods.

1.IntroductionInstrumental variables (IV) methods are fundamental in applied economic research.However, because IV methods exploit only that portion of the variation in the endogenousvariable induced by shifting the instrumental variable, inference tends to be imprecise. Thisproblem is exacerbated when the instruments are only weakly related to the endogenousvariables and induce only small amounts of variation. The recent econometrics literature hasincluded numerous papers which present potential improvements to the usual asymptoticapproaches to obtaining estimates and performing inference in IV models.1We present a Bayesian instrumental variables approach that allows nonparametricestimation of the distribution of error terms in a set of simultaneous equations. Bayesianmethods are directly motivated by the fact that researchers are quite likely to haveinformative prior views about potential values of treatment parameters, placing a premiumupon methods allowing the use of such information in estimation. Weak instrumentproblems are inherently small sample problems in that there is little information available inthe data to identify the parameter of interest. As Bayesian methods are inherently smallsample they are a coherent choice. Even in the absence of a direct motivation for usingBayesian methods, we provide evidence that Bayesian interval estimators perform wellcompared to available frequentist estimators, under frequentist performance criteria.The Bayesian non-parametric approach attempts to uncover and exploit structure inthe data. For example, if the errors are truly non-normal, the version of our model withvarying error distribution parameters would fit this distribution and may provide efficiencygains from this information. In contrast, traditional instrumental variable methods areSee for example, Stock, Wright, and Yogo (2002) or Andrews, Stock, and Moreira (2006) for excellentoverviews of this literature.11

designed to be robust to the error distribution and, therefore, may be less efficient. In thecase of normal or nearly normal errors, our procedure should have a small efficiency loss.Our nonparametric method for estimating error term distributions can be interpretedas a type of mixture model. Rather than choosing a fixed number of base distributions to bemixed, we specify a Dirichlet Process Prior that allows the number of mixture componentsto be determined by both the prior and the data.The alternative of using a pre-specifiednumber of mixture components requires some sort of auxiliary computations such as BayesFactors to select the number of components. In our approach, this is unnecessary as theBayes Factor computations are undertaken as part of the MCMC method. There is also asense that the DP process prior is a very general approach which allows model parameters tovary from observation to observation. A posteriori observations which could reasonablycorrespond to the same error distribution parameter are grouped together. This means thatthe DP process approach handles very general forms of heterogeneity in the errordistributions.Our implementation of the normal base model and conjugate prior is entirelyvectorized except for one loop in a sub Gibbs Sampler for drawing the parameters governedby the Dirichlet process prior which is currently implemented in C. This makes DP processcalculations feasible even in a sampling experiment and for the sample sizes oftenencountered in applied cross-sectional work.Computational issues are discussed inappendix A.We conduct an extensive Monte Carlo evaluation of our proposed method andcompare it to a variety of classical approaches to estimation and inference in IV models. Weexamine estimators’ finite sample performance over a range of instrument strength andunder departures from normality. The non-parametric Bayes estimators have dramatically2

nearly smaller RMSE than standard classical estimators. In comparison to Bayesian methodsthat assume normal errors, the non-parametric Bayes method has identical RMSE fornormal errors and much smaller RMSE for log-normal errors. For both weak and stronginstruments, our procedure produces credibility regions that are much smaller thancompeting classical procedures, particularly in the case of non-normal errors. For the weakinstrument cases, our coverage rates are four to twelve percent below the nominal coveragerate of 95 per cent. Recent methods from the weak instrument classical literature produceintervals with coverage rates that are close to nominal levels but at the cost of producingextremely large intervals. For log-normal errors, we find these methods produce infiniteintervals more than 40 per cent of the time.The remainder of this paper is organized as follows. Section 2 presents the mainmodel and the essence of our computation algorithm. Section 3 discusses choices of priors.In Section 4, we present two illustrative empirical example applications of our method.Section 5 presents results from sampling experiments which compare the inference andestimation properties of our proposed method to alternatives in the econometrics literature.Appendices detail the computational strategy and provide specifics of the alternative classicalinference procedures considered in the paper.3

2.Model and MCMCIn this section, we present a version of the instrumental variable problem and explain how toconduct Bayesian inference for it. Our focus will be on models in which the distribution oferror terms are not restricted to any specific parametric family. We also indicate how thissame approach can be used in models with unknown regression or mean functions.2.1 The Linear ModelConsider the case with one linear structural equation and one “first-stage” orreduced form equation.(2.1)x i z i' δ ε 1, iy i β x i w i' γ ε 2, iThe generalization of (2.1) to more than one right hand side endogenous variables isobvious. If ε 1 and ε 2 are dependent, then the treatment parameter, β, is identified by thevariation from the variables in z, which are excluded from the structural equation and arecommonly termed “instrumental variables.” Classical instrumental variables estimators suchas two stage least squares do not make any specific assumptions regarding the distribution ofthe error terms in (2.1). In contrast, the Bayesian treatment of this model has relied on theassumption that the error terms are bivariate normal (c.f. Chao and Phillips (1998), Geweke(1996), Kleibergen and Van Dijk (1998), Kleibergen and Zivot (2003), and Rossi et al(2005))2(2.2) ε 1, i N ( μ, Σ) ε 2, i εi An exception is Zellner (1998) whose BMOM procedure does not use a normal or any other specificparametric family of distributions of the errors.24

For reasons that will become apparent later, we will include the intercepts in the error termsby allowing them to have non-zero mean, μ .Most researchers regard the assumption of normality as only an approximation to thetrue error distribution. Some methods of inference such as those based on TSLS and themore recent weak and many instruments literature do not make any explicit distributionalassumptions. In addition to outliers, some forms of conditional heterogeneity and misspecification of the functional forms of the regression functions can produce non-normalerror terms. For these reasons, we develop a Bayesian procedure that uses a flexible errordistribution that can be given a non-parametric interpretation.Our approach builds on the normal based model but allows for separate errordistribution parameters, θ i ( μi , Σ i ) for every observation.As discussed below, thisaffords a great deal of flexibility in the error distribution. However, as a practical mattersome sort of structure must be imposed on these set of parameters, otherwise we will face aproblem of parameter proliferation. One solution to this problem is to use a prior over thecollection, {θ i } , which creates dependencies. In our approach, we use a prior that clusterstogether “similar” observations into I * groups, each with its own unique value of θ. Thenumber of these groups will be random as well, allowing for a truly non-parametric methodin which the number of clusters can increase with the sample size. In any fixed size sample,our full Bayesian implementation will introduce additional parameters only if necessary,avoiding the problem of over-fitting.With a normal base distribution, the resulting predictive distribution of the errorterms (see section 2.4 for details) will involve a mixture of normal distributions where thenumber and shape of the normal components is influenced by both the prior and the data.A mixture of normals can provide a very flexible approximation device.Thus, our5

procedure enjoys much of the flexibility of a finite mixture of normals without requiringadditional computations/procedures to determine the number of components and imposepenalties for over-fitting.It should be noted that a sensible prior is required for anyprocedure that relies explicitly or implicitly (as ours does) on Bayes Factor computations.In our procedure, observations with large errors can be grouped separately fromobservations with small errors. The coarseness of this clustering is dependent on theinformation content of the data and the prior settings. In principle, this allows for a generalform of heteroskedasticity with different variances for each observation.2.2Flexible Specifications through a Hierarchical Model with a Dirichlet Process PriorOur approach to building a flexible model is to allow for a subset of the parameters to varyfrom observation to observation. We can partition the full set of parameters into a part thatis fixed, η , and one that varies from observation to observation, θ . For example, we couldassign η ( β , δ , γ ) and θ ( μ , Σ ) as suggested above. The problem becomes how to puta flexible prior on the collection, {θi }i 1 . The standard hierarchical approach is to assumeNthat each θ i is iid G0 ( λ ) where G0 is some parametric family of distributions withhyperparmeters, λ . Frequently, a prior is put on λ and this has the effect of inducingdependencies between the θ i .A more flexible approach is to specify a Dirichlet Process prior for G instead.(2.3)θi iid GG DP (α , G0 )DP (α , G0 ) denotes the Dirichlet process with concentration parameter α and basedistribution G0 . G is a random distribution such that with probability one G is discrete.6

This means that different θ i may correspond to the same atom of G and hence be the same.This is a form of dependency in the prior achieved by clustering together some of the θ i .It should be noted that while each draw of G is discrete this does not mean that the jointprior distribution on {θi }i 1 is discrete once G has been margined out. This distribution isNcalled a mixture of Dirichlet Processes and is shown in Antoniak (1974) to have continuoussupport. It is worth noting that the marginal distribution of any θ i is Go . The sole purposeof the DP prior is to introduce dependencies in the collection, {θi }i 1 .NA useful way to gain some intuition as to the DP prior is to consider the “stickbreaking” representation of this prior. Each draw from G is a discrete distribution. Thesupport or “atoms” of this distribution are iid draws from Go . The probability weights are()obtained as π k ωk kj 11 1 ω j , with ω 0 0 and ωk Beta (1, α ) . Thus, a draw G canbe represented as G k 1π k Iθk , where Iθ is a point mass at atom θ , the θk are i.i.d.draws from Go .The distribution of the atom weights depends only on α. We obtain the π weightsby starting with the full mass one and repeatedly taking bites of size ωk out of the remainingweight. If α is big we will take small bites so that the mass will be spread out over a largenumber of atoms. In this case, G will be a discrete approximation of Go so that the drawsof G will be close to Go and the {θ i } will essentially be i.i.d draws from Go . If α is small,we will take big bites and a draw of G will put large weight on a few random draws fromGo . In this case, the {θ i } contain only a few unique values. The number of unique values israndom with values between one and N being possible.Suppressing the fixed parameters η , we can write our basic model in hierarchicalform as the set of the following conditional distributions:7

G DP (α , G0 ){θi } G( x i , y i ) θi , z i(2.4)In the posterior distribution for this model, the prior and the information in the datacombine to identify groups of observations which could reasonably share the same θ.Insection 3, we consider priors on the DP concentration parameter α and the selection of thebase prior distribution, Go . Roughly these two priors delineate the number and type ofatoms generated by the DP prior.2.3MCMC AlgorithmsThe fixed parameter, linear model in (2.1) and (2.2) has a Gibbs Sampler as defined in Rossiet al (2005) (see also Geweke (1996)) consisting of the following conditional posteriordistributions:(2.5)(2.6)(2.7)β , γ δ , μ , Σ, y , x , Z , Wδ β , γ , μ , Σ, y , x , Z , Wμ , Σ β , γ , δ , y , x , Z ,Wwhere y , x , Z , W denote vectors and an array formed by stacking the observations. The keyinsight needed to draw from (2.5) is that, given δ, we “observe” ε 1 and we can computethe conditional distribution of y given x, Z, W and ε 1 . The parameters of this conditionaldistribution are “known” and we simply standardize to obtain a draw from a Bayesregression with N(0,1) errors. The draw in (2.6) is effected by transforming to the reducedform which is still linear in δ (given β). This exploits the linearity (in x) of the structuralequation. Again, we standardize the reduced form equations and stack them to obtain a8

draw from a Bayes regression with N(0,1) errors. The last draw (2.7) simply uses standardBayesian multivariate normal theory using the errors as “data.”If some subset of the parameters is allowed to vary from observation to observationwith a DP prior, we then must add a draw of these varying parameters to this basic set-up.For example, if we define θ i ( μi , Σ i ) , then the Gibbs Sampler becomes(2.8)β , γ δ , Θ, y , x , Z , W(2.9)δ β , γ , Θ, y , x , Z , W(2.10)Θ β , γ , δ , y , x , Z ,W(2.11)α Θ, β , γ , δ , y , x , Z , WΘ {θ i } . The draws in (2.8) and (2.9) are the same as for the fixed parameter case exceptthat the regression equations must be standardized to have zero mean errors and unitvariance. Since Θ contains only I * unique elements, we can group observations by uniqueθ i value and standardize each with different error means and covariance matrices. Thispresents some computing challenges for full vectorization but it is conceptuallystraightforward. The draw of Θ in (2.10) is done by a Gibbs sampler which cycles thru eachof the N θ i s (Escobar and West (1998); see appendix A for full details). The input to thisGibbs Sampler as “data” is the matrix (N x 2) of error terms computed using the last drawsof ( β , δ , γ ) . Each draw of Θ will contain a different number ( N ) unique values. Thedraw of α is a straightforward univariate draw (see appendix A for details). Thus, this modelcan be interpreted as a linear structural equations model with errors following a mixture ofnormals with a random number of components which are determined by the data and priorinformation.9

2.4Bayesian Density EstimationOne useful by-product of our DP process MCMC algorithm is a very simple way ofobtaining an error density estimate directly from the MCMC draws without significantadditional computations. In the empirical examples in section 4, we will display some ofthese density estimates in an effort to document departures from normality. The Bayesiananalogue of a density estimate is the predictive distribution of the random variables forwhich a density estimate is required. In our case, we are interested in the predictivedistribution of the error terms. This can be written as follows:p ( ε N 1 Data ) p ( ε N 1 θ N 1 ) p (θ N 1 Data ) d Θ(2.12)We can obtain draws from θ N 1 Data usingp (θ N 1 Data ) p (θ N 1 Θ ) p ( Θ Data ) d Θ(2.13)Since each draw of Θ has I * N unique values, and the base model, p (ε θ ) , is a normaldistribution, we can interpret the predictive distribution or density estimation problem asinvolving a mixture of normals. This mixture involves a random number of components.To implement this, we simply draw from θ N 1 Θ for each draw of Θ returned by ourMCMC procedure (see appendix A for the details of this draw). Denote these draws byθ Nr 1 , r 1, , R . The Bayesian density “estimate” is simply the MCMC estimate of theposterior mean of the density ordinate.(1 Rrpˆ ( ε ) ϕ ε θ N 1R r 1(2.14)where ϕ (2.5)) is the bivariate normal density function.Generalizations of the Linear Model10

The model and MCMC algorithm considered here can easily be extended.We haveemphasized the use of the DP prior for the parameters of the error terms but we could easilyput the same prior on the regression coefficients and allow these to vary from observation toobservation as in(2.15)y i β i x i w i' γ i ε iThis is a general method for approximating an unknown mean function. The regressioncoefficients will be grouped together and can assume different values in different regions ofthe regressor space. In addition, a model of the form in (2.15) would allow for conditionalheteroskedasticity. Implementation of this approach would require separate a DP prior forthe coefficients. Some of the computations in the Gibbs sampler for the DP parameterswould have to change but our modular computing method would easily allow one to plug injust a few sub-routines. Moreover, the conjugate prior computations required would be lesselaborate than for the DP model for multivariate normal error terms. An interesting specialcase of (2.15) would be the model with heterogenous treatment effects.(2.16)y i β i x i w i' γ ε iHere interest would focus on identifying subsets of the observations with different effects ofx on y.The computational algorithms for (2.15) or (2.16) are straightforward extensions ofwhat we have already implemented. The real work would be in the assessment of reasonablepriors and in methods for interpreting the results.11

3.Hyperparameters and PriorIn this section, we develop a prior for the model (2.1)- (2.2) and associated Gibbs sampler(2.8)-(2.11). Priors are chosen to enable us to capture reasonable prior information and forconvenience in making the draws. The choices include the family G0 and associatedparameters λ , the prior on ( β , δ , γ ) , and the prior on the Dirichlet Process parameter α.We will let ( β , γ ) , δ and α be a priori independent. Our approach will be to put a prior onα which will admit a reasonable a priori distribution of the number of unique θ i values. Wewill choose λ rather than put a priori on this quantity. By in

Bayesian methods, we provide evidence that Bayesian interval estimators perform well compared to available frequentist estimators, under frequentist performance criteria. The Bayesian non-parametric approach attempts to uncover and exploit structure in the data. For example, if the e

Related Documents:

parametric models of the system in terms of their input- output transformational properties. Furthermore, the non-parametric model may suggest specific modifications in the structure of the respective parametric model. This combined utility of parametric and non-parametric modeling methods is presented in the companion paper (part II).

Non-parametric models are a way of getting very flexible models. Many can be derived by starting with a finite parametric model and taking the limit as number of parameters Non-parametric models can automatically infer an adequate model size/complexity from the data, without needing to explicitly do Bayesian model comparison.2

Surface is partitioned into parametric patches: Watt Figure 6.25 Same ideas as parametric splines! Parametric Patches Each patch is defined by blending control points Same ideas as parametric curves! FvDFH Figure 11.44 Parametric Patches Point Q(u,v) on the patch is the tensor product of parametric curves defined by the control points

parametric and non-parametric EWS suggest that monetary expansions, which may reflect rapid increases in credit growth, are expected to increase crisis incidence. Finally, government instability plays is significant in the parametric EWS, but does not play an important role not in the non-parametric EWS.

Computational Bayesian Statistics An Introduction M. Antónia Amaral Turkman Carlos Daniel Paulino Peter Müller. Contents Preface to the English Version viii Preface ix 1 Bayesian Inference 1 1.1 The Classical Paradigm 2 1.2 The Bayesian Paradigm 5 1.3 Bayesian Inference 8 1.3.1 Parametric Inference 8

Learning Goals Parametric Surfaces Tangent Planes Surface Area Review Parametric Curves and Parametric Surfaces Parametric Curve A parametric curve in R3 is given by r(t) x(t)i y(t)j z(t)k where a t b There is one parameter, because a curve is a one-dimensional object There are three component functions, because the curve lives in three .

that the parametric methods are superior to the semi-parametric approaches. In particular, the likelihood and Two-Step estimators are preferred as they are found to be more robust and consistent for practical application. Keywords Extreme rainfall·Extreme value index·Semi-parametric and parametric estimators·Generalized Pareto Distribution

Tulang Penyusun Sendi Siku .41 2. Tulang Penyusun Sendi Pergelangan Tangan .47 DAFTAR PUSTAKA . Anatomi dan Biomekanika Sendi dan Pergelangan Tangan 6 Al-Muqsith Ligamentum annularis membentuk cincin yang mengelilingi caput radii, melekat pada bagian tepi anterior dan posterior insicura radialis pada ulna. Bagian dari kondensasi annular pada caput radii disebut dengan “annular band .