Causal Inference In Social Science An Elementary Introduction

3y ago
143 Views
2 Downloads
252.31 KB
17 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Pierre Damon
Transcription

Causal Inference in Social ScienceAn elementary introductionHal R. VarianGoogle, IncJan 2015Revised: March 21, 2015AbstractThis is a short and very elementary introduction to causal inferencein social science applications targeted to machine learners. I illustratethe techniques described with examples chosen from the economicsand marketing literature.123456789101112131A motivating problemSuppose you are given some data on ad spend and product sales in variouscities and are asked to predict how sales would respond to a contemplatedchange in ad spend. If yc denotes per capita sales in city c and xc denotesper capita ad spend in city c it is tempting to run a regression of the formyc bxc ec where ec is an error term and b is the coefficient of interest.1 (Themachine learning textbook by James et al. [2013] that describes a problemof this sort on page 59.)Such a regression is unlikely to provide a satisfactory estimate of thecausal effect of ad spend on sales. To see why, suppose that the sales, yc , areper capita box office receipts for a movie about surfing and xc are per capitaTV ads for that movie. There are only two cities in the data set: Honolulu,Hawaii and Fargo, North Dakota.1We assume all data has been centered, so we can ignore the constant in the regression.1

141516171819202122232425262728Suppose that the data set indicates that the advertiser spent 10 cents percapita on TV advertising in Fargo and observed 1 in sales per capita, whilein Honolulu the advertiser spent 1 per capita and observed 10 in sales percapita. Hence the model yc 10xc fits the data perfectly.But here is the critical question: do you really believe that increasingper capita spend in Fargo to 1 would result in box office sales of 10 percapita? For a surfing movie? This seems unlikely, so what is wrong with ourregression model?The problem is that there is an omitted variable in our regression, whichwe may call “interest in surfing.” Interest in surfing is high in Honoluluand low in Fargo. What’s more, the marketing executives that determinesad spend presumably know this, and they choose to advertise more whereinterest is high and less where it is low. So this omitted variable—interestin surfing—affects both yc and xc . Such a variable is called a confoundingvariable.To express this point mathematically, think of (y, x, e) as being the population analogs of the sample (yc , xc , ec ). The regression coefficient is givenby b cov(x, y)/cov(x, x). Substituting y bx e, we haveb cov(x, xb e)/cov(x, x) b cov(x, e).29303132333435363738394041The regression coefficient will be unbiased when cov(x, e) 0.2If we are primarily interested in predicting sales as a function of spendand the advertiser’s behavior remain constant, this simple regression may bejust fine. But usually simple prediction is not the goal; what we want to knowis how box office receipts would respond to a change in the data generatingbehavior. The choice of ad expenditure was based on many factors observedby the advertiser; but now we want to predict what the outcome wouldhave been if the advertiser’s choice had been different—without observingthe factors that actually influenced the original choices.To put it slightly more formally: we have observations that were generatedby a process such as“choose spend based on factors you think are important”,and we want to predict what would happen if we change to a data generatingprocess such as “increase your spend everywhere by x percent.”2Note that problem is not inherently statistical in nature. Suppose that there is noerror term, so that the model “revenue spend interest in surfing” fits exactly. If weonly look at the variation in spend and ignore the variation in surfing interest, we get amisleading estimate of the relationship between spend and revenue.2

676869707172737475It is important to understand that the problem isn’t simply that there is amissing variable in the regression. There are always missing variables—that’swhat the error term represents. The problem is that the missing variable,“interest in surfing,” affects both the outcome (sales) and the predictor (ads),so the simple regression of sales on ads won’t give us a good estimate of thecausal effect: what would happen to sales if we explicitly intervened andchanged ad expenditure across the board.This problem comes up all the time in statistical analysis of human behavior. In our example, the amount of advertising in a city, xc is chosenby some decision makers who likely have some views about how various factors affect outcomes, yc . However, the analyst is not able to observe thesefactors—they are part of the error term, ec . But this means that it is veryunlikely that xc and ec are uncorrelated. In our example, cities with highinterest in surfing may have high ad expenditure and high box office receipts,meaning a simple regression of yc on xc would overestimate the effect of adexpenditure on sales.3In this simple example, we have described a particular confounding variables. But in realistic cases, there will be many confounding variables—variables that affect both the outcome and the variables we are contemplatingchanging.Everyone knows that adding an extra predictor to a regression will typically change the values of the estimated coefficients on the other predictorssince the relevant predictors are generally correlated with each other. Nevertheless, we seem comfortable in assuming that the predictors we don’tobserve—those in the error term—are magically orthogonal to the predictorswe do observe!The “ideal” set of data, from the viewpoint of the analyst, would bedata from an advertiser with a totally incompetent advertiser who allocatedadvertising expenditures totally randomly across cities. If ad expenditureis truly random, then we don’t have to worry about confounding variablessince the predictors will automatically be orthogonal to the error term. Butstatisticians are seldom lucky enough to have a totally incompetent client.There are many other examples of confounding variables in economics.Here are a few classic examples.3It wouldn’t have to be that way. Perhaps surfing is so popular in Honolulu thateveryone already knows about the movie and it is pointless to advertise it. Again, this isthe sort of thing the advertiser might know but the analyst doesn’t.3

76777879808182838485868788899091How does fertilizer affect crop yields? If farmers apply more fertilizerto more fertile land, then more fertilizer will be associated with higheryields and a simple regression of fertilizer on outcomes will not give thetrue causal effect.How does education affect income? Those who have more education tendto have higher incomes, but that doesn’t mean that education causedthose higher incomes. Those who have wealthy parents or high abilitytend to acquire both more education and more income. Hence simpleregressions of education on income tend to overstate the impact of education. (See James et al. [2013], p.283 for a machine learning approachto this problem and Card [1999] for an econometric approach.)How does health care affect income? Those who have good jobs tendto have health care, so a regression of health care on income will showa positive effect but the direction of the causality is unclear.In each of these cases, we may contemplate some intervention that willchange behavior.93 How would crop yields change if we change the amount of fertilizerapplied?94 How would income change if we reduce the cost of acquiring education?95 How would income change if we changed the availability of health care?92100Each of these policies is asking what happens to some output if we changean input and hold other factors constant. But the data was generated byparties who were aware of those other factors and made choices based ontheir perceptions. We want an answer to a ceteris paribus question, but ourdata was generated mutatis mutandis.101296979899102103104105ExperimentsAs Box et al. [2005] put it “To find out what happens when you changesomething, it is necessary to change it.” As we will see, that may be slightlyoverstated, but the general principle is right: the best way to answer causalquestions is usually to run an experiment.4

134However, experiments are often costly and in some cases are actuallyinfeasible. Consider the example of the impact of education on income. Anideal experiment would require randomly selecting the amount of educationstudents acquire, which would be rather difficult.But this is an extreme case. Actual education policies being contemplatedmight involve things like student loans or scholarships and small scale experiments with such policies may well be feasible. Furthermore, there may be“natural experiments” that can shed light on such issues without requiringexplicit intervention.In an experiment, one applies a ‘treatment to some set of subjects andobserves some outcomes. The outcomes for the treated subjects can be compared to the outcomes for the untreated subjects (the control group) to determine the causal effect of the treatment on the subjects.One may be interested in the “impact of the treatment on the population,” in which case one would like the subjects to be a representative samplefrom the population. Or one might be interested in the how the treatmentaffected those who actually were treated, in which case one is concerned withthe “impact of the treatment on the treated.” Or you might be interested inthose who were invited to be treated, whether or not they actually agreed tobe treated; this is called an “intention to treat” analysis.If the proposed policy is going to be applied universally to some population, then one is likely interested in the impact of the treatment on thepopulation. If the proposed policy to be implement involves voluntary participation, then one may be interested in the impact of the treatment onthose who choose (or agree) to be treated.In marketing, we are often interested the how a change in advertisingpolicies affects a particular firm—the impact of a treatment on a subjectthat chooses to be treated. This impact may well be different from a subjectwhere treatment is 37138139140Fundamental identity of causal inferenceFollowing Angrist and Pischke [2009] we can decompose the observed outcome of a treatment into two effects.Outcome for treated Outcome for untreated [Outcome for treated Outcome for treated if not treated] [Outcome for treated if not treated Outcome for untreated]5

141 Impact of treatment on treated selection bias142143144145146147148149150151152153The first bracketed term is the impact of the treatment on the treated whilethe second bracketed term is the selection bias—the difference in outcomebetween the treated if they were not treated, compared to the outcome forthose who were, in reality not treated.This “basic identity of causal inference” shows that the critical conceptfor understanding causality is the comparison of the actual outcome (whathappens to the treated) compared to the counterfactual (what would havehappened if they were not treated), an insight that goes back to Neyman[1923] and Rubin [1974]. As Rubin emphasized, we cant actually observewhat would have happened to the treated if they hadn’t been treated, so wehave to estimate that counterfactual some other way.As an example, think of our Fargo/Honolulu data set. The true model isyc a xc b sc d ec ,163where sc is a variable that measures “interest in surfing”. If the counterfactual is no ad expenditure at all, we would still see variation in revenue acrosscities due to sc . To determine the causal impact of additional ad expenditure on revenue, we have to compare the observed revenue to a counterfactualrevenue that would associated with some default ad expenditure.By the way, the basic identity nicely shows why randomized trials are thegold standard for causal inference. If the treated group are a random sampleof the population, then the first term is an estimate of the causal impact ofthe treatment on the population and if the assignment is random then thesecond term has an expected value of 9170171172Impact of an ad campaignAngrist and Pischke [2014] describe what they call the “Furious Five methodsof causal inference:” random assignment, regression, instrumental variables,regression discontinuity, and differences in differences. We will outline thesetechniques in the next few sections, though we organize the topics slightlydifferently.As a baseline case for the analysis, let us consider a single firm that isrunning a randomized experiment to determine whether it is beneficial toincrease its ad spend. We could imagine applying the increase in ad spend to6

183184185186187some consumers and not others, to some geographic location but not others,or at some time but not at other times.In each case, the challenge is to predict what would have happened if thetreatment had not been applied. This is particularly difficult for an experiment, since the likelihood that a randomly chosen person buys a particularproduct during a particular period is typically very small. As Lewis and Rao[2013] have indicated, estimating such small effects can be very difficult.The challenge is here is something quite familiar to machine learningspecialists—predictive modeling. We have time-tested ways to build sucha model. In the simplest case, we divide the data into a training set anda test set and adjust the parameters on the training set until we find agood predictive model for the test set. Once we have such a model, we canapply it to the treated units to predict the counterfactual: what would havehappened in the absence of treatment. This train-test-treat-compare processis illustrated in Figure 4.7

207The train-test-treat-compare cycle is a generalization of the classic treatmentcontrol approach to experimentation. In that model, the control group provides an estimate of the counterfactual. However, if we can build a predictivemodel that improves on predictions of what happens in the absence of treatment, all the better.The train-test-treat-compare cycle I have outlined is similar to the synthetic control method described by Abadie et al. [2010].4 Synthetic controlmethods use a particular way to build a predictive model of to-be-treatedsubjects based on a convex combination of other subjects outcomes. However, machine learning offers a variety of other modeling techniques whichmay lead to better predictions on the test set and, therefore, better predictions of the counterfactual.One important caveat: we don’t want to use predictors that are correlatedwith the treatment, otherwise we run into the confounding variable problemdescribed earlier. For example, during the Holiday Season, we commonlyobserve both an increase in ad spend and an increase in sales. So the “HolidaySeason” is a confounding variable, and a simple regression of spend on saleswould give a misleading estimate. The solution here is simple: pull theconfounder out of the error term and model the seasonality as an 15216217218219220Regression discontinuityAs I indicated earlier, it is important to understand the data generatingprocess when trying to develop a model of who was selected for the treatment.One particularly common selection rule is to use a threshold. In this case,observations close to, but just below, a threshold should be similar thoseclose to, but just above, the threshold. So if we are interested in the causaleffect of the threshold the threshold, comparing subjects on each side of thethreshold is appealing.For example, Angrist and Lavy [1999] observes that in Israel, class sizesfor elementary school students that have 40 students enrolled on the first day,remain at that size throughout the year. But classes with 41 or more studentshave to be divided in half, or as close to that as possible. This allows themto compare student performance in classes with 40 initial students to that4See also the time-series literature on interrupted regression, intervention analysis,structural change detection, etc.8

Figure 1: Death rates by age by with (say) 41 initial students (who end up with 20-person classes), therebyteasing out the causal effect of class size on educational performance. Since itis essentially random which side of the threshold a particular subject ends upon, so this is almost as good as random assignment to different sized classes.5Another nice example is the study by Valletti et al. [2014] that aims toestimate the impact of broadband speed on housing values. Just lookingat the observational data is will not resolve this issue since houses in newerareas may be both more expensive and have better broadband connections.But looking at houses that are just on the boundary of internet service areasallows one to identify the causal effect of broadband on house valuation.As a final example, consider Carpenter and Dobkin [2011], who examinethe impact of the minimum legal drinking age on mortality. The story istold in Figure 5, which is taken from this paper.6 As you can see, there is amajor jump in motor vehicle accidents at the age of 21. Someone who is 20.5years old isn’t that different from someone who is 21 years old, on average,5The actual policies used are a bit more complicated than I have described; see thecited source or Angrist and Pischke [2009] for a more detailed description.6See also the helpful discussion in Angrist and Pischke [2014].9

252but 21 year olds have much higher death rates from automobile accidents,suggesting that the minimum drinking age causes this effect.Regression discontinuity design is very attractive when algorithms areused to make a choice. For example, ads may receive some special treatmentsuch as appearing in a prominent position if they have a score that exceedssome threshold. We can then compare ads that just missed the thresholdto those that just passed the threshold to determine the casual effect of thetreatment. Effectively, the counterfactual for the treated ads are the adsthat just missed being treated. See Narayanan and Kalyanam [2014] for anexample in the context of ranking search ads.Even better, we might explicitly randomize the algorithm. Instead ofa statement like if (score threshold) do treatment we have a statement like if (score e threshold) do treatment, where e is a smallrandom number. This explicit randomization allows us to estimate the causaleffect of the treatment on outcomes of interest. Note that restricting e tobe small means that our experiment will not be very costly compared to thestatus quo since only cases close to the threshold are 267268269270Natural experimentsIf there is a threshold involved in making a decision, by focusing only onthose cases close to the threshold we may have something that is almost asgood as random assignment to treatment and control. But we may be ableto find a “natural experiment” that is “as good as random.”Consider, for example, the Super Bowl. It is well known that the homecities of the teams that are playing have an audience about 10-15% largerthan cities not associated with the teams playing. It is also well knownthat companies that advertise during the Super Bowl have to purchase theirads months before it is known which teams will actually be playing. Thecombination of these two facts implies that two essentially randomly chosencities will experience a 10% increase

Causal Inference in Social Science An elementary introduction Hal R. Varian Google, Inc Jan 2015 Revised: March 21, 2015 Abstract This is a short and very elementary introduction to causal inference in social science applications targeted to machine learners. I illustrate the techniques described with examples chosen from the economics

Related Documents:

Chapter 1 (pp. 1 -7 & 24-33) of J. Pearl, M. Glymour, and N.P. Jewell, Causal Inference in Statistics: A Primer, Wiley, 2016. Correlation Is Not Causation The gold rule of causal analysis: no causal claim can be established purely by a statistical method. . Every causal inf

CAUSAL INFERENCE AT THE TIPPING POINT Causal inference is now on the cusp of broad adoption in business. There are three main factors that are driving the emergence of causal inference as an accelerating category now. First, the limitations of current AI are becoming evident. The hype for A

causal inference across the sciences. The authors of any Causal Inference book will have to choose which aspects of causal inference methodology they want to emphasize. The title of this introduction reflects our own choices: a book that helps scientists–especial

Causal inference with graphical models – in small and big data 1 Outline Association is not causation How adjustment can help or harm Counterfactuals - individual-level causal effect - average causal effect Causal graphs - Graph structure, joint distribution, conditional independencies - how to esti

So a causal effect of X on Y was established, but we want more! X M Y The directed acyclic graph (DAG) above encodes assumptions. Nodes are variables, directed arrows depict causal pathways Here M is caused by X, and Y is caused by both M and X. DAGs can be useful for causal inference: clarify the assumptions taken and facilitate the discussion.

For Causal Inference We Need: 6/7/2018 dechter, class 8, 276-18 1. A working definition of “causation” 2. A method by which to formally articulate causal assumptions—that is, to create causal models 3. A method by which to link the structure of a causal model to features of data 4.

Causal inference primer 2. Causality from non-experimental data 3. Text as a control 4. Double machine learning . Text and Causal Inference: A Review of Using Text to Remove Confounding from Causal Estimates (ACL 2020) . chern.handout.pdf. Double ML Text Examples. What We’v

counseling appointments. ontact Army hild Youth Services re-garding hourly childcare. an I see another provider? Absolutely. After your appointment, please speak with a MSA and they will be happy to assist you in scheduling an appointment with another provider. Frequently Asked Questions