ADA1: Chapter8, Correlation & Regression

2y ago

47 Views

3 Downloads

2.77 MB

106 Pages

Last View : 11d ago

Last Download : 3m ago

Upload by : Ellie Forte

Report this link

Download PDF

Transcription

Chapter 8: Correlation & RegressionWe can think of ANOVA and the two-sample t-test as applicable tosituations where there is a response variable which is quantitative, andanother variable that indicates group membership, which we might thinkof a as categorical predictor variable.In the slides on categorical data, all varaibles are categorical, and we keeptrack of the counts of observation in each category or combination ofcategories.In this section, we analyze cases where we have multiple quantitativevariables.ADA1November 12, 20171 / 105

Chapter 8: Correlation & RegressionIn the simplest case, there are two quantitative variables. Examplesinclude the following:I heights of fathers and sons (this is a famous example from Galton,Darwin’s cousin)I ages of husbands and wifesI systolic versus diastolic pressure for a set of patientsI high school GPA and college GPAI college GPA and GRE scoresI MCAT scores before and after a training courseIn the past, we might have analyzed pre versus post data using atwo-sample t-test to see whether there was a difference. It is also possibleto try to quantify the relationship—instead of just asking whether the twosets of scores are different, or getting an interval for the average difference,we can also try to predict the new score based on the old score, and theamount of improvmenet might depend on the old score.ADA1November 12, 20172 / 105

Chapter 8: Correlation & RegressionHere is some example data for husbands and wives. Heights are in mm.Couple123456789101112HusbandAge HusbandHeight 610159016101700November 12, 20173 / 105

Correlation: Husband and wife agesCorrelation is 0.88.ADA1November 12, 20174 / 105

Correlation: Husband and wife heightsCorrelation is 0.18 with outlier, but -0.54 without outlier.ADA1November 12, 20175 / 105

Correlation: scatterplot matrixpairs(x[,2:5]) allows you to look at all data simultaneously.ADA1November 12, 20176 / 105

Correlation:scatterplot matrixlibrary(ggplot2)library(GGally)p1 - ggpairs(x[,2:5])print(p1)ADA1November 12, 20177 / 105

Correlation: scatterplot matrixADA1November 12, 20178 / 105

Chapter 8: Correlation & RegressionFor a data set like this, you might not expect age to be significantlycorrelated with height for either men or women (but you could check).You could also check whether differences in couples ages are correlatedwith differences in their heights. The correlation between two variables isdone as follows:cor(x WifeAge,x HusbandAge)Note that the correlation is looking at something different than the t test.A t-test for this data might look at whether the husbands and wives hadthe same average age. The correlation looks at whether younger wivestend to have younger husbands and older husbands tend to have olderwives, whether or not there a difference in the ages overall. Similarly forheight. Even if husbands tend to be taller than wives, that doesn’tnecessarily mean that there is a relationship between the heights forcouples.ADA1November 12, 20179 / 105

Pairwise CorrelationsThe pairwise correlations for an entire dataset can be done as follows.What would it mean to report the correlation between the Couple variableand the other variables? Here I only get the correlations for variables otherthan the ID variable.options(digits 4) # done so that the output fits#on the screen!cor(x[,2:5])HusbandAge HusbandHeight WifeAge WifeHeightHusbandAge1.0000-0.24716 0.88003-0.5741HusbandHeight-0.24721.00000 0.021240.1783WifeAge0.88000.02124 1.00000-0.5370WifeHeight-0.57410.17834 -0.536991.0000ADA1November 12, 201710 / 105

Chapter 8: Correlation & RegressionThe correlation measures the linear relationship between variables X andY as seen in a scatterplot. The sample correlation between X1 , . . . , Xn andY1 , . . . , Yn is denoted by r has the following propertiesI 1 r 1I if Yi tends to increase linearly with Xi , then r 0I if Yi tends to decrease linearly with Xi , then r 0I if there is a perfect linear relationship between X and Y , then r 1(points fall on a line with positive slope)I if there is a perfect negative relationship between X and Y , thenr 1 (points fall on a line with negative slope)I the closer the points (Xi , Yi ) are to a straight line, the closer r is to 1or 1I r is not affected by linear transformations (i.e., converting from inchesto centimeters, Fahrenheit to Celsius, etc.I the correlation is symmetric: the correlation between X and Y is thesame as the correlation between Y and XADA1November 12, 201711 / 105

Chapter 8: Correlation & RegressionFor n observations on two variables, the sample correlation is calculated byPn(xi x)(yi y )SXYr pPn i 1Pn22SX SYi 1 (xi x)i 1 (yi y )Here SX and SY are the sample standard deviations andPn(xi x)(yi y )SXY i 1n 1is the sample covariance. All the (n 1) terms cancel out from thenumerator and denominator when calculating r .ADA1November 12, 201712 / 105

Chapter 8: Correlation & RegressionADA1November 12, 201713 / 105

Chapter 8: Correlation & RegressionADA1November 12, 201714 / 105

Chapter 8: Correlation & RegressionCIs and hypothesis tests can be done for correlations using cor.test().The test is usually based on testing whether the population correlation ρ isequal to 0, soH0 : ρ 0and you can have either a two-sided or one-sided alternative hypothesis.We think of r as a sample estimate of ρ, the Greek letter for r . The test isbased on a t-statistic which has the formularn 2tobs r1 r2and this is compared to a t distribution with n 2 degrees of freedom. Asusual, you can rely on R to do the test and get the CI.ADA1November 12, 201715 / 105

CorrelationThe t distribution derivation of the p-value and CI assume that the jointdistribution of X and Y follow what is called a bivariate normaldistribution. A sufficient condition for this is that X and Y eachindividually have normal distributions, so you can do usual tests ordiagnostics for normality. Similar to the t-test, the correlation is sensitiveto outliers. For the husband and wife data, the sample sizes are small,making it difficult to detect outliers. However, there is not clear evidenceof non-normality.ADA1November 12, 201716 / 105

Chapter 8: Correlation & RegressionADA1November 12, 201717 / 105

Chapter 8: Correlation & RegressionShairpo-Wilk’s tests for normality would all be not rejected, although thesample sizes are quite small for detecting deviations from normality: shapiro.test(x HusbandAge) p.value[1] 0.8934 shapiro.test(x WifeAge) p.value[1] 0.2461 shapiro.test(x WifeHeight) p.value[1] 0.1304 shapiro.test(x HusbandHeight) p.value[1] 0.986ADA1November 12, 201718 / 105

CorrelationHere we test whether ages are significantly correlated and also whetherheights are positively correlated. cor.test(x WifeAge,x HusbandAge)Pearson’s product-moment correlationdata: x WifeAge and x HusbandAget 5.9, df 10, p-value 2e-04alternative hypothesis: true correlation is not equal to 095 percent confidence interval:0.6185 0.9660ADA1November 12, 201719 / 105

CorrelationHere we test whether ages are significantly correlated and also whetherheights are positively correlated. cor.test(x WifeHeight,x HusbandHeight)Pearson’s product-moment correlationdata: x WifeHeight and x HusbandHeightt 0.57, df 10, p-value 0.6alternative hypothesis: true correlation is not equal to 095 percent confidence interval:-0.4407 0.6824sample estimates:cor0.1783ADA1November 12, 201720 / 105

CorrelationWe might also test the heights with the bivariate outlier removed: cor.test(x WifeHeight[x WifeHeight 1450],x HusbandHeight[xPearson’s product-moment correlationdata: x WifeHeight[x WifeHeight 1450] and x HusbandHeightt -1.9, df 9, p-value 0.1alternative hypothesis: true correlation is not equal to 095 percent confidence interval:-0.8559 0.1078sample estimates:cor-0.5261ADA1November 12, 201721 / 105

CorrelationRemoving the outlier changes the direction of the correlation (frompositive to negative). The result is still not significant at the α 0.05level, although the p-value is 0.1, suggesting slight evidence against thenull hypothesis of no relationship between heights of husbands and wives.Note that the negative correlation here means that, with the one outliercouple removed, taller wives tended to be associated with shorterhusbands and vice versa.ADA1November 12, 201722 / 105

CorrelationA nonparametric approach for dealing with outliers or otherwise nonnormaldistributions for the variables being correlated is to rank the data withineach sample and then compute the usual correlation on the ranked data.Note that in the Wilcoxon two-sample test, you pool the data first andthen rank the data. For the Spearman correlation, you rank each groupseparately.The idea is that large observations will have large ranks in both groups, sothat if the data is correlated, large ranks will tend to get paired with largeranks, and small ranks will tend to get paired with small ranks if the datais correlated. If the data are uncorrelated, then the ranks will be randomwith respect to each other.ADA1November 12, 201723 / 105

The Spearman correlation is implemented in cor.test() using the optionmethod ’’spearman’’. Note that the correlation is negative using theSpearman ranking even with the outlier, but the correlation was positive using theusual (Pearson) correlation. The Pearson correlation was negative when theoutlier was removed. Since the results depended so much on the presence of asingle observation, I would be more comfortable with the Spearman correlation forthis example.cor.test(x WifeHeight,x HusbandHeight,method "spearman")Spearman’s rank correlation rhodata: x WifeHeight and x HusbandHeightS 370, p-value 0.3alternative hypothesis: true rho is not equal to 0sample estimates:rho-0.3034ADA1November 12, 201724 / 105

Chapter 8: Correlation & RegressionA more extreme example of an outlier. Here the correlation changes from0 to negative.ADA1November 12, 201725 / 105

CorrelationSomething to be careful of is that if you have many variables (which oftenoccurs), then testing every pair of variables for a significant correlationleads to multiple comparison problems, for which you might want to use aBonferroni correction, or limit yourself to only testing a small number ofpairs of variable that are interesting a priori.ADA1November 12, 201726 / 105

RegressionIn regression, we try to make a model that predicts the average responseof one quantitative variable given one or more predictor variables. We startwith the case that there is one predictor variable, X , and one response, Y ,which is called simple linear regression.Unlike correlation, the model depends on which variable is the predictorand which is the response. While the correlation of x and y is the same asthe correlation of y and x, the regression of y on x will generally lead to adifferent model than regressing on x on y . In the phrase “regressing y onx”, we mean that y is the response and x is the predictor.ADA1November 12, 201727 / 105

RegressionIn the basic regression model, we assume that the average value of Y hasa linear relationship to X , and we writey β0 β1 xHere β0 is the coefficient and β1 is the slope of the line. This is similar toequations of lines from courses like College Algebra where you writey a bxory mx bBut we think of β0 and β1 as unknown parameters, similar to µ for themean of a normal distribution. One possible goal of a regression analysis isto make good estimates of β0 and β1 .ADA1November 12, 201728 / 105

RegressionReview of lines, slopes, and intercepts. The slope is the number of unitsthat y changes for a change of 1 unit in x. The intercept (or y -intercept)is where the line intersects the y -axis.ADA1November 12, 201729 / 105

RegressionIn real data, the points almost never fall exactly on a line, but there mightbe a line that describes the overall trend. (This is sometimes even calledthe trend line). Given a set of points, which line through the points is“best”?ADA1November 12, 201730 / 105

RegressionHusband and wife age example.ADA1November 12, 201731 / 105

RegressionHusband and wife age example. Here we plot the line y x. Note that 9 out of12 points are above the line—for the majority of couples, the husband was olderthan the wife. The points seem a little shifted up compared to the line.ADA1November 12, 201732 / 105

Now we’ve added the usual regression line in black. It has a smaller slope but ahigher intercept. The lines seem to make similar predictions at higher ages, butthe black line seems a better fit to the data for the lower ages. Although thisdoesn’t always happen, exactly half of the points are above the black line.ADA1November 12, 201733 / 105

RegressionIt is a little difficult to tell visually which line is best. Here is a third line,which is based on regressing the wives’ heights on the husbands heights.ADA1November 12, 201734 / 105

RegressionIt is difficult to tell which line is “best” or even what is meant by a bestline through the data. What to do?One possible solution to the problem is to consider all possible lines of theformy β0 β1 xor hereHusband height β0 β1 (Wife height)In other words, consider all possible choices of β0 and β1 and pick the onethat minimizes some criterion. The most common criterion used is theleast squares criterion—here you pick β0 and β1 that minimizenX[yi (β0 β1 xi )]2i 1ADA1November 12, 201735 / 105

RegressionGraphically, this means minimizing the sum of squared deviations fromeach point to the line.ADA1November 12, 201736 / 105

RegressionRather than testing all possible choices of β0 and β1 , formulas are knownfor the optimal choices to minimize the sum of squares. We think of theseoptimal values as estimates of the true, unknown population parametersβ0 and β1 . We use b0 or βb0 to mean an estimate of β0 and b1 or βb1 tomean an estimate of β1 :P(xi x)(yi y )SYbb1 β1 i P r2SXi (xi x)b0 βb0 y b1 xHere r is the Pearson (unranked) correlation coefficient, and SX and SYare the sample standard deviations. Note that if r is positive if, and onlyif, b1 is positive. Similarly, if one is negative the other is negative. In otherwords, r has the same sign as the slope of the regression line.ADA1November 12, 201737 / 105

RegressionThe equation for the regression line isyb b0 b1 xwhere x is an value (not just values that were observed), and b0 and b1were defined on the previous slide. The notation yb is used to mean thepredicted or average value of y for the given x value. You can think of itas meaning the best guess for y if a new observation will have the given xvalue.A special thing to note about the regression line is that it necessarilypasses through the point (x, y ).ADA1November 12, 201738 / 105

Regression: scatterplot with least squares lineOptions make the dots solid and a bit bigger.plot(WifeAge,HusbandAge,xlim c(20,60),ylim c(20,60),xlab "Wife Age", ylab "Husband Age", pch 16,cex 1.5,cex.axis 1.3,cex.lab 1.3)abline(model1,lwd 3)ADA1November 12, 201739 / 105

Regression: scatterplot with least squares lineYou can always customize your plot by adding to it. For example you canadd the point (x, y ). You can also add reference lines, points, annotationsusing text at your own specified coordinates, etc.points(mean(WifeAge),mean(HusbandAge),pch 17,col ’’red’’)text(40,30,’’r 0.88’’,cex 1.5)text(25,55,’’Hi Mom!’’,cex 2)lines(c(20,60),c(40,40),lty 2,lwd 2)The points statement adds a red triangle at the mean of both ages, whichis the point (37.58, 39.83). If a single coordinate is specified by thepoints() function, it adds that point to the plot. To add a curve or lineto a plot, you can use points() with x and y vectors (just like theoriginal data). For lines(), you specify the beginning and ending x and ycoordinates, and R fills in the line.ADA1November 12, 201740 / 105

RegressionTo fit a linear regression model in R, you can use the lm() command,which is similar to aov().The following assumes you have the file couple.txt in the same directoryas your R session:x - read.table("couples.txt",header T)attach(x)model1 - lm(HusbandAge WifeAge)summary(model1)ADA1November 12, 201741 / 105

RegressionCall:lm(formula HusbandAge WifeAge)Residuals:Min1Q Median-8.1066 -3.2607 -0.01253Q3.4311Max6.8934Coefficients:Estimate Std. Error t value Pr( t )(Intercept) 10.44475.23501.995 0.073980 .WifeAge0.78200.13345.860 0.000159 ***--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.11Residual standard error: 5.197 on 10 degrees of freedomMultiple R-squared: 0.7745, Adjusted R-squared: 0.7519F-statistic: 34.34 on 1 and 10 DF, p-value: 0.0001595ADA1November 12, 201742 / 105

RegressionThe lm() command generates a table similar to the ANOVA tablegenerated by aov().To go through some elements in the table, it first fives the formula used togenerate the output. This is useful when you have generated severalmodels, say model1, model2, model3, . and you can’t rememberhow you generated the model. For example, you might have one modelwith an outlier removed, another model with one of the variables on alog-transformed scale, etc.The next couple lines deal with residuals. Residuals are the differencebetween theobserved and fitted values, That isyi ybi yi (b0 b1 xi )ADA1November 12, 201743 / 105

RegressionFrom the Q3YWVlM2MzZmUzN2MwOWY.jpgADA1November 12, 201744 / 105

RegressionThe next part gives a table similar to ANOVA. Here we get the estimatesfor the coefficients, b0 and b1 in the first quantitative column. We also getstandard errors for these, corresponding t-values and p-values. Thep-values are based on testing the null hypothesesH0 : β 0 0andH0 : β 1 0The first null hypothesis says that the intercept is 0. For this problem, thisis not very meaningful, as it would mean that the husband of 0-yr oldwoman would also be predicted to be a 0-yr old!Often the intercept term is not very meaningful in the model. The secondnull hypothesis is that the slope is 0, which would mean that the wife’sage increasing would not be associated with the husband’s age increasing.ADA1November 12, 201745 / 105

RegressionFor this eample, we get a significant result for the wife’s age. This means thatthe wife’s age has some statistically signficant ability to predict the husband’sage. The coefficients give the modelMean Husband’s Age 10.4447 0.7820 (Wife’s Age)The low p-value for the Wife’s age, suggest that the coefficient 0.7820 isstatistically significantly different from 0. This means that the data show there isevidence that the wife’s age is associated with the husband’s age. The coefficientof 0.7820 means that for each year of increase in the wife’s age, the meanhusband’s age is predicted to increase by 0.782 years.As an example, based on this model, a 20-yr old women who was married wouldbe predicted to have a husband who was10.4447 (0.782)(30) 33.9or about 34 years old. A 50 yr-old women would be predicted to have husbandwho was10.4447 (0.782)(55) 53.5ADA1November 12, 201746 / 105

RegressionThe fitted values are found by plugging in the observed x values (Wifeages) into the regression equation. This gives the expected husband agesfor each wife. They are given automatically usingmodel1 fitted.values12345644.06894 32.33956 33.90348 55.01637 51.10658 31.55760 51.106910111228.42976 29.99368 40.94111 35.46740x WifeAge[1] 43 28 30 57 52 27 52 43 23 25 39 32For example, if the wife is 43, the regression equation predicts10.4447 (0.782)(43) 44.069 for the husband age.ADA1November 12, 201747 / 105

RegressionTo see what is stored in model1, typenames(model1)# [1] "coefficients" "residuals"# [5] "fitted.values" "assign"# [9] delThe residuals are also stored, which are the observed husband ages minusthe fitted values.ei yi ybiADA1November 12, 201748 / 105

Regression: ANOVA tableMore details about the regression can be obtained using the anova()command on the model1 object: anova(model1)Analysis of Variance TableResponse: HusbandAgeDf Sum Sq Mean Sq F valuePr( F)WifeAge1 927.53 927.53 34.336 0.0001595 ***Residuals 10 270.1327.01Here the sum of squared residuals, sum(model1 residuals2 ) is 270.13.ADA1November 12, 201749 / 105

Regression: ANOVA tableOther components from the table are (SS means sums of squares):Residual SS nXei2i 1Total SS nX(yi y )2i 1Regression SS b1nX(xi x)(yi y )i 1Regression SS Total SS Residual SSRegression SS r2R2 Total SSThe Mean Square values in the table are the SS values divided by the degrees offreedom. The degrees of freedom is n 2 for the residuals and 1 for the 1predictor. The F statistic is MSR/MSE (Measn square for regression divided bymean square error), and the p-value can be based on the F statistic.ADA1November 12, 201750 / 105

RegressionNote that R 2 1 occurs when the Regression SS is equal to the Total SS.This means that the Residual SS is 0, so all of the points fall on the line.In this case, r 1 and R 2 1.On the other hand, R 2 0 means that the Total SS is equal to theResidual SS, so the Regression SS is 0. We can think of the Regression SSand Residual SS as partitioning the Total SS:Total SS Regression SS Residual SS orRegression SS Residual SS 1Total SSTotal SSIf a large proportion of the Total SS is from the Regression rather thanfrom Residuals, then R 2 is high. It is common to say that R 2 is a measureof how much variation is explained by the predictor variable(s). This phraseshould be used cautiously because it doesn’t refer to a causal explanation.ADA1November 12, 201751 / 105

RegressionFor the husband and wife and example for ages, the R 2 value was 0.77.This means that 77% of the variation in husband ages was “explained by”variation in the wife ages. Since R 2 is just the correlation squared,regressing wife ages on husband ages would also result in R 2 0.77, and77% of the variation in wife ages would be “explained by” variation inhusband ages. Typically, you want the R 2 value to be high, since thismeans you can use one variable (or a set of variables) to predict anothervariable.ADA1November 12, 201752 / 105

RegressionThe least-squares line is mathematically well-defined and can be calculatedwithout thinking about the data probabilitistically. However, p-values andtests of significance assume the data follow a probabilistic model withsome assumptions. Assumptions for regression include the following:Ieach pair (xi , yi ) is independentIThe expected value of y is a linear function of x: E (y ) β0 β1 x,sometimes denoted µY XIthe variability of y is the same for each fixed value of x. This issometimes denoted σy2 xIthe distribuiton of y given x is normally distributed with meanβ0 β1 x and variance σy2 xIin the model, x is not treated as randomADA1November 12, 201753 / 105

RegressionNote that the assumption that the variance is the same regardless of x issimilar to the assumption of equal variance in ANOVA.ADA1November 12, 201754 / 105

RegressionLess formally, the assumptions in their order of importance, are:1. Validity. Most importantly, the data you are analyzing should map tothe research question you are trying to answer. This sounds obviousbut is often overlooked or ignored because it can be inconvenient.2. Additivity and Linearity. The most important mathematicalassumption of the regression model is that its deterministiccomponent is a linear function of the separate predictors.3. Independence of errors (i.e., residuals). This assumption dependson how the data were collected.4. Equal variance of errors.5. Normality of errors.It is easy to focus on the last two (especially when teaching) because thefirst assumptions depend on the scientific context and are not possible toassess just looking at the data in a spreadsheet.ADA1November 12, 201755 / 105

RegressionTo get back to the regression model, the parameters of the model are β0 ,β1 and σ 2 (which we might call σY2 X , but it is the same for every x).Usually σ 2 is not directly of interest but is necessary to estimate in orderto do hypothesis tests and confidence intervals for the other parameters.σY2 X is estimated byP(yi ybi )2Residual SS Residual MS iResidual dfn 2This formula is similar to the sample variance, but we subtract thepredicted values for y instead of the mean for y , y , and we divide by n 2instead of dividing by n 1. We can think of this as two degrees offreedom being lost since β0 and β1 need to be estimated. Usually, thesample variance uses n 1 in the denominator due to one degree offreedom being lost for estimating µY with y .sY2 XADA1November 12, 201756 / 105

RegressionRecall that there are observed residuals, which are observed minus fittedvalues, and unobserved residuals:ei yi ybi yi (b0 b1 )xiεi yi E (yi ) yi (β0 β1 )xiThe difference in meaning here is whether the estimated versus unknownregression coefficients are used. We can think of ei as an estimate of εi .ADA1November 12, 201757 / 105

RegressionTwo ways of writing the regression model areE (yi ) β0 β1 xiandyi β0 β1 xi εiADA1November 12, 201758 / 105

RegressionTo get a confidence interval for β1 , we can useb1 tcrit SE (b1 )whereSE (b1 ) pPsY Xi (xi x)2Here tcrit is based on the Residual df, which is n 2.To test the null hypothesis that β1 β10 (i.e. a particular value for β1 ,you can use the test statistictobs b1 β10SE (b1 )and then compare to a critical value (or obtain a p-value) using n 2 df(i.e., the Residual df).ADA1November 12, 201759 / 105

RegressionThe p-value based on the R output is for testing H0 : β1 0, whichcorresponds to a flat regression line. But the theory allows testing anyparticular slope. For the couples data, you might be interested in testingH0 : β1 1, which would mean that for every year older that the wife is,the husband’s age is expected to increase by 1 year.Coefficients:Estimate Std. Error t value Pr( t )(Intercept) 10.44475.23501.995 0.073980 .WifeAge0.78200.13345.860 0.000159 ***To test H0 : β1 1, tobs (0.782 1.0)/(.1334) 1.634. The criticalvalue is qt(.975,10) 2.22. So comparing 1.634 to 2.22 for atwo-sided test, we see that the observed test statistic is not as extreme asthe critical value, so we cannot conclude that the slope is significantlydifferent from 1. For a p-value, we can use pt(-1.634,10)*2 0.133.ADA1November 12, 201760 / 105

RegressionInstead of getting values such as the SE by hand from the R output, youcan also save the output to a variable and extract the values. This reducesroundoff error and makes it easier to repeat the analysis in case the datachanges. For example, from the previous example, we could usemodel1.values - summary(model1)b1 - model1.values coefficients[2,1]b1#[1] 0.781959se.b1 - model1.values coefficients[2,2]t - (b1-1)/se.b1t#[1] -1.633918The object model1.values coefficients here is a matrix object, so thevalues can be obtained from the rows and columns.ADA1November 12, 201761 / 105

RegressionFor the CI for this example, we havedf - model1.values fstatistic[3] # this is hard to findt.crit - qt(1-0.05/2, df)CI.lower - b1 - t.crit * se.b1CI.upper - b1 t.crit * se.b1print(c(CI.lower,CI.upper))#[1] 0.4846212 1.0792968Consistent with the hypothesis test, the CI includes 1.0, suggesting wecan’t be confident that the ages of husbands increase at a different ratefrom the ages of their wives.ADA1November 12, 201762 / 105

RegressionAs mentioned earlier, the R output tests H0 : β1 0, so you need toextract information to do a different test for the slope. We showed using at-test for testing this null hypothesis, but it is also equivalent to an F test.2 when there is only 1 numerator degree ofHere the F statistic is tobsfreedom (one predictor in the regression).t - (b1-0)/se.b1t#[1] 5.859709t 2#[1] 34.33619which matches the F statistic from earlier output.In addition, the p-value matches that for the correlation usingcor.test(). Generally, the correlation will be significant if and only if theslope is signficantly different from 0.ADA1November 12, 201763 / 105

RegressionAnother common application of confidence intervals in regression is for theregression line itself. This means getting a confidence interval for theexpected value of y for each value of x.Here the CI for y given x iss1(x x)2 Pb0 b1 x tcrit sY X2ni (xi x)where the critical value is based on n 2 degrees of freedom.ADA1November 12, 201764 / 105

RegressionIn addition to a confidence interval for the mean, you can madeprediction intervals for a new observation. This gives a plausible intervalfor a new observation. Here there are two sources of uncertainty:uncertainty about the mean, and uncertainty about how much anindividual observation deviates from the mean. As a result, the predictioninterval is wider than the CI for the mean.The prediction interval for y given x issb0 b1 x tcrit sY X1 ADA1(x x)21 P2ni (xi x)November 12, 201765 / 105

RegressionFor a particular wife age, such as 40, the CIs and PIs (prediction intervals)are done in R bypredict(model1,data.frame(WifeAge 40), interval "confidence",level .95)#fitlwrupr#1 41.72307 38.30368 45.14245predict(model1,data.frame(WifeAge 40), interval "prediction",level .95)#fitlwrupr#1 41.72307 29.6482 53.79794Here the predicted husband’s age for a 40-yr old wife is 41.7 years. A CI for themean age for the husband is (38.3,45.1), but a prediction interval is that 95% ofhusbands for a wife this age would be between 29.6 and 53.8 years old.

Chapter 8: Correlation & Regression ADA1 November 12, 2017 14 / 105. Chapter 8: Correlation & Regression CIs and hypothesis tests can be done for correlations using cor.test(). The test is usually based on testing whether the population

Related Documents:

Introduction to Regression Procedures

independent variables. Many other procedures can also ﬁt regression models, but they focus on more specialized forms of regression, such as robust regression, generalized linear regression, nonlinear regression, nonparametric regression, quantile regression, regression modeling of survey data, regression modeling of

160 Views

2y ago

Chapter 7 Simple linear regression and correlation

Chapter 7 Simple linear regression and correlation Department of Statistics and Operations Research November 24, 2019. Plan 1 Correlation 2 Simple linear regression. Plan 1 Correlation 2 Simple linear regression. De nition The measure of linear association ˆbetween two variables X and Y is estimated by the s

45 Views

2y ago

Chapter 3: Correlation and Regression

Chapter 3: Correlation and Regression The statistical tool with the help of which the relationship between two or more variables is studied is called correlation. The measure of correlation is called the Correlation Coefficie

121 Views

2y ago

Chapter 12. Simple Linear Regression and Correlation

Chapter 12. Simple Linear Regression and Correlation 12.1 The Simple Linear Regression Model 12.2 Fitting the Regression Line 12.3 Inferences on the Slope Rarameter ββββ1111 NIPRL 1 12.4 Inferences on the Regression Line 12.5 Prediction Intervals for Future Response Values 1

40 Views

2y ago

Linear Regression and Correlation - NCSS

Linear Regression and Correlation Introduction Linear Regression refers to a group of techniques for fitting and studying the straight-line relationship between two variables. Linear regression estimates the regression coefficients β 0 and β 1 in the equation Y j β 0 β 1 X j ε j wh

74 Views

2y ago

Lecture 14 Multiple Linear Regression and Logistic Regression

LINEAR REGRESSION 12-2.1 Test for Significance of Regression 12-2.2 Tests on Individual Regression Coefficients and Subsets of Coefficients 12-3 CONFIDENCE INTERVALS IN MULTIPLE LINEAR REGRESSION 12-3.1 Confidence Intervals on Individual Regression Coefficients 12-3.2 Confidence Interval

92 Views

2y ago

Correlation: Karl Pearson's Coefficient of Correlation, Spearman Rank ...

Items Description of Module Subject Name Management Paper Name Quantitative Techniques for Management Decisions Module Title Correlation: Karl Pearson's Coefficient of Correlation, Spearman Rank Correlation Module Id 32 Pre- Requisites Basic Statistics Objectives After studying this paper, you should be able to - 1) Clearly define the meaning of Correlation and its characteristics.

26 Views

1y ago

THE PEOPLE’S JUSTICE GUARANTEE AGENDA IS POPULAR

take the lead in rebuilding the criminal legal system so that it is smaller, safer, less puni-tive, and more humane. The People’s Justice Guarantee has three main components: 1. To make America more free by dra-matically reducing jail and prison populations 2. To make America more equal by elim-inating wealth-based discrimination and corporate profiteering 3. To make America more secure by .

56 Views

3y ago

Recent Views

Dear Members of the Harvard Community,

Life science graduate education at Harvard is comprised of 14 Ph.D. programs of study across four Harvard faculties—Harvard Faculty of Arts and Sciences, Harvard T. H. Chan School of Public Health, Harvard Medical School, and Harvard School of Dental Medicine. These 14 programs make up the Harvard Integrated Life Sciences (HILS).

3y ago

182 Views

Xavier Du Maine, Lara Roach, Perspectives - Harvard University

Sciences at Harvard University Richard A. and Susan F. Smith Campus Center 1350 Massachusetts Avenue, Suite 350 Cambridge, MA 02138 617-495-5315 gsas.harvard.edu Office of Diversity and Minority Affairs minrec@fas.harvard.edu gsas.harvard.edu/diversity Office of Admissions and Financial Aid admiss@fas.harvard.edu gsas.harvard.edu/apply

1y ago

146 Views

PROGRAM ON CRISIS LEADERSHIP - Harvard Kennedy School

Harvard Kennedy School Arnold M. Howitt Harvard Kennedy School Philip B. Heymann Harvard Law School April 2014 An earlier version of this white paper provided background for an expert dialogue on lessons learned from the events of the Boston Marathon bombing that was held at the John F. Kennedy School of Government at Harvard

2y ago

330 Views

Harvard Law School - WordPress

Law & Business, Harvard Law School, and H. Douglas Weaver Professor of Business Law. Harvard Business School. 10.30-10.55h. 13th Lecture "Cross-border Insolvency: the New European Regime". Pedro de Miguel Asensio. Full Professor of Private International Law. UCM. 11.00-12.00h. Round Table. "Latest reforms and tendencies on Insolvency Law".

1y ago

145 Views

Harvard Buildings Emergency Phones Harvard University .

Faculty of Arts and Sciences, Harvard University Class of 2018 LEGEND Harvard Buildings Emergency Phones Harvard University Police Department Designated Pathways Harvard Shuttle Bus Stops l e s R i v e r a C h r YOKE ST YMOR E DRIVE BEACON STREET OXFORD ST VENUE CAMBRIDGE STREET KIRKLAND STREET AUBURN STREET VE MEMORIAL

3y ago

171 Views

THE FIRST CENTURY OF THE AMERICAN . - Princeton

Harvard University Press, 1935) and Harvard College in the Seventeenth Century (Cambridge: Harvard University Press, 1936). Quotes, Founding of Harvard, 168, 449. These works are summarized in Three Centuries of Harvard (Cambridge: Harvard U

2y ago

225 Views

Catherine G. Barrera HARVARD UNIVERSITY

danbjork@fas.harvard.edu HARVARD UNIVERSITY Placement Director: Gita Gopinath GOPINATH@HARVARD.EDU 617-495-8161 Placement Director: Nathan Nunn NNUNN@FAS.HARVARD.EDU 617-496-4958 Graduate Administrator: Brenda Piquet BPIQUET@FAS.HARVARD.EDU 617-495-8927 Office Contact Information Department of Economics

2y ago

363 Views

SEAS Lab Safety Officer Orientation

Kuan ebrandin@harvard.edu akuan@fas.harvard.edu Donhee Ham MD B129, MDB132 Dongwan Ha dha@seas.harvard.edu Lene Hau Cruft 112-116 Danny Kim dannykim@seas.harvard.edu Robert Howe 60 Oxford, 312-317,319-321 Paul Loschak loschak@seas.harvard.edu Evelyn Hu McKay 222,226,232 Kathryn Greenberg greenber@fas.harvard.edu

2y ago

359 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Dangerous Defendants - Yale Law Journal

Law School, Louisiana State University Paul M. Hebert Law Center, Roger Williams University School of Law, Rutgers Law School, Sandra Day O'Connor College of Law, Southern Methodist University Dedman School of Law, University of Georgia School of Law, and University of Utah S.J. Quinney College of Law. For institutional support, I am grateful .

1y ago

169 Views

EMPLOYER GUIDE - Harvard Kennedy School

HARVARD KENNEDY SCHOOL EMPLOYER GUIDE At Harvard Kennedy School, our students are being trained in public policy . Post in the HKS JACK job bank or send to HKS_Career@hks.harvard.edu 2. Browse our resume book to identify a student with the skills and experience you need 3. Visit us and meet our talent HARVARD

2y ago

138 Views

2008-2009 FACT BOOK - Harvard University

Harvard Business School Harvard Medical School Harvard Faculty of Arts and Sciences Harvard School of Public . Publishing Division Joint Center for Housing Studies American Repertory Theatre . WIDE is the Wide-scale Interactive Development for Educators. (5) The Nanoscale Science and Engineering Center is a joint program with M.I.T., U.C.S .

2y ago

364 Views

HARVARD UNIVERSITY 2007-08

Harvard Business School Harvard Medical School Harvard Faculty of Arts and Sciences Harvard School of Public . Publishing Division Joint Center for Housing Studies* American Repertory Theatre . WIDE is the Wide-scale Interactive Development for Educators. (5) The Nanoscale Science and Engineering Center is a joint program with M.I.T., U.C.S .

2y ago

314 Views

ANNA BRADY - Harvard University

Jun 02, 2008 · ANNA BRADY 12 Oxford Street Apt. 9 Cambridge, MA 02138 (617) 495-3108 abrady@jd11.law.harvard.edu EDUCATION HARVARD LAW SCHOOL, Candidate for J.D., June 2011 Activities: Harvard Civil Rights-Civil Liberties Law Review UNIVERSITY OF CHICAGO, B.A. i

2y ago

137 Views

ADA1: Chapter8, Correlation & Regression

It looks like you're using an ad-blocker