ADA1: Chapter8, Correlation & Regression

2y ago
47 Views
3 Downloads
2.77 MB
106 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Ellie Forte
Transcription

Chapter 8: Correlation & RegressionWe can think of ANOVA and the two-sample t-test as applicable tosituations where there is a response variable which is quantitative, andanother variable that indicates group membership, which we might thinkof a as categorical predictor variable.In the slides on categorical data, all varaibles are categorical, and we keeptrack of the counts of observation in each category or combination ofcategories.In this section, we analyze cases where we have multiple quantitativevariables.ADA1November 12, 20171 / 105

Chapter 8: Correlation & RegressionIn the simplest case, there are two quantitative variables. Examplesinclude the following:I heights of fathers and sons (this is a famous example from Galton,Darwin’s cousin)I ages of husbands and wifesI systolic versus diastolic pressure for a set of patientsI high school GPA and college GPAI college GPA and GRE scoresI MCAT scores before and after a training courseIn the past, we might have analyzed pre versus post data using atwo-sample t-test to see whether there was a difference. It is also possibleto try to quantify the relationship—instead of just asking whether the twosets of scores are different, or getting an interval for the average difference,we can also try to predict the new score based on the old score, and theamount of improvmenet might depend on the old score.ADA1November 12, 20172 / 105

Chapter 8: Correlation & RegressionHere is some example data for husbands and wives. Heights are in mm.Couple123456789101112HusbandAge HusbandHeight 610159016101700November 12, 20173 / 105

Correlation: Husband and wife agesCorrelation is 0.88.ADA1November 12, 20174 / 105

Correlation: Husband and wife heightsCorrelation is 0.18 with outlier, but -0.54 without outlier.ADA1November 12, 20175 / 105

Correlation: scatterplot matrixpairs(x[,2:5]) allows you to look at all data simultaneously.ADA1November 12, 20176 / 105

Correlation:scatterplot matrixlibrary(ggplot2)library(GGally)p1 - ggpairs(x[,2:5])print(p1)ADA1November 12, 20177 / 105

Correlation: scatterplot matrixADA1November 12, 20178 / 105

Chapter 8: Correlation & RegressionFor a data set like this, you might not expect age to be significantlycorrelated with height for either men or women (but you could check).You could also check whether differences in couples ages are correlatedwith differences in their heights. The correlation between two variables isdone as follows:cor(x WifeAge,x HusbandAge)Note that the correlation is looking at something different than the t test.A t-test for this data might look at whether the husbands and wives hadthe same average age. The correlation looks at whether younger wivestend to have younger husbands and older husbands tend to have olderwives, whether or not there a difference in the ages overall. Similarly forheight. Even if husbands tend to be taller than wives, that doesn’tnecessarily mean that there is a relationship between the heights forcouples.ADA1November 12, 20179 / 105

Pairwise CorrelationsThe pairwise correlations for an entire dataset can be done as follows.What would it mean to report the correlation between the Couple variableand the other variables? Here I only get the correlations for variables otherthan the ID variable.options(digits 4) # done so that the output fits#on the screen!cor(x[,2:5])HusbandAge HusbandHeight WifeAge WifeHeightHusbandAge1.0000-0.24716 0.88003-0.5741HusbandHeight-0.24721.00000 0.021240.1783WifeAge0.88000.02124 1.00000-0.5370WifeHeight-0.57410.17834 -0.536991.0000ADA1November 12, 201710 / 105

Chapter 8: Correlation & RegressionThe correlation measures the linear relationship between variables X andY as seen in a scatterplot. The sample correlation between X1 , . . . , Xn andY1 , . . . , Yn is denoted by r has the following propertiesI 1 r 1I if Yi tends to increase linearly with Xi , then r 0I if Yi tends to decrease linearly with Xi , then r 0I if there is a perfect linear relationship between X and Y , then r 1(points fall on a line with positive slope)I if there is a perfect negative relationship between X and Y , thenr 1 (points fall on a line with negative slope)I the closer the points (Xi , Yi ) are to a straight line, the closer r is to 1or 1I r is not affected by linear transformations (i.e., converting from inchesto centimeters, Fahrenheit to Celsius, etc.I the correlation is symmetric: the correlation between X and Y is thesame as the correlation between Y and XADA1November 12, 201711 / 105

Chapter 8: Correlation & RegressionFor n observations on two variables, the sample correlation is calculated byPn(xi x)(yi y )SXYr pPn i 1Pn22SX SYi 1 (xi x)i 1 (yi y )Here SX and SY are the sample standard deviations andPn(xi x)(yi y )SXY i 1n 1is the sample covariance. All the (n 1) terms cancel out from thenumerator and denominator when calculating r .ADA1November 12, 201712 / 105

Chapter 8: Correlation & RegressionADA1November 12, 201713 / 105

Chapter 8: Correlation & RegressionADA1November 12, 201714 / 105

Chapter 8: Correlation & RegressionCIs and hypothesis tests can be done for correlations using cor.test().The test is usually based on testing whether the population correlation ρ isequal to 0, soH0 : ρ 0and you can have either a two-sided or one-sided alternative hypothesis.We think of r as a sample estimate of ρ, the Greek letter for r . The test isbased on a t-statistic which has the formularn 2tobs r1 r2and this is compared to a t distribution with n 2 degrees of freedom. Asusual, you can rely on R to do the test and get the CI.ADA1November 12, 201715 / 105

CorrelationThe t distribution derivation of the p-value and CI assume that the jointdistribution of X and Y follow what is called a bivariate normaldistribution. A sufficient condition for this is that X and Y eachindividually have normal distributions, so you can do usual tests ordiagnostics for normality. Similar to the t-test, the correlation is sensitiveto outliers. For the husband and wife data, the sample sizes are small,making it difficult to detect outliers. However, there is not clear evidenceof non-normality.ADA1November 12, 201716 / 105

Chapter 8: Correlation & RegressionADA1November 12, 201717 / 105

Chapter 8: Correlation & RegressionShairpo-Wilk’s tests for normality would all be not rejected, although thesample sizes are quite small for detecting deviations from normality: shapiro.test(x HusbandAge) p.value[1] 0.8934 shapiro.test(x WifeAge) p.value[1] 0.2461 shapiro.test(x WifeHeight) p.value[1] 0.1304 shapiro.test(x HusbandHeight) p.value[1] 0.986ADA1November 12, 201718 / 105

CorrelationHere we test whether ages are significantly correlated and also whetherheights are positively correlated. cor.test(x WifeAge,x HusbandAge)Pearson’s product-moment correlationdata: x WifeAge and x HusbandAget 5.9, df 10, p-value 2e-04alternative hypothesis: true correlation is not equal to 095 percent confidence interval:0.6185 0.9660ADA1November 12, 201719 / 105

CorrelationHere we test whether ages are significantly correlated and also whetherheights are positively correlated. cor.test(x WifeHeight,x HusbandHeight)Pearson’s product-moment correlationdata: x WifeHeight and x HusbandHeightt 0.57, df 10, p-value 0.6alternative hypothesis: true correlation is not equal to 095 percent confidence interval:-0.4407 0.6824sample estimates:cor0.1783ADA1November 12, 201720 / 105

CorrelationWe might also test the heights with the bivariate outlier removed: cor.test(x WifeHeight[x WifeHeight 1450],x HusbandHeight[xPearson’s product-moment correlationdata: x WifeHeight[x WifeHeight 1450] and x HusbandHeightt -1.9, df 9, p-value 0.1alternative hypothesis: true correlation is not equal to 095 percent confidence interval:-0.8559 0.1078sample estimates:cor-0.5261ADA1November 12, 201721 / 105

CorrelationRemoving the outlier changes the direction of the correlation (frompositive to negative). The result is still not significant at the α 0.05level, although the p-value is 0.1, suggesting slight evidence against thenull hypothesis of no relationship between heights of husbands and wives.Note that the negative correlation here means that, with the one outliercouple removed, taller wives tended to be associated with shorterhusbands and vice versa.ADA1November 12, 201722 / 105

CorrelationA nonparametric approach for dealing with outliers or otherwise nonnormaldistributions for the variables being correlated is to rank the data withineach sample and then compute the usual correlation on the ranked data.Note that in the Wilcoxon two-sample test, you pool the data first andthen rank the data. For the Spearman correlation, you rank each groupseparately.The idea is that large observations will have large ranks in both groups, sothat if the data is correlated, large ranks will tend to get paired with largeranks, and small ranks will tend to get paired with small ranks if the datais correlated. If the data are uncorrelated, then the ranks will be randomwith respect to each other.ADA1November 12, 201723 / 105

The Spearman correlation is implemented in cor.test() using the optionmethod ’’spearman’’. Note that the correlation is negative using theSpearman ranking even with the outlier, but the correlation was positive using theusual (Pearson) correlation. The Pearson correlation was negative when theoutlier was removed. Since the results depended so much on the presence of asingle observation, I would be more comfortable with the Spearman correlation forthis example.cor.test(x WifeHeight,x HusbandHeight,method "spearman")Spearman’s rank correlation rhodata: x WifeHeight and x HusbandHeightS 370, p-value 0.3alternative hypothesis: true rho is not equal to 0sample estimates:rho-0.3034ADA1November 12, 201724 / 105

Chapter 8: Correlation & RegressionA more extreme example of an outlier. Here the correlation changes from0 to negative.ADA1November 12, 201725 / 105

CorrelationSomething to be careful of is that if you have many variables (which oftenoccurs), then testing every pair of variables for a significant correlationleads to multiple comparison problems, for which you might want to use aBonferroni correction, or limit yourself to only testing a small number ofpairs of variable that are interesting a priori.ADA1November 12, 201726 / 105

RegressionIn regression, we try to make a model that predicts the average responseof one quantitative variable given one or more predictor variables. We startwith the case that there is one predictor variable, X , and one response, Y ,which is called simple linear regression.Unlike correlation, the model depends on which variable is the predictorand which is the response. While the correlation of x and y is the same asthe correlation of y and x, the regression of y on x will generally lead to adifferent model than regressing on x on y . In the phrase “regressing y onx”, we mean that y is the response and x is the predictor.ADA1November 12, 201727 / 105

RegressionIn the basic regression model, we assume that the average value of Y hasa linear relationship to X , and we writey β0 β1 xHere β0 is the coefficient and β1 is the slope of the line. This is similar toequations of lines from courses like College Algebra where you writey a bxory mx bBut we think of β0 and β1 as unknown parameters, similar to µ for themean of a normal distribution. One possible goal of a regression analysis isto make good estimates of β0 and β1 .ADA1November 12, 201728 / 105

RegressionReview of lines, slopes, and intercepts. The slope is the number of unitsthat y changes for a change of 1 unit in x. The intercept (or y -intercept)is where the line intersects the y -axis.ADA1November 12, 201729 / 105

RegressionIn real data, the points almost never fall exactly on a line, but there mightbe a line that describes the overall trend. (This is sometimes even calledthe trend line). Given a set of points, which line through the points is“best”?ADA1November 12, 201730 / 105

RegressionHusband and wife age example.ADA1November 12, 201731 / 105

RegressionHusband and wife age example. Here we plot the line y x. Note that 9 out of12 points are above the line—for the majority of couples, the husband was olderthan the wife. The points seem a little shifted up compared to the line.ADA1November 12, 201732 / 105

Now we’ve added the usual regression line in black. It has a smaller slope but ahigher intercept. The lines seem to make similar predictions at higher ages, butthe black line seems a better fit to the data for the lower ages. Although thisdoesn’t always happen, exactly half of the points are above the black line.ADA1November 12, 201733 / 105

RegressionIt is a little difficult to tell visually which line is best. Here is a third line,which is based on regressing the wives’ heights on the husbands heights.ADA1November 12, 201734 / 105

RegressionIt is difficult to tell which line is “best” or even what is meant by a bestline through the data. What to do?One possible solution to the problem is to consider all possible lines of theformy β0 β1 xor hereHusband height β0 β1 (Wife height)In other words, consider all possible choices of β0 and β1 and pick the onethat minimizes some criterion. The most common criterion used is theleast squares criterion—here you pick β0 and β1 that minimizenX[yi (β0 β1 xi )]2i 1ADA1November 12, 201735 / 105

RegressionGraphically, this means minimizing the sum of squared deviations fromeach point to the line.ADA1November 12, 201736 / 105

RegressionRather than testing all possible choices of β0 and β1 , formulas are knownfor the optimal choices to minimize the sum of squares. We think of theseoptimal values as estimates of the true, unknown population parametersβ0 and β1 . We use b0 or βb0 to mean an estimate of β0 and b1 or βb1 tomean an estimate of β1 :P(xi x)(yi y )SYbb1 β1 i P r2SXi (xi x)b0 βb0 y b1 xHere r is the Pearson (unranked) correlation coefficient, and SX and SYare the sample standard deviations. Note that if r is positive if, and onlyif, b1 is positive. Similarly, if one is negative the other is negative. In otherwords, r has the same sign as the slope of the regression line.ADA1November 12, 201737 / 105

RegressionThe equation for the regression line isyb b0 b1 xwhere x is an value (not just values that were observed), and b0 and b1were defined on the previous slide. The notation yb is used to mean thepredicted or average value of y for the given x value. You can think of itas meaning the best guess for y if a new observation will have the given xvalue.A special thing to note about the regression line is that it necessarilypasses through the point (x, y ).ADA1November 12, 201738 / 105

Regression: scatterplot with least squares lineOptions make the dots solid and a bit bigger.plot(WifeAge,HusbandAge,xlim c(20,60),ylim c(20,60),xlab "Wife Age", ylab "Husband Age", pch 16,cex 1.5,cex.axis 1.3,cex.lab 1.3)abline(model1,lwd 3)ADA1November 12, 201739 / 105

Regression: scatterplot with least squares lineYou can always customize your plot by adding to it. For example you canadd the point (x, y ). You can also add reference lines, points, annotationsusing text at your own specified coordinates, etc.points(mean(WifeAge),mean(HusbandAge),pch 17,col ’’red’’)text(40,30,’’r 0.88’’,cex 1.5)text(25,55,’’Hi Mom!’’,cex 2)lines(c(20,60),c(40,40),lty 2,lwd 2)The points statement adds a red triangle at the mean of both ages, whichis the point (37.58, 39.83). If a single coordinate is specified by thepoints() function, it adds that point to the plot. To add a curve or lineto a plot, you can use points() with x and y vectors (just like theoriginal data). For lines(), you specify the beginning and ending x and ycoordinates, and R fills in the line.ADA1November 12, 201740 / 105

RegressionTo fit a linear regression model in R, you can use the lm() command,which is similar to aov().The following assumes you have the file couple.txt in the same directoryas your R session:x - read.table("couples.txt",header T)attach(x)model1 - lm(HusbandAge WifeAge)summary(model1)ADA1November 12, 201741 / 105

RegressionCall:lm(formula HusbandAge WifeAge)Residuals:Min1Q Median-8.1066 -3.2607 -0.01253Q3.4311Max6.8934Coefficients:Estimate Std. Error t value Pr( t )(Intercept) 10.44475.23501.995 0.073980 .WifeAge0.78200.13345.860 0.000159 ***--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.11Residual standard error: 5.197 on 10 degrees of freedomMultiple R-squared: 0.7745, Adjusted R-squared: 0.7519F-statistic: 34.34 on 1 and 10 DF, p-value: 0.0001595ADA1November 12, 201742 / 105

RegressionThe lm() command generates a table similar to the ANOVA tablegenerated by aov().To go through some elements in the table, it first fives the formula used togenerate the output. This is useful when you have generated severalmodels, say model1, model2, model3, . and you can’t rememberhow you generated the model. For example, you might have one modelwith an outlier removed, another model with one of the variables on alog-transformed scale, etc.The next couple lines deal with residuals. Residuals are the differencebetween theobserved and fitted values, That isyi ybi yi (b0 b1 xi )ADA1November 12, 201743 / 105

RegressionFrom the Q3YWVlM2MzZmUzN2MwOWY.jpgADA1November 12, 201744 / 105

RegressionThe next part gives a table similar to ANOVA. Here we get the estimatesfor the coefficients, b0 and b1 in the first quantitative column. We also getstandard errors for these, corresponding t-values and p-values. Thep-values are based on testing the null hypothesesH0 : β 0 0andH0 : β 1 0The first null hypothesis says that the intercept is 0. For this problem, thisis not very meaningful, as it would mean that the husband of 0-yr oldwoman would also be predicted to be a 0-yr old!Often the intercept term is not very meaningful in the model. The secondnull hypothesis is that the slope is 0, which would mean that the wife’sage increasing would not be associated with the husband’s age increasing.ADA1November 12, 201745 / 105

RegressionFor this eample, we get a significant result for the wife’s age. This means thatthe wife’s age has some statistically signficant ability to predict the husband’sage. The coefficients give the modelMean Husband’s Age 10.4447 0.7820 (Wife’s Age)The low p-value for the Wife’s age, suggest that the coefficient 0.7820 isstatistically significantly different from 0. This means that the data show there isevidence that the wife’s age is associated with the husband’s age. The coefficientof 0.7820 means that for each year of increase in the wife’s age, the meanhusband’s age is predicted to increase by 0.782 years.As an example, based on this model, a 20-yr old women who was married wouldbe predicted to have a husband who was10.4447 (0.782)(30) 33.9or about 34 years old. A 50 yr-old women would be predicted to have husbandwho was10.4447 (0.782)(55) 53.5ADA1November 12, 201746 / 105

RegressionThe fitted values are found by plugging in the observed x values (Wifeages) into the regression equation. This gives the expected husband agesfor each wife. They are given automatically usingmodel1 fitted.values12345644.06894 32.33956 33.90348 55.01637 51.10658 31.55760 51.106910111228.42976 29.99368 40.94111 35.46740x WifeAge[1] 43 28 30 57 52 27 52 43 23 25 39 32For example, if the wife is 43, the regression equation predicts10.4447 (0.782)(43) 44.069 for the husband age.ADA1November 12, 201747 / 105

RegressionTo see what is stored in model1, typenames(model1)# [1] "coefficients" "residuals"# [5] "fitted.values" "assign"# [9] delThe residuals are also stored, which are the observed husband ages minusthe fitted values.ei yi ybiADA1November 12, 201748 / 105

Regression: ANOVA tableMore details about the regression can be obtained using the anova()command on the model1 object: anova(model1)Analysis of Variance TableResponse: HusbandAgeDf Sum Sq Mean Sq F valuePr( F)WifeAge1 927.53 927.53 34.336 0.0001595 ***Residuals 10 270.1327.01Here the sum of squared residuals, sum(model1 residuals2 ) is 270.13.ADA1November 12, 201749 / 105

Regression: ANOVA tableOther components from the table are (SS means sums of squares):Residual SS nXei2i 1Total SS nX(yi y )2i 1Regression SS b1nX(xi x)(yi y )i 1Regression SS Total SS Residual SSRegression SS r2R2 Total SSThe Mean Square values in the table are the SS values divided by the degrees offreedom. The degrees of freedom is n 2 for the residuals and 1 for the 1predictor. The F statistic is MSR/MSE (Measn square for regression divided bymean square error), and the p-value can be based on the F statistic.ADA1November 12, 201750 / 105

RegressionNote that R 2 1 occurs when the Regression SS is equal to the Total SS.This means that the Residual SS is 0, so all of the points fall on the line.In this case, r 1 and R 2 1.On the other hand, R 2 0 means that the Total SS is equal to theResidual SS, so the Regression SS is 0. We can think of the Regression SSand Residual SS as partitioning the Total SS:Total SS Regression SS Residual SS orRegression SS Residual SS 1Total SSTotal SSIf a large proportion of the Total SS is from the Regression rather thanfrom Residuals, then R 2 is high. It is common to say that R 2 is a measureof how much variation is explained by the predictor variable(s). This phraseshould be used cautiously because it doesn’t refer to a causal explanation.ADA1November 12, 201751 / 105

RegressionFor the husband and wife and example for ages, the R 2 value was 0.77.This means that 77% of the variation in husband ages was “explained by”variation in the wife ages. Since R 2 is just the correlation squared,regressing wife ages on husband ages would also result in R 2 0.77, and77% of the variation in wife ages would be “explained by” variation inhusband ages. Typically, you want the R 2 value to be high, since thismeans you can use one variable (or a set of variables) to predict anothervariable.ADA1November 12, 201752 / 105

RegressionThe least-squares line is mathematically well-defined and can be calculatedwithout thinking about the data probabilitistically. However, p-values andtests of significance assume the data follow a probabilistic model withsome assumptions. Assumptions for regression include the following:Ieach pair (xi , yi ) is independentIThe expected value of y is a linear function of x: E (y ) β0 β1 x,sometimes denoted µY XIthe variability of y is the same for each fixed value of x. This issometimes denoted σy2 xIthe distribuiton of y given x is normally distributed with meanβ0 β1 x and variance σy2 xIin the model, x is not treated as randomADA1November 12, 201753 / 105

RegressionNote that the assumption that the variance is the same regardless of x issimilar to the assumption of equal variance in ANOVA.ADA1November 12, 201754 / 105

RegressionLess formally, the assumptions in their order of importance, are:1. Validity. Most importantly, the data you are analyzing should map tothe research question you are trying to answer. This sounds obviousbut is often overlooked or ignored because it can be inconvenient.2. Additivity and Linearity. The most important mathematicalassumption of the regression model is that its deterministiccomponent is a linear function of the separate predictors.3. Independence of errors (i.e., residuals). This assumption dependson how the data were collected.4. Equal variance of errors.5. Normality of errors.It is easy to focus on the last two (especially when teaching) because thefirst assumptions depend on the scientific context and are not possible toassess just looking at the data in a spreadsheet.ADA1November 12, 201755 / 105

RegressionTo get back to the regression model, the parameters of the model are β0 ,β1 and σ 2 (which we might call σY2 X , but it is the same for every x).Usually σ 2 is not directly of interest but is necessary to estimate in orderto do hypothesis tests and confidence intervals for the other parameters.σY2 X is estimated byP(yi ybi )2Residual SS Residual MS iResidual dfn 2This formula is similar to the sample variance, but we subtract thepredicted values for y instead of the mean for y , y , and we divide by n 2instead of dividing by n 1. We can think of this as two degrees offreedom being lost since β0 and β1 need to be estimated. Usually, thesample variance uses n 1 in the denominator due to one degree offreedom being lost for estimating µY with y .sY2 XADA1November 12, 201756 / 105

RegressionRecall that there are observed residuals, which are observed minus fittedvalues, and unobserved residuals:ei yi ybi yi (b0 b1 )xiεi yi E (yi ) yi (β0 β1 )xiThe difference in meaning here is whether the estimated versus unknownregression coefficients are used. We can think of ei as an estimate of εi .ADA1November 12, 201757 / 105

RegressionTwo ways of writing the regression model areE (yi ) β0 β1 xiandyi β0 β1 xi εiADA1November 12, 201758 / 105

RegressionTo get a confidence interval for β1 , we can useb1 tcrit SE (b1 )whereSE (b1 ) pPsY Xi (xi x)2Here tcrit is based on the Residual df, which is n 2.To test the null hypothesis that β1 β10 (i.e. a particular value for β1 ,you can use the test statistictobs b1 β10SE (b1 )and then compare to a critical value (or obtain a p-value) using n 2 df(i.e., the Residual df).ADA1November 12, 201759 / 105

RegressionThe p-value based on the R output is for testing H0 : β1 0, whichcorresponds to a flat regression line. But the theory allows testing anyparticular slope. For the couples data, you might be interested in testingH0 : β1 1, which would mean that for every year older that the wife is,the husband’s age is expected to increase by 1 year.Coefficients:Estimate Std. Error t value Pr( t )(Intercept) 10.44475.23501.995 0.073980 .WifeAge0.78200.13345.860 0.000159 ***To test H0 : β1 1, tobs (0.782 1.0)/(.1334) 1.634. The criticalvalue is qt(.975,10) 2.22. So comparing 1.634 to 2.22 for atwo-sided test, we see that the observed test statistic is not as extreme asthe critical value, so we cannot conclude that the slope is significantlydifferent from 1. For a p-value, we can use pt(-1.634,10)*2 0.133.ADA1November 12, 201760 / 105

RegressionInstead of getting values such as the SE by hand from the R output, youcan also save the output to a variable and extract the values. This reducesroundoff error and makes it easier to repeat the analysis in case the datachanges. For example, from the previous example, we could usemodel1.values - summary(model1)b1 - model1.values coefficients[2,1]b1#[1] 0.781959se.b1 - model1.values coefficients[2,2]t - (b1-1)/se.b1t#[1] -1.633918The object model1.values coefficients here is a matrix object, so thevalues can be obtained from the rows and columns.ADA1November 12, 201761 / 105

RegressionFor the CI for this example, we havedf - model1.values fstatistic[3] # this is hard to findt.crit - qt(1-0.05/2, df)CI.lower - b1 - t.crit * se.b1CI.upper - b1 t.crit * se.b1print(c(CI.lower,CI.upper))#[1] 0.4846212 1.0792968Consistent with the hypothesis test, the CI includes 1.0, suggesting wecan’t be confident that the ages of husbands increase at a different ratefrom the ages of their wives.ADA1November 12, 201762 / 105

RegressionAs mentioned earlier, the R output tests H0 : β1 0, so you need toextract information to do a different test for the slope. We showed using at-test for testing this null hypothesis, but it is also equivalent to an F test.2 when there is only 1 numerator degree ofHere the F statistic is tobsfreedom (one predictor in the regression).t - (b1-0)/se.b1t#[1] 5.859709t 2#[1] 34.33619which matches the F statistic from earlier output.In addition, the p-value matches that for the correlation usingcor.test(). Generally, the correlation will be significant if and only if theslope is signficantly different from 0.ADA1November 12, 201763 / 105

RegressionAnother common application of confidence intervals in regression is for theregression line itself. This means getting a confidence interval for theexpected value of y for each value of x.Here the CI for y given x iss1(x x)2 Pb0 b1 x tcrit sY X2ni (xi x)where the critical value is based on n 2 degrees of freedom.ADA1November 12, 201764 / 105

RegressionIn addition to a confidence interval for the mean, you can madeprediction intervals for a new observation. This gives a plausible intervalfor a new observation. Here there are two sources of uncertainty:uncertainty about the mean, and uncertainty about how much anindividual observation deviates from the mean. As a result, the predictioninterval is wider than the CI for the mean.The prediction interval for y given x issb0 b1 x tcrit sY X1 ADA1(x x)21 P2ni (xi x)November 12, 201765 / 105

RegressionFor a particular wife age, such as 40, the CIs and PIs (prediction intervals)are done in R bypredict(model1,data.frame(WifeAge 40), interval "confidence",level .95)#fitlwrupr#1 41.72307 38.30368 45.14245predict(model1,data.frame(WifeAge 40), interval "prediction",level .95)#fitlwrupr#1 41.72307 29.6482 53.79794Here the predicted husband’s age for a 40-yr old wife is 41.7 years. A CI for themean age for the husband is (38.3,45.1), but a prediction interval is that 95% ofhusbands for a wife this age would be between 29.6 and 53.8 years old.

Chapter 8: Correlation & Regression ADA1 November 12, 2017 14 / 105. Chapter 8: Correlation & Regression CIs and hypothesis tests can be done for correlations using cor.test(). The test is usually based on testing whether the population

Related Documents:

independent variables. Many other procedures can also fit regression models, but they focus on more specialized forms of regression, such as robust regression, generalized linear regression, nonlinear regression, nonparametric regression, quantile regression, regression modeling of survey data, regression modeling of

Chapter 7 Simple linear regression and correlation Department of Statistics and Operations Research November 24, 2019. Plan 1 Correlation 2 Simple linear regression. Plan 1 Correlation 2 Simple linear regression. De nition The measure of linear association ˆbetween two variables X and Y is estimated by the s

Chapter 3: Correlation and Regression The statistical tool with the help of which the relationship between two or more variables is studied is called correlation. The measure of correlation is called the Correlation Coefficie

Chapter 12. Simple Linear Regression and Correlation 12.1 The Simple Linear Regression Model 12.2 Fitting the Regression Line 12.3 Inferences on the Slope Rarameter ββββ1111 NIPRL 1 12.4 Inferences on the Regression Line 12.5 Prediction Intervals for Future Response Values 1

Linear Regression and Correlation Introduction Linear Regression refers to a group of techniques for fitting and studying the straight-line relationship between two variables. Linear regression estimates the regression coefficients β 0 and β 1 in the equation Y j β 0 β 1 X j ε j wh

LINEAR REGRESSION 12-2.1 Test for Significance of Regression 12-2.2 Tests on Individual Regression Coefficients and Subsets of Coefficients 12-3 CONFIDENCE INTERVALS IN MULTIPLE LINEAR REGRESSION 12-3.1 Confidence Intervals on Individual Regression Coefficients 12-3.2 Confidence Interval

Items Description of Module Subject Name Management Paper Name Quantitative Techniques for Management Decisions Module Title Correlation: Karl Pearson's Coefficient of Correlation, Spearman Rank Correlation Module Id 32 Pre- Requisites Basic Statistics Objectives After studying this paper, you should be able to - 1) Clearly define the meaning of Correlation and its characteristics.

take the lead in rebuilding the criminal legal system so that it is smaller, safer, less puni-tive, and more humane. The People’s Justice Guarantee has three main components: 1. To make America more free by dra-matically reducing jail and prison populations 2. To make America more equal by elim-inating wealth-based discrimination and corporate profiteering 3. To make America more secure by .