14. Correlation and Regression

# 14.6 r² and the Standard Error of the Estimate of y′

Consider the deviations : Looking at the picture we see that Remember that variance is the sum of the squared deviations (divided by degrees of freedom), so squaring the above and summing gives: (the cross terms all cancel because is the least square solution and , see Section 14.6.1, below, for details). This is also a sum of squares statement: where SS , SS and SS are the sum of squares — error, sum of squares — total and sum of squares — regression (explained) respectively.

Dividing by the degrees of freedom, which is in this {\em bivariate} situation, we get: It turns out that The quantity is called the coefficient of determination and gives the the fraction of variance explained by the model (here the model is the equation of a line). The quantity appears with many statistical models. For example with ANOVA it turns out that the “effect size” eta-squared is the fraction of variance explained by the ANOVA model, .

The standard error of the estimate is the standard deviation of the noise (the square root of the unexplained variance) and is given by Example 14.4: Continuing with the data of Example 14.3, we had so Here is a graphical interpretation of : The assumption for computing confidence intervals for is that is independent of . This is the assumption of homoscedasticity. You can think of the regression situation as a generalized one-way ANOVA where instead of having a finite number of discrete populations for the IV, we have an infinite number of (continuous) populations. All the populations have the same variance (and they are assumed to be normal) and is the pooled estimate of that variance.

# 14.6.1: **Details: from deviations to variances

Squaring both sides of and summing gives Working on that cross term, using , we get where was used in the last line.

1. In ANOVA the model'' is the difference of means between the groups. We will see more about this aspect of ANOVA in Chapter 17. 