14. Correlation and Regression
14.6 r² and the Standard Error of the Estimate of y′
Consider the deviations :
Looking at the picture we see that
Remember that variance is the sum of the squared deviations (divided by degrees of freedom), so squaring the above and summing gives:
(the cross terms all cancel because is the least square solution and , see Section 14.6.1, below, for details). This is also a sum of squares statement:
where SS, SS and SS are the sum of squares — error, sum of squares — total and sum of squares — regression (explained) respectively.
Dividing by the degrees of freedom, which is in this {\em bivariate} situation, we get:
It turns out that
The quantity is called the coefficient of determination and gives the the fraction of variance explained by the model (here the model is the equation of a line). The quantity appears with many statistical models. For example with ANOVA it turns out that the “effect size” eta-squared is the fraction of variance explained by the ANOVA model[1], .
The standard error of the estimate is the standard deviation of the noise (the square root of the unexplained variance) and is given by
Example 14.4: Continuing with the data of Example 14.3, we had
so
▢
Here is a graphical interpretation of :
The assumption for computing confidence intervals for is that is independent of . This is the assumption of homoscedasticity. You can think of the regression situation as a generalized one-way ANOVA where instead of having a finite number of discrete populations for the IV, we have an infinite number of (continuous) populations. All the populations have the same variance (and they are assumed to be normal) and is the pooled estimate of that variance.
14.6.1: **Details: from deviations to variances
Squaring both sides of
and summing gives
Working on that cross term, using , we get
where
was used in the last line.
- In ANOVA the ``model'' is the difference of means between the groups. We will see more about this aspect of ANOVA in Chapter 17. ↵