14. Correlation and Regression

14.10 Multiple Regression

Multiple regression is to the linear regression we just covered as one-way ANOVA is to m-way ANOVA. In m-way ANOVA we have one DV and m discrete IVs. With multiple regression we have one DV (univariate) and k continuous IVs. We will label the DV with y and the IVs with x_{1}, x_{2}, \ldots, x_{k}. The idea is to predict y with y^{\prime} via

    \[y^{\prime} = a + b_{1} x_{1} + b_{2} x_{2} + \cdots + b_{k} x_{k}\]

or, using summation notation

    \[y^{\prime} = a +\sum_{j=1}^{k} b_{j} x_{j}\]

Sometimes we (and SPSS) write a = b_{0}. The explicit formula for the coefficients a and b_{j} are long so we won’t give them here but, instead, we will rely on SPSS to compute the coefficients for us. Just the same, we should remember that the coefficients are computed using the least squares method, where the sum of the squared deviations is minimized. That is, a and the b_{j} are such that

    \begin{eqnarray*} E & = & \sum_{i=1}^{n} (y_{i} - y^{\prime}_{i})^{2} \\ & = & \sum_{i=1}^{n} (y_{i} - [a + \sum_{j=1}^{k} b_{j} x_{ji}])^{2} \end{eqnarray*}

is minimized. (Here we are using (y_{i}, x_{1i}, x_{2i}, \ldots, x_{ki}) to represent data point i.) If you like calculus and have a few minutes to spare, the equations for a and the b_{j} can be found by solving:

    \[ \frac{\partial E}{\partial a} = 0, \;\;\; \frac{\partial E}{\partial b_{1}} = 0, \;\;\; \cdots \;\;\; \frac{\partial E}{\partial b_{k}} = 0 \]

for a and the b_{j}. The result will contain al the familiar terms like \sum y, \sum y x_{j}, etc. It also turns out that the “normal equations” for a and the b_{j} that result have a pattern that can be captured with a simple linear algebra equation that we will see in Chapter 17.

Some terminology: the b_{j} (including b_{0}) are known as partial regression coefficients.

14.10.1: Multiple regression coefficient, r

An overall correlation coefficient, r, can be computed using pairwise bivariate correlation coefficients as defined in the previous Section 14.2. This overall correlation is defined as r = r_{y^{\prime} y}, the bivariate correlation coefficient of the predicted values y^{\prime} versus the data y. For the case of 2 IVs, the formula is

    \[ r = \sqrt{\frac{r_{y x_{1}}^{2} + r^{2}_{y x_{2}} - 2 r_{y x_{1}} r_{y x_{2}} r_{x_{1} x_{2}} }{1 - r^{2}_{x_{1} x_{2}}}} \]

where r_{y x_{1}} is the bivariate correlation coefficient between y and x_{1}, etc. It is true that -1 \leq r \leq 1 as with the bivariate r.

Example 14.6 : Suppose that you have used SPSS to obtain the regression equation

    \[ y^{\prime} = -44.572 + 87.679 x_{1} + 14.519 x_{2} \]

for the following data :

Student GPA,
x_{1}
Age,
x_{2}
Score,
y
x^{2}_{1} x^{2}_{2} y^{2} x_{1}y x_{2}y x_{1}x_{2}
A 3.2 22 550 10.24 484 302500 1760 12100 70.4
B 2.7 27 570 7.29 729 324900 1539 15390 72.9
C 2.5 24 525 6.25 576 275625 1312.5 12600 60
D 3.4 28 670 11.56 784 448900 2278 18760 95.2
E 2.2 23 490 4.84 529 240100 1078 11270 50.6
n=5 \sum x_{1} =
14
\sum x_{2} =
124
\sum y =
2805
\sum x^{2}_{1} =
40.18
\sum x^{2}_{2} =
3102
\sum y^{2} =
1592025
\sum x_{1}y =
7967.5
\sum x_{2}y =
70120
\sum x_{1}x_{2} =
349.1

Compute the multiple correlation coefficient.

Solution :

First we need to compute the pairwise correlations r_{x_{1}y}, r_{x_{2}y}, and r_{x_{1}x_{2}}. (Note that r_{x_{1}y} = r_{yx_{1}}, etc. because the correlation matrix is symmetric.)

    \begin{eqnarray*} r_{x_{1}y} & = & \frac{n(\sum x_{1} y) - (\sum x_{1} ) (\sum y )}{\sqrt{[n (\sum x_{1}^{2}) - (\sum x_{1})^{2}] [n (\sum y^{2}) - (\sum y)^{2}}]}\\ & = & \frac{5(7967.5) - (14 ) (2805 )}{\sqrt{[5 (40.18) - (14)^{2}] [5 (1592025) - (2805)^{2}}]}\\ & = & 0.845 \end{eqnarray*}

    \begin{eqnarray*} r_{x_{2}y} & = & \frac{n(\sum x_{2} y) - (\sum x_{2} ) (\sum y )}{\sqrt{[n (\sum x_{2}^{2}) - (\sum x_{2})^{2}] [n (\sum y^{2}) - (\sum y)^{2}}]}\\ & = & \frac{5(70120) - (124 ) (2805 )}{\sqrt{[5 (3102) - (124)^{2}] [5 (1592025) - (2805)^{2}}]}\\ & = & 0.791 \end{eqnarray*}

    \begin{eqnarray*} r_{x_{1}x_{2}} & = & \frac{n(\sum x_{1} x_{2}) - (\sum x_{1} ) (\sum x_{2} )}{\sqrt{[n (\sum x_{1}^{2}) - (\sum x_{1})^{2}] [n (\sum x_{2}^{2}) - (\sum x_{2})^{2}}]}\\ & = & \frac{5(349.1) - (14 ) (124)}{\sqrt{[5 (40.18) - (14)^{2}] [5 (3102) - (124)^{2}}]}\\ & = & 0.371 \end{eqnarray*}

Now use these in :

    \begin{eqnarray*} r & = & \sqrt{\frac{r_{y x_{1}}^{2} + r^{2}_{y x_{2}} - 2 r_{y x_{1}} r_{y x_{2}} r_{x_{1} x_{2}} }{1 - r^{2}_{x_{1} x_{2}}}}\\ & = & \sqrt{\frac{(0.845)^{2} + (0.791)^{2} - (2)(0.845)(0.791)(0.371) }{1 -(0.371)^{2}}} \\ & = & 0.989 \end{eqnarray*}

14.10.2: Significance of r

Here we want to test the hypotheses :

    \begin{eqnarray*} H_{0} &:& \rho = 0 \\ H_{1} &:& \rho \neq 0 \end{eqnarray*}

where \rho is the population multiple regression correlation coefficient.

To test the hypothesis we use

    \[ F_{\mbox{test}} = \frac{r^{2} / k}{(1-r^{2})/(n-k-1)} \]

with

    \[ \nu_{1} = n-k \mbox{\ \ \ (d.f.N.)\ \ \ \ and\ \ \ \ } \nu_{2} = n-k-1 \mbox{\ \ \ (d.f.D.)} \]

here:

    \begin{eqnarray*} n & = & \mbox{ sample size} \\ k & = & \mbox{ number of IVs} \\ r & = & \mbox{ multiple correlation coefficient} \end{eqnarray*}

(Note: This “F-test” is similar to but not the same as the “ANOVA” output given by SPSS when you run a regression.)

Example 14.7 : Continuing with Example 14.6, test the significance of r.

Solution :

1. Hypotheses.

    \begin{eqnarray*} H_{0} &:& \rho = 0 \\ H_{1} &:& \rho \neq 0 \end{eqnarray*}

2. Critical statistic. From the Rank Correlation Coefficient Critical Values Table (i.e., the critical values for the Spearman correlation) with

    \begin{eqnarray*} \nu_{1} & = & n-k = 5-1 = 3 \\ \nu_{2} & = & n - k - 1 = 5 - 2 - 1 = 2 \\ \alpha & = & 0.05 \end{eqnarray*}

find

    \[ F_{\mbox{crit}} = 19.16 \]

3. Test statistic.

    \begin{eqnarray*} F_{\mbox{test}} & = & \frac{r^{2} / k}{(1-r^{2})/(n-k-1)} \\ & = & \frac{(0.989)^{2} / 2}{(1-(0.989)^{2})/(5-2-1)} \\ & = & 44.45 \end{eqnarray*}

4. Decision.

Reject H_{0}.

5. Interpretation.

r = 0.989 is significant.

14.10.3: Other descriptions of correlation

  1. Coefficient of multiple determination: r^{2}. This quantity still has the interpretation as fraction of variance explained by the (multiple regression) model.
  2. Adjusted r^{2}:

        \[ r^{2}_{\mbox{adj}} = 1 \ \left[ \frac{(1-r^{2})(n-1)}{n-k-1} \right] \]

    r^{2}_{\mbox{adj}} gives a better (unbiased) estimate of the population value for \rho^{2} by correcting for degrees of freedom just as the sample s^{2} with its degrees of freedom equal to n-1 gives an unbiased estimate of the population \sigma^{2}.

Example 14.8 : Continuing Example 14.6, we had r = 0.989 so

    \[ r^{2} = 0.978 \]

and

    \begin{eqnarray*} r^{2}_{\mbox{adj}} & = & 1 \ \left[ \frac{(1-r^{2})(n-1)}{n-k-1} \right] \\ r^{2}_{\mbox{adj}} & = & 1 \ \left[ \frac{(1-0.978)(5-1)}{5-2-1} \right] \\ & = & 0.956 \end{eqnarray*}