17.2 The General Linear Model (GLM) for Univariate Statistics

Gordon E. Sarty

17. Overview of the General Linear Model

17.2 The General Linear Model (GLM) for Univariate Statistics

In abstract form, the GLM is

$\vec{y} = [X] \vec{\beta} + \vec{\epsilon}$

where

$\vec{y}$ is the data vector, an $n$ dimensional column vector.
$[X]$ is the design matrix which is different from test type to test type.
$\vec{\beta}$ is the parameter vector, a lower $p$ -dimensional vector that summarizes the data in terms of the model given by the design matrix.
$\vec{\epsilon}$ is the error vector, the $n$ dimensional column vector of deviations or differences between the model predictions and the data in $\vec{y}$ .

The solution for $\beta$ is the least squares solution

$\vec{\beta} = ([X]^{T}[X])^{-1} [X]^{T} \vec{y}$

In terms of the linear algebra that we just reviewed, $[X]^{\dagger} = ([X]^{T}[X])^{-1} [X]^{T}$ (known as the pseudo-inverse) transforms the data vector $\vec{y}$ in data space ( $\mathbb{R}^{n}$ ) to a vector $\vec{\beta}$ in parameter space ( $\mathbb{R}^{p}$ ) that presumably explains the data.

17.2.1 Linear Regression in GLM Format

We can express the linear regression model $y = a + bx$ in GLM format as

$\left[ \begin{array}{c} y_{1} \\ y_{2} \\ y_{3} \\ \vdots \\ y_{n} \end{array} \right] = \left[ \begin{array}{cc} 1 & x_{1} \\ 1 & x_{2} \\ 1 & x_{3} \\ \vdots \\ 1 & x_{n} \end{array} \right] \left[ \begin{array}{c} a \\ b \end{array} \right] + \left[ \begin{array}{c} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \vdots \\ \epsilon_{n} \end{array} \right]$

Note, importantly, that the design matrix is

$[X] = \left[ \begin{array}{cc} 1 & x_{1} \\ 1 & x_{2} \\ 1 & x_{3} \\ \vdots \\ 1 & x_{n} \end{array} \right]$

…where the second column is composed of the IV values, $x_{i}$ . This is typical for the GLM, the DV is represented by the data vector and the IV is represented by the design matrix. If we do the matrix multiplication the model is:

$\left[ \begin{array}{c} y_{1} \\ y_{2} \\ y_{3} \\ \vdots \\ y_{n} \end{array} \right] = \left[ \begin{array}{c} a + bx_{1} \\ a + bx_{2} \\a + b x_{3} \\ \vdots \\ a + bx_{n} \end{array} \right] + \left[ \begin{array}{c} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \vdots \\ \epsilon_{n} \end{array} \right]$

…so $[X] \vec{\beta} = \vec{\hat{y}}$ is the prediction vector

$\vec{\hat{y}} = \left[ \begin{array}{c} \hat{y}_{1} \\ y_{2} \\ \hat{y}_{3} \\ \vdots \\ \hat{y}_{n} \end{array} \right] = \left[ \begin{array}{c} a + bx_{1} \\ a + bx_{2} \\a + b x_{3} \\ \vdots \\ a + bx_{n} \end{array} \right]$

Abstractly, the GLM $\vec{y} =[X] \vec{\beta} + \vec{\epsilon}$ is $\vec{y} = \vec{\hat{y}} + \vec{\epsilon}$ and the components of $\vec{\epsilon}$ are clearly the deviations $\epsilon_{i} = y_{i} - \hat{y}_{i}$ .

The least squares solution $\vec{\beta} = ([X]^{T}[X])^{-1} [X]^{T} \vec{y}$ written out explicitly for this linear regression case is (without going into the calculation details):

$\begin{eqnarray*} \left[ \begin{array}{c} a \\ b \end{array} \right] &=& \left( \left[ \begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_{1} & x_{2} & \cdots & x_{n} \end{array} \right] \left[ \begin{array}{cc} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \end{array} \right] \right)^{-1} \left[ \begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_{1} & x_{2} & \cdots & x_{n} \end{array} \right]^{T} \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right] \\ &=& \left[ \begin{array}{c} \frac{(\sum y)(\sum x^{2}) - (\sum x)(\sum x y)}{n(\sum x^{2}) - (\sum x)^{2}} \\ \\ \frac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^{2}) - (\sum x)^{2}} \end{array} \right] \end{eqnarray*}$

…and this is exactly the solution for $a$ and $b$ that we saw in Section 14.5: Linear Regression.

Example 17.8 : Let’s look at the data of Example 14.3 in a new light. The data were :

Subject	x	y
A	6	82
B	2	86
C	15	43
D	9	74
E	12	58
F	5	90
G	8	78

and we found that $a = 102.5$ (intercept) and $b = -3.6$ (slope).

In GLM format this all is:

$\left[ \begin{array}{c} 82 \\ 86 \\ 43 \\ 74 \\ 58 \\ 90 \\ 78 \end{array} \right] = \left[ \begin{array}{cc} 1 & 6 \\ 1 & 2 \\ 1 & 15 \\ 1 & 9 \\ 1 & 12 \\ 1 & 5 \\ 1 & 8 \end{array} \right] \left[ \begin{array}{c} 102.5 \\ -3.6 \end{array} \right] + \left[ \begin{array}{c} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \\ \epsilon_{6} \\ \epsilon_{7} \end{array} \right]$

Exercise: Compute $\vec{\epsilon}$ .

▢

17.2.2 Multiple Linear Regression in GLM Format

The model for multiple linear regression with 2 IVs is:

$y = b_{0} + b_{1} x_{1} + b_{2} x_{2}$

To see how to cast this model in GLM format, let’s take an $n=5$ size dataset with data vector

$\vec{y} = \left[ \begin{array}{c} y(1) \\ y(2) \\ y(3) \\ y(4) \\ y(5) \end{array} \right]$

…then the GLM $\vec{y} = [X] \vec{\beta} + \vec{\epsilon}$ becomes (note the form of $[X]$ ) :

$\left[ \begin{array}{c} y(1) \\ y(2) \\ y(3) \\ y(4) \\ y(5) \end{array} \right] = \left[ \begin{array}{ccc} 1 & x_{1}(1) & x_{2}(1) \\ 1 & x_{1}(2) & x_{2}(2) \\ 1 & x_{1}(3) & x_{2}(3) \\ 1 & x_{1}(4) & x_{2}(4)\\ 1 & x_{1}(5) & x_{2}(5) \end{array} \right] \left[ \begin{array}{c} b_{0} \\ b_{1} \\ b_{2} \end{array} \right] + \left[ \begin{array}{c} \epsilon(1) \\ \epsilon(2) \\ \epsilon(3) \\ \epsilon(4) \\ \epsilon(5) \end{array} \right]$

Doing the matrix multiplication and looking at the vector components brings us back to

$y(1) = b_{0} + b_{1} x_{1}(1) + b_{2} x_{2}(1) + \epsilon(1)$

$y(2) = b_{0} + b_{1} x_{1}(2) + b_{2} x_{2}(2) + \epsilon(2)$

$y(3) = b_{0} + b_{1} x_{1}(3) + b_{2} x_{2}(3) + \epsilon(3)$

$y(4) = b_{0} + b_{1} x_{1}(4) + b_{2} x_{2}(4) + \epsilon(4)$

$y(5) = b_{0} + b_{1} x_{1}(5) + b_{2} x_{2}(5) + \epsilon(5)$

The solution for $\beta$ again is given by^[1] $\vec{\beta} = ([X]^{T}[X])^{-1} [X]^{T} \vec{y}$ . This is a prescription for deriving the regression formulae but we won’t dive into the details.

The design matrix (the model) again maps the $n$ dimensional data vector in $\mathbb{R}^{n}$ to a parameter vector $\vec{\beta}$ in $\mathbb{R}^{p}$ . As with all these GLMs, the dimension of the parameter space $p$ is smaller than the dimension $n$ of the data space. Up until now we have been considering $\mathbb{R}^{n}$ and $\mathbb{R}^{p}$ as separate vector spaces but we can set things up with^[2] $\mathbb{R}^{p} \subset \mathbb{R}^{n}$ with the parameter space being a subspace of the data space; in the example here the parameter space is a 3-dimensional subspace of the 5-dimensional data space. That leaves another $n-p$ dimensional subspace of the data space that is the noise space. Now we can start to see the signal and noise concepts again. We can also call the parameter space the model space or the signal space so that the $n$ -dimensional data space is composed of a $p$ -dimensional signal space and a $(n-p)$ -dimensional noise space. Perfect data would lie in the signal space but in reality the data vector has components in the noise space — it has $n-p$ degrees of freedom for generating random noise. We’ll briefly look at this aspect of data space again in Section 17.2.4.

17.2.3 One-Way ANOVA in GLM Format

There are two ways to formulate a GLM design matrix for one-way ANOVA. It depends on whether the grand mean is explicitly included in the model definition or not. The two model definitions are :

1.) With the grand mean:

$y_{j}(i) = \mu + \tau_{j} + \epsilon_{j}(i)$

…for group $j$ .

2.) Without the grand mean:

$y_{j}(i) = \tau_{j} + \epsilon_{j}(i)$

…for group $j$ .

We’ll illustrate by means of a simple example that has 3 groups with 2 subjects per group how to construct the $[X]$ corresponding to each case.

Case 1 : With the grand mean.

$[X] = \left[ \begin{array}{cccc} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \end{array} \right]$

The first column of 1’s is for the grand mean and the last three columns are coding vectors for the groups. SPSS uses the GLM setup in its programming. When you enter data for a one-way ANOVA into SPSS, you enter an IV vector that looks like:

$\left[ \begin{array}{c} 1 \\ 1 \\ 2 \\ 2 \\ 3 \\ 3 \end{array} \right]$

Such a vector is not in GLM form so SPSS takes your IV vector and, behind the scenes^[3], produces the 3 coding vectors:

$\left[ \begin{array}{c} 1 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \end{array} \right], \;\; \left[ \begin{array}{c} 0 \\ 0 \\ 1 \\ 1 \\ 0 \\ 0 \end{array} \right], \;\; \left[ \begin{array}{c} 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ 1 \end{array} \right]$

Using the $[X]$ given above in the GLM, and setting $\mu = \beta_{0}$ , $\tau_{j} = \beta_{j}$ , we get:

$\left[ \begin{array}{c} y_{1}(1) \\ y_{1}(2) \\ y_{2}(1) \\ y_{2}(2) \\ y_{3}(1) \\ y_{3}(2) \end{array} \right] = \left[ \begin{array}{cccc} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \end{array} \right] \left[ \begin{array}{c} \beta_{0} \\ \beta_{1} \\ \beta_{2} \\ \beta_{3} \end{array} \right] + \left[ \begin{array}{c} \epsilon_{1}(1) \\ \epsilon_{1}(2) \\ \epsilon_{2}(1) \\ \epsilon_{2}(2) \\ \epsilon_{3}(1) \\ \epsilon_{3}(2) \end{array} \right]$

…which, with matrix multiplication, expands out to

$y_{1}(1) = \beta_{0} + \beta_{1} + \epsilon_{1}(1)$

$y_{1}(2) = \beta_{0} + \beta_{1} + \epsilon_{1}(2)$

$y_{2}(1) = \beta_{0} + \beta_{2} + \epsilon_{2}(1)$

$y_{2}(2) = \beta_{0} + \beta_{2} + \epsilon_{2}(2)$

$y_{3}(1) = \beta_{0} + \beta_{3} + \epsilon_{3}(1)$

$y_{3}(2) = \beta_{0} + \beta_{3} + \epsilon_{3}(2)$

The solution^[4] for $\vec{\beta}$ is:

$\beta_{0} = \overline{x}_{GM}$

$\beta_{j} = \overline{x}_{j} - \overline{x}_{GM}, \;\;\; j \neq 0$

…where $\overline{x}_{GM}$ is the grand mean of all the data ( $y$ ) and $\overline{x}_{j}$ is the mean of group $j$ .

Case 2 : Without the grand mean.

$[X] = \left[ \begin{array}{ccc} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \end{array} \right]$

Now $[X]$ only contains coding vectors. Using that design matrix in the GLM explicitly for our small example with $\tau_{j} = \beta_{j}$ gives:

$\left[ \begin{array}{c} y_{1}(1) \\ y_{1}(2) \\ y_{2}(1) \\ y_{2}(2) \\ y_{3}(1) \\ y_{3}(2) \end{array} \right] = \left[ \begin{array}{ccc} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \end{array} \right] \left[ \begin{array}{c} \beta_{1} \\ \beta_{2} \\ \beta_{3} \end{array} \right] + \left[ \begin{array}{c} \epsilon_{1}(1) \\ \epsilon_{1}(2) \\ \epsilon_{2}(1) \\ \epsilon_{2}(2) \\ \epsilon_{3}(1) \\ \epsilon_{3}(2) \end{array} \right]$

Expanding this to the vector components gives:

$y_{1}(1) = \beta_{1} + \epsilon_{1}(1)$

$y_{1}(2) = \beta_{1} + \epsilon_{1}(2)$

$y_{2}(1) = \beta_{2} + \epsilon_{2}(1)$

$y_{2}(2) = \beta_{2} + \epsilon_{2}(2)$

$y_{3}(1) = \beta_{3} + \epsilon_{3}(1)$

$y_{3}(2) = \beta_{3} + \epsilon_{3}(2)$

Solving $\vec{\beta} = ([X]^{T} [X])^{-1} [X]^{T} \vec{y}$ gives:

$\vec{\beta} = \left[ \begin{array}{c} \beta_{1} \\ \beta_{2} \\ \beta_{3} \end{array} \right] = \left[ \begin{array}{c} \overline{x}_{1} \\ \overline{x}_{2} \\ \overline{x}_{3} \end{array} \right]$

Let’s work through a numerical example.

Example 17.9 : Given the one-way ANOVA data:

DV	Group (IV)
5	1
6	1
7	1
3	2
2	2
1	2
12	3
11	3
7	3
20	4
21	4
25	4

…we set up the GLM explicitly without the grand mean :

$\left[ \begin{array}{c} 5 \\ 6 \\ 7 \\ 3 \\ 2 \\ 1 \\ 12 \\ 11 \\ 7 \\ 20 \\ 21 \\ 25 \end{array} \right] = \left[ \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{array} \right] \left[ \begin{array}{c} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\ \beta_{4} \end{array} \right] + \left[ \begin{array}{c} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \\ \epsilon_{6} \\ \epsilon_{7} \\ \epsilon_{8} \\ \epsilon_{9} \\ \epsilon_{10} \\ \epsilon_{11} \\ \epsilon_{12} \end{array} \right]$

The solution for $\vec{\beta}$ is

$\left[ \begin{array}{c} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\ \beta_{4} \end{array} \right] = \left[ \begin{array}{c} \overline{x}_{1} \\ \overline{x}_{2} \\ \overline{x}_{3} \\ \overline{x}_{4} \end{array} \right] = \left[ \begin{array}{c} 6 \\ 2 \\ 10 \\ 22 \end{array} \right]$

…so:

$\left[ \begin{array}{c} 5 \\ 6 \\ 7 \\ 3 \\ 2 \\ 1 \\ 12 \\ 11 \\ 7 \\ 20 \\ 21 \\ 25 \end{array} \right] = \left[ \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{array} \right] \left[ \begin{array}{c} 6 \\ 2 \\ 10 \\ 22 \end{array} \right] + \left[ \begin{array}{c} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \\ \epsilon_{6} \\ \epsilon_{7} \\ \epsilon_{8} \\ \epsilon_{9} \\ \epsilon_{10} \\ \epsilon_{11} \\ \epsilon_{12} \end{array} \right]$

Exercise 1 : Do the matrix multiplication and compute $\vec{\epsilon}$ .

Exercise 2 : Formulate $[X]$ with the grand mean and compute $\vec{\epsilon}$ .

Hint: in that case

$\vec{\beta} = \left[ \begin{array}{c} \beta_{0} \\\beta_{1} \\ \beta_{2} \\ \beta_{3} \\ \beta_{4} \end{array} \right] = \left[ \begin{array}{c} \overline{x}_{GM} \\ \overline{x}_{1} - \overline{x}_{GM} \\ \ \overline{x}_{2} - \overline{x}_{GM} \\ \overline{x}_{3} - \overline{x}_{GM} \\ \overline{x}_{4} - \overline{x}_{GM} \end{array} \right]$

…and $\vec{\epsilon}$ will be the same as in Exercise 1.

▢

17.2.4 Test Statistics in GLM Format

In all GLM cases the inferential statistics (the $t_{\mbox{test}}$ or $F_{\mbox{test}}$ values) come from an analysis of the $\vec{\epsilon}$ error (or residual) vector. Roughly, the approach begins with the observation that $\vec{\epsilon} \in \mathbb{R}^{n-p} \subset \mathbb{R}^{n}$ . The error vector has $n-p$ degrees of freedom. Then we consider a variance^[5] that has the form

$\sigma^{2} = \frac{\vec{\epsilon}^{\;T} \vec{\epsilon}}{n-p}.$

The $t$ and $F$ statistics describe how the component values of $\vec{\epsilon}$ will be distributed if $H_{0}$ is true.

In an ANOVA set up, for example, we can do post hoc testing using contrast vectors^[6], $\vec{c}$ , and use the following formula for the $t$ test statistic :

$t_{\mbox{test}} = \frac{\vec{c}^{\;T} (\vec{\beta} - \vec{\beta}_{0})}{\sqrt{\sigma^{2} \vec{c}^{\;T}([X]^{T}[X])^{-1} \vec{c}}}$

…where $[X]$ must be the version without the grand mean and $\vec{\beta}_{0}$ is the parameter vector associated with $H_{0}$ (all zeros usually). As examples of contrast vectors, if we have three groups then:

$\vec{c}_{1} & = & \left[ \begin{array}{c} 1 \\ -1 \\ 0 \end{array} \right] \;\;\; \mbox{ compares groups 1 and 2}$

$\vec{c}_{2} & = & \left[ \begin{array}{c} 1 \\ 0 \\ -1 \end{array} \right] \;\;\; \mbox{ compares groups 1 and 3}$

$\vec{c}_{3} & = & \left[ \begin{array}{c} 0 \\ 1 \\ -1 \end{array} \right] \;\;\; \mbox{ compares groups 2 and 3}$

There are similar formulae for $F$ that use the GLM matrices and vectors.

A more appropriate notation for the parameter vector would be $\vec{b}$ to emphasize that it is an estimate from a sample of some population vector $\vec{\beta}$ . But, as we did for the symbols $r$ and $\rho$ for correlation, we'll be a little sloppy with the notation we use for sample and population values. ↵
The set symbol $\subset$ means "proper subset". ↵
The actual operation of SPSS is a blackbox that may not run exactly as described here, but conceptually its GLM operation requires the pieces of $[X]$ as described here. ↵
You may see that $[X]$ here is not of full rank so that a least squares solution is not actually possible. But we pick out the solution, from the infinity of possible solutions for $\vec{\beta}$ that fits with what we'll find when we look at case 2 in which $[X]$ is of full rank. ↵
Again we are being sloppy with sample and population symbols. ↵
A modern approach, that replaces the traditional omnibus ANOVA followed by post hoc testing, skips the ANOVA and jumps directly to comparing groups of interest using contrast vectors. ↵