17. Overview of the General Linear Model
17.2 The General Linear Model (GLM) for Univariate Statistics
In abstract form, the GLM is
where
- is the data vector, an dimensional column vector.
- is the design matrix which is different from test type to test type.
- is the parameter vector, a lower -dimensional vector that summarizes the data in terms of the model given by the design matrix.
- is the error vector, the dimensional column vector of deviations or differences between the model predictions and the data in .
The solution for is the least squares solution
In terms of the linear algebra that we just reviewed, (known as the pseudo-inverse) transforms the data vector in data space () to a vector in parameter space () that presumably explains the data.
17.2.1 Linear Regression in GLM Format
We can express the linear regression model in GLM format as
Note, importantly, that the design matrix is
…where the second column is composed of the IV values, . This is typical for the GLM, the DV is represented by the data vector and the IV is represented by the design matrix. If we do the matrix multiplication the model is:
…so is the prediction vector
Abstractly, the GLM is and the components of are clearly the deviations .
The least squares solution written out explicitly for this linear regression case is (without going into the calculation details):
…and this is exactly the solution for and that we saw in Section 14.5: Linear Regression.
Example 17.8 : Let’s look at the data of Example 14.3 in a new light. The data were :
Subject | x | y |
A | 6 | 82 |
B | 2 | 86 |
C | 15 | 43 |
D | 9 | 74 |
E | 12 | 58 |
F | 5 | 90 |
G | 8 | 78 |
and we found that (intercept) and (slope).
In GLM format this all is:
Exercise: Compute .
▢
17.2.2 Multiple Linear Regression in GLM Format
The model for multiple linear regression with 2 IVs is:
To see how to cast this model in GLM format, let’s take an size dataset with data vector
…then the GLM becomes (note the form of ) :
Doing the matrix multiplication and looking at the vector components brings us back to
The solution for again is given by[1] . This is a prescription for deriving the regression formulae but we won’t dive into the details.
The design matrix (the model) again maps the dimensional data vector in to a parameter vector in . As with all these GLMs, the dimension of the parameter space is smaller than the dimension of the data space. Up until now we have been considering and as separate vector spaces but we can set things up with[2] with the parameter space being a subspace of the data space; in the example here the parameter space is a 3-dimensional subspace of the 5-dimensional data space. That leaves another dimensional subspace of the data space that is the noise space. Now we can start to see the signal and noise concepts again. We can also call the parameter space the model space or the signal space so that the -dimensional data space is composed of a -dimensional signal space and a -dimensional noise space. Perfect data would lie in the signal space but in reality the data vector has components in the noise space — it has degrees of freedom for generating random noise. We’ll briefly look at this aspect of data space again in Section 17.2.4.
17.2.3 One-Way ANOVA in GLM Format
There are two ways to formulate a GLM design matrix for one-way ANOVA. It depends on whether the grand mean is explicitly included in the model definition or not. The two model definitions are :
1.) With the grand mean:
…for group .
2.) Without the grand mean:
…for group .
We’ll illustrate by means of a simple example that has 3 groups with 2 subjects per group how to construct the corresponding to each case.
Case 1 : With the grand mean.
The first column of 1’s is for the grand mean and the last three columns are coding vectors for the groups. SPSS uses the GLM setup in its programming. When you enter data for a one-way ANOVA into SPSS, you enter an IV vector that looks like:
Such a vector is not in GLM form so SPSS takes your IV vector and, behind the scenes[3], produces the 3 coding vectors:
Using the given above in the GLM, and setting , , we get:
…which, with matrix multiplication, expands out to
The solution[4] for is:
…where is the grand mean of all the data () and is the mean of group .
Case 2 : Without the grand mean.
Now only contains coding vectors. Using that design matrix in the GLM explicitly for our small example with gives:
Expanding this to the vector components gives:
Solving gives:
Let’s work through a numerical example.
Example 17.9 : Given the one-way ANOVA data:
DV | Group (IV) |
5 | 1 |
6 | 1 |
7 | 1 |
3 | 2 |
2 | 2 |
1 | 2 |
12 | 3 |
11 | 3 |
7 | 3 |
20 | 4 |
21 | 4 |
25 | 4 |
…we set up the GLM explicitly without the grand mean :
The solution for is
…so:
Exercise 1 : Do the matrix multiplication and compute .
Exercise 2 : Formulate with the grand mean and compute .
Hint: in that case
…and will be the same as in Exercise 1.
▢
17.2.4 Test Statistics in GLM Format
In all GLM cases the inferential statistics (the or values) come from an analysis of the error (or residual) vector. Roughly, the approach begins with the observation that . The error vector has degrees of freedom. Then we consider a variance[5] that has the form
The and statistics describe how the component values of will be distributed if is true.
In an ANOVA set up, for example, we can do post hoc testing using contrast vectors[6], , and use the following formula for the test statistic :
…where must be the version without the grand mean and is the parameter vector associated with (all zeros usually). As examples of contrast vectors, if we have three groups then:
There are similar formulae for that use the GLM matrices and vectors.
- A more appropriate notation for the parameter vector would be to emphasize that it is an estimate from a sample of some population vector . But, as we did for the symbols and for correlation, we'll be a little sloppy with the notation we use for sample and population values. ↵
- The set symbol means "proper subset". ↵
- The actual operation of SPSS is a blackbox that may not run exactly as described here, but conceptually its GLM operation requires the pieces of as described here. ↵
- You may see that here is not of full rank so that a least squares solution is not actually possible. But we pick out the solution, from the infinity of possible solutions for that fits with what we'll find when we look at case 2 in which is of full rank. ↵
- Again we are being sloppy with sample and population symbols. ↵
- A modern approach, that replaces the traditional omnibus ANOVA followed by post hoc testing, skips the ANOVA and jumps directly to comparing groups of interest using contrast vectors. ↵