It’s time to respond to: I’m curious about what you tell PhD students about multiple linear regression. I tend to focus first on regression coefficients: what they are and are not, why we might care, and how we compute them. Almost fifty years ago, I was lucky enough to be introduced to Yule’s new system of notation, new in 1907, that is. (Thank you, Dr Geoffrey Jowett.) Given a collection X1, X2, … , Xp of random variables, the expression b12•3…p denotes the (linear least-squares) regression coefficient of X1 on X2, when X3, … , Xp are also in the regression equation. As Yule put it in his paper, the first subscript gives the dependent variable, the second the variable of which the given regression is the coefficient, and the subscripts after the period show the remaining independent variables which enter into the equation. This avoids having to emphasize that the regression coefficient of X1 on X2 depends on the other variables in the equation: it’s right there in the notation! Mosteller and Tukey say it another way in chapter 13, Woes of regression coefficients, of their magnificent 1977 book: “a coefficient in a multiple regression – either in a theory or in a fit – depends on MORE than just: the set of data and the method of fitting [and] the carrier it multiplies. It also depends on: what else is offered as part of the fit.”

Having got this point clear, we now need to address the vexed question of how we interpret b12•3, that is, the words we use when we say informally what it means. As we all know, some people call it the regression coefficient of X1 on X2, controlling for X3. But we also know that in general X‘s in regressions are not under any control, so this cannot be a good description. My preference is to say adjusting for X3. This is vague, but less likely to mislead, and definitely conveys the fact that X3 is in the model along with X2. It is also connected to the use of regression for linear adjustment. But what exactly is a regression coefficient? Again we all know the simplistic interpretation of b12•3 as the average change in X1 per unit change in X2, when X3 is held fixed. Why simplistic? At times “held fixed” makes no sense, an example being X3 = X22

What can we say? A lengthy, but basically correct, interpretation goes like this: b12•3 tells us how X1 responds, on average, to change in X2, after allowing for simultaneous linear change in X3 in the data at hand.

Mosteller and Tukey point out that sometimes X‘s can be held constant, and then the important thing is to recognize just how large the difference can be between (i) X2 changing while X3 is not otherwise disturbed or clamped, and (ii) changing X2 while holding X3 fast. The first corresponds to the interpretation I gave, and the second is what people usually wish for. Complicated?  Indeed, but as Oscar Wilde told us, “The truth is rarely pure and never simple.”

Yule also introduced the notation X1•23…p = X1b12•3…pX2 − …b1p•1…p−1Xp. This can be very helpful when we want to show that multiple linear regression may be viewed as a sequence of simple linear regressions, of residuals on residuals. It is closely related to added variable plots. I think it’s important for students to know this, and how to derive it using the fact that (least-squares) residuals are orthogonal to all the variables after the period. For example, one can easily derive the identity
b12•3 = b12b13•2b32, which I have found extremely useful over the years. Here’s one thing you can see from this identity: the regression coefficient of X1 on X2 doesn’t change when X3 is added into the regression equation, if either b32 = 0, i.e., if X2 and X3 are orthogonal, or b13•2 = 0. Another is the relation between adjusted and unadjusted means in ANCOVA. These identities are not hard to understand if you learn them when you are doing all your multiple regression computations with a mechanical calculator. Jowett showed us that if we use Jordan’s procedure for matrix inversion, “every intermediate quantity occurring in the calculation is either a partial regression coefficient or a partial covariance, and therefore of potential interest.” Try this step-by-step in R.

In a sense, our problems in interpreting regression coefficients are consequences of their simplicity when (X1, X2, … , Xp) are jointly normally distributed. In that case, everything works out so beautifully that we are seduced into thinking it applies more generally. But it doesn’t.

Next column: it’s why and how.