I really like multiple linear regression (MLR), even though I think that it must be the most widely misused of all statistical methods. There are so many different reasons why we might use it, and there are so many variations on linear least squares, I feel that MLR can be seen as a microcosm of statistics as a whole. At a conference recently I heard a speaker discuss MLRs with 15–20 variables. He spoke of model complexity, of functional forms, of whether or not variables should be selected, and he discussed model (in)stability and resampling techniques for diagnosing and improving models. All without stating a reason for doing MLR!

Why do we run MLRs? Let me reel off a few possible responses before commenting on why I think asking “why” matters. To summarize. To predict. To estimate a parameter. To attempt a causal analysis. To find a model. I hope it is clear that these are different reasons.

If you concede this, then perhaps you will agree that going through the same moves with a data set (y,X) to produce the familiar estimates

$\hat{\beta} = (X ^{ \prime } X )^{-1} X ^{\prime} y$     and     $\hat{ var } ( \hat{\beta} ) = (X ^{ \prime } X )^{-1} \hat{\sigma}^2$,

and doing all the standard regression diagnostics (the “core” approach) is unlikely to be the right thing in any of these cases. Sharpening the question is just as necessary when considering regression as it is with any other statistical analysis. At the end we will want to assess how well we have answered our question, and in doing so, we’ll go far beyond the standard formulae, in different ways with different questions.

Think of the world of difference between using a regression model for prediction and using one for estimating a parameter with a causal interpretation, for example, the effect of class size on school children’s test scores. With prediction, we don’t need our relationship to be causal, but we do need to be concerned with the relation between our training and our test set. If we have reason to think that our future test set may differ from our past training set in unknown ways, nothing, including cross-validation, will save us. When estimating the causal parameter, we do need to ask whether the children were randomly assigned to classes of different sizes, and if not, we need to find a way to deal with possible selection bias. If we have not measured suitable covariates on our children, we may not be able to adjust for any bias.

What’s my point here? I would like to see multiple regression taught as a series of case studies, each study addressing a sharp question, and focussing on those aspects of the topic that are relevant to that question. Instead, what happens all too often, is that writers and instructors distil all uses of multiple linear regression down to the “core” mentioned above, and students come away not having seen the fascinating and important interplay between question, context, data and answer. It’s a “baby and bath-water” problem.

Who does it to my liking? I mentioned Mosteller & Tukey in my last piece on this topic, and once again I’m happy to say that they do a fine job on the different questions that lead us to MLR, with their own colorful terminology, e.g. regression to “set aside the effect of ” a variable, to get the variable “out of the way,” or “regression as exclusion.” In their book Mostly Harmless Econometrics: An Empiricist’s Companion, Angrist and Pischke have a very nice chapter 3 entitled “Making Regression Make Sense.” Near the beginning of their book, they say that, “the most interesting research in social science is about cause and effect, such as the effect of class size on children’s test scores.”

How do we run regressions? Overwhelmingly, the answer is by using least squares, justified by the Gauss-Markov theorem. In a characteristically brilliant, though at times challenging, 1975 book chapter, “After Gauss-Markov Least Squares, What?” Tukey deconstructs this theorem, and in so doing opens our eyes to the richness of our statistical world, in comparison with the poverty of the “core”. He views his task as “idol management.” After listing the seven “ifs” of the theorem, leading to the conclusion that the best estimate of any individual β or any linear combination of β’s is to be had by “least squares,” Tukey questions each “if ” in turn, and uses each “to point a direction in which to move a suitable distance away from our idol.” In the discussion which follows, we meet nonlinear least squares, “minimizing potential” vs “balance of forces”, “indirect and imperfect” measurements, instrumental variables, weighting and misweighting, robustness via iteratively reweighting, “insulation” and “transparency”, penalized regression and much more.

The beauty of Tukey’s approach to MLR is that it can be revisited at any time, and applied to other areas. Idol management should always be with us.