Not long ago, I extolled the virtues of Tukey’s deconstruction of the Gauss–Markov theorem. Earlier, I conjured up the image of Gauss carrying out the triangulation of Hanover. But I have never focussed on linear least squares, something we all recognize as being at the very heart of statistics. It’s time to do so, for I have always viewed the pioneering work of Legendre and Gauss in astronomy and geodesy with a sense of awe and wonderment, and I never lose of interest in new developments. Their ideas have evolved into a huge part of our discipline, to tools which we all use every day. Here I must be brief; the justifications, theory and algorithms each have a vast literature of their own.

These days we introduce ordinary least squares (OLS) with vector and matrix notation as the minimization in β of the sum of squares $|y−Xβ|^2$. However, many of Gauss’s problems involved minimizing $|y−β|^2$ in $β$ subject to linear constraints $C’β=d$. Indeed his constraints were usually non-linear, and so had to be linearized, and his observations usually had to be weighted as well. I’ve always thought it would be good for our students to meet these extra features quite early on, for example, by adjusting surveying measurements.

I am not aware of significant extensions to Gauss’s basic framework arising later in the 19th century, but I’d like to hear from you
if you know any. Exceptions are the random effects models, which arose in astronomy.

A century after Gauss, new ideas started to appear. In an elegant, almost Bayesian 1923 paper, E.T. Whittaker gave “a new method of graduation”: the replacement of the observed values of a function with smoother ones estimated by penalized least squares, where the smoothness penalty was presented in the form of a prior distribution. Actuaries used these graduated values. I.J. Schoenberg later connected this to the theory of splines, part of which can now be viewed as linear mixed modelling.

In 1934, Whittaker’s student A.C. Aitken introduced generalized least squares with a known covariance matrix, presenting his results in the matrix notation we adopt today. In my view, this and his later 1945 paper are the first truly modern works on least squares, and both papers are well worth reading today. For example, Aitken gives an illuminating comparison between least squares estimation with exact constraints on the parameters and penalized least squares where a quadratic penalty partially imposes the constraint.

In the first century of the life of least squares, probabilistic assumptions, if any, were made on the errors. From some time in the 20th century, perhaps earlier, least-squares-like theory was developed with general random variables, leading to minimum mean-square error estimates of relevant quantities (interpolants, predictors, smoothers). Such theory might use normality, or just means, variances and covariances. This step had certainly been made by World War 2, when A. N. Kolmogorov and, independently, N. Wiener developed linear prediction theory for stationary time series. There were several wartime applications of this work. It is a
genuine extension of Gauss’s least squares, but of a slightly different nature. Of course numbers enter the picture eventually.

In the 1950s the theory of mixed-models was developed, spearheaded by the statistician and animal breeder C.R. Henderson. Generalized least squares for these models and the theory of best linear unbiased prediction (BLUPs) were advances with broad applicability. With hindsight, many things are found to be BLUPs.

Arguably the most important advance in least squares in the 20th century was the development of linear state-space models, the body of work associated with R.E. Kalman, R.S. Bucy and others, from around 1960. Application areas include engineering systems, satellite navigation, and Gauss’s topic, surveying, but these days using global positioning systems. The path from Gauss to Kalman is not a simple one, as the principal inspiration for Kalman’s work was the Wiener-Kolmogorov theory. However, his extension of least-squares to non-stationary time series was an important step, and, just as Gauss showed, it can be justified with or without normal theory.

The 1970s saw several important variants and extensions of OLS appear, including ridge (regularized) regression, non-negative least squares, robust regression and generalized linear models. The last two make use of iteratively (re-)weighted least squares, showing that maximum-likelihood estimation for some important classes of models can be achieved using a form of least squares.

Sample developments in least squares since the 1970s include partial least squares, used widely in chemometrics; the lasso, which is OLS with an L1 penalty; and the elastic net, OLS with both L2 and L1 penalties. I have probably omitted your favourite variant on least squares, but you can write and remind me. I wonder where least squares will go next?