A Commentary on “The Kids Are Alright: Divide by n when estimating variance,” by Jeffrey S. Rosenthal, IMS Bulletin (December 2015), Vol. 44, No. 8, Page 9
Dear Editor
Professor Rosenthal’s piece is persuasive and very clearly written. I thank Professor Rosenthal for taking us back to this old concern that never truly goes away. Indeed the basic issue under consideration appears and reappears when one teaches a cohort of new students.
With nearly 40 years of teaching experience now, I have a different, but easy, way to explain why the divisor in the customary sample variance is suddenly $n − 1$ instead of $n$. It is my understanding that there are readers out there who may happen to like my simple persuasion, below, in favor of a traditional divisor $n − 1$.
Suppose that I have $n$ random samples $X_1, \cdots, X_n$ from a single population with a population mean $\mu$. Customarily, in many elementary courses, I propose that $\mu$ is estimated by the sample mean, $\bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i$. Here, the divisor is $n$ and no one really objects to that idea.
Then comes the idea of variation around $\mu$. First, I explain why no-one considers $E[X−\mu]$ as a quantification of variation. An explanation is simple: $E[X−\mu] = 0$ under the population distribution. In other words, the errors in over-estimation and under-estimation of $\mu$ by $\bar{X}$ cancel out.
Thus, many proceed to the next step.
Define a population variation or variance as $\sigma^2$ given by $E[(X−\mu)^2]$, which will be positive unless all observations coincide with $\mu$ (with probability 1). After all, who wants to collect data where every data point is the same, and waste time and money!
So, how should one estimate $\sigma^2$? Well, I begin with $\sum_{i=1}^{n}(X_i – \bar{X})^2$. But I note that $\sum_{i=1}^{n}(X_i – \bar{X})$ is identically zero for any set of $n$ numbers. That is, among $n$ numbers (residuals) $X_1 − \bar{X}, X_2 − \bar{X}, … , X_n − \bar{X}$, we have exactly $n − 1$ free-riding numbers, since all $n$ residuals add up to zero. That is, the remaining $n$th number is fully determined by the other $n − 1$ free-riding numbers. Thus, while one obtains the sample variance, one divides $\sum_{i=1}^{n} (X_i – \bar{X})^2$ by $(n-1)$ instead of $n$. In this sense, $n – 1$ is customarily called the “degree of freedom,” that is, an indication of how many among $n$ residuals are truly free-riding.
In a first-year pre-calculus course that is often mandatory for all (or a large majority of) undergraduate students, the idea of pursuing mean square criterion (MSE) considerations never really convinces our first-year undergraduates since they had never heard of MSE prior to taking Stat 100 or Stat 110.
Especially for them, in order to have a painless discourse, I take a very small set of numbers, say, 3, 4, 2, 4, 2 with $n$ = 5. Obviously, $\bar{x} = 3$ and
$\sum_{i=1}^{n}(x_i – \bar{x}) = 0 + 1 – 1 + 1 – 1 = 0$
but
$\sum_{i=1}^{n}(x_i – \bar{x})^2 = 0 + 1 + 1 + 1 + 1 = 4.$
Thus, the sample variance should be the customary
$s^2 = \frac{1}{4} \sum_{i=1}^{n} (x_i – \bar{x})^2 = 1$.
The divisor is 4 instead of 5 because 4 is the “degree of freedom” as explained.
Nitis Mukhopadhyay
Professor of Statistics
University of Connecticut, Storrs, USA
—
Comments on “Letter to the Editor”