Contributing Editor Dimitris Politis writes:

Consider the standard setup where X1,,Xn are i.i.d. from distribution F with mean μ=EXi, variance σ2=E(Xiμ)2>0, skewness γ=E(Xiμ)3/σ3, and kurtosis κ=E(Xiμ)4/σ4 assumed finite. As usual, define the sample mean X¯=1ni=1nXi, the sample variance σ^2=1n1i=1n(XiX¯)2, and the t-statistic T=n(X¯μ)/σ^.

At the beginning of the 20th century, the classical theory for the t and χ2 tests was developed by the statistical pioneers W.S. Gosset (“Student”), R.A. Fisher, etc. These were undoubtedly major breakthroughs at the time but are intimately tied to the assumption that F is Gaussian which is often hard to justify. Indeed, the true sampling distribution of T and (n1)σ^2 can be quite different from the textbook tn1 and σχn12 distributions respectively even under moderate non-normality.

Skewness and the t-statistic

Nowdays that samples are typically large, practitioners may be reluctant to start out with such a restrictive assumption as normality, relying instead on a Central Limit Theorem (CLT). It is well known that the accuracy of the CLT for X¯ depends on the skewness of the data. Under regularity conditions, an Edgeworth expansion yields [Eq. 1]
P(n(X¯μ)/σx)=Φ(x)+b(x)γ/nc(x)γn+O(1/n)
for all x, where Φ() is the standard normal distribution, and b() a bounded function. Eq.[1] indicates that the distribution of X¯ has a skewness of order γ/n that is not captured in the symmetric shape of the limit distribution.

What is perhaps lesser known is that skewness induces a strong dependence between the numerator and denominator of the t-statistic thereby shedding doubt on the accuracy of the t-tables; Hesterberg \cite{h} brings out this point.

Figure 1 depicts a scatterplot of X¯ vs. σ^ based on 250 simulated samples each of size n=30 from F=12χ22, i.e., exponential; the correlation between X¯ and σ^ is unmistakeable.
R Graphics Output
Figure 1: Scatterplot of X¯ vs. σ^ based on 250 exponential samples each of size n=30

One might think that this is a phenomenon particular to the exponential distribution where X¯ and σ^ estimate the same parameter. In fact, the phenomenon holds true for any skewed distribution F, and the correlation persists regardless of sample size. To elaborate, the δ-method implies Corr(X¯,σ^) Corr(X¯,σ^2)γ/κ1 as n. When F is exponential, γ=2, κ=9, and Corr(X¯,σ^)0.7. Incidentally, this large-sample correlation calculation shows as a by-product the nontrivial general inequality γ2κ1 which is sharper than what Littlewood’s inequality implies; see p.~643 of DasGupta \cite{Das} with r=4,s=3 and t=2. Interestingly, when F is Bernoulli(p), the identity γ2=κ1 holds for any p(0,1).

An Edgeworth expansion for the t-statistic yields [Eq.2]
P(n(X¯μ)/σ^x)=Φ(x)+c(x)γ/nc(x)γn+O(1/n)
where c() is another bounded function. Figure 2 shows the true density of T obtained via Monte Carlo simulation from an exponential sample of size n=30 with the t29 density superimposed. The true density is skewed left although visually this might not appear particularly striking. However, the true 2.5\% and 5\% quantiles of T are 2.84 and 2.28 respectively that are poorly approximated by the corresponding t29 quantiles 2.04 and 1.70.
R Graphics Output
Figure 2: True density of the t-statistic calculated from an exponential sample of size n=30 (in blue) with the t29 density superimposed (in black).

Luckily, Efron’s bootstrap can be applied to the studentized sample mean, i.e., the statistic T, yielding a better approximation. As expounded upon by Hall, the studentized bootstrap captures the skewness, i.e., [Eq.3]
P(n(X¯X¯)/σ^x)=Φ(x)+c(x)γ/nc(x)γn+OP(1/n)
where X¯=1ni=1nXi, σ^=1n1i=1n(XiX¯)2, and the re-sample X1,,Xn is obtained by i.i.d. sampling from the empirical distribution of X1,,Xn. Let q(α) denote the α-quantile of the LHS of Eq.[3] which serves as an estimator of q(α), the α-quantile of the LHS of Eq.[2]. In the exponential example with n=30, the expected values of q(0.025) and q(0.050) are (approximately) 2.90 and 2.31, i.e., very close to the true ones.

Kurtosis and asymptotic failure of the χ2 test

The saving point for Student’s t approximation is that σ^σ almost surely as n. Since a constant is independent to all random variables, σ^ will gradually, i.e., asymptotically, become independent to n(X¯μ), and the t distribution will become standard normal. Hence, Student’s t approximation to the distribution of T is asymptotically valid for non-Gaussian data (with finite variance).

The χ2 test does not even have this excuse: it is asymptotically {\it invalid} when κ3. This might appear surprising since the χm2 distribution is asymptotically normal when m is large; however, it has the wrong asymptotic variance. To see why, let Yi=(Xiμ)2 and Y¯=1ni=1nYi. Then, a CLT on the i.i.d.~variables Y1,,Yn yields [Eq.4]
P(n(Y¯σ2)x)Φ(x/τ)  as  n
where τ2=(κ1)σ4. But i=1n(XiX¯)2=i=1n(Xiμ)2+Vn with Vn=Op(1), i.e., bounded in probability. Hence, Eq.[4] implies Eq.[5]
P(n(σ^2σ2)x)Φ(x/τ)  as  n.

Recall that a χm2 variable has mean m and variance 2m. Hence, the χn12 approximation to the distribution of (n1)σ^2/σ2 is associated with a large sample approximation analogous to Eq.[5] with τ2 replaced by 2σ4; but this is only admissible when the kurtosis κ equals 3 as it does in the Gaussian case. If κ>3, the χ2 approximation underestimates the variance of σ^2 leading to confidence intervals with smaller level than nominal. In the exponential case, the χ2 approximation effectively estimates the variance of σ^2 to only be 1/4 of what it truly is.

Eq.[5] is valid in general but in order to be practically useful for tests and confidence intervals it needs a consistent estimate of τ2 to be plugged in, e.g., letting τ^2=(κ^1)σ^4 where κ^ is the sample kurtosis. Furthermore, finite-sample accuracy can be improved via a skewness-reducing transformation such as the logarithmic; Eq.[5] together with the δ-method yields Eq.[6]
P(n(logσ^2logσ2)x)Φ(x/v)  as  n
with v2=τ2/σ4=κ1. In addition, Eq.[5] and Eq.[6] imply that the distribution of σ^2 and/or logσ^2 can be bootstrapped. Focusing on Eq.[6], one may readily approximate the quantiles of distribution P(n(logσ^2logσ2)x) by the respective quantiles of the bootstrap distribution P(n(logσ^2logσ^2)x). Studentized versions of Eq.[5] and [6] can also be derived whose bootstrap approximations may give higher order refinements.

Classroom strategies

It is important to convey to the students that there is nothing wrong with the T and σ^2 statistics per se; the issue is how to accurately approximate their sampling distributions without relying on the often unjustifiable assumption of Gaussianity. Luckily, simulation and resampling have been slowly finding their way into the undergraduate classroom (see, for example, the textbooks 1, 6 and 7 in the references below). However, in their first course on Mathematical Statistics, students may find hard the notion of resampling from the empirical distribution.

The following steps may be found useful in order to ease undergraduates into the matter.

1. Introduce Monte-Carlo simulation from the start as an alternative, computer-intensive methodology to approximate probabilities and expectations, and create quantile tables.
2. Employ parametric bootstrap – which is easy to motivate – in order to wean students off reliance on textbook tables.
3. Finally, depose of the parametric framework altogether, and show students how to generate quantiles for the T and σ^2 statistics by resampling from a histogram of the X1,,Xn data as an approximation to the underlying true density (or mass) function.

If F is (absolutely) continuous, resampling from the histogram is closely related to the smoothed bootstrap; however, it still works when F is discrete as long as the bins are small enough. To elaborate, suppose the histogram of X1,,Xn is based on m bins centered at midpoints v1,,vm that are collected in vector V with their respective counts collected in vector Vcounts. Basic resampling can then be performed in R letting Xstar = sample(V, size=n, replace = TRUE, prob = Vcounts/sum(Vcounts)); note that both the midpoints V and the counts Vcounts are part of the list output of the R function hist (). If F is continuous, one can then add to each of the coordinates of Xstar a uniform “jitter”; i.e., the final bootstrap sample (from which T and σ^2 are to be re-computed) would be given by Xstar + Ujit where Ujit = runif(n, min=-b/2, max=b/2) and b=vj+1vj is the bin width.

References

 

1 Chihara, L.M. and Hesterberg, T.C. (2012). Mathematical Statistics with Resampling and R, John Wiley, New York.

2 DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.

3 Efron, B. (1979). Bootstrap methods: another look at the jackknife, Ann. Statist., 7, pp. 1-26.

4 Hall, P. (1992). The Bootstrap and Edgeworth Expansion, Springer, New York.

5 Hesterberg, T.C. (2015). What teachers should know about the bootstrap: resampling in the undergraduate statistics curriculum, American Statistician, vol. 69, no. 4., pp. 371-386.

6 Larsen, R.J. and Marx, M.L. (2012), An Introduction to Mathematical Statistics and its Applications, (5th ed.), Prentice Hall, New York.

7 Moore, D.S., McCabe, G.P. and Craig, B.A. (2012). Introduction to the Practice of Statistics, (8th ed.), WH Freeman, New York.