Charles M. Stein, one of the most original statisticians and probabilists of the 20th century died in Fremont, California on November 24th, 2016. Stein’s paradox showing the classical least squares estimates of several parameters can be usefully improved by combining seeming unrelated pieces of information is one of the most surprising and useful contributions of decision theory to statistical practice. Stein’s method for proving approximation theorems such as the central limit and Poisson approximation theorems for complicated, dependent sums of random variables now permeates modern probability.

Born in Brooklyn New York in 1920, Stein was a prodigy who started university of Chicago at age 16. There, he fell under the spell of abstraction through Saunders Mac Lane and Adrian Albert. His first applied statistical work was done making weather forecasts during World War 2. Working with another youngster, Gil Hunt, Stein studied the interaction of invariance and accuracy, a lifetime theme.

Suppose that $P_\theta(dx), \theta \in \Theta$ is a family of probabilities on a space $X$, and suppose that a group $G$ acts on $X$, taking $X$ to $X^g$ (think of taking Fahrenheit to Celsius). The group is said to act on the family if, for every $g$, there is $\bar{g}$ so that $P_\theta(x^g)=P_{\theta^{\bar{g}}}(x)$. An estimator $\hat{\theta}(x)$ is equivariant if $\hat{\theta}(x^g)=\bar{g}\hat{\theta}(x)$. If $L(\theta,\hat{\theta})$ is a loss function, an estimator $\theta^*$ is minimax if
$$\inf_{\hat{\theta}}\sup_{\theta} EL(\hat{\theta},\theta) = \sup_{\theta} EL(\theta^*,\theta).$$

The Hunt–Stein theorem shows that if an estimator $\theta^*$ is minimax among all equivariant estimators, then it is minimax among all estimators, provided that the group is amenable. This is a remarkable piece of work: often it is straightforward to write down all equivariant estimators and find the best one. Showing such an estimator has global optimality and the intervention of amenability is remarkable. That it was done during wartime conditions by two college kids is astounding.

Stein’s work on invariance under a group energized Erich Lehmann, who wrote up the Hunt-Stein theorem in the first edition of his testing book (the original manuscript is lost) and established equivariance as a general statistical principle.

After the war, Stein entered graduate school at Colombia to work with Abraham Wald. Following Wald’s tragic death in an airplane accident, Stein’s thesis was read by Harold Hotelling and Ted Anderson.

Stein’s thesis [ref1] solved a problem posed by Neyman: find a fixed-width confidence interval for a normal mean when the variance is unknown. The usual t-interval has random width governed by the sample standard deviation and George Danzig had proved that no such confidence interval exists based on a fixed sample of size $n$. Stein introduced a two-step procedure: a preliminary sample is taken, this is used to determine the size of a second sample and that finally yields the estimator. Combining these ideas to get an exact procedure takes a very original piece of infinite dimensional analysis, still impressive 70 years later.

Stein taught at UC Berkeley from 1947 to 1950, then, having refused to sign Berkeley’s loyalty oath during the McCarthy era, he moved from Berkeley to Chicago and then to Stanford University’s Department of Statistics in 1953, where he spent the rest of his career.

A celebrated contribution to decision theory is Stein’s necessary and sufficient condition for admissibility. Roughly, this says that any admissible procedure in statistical decision problem is a limit of Bayes rules. The setting is general enough to encompass both estimation and testing. Stein had a lifetime aversion to the “cult of Bayes”: in the Statistical Science interview with de Groot [ref2] he said, “The Bayesian point of view is often accompanied by an insistence that people ought to agree to a certain doctrine, even without really knowing what that doctrine is”. He told us that it took him five years to publish—until he could find a non-Bayesian proof of the result. He softened in later years: when discussing his estimate of the multivariate mean, the theory allows shrinkage towards any point. Stein said, “I guess you might as well shrink towards your best guess at the mean.” He made one further philosophical point to us regarding his theorem: the theory says that good estimators are Bayes but it is perfectly permissible to use one prior to estimate one component of a vector of parameters and a completely different prior to estimate other coordinates. For Stein, priors could suggest estimators but their properties should be understood through the mathematics of decision theory.

Stein contributed to several other areas of statistics: “Stein’s lemma” [ref3] for bounding the tails of stopping times in sequential analysis is now a standard tool. His work in nonparametrics: where he showed that to estimate $\theta$ given $X_i=\theta+\epsilon_i$, where the law of $\epsilon_i$ is unknown; one first estimates the law of $\epsilon$ using nonparametric density, then combines that with a Pitman-type estimator. This has remarkable optimality properties.

The “Sherman–Stein–Cartier–Fell–Meyer” theorem formed the basis of Stein’s (unpublished) Wald lecture. This started the healthy field of comparison of experiments, brilliantly developed by Lucien Le Cam [ref4]. Two brief but telling notes [ref5, ref6] show Stein’s idiosyncratic use of clever counter-examples to undermine preconceptions that everyone believed true.

Throughout his statistical work, Stein preferred “properties over principles”. Here is the way he explained the content of his shrinkage estimate to us: if you ask a working scientist if estimators should obey natural transformation rules (changing from feet to meters), they should agree this mandatory. Most would prefer an estimator which always has a smaller expected error. Stein’s paradox shows that these two principles are incompatible (shrinkage estimators are not equivariant).

Roughly, the second half of Stein’s research career was spent developing a new method of proving limit theorems (usually with explicit finite sample error bounds): what is now called Stein’s method. This separation is artificial because Stein saw the subjects of statistics and probability as intertwined. For example, in working out better estimates of an unknown covariance matrix, Stein discovered, independently, Wigner’s semi-circle law for the eigenvalues of the usual sample estimator. His estimator shrank those to make them closer to the true eigenvalues. Then he proved that this estimator beats the naive one. Stein’s method of exchangeable pairs seems to have been developed as a new way of proving Hoeffding’s combinatorial central limit theorem whose starting point is a non-random n by n matrix $A$. Forming a random diagonal $W_{\pi}=\sum_{i=1}^n A_{i\pi(i)}$, where $\pi$ is a uniformly chosen permutation of $\{1,2,\ldots,n\}$. Under mild conditions on $A$, this $W_\pi$ has an approximately normal limit. This unifies the normality of sampling without replacement from an urn, limit theorems for standard nonparametric tests and much else. Stein compared the distribution of $W_\pi$ and $W_{t\pi}$ where $t$ is a random transposition. He showed these differ by a small amount and was able to show this mimicked his famous characterization of the normal: a random variable $W$ has a normal distribution if and only if $E(Wf(W))=E(f'(W))$ for all smooth bounded $f$. The $W_\pi$ satisfy this identity approximately and Stein proved this was enough. The earliest record of this work is in class notes taken by Lincoln Moses in 1962 (many faculty regularly sat in Stein’s courses). His first publication on this approach used Fourier analysis and is recorded in [ref7]. The Fourier analysis was dropped and the method expanded into a definitive theory, published as a book by the IMS in the monograph series [ref8].

Two important parts of Stein’s life were family and politics. Charles met Margaret Dawson while she was a graduate student at UC Berkeley. She shared his interest in statistics; they translated A. A. Markov’s letters to A. A. Chuprov [ref9] and worked together as political activists. While Charles almost always had his head in the clouds, Margaret made everything work and guided many of his professional decisions. Their three children, Charles Jr, Sarah and Anne, grew up to be politically active adults. Margaret passed away a few months before her husband. He is survived by two daughters: Sarah Stein, her husband, Gua-su Cui, and their son, Max Cui-Stein, of Arlington, Massachusetts; by Anne Stein and her husband, Ezequiel Pagan, of Peekskill, NY; and by his son Charles Stein Jr. and his wife, Laura Stoker, of Fremont, California.

Politics of a very liberal bent were a central part of the Steins’ world. Charles led protests against the war (and was even arrested for it), Margaret was a singing granny and a community organizer. The family traveled to the Soviet Union and the kids went around to churches and schools upon returning to try to humanize the USSR’s image.

Charles shared his ideas and expertise selflessly. He read our papers, taught us his tools and inspired all of us by his integrity, depth and humility. All of our worlds are a better place for his being.


Written by Persi Diaconis and Susan Holmes, Stanford, CA

 

References:

[1] Stein, C.M. (1953) A two-sample test for a linear hypothesis having power independent of the variance. PhD, Columbia University.
[2] DeGroot, M.H. (1986) A conversation with Charles Stein. Statistical Science 1: 454–462.
[3] Stein, C. (1946) A note on cumulative sums. Ann Math Statist 17: 498–499.
[4] Le Cam, L., Yang G.L. (2012) Asymptotics in statistics: some basic concepts. Springer Science & Business Media.
[5] Stein, C (1959) An example of wide discrepancy between fiducial and confidence intervals. Ann Math Statist 30:877–880.
[6] Stein, C. (1962) A remark on the likelihood principle. J Roy Statist Soc Ser A 125:565–568.
[7] Stein, C. (1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Probab. 2: 583–602.
[8] Stein, C. (1986) Approximate computation of expectations, volume 7 of IMS Lecture Notes–Monograph Series.
[9] Markov, A., Chuprov, A. (1971) The correspondence on the theory of probability and mathematical statistics. Springer-Verlag. Translated from the Russian by Charles M. and Margaret D. Stein.