Some thoughts about the relations between statistics and probability theory

Erwin Bolthausen, University of Zürich and Kyoto University, delivered this IMS Presidential ADdress at the Joint Statistical Meetings in Seattle, WA, on August 10, 2015.

If one opens any scientific work about a topic where statistics plays a role, there are usually probabilistic concepts behind. How does it then come that probability theory and statistics, in research, have become more and more separated? The answer is to some extent evident:

• Probability theory has nowadays many relations with other mathematical fields, and also with applied fields outside statistics.
• For modern statistics, probability is just one crucial basis, but there are many more, often also non-mathematical ones. For instance, one has to decide which probabilistic models lead to computational feasible procedures, and still mirror the reality close enough. This cannot by answered by probability theory.

Nonetheless, I would argue that the separation of the fields has become deeper than healthy, and furthermore, IMS should be more active to bridge the gap.

When I started as IMS President-elect two years ago, I became fully aware how far statistics and probability had moved apart in research, and that the scientific communication between the groups— also within IMS—is reduced to a trickle. Of course, I already knew that this interaction had become weaker over the years. For instance, until about 30 years ago, there was every year a joint statistics-probability meeting in Oberwolfach, on a theme of common interest, and I attended them for many years. That was going on till the mid eighties, and then Oberwolfach stopped them. Such type of conferences still exist, but at least in Oberwolfach or in Banff, they have become very rare. The Joint Statistical Meetings which are the main meetings for the IMS in odd years have no longer any probability sections.

I am far from blaming just the statisticians. There is also a declining interest of probabilists in statistics. The big majority of them, at least in Europe, knows nothing about statistics, except perhaps least squares which they might have learned in the basic linear algebra course.

I think that in the modern development of probability, the relations with pure mathematics and with mathematical physics have become stronger than those with statistics. Many probabilists, including me, have found the problems from mathematical physics very appealing. Parallel to it developed also closer connections with other mathematical fields, like algebra, matrix theory, representation theory, complex analysis, differential geometry, to name only a few. The relations with differential equations had always been very close. Probability theory is now much more present than 20 years ago in pure mathematical institutions, like the Ecole Normale Supérieure, the elite school in France, where some of the scientific directors of the mathematics school in the last decades were probabilists. It is also much more present at the International Congresses of Mathematics.

The lack of communication will create problems for IMS. For instance, we certainly don’t want to have fixed quota in the form: So many IMS fellows selected have to be probabilists, and so many statisticians. If we don’t want that, there has to be a rational discussion inside the committee, and a mutual basis of understanding. This would mean that the committee members should ideally have some view of the whole field, but it has become fairly difficult to find such people. Also, if young probabilists are not joining IMS, then we have on the long run a problem with handling the probability journals, like filling committees for choosing the editors.

I would now like to present and shortly discuss some of the most interesting recent developments in probability theory, with a personal bias of course, and give comments about their (possible) relations with statistics. But to put up the historic perspective, first a view back in time.

It is actually not so well known that Bernoulli’s law of large numbers, was motivated by thoughts of a statistical nature. This is only marginally present in the posthumously published “Ars conjectandi” but it is revealed in an exchange of letters, Bernoulli had with Leibniz in 1703, two years before he died, and which I found extraordinary interesting. Bernoulli thought of using the LLN as a foundation to obtain through repeated measurements better and better approximations for unknown probabilities, and compares the problem with the possibility to approximate $π$. Although he was very proud of his mathematical proof of the LLN, he writes to Leibniz that every fool would anyway know the truth of it, and that his main aim is to apply it to real world problems like estimating “true” survival probabilities. He probably didn’t have a precise concept of confidence intervals, but it is also clear from his letter that he was thinking intensively about this. He however died before he could further develop his ideas. Leibniz was actually not at all convinced of the concept.

A century later, much of probability theory was still motivated by statistical considerations. The method of least squares, invented independently by Gauss and Legendre, entered the scientific world with a spectacular success, namely the prediction of Gauss of the position of Ceres.

Some of the 20th century probability theory has of course very close connections with statistics, for instance branching processes, measure valued processes, and coalescent processes, which are still very much alive today, and which play a fundamental role in biostatistics. Then Markov processes with many applications in statistics and computer science, and empirical process theory with its motivation from goodness-of-fit and other statistical problems.

The most recent and perhaps most spectacular, developments in probability theory are however only very loosely connected with statistics, if at all, and were barely motivated by statistical questions. I discuss shortly some of the main themes in modern probability theory: a personally biased selection, with a view towards applications.

Stochastic analysis and martingale theory:

This had many applied sources, finance mathematics (Bachelier), games and insurance. There are close relations with other mathematical fields, mainly harmonic analysis, partial differential equations, differential geometry, and others. Today, it plays a huge role in finance mathematics.

A recent and most powerful progress came through the rough path theory initiated by Terry Lyons. It is basically a deterministic theory. One of the main achievements is that one can write solutions of SDEs as continuous deterministic functionals of the Brownian path on an enlarged path space. The key point is an extension of the path space by the “signature” which comes from iterated integrals. On the enlarged space, with an appropriate topology, the solutions of the SDE are smooth functionals. This has now found wide applications, also in statistics, and machine learning, with some quite spectacular successes. For instance, Benjamin Graham did win a competition about automatic Chinese character identification with a method based on rough paths. Rough paths are now also heavily used in finance mathematics.

This was very recently pushed much further by the regularity structures of Martin Hairer for which he obtained the Fields Medal 2014, which make sense of ill posed stochastic PDEs that have solutions only after renormalization by infinite “counterterms” . An example is the KPZ equation:

δth = δ2xh + (δxh)2 + white noise.

The white noise is in space–time. This equation does not make sense as it stands because h has to be a distribution, and one cannot square a distribution. The basic idea for the renormalization is coming from quantum field theory. There is presently a lot of interest in the so called KPZ-universality class which encompasses models, some from experimental physics, for which the scaling limits are described by the KPZ equation.

Random media and spin glasses.

This is mainly motivated by mathematical physics. Spin glasses are interacting systems like the Ising model with “disordered” meaning random interactions. There are many fields in random media outside spin glasses, for instance percolation, random walk in random environments and others.

The non-rigorous physicists picture of mean-field spin glasses like the SK model predicts a spectacular mathematical structure with an infinite-dimensional hierarchical “symmetry breaking” going back to Giorgio Parisi. There are important recent advances in a mathematical understanding, by Talagrand, Guerra, Panchenko, and others, but many of the aspects are still poorly understood. The theory has found many applications, for instance in:

• Theoretical computer science where some ideas were leading to very efficient algorithms.

• Combinatorial optimization, like the Traveling Salesperson Problem, assignment problems and others.

• Coding theory

• Most recently: Statistics. More on that a bit later.

Spectral properties of large random matrices.

The motivation came from quantum mechanics. Eugene Wigner had the idea that spectra of large atoms could be modeled by random Hamiltonian matrices. This is a very active field connected with some of the most challenging open problems in mathematical physics and in pure mathematics.

A spectacular achievement was the proof of the Tracy–Widom limit distributions for the law of the largest eigenvalues. This is closely connected with the KPZ equation, and I believe that it must be of interest in statistics.

Random matrices are strongly related to algebra (free probability), representation theory and number theory. Some of the results on random matrices (proved by John Keating and coauthors) lead to conjectures about the Riemann Zeta function which go beyond the Riemann Hypothesis.

Schramm-Löwner-evolution equations and two-dimensional random geometry problems.

This is about two-dimensional models from statistical physics at criticality, like percolation, self-avoiding random walks, the Ising model, and others. The basic mathematical construction to attack these problems was invented by Oded Schramm, who tragically died in an mountain accident near Seattle.

The topic led to the first two Fields Medals for probabilists: Wendelin Werner and Stas Smirnov. It is one of the most active fields in probability theory presently.

Let me now briefly discuss a particular example which I found very striking and which reveals a close connection of modern (21st-century) regression analysis and compressed sensing with problems which arose from spin glass theory and neural nets. I like it, as I was marginally involved in it.

The statistical side I learned from a survey article by Donoho and Tanner and a recent article by Bayati, Lelarge and Montanari which appeared in 2015 in the Annals of Applied Probability.

Take a linear regression:

$Y_{i}=\sum\nolimits_{j=1}^{p}X_{ij}\beta _{j}+Z_{i},\ i=1,\ldots ,n$

βj being the regression parameters, and Z the noise, in the situation which Donoho calls the 21st century setting, namely when n<p, which is opposite to the classical Gauss–Legendre setup.

However, only a part of the parameters are relevant that is, one has a sparsity assumption:

k: = # { j : βj ≠ 0} <n

but one does not know which ones, of course.

The first question is about identification: Given a matrix (Xij), and (βj) sparse, satisfying the equation

yi = p j=1Xij βj, i=1, … ,n,

is it possible to identify the βj from ( yi ) and (Xij) by

β = arg min {j| βj| : y =X β } .

That depends, of course, very much on the matrix X, and therefore one chooses a probabilistic formulation: Given a “typical” matrix (Xij), where “typical” has to be specified, is it true that for most (βj) with satisfying k: = # { j : βj ≠ 0} <n, the answer is “yes”?

It can then be phrased as a problem on random convex polytopes. Given the $\ell _{1}$ unit ball $C$ in $\mathbb{R}^{p}$, map it with a random matrix $X$ to the random convex polytope $XC\subset \mathbb{R}^{n}$: Has this convex polytope with high probability the property that the convex hull of $k$ nodes, with no antipodal pairs, is a $k-1$-dimensional face of the polytope, for most choices of the set of $k$ nodes? Donoho proved that if $X$ has i.i.d. standard Gaussian entries, then the answer is “yes”, for large $k,n,p$, provided $\delta :=k/n$ is below a critical value $\delta ^{\ast }\left(n/p\right) \in \left( 0,1\right) $ given by a semi-explicit formula. Donoho and Tanner showed by extensive computer simulations that there is universality in the sense that the distribution of $X$ is not important, and they conjectured that such universality holds as $k,n,p\rightarrow \infty $ with quotients fixed. Bayati et al finally proved this conjecture in somewhat restricted cases in a difficult 70 page paper.

In the geometric formulation, the problem fairly evidently has close connections with spin glass theory. There is a first random object, namely the random polytope, and then one asks probabilistic questions about this random object. It turns out, that there are close similarities with problems around the Thouless–Anderson–Palmer equations in spin glass theory. Bayati et al use a (complicated) variant of an iterative procedure for the TAP equation used by me for the Sherrington–Kirkpatrick model in a paper which appeared last year in CMP.

One may question whether this is a good example. After all, the fact that there is universality had been checked by Donoho and Tanner on the basis of extensive simulations, and the Bayati et al result still has restrictions on the choice of the Xij. This after a complicated and compact proof of 70 pages. Still, in my view, it opens an understanding which would be impossible to gain just by computer simulations, but that’s of course the view of a mathematician which may not be shared by many applied statisticians.

This is just a very special example, and I presented it only because my special interest in it. In my view, however, it however shows how much both sides can learn from each other if we take the time to pay attention to what the others are doing, and if there are opportunities for both sides to meet.

Overall, I think IMS would be the society to support interactions and communications, and I think it is important for us in the long run, unless IMS wants to chip off the probability side. To a large extent, the interactions have to come bottom-up, through scientists organizing workshops and meetings, and asking for support. That is happening to some extent. On the other hand, I think it would good to become more active from the side of the society, in order to keep some coherence.