David J. Hand, Imperial College London, chaired the selection committee for the 2024 Rousseeuw Prize, awarded as previously announced to Yoav Benjamini, Ruth Heller and Daniel Yekutieli (with posthumous acknowledgment to Yosef Hochberg). David gave the following speech at the award presentation.
Introduction
I was honoured to be asked to chair the selection committee again. I would like to say how grateful the statistical community is to Peter Rousseeuw for establishing this prize. And to thank the administrators, who promote, advertise, and solicit nominations, and my fellow selection committee members for our discussions on who should be awarded this year’s prize.
The Nobel Prize was established in 1895, for the disciplines of physics, chemistry, physiology or medicine, literature, and peace. A sixth prize for economics was established in 1969. The Turing Prize for computer science was launched in 1966. The Abel Prize for Mathematics was established in 2001. But the gaping lacuna in that list is difficult to miss. Statistics—the discipline of extracting understanding and illumination from data—is central to modern civilisation. It’s ubiquitous, impacting all walks of life, from medicine, to finance, to public policy and government, even though one may have to look under the hood to see it. I might go so far as to say that none of the disciplines included in the Nobel awards would be possible in their modern form without statistics—perhaps with the exception of literature. So it was a gap that was crying out to be filled, and the Rousseeuw Prize does just that.
My role here is to describe the winning work in a high-level, informal and non-technical way, and introduce the prizewinners. I shall begin by looking at a couple of examples.
Motivating background examples
Your blood pressure fluctuates from day to day, over time of day, and even depending on whether you have been sitting or walking just before it was measured. It’s affected by a multitude of random influences. So if I were to measure the blood pressure of everyone in this room, some of you would have a high value, perhaps even above the threshold which would prompt concern. But only for some of those would it be because you really had an underlying condition that might merit treatment, while for others it would be high simply because one of those other random influences had temporarily made it high.
In this context, a false positive or a false discovery is a test result which indicates something abnormal when really there is nothing abnormal.
If I rushed everyone who had a high blood pressure reading off to the hospital many of those would turn out to be perfectly normal. This would be bad news for all those who had been rushed off, not merely because of the time wasted but also because of the anxiety caused. Indeed, some medical screening programmes have been criticised for these very reasons. Being told you might have cancer—and then suffering intrusive investigations and perhaps even surgical operations—only to later discover that it was a false positive is clearly to be avoided.
So how can we reduce the number of such false positives?
One way would be to use a higher threshold which must be exceeded if we are to classify blood pressure as concerning. Indeed, by choosing a high enough blood pressure threshold I can make the number who have been unnecessarily rushed to the hospital as small as I like.
But there are a lot of people with normal blood pressure in this room. Even if I choose a threshold sufficiently high that any one person has a very small chance of being incorrectly flagged, with so many people the chance that I incorrectly flag one or more is large. To make my chance of incorrectly sending anyone to the hospital I need to set the threshold so high that most of those who do have hypertension will be missed. This rather defeats the object of the exercise. We seem to have something of an impasse.
It is at this point that today’s laureates came galloping to the rescue. They developed a very clever novel strategy for tackling this problem, which I will describe in a moment. But first let me give my second example.
In image processing a pixel is one of the tiny dots that make up the picture. Look at an image or a computer screen in high magnification and you can see the dots, but from a distance they all merge together to give a continuous picture.
Medical scanning involves using X-rays or other methods to take images of slices through the body. Put all those slices together and you have a three-dimensional image of the inside of your body. So that three-dimensional image is made up of lots of tiny three-dimensional dots. These are called voxels. What we’d like to know is if any of those voxels, or regions of voxels, appear abnormal. Are they too bright? Again, we can test each of those voxels but random variation means we should expect that sometimes perfectly normal regions of the scan will show up brightly. And if we test lots of them we are pretty well guaranteed that some regions of the scans of perfectly healthy people will show up brightly just by chance. It’s just like the blood pressure example.
A paper published in 2010, and which has now become a minor classic, illustrated the problem perfectly. The paper says:
“One mature Atlantic Salmon [a fish]… participated in the fMRI study. The salmon measured approximately 18 inches long, weighed 3.8 lbs, and was not alive at the time of scanning [the fish was dead]… Image acquisition was completed on a 1.5-tesla GE Signa MR scanner. A quadrature birdcage head coil was used for RF transmission and reception….” and the paper goes on like that for several paragraphs giving details of the experimental setup, before continuing: “The task administered to the salmon involved completing an open-ended mentalizing task. The salmon was shown a series of photographs depicting human individuals in social situations …. The salmon was asked to determine which emotion the individual in the photo must have been experiencing.”
Now at this point you are probably thinking to yourself that it’s pretty clear what response the dead fish will give to being presented with the photographs. But here comes the crunch. The paper says: “Voxelwise statistics on the salmon data were calculated”.
Now, you will remember from my description of what a voxel is that there will be a large number of them in any particular three-
dimensional image. In the salmon’s case there were 130,000.
This is like the blood pressure example, but instead of just 100 or so tests, there is a vast number of tests being conducted.
Carrying out separate tests on such a large number of voxels will almost certainly result in some showing apparent effects purely by chance, because of measurement error and so on. And that is indeed what the researchers found. They report, “Several active voxels were observed in a cluster located within the salmon’s brain cavity…”
They went on to say, “Either we have stumbled onto a rather amazing discovery in terms of post-mortem ichthyological cognition [they mean “dead fish thinking”], or there is something a bit off with regard to our uncorrected statistical approach. Could we conclude from this data that the [dead] salmon is engaging in the perspective-taking task? Certainly not. By controlling for the cognitive ability of the subject we have thoroughly eliminated that possibility [they mean, the fish was dead and so could not think]. What we can conclude is that random noise … may yield spurious results if multiple testing is not controlled for.”
Of course, the aim of the authors of that paper was to drive home the consequences of failing to take account of the number of tests being conducted. They also included more appropriate analyses, including those based on the work of today’s laureates.
I’ve given you two examples. A third occurs more generally in science. If scientists explore a great many different hypotheses, arranging so that each one has a chance of only 5% of being supported incorrectly, it means that the chance of accepting at least one incorrectly is very large. And, given enough such hypotheses, this could result in accepting a large number of hypotheses which are not true. It will mean that scientific claims are made which are later found to be false. This has become a real problem in some scientific disciplines, where it is known as the reproducibility crisis.
So, we have a problem.
The FDR method
Major breakthroughs in science often come from looking at things from a different angle, followed by a lot of hard work and a struggle for acceptance. I’ll come to the hard work and struggle in a moment, but first let’s look at the insight the laureates had that led to the work being presented today.
In my first example I looked at the chance of incorrectly flagging someone with normal blood pressure as hypertensive, and the opposite chance of incorrectly labelling someone with abnormal blood pressure as normal. That’s all very well, but now put yourself in the physician’s position. What you’d really like to know is, of all those that I send to the hospital, what proportion of them really have high blood pressure? If it turns out that only 1% of them really have high blood pressure, the hospital would ask me to stop wasting their time and resources, but if it turned out that 99% did then they’d be grateful.
Let me make this more general. Suppose, instead of people, I am testing many different drugs, or many different scientific hypotheses, or many different financial trading systems. The traditional perspective tells us what proportion of the ineffective drugs I claim are effective. The new perspective tells us, of those drugs I claim are effective, what proportion really are? That is often a much more relevant question. It’s telling us how many of my claimed scientific discoveries are real. And its complement is telling us how many of my claimed discoveries are false—it’s telling us the false discovery proportion.
But you may have spotted a difficulty here. We can’t tell how many of the claimed discoveries are false because we don’t actually know which are true and which are false. That’s the whole point: it’s what we are trying to find out. It means we can’t actually calculate the false discovery proportion so we can’t limit it.
In a seminal breakthrough paper published in 1995, Yoav Benjamini and Yosef Hochberg cracked the problem. Again, it’s all about perspective and looking at problems in the right way. Instead of looking at the observed false discovery proportion, they looked at what you’d expect it to be, based on the available data. They called this the false discovery rate.
And using some clever mathematics, they showed that it is possible to control the value of this. That is, we can ensure that the false discovery rate is less than any value we want.
Struggle for acceptance
However, it has to be said that few people initially recognised the importance of the work, or how widespread its use would become. This is the struggle part.
It’s traditional in science that papers are sent out to other experts in the field to be reviewed and commented upon. Is the data sound? Are the methods fully described? Do the conclusions follow? The paper will then probably be sent back for revision, often several times, with no guarantee of acceptance. Often journal editors decide that the submission does not quite match their aims for papers and it will be rejected. So one submits the paper to another journal. This is all standard procedure, but it means that the process of getting a paper published is slow and painstaking. Yoav Benjamini says of the 1995 paper, “Five years and three journals later the paper was accepted for publication.” I imagine that the editors of those journals who rejected it felt a bit like all the publishers who rejected J.K. Rowling’s first Harry Potter book.
Other papers on the false discovery rate concept suffered a similar protracted struggle before they were eventually published.
Now, the discipline of statistics is constantly advancing. New kinds of problems are arising, new types of data are being captured, new questions are being asked, and new methods must be developed to cope with these changes. The development of the Benjamini and Hochberg method illustrates this perfectly.
The method gradually gained traction, as researchers recognised that it often answered a more pertinent question than the traditional approaches. But it was the advent of new kinds of data which really gave it a boost, and catapulted the 1995 paper into being one of the most highly cited papers in statistics.
In particular, genetic data in the form of DNA microarrays arrived. This is characterised by thousands or tens of thousands of genes being tested simultaneously. A perfect illustration of the multiple testing challenges I outlined in my earlier examples. And this new kind of data required a solution which had already been developed in the 1995 paper.
But, of course, things don’t stop there. The original paper assumed statistical independence of the multiple test statistics. That’s not always a realistic assumption. Yoav Benjamini and Daniel Yekutieli went on to relax this constraint and, together with Ruth Heller have subsequently gone on to extend its application in various other directions.
The laureates
Daniel Yekutieli is an Associate Professor in the Department of Statistics and Operations Research at Tel Aviv University. He took a BSc in Mathematics from the Hebrew University of Jerusalem in 1992, a Master’s in 1996, and a PhD in Applied Statistics in 2002, supervised by Yoav Benjamini. Between 1992 and 1997 he worked in the research department of the Israel Meteorological Service. It was Daniel’s father, a physicist, who suggested he consider statistics as a career, which is why he moved to Tel Aviv University. While there he was encouraged by Yosef Hochberg and started working with Yoav Benjamini. Daniel says of Hochberg: “Yossi was a very generous and smart guy, he made me feel very welcome and even gave me consulting projects to do,” and of Benjamini, “He seemed very kind, dependable and always open to new ideas [all true!] and I liked his outlook on statistics.”
Ruth Heller is a Professor in the Department of Statistics and Operations Research at Tel Aviv University. She took a BSc in Mathematics from McGill University in 1996, where she received some very wise guidance from Professor David Wolfson, who told her that biostatistics is the most beautiful profession. Quite right, of course. Professor Heller then went on to take an MSc in biostatistics from the University of Washington in Seattle in 1998, and a PhD in statistics from Tel Aviv University in 2007, supervised by Yoav Benjamini. She has spent extended periods at the National Cancer Institute, the University of Pennsylvania, and Technion, the Israel Institute of Technology.
By this point you will have spotted the common factor throughout all of this work. Yoav Benjamini is the Nathan and Lily Silver Professor of Applied Statistics at Tel Aviv University. He studied Mathematics at the Hebrew University of Jerusalem, from which he graduated in 1973 and received his Master’s in 1976. He then went to the United States to take a PhD at Princeton, graduating in 1981. He has spent time as Visiting Professor at various universities, including the University of California, Stanford University, and the Wharton School in Pennsylvania. He has received many awards for his work, including the 2012 Israel Prize, the Medallion Lecture of the IMS, and election to the Israeli Academy of Sciences and Humanities and the US National Academy of Sciences.
Those are today’s three laureates, but I also want to mention one other person, whose name has cropped up repeatedly. This is Yosef (or Yossi) Hochberg. Regrettably, he passed away in 2013. Had he still been with us, he would certainly have been one of the laureates. His earlier work, such as his 1987 book Multiple Comparison Procedures, underpinned what I have been describing, and his continued work, through the seminal 1995 paper into the present century, led to significant development of the ideas.
The 1995 Benjamini and Hochberg paper is one of the most highly cited of all scientific papers. What that means is that a huge number of other researchers have used the method and developed it further. That 1995 paper was the start of a revolution, one which has benefited all of humanity in countless ways.
Read more about the Rousseeuw Prize at https://www.rousseeuwprize.org/