Contributing Editor David J. Hand (Imperial College London) counters the argument that the numbers speak for themselves: indeed they can, but they can also lie…
In 2008, in an article in Wired magazine, Chris Anderson famously wrote that “with enough data, the numbers speak for themselves.” This was in the context of arguing that “more is different” as far as data are concerned. He was claiming that the vast masses of data now being created, collected automatically as people go about their everyday lives, mean we can actually see what people do without having to construct theoretical models of behaviour. And there is certainly an element of truth in the claim – if the aim is simply prediction or decision-making, then understanding what is going on is unnecessary. All that it is needed is to know how things are related and what will happen when interventions are made. That information can be gained from past data and, along with an assumption of stationarity in supposing that the future will be like the past, it allows prediction of what will happen. However, if the aim is deeper, if the aim is actually to understand underlying mechanisms and processes, then models are necessary. Indeed, in one sense “models” are what understanding means.
Although big data has driven the modern notion that numbers can speak for themselves, Anderson’s statement was not the first time the idea had arisen. For example, in their 1988 book The Likelihood Principle, James Berger and Robert Wolpert wrote (Berger and Wolpert, 1988, p78): “[i]t was apparently this feeling, that data should be able to speak for itself, that led Barnard to first support the Stopping Rule Principle”. The argument there was that the data were adequate for inference, and how they were collected was irrelevant. This position has received diminishing support over time, as is demonstrated by the furore following John Ioannidis’s 2005 article “Why Most Published Research Findings are False.”
Various people have pushed back against Anderson’s assertion. Nate Silver, author of The Signal and the Noise, said (Silver, 2012, p9) “The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning.” And Deborah Mayo on p79 of in her recent book Statistical Inference as Severe Testing, said (Mayo, 2019) “In this day of fascination with Big Data’s ability to predict what book I’ll buy next, a healthy Popperian reminder is due: humans also want to understand and explain.”
But the fact is that the mistaken notion that numbers require no interpretation had been addressed long ago. Alfred Marshall, in his inaugural lecture for his Chair in Political Economy at Cambridge in 1885, wrote (Hodgson, 2005): “Experience in controversies such as these brings out the impossibility of learning anything from facts till they are examined and interpreted by reason; and teaches that the most reckless and treacherous of all theorists is he who professes to let facts and figures speak for themselves”. Although mainly concerned with the difficulty of deducing causal relationships from the “facts and figures” alone, Marshall was also very aware of the dangers of taking numbers out of context, of failing to allow for data quality, of perversions in how the data were collected, and the host of other risks associated with the blind use of data as descriptions of the phenomenon they purport to represent.
The phrase “the numbers speak for themselves” is taken to mean that what they say is obvious, requiring no interpretation and brooking no disagreement. But data alone are not sufficient to understand phenomena. Understanding requires more than simple description of observed structures in data sets – not least because, as I sometimes put it, if data can speak for themselves, they can also lie for themselves.
Anderson C. (2008) The end of theory: the data deluge makes the scientific method obsolete. http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
Berger J.O. and Wolpert R.L. (1988) The Likelihood Principle. Institute of Mathematical Statistics, Lecture Notes – Monograph Series, Vol. 6, 2nd ed.
Hodgson G.M. (2005) “The present position of economics” by Alfred Marshall. Journal of Institutional Economics, 1, 121–127.
Ioannidis J.P.A. (2005) Why Most Published Research Findings are False. PLoS Medicine, 2(8), 696–701.
Mayo D. (2019) Statistical Inference as Severe Testing. Cambridge University Press, Cambridge.
Silver N. (2012) The Signal and the Noise: The Art and Science of Prediction. Penguin Books, London