Contributing Editor David Hand has been re-reading John Tukey’s “The Future of Data Analysis”:
Most readers will be familiar with the name John Tukey. He is renowned for developing the Fast Fourier Transform and the box plot, and for coining the term “bit” as used in computer science as well as the term “exploratory data analysis.” Several statistical tools and methods are named after him. Over sixty years ago, in 1962 in the Annals of Mathematical Statistics, he published a 67-page paper on “The Future of Data Analysis.” Some context is indicated by the fact that S-plus, Python, and R were developed much later, in the 1980s and 1990s.
Tukey began the paper by saying “For a long time I have thought I was a statistician, interested in inferences from the particular to the general … I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”
The paper is a tour de force. Amongst the special growth areas he identifies are (using his terms, which the modern reader will readily be able to translate): spotty data, multiple-response data, problems of selection, ways of assessing error, data heterogeneous in precision, incomplete data, and others. He stresses the need for iteration in data analysis, something which is fundamental as one tries to understand data or fit a predictive model to it.
To initiate new forms of analysis he draws attention to the need to seek wholly new questions (he mentions “more complexly organised data,” and complex data has been a recurring theme in modern statistics and data science), tackling old problems in more realistic frameworks (e.g. nonparametric methods), finding novel ways to summarise data (look at the data this way and that way; don’t just follow the “standard” approach), and finding and evading deeper constraints (the notion that different experts, with different ways of looking at things, could lead to different insights). Regarding the last point, he stresses the danger in supposing that “all statisticians should treat a given set of data in the same way…”—and I really like his description of the danger: “…all British admirals, in the days of sail, maneuvered in accord with the same principles. The admirals could not communicate with one another, and a single basic doctrine was essential to coordinated and effective action. Today, statisticians can communicate with one another, and have more to gain by using special knowledge (subject-matter or methodological) and flexibility of attack than they have to lose by not all behaving alike.”
He remarks that, for a discipline to be a science, three constituents will be judged essential: intellectual content, organization into an understandable form, and reliance upon the test of experience as the ultimate standard of validity. And then he goes on to say that, as he sees it, data analysis passes all three tests, and so he would regard it as a science (“one defined by a ubiquitous problem rather than by a concrete subject”).
The paper includes the famous remark: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” He continues by observing that, “Data analysis must progress by approximate answers, at best, since its knowledge of what the problem really is will at best be approximate.”
An interesting aspect he considers is at what stage a proposed new method should be put into practice. How much do we need to know about it first? I suspect that, increasingly in the modern world, the question is moot, since new tools tend to be used immediately, in the hope that the gain outweighs the risk. He also discusses the dangers of seeking “the best” solution to a problem.
I was especially struck by his comment that, “Some would say that one should not automate such procedures of examination, that one should encourage the study of the data. (Which is somehow discouraged by automation?)”—to which he gives three counter-arguments: that most analysis will be done by non-experts in data analysis, known procedures must be easy and quick so that experts can explore the use of new methods, and that automating procedures requires them to have been fully specified. He says he finds these counter-arguments conclusive.
He gives a nuanced answer in the section on the impact of the computer, pointing out, obviously correctly, that “In many instances the answer may surprise many by being ‘important but not vital’ although in others there is no doubt but [that] the computer has been ‘vital’.” He did not foresee how dramatically computer power would advance over the succeeding half-century, and how many areas of data analysis would be strikingly transformed by the power we now have.
Naturally, some areas have advanced a long way in the past 60 years. One area, because of its need for computational power, is multivariate analysis, although he does discuss factor analysis and the seminal discipline of cluster analysis, such as it was at the time. Other particular topics he examines include stochastic process data, selection and screening, multiple sources of error, plots, teaching data analysis, the role of judgment in data analysis, and many other topics central to modern data science. It is, as I say, a tour de force, and one which fully justifies regarding him as the first data scientist.