David Hand, Imperial College London, writes on the problem of translating substantive questions to statistical questions:
If we wanted to decide whether the total weight of passengers on a plane exceeded that allowed by the airline given the flight conditions, we could gather all of them together on a giant scale and see if they were over the permitted weight. Or, we could multiply an estimated “average passenger weight” by the number of passengers on the plane (as airlines actually do). In doing this, we have moved away from the actual physical situation we are studying, representing it in numerical terms and using mathematical and statistical tools to find an answer, which we then map back to the physical world (perhaps in the form “ask for volunteers to leave the flight”).
That vignette captures the essence of statistical (and, more generally, applied mathematical) methods. We map the real-world problem (be it the physical world, economic world, psychological world, or whatever) to a formal representation, typically in numerical terms, carry out our manipulations on the numbers, draw some numerical conclusions, and map those conclusions back to the real world. Our formal representation permits deduction and inference much more easily than would corresponding manipulations in the real world (getting all those passengers onto the giant scales).
The core notions here are mapping and representation, and it is obvious that the accuracy and validity of our conclusions relies on the veracity of these notions. Moreover, not all aspects of the real world will be relevant to our objectives. Passenger names, food preferences, or ages are irrelevant to our objective of deciding if the plane is safe to take off. In any statistical approach to a problem, we must decide what is relevant, what should be mapped to the numbers which we will analyse, what variables we should measure.
And we must decide how to construct a good representation. Multiplying individual passenger weights together would not be helpful. We need to add them. Addition here is representing the notion of passengers being together on the plane, at least for our question about total weight. We must decide what our model should look like and how an algorithm should be constructed so that it parallels the question and the world being studied
This is obvious, indeed so obvious that it may not need saying. Yet sometimes insufficient care is taken in the mapping, resulting in mistaken statistical conclusions. Or, insufficient care is taken to ensure that the statistical question answers the substantive question.
Simpson’s paradox provides an illustration in which the need to ensure valid mapping between the two questions is obvious. This “paradox” arises when comparing two populations, and is the phenomenon in which every subgroup of population A has a larger mean than the corresponding subgroup of population B so that so that one concludes that “A is greater than B”, but where the overall mean of A is smaller than the overall mean of B. The apparent contradiction arises from the fact that the two analyses are actually answering different questions. The first asks about the average conditional difference between A and B, conditioning on the groups, while the second asks about the unconditional difference. They are different questions, so we should not be surprised when they give different answers. What we should ask ourselves is which of these two questions properly represents the substantive question?
When presented with the two possible mappings, as in the Simpson’s paradox illustration, one can make a choice about which (if either) properly represents the substantive question. But what about situations in which just one mapping is presented? Could it be that sometimes people have rushed in to use familiar or standard approaches even though they are inaccurate representations of the substantive question? The reader may be familiar with the “fallacy of the instrument” (in picturesque terms, the “hammer fallacy”: if all you have is a hammer, everything looks like a nail).
Some examples are familiar: misinterpreting the meaning of regression coefficients; using the F-measure to evaluate a machine learning system, when diagnostic odds ratio, partial AUC, or one the host of other measures better captures the relevant aspect of performance; misinterpreting group behaviour as individual behaviour as in the ecological fallacy; the use of p-values as measures of effect size; the misinterpretation of p-values as the probability that the null hypothesis is true; interpreting correlation as causation (despite the familiar adage!); using single-link cluster analysis, with its potential long straggling clusters, when what is really wanted are compact spherical clusters, such as those produced by k-means analysis; and so on. But others are less familiar, and may not even have been recognised, not least by the researchers conducting the study. My 2026 book What’s the Question? Deciding What You Really Want to Know explores this issue in depth. It gives a wide variety of examples, ranging from simple questions relating to averages to complex questions relating to sophisticated statistical tools, covering those mentioned above and others. If even averages have been misused and misunderstood, misrepresenting the substantive question, then how much greater is there opportunity for misuse and misunderstanding with advanced tools?
In the previous issue, Marianne Huebner wrote about the importance of initial data analysis (IDA). I certainly agree with her about its importance, and about how it can protect against mistaken conclusions. But there is a critical step which is even more “initial” than IDA. This is the step of statistical question formulation: of mapping the substantive question to the statistical question.