Contributing Editor David J. Hand writes:
George Box once said “You have a big approximation and a small approximation. The big approximation is your approximation to the problem you want to solve. The small approximation is involved in getting the solution to the approximate problem.” In a similar vein, John Tukey said “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”
I’ve never been entirely convinced by these statements. They have the ring of nice soundbites (especially when polished up, for example to “An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question”) but it seems to me that the critical thing is the accuracy of both approximations.
Nonetheless, the underlying point, that people should think carefully about the problem they actually want to solve, holds good. Researchers should not expend energy answering the “wrong question” unless they are confident that it is near enough to the right one.
A particularly simple example of this is whether to use the mean or the median to summarise a set of data. Since these statistics are different, they naturally have different properties. Indeed, as all statisticians will know it’s possible for one of two groups to have a higher mean but lower median than the other group. Changes in the extreme values will impact the mean, but not the median. So, for example, one can make the sample mean as large as one likes by increasing the single largest value enough, while the median remains unchanged. If the world’s richest person’s wealth increases sufficiently while everyone else’s declines, the mean wealth goes up, while the median wealth decreases.
For such reasons, it’s common to see statements to the effect that the choice should depend on how the data are distributed, and that the mean is a better measure of “location” than the median for symmetric distributions, while otherwise the median should be used.
But this is an oversimplification—and you will note that that description of which measure is appropriate made no reference at all to the question being asked.
If, as an employer, I choose the remuneration of new recruits randomly from a pronouncedly skewed distribution of salaries, then the average which will interest me is the mean of the distribution: those receiving large salaries will be compensated for by the larger number receiving small salaries, and my total wage bill is the product of the number of employees and their mean salary. In contrast, a potential new recruit considering joining the firm will be interested in the median salary. To her the mean is of little interest, since she is very likely to receive substantially less than that.
The distribution has the same shape in each case, but the appropriate average depends on what one wants to know.
I picked the mean/median example because it was the simplest example I could think of, but the principle is ubiquitous: the choice of statistical method depends on the question you want to answer.
Correlation coefficients are another very widely used basic summary statistic. The Pearson product-moment coefficient is known to be a measure of the strength of linear relationship. Often, however, one wants a weaker measure of relationship—perhaps merely a measure of strength of monotonic relationship. Correlation coefficients for this are sometimes called nonparametric measures of correlation, and they are invariant to monotonic increasing transformations of the two variables involved. The Spearman coefficient is an example. This works by transforming the observed values to ranks and calculating the Pearson coefficient of the ranked data. That’s equivalent to transforming the raw data to uniform scores, before applying the Pearson measure. But the choice of a uniform distribution here is arbitrary—or at least, in almost all the applications I have encountered, it’s arbitrary. No-one has told me why, for their problem, they believed uniform scores were appropriate, rather than, for example, normal scores, or scores derived from some other distribution. Unfortunately, the derived value of the Pearson coefficient will depend on the chosen distribution. What this means is that the value of the coefficient is using arbitrary “information” that the researcher has injected into the calculation, not merely the information in the data.
To overcome this, we need to step back and think more carefully about how the invariance to monotonic transformations of the two variables is achieved. The Spearman coefficient does it by mapping to a standard representation, but an alternative approach would be to base one’s measure solely on the ordinal properties of the data. A measure which does this is the Kendall coefficient. This measure thus sidesteps the intrinsic arbitrariness implicit in the Spearman measure. Again the two measures are different, with different properties, and which is appropriate depends on what you want to know.
As Box, Tukey, and other great statisticians have pointed out, it is critical in a statistical analysis to make sure you solve the right problem.
David Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He serves on the Board of the UK Statistics Authority. He is a Fellow of the British Academy, and a recipient of the RSS Guy Medal. He was made OBE for services to research and innovation.