Me: “Is there any evidence of x effects in the data?”

Them: “No.”

Me: “Have you looked?”

Them: “No.”

I’ve had this conversation many times, where x = batch, temporal, spatial, machine, reagent, operator or other effects, and I expect to have it again. Why? Perhaps people want to reverse Matthew 7:7 into, “Don’t seek, and you won’t find”, as they expect that whatever they do find will be bad. Perhaps they don’t dare look at their data before they formulate their model and address the questions of interest, lest their looking will invalidate their answers. (It may, but that may also be right.) I think the style of some of our teaching (statistician as police officer, or keeper of the disciplinary faith) and some (regulatory) practice encourages such a view.

I prefer to go with Yogi Berra, who said You can see a lot by just looking. One of the reasons I’m so keen on looking is that in my data-rich world there are always things to find, and that can be fun. I call them artifacts, though features might be a more positive term (cf the 1970s quip: Is it a bug or a feature?). The question for me is not, “Are there artifacts in my data?” for the answer to this question is invariably, “Yes!” What concerns me is whether they are a major problem.

On a related point, one of the things I’ve noticed over time is that as we get confronted with more and more data, we tend to look at it less and less. It should be the other way around.

Do some people have a problem looking at large data sets, and if so, why? I think the answer is yes, some do, and I offer a few possible reasons. One is that large data sets are frequently produced by complex, multi-step processes, involving technologies that can be a challenge to understand. As a result, like the Little Prince—Quand le mystère est trop impressionnant, on n’ose pas désobéir—people take such data at face value. Another possibility is a blind faith in numbers, a feeling that if there is a lot of data, the answer that falls out must be overwhelmingly more probable than any of the alternatives, and that no artifact will change the conclusions. My third reason is that we all need to think harder, because simply repeating what we used to do with 10 variables is not an option when we have 10,000 variables. A change in perspective is required. Rather than looking at all our data, doing some analyses and finishing off with further looks, with large data sets the first step is reduced, we need a much more thorough third step. That is, our focus needs to be more on looking for things that might change our conclusions, not things that support (or fail to support) our assumptions. Also, we may be unsure what to do if we see problems. Or, perhaps now there’s so much data that no single set seems to warrant as careful consideration as it might have in the past, before we move on.

None of these reasons should be entertained. We must work hard to understand our measurement processes, artifacts are frequently the largest effects, there are some good ways of looking, and of responding when we find something untoward, and we should use them, though more ways will always be welcomed. Lastly, there is no reason to become complacent: some large data sets can be very rich indeed, and deserve thorough examination.

How should we seek, and what can we do when we find? In the last decade much use has been made of histograms or qq-plots of test statistics or p-values. These are valuable indicators of the health of an analysis: if your p-value distribution has problems, your analysis has problems. Also useful are negative controls, variables that should be unaffected by your treatments; and positive controls, which are variables that should be affected by treatments in known ways. If your controls don’t behave as expected, then you have a problem, and something needs to be done. Further, most large data sets have other information, sometimes called metadata, part of which might be associated with your final estimates, test statistics or p-values. Your task is to decide wisely which are worth looking at, and then to do so. There are statistical ways to try to deal with known or unknown artifacts, which don’t necessarily require that you understand how they arose. You should seek evidence of their fingerprints, and do something about that. Explanations may come later, as in the “Wednesday effect” in Primo Levi’s Silver or the “method effect” in Lord Rayleigh’s Anomaly Encountered in Determinations of the Density of Nitrogen Gas. Seek, find and fix, perhaps understand.

Statue of the Little Prince on his B-612 Asteroid

“The thing that is important is the thing that is not seen…” says the Little Prince,
sculpted here on his B-612 Asteroid, at the French theme park in Hakone, Japan