
Rob Tibshirani and Daniela Witten’s matching outfits feature the geometric interpretation of ridge regression and the lasso
Our contributing editor Daniela Witten teams up with Rob Tibshirani:
This column is about the fact that statistics is hard. Of course, statistics is hard for non-statisticians: anyone who has ever taught a non-major stat class knows this to be true! But this column is about the fact that statistics is hard, full stop.
In statistics, the issues at play can be subtle, and there is often not a clear answer. And the inherent statistical gray zone in which so many real-world questions reside can create a conflict between the scientist, who wants a black-and-white answer, and the statistician, who does not want to peddle snake oil.
One of us is currently teaching a course to PhD students in epidemiology. Many of the students enrolled in this course eagerly (and quite reasonably) hope to learn how to answer causal questions on the basis of the messy, real-world data that will form the heart of their dissertations. They are looking for “yes or no” answers to questions such as “Are the observations independent?” and “Does the linearity assumption hold?” and “Is this variable a confounder?” and “Is this a valid estimate of the causal effect?”. We certainly wish we could give them the answers they seek! But instead, we tell them: “It depends … how did you collect the data?” and “I’m not sure … do you believe in magic?” and “What does the science tell you?” and “It depends!”
The problem is that the students want easy and clean answers to hard and messy questions. And maybe, if we knew less statistics, or if we believed in fairy tales, then we could give them the “just so” answers that they are looking for. But as statisticians, we understand that statistics is hard, and questions that seem incredibly simple can have very, very complicated answers.
As another example, there is a huge interest within the ML/AI community in understanding the “importance” of a variable. We understand why this is of interest, and of course, we aim to please! But, what does “variable importance” even mean?? As any statistician knows (and, as any student in a linear models course should know, though perhaps some don’t), a variable’s importance depends on the model under consideration. Are we considering that variable within the context of a simple linear model? Or a spline model using only that variable? Or a multiple linear regression model with other covariates? Or a gradient-
boosted tree, or a convolutional neural network? No measure of variable importance computed in a vacuum (i.e. outside of the context of a specific model) can be meaningful. For instance, a measure of variable importance that accounts for interactions between variables may work well in the context of a random forest, but poorly in the context of linear regression. So, a “one-size-fits-all-models” measure of variable importance is statistical snake oil.
This phenomenon also arises in the context of recent interest in visualizing high-dimensional data. The usual tools that work in low dimensions—like pairwise scatterplots of the data—cease to be useful in higher dimensions. Principal component analysis and other well-studied dimension reduction techniques can help and are well-understood from a statistical perspective, but for the most part they rely on a linear representation of the data. More recent techniques promise to reveal non-linear structure in the data. But we hesitate to use them. What do these visualizations actually mean? Are they displaying real signal, or simply artifacts of the dimension reduction procedure? It would be easy enough to show a collaborator the output of one of these techniques, and to pretend that we believe it. But we cannot do that in good faith, without knowledge of the technique’s operating characteristics, its performance in a variety of settings, and its theoretical guarantees… none of which are yet understood to our satisfaction because statistics is hard.
It sums up to this: the more statistics we know, the less comfortable we feel answering real scientific questions with statistical fairy tales. What, then, is our role as a statistician? We can educate scientists about why statistics is hard, so that they can (i) arrive at the best imperfect answer to their very hard question, and (ii) recognize statistical snake oil for what it is the next time it comes their way.
Many of us are drawn to the field of statistics because we are natural-born skeptics. Through our statistical training, we become more skeptical still. Our experience—both personal and professional—has taught us that if something seems too good to be true, then it probably is. We seek the simplest possible answer to every question, and no simpler. After all, statistics is hard. The key is to enjoy the challenge.