Daniela Witten. Photo: Jenny Jimenez

Daniela Witten is a professor of Statistics and Biostatistics at University of Washington, and the Dorothy Gilford Endowed Chair in Mathematical Statistics. She develops statistical machine learning methods for high-dimensional data, with a focus on unsupervised learning. Daniela is the recipient of an NIH Director’s Early Independence Award, a Sloan Research Fellowship, an NSF CAREER Award, a Simons Investigator Award in Mathematical Modeling of Living Systems, and a Mortimer Spiegelman Award. She is also a co-author of the popular textbook Introduction to Statistical Learning. Daniela completed a BS in Math and Biology with Honors and Distinction at Stanford University in 2005 and a PhD in Statistics at Stanford University in 2010.

 

Testing a hypothesis after unsupervised learning

In recent years, a number of techniques have been proposed for performing selective inference, also known as post-selection inference, in a wide variety of settings. For instance, approaches are available to test the null hypothesis that a particular coefficient in a regression model equals zero, given that the coefficient was selected based on the data at hand, using (for instance) the lasso or forward stepwise regression.

In this talk, I will discuss the application of the selective inference framework to some well-studied problems in unsupervised learning. The selective inference framework is particularly well-suited to unsupervised learning, since unsupervised learning can be viewed as a way to conduct hypothesis generation. Once these hypotheses are generated, testing them typically requires an independent data set. However, the selective inference framework allows us to conduct both hypothesis generation and hypothesis testing on the same dataset.

First, I will consider the task of changepoint detection. Given a sequence of observations in $\mathbb{R}$, suppose we apply (for instance) binary segmentation or $\ell_0$ segmentation to detect a changepoint. We can then ask the question: is the mean to the left of this estimated changepoint equal to the mean to the right of this estimated changepoint? Of course, a p-value for this null hypothesis computed via naïve application of a z-test or t-test would fail to account for the fact that the changepoint was estimated from the data, and thus would not control Type 1 error. Therefore, in order to conduct valid inference, we must instead ask a more refined question that accounts for the process by which we estimated the changepoint from the data. In particular, we ask: is the mean to the left of this estimated changepoint equal to the mean to the right of this estimated changepoint, given that we estimated a changepoint at this position?

We develop a computationally-efficient approach to answer this question for changepoints estimated via binary segmentation and $\ell_0$ segmentation. In contrast to recent proposals in the literature, we are able to avoid conditioning on unnecessary information; we see a clear benefit in terms of the power of our approach.

Next, we show that related ideas can be used to improve upon recently-published proposals for testing hypotheses that are based upon the output of the fused lasso. This applies both to hypotheses that involve changepoints estimated via the one-dimensional fused lasso, as well as hypotheses that are based on the output of the fused lasso applied to an arbitrary graph.

Finally, suppose that we cluster n observations using hierarchical clustering, and then we cut the dendrogram at a particular height in order to obtain (say) K clusters. We can then ask the question: is the mean of the observations in one particular cluster equal to the mean of the observations in another cluster? Once again, this question has a selective inference flavor, since we must somehow account for the fact that these clusters were estimated from the data. It turns out that in the special cases of average, single, and centroid linkage, we can efficiently compute p-values that account for the clustering process and that control Type 1 error. This means that it is possible to assign a p-value to each split on a hierarchical clustering dendrogram. We demonstrate the performances of our proposed approaches on a number of applications, including DNA sequence data and calcium imaging data.

This is joint work with Jacob Bien (University of Southern California), Yiqun Chen (University of Washington), Paul Fearnhead (Lancaster University), Lucy Gao (University of Washington), and Sean Jewell (University of Washington).