Emmanuel Candès is the Barnum-Simons Chair in Mathematics and Statistics, and professor of Electrical Engineering (by courtesy) at Stanford University, where he currently chairs the Department of Statistics. Emmanuel’s work lies at the interface of mathematics, statistics, information theory, signal processing and scientific computing: finding new ways of representing information and of extracting information from complex data. Candès graduated from the Ecole Polytechnique in 1993 with a degree in science and engineering, and received his PhD in Statistics from Stanford in 1998. He received the 2006 NSF Alan T. Waterman Award, the 2013 Dannie Heineman Prize from Göttingen, SIAM’s 2010 George Pólya Prize, and the 2015 AMS-SIAM George David Birkhoff Prize in Applied Mathematics. He is a member of the National Academy of Sciences and the American Academy of Arts and Sciences.
The Wald Lectures are delivered this year at JSM in Baltimore.
The 2017 Wald Lectures: What’s happening in Selective Inference?
For a long time, science has operated as follows: a scientific theory can only be tested empirically, and only after it has been advanced. Predictions are deduced from the theory and compared with the results of decisive experiments so that they can be falsified or corroborated. This principle, formulated independently by Karl Popper and by Ronald Fisher, has guided the development of scientific research and statistics for nearly a century. We have, however, entered a new world where large data sets are available prior to the formulation of scientific theories. Researchers mine these data relentlessly in search of new discoveries and it has been observed that we have run into the problem of irreproducibility. Consider the April 23, 2013 Nature editorial: “Over the past year, Nature has published a string of articles that highlight failures in the reliability and reproducibility of published research.” The field of Statistics needs to re-invent itself and adapt to this new reality in which scientific hypotheses/theories are generated by data snooping. In these lectures, we will make the case that statistical science is taking on this great challenge and discuss exciting achievements.
An example of how these dramatic changes in data acquisition that have informed a new way of carrying out scientific investigation is provided by genome-wide association studies (GWAS). Nowadays we routinely collect information on an exhaustive collection of possible explanatory variables to predict an outcome or understand what determines an outcome. For instance, certain diseases have a genetic basis and an important biological problem is to find which genetic features (e.g., gene expressions or single nucleotide polymorphisms) are important for determining a given disease. Even though we believe that a disease status depends on a comparably small set of genetic variations, we have a priori no idea about which ones are relevant and therefore must include them all in our search. In statistical terms, we have an outcome variable and a potentially gigantic collection of explanatory variables, and we would like to know which of the many variables the response depends on. In fact, we would like to do this while controlling the false discovery rate (FDR) or other error measures so that the results of our investigation do not run into the problem of irreproducibility. The lectures will discuss problems of this kind. We introduce “knockoffs,” an entirely new framework for finding dependent variables while provably controlling the FDR in finite samples and complicated models. The key idea is to make up fake variables—knockoffs—which are created from the knowledge of the dependent variables alone (not requiring new data or knowledge of the response variable) and can be used as a kind of negative control to estimate the FDR (or any other error of type 1). We explain how one can leverage haplotype models and genotype imputation strategies about the distribution of alleles at consecutive markers to design a full multivariate knockoff processing pipeline for GWAS!
The knockoffs machinery is a selective inference procedure in the sense that the methods finds as many relevant variables as possible without having too many false positives, thus controlling a type 1 error averaged over the selected. We shall discuss other approaches to selective inference, where the goal is to correct for the bias introduced by a model constructed after looking at the data as is now routinely done in practice. For example, in the high-dimensional linear regression setup, it is common to use the lasso to select variables. Now it is well known that if one applies classical techniques after the selection step—as if no search has been performed—inference is distorted and can be completely wrong. How then should one adjust the inference so that it is valid? We plan on presenting new ideas from Jonathan Taylor and his group to resolve such issues, as well as from a research group including Berk, Brown, Buja, Zhang and Zhao on post-selection inference.
Some of the work I will be presenting is joint with many great young researchers including Rina Foygel Barber, Lucas Janson, Jinchi Lv, Yingying Fan, Matteo Sesia as well as many other graduate students and post-docs, and also with Professor Chiara Sabatti who played an important role in educating me about pressing contemporary problems in genetics. I am especially grateful to Yoav Benjamini: Yoav visited Stanford in the Winter of 2011 and taught a course titled “Simultaneous and Selective Inference”. These lectures inspired me to contribute to the enormously important enterprise of developing statistical theory and tools adapted to the new scientific paradigm — collect data first, ask questions later.