Kathryn Roeder joined the faculty at Carnegie Mellon University in 1994, where she is now the UPMC Professor of Statistics and Life Sciences in the Departments of Statistics & Data Science and Computational Biology. She earned her PhD in statistics at Pennsylvania State University, after which she was on the faculty at Yale University for six years. In 1997 she received the COPSS Presidents’ Award for the outstanding statistician under age 40. In 2019 she was inducted into the National Academy of Sciences. She was awarded the 2020 COPSS Distinguished Achievement Award and Lectureship. Her research group develops statistical tools applied to genetic and genomic data to understand the workings of the human brain, and the interplay with genetic variation. These methods rely on various statistical and machine learning methods, causal inference, latent space embedding, sparse PCA and high dimensional nonparametric techniques.

This IMS Rietz lecture will take place at JSM Nashville, on Sunday, August 3, at 4:00pm.

 

Genomic Inferences in the Era of Black-Box Predictions

Since the advent of high-throughput genomic techniques, myriad statistical challenges have arisen due to high dimensionality and missing data. To obtain sufficient sample size, it is often necessary to combine data across related studies; in the process, valuable data can be lost due to high rates of missingness and due to differing experimental designs. Intriguingly, however, powerful black-box models have been remarkably successful in filling in the missing data. The question that arises is, how can we adjust inferential techniques to account for the additional uncertainty induced by imputation? Here we illustrate several genomic applications in which we overcome these challenges.

While quantitative measurements produced by mass spectrometry proteomics experiments offer a direct way to explore the role of thousands of proteins in molecular mechanisms, analysis of such data is challenging due to the large proportion of missing values. To address this issue, a common strategy imputes missing data, although it often introduces systematic bias into downstream analyses if the imputation errors are ignored. We develop a statistical framework inspired by doubly robust estimators that offers valid and efficient inferences for proteomic data. Our framework utilizes a customized variational autoencoder (VAE) to obtain excellent imputation quality, and a propensity score for debiasing imputed outcomes. Our estimator is compatible with the double machine learning framework, which allows us to gain additional, meaningful discoveries and yet maintain good control of false positives.

Recent advances in single-cell technologies enable joint profiling of multiple omics. These profiles can reveal the complex interplay of different regulatory layers in single cells; still, new challenges arise when integrating datasets with some features shared across experiments and others exclusive to a single source. Combining information across these sources is called mosaic integration. The missing data can be imputed with surprising success using our customized VAE, but conducting inference across these integrated samples is still challenging. We frame this problem in the context of semi-supervised learning, and assume the modality of interest is measured in a smaller supervised dataset, while it is unmeasured in the much larger unsupervised sample (i.e., vanishing overlap). We extend available theoretical results to accommodate our setting. Our methods apply to a wide range of smooth statistical targets—
including means, linear coefficients, quantiles, and causal effects—and remain valid under high-dimensional nuisance estimation, distributional shift between labeled and unlabeled samples, and overlap that vanishes as sample size increases. We construct estimators that are doubly robust and asymptotically normal by deriving influence functions under this regime. A key insight is that classical root-n convergence fails under vanishing overlap; we instead provide corrected asymptotic rates that capture the impact of the decay in overlap. We apply our methods to multi-omic single-cell samples.

Single-cell RNA sequencing used in conjunction with CRISPR-based perturbation (Perturb-seq) can uncover the function of genes; however, it can be costly to perform as many perturbations experiments as desired. Ideally it would be possible to use a model to predict the outcome of perturbations related to those already performed. Despite their high dimensionality and sparsity, these data have shown themselves to be amenable to analysis by deep learning methods, which provides us with a framework for this task. We utilize a model combining VAE and denoising diffusion models to generate realistic single-cell RNA-seq data. Remarkably we have had success in generating data for perturbation experiments that were never performed, provided we have a rich set of data from related experiments. We use ideas derived from semiparametric inference literature to obtain inferential techniques that are somewhat successful in this challenging setting.