Sandrine Dudoit is Executive Associate Dean in the College of Computing, Data Science, and Society and Professor in the Department of Statistics and the School of Public Health, at the University of California, Berkeley. She obtained a BSc (1992) and an MSc (1994) in Mathematics from Carleton University, Canada. She earned a PhD degree in Statistics from UC Berkeley (1999) under the supervision of Professor Terence P. Speed. She was a postdoctoral fellow in the laboratory of Professor Patrick O. Brown, in the Department of Biochemistry, at Stanford University (1999–2001).  She was named Fellow of the American Statistical Association (2010) and Fellow of the Institute of Mathematical Statistics (2021).

Much of Professor Dudoit’s research concerns the development and application of statistical learning methods and software for the analysis of high-throughput -omic data in both basic biology and precision health. Her methodological interests regard high-dimensional statistical learning and include exploratory data analysis, unsupervised learning (e.g., cluster analysis, dimensionality reduction), loss-based estimation with cross-validation (e.g., density estimation, classification, regression, model selection), and causal inference. She is a founding core developer of the Bioconductor Project (https://www.bioconductor.org/), an open-source and open-development software project for the analysis of biomedical and genomic data.

This IMS Medallion lecture will take place at JSM Nashville, on Monday, August 4, at 8:30am.

 

Learning from Data in Single-Cell Transcriptomics

Novel measurement platforms are allowing researchers to investigate biological processes at ever-increasing scale and ever-finer resolution, with transformative impact on both fundamental biology and personalized health. Single-cell spatially resolved transcriptomics allows the high-throughput measurement of gene expression levels for entire genomes at the resolution of single cells (vs. pools of cells), while simultaneously recording the spatial location of cells and molecules within tissues. Such resolution is crucial to address many important biological and medical questions, such as the study of stem cell differentiation, the detection of rare mutations in cancer, or the discovery of cellular subtypes in the brain.

The availability of reliable and efficient statistical methods and software is a limiting factor in our ability to reap the benefits from novel biotechnologies, which generate massive amounts of data at a rate outpacing analysis capabilities. The data are complex in a variety of ways, including dimensionality, heterogeneity (in terms of data types, levels of processing, and sources), quality, and correlation structures. Transcriptomic studies provide a great example of the range of questions one encounters in a data science workflow, where the data are complex, questions are not always clearly formulated, there are multiple analysis steps, and drawing on rigorous statistical principles and methods is essential to derive meaningful and reliable biological results.

In this lecture, I will provide a survey of statistical questions related to the analysis of single-cell transcriptome sequencing (scRNA-Seq) data to investigate the differentiation of stem cells in the brain. I will discuss exploratory data analysis and expression quantitation and, as for DNA microarrays at the turn of the century, illustrate the impact of data quality and preprocessing on the results of downstream analysis. I will present the Slingshot method for inferring cellular lineages from scRNA-Seq data, which comprises two main steps: (1) The inference of the global lineage structure (i.e., the number of lineages and where they branch) using a cluster-based minimum spanning tree;
(2) the inference of cell pseudotimes along each lineage using simultaneous principal curves.

I will also discuss tradeSeq, a generalized additive model framework based on the negative binomial distribution, that allows flexible inference of both within-lineage and between-lineage differential expression. Finally, I will address differential expression analysis in spatial transcriptomics and outline a new approach based on a genewise Poisson generalized linear model to identify differentially expressed genes both within and between samples while accounting for the spatial structure of the data.