Institute of Mathematical Statistics | Preview: Rong Ma, IMS Lawrence D. Brown PhD Student Award Lecture

Preview: Rong Ma, IMS Lawrence D. Brown PhD Student Award Lecture

February 16, 2022

Rong Ma is currently a postdoctoral scholar in the Statistics Department at Stanford University, advised by Professor David Donoho. Prior to that, he got his PhD in Biostatistics in 2021 from the University of Pennsylvania, jointly advised by Professors T. Tony Cai and Hongzhe Li. He received a BS in Statistics in 2015 from Nankai University, China, and an MS in Statistics in 2016 from the University of Wisconsin, Madison. His research interest lies in the understanding and underpinning of the statistical foundations of data science. Currently his research focuses on (i) statistical inference for large disordered data and high-dimensional models, (ii) improving theoretical cognitions of data visualization and dimension reduction algorithms, and (iii) their applications in interdisciplinary research such as microbiomics, integrative genomics, among many other fields.

Rong Ma was another of the three winners of the Lawrence D. Brown PhD Student Awards, and will be presenting this lecture in a special session at the 2022 IMS Annual Meeting in London, UK, June 27–30, 2022.

Statistical Inference for High-Dimensional Generalized Linear Models with Binary Outcomes

Generalized linear models (GLMs) with binary outcomes are ubiquitous in modern data-driven scientific research, as binary outcome variables arise frequently in many applications such as genetics, metabolomics, finance, and econometrics, and play important roles in many observational studies. In this talk, I will present our recent work in which we develop a unified statistical inference framework for high-dimensional binary generalized linear models (GLMs) with general link functions.

Both unknown and known design distribution settings are considered. For the former case, we propose a two-step procedure for constructing confidence intervals (CIs) and performing statistical tests for the regression coefficients of a given high-dimensional binary GLM. A penalized maximum-likelihood estimator is implemented to estimate the high-dimensional regression vector and then a Link Specific Weighting (LSW) method is proposed to correct the bias of the penalized estimator. CIs and statistical tests are constructed by quantifying the uncertainty of the proposed LSW estimator. The asymptotic normality of the proposed LSW estimator is established and the validity of the constructed CIs and statistical tests are justified. A key methodological advancement is the construction of the link-specific weights. With this novel weight construction, the proposed LSW method is shown to be effective for a general class of link functions, including both the canonical and non-canonical binary GLMs. Furthermore, the proposed LSW method is effective for the general unknown sub-Gaussian design with a regular population design covariance matrix. The minimax optimality of CIs for a single regression coefficient of the binary GLMs with general link functions is established, and our proposed CIs are shown to achieve the optimal expected length up to a logarithmic factor over the sparse regime. The analysis provides important insights on the adaptivity of the optimal CIs with respect to a collection of nested parameter spaces indexed by the sparsity of the coefficients. It is shown that the possible region of constructing adaptive CIs for the individual coefficients is the ultra-sparse regime. New lower bound techniques are developed, which can be of independent interest for other high-dimensional binary GLM inference problems. Moreover, for both theoretical and practical interests, we study the optimal CIs and statistical tests in the case of known design distributions.

Simulation studies indicate several practical advantages of the LSW method over the existing ones. Specifically, our proposed method is flexible with respect to the underlying link function and efficient in terms of computational costs. The proposed CIs have more precise empirical coverage probabilities and shorter lengths. As for hypothesis testing, under the sparse setting, the proposed test is more powerful than the existing likelihood ratio tests. In addition, an analysis of a real single cell RNA-seq data set yields interesting biological insights that integrate well into the current literature on the cellular immune response mechanisms as characterized by single-cell transcriptomics. Our proposed method has been included in the R package SIHR, which is now available from CRAN.

This is joint work with Tony Cai and Zijian Guo, while I was a PhD candidate in biostatistics at the University of Pennsylvania.