Yingying Fan received her BS in Statistics and Finance from the University of Science and Technology of China in 2003, and her PhD in Operations Research and Financial Engineering from Princeton University in 2007. She is currently Centennial Chair in Business Administration and Professor in Data Sciences and Operations Department of the Marshall School of Business at the University of Southern California. Yingying’s research interests include statistics, data science, machine learning, economics, big data and business applications, and artificial intelligence. Her latest works have focused on statistical inference for networks, texts, and AI models empowered by some most recent developments in random matrix theory and statistical learning theory. She is the recipient of a number of honors including the International Congress of Chinese Mathematicians 45-minute Invited Lecture (2023), Fellow of the IMS (2020) and ASA (2019), the Royal Statistical Society Guy Medal in Bronze (2017), the ASA Noether Young Scholar Award (2013), and NSF Faculty Early Career Development (CAREER) Award (2012). She is the IMS Editor of Statistics Surveys (2023–25) and an associate editor of journals including Annals of Statistics (2022– ), Information and Inference (2022– ), Journal of Business & Economic Statistics (2018– ), and Journal of the American Statistical Association (2014– ). This Medallion Lecture will be given at the Joint Statistical Meetings in Toronto, August 5–10, 2023.
High-Dimensional Random Forests Estimation and Inference
As a flexible nonparametric learning tool, the random forests algorithm has been widely applied to various real applications with appealing empirical performance, even in the presence of high-dimensional feature space. Indeed, it is arguably the most popularly used nonparametric learning method besides deep learning. Yet, because of its black-box nature, the results by random forests can be hard to interpret in many big data applications. This talk contributes to a fine-grained understanding of the random forests algorithm by discussing its consistency and feature selection properties in a general high-dimensional nonparametric regression setting.
Specifically, we derive the consistency rates for the random forests algorithm associated with the sample CART splitting criterion used in the original version of the seminal algorithm proposed by Breiman (2001). Our derivation was built on a bias-variance decomposition analysis. Our new theoretical results show that random forests can indeed adapt to high dimensionality and allow for discontinuous regression function. Our bias analysis takes a global approach that characterizes explicitly how the random forests bias depends on the sample size, tree height, and column subsampling parameter; and our variance analysis takes a local approach that bounds the forests variance via bounding the tree variance. A major technical innovation of our work is to introduce the sufficient impurity decrease (SID) condition that makes our bias analysis possible and precise. Such a condition also allows us to characterize the complexity of the random forests algorithm with the sample CART splitting criterion. We verify that many popularly used high-dimensional sparse nonlinear models satisfy the identified SID condition.
We further proceed with quantifying the usefulness of individual features in random forests learning, which can greatly enhance the interpretability of the learning outcome. Existing studies have shown that some popularly used feature importance measures suffer from the bias issue. In addition, most of these existing methods lack comprehensive size and power analyses. We approach the problem via hypothesis testing and suggest a general framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature. The vanilla version of our FACT test can suffer from the same bias issue as existing methods in the presence of feature dependency. We exploit the techniques of sample imbalancing and feature conditioning for bias correction. We further incorporate the ensemble idea into the FACT statistic through feature transformations for enhanced power. We formally establish that FACT can provide theoretically justified random forests feature p-values and enjoy appealing power through nonasymptotic analyses. These new theoretical results and finite-sample advantages of FACT for random forests inference are illustrated with several simulation examples. We also investigated an economic forecasting application in relation to COVID-19 using the suggested FACT framework.
Despite exciting recent progresses toward a better understanding and enhanced interpretation of the random forests algorithm, there are still many open questions that deserve further in-depth investigations. This lecture will end with a discussion of these related open questions.