Runze Li received his PhD in Statistics from the University of North Carolina at Chapel Hill in 2000. He currently is the Eberly Family Chair in Statistics and Professor of Public Health Sciences in the Department of Statistics, Pennsylvania State University at University Park. Runze’s research interests include theory and methodology development in variable selection, feature screening, robust statistics, nonparametric and semiparametric regression. His interdisciplinary research aims to promote the better use of statistics in social behavioral research, neural science research and climate studies. He served as Co-Editor of Annals of Statistics from 2013–15. Runze Li is Fellow of the American Association for the Advancement of Science, the ASA, and the IMS. Recent honors include the 2017 ICSA Distinguished Achievement Award; the Faculty Research Recognition Awards for Outstanding Collaborative Research, College of Medicine, Penn State University in 2018; and the Distinguished Mentoring Award, Eberly College of Science, Penn State University in 2023. This Medallion Lecture will be given at JSM Toronto, August 5–10, 2023.

 

Feature screening for ultrahigh dimensional data: Methods and Applications

Analysis of ultrahigh-dimensional data plays critical roles in big data analysis. Feature screening aims to quickly reduce the dimensionality by filtering out irrelevant variables as many as possible without excluding out important variables. Thus, feature screening is an important statistical analytic tool for ultrahigh data. There have been many developments on this topic. In this lecture, I will present general strategy and some applications of feature screening.

My lecture will start with connections between sure independence screening and t-test for high-dimensional two-sample mean problem with false discovery rate control. I then present an overview on marginal screening procedures for linear models and generalized linear models along with their theoretical properties, and further introduce general strategy for model-based feature screening procedures. I then introduce model-free feature screening procedures and present a brief review on model-free feature screening procedures. I will also briefly introduce conditional feature screening procedures.

In the last part of my talk, I will present an application of feature screening method via an empirical analysis of online job advertisements data to quantify the differences in returns to skills. This is of great interest in both labor economics and statistics fields, to study the relationship between the posted salary and the job requirements in online labor markets. We propose a new feature screening method, Absolute Distribution Difference Sure Independence Screening (ADD-SIS), to select important skill words for the interval-valued response (such as $75k–80k for annual salary). The marginal utility for feature screening is based on the difference of estimated distribution functions via nonparametric maximum likelihood estimation, which sufficiently uses the interval information. In the empirical analysis of online job market data, we study the text data of job advertisements for data scientists and data analysts in a major Chinese online job posting website, and explore the important skill words for the salary. We find that the skill words like optimization, long short-term memory (LSTM), convolutional neural networks (CNN), collaborative filtering, are positively correlated with the salary while the words like Excel, Office, data collection, may negatively contribute to the salary.