Jiashun Jin is Professor of Statistics at Carnegie Mellon University. He received his PhD in Statistics from Stanford University in 2003. He was a faculty member at Purdue University from 2003 to 2007, after which he joined the Department of Statistics at Carnegie Mellon University, where he remains a faculty member today. Dr. Jin’s main research interest is in Large-Scale Inference, especially in the regime where the signals are both rare and weak. In such a regime, many conventional approaches fail, and it is desirable to find new methods and theory appropriate for such a situation.
Dr. Jin was a recipient of the NSF CAREER Award and in 2009 was given the IMS Tweedie New Researcher Award. He is also an elected IMS fellow. Dr. Jin has served as an Associate Editor for several journals, including The Annals of Statistics and Journal of the American Statistical Association. He also served as the organizational co-chair for the Second IMS-China International Conference in Statistics and Probability.
Jiashun Jin’s Medallion Lecture will be delivered at JSM in Seattle, on Wednesday August 12, 2015 at 2:00pm.
New approaches to spectral clustering, with applications to gene microarrays and social network community detection
Pearson’s PCA (principal component analysis) is a flexible and easy-to-use method, but faces challenges in modern settings. How to adapt PCA to problems of contemporary interest is a very active research topic, where new ideas, methods and theory, and applications are in urgent need.
This work is motivated by two application problems. In the first one, we have recently collected a network data set (co-authorship and citation) for statisticians, based on all published papers in AoS, Biometrika, JASA, and JRSS-B, 2003–2012. We are interested in network community detection (i.e., finding tight research groups). In the second one, we have microarray data measured for n different patients from K different classes. The class labels are unknown, and the goal is to estimate them (i.e., clustering).
For the first problem, the main challenge is that the information of the community labels is largely distorted by degree heterogeneity. We attack this using a new method: Spectral Clustering On Ratios of Eigenvectors (SCORE). In SCORE, we obtain the first few leading eigenvectors, “remove” the degree heterogeneity by taking entry-wise ratios between these vectors and the first one, and then clustering by applying the classical k-means. Applying SCORE to the networks aforementioned, we have identified a handful of meaningful communities, such as “Large-Scale Multiple Testing,” “Objective Bayes,” and “Variable Selection.”
For the second problem, the main challenge is that only a small fraction of genes are related to the clustering decision, each contributes weakly. We also use a new method to attack this: Important-Features PCA (IF-PCA). In IF-PCA, we rank the features by the Kolmogorov–Smirnov statistic and then apply PCA to only the small fraction of the most significant features. Here, a difficult problem is how to set the threshold; we attack this by adapting the recent notion of Higher Criticism. We have applied IF-PCA to ten gene microarray data sets, where it consistently outperforms existing procedures, and significantly so in several data sets.
The nice real-data performances of SCORE and IF-PCA motivate careful theoretical studies on high-dimensional clustering. We focus on the most challenging regime where the signals are rare and weak. In the two-dimensional phase space calibrating the signal sparsity and signal strength, we identify the precise demarcations for the Region of Impossibility and the Region of Possibility. In the Region of Impossibility, signals are simply too rare and weak so that any post-selection PCA type of methods fail. In the Region of Possibility, signals are strong enough so some methods may succeed, and IF-PCA is one of such methods. In this sense, IF-PCA has optimal clustering behaviors.
We also study the fundamental limits and the computationally feasible fundamental limits associated with clustering problems. For the former, we consider all possible clustering procedures (maybe NP hard). For the latter, we only consider procedures that are computationally feasible. Similarly, we find the precise demarcations for the Region of Possibility and the Region of Impossibility, and identify optimal clustering procedures.
The study requires very delicate post-selection Random Matrix Theory, and many theoretical results presented are new. The work is connected to the literature on Higher Criticism, sparse PCA, and low-rank matrix recovery.