Boaz Nadler earned his BSc, Msc and PhD degrees all from Tel-Aviv University, Israel. After his PhD, he spent three years as a Gibbs instructor / assistant professor at Yale University. Since 2005, he has been a faculty member at the Weizmann Institute of Science in Israel, where is now a full professor. His research interests include mathematical statistics, statistical machine learning and more generally applied and computational mathematics. He is also interested in statistical applications in signal processing and computational biology. In 2018 he was awarded the Abarbanel prize in applied mathematics by the Israel Mathematical Union, and in 2023 he was named a Fellow of the IMS.

Boaz Nadler’s Medallion lecture will be at JSM Nashville, August 2–8, 2025.

 

Finding structure in high dimensional data: Statistical and Computational Challenges

A fundamental task in the statistical analysis of data is to detect and estimate interesting structures hidden in it. In this talk I’ll focus on aspects of this problem under a high-dimensional regime, where each observed sample has many coordinates, and the number of samples is limited. We will review some well known results as well as some newer phenomena, derived more recently.

Specifically, we will show the following in a high-dimensional regime.

Firstly, standard methods to detect structure in high dimensions, such as principal component analysis, may not work well. In fact, we will also discuss some fundamental limitations to discover structure, whereby regardless of the method deployed, it may not be possible to discover weak structures in the data.

Secondly, sparsity can come to the rescue: When the structure to be discovered is concentrated in only a small unknown subset of relatively few variables, then it can be detected with much fewer number of samples, where traditional methods fail. However, this may bring with it significant statistical and computational challenges.

Finally, some interesting phenomena may occur in semi-supervised learning settings where for few of the samples we are also given their underlying labels. Specifically, we will consider a simple example involving a mixture of two high dimensional Gaussians with a sparse difference in their means. Here, when the separation is small, and the number of unlabeled samples is not extremely large, unsupervised learning of the mixture is believed to be computationally challenging, and there are no known polynomial time methods that succeed in this task. In a fully supervised setting, given a small labeled dataset, whereby for each sample we know from which Gaussian it came, accurately estimating the Gaussians and their difference is not possible from an information theoretic perspective, regardless of computational issues. However, as we will show, by combining these relatively small labeled and unlabeled datasets, in some parameter regimes it is possible to accurately estimate the two Gaussians by a simple polynomial time semi-supervised learning algorithm. In simple words, in sparse high-dimensional settings semi-supervised learning may offer non-trivial computational benefits.