Florentina Bunea is Professor of Statistics in the Department of Statistics and Data Science at Cornell University. She received her PhD in Statistics at the University of Washington, Seattle, in 2000. Her research focuses on statistical machine learning methodology and theory. Her work spans model selection theory, optimal estimation in high-dimensional models with low-dimensional structures, mixture models, clustering, inference in factor models, and elements of optimal transport. Her most recent interests revolve around interpretable machine learning methods and theory, with applications to LLM evaluation. She has served on the editorial boards of the Annals of Statistics, JASA, Bernoulli, Annals of Applied Statistics, and JRSSB, among others. She is an elected fellow of the IMS, and a recipient of the Cornell University Bowers Research Excellence Award.
This IMS Medallion lecture will take place at JSM Nashville, on Tuesday, August 5, at 8:30am.
Softmax Mixture Ensembles for Interpretable Latent Discovery
Extracting latent information from complex data sets plays a central role in statistics, machine learning and AI. This lecture will explore solutions to this problem for data that can be well modeled via an ensemble of discrete mixtures with shared mixture components.
The first model of this type, the topic model, was originally proposed four decades ago, for extracting semantically meaningful, latent topics from a text corpus. The model is now used as an exploratory tool in virtually any scientific area where a model for a collection of multinomial samples with latent structure is of interest. The basic topic model formulation starts by associating to each observed multinomial sample its generating p-dimensional probability vector. The topic model assumption is that each of these vectors is a mixture of K latent, p-dimensional probability vectors that are common to the ensemble, with sample specific mixture weights. In the original jargon, adopted in this lecture, a sample is a document, viewed as a relative frequency of its constituent words relative to a given vocabulary; the mixture components are topics covered by the entire corpus; and the mixture weights are the proportions in which a document (sample) covers a specific topic. Estimation of this collection of non-parametrized, discrete mixture models enables the evaluation of a corpus (collection of samples) in terms of its latent topical content.
Part 1 of this lecture will briefly review methods with statistical guarantees for the basic topic model. Computationally efficient estimation of the model parameters (mixture weights and mixture components) reduces to performing non-negative matrix factorization under constraints. While many such constraints can be considered, I will present those that simultaneously allow for model identifiability and computational tractability, as well as minimax-rate optimal, interpretable topic estimation. Moreover, the cross-entropy mixture weight estimates, in any identifiable topic model, have a notable property: they automatically adapt to the unknown sparsity of the true mixture weights, without extra regularization. Furthermore, a one-step correction of these potentially sparse estimates allows for rigorous inference on the mixture weights. The totality of these theoretically justified properties enables nuanced analyses not only at the corpus level, but also at the document level. However, both theory and practice suggest that the quality of estimation may deteriorate when p is very large. Moreover, by definition, the basic topic model cannot incorporate information on the support points of the p-dimensional probability vectors that are modeled as mixtures.
Part 2 of this lecture will offer solutions to these issues, and cover contemporary extensions of this model, inspired by LLM technology, where each mixture component (topic) has a softmax parameterization relative to a large collection of p feature vectors from an L-dimensional space, with L < p. Since a topic is in a one-to-one correspondence with its L-dimensional softmax parameter, it is directly interpretable in the embedding space. I will discuss very recent theoretical and practical results on an EM algorithm developed for ensembles of softmax mixtures. In identifiable models, and under specific initialization schemes, the EM algorithm provably yields rate optimal parameter estimates; the mixture weight estimates continue to enjoy the adaptation to sparsity established for the non-parameterized model. Just as the softmax parametrization has come to occupy a prominent role in any LLM-generated output, softmax mixtures can become important tools for estimating and interpreting the topical richness explored by algorithms in generating such output.