George Stepaniants is an NSF MSPRF postdoctoral fellow at the California Institute of Technology (Caltech) in the Department of Computing and Mathematical Sciences working with Professor Andrew Stuart. He received his PhD from the Massachusetts Institute of Technology (MIT) in 2024 in the Department of Mathematics co-advised by Professors Philippe Rigollet and Jörn Dunkel funded by the NSF GRFP and MIT Presidential Fellowship. He was also part of the Interdisciplinary Doctoral Program in Statistics (IDPS) through the Institute for Data, Systems, and Society (IDSS). Prior to MIT, George graduated in 2019 from the University of Washington (UW) with a Bachelors of Science in Mathematics and Computer Science where he performed research in the Department of Applied Mathematics under Professor Nathan Kutz.
George Stepaniants’ talk will take place as part of the IMS Lawrence D Brown PhD Student Award session, at JSM Nashville, August 2–8, 2025.
Alignment of Untargeted Data through their Covariances: A Novel Perspective on a Classical Tool in Optimal Transport
Dataset or feature alignment is a fundamental problem in statistics and machine learning, arising in a variety of fields including computer vision, machine translation, and biostatistics. Despite the progress of feature alignment methods in computer science, they are not immediately applicable to biostatistical problems, where tasks such as data comparison, pooling, and annotation from a reference dataset must be guided by the correct biological constraints. These tasks frequently arise in “untargeted” studies of metabolomics, proteomics, and lipidomics experiments, where concentrations of compounds (metabolites, proteins, and lipids) are measured across a collection of patients. Because the compounds in these studies are not preselected and are unlabeled, they allow for the discovery of new biomarkers that indicate the health status of a patient. The unlabeled nature of these studies gives rise to a wide range of feature (compound) matching problems as practitioners often desire to compare the features, merge datasets, or transfer feature annotations between two untargeted experimental studies.
We begin this talk by discussing the relevance that optimal transport has for solving such problems in biological feature matching. Even when studying the same phenomena, biological datasets collected in different labs do not have identical cohorts or similar sample sizes. In order to match the features between such datasets, we discuss an important extension of the optimal transport method known as the Gromov–Wasserstein (GW) algorithm, which performs matchings between the feature similarity matrices of both datasets. To the extent of our knowledge, we develop the first application of optimal transport for the analysis and matching of metabolomic (liquid-chromatography mass-spectroscopy; LC-MS) datasets. Our method, GromovMatcher, a constrained GW solver, accurately matches corresponding metabolic features between studies, delivering superior alignment accuracy and robustness compared to existing approaches. Applying GromovMatcher to experimental metabolic studies of liver and pancreatic cancer, we discover shared metabolic features between these cancer groups and showcase the potential of our method for advancing alcohol biomarker discovery.
Motivated by these real-world feature alignment problems and the success of Gromov–Wasserstein in these settings, we propose and analyze a novel statistical framework for feature matching between two unlabeled datasets. In this framework, the features of both datasets follow the same joint Gaussian distribution with unknown covariance, and features of the second dataset are permuted by an unknown permutation which we wish to recover. We show how the correct permutation of the features can be recovered through a quasi-maximum likelihood estimator (QMLE) as well as through the GW method. Both estimators aim to align the empirical covariance matrices of both datasets, which we term the “covariance alignment” problem, offering a previously unstudied setting for graph matching with Wishart random matrices. The QMLE and GW estimators are instances of quadratic assignment problems which require combinatorial optimization over the discrete space of permutations. However, unlike the QMLE, the GW estimator can be lifted to the continuous space of coupling matrices and hence can be optimized with gradient methods, allowing it to scale to far larger matching problems as shown by our numerical experiments.
The novelty of our statistical framework lies in the fact that the unknown covariance of both datasets is treated as a nuisance parameter. This allows us to show that QMLE and GW achieve the same minimax optimal rate for the covariance alignment problem that has a non-standard dimension scaling, interpolating between the rate of permutation estimation and the rate of estimation of the nuisance covariance. Finally, these results give the first statistical justification of the Gromov–Wasserstein algorithm for feature alignment.
This talk is based on my PhD work with Philippe Rigollet and Yanjun Han at MIT along with collaborators in the group of Vivian Viallon at IARC in Lyon, France.