Institute of Mathematical Statistics | IMS Medallion Lecture preview: Lester Mackey

IMS Medallion Lecture preview: Lester Mackey

May 16, 2026

Lester Mackey is a Senior Principal Researcher at Microsoft Research, where he develops machine learning methods, models, and theory for large-scale learning tasks driven by applications from weather and climate forecasting, healthcare, and the social good. Lester co-organized the second place team in the Netflix Prize competition; won the Prize4Life ALS disease progression prediction challenge; won prizes for temperature and precipitation forecasting in the yearlong real-time Subseasonal Climate Forecast Rodeo; and received best paper awards from the ACM Conference on Programming Language Design and Implementation, the Conference on Neural Information Processing Systems, and the International Conference on Machine Learning. He is a 2023 MacArthur Fellow, a Fellow of the IMS and ASA, an elected member of the COPSS Leadership Academy, and the recipient of the 2023 Ethel Newbold Prize and the 2025 COPSS Presidents’ Award. This lecture will be delivered at JSM 2026 in Boston, August 1–6, 2026.

Better-than-i.i.d. Sampling

How do you succinctly summarize a probability distribution? The gold standard is to sample $n$ representative points either independently from the target or from a convergent Markov chain. However, these standard sampling strategies are not especially concise: since $n$ independent points yield an order $1/\sqrt{n}$ approximation, ten thousand points are required for $1\%$ relative error and $1$ million points for $0.1\%$ error. Such bloated sample representations preclude applications with critically expensive downstream costs like computational cardiology, where a 1000-CPU-hour tissue or organ simulation is required for each sample point.

This lecture will introduce three tools for summarizing a probability distribution more effectively than independent sampling or standard Markov chain Monte Carlo (MCMC).

Kernel thinning
Given an initial n point summary (for example, from independent sampling or MCMC), kernel thinning finds a subset of only square-root n points with $O(\sqrt{\log(n)/n})$ integration error. In contrast, an independent sample of size $\sqrt{n}$ would suffer substantially larger $\Omega(n^{-1/4})$ integration error. This improved rate of approximation is reminiscent of quasi-Monte Carlo but applies to general target distributions.

Stein kernel thinning
Often, one only has access to a biased sample of points targeting the wrong distribution. Such biases are a common occurrence in MCMC-based inference due to tempering (where one targets a less peaked and more dispersed distribution to achieve faster mixing), burn-in (where the initial state of a Markov chain biases the distribution of chain iterates), or approximate MCMC (where one runs a cheaper approximate Markov chain to avoid the prohibitive costs of an exact MCMC algorithm). In these settings, our aim is to transform the potentially large and biased input sample into a compact and faithful representation of the target. Stein kernel thinning achieves this by optimizing a kernel Stein discrepancy, a quality measure based on Stein’s method that allows one to directly measure and hence correct for errors against the target distribution.

Compress++
Natively, kernel thinning runs in $n^2$ time which is tolerable for moderate-sized problems but prohibitive for large sample sizes $n$. Our final tool, Compress++, resolves this issue by converting any unbiased quadratic-time thinning algorithm into a near-linear-time algorithm with error inflated by no more than a factor of 4. The same simple meta-procedure also accelerates super-quadratic thinning algorithms by square-rooting their runtime.

These tools are especially well-suited for tasks that incur substantial downstream costs per summary point and have been used to compress uncertainty in digital twins of human hearts, to develop fast, high-quality attention approximations in transformers, to accelerate stochastic gradient training through reordering, and to powerfully test for distributional differences in near-linear time.

References

Dwivedi, R, and Mackey, L. (2024) Kernel thinning. Journal of Machine Learning Research, 25(152): 1–77

Riabiz, M, Chen, WY, Cockayne, J, Swietach, P, Niederer, SA, Mackey, L, and Oates, CJ. (2022) Optimal thinning of MCMC output. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(4):1059–1081

Dwivedi, R, and Mackey, L. (2022) Generalized kernel thinning. In International Conference on Learning Representations

Shetty, A, Dwivedi, R, and Mackey, L. (2022) Distribution compression in near-linear time. In International Conference on Learning Representations

Domingo-Enrich, C, Dwivedi, R, and Mackey, L. (2023) Compress then test: Powerful kernel testing in near-linear time. In Proc. 26th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR, 25–27, Apr 2023

Li, L, Dwivedi, R, and Mackey, L. (2024) Debiased distribution compression. In Proceedings of the 41st International Conference on Machine Learning, volume 203 of Proceedings of Machine Learning Research. PMLR, 21–27, Jul 2024

Carrell, AM, Gong, A, Shetty, A, Dwivedi, R, and Mackey, L. (2025) Low-Rank Thinning. In Proc. 42nd International Conference on Machine Learning