David Dunson is Arts and Sciences Distinguished Professor of Statistical Science and Mathematics at Duke University. He is known for his broad spanning contributions to statistical methodology, with a particular focus on novel modeling frameworks and Bayesian approaches that are motivated by complex and high-dimensional data collected in the sciences. This includes latent factor, dimensionality reduction, nonparametric and machine learning methodology. Primary areas of application include neurosciences and brain network modeling, environmental health, ecology, and human fertility, among others. David is a fellow of the ASA, IMS and ISBA and has won numerous awards, including most notably the 2010 COPSS President’s Award. His work is very widely cited and he has an h-index of 70 on Google Scholar.

David Dunson’s Medallion Lecture will be given at JSM 2019 in Denver, USA (provisionally on Monday July 29, but check the program at http://ww2.amstat.org/meetings/jsm/2019/onlineprogram/index.cfm when it is finalized in late March).

 

Learning & exploiting low-dimensional structure in high-dimensional data

Characterizing low-dimensional structure in complex and high-
dimensional data is one of the canonical problems in statistics and machine learning. There is a very rich associated literature spanning from classical methods, such as principal components analysis (PCA), to recent popular non-linear approaches, such as various manifold learning algorithms and variational auto-encoders (VAEs). The majority of this literature is focused on algorithmic approaches that lack uncertainty quantification, and (in particular) the ability to propagate uncertainty across different components in inference and prediction tasks. Most commonly, one applies a two-stage approach in which the original high-dimensional data are replaced with lower-dimensional scores, and these scores are then used as the basis of data visualization and subsequent statistical analyses.

A particular focus of this talk is on fully model-based frameworks for flexible non-linear dimensionality reduction, in which one has a hierarchical likelihood specification of the data generating process. The associated literature is surprisingly limited, with most of the focus being on some variation of locally linear models. For example, one could approximate a non-linear subspace or manifold using a collection of hyperplanes with the density of the data having elliptical contours around these planes. This leads naturally to mixture of Gaussian models, potentially with a factor analytic structure on the covariance to reduce dimension. However, a critical disadvantage of locally linear models, including mixtures of Gaussians, is the inability to parsimoniously represent data lying close to non-linear subspaces having high curvature.

Motivated by this problem, we propose a useful new class of spherelet dictionaries and kernels for concisely representing nonlinear low dimensional structure in complex data. We start by proposing a simple generalization of PCA to allow curvature; we refer to this approach as spherical PCA (SPCA) and show that SPCA has substantial theoretical and practical advantages in many settings – using manifold learning as a motivating example. SPCA has a simple analytic form, making it easy to use as an alternative to PCA in broad cases. We show improvements over competitors in a variety of applications including manifold estimation, image denoising, geodesic distance estimation and classification. A simple nearest-neighbor spherelets classifier can be defined that has improved performance over a wide range of competitors, including convolutional neural networks, in canonical image classification problems, such as for digits data. Relative to neural networks, dramatically fewer training examples are needed.

Spherelets can also be used to create new kernels for multivariate density estimation and associated problems. In particular, spherelet kernels are obtain by generating from a Fisher von Mises density on a sphere and then adding Gaussian noise. The resulting kernels can be curved to an extent controlled by the radius of the sphere, generalize the Gaussian, and have an analytic expression. We define spherelet kernel mixture models and developing supporting MCMC algorithms and theory, showing dramatically better performance compared with mixtures of Gaussians in a variety of examples.