Preview of a Special IMS Lecture
Didong Li is a fifth-year graduate student in the Department of Mathematics at Duke University, supervised by David B. Dunson and Sayan Mukherjee. His research focuses on bridging between statistics and differential geometry to develop fundamentally new algorithms, statistical methods and theory. In particular, he is interested in manifold learning, nonparametric Bayes and information geometry. His PhD thesis focuses on learning and exploiting low dimensional geometric structures hidden in high dimensional data. Li earned a Bachelor’s degree in 2012 and a Master’s degree in 2015, both in Mathematics, from Beijing Institute of Technology, under the supervision of Huafei Sun. He will give this talk at the Bernoulli–IMS World Congress in Seoul in August.
Efficient Manifold Approximation with Spherelets
Data lying in a high-dimensional ambient space are commonly thought to have a much lower intrinsic dimension. In particular, the data may be concentrated near a lower-dimensional manifold. If one does not pay much attention to the hidden geometry in the data but instead deal with the ambient high-dimensional Euclidean spaces, both statistical and computation efficiency have been proven to be extremely low. In contrast, an accurate approximation of the unknown manifold will benefit a variety of aspects including dimension reduction, feature selection, density estimation, classification, clustering, data denoising, data visualization and so on. Most of the literature for data analysis relies on linear or locally linear methods. However, when the manifold has essential curvature, these linear methods suffer from low accuracy and efficiency. There is also an immense literature focused on non-linear methods like Variational Auto Encoders and Gaussian Process Latent Variable Model, to improve the approximation performance. However, these methods are complex black boxes lacking identifiability and interpretability, trading one problem (bad performance) for another (high complexity). As a result, new non-linear tools need to be developed without introducing too much extra complexity.
In this talk, I will focus on exploiting the geometry in the data through curvature of the unknown manifold to improve the performance when the manifold has essential curvature, while keeping the simple and clean close forms as in linear methods. In particular, we propose a simple and general alternative of locally linear manifold learning method, which instead uses pieces of spheres, or spherelets, to locally approximate the unknown manifold. We also develop spherical principal components analysis (SPCA) as a non-linear alternative of PCA, to find the best sphere fitting the data. SPCA provides simple tools that can be implemented efficiently for big and complex data and enables learning about geometric structure in the data, without introducing much more complexity than linear methods. Time permitting, I will also introduce a curved kernel called Fisher–Gaussian kernel which outperforms multivariate Gaussians in many cases, with a Bayesian nonparametric methodology for inference. I will also present some applications of spherelets, including classification, geodesic distance estimation and clustering.
 D. Li, M. Mukhopadhyay, D.B. Dunson, Efficient manifold and subspace approximations with spherelets, arXiv:1706.08263, 2018.
 M. Mukhopadhyay, D. Li, D.B. Dunson, Estimating densities with nonlinear support using Fisher–Gaussian kernels, arXiv:1907.05918, 2019.
 D. Li, D.B. Dunson, Classification via local manifold approximation, arXiv:1903.00985, 2019.
 D. Li, D.B. Dunson, Geodesic distance estimation with spherelets, arXiv:1907.00296, 2019.