Yee Whye Teh is a Professor of Statistical Machine Learning at the Department of Statistics, University of Oxford and a Research Scientist at DeepMind. He was programme co-chair for AISTATS 2010 and ICML 2017. His research interests span across machine learning and Bayesian statistics, including probabilistic methods, Bayesian nonparametrics and deep learning.

Yee Whye’s Medallion lecture will be delivered at the Joint Statistical Meetings in Denver, July 27–August 1, 2019.

A shorter version of this article appears below, or you can download a longer PDF version here.

**On Statistical Thinking in Deep Learning**

Historically, machine learning has its roots in pattern recognition and connectionist systems whose intelligent behaviours are learnt from data. In the 90s, the community started realising the widespread connections with statistics, which led to a period when statistical approaches flourished and became the dominant framework for both theoretical foundations and methodological developments. In the last decade, with the growing popularity of deep learning, this coming together with statistics has started to unravel, and the research frontier moved from statistical learning to artificial intelligence, from graphical models to neural networks, and from Markov chain Monte Carlo to stochastic gradient descent.

In this new era, what is the role of statistical thinking in advancing the state-of-the-art in machine learning? It is my belief, and that of many others, that statistical thinking continues to play an important role in machine learning. The deep theoretical roots of statistics and probability have continued to nourish our understanding of learning phenomena; in unsupervised learning, generative modelling continue to be a popular paradigm; and the deep concern for uncertainty and robustness prevalent in statistics is now being increasingly felt as machine learning techniques are applied in the real world. In the following I will illustrate how statistical thinking has helped with two inter-related examples from my own research.

**Meta Learning and Neural Processes**

While much of machine learning excels for large datasets, there is increasing interest in systems that can learn efficiently from much less data. For example, in few-shot image classification, with just a few example images of each class, we would like a system that can generalise well to classifying other images. Meta learning is an idea whereby if our system has seen many examples of such few-shot image classification tasks (each with its own small dataset), we might conceivably expect there to be sufficient information spread across tasks for a system to learn to generalise sensibly from few examples.

While most recent approaches to meta learning are based on the idea of optimising learning algorithms, an interesting alternative, which we call neural processes, considers it from the statistical perspectives of hierarchical Bayes and stochastic processes (Garnelo et al., 2018a,b; Kim et al., 2019; Galashov et al., 2019). The idea is that in order to learn effectively from small datasets, prior knowledge is necessary, which from a Bayesian perspective takes the form of the prior distribution. In case of image classification and supervised learning, each task corresponds to a function, and the prior of interest is a distribution over functions, i.e. a stochastic process. While standard approaches in Bayesian nonparametrics might posit simple prior distributions that enable tractable posterior computation, we instead propose to use neural networks to directly learn the predictive distributions induced by the stochastic process from data.

Viewing meta learning from a statistical perspective has allowed us to better understand the underlying learning phenomena. This has in turn allowed us to make links with other ideas like Bayesian nonparametrics and Gaussian processes, and motivated new approaches which better handle uncertainty (Garnelo et al., 2018b) and learn more accurately (Kim et al., 2019), as well as new applications of meta learning in Bayesian optimisation and sequential decision making (Galashov et al., 2019).

**Probabilistic Symmetries and Neural Networks**

In neural processes, the central function being learnt has a form

$y = f(x, \mathcal{D}{^\mathsf{train}})$, of an output $y$ given an input $x$ and an iid training set $\mathcal{D}{^\mathsf{train}} = {(x_i{^\mathsf{train}}, y_i{^\mathsf{train}})}_{i=1}^n$. The question is, how should we choose the architecture of our neural network used to learn it? Specifically, the function should be invariant with respect to permuting the indices of $\mathcal{D}{^\mathsf{train}}$. We enforced this permutation invariance explicitly by choosing a specific neural architecture,

$f(x, {(x_i{^\mathsf{train}}, y_i{^\mathsf{train}})}_{i=1}^n) = h\left(x, \sum_{i=1}^n g(x_i{^\mathsf{train}}, y_i{^\mathsf{train}}) \right)$

where both $g$ and $h$ are neural networks.

By construction the function is invariant to permutations of the dataset, since addition is commutative. However, there are other commutative operators, for example element-wise product, max, or min. This raises the following questions: Which operator is best? Are there other neural architectures or function classes that have this permutation invariance property? And can we characterise all permutation-invariant functions?

In Bloem-Reddy and Teh (2019), we developed a general framework to answer these questions using tools from probabilistic symmetries and statistical sufficiency. The core idea is that an invariance means that some information is ignorable. The rest of the information then forms an adequate statistic for computing the function, and we can identify what the adequate statistic is. In the case of permutation invariance this is the empirical measure, and the implication is that the form we chose above is the natural one.

We have generalised this result in a few ways. Firstly, we can generalise to invariance under the action of some compact group. The results are structurally the same, except that the empirical measure is replaced by an appropriate adequate statistic called a maximal invariant. We have also derived analogous results for a different notion of symmetry called equivariance, where transformations of the input lead to output that is transformed in the same way.

**References**

Bloem-Reddy, B. and Teh, Y. W. (2019). Probabilistic symmetry and invariant neural networks. arXiv:1901.06082.

Galashov, A., Schwarz, J., Kim, H., Garnelo, M., Saxton, D., Kohli, P., Eslami, S., and Teh, Y. W. (2019). Meta-learning surrogate models for sequential decision making. In *ICLR Workshop on Structure & Priors in Reinforcement Learning*. arXiv:1903.11907.

Garnelo, M., Rosenbaum, D., Maddison, C. J., Ramalho, T., Saxton, D., Shana- han, M., Teh, Y. W., Rezende, D. J., and Eslami, S. (2018a). Conditional neural processes. In *International Conference on Machine Learning (ICML)*.

Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S., and Teh, Y. W. (2018b). Neural processes. In *ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models*. arXiv:1807.01622.

Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. (2019). Attentive neural processes. In *International Conference on Learning Representations (ICLR)*. arXiv:1901.05761.

## Comments on “Medallion Lecture Preview: Yee Whye Teh”