Stéphane Boucheron works in the Statistics Group at the Laboratoire de Probabilités et Modèles Aléatoires, Université Paris Diderot. He writes as one of our team of Contributing Editors:
Since 2011, the growing hype around Big Data has provoked mixed reactions among statisticians. For a while, there will be jobs in the data science industry. This provides arguments when trying to recruit students in a statistics curriculum. But beyond this opportunity, the next question is how connected to statistics is this emerging data science? Comparing trends in the use of terms like big data analytics and big data statistics suggests that the Big Data industry is not reducible to Statistics. I will not try to define here what Big Data is or could be. Several articles in this Bulletin have already safely avoided this point. But I will comment on connections between (mathematical) statistics and computer science that are emphasized by the so-called Data Deluge.
Big Data is sometimes said to be hijacked by computer science at the expense of statistics. This feeling is a real concern on the applied side where databases, business intelligence, analytics, visualization and reporting tools get most of the attention, while advances in computational statistics and statistical learning remain in the shadows.
Statistical theory might also be challenged by the Big Data movement. For a century, high church statistical theory has been shaped by the analysis of experiments in agronomy, physics, experimental psychology… In those contexts, data are usually made of a matrix (or rather a data-frame) where n rows correspond to individuals and p columns to variables. In the mathematical statistics community, Big Data is often equated with High Dimension, that is with n ≪ p. For almost two decades, this has been the playground for inference under sparsity constraints. This attempt to cope with various aspects of the curse of dimensionality has stimulated many exciting developments both on the theoretical and the algorithmic side. This endeavor has already delivered Compressed Sensing, the Lasso and a variety of sparsity-inducing penalization techniques, renewed interest in Greedy methods and a surge of interest in optimization in the statistics community.
There might be something else. For mathematical statistics, Big Data might not be reducible to high dimension. In the language of databases, a data frame corresponds to a relational table. The fact that many working statisticians are now able to query databases or even data-warehouses (collections of possibly heterogenous databases) changes the status of the traditional data-frame. Whereas in the old days, building a data frame represented a lot of work, and data had to be milked thoroughly, it is now possible to re-shape, enrich, decimate data-frame by querying back the databases, that is by filtering, joining and projecting a complex database schema. Things may turn out to be slightly different with NoSQL databases (Not Only SQL databases), such as document databases, and databases made of semi-structured data. But the statistician’s job is quietly broadening. Statistical techniques like resampling methods (bootstrap, subsampling) make their way inside OLAP (OnLine Analytical Processing) databases but OLAP databases also constitute a challenge for those very methods.
The rate of data acquisition and data dimensionality are not the only changes in the landscape. Thanks to the possibility of mining databases, data acquisition is also becoming more flexible. There is a theory of classical statistics. On the asymptotic side, it culminates with the theory of comparison of experiments. When we add lines to the single data-frame—that is, when the sample becomes large—we are (sometimes) able to realize that apparently very different statistical problems are actually equivalent to a Gaussian shift experiment. One may wonder whether there could be a comparable and useful theory for statistical inference in this new, broader framework. Plausible directions may come from computational or statistical learning theory. The classical supervised learning setting has been supplemented by interesting variants, for example, semi-supervised learning where the data (labelled examples) of a classification problem are supplemented by a (large) collection of unlabeled examples, active learning where the statistician is allowed to request the labeling of well-chosen data. The asymptotic nature of the theory of comparison of experiments may seem unattractive to many statistical learners. Nevertheless, although the theory of statistical learning has largely complied with the line pioneered by Vapnik and Chervonenkis in the 60s and 70s—sticking to non-asymptotic risk guarantees, and avoiding assumptions about the existence of a correct model—the ability to compare two classification experiments would nicely complement the picture that has been built during the last 20 years (showing that the risk suffered classification depends primarily on the complexity of the dictionary and on the called noise conditions). The semi-supervised scenario and its companion, active learning, provide us with plausible abstractions for Big Data analytics.
In my next columns, I will elaborate on plausible interactions between (theoretical) statistics and Big Data.