Juan C. Meza is Director of the Division of Mathematical Sciences at the US National Science Foundation. IMS President Xiao-Li Meng invited him to share some opportunities to reach out to Computer Science and Math or explore new partnerships with the domain sciences.

 

These are exciting times in mathematics and statistics. One reason for this is the exponential increase in the amount of data in many fields due to the increased power of observational, experimental, and computational tools and techniques. The McKinsey report, Big data: The next frontier for innovation, competition, and productivity [1] found that, “the percentage of data stored in digital form increased from only 25 percent in 2000 (analog forms such as books, photos, and audio/video tapes making up the bulk of data storage capacity at that time) to a dominant 94 percent share in 2007”. This comes from all sectors of life including the health care industry, social media, and applications in the Internet of Things. Similar trends have occurred in many scientific and engineering fields.

This explosion of data has led to situations where scientists must analyze massive data sets. Other applications require analysis of large numbers of streams of small data sets. Today, other issues have also come to the forefront, including the increasingly heterogeneous, unstructured, and real-time aspects of many data sets. The question before us is this: how does one manage and make use of all of this data to generate new knowledge and solve today’s problems?

While thinking about this, I had the opportunity to reread a favorite article of mine, “The Future of Data Analysis,” by John Tukey [2]. One statement in that article stood out to me: “Statistics has contributed much to data analysis. In the future it can, and in my view should, contribute much more.” I found this encouraging, because I believe that mathematics and statistics are in a wonderful position to contribute to data science, even more so than back in 1962 when Tukey wrote the paper above.

At the National Science Foundation, ten new initiatives, called the 10 Big Ideas, were proposed in 2016. These were intended to be “long-term research and process ideas that identify areas for future investment at the frontiers of science and engineering.” The Big Ideas also represented “unique opportunities to position our Nation at the cutting edge—indeed, to define that cutting edge—
of global science and engineering leadership”.

Among those Big Ideas, one of them is of particular importance to the statistics community—Harnessing the Data Revolution. With a view towards understanding how to take advantage of the Big Data revolution, two of the main goals of this program are to engage NSF’s research community in the pursuit of fundamental research in data science and engineering, and the development of a 21st-century data-capable workforce.

But what do we mean by data science? There has been much debate on what constitutes data science, who practices this new science, and how one should teach it. Dhar [3] gave one answer, saying, “Data science is the study of the generalizable extraction of knowledge from data,” and that “a data scientist requires an integrated skill set spanning mathematics, machine learning, artificial intelligence, statistics, databases, and optimization, along with a deep understanding of the craft of problem formulation to engineer effective solutions”. As Tukey had pointed out, there are many different ways to extract knowledge from data, and statisticians have been studying this area and developing statistical models for many years. Shmueli [4] in turn provided interesting insights into statistical models and some of the differences between two types, explanatory and predictive modeling, arguing that we need to do both. In one area that has received a lot of attention, machine learning, Daubechies [5] says, “Our current mathematical understanding of many techniques that are central to the ongoing big-data revolution is inadequate, at best”.

Within the Harnessing the Data Revolution (HDR) initiative, we are seeking to support innovative research in data science. One of the first initiatives within the HDR program was the Transdisciplinary Research in Principles of Data Science (TRIPODS) program. In 2017, the Division of Mathematical Sciences (DMS), along with the Division of Computing and Communication Foundations, made 12 awards to 14 different institutions for a total of $17.7 million. The overarching goal is to bring together the statistics, mathematics, and theoretical computer science communities to develop the foundations of data science. These 12 awards are a first attempt to bring the communities together to form interdisciplinary teams to study these problems. This year these awards were complemented through a new solicitation to include domain-specific applications teaming with the initial awardees. An anticipated Phase 2 of the TRIPODS program will then select a smaller number of larger institutes.

Another initiative is a new joint solicitation between DMS and the NIH National Library of Medicine for Generalizable Data Science Methods for Biomedical Research. Here again, the goal is to develop and strengthen ties between different disciplines to address the questions of data science. In particular, this solicitation plans to support the development of innovative and transformative mathematical and statistical approaches to address important data-driven biomedical and health challenges.

The educational component of data science is also important. In May of this year, the National Academies released a report on Data Science for Undergraduates that had been sponsored by the NSF [6]. The goal of this study was to explore what data science skills are essential for undergraduates, now and in the future, and how academic institutions can structure their data science education programs to best meet those needs. Two key findings were that data science is in its infancy and that it is a unique field that borrows heavily from multiple other fields.

Specifically, with regard to the educational component, they also found that education at all levels will need to evolve as the field evolves and that there must be multiple pathways for undergraduates as a result. They also called out two aspects that are noteworthy; the first is that all students would need to have a certain amount of data acumen, and secondly, that data ethics should be incorporated into the curriculum.

What will the future hold for us, then? At NSF, we are continually looking to see what the emerging trends are and what the community sees as future areas of interest. Towards that end, DMS convened a workshop on October 15–16, 2018, to discuss future trends in statistics. The workshop brought together over 50 researchers from all areas of statistics to discuss six broad themes including: foundations of statistics and data science, statistics and computation, emerging applications, data challenges, inference in the age of big data, and statistics education in the new era. I am looking forward to reading the workshop report.

I started by saying that these are exciting times for our communities. There are great challenges, but also great opportunities. Obviously, there are many different ways of approaching these problems and these discussions will continue for some time. Whatever the ultimate outcomes of these discussions are, it is clear that data in its many forms is now an integral part of how science is done today. And as the field evolves, new strategies, methods, and theory will be needed to address all of the complex data issues arising.

How, then, might we proceed in developing future strategies? Perhaps Tukey once again can provide us with some guidance. If I may paraphrase him, “Is it not time to seek out novelty in data sciences?” And who better to do this than those who have already contributed so much to data sciences?

 


Footnotes

[1] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers, Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, 2011.

[2] Tukey, John W., The Future of Data Analysis. Ann. Math. Statist. 33 (1962), no. 1, 1–67. doi:10.1214/aoms/1177704711.
https://projecteuclid.org/euclid.aoms/1177704711

[3] Dhar, Vasant, Data Science and Prediction, ACM., Communications of the ACM, 2013,Vol. 56, no. 12, 2013

[4] Shmueli, Galit, To Explain or Predict, Statistical Science, 2010, Vol. 25, No. 3, 289–310, DOI:10.1214/10-STS330, Institute of Mathematical Statistics, 2010.

[5] Daubechies, Ingrid, Machine Learning Works Great—Mathematicians Just Don’t Know Why, Wired Magazine, Dec. 12, 2015.

[6] National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. https://doi.org/10.17226/24886.