Institute of Mathematical Statistics | What is the core of Data Science?

August 28, 2015

David Dunson, Arts and Sciences Professor of Statistical Science at Duke University, writes:

What is the core of data science? To address this, I think it is necessary to first touch on the question of what is data science? Certainly there is not one agreed upon definition of what data science is, exactly. At Duke we had a recent search for an open rank data science faculty position, and we received extremely disparate applicants, ranging from theoretically focused researchers studying properties of machine learning algorithms to optimization experts to image processors to applied mathematicians interested in large scale applications in neurosciences and power grid optimization. The field of the PhD degree for these applicants varied extremely widely, including, but not limited to, statistics, computer science, mathematics, electrical engineering and physics. I received candidates with similarly varied backgrounds when I recently advertised for a “Bayesian data science” postdoctoral fellow.

The consensus that we came up with in our search and my own view is that a data scientist is an individual who is driven primarily by the application and uses whatever statistical, computational and algorithmic tools they can come up to develop new knowledge and insights in that application area. If the field of study is an area of science (eg., neuroscience, genomics) then the data scientist is a full-fledged scientist in their corresponding area, but instead of collecting new data in their labs they exploit existing large, complex and disparate data sources to obtain new scientific insights.

Given this view of data science, it is not at all surprising that the rise of data science has ended up blurring disciplines and attracting individuals with highly disparate backgrounds in the mathematical sciences (broadly defined). Many view this as a threat to statistics as a discipline. Increasingly, the caricature of a statistician is a reserved, conservatively thinking stickler for foundations and theoretical support, who is so slowed down by their own principles that they study toy algorithms that aren’t useful in real world large scale applications. Meanwhile the hip and cool machine learning types charge ahead in creatively developing wild new algorithms and approaches and diving right into big exciting applications. Then, not surprisingly, the lion’s share of the increasing research dollars associated with data science topics goes to the latter group. These stereotypes, which have a seed of underlying truth to them, should serve as a wake up call to statisticians to make their work more relevant to modern applications.

The over-arching motivation for organizing this conference [the IMS-Microsoft Research workshop on Foundations of Data Science—see the interview with the other organizers here], and for hopefully kick-starting an IMS group focused on this topic, is to bring together leaders in different aspects of data science to move towards establishing the foundations of data science. Classical statistical theory, methods and principles are increasingly not relevant in modern data science problems and new foundations need to be established, going well beyond statistical theory for large p, small n problems.

There are several directions to take in closing the gulf between statistical foundations and data science practice. One is to have mathematical statisticians become more seriously engaged in understanding why highly successful algorithms, such as deep learning, have such good behavior. This is a type of top-down approach. The other is to become more cognizant and seriously engaged in what successful data scientists are actually doing in terms of the process of obtaining the data to analyze, reducing dimension, doing many analyses, reporting and summarizing the results, etc. Then, attempt to develop a realistic statistical formalism for establishing optimality and other properties, taking into account more of the pipeline including computational time, storage, etc.

My own view is that the data science revolution has been extremely intellectually stimulating and exciting. In a very short time, it has had the impact of dramatically reducing siloing of data scientists based on their PhD field and department affiliation. For many years, different communities proceeded independently working on essentially identical problems, but with different notation, publication outlets and perspectives. Mostly these communities were unaware of each other even when working on exactly the same problems. This has shifted dramatically, partly due to the growing tendency to establish interdisciplinary big data and data science centers or institutes. For example, at Duke we have the “Information Initiative at Duke” (IID), which has wonderful dedicated space, a core faculty having PhDs and primary department affiliations in many different fields, vibrant seminar series, and great cross-talk between research groups at all levels including undergraduates. I have grown to enjoy the “data seminar” organized by a topologist and focusing on cool math-y stuff people do with data more than our regular statistics seminars. I’m more likely to see new intellectually stimulating ideas that will deeply impact my work, while there isn’t as much surprising to me in a usual statistics seminar after 20+ years in the field. This has definitely improved the quality of my work, and I’m hoping efforts, such as the Foundations of Data Science conference, will similarly stimulate others.

1 Comment

1 comment on “What is the core of Data Science?”

Leave a Reply Cancel reply