On June 11–12, 2015, the IMS-Microsoft Research workshop Foundations of Data Science brought together parallel communities of statisticians and machine learning researchers to discuss various problems that have both statistical and computational aspects. The organizers Jennifer Chayes from Microsoft Research, former IMS president Bin Yu from UC Berkeley Statistics Department, Sham Kakade from Microsoft Research, and Rafael Irizarry from Dana-Farber Cancer Institute were interviewed by Karl Rohe from UW-Madison Statistics Department via teleconference on July 24. This transcript has been edited for clarification and brevity.

David Dunson also helped organize the conference, but he was unable to join the teleconference. His essay (here) was written before this interview.


Karl Rohe: Why did you organize this conference?

Jennifer Chayes: One of the reasons for setting up this conference is that we felt machine learning people should be more involved in statistics, and statistics people should be more involved in machine learning.

Bin Yu: We wanted to show the younger generation of statisticians that a lot of work is being done at the intersection of these areas.

Chayes: Data science covers a wide breadth of challenges. These challenges have both statistical and computation aspects. These two fields provide complementary approaches to these problems.

Yu: We want to integrate these two fields.

Chayes: Right! You need both of them to address all of these data science questions.

Yu: IMS wants to facilitate this foundational work. The next step is to organize a data science group to host the next conference.

Chayes: That would be great!

Rohe: To give a flavor of the conference, what were some of the talks about?

Sham Kakade: Axel Munk’s session on super-resolution imaging had some really cool stuff. For example, Victor Panaretos spoke about functional data analysis of DNA molecular dynamics. There was also a lot of cool computational social science stuff. Sharad Goel analyzed over three million stops by New York City police officers to investigate racial disparities. There were also numerous talks from physicists.

Rohe: Do you have any advice for people who are interested in data science?

Yu: They have to be quantitative already…

Chayes: And people can be quantitative in many ways. Expose yourself to the richness of the kinds of problems that are addressed in data science and the breadth of techniques that can be brought to bear. Whatever a researcher is doing, there are things that they can do to contribute to data science. Expose yourself to the breadth of this so that you can say to yourself, “Wow, I can do this.”

Yu: Also, research is always human-centered. You need to make the right connections. You can go to conferences, you can organize webinar meetings, you can have blogs, there are many ways, but it is people coming together. We are forging data science together. It is not like it is already made and you should learn it. This is really an invitation, let’s do data science! Let’s invent data science together! Whatever you can bring to the table.

Chayes: You need to get out there and meet people!

Yu: Yeah! For me, research is always social.

Chayes: This is true in all areas of research. I do think that one of the nice things about data science, and one of the things that was inherent in the workshop, is the breadth of the communities. If you’ve been doing statistical biology, it doesn’t mean that you can’t be doing statistical environmental science. A lot of the techniques apply across multiple application areas.

Yu: That is the essence of statistics. Statistics is always the hub of ideas and you can transport and learn. That is why you have a field. Data science is playing that role too. I don’t think we need to worry too much about “Statistics vs Data Science”—the important thing is to get real problems solved, alone or together.

Chayes: Data science does have computational aspects that were not at the fore in statistics. As the data sets have gotten larger, the computational questions have entered more profoundly than they did in the past.

Yu: Computation has always been essential to statistics. The motivation for the Hollerith Tabulating Machine arose from the 1890 US Census. It was developed by a statistician and inventor. In 1911, his company merged with three others to form a company that we now call IBM. I discussed in my [2014] IMS Presidential Address. For me, the role of computation is a key part of statistics. Statistics has to address computational challenges.

Chayes: Statistics is not going to be useful if it is computationally inefficient.

Rohe: Do you think there is a difference between data science and statistics?

Yu: Defining the difference between data science and statistics will make things really difficult right now since we are defining data science as we engage and solve data problems.

Chayes: A key point is that the people from machine learning and people from statistics are already starting to form a vibrant and joint community. Right?

Yu: Yes. Yes. Intellectually, statistics and machine learning have much overlap as academic fields and some (like me) think machine learning represents recent development in statistics, but individual people are more clearly defined. I’m in a statistics department and you [Jennifer] are in a machine learning department, but we do many similar things.

Chayes: What we had hoped was to encourage statisticians and machine learning people to start reaching out beyond the boundaries of their departments, their fields, their journals, and collaborating more extensively with each other on this wide breadth of application areas.

Yu: The conference took steps in this direction. Statisticians need to continue reaching out.

Chayes:The conference was just the beginning for both communities. What we really want is for people to be reaching out between these two communities because the questions that arise in data science are being addressed simultaneously by these two communities with complementary skill sets of techniques. They should be working with each other more.

Yu: I agree. We have a lot to learn from each other and we have a lot be gained from working together to forge the next frontier.

Rohe: Rafael, is there anything that you would like to add?

Rafael Irizarry: I agree with what has been said in this previous discussion. One point that has been missing is that there is often a disconnect between theory and applications. Sometimes the theory that is being developed is not necessarily targeting, as specifically as it could, the problems that are currently being faced by people who have data problems.

Yu: That is an excellent point. We need more of a dialogue between the theoretical and applied communities. Working with domain experts is a key part as well.

Irizarry: One of the things that is also missing of people who claim to do data science is an understanding of theory! So, I think it goes both ways.

Rohe: Thanks to Jennifer, Bin, Sham and Rafael for sharing your thoughts.