Institute of Mathematical Statistics | Data science: how is it different to statistics ?

Data science: how is it different to statistics ?

September 4, 2014

Contributing Editor Hadley Wickham is Chief Scientist at RStudio and Adjunct Professor of Statistics at Rice University. He is interested in building better tools for data science. His work includes R packages for data analysis (ggplot2, plyr, reshape2); packages that make R less frustrating (lubridate for dates, stringr for strings, httr for accessing web APIs); and that make it easier to do good software development in R (roxygen2, testthat, devtools, lineprof, staticdocs). He is also a writer, educator, and frequent contributor to conferences promoting more accessible and more effective data analysis. He writes:

Recently, there has been much hand-wringing about the role of statistics in data science. In this and future columns, I’ll discuss both the threat and opportunity of data science. I believe that statistics is a crucial part of data science, but at the same time, most statistics departments are at grave risk of becoming irrelevant. Statistics is flourishing; by-and-large academic statistics continues to focus on problems that are not relevant to most data analyses. In this first column, I’ll discuss why I think data science isn’t just statistics, and highlight important parts of data science that are typically considered to be out of bounds for statistics research.

I think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results. It’s rare to walk this process in one direction: often your analysis will reveal that you need new or different data, or when presenting results you’ll discover a flaw in your model.

Statistics has a lot to say about collecting data: survey sampling and design of experiments are well established fields backed by decades of research. Statisticians, however, have little to say about collecting and refining questions. Good questions are crucial for good analysis, but there is little research in statistics about how to solicit and polish good questions, and it’s a skill rarely taught in core PhD curricula.

Once the data has been collected, it needs to be tidied (or normalized) into a form that’s amenable for analysis. Organizing data into the right ‘shape’ is essential for fluent data analysis: if it’s in the wrong shape you’ll spend the majority of your time fighting your tools, not questioning the data. I’ve worked on this problem for quite some time (culminating in the tidy data framework) but I’m aware of little similar work by statisticians.

Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling. Visualization and modelling are complementary. Visualizations surprise you, and can help refine vague questions. However, visualizations rely on human interpretation, so the ability to scale is fundamentally constrained. Models scale much better, and it’s usually possible to throw more computing at the problem. But models are constrained by their assumptions: fundamentally a model cannot surprise you. In any real analysis you may use both visualizations and models. But the vast majority of statistics research is on modelling, much less is on visualization, and less still on how to iterate between modelling and visualization to get to a good place.

The end product of an analysis is not a model: it is rhetoric. An analysis is meaningless unless it convinces someone to take action. In business, this typically means convincing senior management who have little statistical expertise. In science, it typically means convincing reviewers. Communication is not a mainstream thread of statistics research (if you attend the JSM, it’s easy to come to the conclusion that some academic statisticians couldn’t care less about the communication of results). Communication is a part of some PhD programs, but it tends to focus on professional communication (to other statisticians), not communicating with people who have substantive expertise in other domains.

In business, analyses are often not done just once, but need to be performed again and again as new data come in. These data products need to be robust in both the statistical sense (i.e. to changes in the underlying distributions/assumptions) and in the software engineering sense (i.e. to changes in the underlying technological infrastructure). This is a ripe field for research.

Statistics is a part of data science, not the whole thing. Statistics research focuses on data collection and modelling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products.

There are people in statistics doing great work in all these areas, but it’s not mainstream statistics. If you’re interested in these areas, it’s harder to get tenure, harder to get grants, and most of the ‘top’ statistics journals are unavailable to you.

Attempting to claim that data science is ‘just’ statistics makes statisticians look out of touch, and belittles the many other contributions outside of statistics.

What do you think? Let me know your thoughts at hadley@rstudio.com, or @hadleywickham.

Editor’s note: The opinions expressed are exclusively of the columnist and do not necessarily reflect opinions of the IMS or editorial opinions of the IMS Bulletin.

31 Commments

31 comments on “Data science: how is it different to statistics ?”

Mahbubul Majumder

September 5, 2014 at 3:55 pm

I am currently working to develop a data science program in University of Nebraska at Omaha and teaching introduction to data science. While talking to the business community around Omaha, we got a nice feedback and everyone indicated the necessity of such a program. The existing programs on statistics or computing or business does not just satisfy their full need. No matter how hardly statisticians or computer scientists say they are doing data science it indeed makes them 'look out of touch'. Your discussion in the column nicely complemented how Dr. Cleveland felt about statistics when he wrote his article "Data science: an action plan for expanding the technical areas of the field of statistics" in 2001. Finally I share the same feeling when you say " it’s harder to get tenure, harder to get grants, and most of the ‘top’ statistics journals are unavailable to you". Looking forward to seeing your future columns.

MasterG

September 5, 2014 at 10:42 pm

Should the question be framed differently? How is data science different from data analysis? Data analysis seems to encompasses all of "data science" and more because it involves models, visualization, simulation and significance testing or testing for independence, and attempts to answer questions about causality or what factors account for variability in a distribution. Data science is but one aspect of statistics as a discipline.

Stephen McDaniel

September 5, 2014 at 11:30 pm

A great summary of what I also see in the rapidly changing world of data. My experience is that many people who are strong in analytics and data management could definitely benefit from a strong course on thinking like a statistician (a use case review of many of the methods). In particular, developing a keen understanding that much of the power of statistics is underpinned by a strong focus on variability and how prior observed variability can inform your findings for better decisions. Likewise, many who are expert in statistics would benefit greatly from becoming more knowledgeable about the business or field(s) they analyze data within. This includes speaking more often with leaders and executives about goals, challenges and better understanding the language of the decision-maker. This would lead them to easily reduce excessive analysis of topics that will have low impact and better inform their assumptions about the models they build. Statisticians can also benefit from a stronger focus on managing dirty data, cleaning and tidying it. This should be a required course for most stats majors at any degree level. Finally, courses and practicums on data presentation skills including graph and dashboard design will radically improve the value of their degrees.

Vincent Granville

September 7, 2014 at 5:31 am

Statistics and data science are very different. It's like comparing astronomy and physics. Read my article "16 analytic disciplines compared to data science" at http://www.datasciencecentral.com/profiles/blogs/17-analytic-disciplines-compared.

HUFFPOLLSTER: New Polls Show How Close The Midterms Remain | TheAllNews

September 8, 2014 at 12:52 pm

[…] Wickham explains how data science differs from statistics. [IMS Bulletin via […]

HUFFPOLLSTER: New Polls Show How Close The Midterms Remain — LiberalVoiceLiberalVoice — Your source for everything about liberals and progressives! — News and tweets about everything liberals and progressives

September 8, 2014 at 1:02 pm

HUFFPOLLSTER: New Polls Show How Close The Midterms Remain

September 8, 2014 at 2:55 pm

HUFFPOLLSTER: New Polls Show How Close The Midterms Remain | The Daily Float

September 8, 2014 at 1:28 pm

Garfield Fisher

September 9, 2014 at 12:56 pm

I wish you had a twitter share link on this post. You make good points that I will share with my statistician colleagues. Thanks for the article.

Leo Godin

September 12, 2014 at 12:48 pm

Could you say that data science is the concrete and practical application of statistics to real-world problems? In this idea, domain knowledge is important. For example, the statistician might study the effects of supplementing vitamin D, because she is asked to by researchers. The data scientist might tell researchers that vitamin D interacts with vitamins A and K, so the three need to be researched together. The data scientist helps to ask better questions. That's how I see the difference.

Mark Samuel Tuttle

September 12, 2014 at 11:24 pm

Nice blog. I heard your speak in San Francisco in July 2012 on "The Future of Data Analysis". Glad to see you're still working on topics you mentioned there, including the "tidy data framework". I'm reminded of an analogous discussion with the then president of a related professional society. In response to my predicting a future of irrelevance for said society, he replied "Not everyone wants to be relevant." That said, let me support your observations as follows: I think the field of statistics, as opposed to professional statisticians, has become too narrow and constrained. Perhaps you're arguing the same thing. For example, proposing that "everything" in healthcare should undergo a double-blind randomized trial is self-defeating. As Russ Altman observes, "We don't have time, we'll all be dead." Thus, as you imply we need data-driven approaches that construct experiments as best we can. Because so many U.S. patients are treated differently even though they present "identically" we have an ongoing natural experiment. The few professional statisticians I've spoken to about such things are deeply pragmatic, and, in their, own way, they are data-driven, instead of being method driven. But, the many books on statistics I have are almost all method driven - implying that the method is the magic, etc. If all we had to worry about was as you posit - visualization vs. modeling - I think we'd be in good shape. Instead what I see is endless refinement and esoterica re statistical tests. And, incidentally apropos "surprises", one way I know my model is good is when for a given input the model is right and my estimation from that data is wrong. Lastly, now that we have (more or less) unbounded computing resources, the field of statistics should take on "method" with the goal of making statistics more uniform, letting data drive differences and not method. I look forward to your future posts.

Homi

September 15, 2014 at 11:42 pm

I hope your friends at JHU reads this. They seem to think that data science is another name for statistics.

Editor

September 17, 2014 at 5:06 pm

Dear Garfield. I'm embarrassed to say I had neglected to include a Twitter share link, but have now rectified this: please do share! Thanks for visiting.

Fred

September 18, 2014 at 3:53 am

I think the talk is more relevant to academic statistician, not those working in the industry such as health or finance whatever. I never engaged in a project is purly about statistical modelling. Each project is collabration with domain experts, disussing about the data and research questions we are going to answer, communcating via visualsation and our modelling approach. And in the end, we have a fully-developed protoco for the project. I heard many many times, people are discussing that statisticians only care about analysis, not the business. I don't know where people are getting this feeling. Those working in the academic environment or those working as theoretical statisticians are developing and inventing methods for the analysis of different types of data. With the emerge of "Big Data", not only computing and IT structure matters, the methods as well.

First Fall 2014 Newsletter | Newsletter

September 28, 2014 at 2:24 am

[…] Data science: how is it different to statistics ? […]

September 28, 2014 at 4:14 pm

Great, I will definitely share!

ThomasV

December 10, 2014 at 12:20 pm

As is well describe in this post, data science is a process that at some point involves statistics and computer programming. However too often it is associated only to machine learning and computer programming: https://skim.it/u/ThomasV/data-science-is-not-computer-science

M. Pace

July 14, 2015 at 8:58 pm

All the blather about how “statistics” is becoming irrelevant and “data science” is the right way to approach data is just hogwash. I don’t know whether the lies involved are deliberate or just ignorant, but it hardly matters. The name of the identical field as "data science" is “statistics”. “Data science” is just a fad name for the field of statistics. All the claims that “data science” does things that poor old irrelevant statistics doesn’t do are 100% untrue. Anyone familiar with statistics papers from the past 20 or 30 years will recognize what utter, complete nonsense articles like the one on this web page are.

Magister Dixit | Data Analytics & R

September 25, 2015 at 8:01 am

[…] “Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling.” Hadley Wickham […]

alen owen

December 31, 2015 at 5:40 pm

Interesting article.........I hold the view that data science is repackaged academic statistics nothing new. A data scientist is a statistician with advanced computer programming expertise

Vic Duoba

March 9, 2016 at 12:57 am

I think that data science encompasses all of statistics, not the other way around.

John Jimenez

April 19, 2016 at 8:39 pm

Spot on. It's just another buzz word, like "Big Data". Data Science, as the word is used today, is basically a very small subset of statistics and a fancy way of saying that one is practicing the data collection, categorization, and or combination of existing statistical methods. Glad I'm not the only one that sees this.

Alvin Hsieh

May 16, 2016 at 8:03 pm

I found your article from a search trying to understand the differences between Data Science and Statistics. What I don't understand is why you are blaming Statistics for not asking good questions. It is not Statistics or Data Science that helps you ask good questions. It is the person. Even before you look at data, you should already be asking questions. Otherwise, no amount of data can ever help you with that. Asking the "good questions" is very subjective. One person's good question may not be the same as the next. I'm sorry that your PhD program failed you. Statistics courses that I have taken focused a lot of attention on Interpretation of the results from data collection and modeling. Statistics is about drawing conclusions/interpretations about the pattern in the data. This article did not really state the difference between Statistics and Data Science. Sadly, your ranting does not help people understand the difference. All this did is like the comments show "more ranting". Your article is still clear as mud to me of what are the differences which I know there are many. I am not taking sides and just want the facts.

Yanming Di

June 16, 2016 at 11:47 pm

If we define data science as the science of answering scientifically relevant questions using data, then we need to treat the entire learning process in a scientific manner. In statistics, we have extensive experiences with experimental designs, learning from the models, and communicating the results (I actually think many basic visualization tools from statistics, box plots, scatter plots, etc, are still the most useful ones. One can certainly add new dimensions to these tools.) But as Hadley pointed out, there are gaps. For example, if you think about the entire pipeline of doing a biological experiment and draw conclusions from the data. Statisticians can help with design, analysis, and reporting of the results, but in the past, we have paid less attention to helping with documenting all details of a experiment, curating a database of experiment settings for published results (not just experiment data), and so on. (Hadley obviously has a lot of experience with cleaning dirty data.) I also find it challenging in practice to convince all scientists to talk to a statisticians about their designs before they carry out their experiments. If we can achieve that, we will keep many more statisticians busy, and will never need to worry about our job security (and we will be dealing with more high-quality data sets). I am also not clear how a statistician can help with developing the scientific questions. It seems more the job a scientist. I do see that, sometimes, scientists want to find questions from the data: they call this process hypothesis generating. But I am not fully convinced that is the best way to generating scientific questions.

Jason Bodnar

July 18, 2016 at 2:42 am

I agree. I think also a key difference is that we as statisticians are taught the ins and outs of the scientific process. Take the theory of hypothesis testing for example. I have reviewed many data science program courses, and none appear to be calculus based nor as in-depth as theoretical statistics.

Khan

March 28, 2017 at 10:28 pm

Based on how the author defined the Data Science, I would say 'Data Science' is nothing but Statistics.

Jerzy Neyman

July 13, 2017 at 8:26 pm

Statistics is based on a theory. Does a single, unifying theory serve as the basis for "data science"?

Luna W

July 21, 2017 at 11:44 pm

I agree with this. As of now my definition of data scientist is an applied statistician in industry (most often tech industry) with advanced programming skills.

Data Scienteur

April 7, 2018 at 3:52 pm

To say that “Data Science is just Statistics” would have to imply that Statisticians use deep convolutions neural networks for classification, long short term memory recurrent networks for forecasting, particle swarm optimizers and other computational intelligence algorithms for non-convex optimization, develop bespoke interactive real-time visualizations using libraries like D3 (implying they can write a modern JavaScript web application) and engineer solutions using all these building blocks end-to-end. If someone who proclaims to be a Statistician cannot do all the above, then they can’t claim that Statistics = Data Science. Because the above is by no means beyond the level of a 4 year graduate in a modern Computer Science curriculum, who happens to have 3 years of calculus, linear algebra, numerical analysis, discrete structures, mathematical statistics, statistical learning, artificial intelligence, deep learning, and the usual “core” Computer Science curriculum under his belt. This is what a modern, well-equipped Data Scientist possesses. They are a one man Statistical Computer Scientist Swiss Army knife.

Xianyi Wu

April 8, 2018 at 10:24 am

Before the emergence of name "data science", we have spoken of statistics and data analysis for a long time. Some peoples categorized statistics as two groups: theoretical statistics that work with random variables, propose statistic methods for assumed models and investigates statistical properties of those methods, whereas applied statistics analyze real-world data taking use of the methods developed by theoretical statisticians. When a data is analyzed, the data preparation (data cleaning, data tidies and so on) are usually inevitable. Some one ever said in a textbook on linear regression (obviously a statistics book) that, when analyzing a data, one usually spends almost more than 80% of the time at data preparation and less time for running a program for computing the statistics; it even needs only a few minutes sometimes. Compared to statistics, data science puts more emphases on question formulation. This appears the only difference.

Arturo Erdely

August 14, 2018 at 9:49 pm

The expression “Data Science” is misleading because it is not a science. “Data scientists” are not scientists, they are more like engineers since they apply results obtained by sciences such as Statistics, Mathematics and Computer Science, among others. Data Engineering would be perhaps a more accurate expression for such activities.

Editor’s note: The opinions expressed are exclusively of the columnist and do not necessarily reflect opinions of the IMS or editorial opinions of the IMS Bulletin.

31 comments on “Data science: how is it different to statistics ?”

Leave a Reply Cancel reply