Contributing Editor David J. Hand, Imperial College London, returns to his Hand Writing column with some thoughts on the unglamorous but fundamentally important topic of data quality.
Countless articles and many books have been written on the topic of data quality. High quality data is clearly central to statistics and data science, in terms of understanding underlying processes, making operational decisions, and building effective algorithms. As most readers of this article will know, the consequences of poor quality data can be catastrophically serious. And yet, somehow, data quality almost always appears as a secondary consideration in the teaching of statistics and data science.
Perhaps this is understandable. If I am going to apply regression analysis I first need to know the nature of the regression model, what it is intended to do, and how to use it. Only then can I sensibly study its limitations. Indeed, the excitement of learning about a powerful new statistical method lies in the actual learning of that method, and in seeing how useful its application can be. Learning how something can fail is much less thrilling. But a consequence of this is that issues of data quality are typically pushed to the margins.
Such courses on data quality as there are tend to emphasise aspects such as computer science (e.g., storage and retrieval, database design, etc.) or regulatory compliance (e.g., in clinical trials, data governance, etc.), and to focus on particular application domains (e.g., official statistics, health service). This is in contrast to courses about statistical methods themselves, which aim to teach the student how to apply the tools in general, in any appropriate context.
Of course, many data quality issues are highly problem- and context-specific. But one might argue that the same applies to statistical tools themselves. When I apply regression to data from a physics experiment, I might be searching for a real (and significant) but very slight departure from a specific model (e.g. a tiny departure from a linear relationship between the dependent and independent variables), but when I apply the same statistical tool to social science data I might hope merely to explore the proportion of variation accounted for. The way the tool is used, and what it is used for, depends on the application domain and the problem.
It is a familiar adage that in many projects the bulk of the work lies in the data preparation. For example, we have, “Data scientists spend approximately 80% of the time on preparing the data and about 20% on actual model implementation and deployment,” (Hameed and Naumann, 2020), and, “…the process of data cleaning […] will often take up to 50 percent to 80 percent of the entire time spent on the project” (Yu and Barter, 2024, Chapter 4). If this is the case, then it is curious that the teaching of tools for this critical initial step has been relegated to a secondary role. Surely, any university or other institution concerned with teaching statistics or data science should also offer a (perhaps mandatory) module on data quality and how to achieve—or at least work towards—it.
Such modules might cover topics which will be familiar to experienced statisticians and data scientists, but perhaps are not second nature to those newly exposed to the discipline. Sensitising new data analysts to the problems can only benefit both the discipline and the areas in which they are applying it.
Such topics will doubtless be familiar to most of the readers of this article, and include:
• that one should never simply assume that the data are sound.
• applying data error-detection tools is a useful step, but should not be unthinkingly relied upon.
• a good understanding of the data collection process can save one from many egregious errors, not only in detecting data problems, but also in correcting them.
• a recognition that data might be “clean” for one purpose but not for another.
• acknowledgment of the fact that one may not be able to guarantee to be able to clean data sufficiently to be able to answer any given question. If one could, then any data, no matter how limited or poorly related to the question, would suffice.
• that an audit trail of any data cleaning activity should be retained—in sufficient detail that the original data can be reconstructed.
• that new types of data come with new types of data quality issues. For example, administrative data collected for one purpose may not be helpful for another; data sets constructed by merging other sets have their own risks; and inferential tools based on an assumption of a random sample may be risky when applied to data with arbitrary and unknown underlying selection processes.
• it might be useful to have two modelling stages: one modelling the data selection and creation process, and one producing the substantive model of interest.
Kim et al (2003) have made a sterling effort to organise data quality issues into a “taxonomy of dirty data.”
These and other aspects of data quality may not constitute the most glamorous of statistical courses to teach. But they are certainly one of the courses most likely to have a beneficial impact on the quality of the results. And this is not merely in reducing mistakes but also more positively. After all, in domains where machine learning methods outstrip the accuracy of human diagnosticians, it has often been found that this is attributable to the closer attention to data quality and completeness required for data input to the machine learning system.
References
• Hameed, M. and Naumann, F. (2020) Data preparation: a survey of commercial tools. SIGMOD Record, September 2020, 49(3).
• Kim, W., Choi, B.-Y., Hong, E.-K., Kim, S.-K., and Lee, D. (2003) A taxonomy of dirty data. Data Mining and Knowledge Discovery, 7, 81-99.
• Yu, B. and Barter, R.L. (2024) Veridical Data Science, MIT Press. https://vdsbook.com/04-data_cleaning