Institute of Mathematical Statistics | Marianne’s Measures: Solid Models Need Solid Ground

Marianne’s Measures: Solid Models Need Solid Ground – The importance of Initial Data Analysis

March 31, 2026

Marianne Huebner, Michigan State University, responds to David Hand’s recent “Hand Writing” column about data quality, which appeared in the January/February 2026 issue. She stresses the importance of Initial Data Analysis (IDA):

David Hand makes an excellent case for the crucial role of data quality––an aspect that too often gets sidelined in favor of “the excitement of learning about a powerful new statistical method” [1]. There are initiatives to support data quality workflows [2]. However, data quality checks are just one thread in the larger fabric required to weave a coherent and trustworthy data story.

The concept of Initial Data Analysis (IDA) is not new. It was discussed by Chatfield [3], who wrote: “The initial examination of data is a valuable stage of most statistical investigations, not only for scrutinizing and summarizing data, but also for model formulations.” Commentaries on this paper called for a reform in statistical teaching. That was 40 years ago–—yet IDA remains largely absent from most curricula.

But what exactly is IDA? Is it data cleaning? Basic summaries? Exploratory analyses?

The Topic Group “Initial Data Analysis”, part of the STRATOS Initiative [https://www.stratos-initiative.org], currently has six members from five countries [4]. The aim is to improve awareness of IDA as an important part of the research process and to provide guidance on conducting IDA in a systematic and reproducible manner. We started out with developing a framework for IDA [4 and references therein]. Initial reactions were skeptical:

• “Isn’t this all common sense? Everybody does it.”

• “You can’t define it—statisticians have personal preferences.”

• “Every dataset is different. It must be ad hoc.”

Once we presented the framework, the reactions changed to sharing numerous horror stories of analyses gone wrong because data properties had not been considered. But sharing stories does not help fix the issue. It happens to the best. One example involves two classic papers analyzing the same dataset on optical isomers and sleep: Student (1908, Biometrika) and Fisher (1925, Statistical Methods for Research Workers). Using different analytical approaches they reached the same conclusion…and both were wrong. The dataset had been mislabeled.

Today, researchers routinely fit increasingly complex models, thanks to powerful software. But have they assured themselves that the chosen methods are appropriate for the data at hand?

To understand current practice, we conducted a literature review, “Hidden Analyses” [4 and references therein]. We found that although many authors seem to conduct some form of IDA, it is often selective, unsystematic, and poorly documented. Reviewers and readers have no clear insight whether IDA occurred at all or the extent of what was done. This is problematic because IDA may substantially and non-transparently influence results and conclusions.

To provide practical support, we wrote “Ten Simple Rules for IDA” [5] and developed a checklist for cross-sectional studies (“Regression Without Regrets”), and a checklist for longitudinal studies with worked examples [4 and references therein]. As a minimum, the data screening aspect of IDA should be conducted before carrying out planned statistical modeling:

1 Missingness (unit and item missingness)

2 Univariable descriptions (may include smallest and largest values, quantiles, mean, high-resolution histograms, frequencies and proportions—for all variables)

3 Multivariable descriptions (may include stratified summaries, scatterplots, correlation matrices, redundancy analyses—without the outcome variables)

A core principle of IDA is to avoid examining associations between covariates and outcome variables. This is a key distinction between IDA and exploratory data analysis (EDA).

Even better is to incorporate IDA elements directly into a statistical analysis plan [https://stratosida.github.io/activities.html]. This provides a scope of IDA activities to complete and allows for considering consequences of IDA results for the chosen main data analysis methods (MDA) or for presentation and interpretation of MDA results. “Two modelling stages” were also mentioned by David Hand [1]. In fact, sometimes the “exciting” models become unnecessary, because IDA reveals that simpler approaches may be sufficient, or may be all that the dataset supports.

IDA takes time and resources, a fact often forgotten during project planning and budgeting. Occasionally, everything proceeds smoothly, but experience shows that even the best-curated datasets may contain about 5% errors or have data properties not directly suitable for the chosen statistical models. Secondary analyses of a data set may save time but only if metadata, data properties, and code were properly documented the first time around. A systematic, pre-planned IDA process prevents repeated analyses, revised tables, model changes, and avoidable delays.

The bottom line: IDA saves time.

—

References:

[1] Hand, D. Hand Writing: Data quality, the missing module. https://imstat.org/2025/12/14/hand-writing-data-quality-the-missing-module/

[2] Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, Huebner M, Schmidt B, Sauerbrei W, Richter A. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol. April 2, 2021; 21(1):63. doi: 10.1186/s12874-021-01252-7.

[3] Chatfield C. The Initial Examination of Data. JRSS Series A. 1985;148(3):214–31. doi: 10.2307/2981969.

[4] STRATOS Initiative Topic Group Initial Data Analysis, https://stratosida.github.io/

[5] Baillie M, le Cessie S, Schmidt CO, Lusa L, Huebner M. Ten simple rules for initial data analysis. PLoS Comput Biol. Feb 24, 2022; 18(2):e1009819. doi: 10.1371/journal.pcbi.1009819.