Institute of Mathematical Statistics | Sound the Gong: Data Science Realism

November 15, 2025

Ruobin Gong writes in praise of Bin Yu and Rebecca Barter’s Veridical Data Science, which she used to modernize a course:

As any seasoned instructor could attest, there is a certain kind of comfort to teaching a course on repeat. We know the lay of the land like the back of our hand. Where, on the quest for knowledge, lie the twists and turns, the rough terrain, and the scenic viewpoints for unexpected delights. We know what excites our students, as well as when to slow down our pace and give them time to soak it all in. It doesn’t hurt that we already have a thick deck of notes meticulously tailored to our teaching style, with every page bearing witness to our (let’s admit it) impeccable taste in the subject matter. On the very last day of every semester, we package up the material, tuck it neatly into a corner of our computers like stowing a piece of cake in the fridge, savoring a sweetness that is bound to linger until the moment of return.

But cakes might spoil. Sometimes faster than we would like.

Last fall I was assigned to teach a course on regression and time series, a foundational course to Rutgers Statistics’ professional Master of Data Science program. The course occupies a special place in my heart because it was the first course that I ever taught. After inheriting it from a senior colleague in 2018, I repeated it three times before passing the baton. Though it’s been a few years, the assignment instantly reinvigorated all the fond memories, cloaked in an aura of excitement that characteristically belonged to a newly minted junior faculty who had her mind set on pouring her knowledge and passion into the mission of training the next generation of data scientists.

But as soon as I dusted off the old course material and began looking, the excitement quickly subsided. Something didn’t feel right. I stared at the citations for the two textbooks in the syllabus: one on regression, one on time series. Both books were published more than twenty years ago. Measured in data science times, twenty years ago may well be the equivalent of antiquity. Two decades ago, the repertoire of tools and concepts that we regard as bread and butter today were either nonexistent or in their unrecognizable infancy. Think ChatGPT (deep learning for that matter), tidyverse, AB testing, even the notion of data science itself—not to mention the namesake departments or programs that are now hallmarks to numerous competitive institutions of higher education. For that matter, some of my prospective students might have been born after the textbooks were written! How are we preparing our students to meet the demands of their prospective employers who are looking for talents to deliver data-based insights at the cutting edge, with textbooks that witnessed none of the industrial revolution?

To be sure, statistics as a discipline is filled with timeless subjects. The art and the science of regression and time series are perennial topics. But professional data science programs have an aim that is markedly different from the faithful pursuit of epistemology. We have an obligation to be timely, and more fundamentally, to stay grounded in the realities that make data science a valuable course of study.

So I convened with Matteo Bonvini, the instructor of the other section (yes, our program was now popular enough to require two instructors for this course) and we conjured a plan to modernize the curriculum. We decided to introduce Veridical Data Science (VDS), a fresh-off-the-press textbook by Bin Yu and Rebecca Barter (https://vdsbook.com), as a third reading and a guidebook to a written course project culminating in a prediction contest.

What made VDS particularly fitting for our course is that it situates theoretical knowledge about regression and time series analysis—solid foundations to a data scientist’s technical strength—within a broad and operational world view. VDS takes the stance that statistical modeling, traditionally and narrowly construed, is but one component in the “Data Science Life Cycle” that comprehensively describes the workflow of a modern data science project. Put simply, there is much to be done before we get to hit “lm()” in R. Work begins with an understanding of the domain problem. What is the question for which an answer is sought? How can we collect some data (or in the case that some data has been collected, use that data) to shed light on an answer? There is also the issue of “raw data wrangling”: cleaning, preprocessing, and exploratory analyses are all crucial procedures we routinely perform, yet they are easily overlooked in pedagogical conversations. They are crucial because decisions taken during these phases might exert fundamental impact on the outcome of the project. They are overlooked, frankly, because of their messiness—because we are not able to develop beautiful mathematical theories to trace the impact in simple terms, and that murky reality hinders the practicality in teaching them. On the other hand, work does not end with a printout of the “lm()” output, either. Just as any branch of science, data science is iterative in nature. Scientific iterations are fueled by skepticism. The scrutinizing gaze of the data scientist at everything she has done leading to a result, be it unimpressive or too good to be true, is the driver for her next explorations. When she is truly satisfied with her work up to this point, she still needs to synthesize the lessons learned from the exercise and convey them to the target audience and with all strings attached: assumptions, qualifications, and limitations. A well-distilled message is worthy of millions, if not billions, of data points. That is the essence of evidence-based learning.

VDS proved to be a great success with the class. Students wrote in their instructional feedback and in emails to us how much they appreciated the open-ended, real data-backed project as an opportunity to understand and to deeply engage with the data science workflow. The prediction contest—with a dose of competitiveness created with local gourmet chocolates from Edison, New Jersey—kept them going for iteration after iteration in search for models with superior performance [1]. What I appreciate the most about VDS goes beyond its articulation of an overarching, workable structure with which a data scientist could organize her thoughts. VDS infuses data science with learned humility, a scarce resource in an unavoidably technical area of study in which one could easily get carried away by the complex minutiae yet lose sight of the greater picture.

At the 2025 JSM in Nashville, Matteo and I told our story in front of a panel of like-minded data science educators. The panel discussion, “Veridical Data Science Education,” was inspirational. We learned from Bin Yu about how VDS had been deployed in its full-fledged might in a Statistics Masters capstone course on data analysis and machine learning at UC Berkeley [2], and likewise from Andrew Bray about a thoughtfully orchestrated adaptation of VDS to an undergraduate introductory course [3]. We also learned from Joshua Rosenberg who, speaking from his rich expertise both as a researcher of data science education and a data science educator, alluded us to the unique challenges facing the K–12 audience as the data science education communities exhibit signs of disciplinary divides [4]. Our audience engaged us in an active conversation with their thoughts and considerations about revolutionizing the data science curriculum at their own institutions to help their students meet the requirements of the modern workforce.

The groundswell of realist awakening was encouraging to see. But to effectuate change is a whole different challenge. Every successful pedagogical revamp is a testament to the unyielding vision of the course designer and the passionate devotion of the instructional team. Graduate teaching assistants may end up carrying the day, and institutional support often proves to be an indispensable ingredient. When the stars aren’t all aligned, it is up to each one of us instructors to step out of our comfort zone and bring in something new for our students. As the fall term draws to a close, we will soon reach the point of reflection. What would that something be the next time we teach again?

—

We would like to thank Mark Glickman and Shaoyang Ning for their valuable suggestions in formulating the prediction contest.
https://classes.berkeley.edu/content/2025-spring-stat-214-001-lec-001
https://stat20.berkeley.edu
Rosenberg, J., & Jones, R. S. (2024). Data Science Learning in Grades K–12: Synthesizing Research Across Divides. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.b1233596