Linjun Zhang continues with the second in our invited series of articles on LLMs and AI, and their implications for the statistics profession. Should statisticians be worried about being replaced or left behind? Or should we, in fact, seize this opportunity and lead our neighboring disciplines? Linjun argues that the role of the statistician is not peripheral; it is central.
In the first piece of this series, I explored the practical side of our new reality: how tools like ChatGPT are already becoming daily companions in our research, programming, and teaching workflows. The “how-to” is immediate and compelling. But this new practical reality forces a deeper, more foundational question: What does the LLM era mean for us as statisticians? What is our unique, irreplaceable role when a machine can draft text, write code, and even appear to reason?
We inhabit a field built on rigor, inference, and the structured understanding of data. But the new wave of generative AI, built on a massive scale and with emergent, often unpredictable, behaviors, can feel alien to our traditions. It’s a moment that can provoke anxiety. Are our methods still relevant? Are we being left behind?
I believe the opposite is true. Our role has never been more critical. But to grasp why, we must first define our position. My undergraduate training was in mathematics, which left me partial to structured frameworks and axiomatic thinking.
To that end, let’s begin with three guiding “axioms” (or working assumptions) to frame our role in this new landscape.
Axiom 1:
Most recent breakthroughs in LLM development—the scaling laws, the architectural innovations, the massive training runs—have come from the AI industry, not from traditional academia.
This is a function of immense computational and financial resources. Some of us statisticians have felt that our field is being left behind, but I think this concern should belong to the whole of academia. This academic-wide challenge requires each field to find its unique contribution. For us, this means defining our strengths relative to our closest academic neighbors, particularly mathematics and computer science.
Axiom 2:
Compared with many pure mathematicians who focus on abstraction, our strength is statistical modeling of real-world problems. We are comfortable with noise, data-driven structure, heterogeneity, latent variables, and the messy, adaptive nature of real-world data.
Axiom 3:
Compared with many computer science efforts that prioritize predictive accuracy or system engineering, our core strength as statisticians is our toolkit for uncertainty quantification, rigorous inference, and formal evaluation.
With these three axioms to hand, we can derive a set of logical consequences—corollaries—about our evolving role.
These corollaries don’t just suggest new research problems; they define our responsibility to the scientific community and to society.
Corollary A:
We leverage the tool as its auditors and calibrators.
Given Axiom 1, it’s not our primary job to build the next trillion-
parameter model. Instead, our role is to bring statistical rigor to what these models produce and how they are used. This role as the ecosystem’s “auditors” and “calibrators” is critical.
The “auditor” role involves applying hypothesis testing to check if outputs satisfy certain properties. The “calibrator” role involves post-processing. LLM outputs are notoriously black-box and uncalibrated; the probabilities spat out by a model after RLHF (Reinforcement Learning from Human Feedback) are not true likelihoods but artifacts of a complex optimization. Here, we can apply our rich history of methods to filter, abstain, or adjust outputs to satisfy fairness, safety, or reliability constraints (see, e.g., [1,2]).
This role also extends to the use of LLMs as synthetic data generators. While promising for augmenting sparse datasets, this practice is statistically fraught. It raises immediate concerns of distributional mismatch, bias amplification, and unknown sampling variability. The statistician’s role is not just to use this data, but to model the synthetic data generation itself, quantifying the bias it introduces and designing validation pipelines to debias the downstream statistical inference [3].
Finally, we must audit the use of LLMs within larger analytical pipelines. When an LLM is used to, for example, extract information from unstructured data, its output is not a “true” variable. It is a variable with error. That error must be propagated. Statisticians are uniquely trained to understand and model this error propagation, preventing the rest of the scientific community from treating AI-generated data as ground truth.
In short, our role evolves: from directly analyzing data to critically auditing and calibrating the AI tools that assist in that analysis.
Corollary B:
We frame LLM problems with a statistical modeling lens.
Axiom 2 highlights our unique strength in modeling real-world systems. We can reconceptualize many of the thorniest open problems in AI not as pure engineering challenges, but as statistical modeling problems that are familiar to us.
Consider watermarking. The problem of detecting AI-generated text can be cast as a classic hypothesis test [4,5]. The null hypothesis (H0) is that a given text is drawn from the (incredibly complex) distribution of human language. The alternative (H1) is that it is drawn from the watermarked model’s distribution. This immediately brings up statistical concepts like the power of the test, the Type I error rate (falsely accusing a human writer), and the Type II error rate (failing to detect an AI).
Or take data misappropriation. Did a model train on copyrighted data and is now “leaking” it? This can be framed as a problem of hypothesis testing and membership inference. Statisticians can design tests to determine if a specific piece of text or data was likely in the training set with optimal guarantees [6].
Even hallucinations can be reframed. A hallucination can be viewed as an out-of-distribution problem, where a prompt has pushed the model into an uncharted, low-density region of its latent space. Neyman–Pearson classification can also help by producing an optimal test, thereby providing reliable factuality assessments and avoid potential hallucination [7].
By framing these issues as statistical modeling problems, we move beyond heuristic, ad-hoc fixes and toward theoretically grounded solutions that allow us to understand trade-offs and provide performance guarantees.
Corollary C:
We lead the design of rigorous, scientific benchmarks.
Axiom 3, our strength in evaluation, positions us to fix one of the most significant gaps in the current LLM landscape: benchmarking. The field is currently driven by “leaderboards” that report single-point estimates (like accuracy or MMLU [Massive Multitask Language Understanding] scores) without uncertainty bounds, paying little attention to prompt variability or random seed. This culture encourages “score chasing” and overfitting to specific benchmarks.
As statisticians, we know this is not scientifically sound. Benchmarking is an experiment, and it should be treated with the same rigor as a clinical trial. We must champion a shift from “leaderboards” to “uncertainty-aware, stability-aware evaluation.” This means designing benchmarks that collect distributions of performance, not single numbers, and reporting confidence intervals, calibration curves, and full error-decomposition analyses.
Furthermore, our tools are perfectly suited for better benchmark construction. We can move beyond simple data scraping, for example, by using stratified sampling to ensure a benchmark’s questions are representative across different topics and difficulty levels. We can also apply principles from the Design of Experiments to the benchmark construction process itself.
This rigor is the only way to move the field from “it works” to “we understand how and when it works, and make it work.”
Reflection: Our Role is Central, Not Peripheral
Putting this all together, what should we do now? The speed of AI advancement might raise the question, Are statisticians being left behind? The three axioms above invite a different answer. Yes, industry leads in scale. But statisticians lead in uncertainty, modeling, and inference.
Our role is not peripheral; it is central. If an LLM can write code and draft text, then what truly remains at stake is judgment, calibration, interpretation, and uncertainty awareness.
That is our domain.
This framework is a call to action. In our research, we should focus on these corollary areas: uncertainty quantification, auditing, rigorous benchmarking, and statistical monitoring tools for deployed models. In our teaching, we must update our curricula. We must teach students not just how to use an LLM, but how to critique it: how to spot a non-calibrated probability, how to test for bias, how to ask for uncertainty.
And we must collaborate. While we may not train the next foundation model, we are essential for its validation. We should partner with industry and computer science to ensure that statistical rigor is built into the production pipeline, not just sprinkled on as an afterthought. If we do this, we won’t be playing catch-up; we will be shaping the future.
In the next and final article, I will turn to the ethical dimension that underpins all of this: governance, authorship, accountability, and the vital role statisticians must play in building public trust in an AI-driven world.
—
References
[1] Zhun Deng, Cynthia Dwork, and Linjun Zhang. “HappyMap: A Generalized Multicalibration Method.”, ITCS 2023.
[2] Lujing Zhang, Aaron Roth, and Linjun Zhang. “Fair Risk Control: A Generalized Framework for Calibrating Multi-group Fairness Risks.” ICML 2024.
[3] Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. “Prediction-powered inference.” Science 382, no. 6671 (2023): 669–674.
[4] Baihe Huang, Banghua Zhu, Hanlin Zhu, Jason Lee, Jiantao Jiao, and Michael Jordan. “Towards Optimal Statistical Watermarking.” In Socially Responsible Language Modelling Research.
[5] Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J. Su. “A statistical framework of watermarks for large language models: Pivot, detection efficiency and optimal rules.” The Annals of Statistics 53, no. 1 (2025): 322-351.
[6] Yinpeng Cai, Lexin Li, and Linjun Zhang. “A statistical hypothesis testing framework for data misappropriation detection in large language models.” arXiv preprint arXiv:2501.02441 (2025).
[7] Fan Nie, Xiaotian Hou, Shuhang Lin, James Zou, Huaxiu Yao, and Linjun Zhang. “FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees.” ICML 2025.