Institute of Mathematical Statistics | XL-Files: When a Statistician becomes a (COVID) Statistic

XL-Files: When a Statistician becomes a (COVID) Statistic

July 18, 2022

Xiao-Li Meng got it. COVID, that is. But the silver lining is that he now makes a welcome return to writing the “XL-Files”:

What happens when a statistician becomes a COVID statistic? Well, first of all, a COVID fever reignited my XL fervor. No, the missing “XL-Files” have not been due to lack of shareable stories in my life (if I have one). Launching and editing Harvard Data Science Review (HDSR) alone has given me abundant excitements to regale and frustrations to vent, including not having time to vent. The general data science community apparently lifts up more those who have developed the skills to benefit from NeurIPS or ICML kinds of pressing deadlines, instead of Annals-esque requests for deeper probing. However, illness tends to reset priorities, rearrange calendars, and remind us of our roots. Usually temporary, unfortunately, or maybe fortunately, depending how you look at it.

Regardless of how long the fever lasted, here I am, reopening the “XL-Files,” while appreciating the luxury of being able to repeat this mantra for the (n+1)st time: Every cloud has a silver lining.

The unexpected motivation, from a participant of a conference—the one where COVID finally got hold of me after chasing me unsuccessfully for over two years—was also responsible for this reopening. Her message that she had used some “XL-Files” for teaching was as encouraging to me as—I imagine—a novice wine maker who finds its product is being sampled by a WSET class.

Secondly, my COVID episode personalized several research areas that have been competing for much of my (non-feverish and non-HDSR) time: individualized risk and prediction, imprecise probability, data quality, and data privacy. I submit these stories of personalization for your judgment as to whether they are results of an overfitting neural network attempting to bootstrap itself out of a natural annealing process. But regardless how your non-artificial neural network differs from mine, I hope we share a time-honored lesson: preaching is far easier than practicing.

The onset of the COVID was signaled by a rather sudden sense of chill, much like entering a wine cellar without being prepared for the immediate temperature drop. There could be a variety of reasons for feeling sick after a week-long travel but knowing someone at conference had just tested positive for COVID obviously should increase my chance of being infected. However, what does “my chance” actually mean here, and in what ways it is affected by my other data? As my body was getting busy with a rising temperature, my brain had its own fervent self-dialogue. “I just got tested negative, and I am fully vaccinated and boosted.” “But I have symptoms.” “But I had symptoms before, and most COVID cases have no symptoms.” “This feels more like a bad flu.” “But most infected people reported that it is like a bad flu.” “I wore masks.” “But I went out for lunch with others, and at the banquet few had masks on.” “But …”

Wait. Where was the Bayes theorem? What events were being conditioned upon? Did my brain just commit a prosecutor’s fallacy? Wait, wait. How could anyone apply the Bayes theorem here? What numbers can be plugged in? Where could those number—any number—come from? Wait, wait, wait. Where was the Dempster’s rule of combination when it’s most needed? And what were pieces to be combined? Where were p(I have COVID), q(I don’t have COVID), and r(I have no idea)? Did my brain just convince itself that r increases with the duration of the dialog? Is that a form of dilation or more a hallucination?

Fortunately, there is a rapid test that can rapidly stop the hallucination, or at least give me an instrument to greatly reduce r. Having reduced r, I could concentrate on reducing the fever. Another instrument came to help: the fever was 99.9ºF (37.7ºC) the first night I returned from the conference and got a digital thermometer from a local pharmacy store. The measuring process took longer than I expected, but everything I preached about using n>1 was completely suppressed by my annealed brain. I surmised that it was afraid of engaging in another r-increasing exercise.

Waking up soaked in Tylenol-enhanced sweat the next morning, I took another measure. The thermometer quickly “peeped,” reporting 100ºF (37.8ºC). Wait. That couldn’t be right. I felt less feverish, and the peep came way too fast, compared to that of the night before. I had to measure it again. It took a bit longer, but it gave a number that stopped me from employing n=3: 99.4 (37.4ºC). It just felt right. And when my concerned family members call, I could honestly tell them not to worry as my fever had gone down.

Honestly? Well, I don’t think I need to insult any IMS Bulletin reader’s intelligence by explaining the logical equivalence between, “What’s wrong with choosing numbers that can make me feel better and comfort others?” and, “What’s wrong with choosing data to support my values and ideology and to unite all people who support the same?” But actively reflecting upon how we behave differently—consciously, or subconsciously—as private individuals as opposed to professional members can remind us, minimally, of paying more attention to data minding before data mining or analysis. For example, modeling measurement errors for self-reported measures such as blood pressures, weight, food intake, amount of exercise, etc., should never be done by only considering adding a convenient Gaussian error or any symmetric error.

Yes, our individual behaviors are sufficient to cast strong doubts about such convenient assumptions when we have reliable priors on the similarities of human behaviors. I have never witnessed any of you ignoring or discarding any survey and hence making yourself a contributor to the big headache of non-response bias, in which I invested a considerable amount of my professional time to ease. Yet, I am willing to put my professional reputation (if I have one) on the line to state that statistically speaking, we all have contributed to this problem multiple times in our private lives, drawing from my experience of not being able to answer over 95% of the surveys I receive every year, no matter how hard I compel myself on a professional and moral ground. (If I have just insulted you by implying your moral standard is as low as mine, please be in touch so I can send you an HDSR readership survey as a token of my apology.)

Data privacy is another area where reflecting on our private behaviors may have professional benefits (and vice versa). Bluntly, data privacy is an oxymoronic term, because data are born to reveal, yet privacy requires us to conceal. Periodically reflecting upon our private behaviors should help us better appreciate how complex the issue is, be more sensible in making professional demands, and give others the benefit of doubt when they seem to make the data less private or useful than our preferred level of trade-off (or lack thereof).

My COVID encounter reminded me of this possibility because of the actual instance of trading between providing timely contact tracing information and protecting the privacy of the infected individuals. Identities are critical for contact tracing, and I’m deeply grateful for such information volunteered by an Individual, who also shared the experience of the rapid onset without warning signs. This timely information, and the knowledge that it could strike rather suddenly, gave me just enough time, and reason, to make arrangements with my family for a minimax quarantine strategy before I got home—an arrangement that, retrospectively, we are all glad that we made.

If physical health is the only metric for optimization, one may argue that any personal information that can help others to reduce the risk of delayed treatments or the spreading of the virus should be shared as quickly and as widely as possible among (in this case) the conference attendees, regardless of how the information is obtained. Again, I don’t need to insult anyone’s intelligence by explaining why a single metric, however well-intended and well-designed, would almost always fall short in addressing problems in the human ecosystem. But my intelligence is seriously auto-insulated by my failure to find a privacy-preserving narrative that would reveal an additional privacy dilemma this conference faced, but without increasing the privacy-loss budget for any meeting attendees, especially those who do not wish to disclose their COVID status. Protecting privacy is extremely hard, because information travels like a virus. (It also mutates as it spreads.)

I may as well take the cue and stop here before this self-invited feverish reopening remark becomes an editor-invited closing remark for the “XL-Files.” But I still need to credit where credit is due. Can anyone help to locate the original source of this inspiration for the title of this column? “Don’t become a statistic, drive safely. Go to graduate school—become a statistician”?