Institute of Mathematical Statistics | XL-Files: A Blitz of DS, Stat and AI Down Under

XL-Files: A Blitz of DS, Stat and AI Down Under

March 30, 2024

Xiao-Li Meng writes:

Here is a moral test (especially for statisticians). Which of the following acronyms baffles you the most: ANOVA, ADSN, ASC, ANU, ABS, or AITA?

Regardless of your answer (or questioning—a moral test??), the eight A’s in the above acronyms will forever remind me of my recent eight-day journey through the data science landscape in Australia. (Am I flirting with numerology? Yes: see the Dec 2022 “XL-Files,” https://imstat.org/2022/12/13/xl-files-xl-is-x-or-lx/). It was an intense and intoxicating whirlwind tour. With a frame now ripe for senior discounts—a charming reminder served by a joyous staff member at the National Museum of Mathematics in New York City—I initially hesitated at the thought of the lengthy voyage through Doha’s skies, 24 or 30 hours, west- or eastbound. Yet, it wasn’t long before I persuaded myself: what better path to spiritual renewal than to soar closer to the empyrean, or at least beyond the reach of my daily spammers?

My intoxicating tour started on December 8, 2023, inside a colossal wine barrel. If you suspect someone is jesting, that’s precisely how I felt upon being invited to give a keynote at the second gathering of the Australian Data Science Network (ADSN) in Adelaide. The humor wasn’t that a statistician was addressing data scientists; rather, it was the venue itself—the National Wine Center of Australia. Apparently, Aussie data scientists understand well that effective networking requires much social lubrication.

Fortuitously, I had just published my first article in a wine magazine, FONDATA. (Not kidding; if you want proof, check out Episode 32 of the Harvard Data Science Review podcast: https://hdsr.mitpress.mit.edu/podcast). This provided an apt theme—or at least a title—for my keynote: “Seeking Simplicity in Statistics, Complexity in Wine, and Everything Else in Fortune Cookies.” FONDATA has graciously permitted me to reproduce the article in the IMS Bulletin, allowing me to share my oenological journey through several upcoming “XL-Files.” As a teaser, it includes an ANOVA assessing the impact of wine tasting order on preference ranking, based on a blind tasting of four Rieslings (from the XL-cellar), conducted by two teaching fellows of the Gen Ed course: “Real Life Statistics—Your Chance for Happiness (or Misery)” (https://statistics.fas.harvard.edu/statistics-your-chance-happiness-or-misery)

Presenting ANOVA at a data science conference might seem to be masochistic—are statisticians so removed from the age of deep learning that they still reminisce about techniques devised by R.A. Fisher, buried 60 years ago (and, incidentally, not far from the wine center)? Well, I ventured—perhaps under the influence of you-now-know-what—that ANOVA and deep learning accomplish the same task, that is, separating revelatory variations (patterns) from obfuscatory variations (noises).

Admittedly, deep learning reveals patterns with a power (and peril) ANOVA could never match, unthinkable in Fisher’s time due to its computational demands. Yet, they share the fundamental idea of leveraging the most salient data variations for inference and prediction. Crafting a complex wine requires much more effort than a simple one, yet the essence of fermentation remains unchanged across all worthy wines (at least in English, since the Chinese translation of the term wine, “酒”, could refer to anything from Chateau Lafite to Chateau La-Gee—a distinction made clear by a promotional bottle in a Chinese supermarket, explicitly stating its mix of p percent grape juice and (1−p) percentage of alcohol. I, of course, was humbled by the explicit declaration, which included the exact p value).

Leaving Adelaide was heart-wrenching, for I had no time to visit any winery (or Fisher’s tomb). My schedule demanded that I rush to teach a short course on December 10 before delivering the Foreman Lecture on December 11 at the Australian Statistical Conference (ASC) in Wollongong. The five-hour short course on “Deep statistics for more rigorous and efficient data science,” nano-sizes a graduate-level course, “Deep Statistics: AI and Earth Observations for Sustainable Development,” that I developed and taught since the spring of 2022. In case you are put off by “deep” or “AI” in the title, the course is a deep collaboration with the AI and Global Development Lab (https://liu.se/en/research/global-lab-ai) led by Adel Daoud of Linköping University, over many deep contemplations with him and the course teaching assistant—and my PhD student—James Bailie.

Deep statistics studies three key environments for inferences and learning from data: multisource, multiphase, and multiresolution. The short course focused on the first two, while the Foreman lecture dived into the last: “Multi-resolution Meandering: Personalized Treatments, Individual Privacy, Machine Unlearning, and a World without Randomness.”

The journey from Wollongong to Canberra, on December 12, was significantly more relaxed, thanks to my ANU (Australian National University) host’s thoughtful arrangements. They had arranged for two students to pick James and me up; one managed the driving while the other steered the conversation, aiding our adjustment to the new time zone. Upon our arrival, it was exceptionally delightful to partake in a warm outdoor graduation celebration, with healthy hors d’œuvres (veggie beef sticks) and beverages (fermented grape juice). For James, an ANU alumnus, Canberra’s gentle breeze and Boston’s harsh freeze must be a day-and-night pairing, even without differentiating the time zones.

The following morning, James and I delivered a paired keynote speech on “Privacy, Data Privacy, and Differential Privacy,” kicking off the 2023 AI in Society Workshop by ANU’s Center for Harmonizing Machine Intelligence. It was an intellectual buffet, surveying much food for thought, with topics such as “Design Justice AI,” “Critical AI in the Art Museum,” “Chasing Storms with AI-enhanced DAS,” “Virtues of Robot Inaction,” “Human–Machine Aesthetics,” and yes, “AITA and Daily Moral Decisions.”

I asked myself AITA for having never asked or even thought about that question. Those of you who are on higher moral ground or are more introspective might be pleased to know that AITA is a subject of a serious study on daily moral decisions, thanks to the Reddit forum (you can search “reddit aita”), accumulating over 100,000 everyday moral dilemmas. It is a fascinating data set to dig into, with some surprises—see https://cmlab.dev/post/aita_overview/. My epiphany came when I realized that any time I consider AITA beneath me, it’s the time to ask AITA about having that thought.

Feeling morally enlightened, James and I went on to visit Australia Bureau of Statistics (ABS), also in Canberra; for James, it was another homecoming, having worked at ABS before (to my great fortune) joining Harvard. We spent the entire December 14 meeting with ABS researchers, discussing challenges from handling data quality in automated systems to producing statistics from unlinkable data. Many ABS challenges humbled me, especially because they require more qualitative (and quality) thinking than quantitative analysis. Thanks to five years (and counting) of exposure to the qualitative world via editing Harvard Data Science Review, I have learned that the qualitative paradigm is just as conceptualizable, contemplable, and construable as the quantitative world, albeit along different dimensions (see for example, “Why the Data Revolution Needs Qualitative Thinking”: https://hdsr.mitpress.mit.edu/pub/u9s6f22y/release/4). The ABS discussions were a field test for what I had learned, though practicing is always more arduous than preaching or reading about it.

The finale of the eight-day blitz was a presentation I relish giving to statistical agencies, “Miniaturizing Data Defect Correlation: A Versatile Strategy for Handling Non-Probability Samples.” Whereas the examples I used were US-centric (predicting the 2016 presidential election and assessing 2021 COVID vaccination uptake), the underlying issues and methodologies obviously transcend borders.

Reflections on such issues continued after an outdoor Christmas lunch, where I was introduced to the game of number toss. To end this delayed trip diary for those who love numbers, please indulge me bragging about my beginner’s luck. The game requires each player to toss a wooden baton at twelve consecutively numbered wooden pins, placed on the ground in reasonable proximity to each other. If only one pin is knocked down, the added score is the number on the pin. But if multiple pins are hit, the added score is simply the number of pins knocked down. All the knocked-down pins will be repositioned at the locations where they fall, for the next player. Whoever first reaches a total score of exactly 50 wins the game. If you overshoot 50, then your score gets reduced to 25, and the game continues.

In my first try, I won the game after six single hits, which involved four different pins, with numbers that happened to be consecutive, {n, n+1, n+2, n+3}. What is n? And what were my six hits?

Don’t ask me how I did afterwards, as I don’t even remember what Aussie beverages were served at the lunch. But I do have a visual memory of the departing day when the Champagne-fueled minivan took a whole day to get to the Canberra airport—driving through wineries is apparently not for the faint-livered.

(In case you feel “XL-Files” is dedicating too much space to Australia’s data science landscape, I wish to express my sincere appreciation to my gracious hosts, particularly those from ASC, for their unwavering invitations spanning three years. My calendar for 2027, however, remains notably unoccupied, save for the commemoration of my department’s 70th anniversary—a milestone that promises another extensive installment of XL-Files. Stay tuned…)