Contributing Editor Anirban DasGupta writes:
After our last puzzle with probabilities on hyperspheres [see solution below], it is now time to turn our thoughts again to something in statistics. This time it’s a problem on epidemiology. Anirban DasGupta deliberately leaves this problem incompletely formulated. A correct solution involves identification of all the parameters, then formulate and answer the question in terms of the model parameters. Be careful: the values of your parameters may be known!

In a certain state in a country, each of m families has k members. We assume m and k to be given to us. Suppose a total of X residents of the state are found to have contracted an infectious viral disease; we assume X to be observable. Suppose these X infected residents come from a total of Y different families. Thus, Y denotes the number of families in the state affected by this virus; data on Y, unfortunately, is not available.
(a) Find a closed form expression for E(Y).
(b) Hence, or otherwise, provide a statistical estimate for Y.
(c) This is more complex: write an expression for the probability mass function of Y.

Solution to Puzzle 28

Contributing Editor Anirban DasGupta writes on the previous problem, which was about probabilities on hyperspheres:
Congratulations to Andrew Thomas, who is a PhD student in the Department of Statistics at Purdue University. Andrew sent a very detailed, rigorous and well written solution.

Suppressing the dimension $n$, let us denote the distance between $P, Q$ to be $X$. Clearly, $0 \leq X \leq 2$, because $P, Q$ are points on the unit hypersphere (in $n$ dimensions). The density of $X$ can be written down as
$f(x) = \frac{\Gamma(n/2)}{2^{n-3}\,\sqrt{\pi }\,\Gamma((n-1)/2)}\,x^{n-2}\,(4-x^2)^{\frac{n-3}{2}}, 0 \leq x \leq 2.$
Fortunately, one can integrate in closed form and get
$c_1 = E(X) = \frac{2^{n-1}\,[\Gamma(n/2)]^2}{\sqrt{\pi}\,\Gamma(n-1/2)}.$
As $n$ increases progressively, a plot of the density $f(x)$ would start to look like a spike around some point; let us see if we can work out exactly what is the value of this point of spike.

$E(X) = c_1 = \frac{2^{n-1}\,[\Gamma(n/2)]^2}{\sqrt{\pi }\,\Gamma(n-1/2)}.$

Hence, by using Stirling’s approximation:
$\Gamma(z) = e^{-z}\,z^{z-1/2}\,\sqrt{2\pi }\,(1+o(1)),\,\,\,z \to \infty ,$
$\lim_{n \to \infty }\, E(X) = \sqrt{2}$. A higher order expansion can be derived too by using Stirling’s series for $\Gamma(z)$. The limiting value of $\sqrt{2}$ for $E(X)$ is saying that in very high dimensions two random points would act like points close to extremes on two mutually orthogonal radii on the surface of the hypersphere. For the special case of $n = 3$, the formula gives $E(X) = \frac{4}{3}$.

Next, again using the expression for the density, we get that the $2m$-th moment of $X$ in four dimensions is
$c_{2m,4} = \frac{4^{m+1}\,\Gamma(m+3/2)}{\sqrt{\pi }\,\Gamma(m+3)},$
which on using the Gamma duplication formula simplifies to
$c_{2m,4} = C_{m+1},$
the $(m+1)$th Catalan number. This is a very interesting coincidence!

If we instead compute the expected geodesic distance between $P$ and $Q$ in three dimensions, we will get an expected value larger than $\frac{4}{3}$. You can find the expected geodesic distance exactly by a trigonometric integration (the final answer will involve $\pi$). You may find it interesting to generate two random points on the three dimensional unit sphere (use normalized standard normals, e.g.) and simulate the average geodesic distance. See if you can guess the exact value of the expected geodesic distance!