The Student Puzzle Corner contains one or two problems in statistics or probability. Sometimes, solving the problems may require a literature search. Current student members of the IMS are invited to submit solutions electronically (to bulletin@imstat.org with subject “Student Puzzle Corner”). Deadline June 15, 2014. The names and affiliations of (up to)the first 10 student members to submit correct solutions, and the answer(s) to the problem(s), will be published in the next issue of the Bulletin. The Editor’s decision is final.

Student Puzzle Corner 4

Suppose a parameter $μ$ was measured at two different laboratories, of which one is more renowned and reliable than the other. Formally, $X ~ N(μ, 1), Y ~ N(μ, σ^2)$, where $X, Y$ are independent, and $σ^2≥1$. Find, explicitly, a 95% confidence interval of finite length for $σ^2$.

It seems a little odd at first that one can estimate the variance of the second laboratory with only one observation from the second laboratory. In some sense, a more basic question is how will you estimate $μ$ in such a case, or what are the maximum likelihood estimates of $μ, σ^2$, but they are not being asked here. Just to provide a little context, combining $P$-values from different experiments is a widely studied problem. Fisher himself did classic work on that. Combining likelihoods from different experiments and the corresponding asymptotic story is a less studied problem.

Solution to Student Puzzle 3

Anirban DasGupta, IMS Bulletin Editor, explains:

In the puzzle of the last issue, we asked what is the average Euclidean distance between two random points on the surface of the Earth. We can easily calculate the Euclidean as well as the geodesic distance between two fixed points. For example, the geodesic distance between New York and Beijing is about 7175 miles, and the Euclidean distance is about 6235 miles, a saving of about $13\%$. A nearly hilarious question is how much time you’d have saved flying from New York to Beijing if we knew how to fly through the Earth’s interior. The cruising speed of a Boeing 777 is about 580 mph. So, a back-of-the-envelope calculation gives that a nonstop flight along the geodesic would take about 12 hours and 20 minutes, while going through the Earth’s interior at the same speed would take 10 hours and 45 minutes; so you save just about an hour and a half. For traveling from London to Sydney, you’d save 4 hours and 55 minutes; Singapore to Bangkok, you save almost nothing. But what about two random points? We will answer this question below, and you might be surprised that the average savings is just about $11.5\%$.

We may as well solve the problem for the case of $n$ dimensions for a general $n \geq 2$. Thus, let $\bf{X}, \bf{Y}$ be two random independent picks from the surface of an $n$-dimensional ball with radius $\rho $ and suppose we want to find $E(||\bf{X}-\bf{Y}||)$, where $||\cdot ||$ denotes Euclidean norm. Notice that the center of the ball has no effect on the problem and the radius has only a scaling effect; so we may as well take the center to be the origin and the radius to be one. The expected value of the square of the Euclidean distance is a much easier problem, and indeed, you can show that in any number of dimensions, for the unit ball, $E(||\bf{X}-\bf{Y}||^2) = 2$; Cauchy-Schwarz gives us a quick bound: we must have $E(||\bf{X}-\bf{Y}||) \leq \sqrt{2}$. We will see how the $\sqrt{2}$ bound gets almost attained when the number of dimensions $n$ gets large.

Now, $||\bf{X}-\bf{Y}||^2 = ||\bf{X}||^2 + ||\bf{Y}||^2 -2\bf{X}’\bf{Y} = 1 + 1 – 2\bf{X}’\bf{Y} = 2\bigg [1 – \bf{X}’\bf{Y}\bigg ]$. Here, as usual, $\bf{X}’\bf{Y}$ denotes the Euclidean inner product. Therefore, $E(||\bf{X}-\bf{Y}||) = \sqrt{2}E\bigg [\sqrt{1 – \bf{X}’\bf{Y}}\bigg ]$. Now, condition with respect to $\bf{Y}$.

The conditional expectation $E\bigg [\sqrt{1 – \bf{X}’\bf{Y}}\,\vert \bf{Y} = \bf{y}\bigg ]$ is independent of the unit vector $\bf{y}$ and hence, the unconditional expectation $E\bigg [\sqrt{1 – \bf{X}’\bf{Y}}\bigg ] = E\bigg [\sqrt{1 – \bf{X}’\bf{e_1}}\bigg ] = E\bigg [\sqrt{1 – X_1}\bigg ]$, where $\bf{e_1}$ is the first standard unit vector. Use now the fact that if $\bf{X}$ is a random pick from the surface of an $n$-dimensional ball of radius $1$, then its first coordinate $X_1$ has the density $c(1-x_1^2)^{(n-3)/2}, – 1 \leq x_1 \leq 1$, where the normalizing constant $c = c_n = \frac{\Gamma (\frac{n}{2})} {\sqrt{\pi }\Gamma (\frac{n-1}{2})}$.

Thus, $E\bigg [\sqrt{1 – \bf{X}’\bf{Y}}\bigg ] = c\,\int_{-1}^1 \sqrt{1-x}(1-x^2)^{(n-3)/2}dx = c\,\int_{-1}^1 (1-x)^{\frac{n}{2}-1}(1+x)^{\frac{n-1}{2}-1}dx$.

This becomes a Beta integral by changing variables; just substitute $z = \frac{1+x}{2}$.
On a little algebra, we eventually get
$E(||\bf{X} – \bf{Y}||) = \sqrt{2}\,c\,2^{n-\frac{3}{2}}\,B(\frac{n-1}{2},\frac{n}{2}) = \frac{2^{n-1}[\Gamma (\frac{n}{2})]^2}{\sqrt{\pi}\,\Gamma (n – \frac{1}{2})}.$ This is the answer when the basic ball has radius $1$; if it has radius $\rho $, then multiply this by $\rho $.

For the specific case of the Earth, modeled as a perfect sphere, we get from this $E(||\bf{X} – \bf{Y}||) = 5280$ (miles).

What happens asymptotically, i.e., in very high dimensions? That is easy enough to figure out as well; we need nothing more than simple tools. Simply use the standard facts that $\Gamma (x) =e^{-x}\,x^{x-1/2}\,\sqrt{2\pi }\bigg [1 + \frac{1}{12x} + \frac{1}{288x^2} + \cdots \bigg ]$ as $x \to \infty $ and that for $|x| < 1$, $\log (1-x) = -x - x^2/2 -x^3/3 - \cdots $. Then, our formula $E\bigg [||\bf{X} - \bf{Y}||\bigg ] = \frac{2^{n-1}[\Gamma (\frac{n}{2})]^2}{\sqrt{\pi}\,\Gamma (n - \frac{1}{2})}$ for the case of the unit ball results in $E\bigg [||\bf{X} - \bf{Y}||\bigg ] = \sqrt{2}\,\bigg [1 - \frac{7}{24n} - \frac{95}{1152n^2} + O(n^{-3}) \bigg ]$. This means, that in high dimensions, the Euclidean distance between two random points on the surface would be close to as large as it can possibly be, namely $\sqrt{2}$. An interesting question is exactly how much smaller is the Euclidean distance between two points than the geodesic distance between them. This will of course depend on the specific two points. What happens if the two points are chosen at random? Let's use some notation to get things very clear. Let $\bf{X}, \bf{Y}$ be two independent random picks from the surface of the $n$-dimensional unit ball, and let $D = D_n$ and $G = G_n$ be the Euclidean and the geodesic distance between them. Then, using spherical trigonometry, a now forgotten subject, you can show that $E\bigg (\frac{D}{G}\bigg )$ has an integral representation. Let us see what it is; it is given by $a_n\,\int_{0}^\pi \frac{2\sin (\frac{\theta }{2})}{\theta } (\sin \theta )^{n-2}d\theta $, where $a_n = \frac{2^{n-2}[\Gamma (\frac{n}{2})]^2}{\pi (n-2)!}$. You can evaluate it numerically, or in closed form by using certain special functions, or you can derive an asymptotic expansion for it. The asymptotic expansion takes a lot more work now, but one is $E\bigg (\frac{D}{G}\bigg ) = \frac{2\sqrt{2}}{\pi } + \frac{\pi ^2-8\pi + 32}{2\sqrt{2}\pi ^3n} + O(n^{-2})$. Now, $\frac{2\sqrt{2}}{\pi }$ is about $.900316$. So, in very high dimensions, the Euclidean distance is just about $10\%$ smaller than the geodesic distance on the average. How about for the Earth, i.e., $n = 3$? Then, $E\bigg (\frac{D}{G}\bigg ) = .88451$. Not much savings!