Have you ever calculated a Pearson correlation without doing a scatter plot? Long ago I vowed never to do so, but from time to time I forget my vow. I did so a couple of weeks ago when a student and I wanted to use our data to see if we could replicate a plot we’d seen in a published paper. Our plot showed a noisy but more or less monotonically decreasing relationship between the Pearson correlation of lagged pairs of proportions, and the lag. It went from about 0.9 at lag 1 down to about 0.3 at lag 1,000 and 0.2 at lag 1,500, where it levelled out. As we saw this in six different data sets, and it was in broad agreement with the published plots, we accepted it as replication and felt happy. However, we had another approach to measuring the decay of that association with lag, and it dropped monotonically to independence at lags around 100. This reduced our level of happiness. After some thought I remembered my vow, and looked at the scatter plots other than at lag 1, which we had examined. By lag 100, the scatter plot looked pretty bad, but gave a Pearson correlation of 0.7, as there were lots of points near (0,0) and (1,1). By lag 200, there was not much evidence of meaningful correlation, with the points mostly on the edges of the square connecting (0,0), (0,1), (1,1) and (1,0). We concluded that Pearson’s linear correlation was a very bad measure to be using in this context. By lag 100, the correlations were meaningless. (It might have been our data: correlations are quite susceptible to influence by selection bias.)
This little experience resonated with two themes I’ve been reading about recently, quite independently of the project involving lagged correlations. One was the strong aversion Tukey and others had to calculating correlation coefficients, something of which I should have been aware long ago. C. P. Winsor started the Society for the Suppression of the Correlation Coefficient whose guiding principle was “that most correlation coefficients should never be calculated.” Tukey was a member, one who frequently held “that correlation coefficients are justified in two and only two circumstances, when they are regression coefficients, or when the measurement of one or both variables on a determinate scale is hopeless.” It’s not clear how many other members of this society there were, but Fisher agreed with Tukey and Winsor, writing over a decade earlier that, “regression coefficients are of interest and scientific importance in many classes of data where the correlation coefficient, if used at all, is an artificial concept of no real utility.”
Of course Pearson’s r is for linear association, it is not robust, correlation is not causation, and in many—perhaps most—contexts, regression coefficients are more meaningful than correlations. But what if we are genuinely interested in the association between two variables, and not at all in the regression of one on the other? For example, when the variables are the same, separated in space or time by some lag, even if the measurement is on a determinate scale? Our interest in the association between certain proportions along the genome was exactly of this kind. I’d go further and say that we expected it to be caused by something, and were seeking some indication of the spatial scale over which this cause might be operating? We might even be interested in periodicity in the lagged association measure, and be led to take the Fourier transform of what we calculated. One of the papers we were following did just that. Should we be discouraged from calculating lagged correlation coefficients, or some other measure of association? I certainly hope not. But we must look at our scatter plots!
What are our options for measuring association, if it’s not linear? Over the years a large number of these have appeared, including measures using ranks (Spearman’s ρ, Kendall’s τ), nonparametric measures (Hoeffding’s D, following Mosteller, Blomqvist’s q’), mutual information (following Shannon, Linfoot’s r1, and many others), principal curve-based measures (Delicado & Smrekar’s covGC), and more recently, distance and Brownian correlation (Székeley & Rizzo’s dCor and CorW).
In parallel with these new ideas, several authors (including Fisher, Maung, Gebelein and Rényi) going back to Hirschfeld have used maximal linear correlation over nonlinear transformations of the original variables. Very recently, a new measure of association building on mutual information (Reshef et al’s MIC) was proposed. The authors claimed that it gave a meaningful measure for a wide range of nonlinear relationships (almost) independent of the nature of the nonlinearity.
In 1954 Tukey asked, “Does anyone know when the correlation coefficient is useful, as opposed to when it is used? What substitutes are better for which purposes?”
Can we answer him yet?
2 comments on “Terence’s Stuff: Correlation”