Talk:Correlation and dependence

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Mathematics     (Rated B-Class)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating: B Class Top Priority Field: Probability and statistics
One of the 500 most frequently viewed mathematics articles.

Please update this rating as the article progresses, or if the rating is inaccurate.

WikiProject Statistics (Rated B-class, Top-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

 B  This article has been rated as B-Class on the quality scale.
 Top  This article has been rated as Top-importance on the importance scale.
 


Contents

[edit] pseudocode

I suggest removing the sections with pseudo code. Wikipedia is not the place for computing tips and tricks (there's some rule about Wikipedia not being a 'how to' site). If there are important algorithmic considerations, they should be presented more formally using correct numerical analysis terminology. And surely, there's no need to show the same algorithm in two languages. —G716 <T·C> 11:10, 12 October 2008 (UTC)

I suggest NOT removing it, it is quite convenient to skip all these "important" formulas and get to the point. However, there seems to be extra division by N there:
 pop_sd_x = sqrt( sum_sq_x / N )
 pop_sd_y = sqrt( sum_sq_y / N )
 cov_x_y = sum_coproduct / N
 correlation = cov_x_y / (pop_sd_x * pop_sd_y)

seems to be the same as

 pop_sd_x = sqrt( sum_sq_x )
 pop_sd_y = sqrt( sum_sq_y )
 cov_x_y = sum_coproduct 
 correlation = cov_x_y / (pop_sd_x * pop_sd_y)

92.112.247.110 (talk) 17:04, 9 November 2008 (UTC)

They are the same apart from numerical accuracy questions. But the former might be preferable if there were later other uses for the standard deviations and covariance, such as writing them out. Melcombe (talk) 10:34, 18 December 2008 (UTC)
I agree with removing the pseudo code, given that this is not a text book. Melcombe (talk) 10:34, 18 December 2008 (UTC)
Look, the argument is indeed rational, but deleting information without finding a new home for it is evil, whether or not it fits with the guidelines! Could you perhaps add it to the Ada programming wikibook and/or one of the Python wikibooks? --mcld (talk) 14:32, 19 December 2008 (UTC)
Great idea. I know nothing about wikibooks - could you either move the info yourself or ask leave a note on the wikibooks talk page to get someone there to help. —G716 <T·C> 15:42, 19 December 2008 (UTC)
I agree with removing the ada source, but the methodology for computing the correlation in one, fast, accurate and stable step is a very important information.
Nightbit (talk) 03:12, 20 December 2008 (UTC)


I'm the guy who put the pseudocode there in the first place because so many people come here looking for how to compute correlation. I agree with Mcld ... don't just delete it! It is fine for the pseudocode to be moved elsewhere and just have a link to it, but it is important and difficult to find stuff.

A language-specific site is not really an acceptably general home for it -- this is pseudocode, not an implementation how-to. In addition, if there's a nontrivial risk of the link going stale, then I think the code ought to stay here, where it can actually be useful to people.

For reference, researching stable one-pass algorithms and distilling my findings into that pseudocode took me several hours (and I have a PhD in mathematics). I hope and believe it has saved many people many hours of work.

Frankly, G716 has the right idea: someone should properly write up the numerical analysis, providing the appropriate context for the pseudocode snippet. I simply lack the time to do it myself. But removing the pseudocode because the entry lacks the contextual information seems hasty and wrong.Brianboonstra (talk) 18:37, 6 February 2009 (UTC)

You need to reference the sources of your research though (don't worry about formatting). Else you're asking any users of the code to just take it on trust, or someone else to repeat your research. Qwfp (talk) 20:00, 6 February 2009 (UTC)
A reference for this code is certainly required. Further description of the code is also needed. For example, why is a one-pass algorithm better than a two-pass? What is the trade-off in accuracy? How much is the speed increased? Will this speed increase really improve someone's application overall (i.e. is the calculation of r likely to be a bottleneck)? Darkroll (talk) 02:55, 10 February 2009 (UTC)

The pseudocode and related text which was in the article (as of 22 Feb 2009) is reproduced below.

Computing correlation accurately in a single pass

The following algorithm (in pseudocode) will calculate Pearson correlation with good numerical stability[citation needed]. Notice this is not an accurate computation as in each iteration only the updated mean is used (not the exact full calculation mean), then the delta is squared, so this error is not fixed by the sweep factor.

 sum_sq_x = 0
 sum_sq_y = 0
 sum_coproduct = 0
 mean_x = x[1]
 mean_y = y[1]
 for i in 2 to N:
     sweep = (i - 1.0) / i
     delta_x = x[i] - mean_x
     delta_y = y[i] - mean_y
     sum_sq_x += delta_x * delta_x * sweep
     sum_sq_y += delta_y * delta_y * sweep
     sum_coproduct += delta_x * delta_y * sweep
     mean_x += delta_x / i
     mean_y += delta_y / i 
 pop_sd_x = sqrt( sum_sq_x )
 pop_sd_y = sqrt( sum_sq_y )
 cov_x_y = sum_coproduct
 correlation = cov_x_y / (pop_sd_x * pop_sd_y)

I have removed this to this discussion page so that it is still available to anyone who wants to see it, but is not included in the article where the general consensus is that it does not belong on this page.

The correct calculation does not require the mean subtracting from each observation while passing through the data: even though this is one way in which it is possible to to calculate the correlation, it is not the computationally simplest way to do it, and it would require the means to be found before calculating the deviations from the means (thus requiring a second pass through the data to do it accurately using this approach). Rather the sum of products

 x_i y_i and sums of squares  x_i^2 and  y_i^2

are collected, allong with the sums of

 x_i and  y_i ,

then allowance is made for the means by subtraction at the end of the calculation, i.e. once the means are known, using the sample correlation coefficient formula given in the article:


r_{xy}=\frac{\sum x_iy_i-n \bar{x} \bar{y}}{(n-1) s_x s_y}=\frac{n\sum x_iy_i-\sum x_i\sum y_i}
{\sqrt{n\sum x_i^2-(\sum x_i)^2}~\sqrt{n\sum y_i^2-(\sum y_i)^2}}.

I have explained this in this discussion page hoping that it will satisfy those who thought the code was a useful part of the article. The code is still available, but you are strongly recommended not to use it. Instead do the calculation using the above formula. Alternatively you can use the formula


r_{xy}=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{(n-1) s_x s_y},

but if you do, you need to calculate the means before you can start, so although this formula is easy to understand, it is slightly less easy to use in practical calculations.

Hey, the proposed formula is wrong: where did you get this (n-1) in the divisor? —Preceding unsigned comment added by 178.94.5.109 (talk) 00:45, 27 December 2010 (UTC)

—Preceding unsigned comment added by SciberDoc (talkcontribs) 12:39, 22 February 2009 (UTC)

I have moved this section to the Pearson correlation page as the algorithm is specific to the Pearson correlation. Skbkekas (talk) 02:40, 5 June 2009 (UTC)


I have a question for the pseudocode. The previous version:

 pop_sd_x = sqrt( sum_sq_x / N )
 pop_sd_y = sqrt( sum_sq_y / N )
 cov_x_y = sum_coproduct / N
 correlation = cov_x_y / (pop_sd_x * pop_sd_y)

The current version

 pop_sd_x = sqrt( sum_sq_x )
 pop_sd_y = sqrt( sum_sq_y )
 cov_x_y = sum_coproduct 
 correlation = cov_x_y / (pop_sd_x * pop_sd_y)

And someone has said that they are the same, but I miss one N to be the same. Do you know which is the correct one? —Preceding unsigned comment added by 195.75.244.91 (talk) 14:57, 5 October 2009 (UTC)

They are the same. The numerators differ by a factor of N, while the two factors in the denominator each differ by a factor of {\sqrt{n}}, so that the {n}s cancel. JamesBWatson (talk) 11:09, 6 October 2009 (UTC)

[edit] Correlation coefficient

Correlation coefficient currently directs here. Should it direct to Coefficient of determination (i.e. r-squared) instead? (note: I'm cross-listing this post at Talk:Coefficient of determination.) rʨanaɢ talk/contribs 03:30, 22 September 2009 (UTC)

No, it shouldn't. This is the main article on correlation, and defines the correlation coefficient. The article on coefficient of determination mentions the correlation coefficient, but does not define it; in fact it rather presupposes a knowledge of the correlation coefficient. What is more this is as it should be, both because correlation coefficient is a much more widely known concept than coefficient of determination, and because it makes more sense to redirect upwards to a more general topic than to redirect sideways to a different concept at the same level. JamesBWatson (talk) 13:05, 27 September 2009 (UTC)

[edit] numerical instability

You need to be a little bit careful running around claiming algorithms are numerically unstable. Instability depends on the range of numbers used. The one pass algorithm is stable if the full calculation can be done using integer arithmetic without intermediate results overflowing. Charles Esson (talk) 08:11, 8 October 2009 (UTC)

[edit] Non-linear correlation

(JamesBWatson left this comment in my user page; since I believe it's of general interest, I'm taking the liberty of moving it here.

I see that you reverted an edit to the article Correlation with the edit summary "Undid revision 320015831 by JamesBWatson (talk) in stats, "correlation" always refers to linear -- not any -- relationship". I have restored the edit, together with references to three textbooks which use the expression "nonlinear correlation". I could have given many more references; for example, here are just a few papers with the expression in their titles:

  1. A. Mitropolsky, “On the multiple non-linear correlation equations”, Izv. Akad. Nauk SSSR Ser. Mat., 3:4 (1939), 399–406
  2. Non-linear canonical correlation analysis with a simulated annealing solution, Sheng G. Shi, Winson Taam (Journal of Applied Statistics, Volume 19, Issue 1 1992 , pages 155 - 165)
  3. Non-Linear Correlation Discovery-Based Technique in Data Mining, Liu Bo (Intelligent Information Technology Application Workshops, 2008. IITAW '08)
  4. Ravi K. Sheth (UC Berkeley), Bhuvnesh Jain (MPA-garching), The non-linear correlation function and the shapes of virialized halos.

Google Scholar gives 2790 citations for "non-linear correlation" and 3650 for "nonlinear correlation". I assure you, "correlation" usually, but by no means always, refers to linear correlation. JamesBWatson (talk) 15:37, 31 October 2009 (UTC)

Thanks for the references, JamesBWatson. I guess the generalization of correlation coefficient for both linear and non-linear associations would require rewriting, e.g., Correlation#Correlation and linearity. Furthermore, we need to define and show how to calculate it. I can see how it could be obtained as \rho_{xy} = \sigma_{xy}^2 / (\sigma_x \sigma_y), where the variances \sigma_x^2, \sigma_y^2 and the covariance \sigma_{xy} come from a non-simple linear regression (for simple vs. non-simple linear regression, see Regression analysis#Linear regression). Is that what you mean? 128.138.43.211 (talk) 06:34, 1 November 2009 (UTC)
First, unless "non-linear correlation" is precisely defined, I see no point in just mentioning it in this article. Secondly, if the interpretation above (in terms of variance and covariances) is correct, wouldn't such a non-linearity extend to the PMCC as well? 128.138.43.211 (talk) 05:20, 4 November 2009 (UTC)

[edit] Section merger proposal

I disagree with the proposal to move material from this article to the Pearson correlation article. In fact, this issue has been discussed quite a bit in the past, and the consensus was to use the Pearson correlation article for issues related to linear correlation measures of the product-moment type, while the correlation article could cover topics related to pairwise association measures in general. The section on "sensitivity to the data distribution" applies specifically to the Pearson correlation measure. Some parts of it may be more general, but not most of it. I was the person who originally created this section, in both articles. I later came to feel that the section in the correlation article needed to be merged to Pearson correlation, not the other way around. I just hadn't had a chance to do it yet. The proposed merger takes us in the wrong direction. Skbkekas (talk) 03:54, 2 November 2009 (UTC)

I don't see that it does any harm to keep both sections, but I certainly agree with Skbkekas that if there is to be a merge it should be from Correlation to Pearson correlation, not the other way around. JamesBWatson (talk) 12:10, 2 November 2009 (UTC)

I have changed the merge templates on both articles to indicate moving material to Pearson correlation, with the discussion pointer still pointing here. I have reverted the move already made of some stuff and I think much more should be moved. If this direction of change is what is wanted, it may be best to rename this article to someting like "correlation and dependence" to give a better indication of its scope. Melcombe (talk) 16:58, 2 November 2009 (UTC)

If Skbkekas is right in saying that previous discussion has resulted in a consensus for keeping both sections then I don't see that a merger is justified. I should also like to put it on record that I agree that it is better to keep both of them. JamesBWatson (talk) 08:32, 4 November 2009 (UTC)
But the question is how much and what material should be in both. The use of the "Main" tag to point to the Pearson correlation article would mean that what should be here is only a summary plus whatever other stuff is required that is relevant to the main topic of the current article. Do we agree that the topic should be "pairwise association measures in general"? I think that topic does deserve an article of its own and this is the way the article starts I think, and the direction the article was being pushed. But there are a number of problems with the articles taken together that can hopefully be reduced by having an appropriate separation of topics. For example in the case of the product-moment correlation there are three separate concepts: the population value, the "raw" estimate obtained by the usual formula, other estimates of correlation derived from appropriate non-normal joint distributions. It is hardly made clear which of these is being thought of for the various points being discussed. Melcombe (talk) 10:32, 4 November 2009 (UTC)
I don't think I said that the consensus of the earlier discussion was to keep both sections. The earlier discussion dealt with how to divide material between the two articles. I like Melcombe's proposal to retitle the correlation article as something along the lines of "correlation and dependence." As far as "sensitivity to the data distribution" goes, I think a section like that belongs in nearly every article about a summary statistic. However, the contents of the section would obviously differ. If the correlation article moves to "correlation and dependence," I'm not sure if there are any general statements that can be made that are applicable in general to correlation and dependence, whereas it is of course possible to say things specifically about Pearson correlation. Skbkekas (talk) 19:57, 4 November 2009 (UTC)

[edit] Correction of a misunderstanding

The following comment was placed in the article by 82.46.170.196, in the section Pearson's product-moment coefficient.

It is important to appreciate that the above description applies to the population and not a small sample. It does not take into account the degrees of freedom. A simple test in Excel shows that the covarience of an array divided by the product of the standard deviations does not give the correct value. For example when all x=y, r does not = 1. However, if the product of the z scores of each (x,y) pair are divided by (n-1) rather than n then the correct value is obtained.

Firstly, this comment belongs here, not in the article, so I have moved it. Secondly, I shall try to clear up the misunderstanding. The covariance of a sample is calculated by dividing by n, while dividing by n-1 is used to calculate an unbiased estimate of the covariance of the population. Exactly the same applies to calculating the variance of a sample and an unbiased estimate of a population variance. The standard definition, as given in the article, uses the sample covariance and the sample variances. Alternatively you can use unbiased estimates of population values in both cases: the result is exactly the same. However, the result is not the same if you mix the sample covariance and unbiased estimates of population variances: you have to be consistent. I do not normally use Excel, but to prepare for writing this I have looked at it. The function COVAR calculates the covariance of the numbers given (which may or may not be a sample: that is irrelevant). On the other hand the function VAR does not calculate the variance of the numbers given, but rather an estimate of a variance of a population which the numbers are assumed to be a sample from. In order to calculate the actual variance of the numbers given, you have to use the function VARP. Why VAR and COVAR work inconsistently is something only Microsoft programmers can explain. Unfortunately the Excel help files make things even more confusing: for example they say that VARP "Calculates variance based on the entire population", although the numbers are frequently not a population at all. I think it comes down to Microsoft programmers being programmers with a little knowledge of statistical techniques, rather than statisticians. JamesBWatson (talk) 20:41, 19 November 2009 (UTC)


[edit] Which one is known as canonical correlation

1. Scatter Diagram. 2. Karl Pearson. 3. Graphic Method. 4. Rank Correlation. —Preceding unsigned comment added by 117.193.144.20 (talk) 11:09, 6 June 2010 (UTC)

None of them: see Canonical correlation. However, this page is not for questions of this kind: it is for discussing editing of the article. JamesBWatson (talk) 19:34, 6 June 2010 (UTC)

[edit] Pearson correlation "mainly sensitive" to linear relationships??

In the second paragraph we have the sentence:

[...] The Pearson correlation coefficient, [is] mainly sensitive to a linear relationship between two variables.

Shouldn't the word "mainly" be changed to "strictly"?

watson (talk) 01:14, 12 September 2010 (UTC)

I disagree. To the contrary, I think "mainly" is already too strict. I think it should be "somewhat more". Skbkekas (talk) 05:09, 12 September 2010 (UTC)
what's an example of two 1D data sets that have high Pearson correlation but lack linear relationship? watson (talk) 01:46, 13 September 2010 (UTC)
If X is uniformly distributed on (0,1) and Y = log(X) the correlation is around 0.86.Skbkekas (talk) 01:04, 15 September 2010 (UTC)

This example shows the linear relationship between x and log(x), which is present to a large extent. log(x) is not linear towards zero but it is very linear out near 1. Thus the correlation coefficient is reduced by the former but not the latter. I coded up this short program in Python to demonstrate. The visual demonstration of the Pearson correlation is linear regression as shown below (blue line is log(x), red line is the regression of course). Note that the correlation coefficient actually is near .787

Linear regression onto log(x) on unit interval

The code for this example is as follows:

import numpy as N
import matplotlib.pyplot as plt
from scipy.stats.stats import pearsonr,linregress
 
x = N.linspace(.00000001,1,1000) 
y = N.log(x)
 
(a,b,r,tt,stderr)=linregress(x,y)
z = a*x+b
print r
 
plt.plot(x,z,'r')
plt.plot(x,y)
plt.savefig('x_vs_logx.png')
r,p = pearsonr(x,y)
print r

it returns the above plot as well as this output of the Pearson correlation coefficent (calculated in two places independently):

0.787734089775
0.787734089775

watson (talk) 20:49, 15 September 2010 (UTC)

This isn't a big deal, but the numerical calculation above is giving the wrong answer, since the numerical approximation to the definite integral is very sensitive to how the limiting behavior at zero is handled. Doing the calculation analytically, you get -1/4 for E(X*Y) (using integration by parts), and you get -1/2 for EX*EY (using the fact that Y follows a standard exponential distribution). Thus cov(X,Y) = 1/4. The variance of X is 1/12 and the variance of Y is 1. Thus the correlation is sqrt(12)/4 = 0.866.

The larger issue about whether the Pearson correlation is "strictly" sensitive to a linear relationship amounts to how you interpret the word "strictly". Many people would incorrectly interpret this as implying that the Pearson correlation is blind to relationships that aren't perfectly linear. I also would argue that the plot above exaggerates the approximate linearity of log(x), based on the very large range of the vertical axis.Skbkekas (talk) 02:12, 16 September 2010 (UTC)

Thanks for that correction, Skbkekas. I was playing fast and loose with my numerical approx and you're totally right about the limiting behavior at zero, considering log(x) shoots off to -inf. I reran my code with my function representation parameters maxed out, i.e. changing the line
x = N.linspace(.00000001,1,1000)

to

x = N.linspace(1e-150,1,6.5e7)

(the smallest value for interval start and largest vector size, respectively which Python running on my computer are capable of)

and I get the value
0.865246469782
Linear regression onto log(x) on unit interval (y-axis restricted)
Linear regression onto log(x) on [.2,1] (y-axis restricted)
Note that the correlation actually 'increases towards your analytic limit because using a smaller discretization of the x-axis amounts to giving less weight in the calculation to the values of log(x) really close to and including x=1e-150.
As to how correlation handles non-linear relationships, I think we're getting caught up on the word "relationship". Yes log(x) has an explicit non-linear relationship to x, but that's a different sense of the word "relationship" than what Pearson correlation measures. Pearson correlation measures the degree to which log(x) is "linear-ish" to use some colloquial language. That is, it measures what the relationship of log(x) is not to x, but to a linear approximation of itself along the specified interval of x. And, as the last paragraph points out, there is only a relatively small sub-interval that log(x) is not linear.
To demonstrate this, I ran my code again, but now with the above line changed to
x = N.linspace(.2,1,6.5e7)

which returns

0.980302482584
Your comment about the log(x) axs is fair (I let Python choose them before), and I plotted again with the axis restricted to a min of -8. Including the figure here and also a figure showing the regression on the sub-interval [.2,1] mentioned above.

watson (talk) 21:44, 19 September 2010 (UTC)

@Watson The log(x) function on the unit interval (i.e., [0:1]) is not a suitable example for a numerical solution. By this means, the correlation coefficient for a linear coefficient is ill-defined. Actually, in regard of your edit history, you should have come to this conclusion yourself. How can you seriously be changing the value of this correlation coefficient in the article without understanding that even the new value is not the true value, simply because you cannot find it with your tool. Remember, at first you though 0.787 was close enough to the true value. Tomeasy T C 06:50, 23 September 2010 (UTC)

[edit] Reference update?

There is a citation given, near the bold term anticorrelation (ref number 5) to Dowdy, S. and Wearden, S. (1983). "Statistics for Research". Wiley. ISBN 0471086029 pp 230. This is the first edition and the latest is the 3rd (Detail and online subscription version)... can anyone say whether this term does (still) appear and so update the reference and page number? Melcombe (talk) 14:17, 21 September 2010 (UTC)

[edit] Mistakes in Formulas

Yesterday I changed some major mistakes in the correlation coefficient (look at the edits eliminating the (n-1) in the denominator. I think these kind of mistakes are unacceptable and unexcusable, and that a warning should be added to the article saying that it's reliability or quality is poor, at least until a couple of experts devote some time verifying the quality of the info. 71.191.7.89 (talk) 16:08, 17 January 2011 (UTC)

It was right before, with the n–1 terms in the divisors. However, these just serve to cancel out the n–1 in the formula for s given at Standard deviation#With sample standard deviation. I'll add another expression to the first display formula for rxy to make this a bit clearer. One quick way to see that the formulas you left had to be wrong is to consider what would happen if you computed the correlation for a sample with two copies of all the observations in the original sample. Clearly this should have no effect on the estimated correlation. --Qwfp (talk) 21:15, 17 January 2011 (UTC)
Personal tools
Namespaces

Variants
Actions
Navigation
Interaction
Toolbox
Print/export