I wrote "Correlated: Surprising Connections Between Seemingly Unrelated Things" (http://www.correlated.org), and one of the things I run into a lot when promoting the book is the "correlation doesn't imply causation" line.
Obviously, the statistics I present in the book and on the website are tongue-in-cheek, but I like to take issue with that line, because typically, people don't fully understand what it means. I use a variation of the line in the OP's post -- that "nothing other than correlation implies causation."
When something implies correlation it doesn't mean there actually is correlation. It just means it might be worth to look deeper into that direction, right? Therefore you need all these different attributes to show a causation exists, but a single one should be enough to imply causation.
note: I might actually really confusing my word definitions here, btw.
You can't exactly imply correlation (in simple models) you merely observe it. You can imply causation since it's a theoretical claim, not just an observation.
What I meant to say before is that correlation cannot imply causation in and of itself because it is merely an observation—you must also build a framework of causal assumptions within which your observed correlation could have only arisen if there was a causal relationship of interest. That final statement is the "implication" we're all talking about.
So, "correlation along with an acceptable and proper set of counterfactual assumptions is capable of imply causation sometimes". Just rolls off the tongue, really.
The right way to disprove "Nothing other than correlation implies causation." would be to find one causation whose existance we know from something other than causation.
But you won't find it. Think about any causation. Now ask yourself why we now that it exists. The answer will always be correlation.
Causal calculus, mutual information, random forest importance scores, various hypothesis tests and other methods can all imply causation as well or better (especially in the case of non-linear or multivariate association) then correlation. All these methods and more are widely used in literature.
Those methods still rely on correlation. To be clear: We are not only talking about linear correlation here, which is only one of several kinds of correlations.
No they don't. They work directly with underlying estimates of probability distributions, entropy or impurity decrease in machine learning models.
Another example: mendelian inheritance patterns in a pedigree study.
If you know of a good measure of non linear correlation please let me know. And publish a paper in science or nature like the MIC/MINE people did (a measure that has issues in practice).
To estimate probability distributions, you need data that is non-random. Non-random means there's a pattern. That's another word for correlation.
In using those methods you may never calculate correlation as a number, but when those methods find something you still rely on the fact that there is a correlation.
How about this? Playing soccer causes broken hips, but soccer playing is in fact anti-correlated with broken hip incidence. Why? Because most broken hips occur among extremely old people, who generally don't play soccer.
[I'm not sure whether this is actually statistically true, but you get the idea, and I'm sure there's a similar example which actually is true.]
Just because old people tend to break their hips more often than younger people wouldn't necessarily make soccer anti-correlated with hip injuries. In your example, if you compared two groups that are identical, the only difference being one plays soccer and the other doesn't, you'd find a correlation between soccer and broken hips.
Isn't this statement taking it too far? As far as I understand, the word "imply X" means that there is a chance >0% that X might be true. Therefore "not imply X" would mean that the chance is <=0%. But when you have an unknown dataset showing X (with X being "A correlates with B", for instance), then the chance for X being true is in fact bigger than 0%. It might be dismissably more likely than 0%, so little more that you might conclude it is basically 0%, but it still is >0% in the mathematical sense. Therefore seeing correlation should imply correlation.
"Imply X" means "one can logically prove X follows as a consequence." When you say correlation does not imply causation you're saying "there are examples of things which correlate but have no casual relationship." Proving such a claim (by finding an example) is a logical disproof of the claim that correlation implies causation. Logical implication has nothing to do with probabilities in this sense.
The point of the article is that correlation (of observed data) does not logically imply correlation (of the underlying phenomena, usually in a more general setting than the data allows).
I guess if you want to be formal you need to actually define imply like you have but the author hasn't.
I'd go with A implies B means (not A) or B, ie binary implication. That'd seem to capture the spirit of "correlation does not imply causation" since we can have correlation (A) and not causing (not B), ie not((not A) or B), ie no implication. So I guess we can use it for this too as at least a possible definition.
If we say A is correlation in the sample set, B is correlation in the population, I think we can replicate the above argument and say we can have correlation in the sample set (A) and not correlation in the population (not B), ie not implication. An obvious example would be if we have 2 variables and plotting them gave a simple quadratic curve (say, y=x^2 across x=[-10,10]) - we can find a sample set with strong positive correlation and another with strong negative correlation - but neither really describes the population as a whole.
So being mathematical and using the boolean logic implication definition I think it's a valid statement - but defining implication in context is pretty important.
So the statement "correlation does not imply causation" says nothing more than the old "correlation does not equal causation", but is easier to misinterpret (accidentally or wilfully) due to the multiple definitions of the word "implies"?
I'd trust the definition of the other people more than my own, since English is only my second language. Also I am not a scientist. It's totally possible that my interpretation of the word "imply" is wrong, which is also what all the other other comments suggest.
That is not a very useful definition of the word imply, since all events have a non-zero probability in a trivial, quantum mechanical anything-can-happen sense. Gelman likely uses the word imply in the more common meaning of indirect logical consequence.
> It might be dismissably more likely than 0%, so little more that you might conclude it is basically 0%, but it still is >0% in the mathematical sense. Therefore seeing correlation should imply correlation.
I don't agree in the general case. A counter-example is that a purely random data set will have correlations, which by definition imply absolutely nothing about anything other than the random data set.
This is addressed in the last sentence of the OP:
> That is, correlation in the data you happen to have (even if it happens to be “statistically significant”) does not necessarily imply correlation in the population of interest.
The key point is that there is more to science than statistics. To reword: statistics alone are not sufficient to create scientific knowledge.
To create scientific knowledge you need to make a prediction (aka a hypothesis) and test it. To make an hypothesis you need some notion of a causal mechanism; to run the experiment you observe an interference in that mechanism to see what happens.
If all you have is a statistical correlation, and you haven't identified or altered a mechanism...you really don't have much. That is what this author is getting at.
Statistical correlation (revealed by some statsistical test) is one thing.
The common meaning of correlation is close relationship embedding some causation in it (A implied B or B implied A or the same thing is the root for A and B).
You can find that two series have high correlation (statistically), whereas this happens just by chance.
In other words, it equivocates: http://en.wikipedia.org/wiki/Equivocation Wikipedia concentrates on the fallacy angle... in this case it isn't a fallacy, it's sloganeering, since the point is to encourage the viewer to be confused and dig into the two different meanings being used. I think it's not likely to work too well, though.
Non-correlation is so correlated with non-causation that it requires it. And that forms the basis of a lot of "you accept the data you're dealt" science.
Not true. It is easy to find/construct a causal non linear relationship that won't show up in correlation tests. Correlation really isn't that great of a measure.
You are correct if what you're implying is that correlation isn't a robust measure. Whereas non-correlation can be quite robust if your test isn't constructed badly (or as you imply 'constructed' badly) It's important to consider the meaning of such measures without regard to quality of the test; which can always be faulty.
That is my point exactly though I am extremely skeptical of any test for non-correlation. Gelman actually has some other articles worth reading on how dangerous it can be to make policy decisions based on such tests with real world examples including traffic laws.
You can always linearize a non-linear relationship to create a linear correlation. Even in cases such as an exclusive-or, neural networks demonstrate layering linear functions to mimic any relationship.
Sure but it no longer has any meaning as a significance test unless you somehow bound the complexity of the relationships you allow yourself so you're left with cross validation. Ie I can non parametrically relate any two variables.
Man, for all the critics out there quick to point out that correlation is not equal to causation, nobody ever seems to explain what needs to be in place to actually show causation.
To establish causation you need 3 things:
1) Correlation
2) Temporal precedence. That is, you have to show that the cause occurs in time before the effect
3) A lack of other plausible explanations
If more people knew the above the world would be a better place. Even the Udacity statistics course failed to mention the above, even though they hammered home that correlation does not imply causation.
When I get a cold I pray to Zeus. It always clears up in a few days. Can't think of any other reason for it getting better, therefore by your 3 criteria I'm justified in thinking Zeus cures my colds.
Actually, to establish causation, in as much as this is possible at all [1], you need a predictive model and controlled experiments.
[1] following Hume, you can never definitively show causation.
Even when all three are satisfied, there may still not be a causal relationship since correlation may have been by way of coincidence. For example, if Google shows that the stock market change on a given day seems to correlate surprisingly well with search volume for a particular phrase on the day before, that may be just because the particular phrase was chosen from amongst a million others based on its correlation. Causation can still cannot be established (though a link cannot be denied either).
Note: If I recall correctly, Google did show one such study.
But coincidences are covered by his (3). By definition coincidences are low probability events, if one does not account for coincidences, then yes, with low probability it will get it wrong. I think this gets more into the probably approximately correct (PAC) learning domain.
I was thinking also if they were covered by #3, but I am not sure:
>> 3) A lack of other plausible explanations
This means that there are not other parameters involved, satisfying the other two criteria, that can also plausibly "explain" the measured effect. In the example I gave, there may not be any such other available correlation.
That the analyst picked the plausible cause by hunting though a large number of others is by itself not a plausible explanation asked for by #3.
The definition of cause is more of a philosophy/epistemology question than a statistics question. I'm not sure those 3 points are necessary. Even temporal precedence is called into question by quantum mechanics. Positrons are just electrons traveling backwards in time and all that.
Most troubling is point 3 -- a lack of other plausible explanations. There are many things, such as smoking cigarettes causing cancer, that have other plausible explanations. It is clear that Americans understand causality to be some form of influence, and not necessarily the only possible influence for an event. It is the degree of influence that matters.
This isn't really a great system for reporting causation. It's at best OK and is used from time to time (see any literature on Granger Causality).
As I linked elsewhere in this thread, if you really want to prove causality then you need to be making counterfactuality arguments. Judea Pearl's framework of causal networks is a good system for doing this.
At the end of the day, your claim of causality rests upon a network of assumptions about causality (and this is a formal description of your point (3), basically) and someone accepts this proof by accepting your assumption network and the strength of your new data.
No. Model-building relies on a given set of assumptions that may or may not be true. Once you've agreed to a set of assumptions, the data becomes meaningful within the framework. The framework can be bent and shaped as the assumptions evolve.
If we can assume that on average, a random set of people who are randomly asked about a political topic is a reflection of the population of a whole, we can start to draw meaningful conclusions about the political topic.
The OP article points to the concept of spurious correlation [1] which is a danger if you have very little domain expertise in the data that you're working with. e.g. If your regression shows US GDP is statistically significantly affected by Bangladesh butter production, you may want to discuss the results with domain experts about why the result may or may not be spurious.
1) every variable you calculate with follows the law of large numbers (also known as the central limit theorem) (this means amongst other things, that if you find correlation, you can't tell anyone involved in the variables, even yourself, because you'll act on it and change it, at which point the correlation won't hold. Also it won't work for chaotic variables, like pretty much anything involving human actions)
2) the correlation remains intact over longer time periods. The variables must be sampled without bias. You have properly separated your concepts, and made sure they actually represent what you want to be looking at. Etc. Etc. Etc. (this is essentially stating : don't fuck up the math before you calculate correlation) (obviously don't fuck up the math afterwards either, in other words : correctly check for statistical significance)
3) You have to check with every time offset, and even with variable time offsets. This is too complex to go into.
Then correlation implies a "causal relationship". Meaning corr(A, B) > 0 IFF
1) A causes B
2) B causes A
3) there is some external factor C that causes both A and B
Now keep in mind that there is no such thing as a root cause. There is simply a chain of events, and if the next event wouldn't have happened without the previous, the first event is said to be the cause of the second. "I shot him because his bees annoyed me", this "algorithm" would find the bees as guilty as the perpetrator, they are both causes of the death.
Also keep in mind that this only works AS LONG AS YOU LEAVE THE CAUSAL CHAIN ALONE. If you calculate correlation, and find, say "BAC" always goes up 2 days before "JPM" goes up, and then you proceed to buy JPM when BAC goes up, you've just invalidated your conclusion (translation : this works better the smaller an investor you are).
There is also absolutely no guarantee that killing of one chain of events won't simply lead to another. Say you have a harbor with 2 entrances, and because one is wider, every ship uses it. Therefore "getting to that harbor" correlates perfectly with "going through entrance A". Obviously blocking entrance A won't lead to no more ships in the harbor.
So this DOES NOT match the human/legal idea of "cause", and can never yield useful actions to take. It is nevertheless a useful metric.
Obviously, the statistics I present in the book and on the website are tongue-in-cheek, but I like to take issue with that line, because typically, people don't fully understand what it means. I use a variation of the line in the OP's post -- that "nothing other than correlation implies causation."