People will often talk about finding ‘strong’ or ‘weak’ relationships between two variables, X and Y. I think this language can be unhelpfully vague and prone to misconception. It could sound like three things:
- When X increases, Y also tends to increase
- When X increases, Y increases by a large amount
- X causes Y
Sometimes people seem to use this vague language intentionally. It allows them to imply that they have discovered that X causes Y when really they’ve only shown that Y tends to increase when X increases. These are two very different claims!
It’s important to understand each of these claims, how they are made, and the type of evidence you’d expect when making them.
When X increases, Y also tends to increase
We are often interested in how much two variables tend to vary together. I.e. when X increases, does Y tend to increase? And when X decreases, does Y tend to decrease? The amount that two variables vary together is called the correlation.
For example, let’s say you’re comparing the number of hours students study and their grades. If you find that students who study more tend to get higher grades, then there’s a positive correlation between study hours and grades – as one goes up, so does the other.
On the other hand, if you find that as one thing increases, the other decreases, that’s a negative correlation. If the more hours students spend playing computer games, the lower their grades are, then gaming and grades have a negative correlation.
One way of thinking about correlation is that it describes how close to a straight ‘line of best fit’ the data points fall.
The two variables in the following graph are perfectly, positively correlated. Every time X increases, Y also increases. The data points are neatly aligned in a straight line on the line of best fit.
In this next graph the two variables are positively correlated, but it’s not perfect. Y tends to increase when X does, but not always. The data points are close to the line of best fit, but not on it.
This is what perfect negative correlation looks like (the left-hand chart) and imperfect negative correlation looks like (the right-hand chart):
Finally, here is an example of two variables that are not correlated at all.
Correlation is often measured using Pearson’s correlation coefficient (‘R’), which has a range of -1 (perfect negative correlation) to +1 (perfect positive correlation). So, if someone shows you a correlation coefficient, they are telling you the degree to which two variables vary together.
When X increases, Y increases by a large amount
The correlation doesn’t tell you whether an increase in X is associated with a small or a large increase in Y. Increases in X could be associated with very small increases in Y but they could be perfectly correlated.
If you want to know how much Y increases when X increases, you need to look at the slope or gradient of the line of best fit.
When X increases, Y increases by a much greater amount on the red line than on the blue line. The red line has a slope of 4. This means that Y quadruples when X increases by 1. The blue line has a slope of 2. This means that Y only doubles when X increases by 1.
The data used to plot the red line and blue line might have the same correlation, but the slopes are quite different.
X causes Y
We might find that X and Y are correlated and Y tends to increase a lot when X increases, but this does not mean that X causes Y. This is the old saying: ‘correlation does not mean causation’.
X and Y might be correlated because they are both caused by some other variable. For example, the number of people carrying umbrellas and the number of people wearing coats are likely to correlate. But umbrella-carrying doesn’t cause coat-wearing. They are both caused by something else: bad weather.
Establishing that X causes Y requires a lot more work than plotting the variables and drawing a line of best fit. It requires careful thought about study design.
If you want to show that X causes Y you have to carefully rule out all other explanations for the relationship. This means using study designs that try to hold everything else constant apart from changes in X, like randomised control trials and other experiments. This is not easy!
Leave a comment