Most people, including me, find it really hard to understand p-values and statistical significance. Even proper scientists struggle to explain it.
This blog is my attempt to explain it as clearly as I can, for my own benefit. It’s an iron law of social science that any definition of statistical significance will have at least one mistake in it. That includes this one, so please let me know when you find it.
Why do we calculate p-values?
If we are trying to estimate an effect (e.g. the impact of an educational programme on children’s reading) we have to do several things that create uncertainty and make it impossible to get a perfect estimate. For example, we can’t do our study on all children so we need to select a sample of children to work with. This introduces the possibility that, through sheer bad luck, our sample is different in some important way to the general population of children. There’s now a risk that we might mistakenly detect an effect when the true effect is zero.
This is the problem that p-values and significance testing are addressing. Every time we estimate an effect, we will face uncertainty about the effect. Significance tests and p-values can help us think about this uncertainty.
So what are p-values?
When we calculate a p-value we start by imagining that there is no real effect. To continue the example above, we’d assume the educational programme has no impact on children’s reading. If we ran our study many times, the average effect size would be zero. However, thanks to the uncertainty described above, these studies would find a distribution of different effects. These effects would be normally distributed which means they would be arranged like the shape below. Many studies would find no effect, fewer studies would find an effect and a small number of studies will find quite large effects.
We calculate a p-value by comparing the effect we actually estimated in our real study to this theoretical distribution of effects. Is our estimate close to the mean of this distribution or is it in the extreme end of one of the tails? If it’s in the tails, that means we’d be unlikely to see an effect this large if the true effect is zero.
So, to calculate a p-value we ask: in the imaginary situation where there is no real effect, how likely is an effect as large as the one we saw in our real study? This is what the p-value is: the probability that, assuming the true effect is zero, we would see an effect as large as the one we actually saw when we collected real data.
A small p value suggests we’d be unlikely to see an effect this large, if there was no real effect. We could interpret this as evidence against the claim that there is no effect.
A larger p value suggests that effects this size or larger would be quite common if there was no real effect. It’s therefore harder to reject the idea that there is no effect.
In other words: the p value tells us how surprising our result would be, if the true effect was zero.
What does statistical significance mean?
Typically, if a p-value is less than 0.05 researchers will declare that the finding is statistically significant. There is nothing special about the threshold of 0.05. Sometimes researchers will set other thresholds. It was suggested as a threshold by the statistician who created the p-value and it stuck, in social science anyway.
That’s all ‘statistically significant’ really means: a result this large is unlikely to have occurred (p<0.05) if there was no true effect.
Some researchers argue that statistical significance means you can reject the idea that the true effect is zero. That seems reasonable enough. But there are many other interpretations that are simply wrong.
What does statistical significance not mean?
- ‘Statistics show this finding is important’. This is a common misconception that happens when we treat statistics as a magical black box that takes in data and churns out truths. Unfortunately, there isn’t one statistical test that can tell us whether something is important. We still have to judge the importance of findings ourselves, based on lots of different bits of information.
- ‘This finding is true’. Statistical significance means a finding would be unlikely if there was no real effect. This is not the same thing as a finding being true. We need to get a lot of things right to make truthful claims, such as a well–designed study and replication of results.
- ‘There is a less than 0.05% chance that this finding is false’. See above.
- ‘This finding is not due to chance’. In a way, the logic behind the p-value is exactly the opposite of this. It starts by assuming that there is no effect and any effects are due to chance. And then asks: how likely is it we’d see an effect this large? ‘A result this large is unlikely to have occurred if there was no true effect’ is not the same as ‘this result is not due to chance’.
- ‘The intervention we tested is highly effective.’ A low p-value tells you nothing about the size of an effect. For that you need to calculate an effect size. It’s quite possible to find p values smaller than 0.05 for small, unimportant effects.
Where can I read more?
I wrote a more technical explanation here if you would like to read more. I’d be grateful for any comments on this document.
This podcast has an excellent overview too.
Leave a comment