I’ve heard and read several statisticians use the following phrase or an equivalent, such as Andrew Gelman on his blog or Nassim Taleb on his youtube channel :

But what does this mean exactly ? I did not fully understand it at first. I decided to research about it and think I now have a good grasp of the idea, so I am writing this short post in order to share my understanding of it, which might still be imperfect. Statistical significance is a tool that is very often misinterpreted and I think it is important to have an honest discussion about it to decide what this tool can and cannot achieve. Here is my explanation of this issue, hopefully you’ll find it succinct and clear :

When you run a hypothesis test, you’re trying to figure out whether you should reject the null hypothesis or not. If we take the example of the Z-test, you decide what the value of the parameter you’re interested in is for the null hypothesis. You then gather data from a sample of the population and estimate the same parameter for that sample. Assuming the parameter you are studying follows a Gaussian distribution, if the estimate of your parameter from the sample and the parameter value for your null hypothesis are separated by enough standard deviations, you reject the null hypothesis. The number of standard deviation separating the two values is often called the z-score, and the probability of observing a z-score greater than *x* is called the p-value. The value *x* varies and depends on the significance level chosen. Indeed, if the p-value is smaller than the significance level, the result is deemed statistically significant.

But here is the catch, if you take a population of samples of the same size from an initial population, the parameters of the population of samples will vary and follow a their own probability distribution. Their z-scores and p-values will also vary and follow their own probability distribution ! In fact, **p-values of samples of the same size from the same population can vary a lot**, and the difference between a significant result and a non-significant result is itself not necessarily statistically significant in the p-value distribution.

In fact, in Nassim Taleb’s paper on this subject, he generated the p-value distribution trough a Monte Carlo generator. He found that if the “true” p-value of the population is 0.12, 60% of the estimated p-values from the samples could be below the traditional significance threshold of 0.05.

This problem has serious consequences. Very often people will say “statistical significance does not imply practical significance”, but in fact, finding a statistically significant result in your sample does not even imply that it is truly “statistically significant” at the population level.

P-values and statistical significance are tools that are misunderstood by a lot of researchers and I think this information needs to be spread. The fact that, as we’ve shown, p-values can vary greatly makes p-hacking much easier than it would be otherwise, and this has terrible negative consequences on the scientific literature. One solution that has been advanced by several statisticians is to lower the significance threshold greatly, to 0.01 or 0.005. This might be a good start, but will it be sufficient ? Time will tell us, hopefully.

Post-publication modifications :

Corrected a few imprecisions.