What does it mean for a result to be “statistically significant”?
How can we tell whether two events happen at the same time by chance, or for a reason? A headache cured by an aspirin might have gone away without the aspirin. The fact that a sequence of five coin flips turned up five heads may or may not indicate a biased coin. When are the observations we make – such as that Republicans go to church more than Democrats, and men earn more money than women in similar jobs – due to chance, and when are they truly correlated events, with an underlying reason?

Measuring the likelihood that an event occurs by chance when there is no underlying pattern is the idea behind “statistical significance.” If there is, at most, a 5 percent chance of two events would happen together by coincidence, we may legitimately infer that there is a reason that the events occurred together.

To put it another way, if there is no underlying pattern in the whole population (the null hypothesis), what is the chance that we would see the data that we see in a sample of the population? If it less than 5 percent, then we suspect that there is something going on with the whole population. Such results are called statistically significant. If the chance of occurring randomly is not small, the possibility that the events occurred together just by luck is too high to dismiss, and we conclude nothing. Small is relative, but many scientific disciplines use 5 percent (.05) as the border between small and not small. The 5 percent line is arbitrary, but has become standard in the field of biomedical research; statistical significance is the golden measuring stick for evaluating data.

Why 5 percent and not ten? More on that below.
Suppose we want to test the effectiveness of a medicine to reduce the likelihood of a heart attack. We design a controlled study of two groups of people. Group A takes the medicine, and Group B takes a placebo. Suppose that Group A has a much lower rate of heart attacks than Group B. Is this due to chance, or the medicine?

If the rate of Group A heart attacks is just slightly lower than that of Group B, then we are more likely to believe that the medicine didn’t necessarily cause the effect, since any two groups of people are likely to have small differences due to random fluctuations. Similarly, if there are a small number of people in the study, we believe that chance plays a larger role – maybe the people in the sample were got lucky. The formula for determining statistical significance therefore depends not only on the actual rates of heart attack in these two groups, but also on the number of people in each group. The p-value is the likelihood that we would observe this relationship between medicine and heart attack rate occurred assuming there is no difference in the populations.

If we find a small p-value, then we conclude that it is unlikely we would see have randomly chosen this data from the whole population if the medicine had no impact.

Suppose the p-value for the study is .04. This means that, if Group A and Group B have no differences between them, there is a 4 percent (.04 x 100) chance that Group A would have as low a rate (or lower) of heart attack as we observed in the study just by chance. Since p <.05, researchers conclude that this is very unlikely and that the medicine and lower heart attacks are correlated. In other words, we first assume that the medicine has no effect on heart attack rates. Given this assumption, what is the chance that we would see these data in Groups A and B? In this case, there’s a 4 percent chance. It’s so small, that we dismiss our assumption.

If p had been .1, then assuming there is no benefit to the whole population of the medicine, there would be a 10 percent (.1 x 100) chance that by choosing a random sample of people on the medication – Group A – we would see this kind of benefit, even though there really isn’t one in the whole population. Since this is not so unlikely, we would have less confidence that the medicine played a role, and the result would be inconclusive. It does not mean that the drug does not help, but that it has not been proven to help.

The fact that statistical significance is achieved when p<.05 is controversial. Assume that there really is no effect from the medicine on heart attack rates. For some, a 5 percent chance that you might nonetheless see a big difference between two groups in a sample is very high; for others it’s very low. In a fundamental way, the choice of where the cutoff is for “significance” is arbitrary. This also means that when a small p value is found (and the null hypothesis is rejected), then scientists might publish results that do not reflect what is happening with the population as a whole.

For some, this makes biomedical research untrustworthy. For others, the fact that a result with p =.1 is not considered reliable means that important correlations are not being reported to the public, with possibly serious consequences. There are cases when scientists hold research to higher (or lower) standards of demonstrating statistical significance, and certainly stronger or weaker correlations are remarked upon in the literature. However, no matter how you conduct the research there is always a small possibility that you observed an association in the sample population when there really isn’t really one in the larger population that the study subjects were chosen to represent. For the sake of having a standard of some kind, scientists have agreed on p =.05.

A result that is statistically significant has more weight in the scientific community than one that is not. There is nonetheless a (small) possibility that the observation in the sample of participants does not reflect what would happen on the whole population, just by chance. This is one reason scientists keep their eyes open for other data that might discredit the earlier reported results. Similarly, even if a conjectured correlation has not been demonstrated to be statistically significant in a particular data set, the association nonetheless exists in the population as a whole. In both cases, additional studies can shed light on whether the results are spurious or descriptive.

Independent follow-up studies are crucial to science for exactly this reason. If a medicine is correlated with lowered heart attack rate in a sample of the population, we may have unluckily chosen a sample that does not reflect how the medicine would behave more generally. But if we take a second sample and we still see an impact, we are much less likely to see a spurious correlation twice in a row. Independent data sets and analysis greatly reduce the possibility that our sample does not represent the population it was chosen to represent.