STATSTICAL SIGNIFICANCE
What does it mean for a result to be “statistically significant”?
How can we tell whether two events happen at the same time by chance, or for a reason? A headache cured by an aspirin might have gone away without the aspirin. The fact that a sequence of five coin flips turned up five heads may or may not indicate a biased coin. When are the observations we make – such as that Republicans go to church more than Democrats, and men earn more money than women in similar jobs – due to chance, and when are they truly correlated events, with an underlying reason?
Measuring the likelihood that an event occurs by chance when there is no underlying pattern is the idea behind “statistical significance.” If there is, at most, a 5 percent chance of two events would happen together by coincidence, we may legitimately infer that there is a reason that the events occurred together. To put it another way, if there is no underlying pattern in the whole population, what is the chance that we would see the data that we see in a sample of the population? If it less than 5 percent, then we suspect that there is something going on with the whole population. Such results are called statistically significant. If the chance of occurring randomly is not small, the possibility that the events occurred together just by luck is too high to dismiss, and we conclude nothing. Small is relative, but many scientific disciplines use 5 percent (.05) as the border between small and not small. The 5 percent line is arbitrary, but has become standard in the field of biomedical research; statistical significance is the golden measuring stick for evaluating data. Why 5 percent and not ten? More on that below.
Statistical significance is extremely important. Suppose we want to test the effectiveness of a medicine to reduce the likelihood of a heart attack. We design a controlled study of two groups of people. Group A takes the medicine, and Group B takes a placebo. Suppose that Group A has a much lower rate of heart attacks than Group B. Is this due to chance, or the medicine?
If the rate of Group A heart attacks is just slightly lower than that of Group B, then we are more likely to believe that the medicine didn’t cause the effect, since any two groups of people are likely to have small differences due to random fluctuations. Similarly, if there are a small number of people in the study, we believe that chance plays a larger role. The formula for determining statistical significance therefore depends not only on the actual rates of heart attack in these two groups, but also the number of people in each group. The p-value is the likelihood that we would observe this relationship between medicine and heart attack rate occurred assuming there is no difference in the populations. (The p-value takes into account both the difference between the two groups and the number of people in each group.)
Suppose the p-value for the study is .04. This means that, if Group A and Group B have no differences between them, there is a 4 percent (.04 x 100) chance that Group A would have as low a rate (or lower) of heart attack as we observed in the study just by chance. Since p <.05, researchers conclude that this is very unlikely and that the medicine and lower heart attacks are correlated. In other words, we first assume that the medicine has no effect on heart attack rates. Given this assumption, what is the chance that we would see these data in Groups A and B? In this case, there’s a 4 percent chance. It’s so small, that we dismiss our assumption.
If p had been .1, then assuming there is no benefit to the whole population of the medicine, there would be a 10 percent (.1 x 100) chance that by choosing a random sample of people on the medication – Group A – we would see this kind of benefit, even though there really isn’t one in the whole population. Since this is not so unlikely, we would have less confidence that the medicine played a role, and the result would be inconclusive. It does not mean that the drug does not help, but that it has not been proven to help.
The fact that statistical significance is achieved when p<.05 is controversial. Assume that there really is no effect from the medicine on heart attack rates. For some, a 5 percent chance that you might nonetheless see a big difference between two groups is very high; for others it’s very low. For every twenty studies published claiming an association between events at p =.05, one of them is flawed. For some, this makes biomedical research untrustworthy. For others, the fact that a result with p =.1 is not considered reliable means that important correlations are not being reported to the public, with possibly serious consequences. There are cases when scientists hold research to higher (or lower) standards of demonstrating statistical significance, and certainly stronger or weaker correlations are remarked upon in the literature. However, no matter how you conduct the research there is always a small possibility that you observed an association when one isn’t really there. For the sake of having a standard of some kind, scientists have agreed on p =.05.
A result that is statistically significant has more weight in the scientific community than one that is not. There is nonetheless a (small) possibility that the result is just due to chance, which is why scientists keep their eyes open for other studies that might discredit the first. Similarly, even if a conjectured correlation has not been demonstrated to be statistically significant, there may still be a good chance that the association really exists, which is why more tests are often called for.