Climate Change, Statistical Significance, and Science
A recent plea for scientists to “stop playing dumb on climate change” shows why The New York Times Op Ed page needs a statistician.
Recently, the New York Times published an opinion piece Playing Dumb on Climate Change by Naomi Oreskes, a professor of the history of science at Harvard University, which argues that in the case of climate change, scientists are too conservative in their scientific standards. Scientists adhere to standards that call for 95 percent confidence levels, meaning that if all climate behavior were random fluctuation, there would be less than 5 percent chance that temperatures would be as warm or warmer than they are. In Oreskes’ opinion, scientists ought to use a standard less stringent than 95 percent because there is a plausible causal mechanism for climate change and because the risk from inaction is so great. Unfortunately, to make her argument, the author confuses several different aspects of confidence, evidence, belief, and decision-making.
The purpose here is to point out the confusion and clarify the statistical issues. This article is not about climate change; it’s about statistics. Oreskes’ mistaken interpretation of these statistical ideas do not imply that climate change is under question; the evidence for climate change consists of mechanistic as well as statistical arguments, and has little to do with the topic under discussion here: a misinterpretation of what is called the p-value, a statistic that tells us the probability of seeing high temperatures if they were occurring just by chance. The American Statistical Association already has a statement on climate change, and nothing in this article is intended to contradict the ASA’s statement or to deny the reality of anthropogenic climate change. This article is only about the proper use and interpretation of statistics.
The piece begins by noting that “Science is conservative, and new claims of knowledge are greeted with high degrees of skepticism. When Copernicus said the Earth orbited the sun, when Wegener said the continents drifted, and when Darwin said species evolved by natural selection, the burden of proof was on them to show that it was so. In the 18th and 19th centuries, this conservatism generally took the form of a demand for a large amount of evidence; in the 20th century, it took on the form of a demand for statistical significance.” So far, so good: Many scientists and statisticians demand statistical significance at 5 percent. But it is important to note that “significance” and “confidence,” as used below, are technical terms in statistics and should not be interpreted according to their usual colloquial meaning.
“Typically, scientists apply a 95 percent confidence limit, meaning that they will accept a causal claim only if they can show that the odds of the relationship’s occurring by chance are no more than one in 20,” writes Oreskes, which in light of the full article, suggests she thinks that “odds of the relationship’s occurring by chance are no more than one in 20,” means, in technical notation, that:
—where we note that a chance of 1/20 is the same as odds of 1/19 (the odds of an event are given by the chance of the event, divided by the chance the event does not occur). But this interpretation of 95 percent confidence is not correct. To understand why, suppose data show that the last decade was warmer than previous decades. Using statistical theory and climatological data and models, we can determine how much year-to-year variation there is in temperature and find the upper 5 percent of that variation; that is, we can find a cut-off temperature C, at which about 5 percent of all decades are warmer and about 95 percent of the decades cooler. We only consider decades in recent history—recent enough so that the climate was roughly like it would be today without climate change. By choosing the critical temperature C this way, there is only about one decade warmer than C for every 19 decades cooler than C, or about one in 20. To put it in mathematical language, let “Pr” denote “probability,” and then using this critical temperature, the odds that a decade is warmer than the temperature C is:
The probabilities are calculated under climatological and statistical models for recent climate assuming there is no climate change. Finally, if the most recent decade’s temperature is C or more, we call it significant at the 5 percent level, because the chance that the temperature would be that high assuming only random fluctuation is at most 5 percent.
Importantly, significance at the 5 percent level does not mean:
It’s important to note that the calculation of the how likely the data would be to occur assuming it’s random does not determine the reasons behind the data that we see, even though it may contribute evidence in the bigger picture for reasons behind climate change.
Also, when Oreskes says scientists treat the “95 percent confidence limit” as a “causal claim” she is wrong. The confidence level (and the confidence interval) speaks to the probability that we would see temperatures this warm or warmer if they were simply random fluctuations. No causes can be interpreted, regardless of how the chips fall based on temperature data alone. It may be unlikely for the data to occur if they were all random fluctuations, but the reasons that the data did occur cannot be deduced from this simple observation.
To continue the illustration, let’s examine other decades. Imagine we had data from 1000 BC to 0 BC. That’s 100 decades, none of which had global warming caused by humans; nevertheless, we know just by chance that about 5 percent will have had temperatures above the critical threshold C. Those decades would be labeled significant by the 95 percent rule even though we would make no causal claim and even though we know they occurred by chance. (Often when we examine many things such as the 100 decades here, we use a stricter standard than 95 percent for declaring significance. We’re ignoring the stricter standard here.)
For another example, consider the question of how many genes are associated with breast cancer. We have samples of people with and without breast cancer and, for each variant of each gene, we check whether its prevalence is different, or about the same, in the two groups of people. (We’re over simplifying again, for the purpose of illustration.) Let’s use 10,000 as the number of genes for our calculation even though there are many more. Just by chance we would find that about 5 percent of the 10,000 genes, or about 500, would be labeled significant because they have variants that randomly occur more frequently among the breast cancer patients in our sample. And if there were a small number of genes—say three—that really did affect the probability of developing breast cancer, they would likely be labeled significant too, so there would be a total of about 503 genes labeled significant. Yet about 500 of them would be spurious and three would not— but based on the statistics, they would all be labeled significant together; we wouldn’t know how many or which ones are spurious and how many or which ones are not.
So the words “odds of occurring by chance,” ought to refer to the fact that only about 500 of the 10,000 genes not involved in breast cancer are found to be significant or to the fact that about 5 percent of all decades are significantly warmer than usual. But is that what Oreskes meant? Her piece continues, “if there’s more than even a scant 5 percent possibility that an event occurred by chance, scientists will reject the causal claim. It’s like not gambling in Las Vegas even though you had a nearly 95 percent chance of winning.” “Winning” sounds like identifying a gene that’s implicated in breast cancer or like offering a convincing demonstration that humans are changing the climate.
In this hypothetical case of breast cancer genes, a gene identified as significant still has a very small chance of being causally related to breast cancer; only 3 out of 503 identified genes. The odds that a gene identified as significant is actually related to cancer are 3/500. Oreskes’ phrasing suggests she believes scientists are upholding a rule that only declares significance when the odds of such a gene being implicated in cancer is very high – over 95 percent likely. In other words, she believes that statistical significance occurs when the gene in question is unlikely to be there for reasons of chance alone. In other words, Pr(this gene is significant by chance)<.05, or in terms of odds, it sounds like Oreskes thinks the 5 percent means:
Or, to bring it back to climate change:
As we explained above, that’s not what 5 percent significance and 95 percent confidence mean. It seems the BBC made a similar mistake a few years ago, when it attempted to describe research results declaring global warming statistically significant, stating, “scientists use a minimum threshold of 95% to assess whether a trend is likely to be down to an underlying cause, rather than emerging by chance.” Well, no: the minimum threshold of 95 percent reflects a judgment of how likely the data are to occur if we assume they emerged by chance (less than 5 percent likely, if the 95 percent threshold is met).
In the breast cancer example the contribution of statistics was to reduce from about 10,000 to about 500 the number of genes to pursue in future investigations. That’s a real win, a real scientific advance, and one we were very sure of being able to accomplish.
The piece continues, “there have been enormous arguments among statisticians about what a 95 percent confidence level really means.” We disagree; we think almost every statistician would accept our explanation of what a 95 percent confidence level means. But we do agree that many working scientists, the BBC, and Oreskes, a historian, get it wrong.
“But the 95 percent level has no actual basis in nature. It is a convention, a value judgment. The value it reflects is one that says that the worst mistake a scientist can make is to think an effect is real when it is not. This is the familiar “Type 1 error.” You can think of it as being gullible, fooling yourself, or having undue faith in your own ideas. To avoid it, scientists place the burden of proof on the person making an affirmative claim. But this means that science is prone to “Type 2 errors”: being too conservative and missing causes and effects that are really there.”
A Type I error would be labeling “significant” a gene that is truly unrelated to breast cancer. A Type II error would be failing to label a gene that is truly related to breast cancer. As Oreskes notes, we can decrease the number of Type I errors by using a stricter standard for labeling. But that would increase the number of Type II errors. Or we can decrease the number of Type II errors by adopting a looser standard, but that would increase the Type I errors. There is no one right way to balance them, as the piece acknowledges next.
“Is a Type 1 error worse than a Type 2? It depends on your point of view, and on the risks inherent in getting the answer wrong.” But there are no risks associated with either Type I or Type II errors. Risks arise only when we contemplate actions. Type I and Type II errors do not prescribe actions; they describe whether data is consistent with chance mechanisms. Type I errors occur when data generated by chance appear to be inconsistent with chance. Type II errors occur when data not generated by chance — at least not by a null or uninteresting chance mechanism — appear to be consistent with chance.
“The fear of the Type 1 error asks us to play dumb; in effect, to start from scratch and act as if we know nothing. That makes sense when we really don’t know what’s going on, as in the early stages of a scientific investigation. It also makes sense in a court of law, where we presume innocence to protect ourselves from government tyranny and overzealous prosecutors —but there are no doubt prosecutors who would argue for a lower standard to protect society from crime. When applied to evaluating environmental hazards, the fear of gullibility can lead us to understate threats. It places the burden of proof on the victim rather than, for example, on the manufacturer of a harmful product. The consequence is that we may fail to protect people who are really getting hurt.”
Statisticians and scientists know that declarations of significance are different from actions. That’s why we have decision theory. We minimize expected loss and maximize expected utility. We don’t develop therapies that target all the genes found to be significant. We investigate further and we consider probabilities, risks, and utilities in addition to declarations of significance.
“And what if we aren’t dumb? What if we have evidence to support a cause-and-effect relationship? Let’s say you know how a particular chemical is harmful; for example, that it has been shown to interfere with cell function in laboratory mice. Then it might be reasonable to accept a lower statistical threshold when examining effects in people …”
A lower threshold for what: confidence, beliefs, or action? Oreskes talks about 95 percent confidence but she also seems to be calling for us to accept the reality of climate change and to do something about it. To statisticians, confidence, beliefs, and thresholds for actions are different things. Confidence is about the probability that chance mechanisms can produce data similar to, or even more extreme, than the data we’ve seen. Beliefs have to do with assessing which is really responsible for the warm decade: chance or climate change. Action thresholds depend on our beliefs but also on costs, risks and benefits. By not distinguishing between confidence, beliefs, and actions in her call for a lower threshold, Orestes helps perpetuate the confusion surrounding these concepts.
Oreskes points out that “In the case of climate change, we are not dumb at all. We know that carbon dioxide is a greenhouse gas, we know that its concentration in the atmosphere has increased by about 40 percent since the industrial revolution, and we know the mechanism by which it warms the planet.” Knowing these facts about carbon is evidence about possible non-random causes of the data. It may affect our beliefs and our actions, but it does not affect statistical confidence, which is based on probabilities calculated under the assumption that fluctuations are random. Oreskes is again failing to distinguish between confidence and beliefs.
After urging scientists to adopt a threshold less stringent than 95 percent in the case of climate change, the piece continues:
“WHY don’t scientists pick the standard that is appropriate to the case at hand, instead of adhering to an absolutist one? The answer can be found in a surprising place: the history of science in relation to religion. The 95 percent confidence limit reflects a long tradition in the history of science that valorizes skepticism as an antidote to religious faith. Even as scientists consciously rejected religion as a basis of natural knowledge, they held on to certain cultural presumptions about what kind of person had access to reliable knowledge. One of these presumptions involved the value of ascetic practices. Nowadays scientists do not live monastic lives, but they do practice a form of self-denial, denying themselves the right to believe anything that has not passed very high intellectual hurdles.”
Yes, most scientists are skeptics. We do not accept claims lightly, we expect proof, and we try to understand our subject before we speak publicly and admonish others.