STATS ARTICLES 2012
Dim data: Why the statistics behind brain scan stories don’t always add up
Rebecca Goldin, Ph.D., Cindy Merrick, January 18, 2013
2012 was a banner year for functional magnetic resonance Imaging (fMRI), which has emerged as one of the most exciting new technologies for research in the areas of cognition, emotion, and personality. But as with any “hot” science, the question remains as to whether the media can respond with light, refracting the critical questions the neuroscience community asks of its grandest claims rather than reporting, glibly, on whatever makes for a good story.
Alarm bells have already sounded: writing in Forbes, David DiSalvo warned about the “brain scan circus,” while in the New York Times, Alissa Quart chided reductive “brain porn” for blinding us to other legitimate forms of explanation for human behavior.
But elsewhere, the media’s fMRI narrative is one of indubitable benefit – from understanding the effects of marketing to how short-term memory, logical reasoning and verbal acuity predict cognitive ability. Brain imaging can, according to the media, tell us how humans associate meaning with pictures and how we make decisions or what we are thinking about unconsciously. The most exciting – and ethically problematic – proposals involve using fMRI research in the courtroom.
Missing from all this wonder are the statistics powering these fMRI analyses and how sure we can be that the correlations indicated by fMRI data are meaningful identifiers of causality.
The perils of confusing causation and correlation in fMRIs is a topic generating a lot of interest. But here we investigate the correlations themselves. If an fMRI study finds that certain brain patterns and certain behaviors are highly correlated, can we trust the result?
Traditional methods of behavioral research rely heavily on self-reported data and clinical interviews. They include tests such as Minnesota Multiphasic Personality Inventory (MMPI), clinical interviews using the Diagnostic and Statistic Manual of Mental Disorders (DSM), and various personality measurement tests. These methods are used to type personalities and assess states like emotional stability and mental health. Subject to many of the errors of any self-reported survey, their results lead, at best, to inferences about the human condition that are, frustratingly, removed from causal mental processes. Researchers and clinicians thus remain at a remove from understanding what motivates our behavior, how we make decisions, and how to improve treatment of mental disorders.
Unlike these traditional methods, fMRI scans seem to offer objective “hard data.” Scientists can now to watch live brain activity while a specific trait or emotion, like jealousy or fear, is induced by some stimulus, such as a question about sexual infidelity, or a picture of a spider. Complicated data acquired through these tests is analyzed to try to determine just what areas of the brain are responsible for the trait or emotion. While the push for big data has led scientists and media sources alike to embrace fMRI as a way of avoiding the perilous terrain of subjective self-reporting, many of the findings are undercut by statistical problems.
In a somewhat antagonistic paper in 2009, MIT graduate student Edward Vul and colleagues blasted more than 20 recent fMRI-based studies for committing a basic statistical error, and named names. While this exposé created a firestorm of protests and counter-accusations (covered by Nature, Scientific American, and The Daily Beast, among others) the view that there are fundamental flaws in the way much fMRI data analysis is being done has gained considerable traction within the field of neuroscience. In another literature search of top neuroscience journals, Nikolaus Kriegeskorte, et al. found that a jaw-dropping 42 percent of published research findings contained the same statistical error observed by Vul.
The Correlations are too good to be true.
Evidence that something is amiss was demonstrated by Vul et al. with a simple calculation. Most studies use a specific number – the correlation coefficient – to demonstrate the strength of a relationship between two data sets, and how well they track together. In the case of fMRI, the correlation coefficients may indicate relationships between some trait or characteristic measured by a traditional (self-reported) test or diagnosis, and brain region activity evinced during the fMRI test. Basically, says Vul, many of these numbers are impossibly high. The correlation coefficient is a number between -1 and 1; values close to zero indicate little or no relationship, and numbers close to 1 or -1 indicate strong (positive or negative) relationships. When calculating the correlation coefficient, one takes into account not just the data values, but the validity of the test or process by which the data is gathered. Every test method involves some amount of error, so that even for two perfectly correlated data sets, we expect the calculation to yield a less-than-perfect (i.e. never exactly 1 or -1) correlation coefficient. Vul suggests that for the testing methods most commonly used, like the MMPI or clinical diagnoses, as well as for testing with fMRI technology, we should reasonably expect to see positive correlations no closer to 1 than about .74; but in their literature search, the authors found more than 25 percent of the studied relationships claimed correlations above this number. An investigation ensued as to why such positive results might be found. Lo and behold, a simple reasoning error can lead to selecting data based on its ability to give such a result.
Double Dipping or the Nonindependence Error
One significant culprit for excessively high correlations, say Vul and Kreigeskorte, is something known as the “nonindependence error,” also called, more colloquially, “double dipping” – or the “Texas Sharpshooter Fallacy.” This form of selection bias is capable of producing correlations higher than would otherwise be reported. Here is how double dipping can happen: A poor marksman uses his rifle to fire a barrage of holes into the side of a barn. He approaches the barn to scan the holes he made, paints a target around the best looking cluster, and announces, “Look what a great shot I am!” With the painted bullseye, he sure does look like a great shot – but he may not actually have been aiming for the target he drew.
The Texas sharpshooter fallacy is a simple example of double dipping with data. The data set (the holes in the side of the barn), did have what looked like a target post-facto – and he insisted the tightest cluster exemplified his marksmanship. The claim would be much more convincing if he performed his experiment again with his target already drawn, and obtained a second cluster of holes in or very close to the new target.
It may be surprising that scientists would claim research results on such a simple and obvious statistical error; and yet, it turns out that variations of this mistake are quite prevalent with fMRI.
This error happens when researchers aren’t careful enough about how they choose certain regions of the brain for analysis. FMRI technology is based on tracking changes in blood oxygen level over time as these changes are believed to indicate neurons firing –the physical basis for brain activity (See our previous article about fMRI). When a test subject is in the fMRI machine, and is shown, say, a picture of spiders (as in this study), the brain is activated in response to the picture. In such a study, the test subjects have been evaluated according to an accepted measure of arachnophobia. This diagnosis measure is then compared to observed activity in various regions of his or her brain during the test. It is from correlations drawn this way that researchers make conclusions about parts of the brain responsible for processing emotions like fear.
The orderly process by which activation changes are recorded is based on the brain being subdivided into cubic units of volume – known as voxels. As the brain is activated, the fMRI machine records activity levels for each voxel through the duration of the test. Since most fMRI techniques divide the brain into between 40,000 and 500,000 voxels, some selection must be made of voxels most worthy of attention. This selection is usually based on the regions of the brain comprising adjacent voxels that experience activity above some threshold.
The nonindependence error may be committed if voxels are selected for study because of their observed correlation with some behavioral measure, the researcher is, in effect, drawing his target based on the best looking cluster of neural activity. In his blog prefrontal.org, University of California at Santa Barbara neuroscientist Craig Bennett puts it succinctly. “Bottom line: you cannot use the data to tell you where to look for results and then use the same data to calculate your results.” Many researchers do exactly this, using their test data to decide which voxels seem most interesting for their behavioral experiment, and then compute correlations using the data from those voxels. “Not surprisingly,” says Bennett, “the resulting values from this method were sky-high.” Any correlation researchers compute based on voxels chosen this way is likely to affirm their hypothesis, and where a true correlation exists, it will probably appear to researchers to be stronger than it actually is. The error is in using nonindependent data sets.
Unfortunately for the media, it is not always obvious whether scientists investigating some particular emotion and its neural correlate have abused their data this way. To find out, one has to read the methods section and see whether the voxels to consider were determined by the data (i.e. choosing voxels that are more active than others) or whether they were picked purely based on a known anatomical region. Authors are often willing to share this description of how they chose the specific voxels to analyze – but journalists need to ask.
Toward Better Interpretations
Kriegeskorte and others are careful to point out that inflated correlations do not necessarily mean that the correlations do not exist. There is often no way to know how distorted a particular correlation is, and therefore the problem is usually one of effect size. In a supplement to their critique, they suggest various methods for avoiding the nonindependence error. One straightforward way is to choose predefined regions of the brain, like saying “let’s look at what happens in this particular region of the amygdala while asking questions about sexual infidelity.” By drawing the target before collecting measurable data, scientists use a preselected set of voxels (the ones generally accepted as comprising the specified region), and make observations and calculations based on whatever activity happens (or doesn’t) in that region. Another approach is to make use of the fact that there are multiple experimental runs over which fMRI data are gathered. Kriegeskorte suggests, for example, using the odd-numbered runs (first, third, fifth, etc.) of fMRI subjects to look for interesting voxel regions, and the even-numbered runs to use activity data on these voxels for correlation analysis.
While popular science publications tend to scrutinize technology and research, such skepticism is rarely evident in mainstream media reporting. The Wall Street Journal, for instance, reported on research using fMRI to study fandom among teen music lovers – “Inside the Brains of Bieber Fans” – without actually saying how fMRI was used to test anything. The New York Times published a story noting the strong connection between experience and language. The article pointed to the response of the brain’s olfactory center when reading words suggesting olfactory behavior (like “cinnamon”), or the activity of the motor cortex when one reads the sentence, “Pablo kicked the ball.” The New York Times claimed that “brain, it seems, does not make much of a distinction between reading about an experience and encountering it in real life; in each case, the same neurological regions are stimulated.” This is a interpretation is a myopic leap, since the scientists presumably noticed similar activity in specific regions, but did not compare reactions in ostensibly “irrelevant” voxels.
The double dipping phenomenon rears its head with regularity. Science Daily reported on research on how people diagnosed with bipolar disorder experience emotions differently when shown images of happy or sad faces, compared to people without such a diagnosis. In the experiment, the researchers asked patients to respond (press a button) or withhold response when shown the pictures. The news source claimed the authors find “that bipolar depressed patients abnormally activated brain areas when they had to withhold responses to sad faces. Manic patients, on the other hand, had abnormal activation regardless of whether they were trying to withhold response to sad faces, happy faces or non-emotional material.”
Yet a close look at the original paper conducted by Amit Anand, M.D. of University of Indiana, and published in the Jan 15, 2013 issue of Biological Psychiatry, shows that double dipping lives on. Specifically the section on image analysis reveals:
"From ANCOVA tests, significant clusters indicating a main effect of diagnostic group on each contrast were identified. To clarify and visualize between-group differences, mean contrast data (i.e., beta coefficients) of each subject were extracted from these clusters for the relevant contrast and analyzed within SPSS."
In other words, the authors identified clusters of voxels first, and then they analyzed these very clusters to identify the most correlated neural activity.
Indeed, many news articles which turn fascinating hypotheses into easily consumed ideas, fail to acknowledge the complexity of actually drawing correlations between detected brain activity through fMRI and what a mind may actually be processing.
There are two questions journalists can ask when reporting on fMRI correlations.
Are the brain regions involved in the correlation chosen independently of the observations being used to calculate the correlation?
How does a reported correlation compare to the best-case scenario, once error in testing is accounted for (does the number pass the sniff test)?
Much research using fMRI for social and cognitive neuroscience is based on the scientific search for the underlying physical process behind an external behavior – expressions of fear or anxiety, actions based on jealousy or grief. But so far, our best non-invasive method of investigating the brain, despite generating millions of pieces of data with each single experimental test, suffers from a dramatically limited perspective. Worse, the media’s reporting suffers from a distressing lack of curiosity about how valid conclusions using fMRI technology should be drawn – and whether valid conclusions have in fact been drawn with each new study claiming a new insight into how our brain works. Our methods of discovery apply ever more sophisticated techniques; they need equally sophisticated journalism.
Rebecca Goldin, Ph.D. is the Director of Research at STATS.org. Dr. Goldin was supported in part by National Science Foundation Grant #202726