We recently read an interesting paper by Vul, Harris, Winkielman, and Pashler which effectively accuses many researchers of having used laughably poor statistical methodology in using brain-imaging to establish conclusions about where in the brain the processing occurs for many psychological states including emotion, personality, and social cognition.  If this accusation is correct, it calls into question many of the conclusions of this research.  And whether or not this accusation turns out to be correct, there is a significant possibility that it will generate controversy that will taint this research in the public perception.  (Vul links to numerous popular media portrayals here.)

 

The research in question uses fairly standard fMRI techniques.  Experimental subjects perform two different tasks in a brain scanner: an “experimental” task that engages the psychological capacity in question and another “control” task that is used for comparison.  The scanner detects how much energy is used by different areas of the brain during these two tasks.  The brain-scans are divided into thousands of tiny cubes (roughly a cubic millimeter each) called voxels. Whichever voxels “light up” more in the experimental task than in the control task are taken to be the brain areas that are responsible for the special processing involved in the experimental task. 

 

Researchers often identify a Region of Interest (or ROI), consisting of one or more voxels that light up during the performance of the experimental task, and they report the degree of correlation between the task subjects are performing (experimental vs. control) and how much the voxels in the ROI light up.  (Two things have a correlation of 1 if you’d never see one without the other; they have a correlation of -1 if you never see them together; and they have a correlation of 0 if the presence of one says nothing about how likely the other is to be present.)

 

Some researchers have reported correlations well above 0.8.  These remarkably high correlations set off alarm bells for Vul et al.  There is a theoretical limit on how high of a correlation a particular experimental method should be able to reveal: the less reliable you are at detecting two things, the lower the degree of correlation you can get in your observations of them.  Suppose two things are actually perfectly correlated.  If you can detect them perfectly reliably, then they’ll look perfectly correlated to you.  The less reliable your methods for detecting them are, the more often you’ll mistakenly measure one or other of them, so the less correlated they will look to you.  (Mathematically, the maximum observed correlation is the geometric mean of the reliabilities of the methods for detecting the two correlates.)  Vul et al argue that the reported reliabilities for various psychological measures rarely exceeds 0.8, and that the reported reliabilities for fMRI studies rarely exceeds 0.7.  Given these estimates, even if some particular brain area was perfectly correlated with a particular psychological state, this should still show up in experiments as a correlation of 0.74, due to errors in measuring one or the other.  Of course, hardly anyone expects that there actually are perfect correlations, at least not at the coarse-grained resolution of current imaging technology, which means observed correlations should be even lower than this upper limit of 0.74.  Surprisingly, Vul et al cite roughly 70 recently published claims of correlations higher than the theoretical best-case limit of 0.74. 

 

Jabbi et al are among the authors that Vul et al criticize.  In their response to Vul, they take issue with the computation of 0.74 as the best-case scenario.  Their main argument is that there is at least one fMRI study which reported a reliability of 0.98, which (together with a 0.8-reliable psychological measure) would yield a higher theoretical best-case observed correlation of about 0.9.  I grant that a few researchers might have gotten lucky and studied tasks and brain regions that fMRI-methods happen to detect with abnormally high reliability.  So it could be that these few researchers are reporting genuinely high correlations, but it’s not credible that all 70 of the claimed correlations above 0.74 involved brain-measures with such abnormally high reliability.   If Vul et al are right that it is extremely rare to encounter fMRI reliabilities above 0.7, then we should expect that most researchers are experiencing reliabilities no higher than 0.7, so we should be very suspicious when 70 out of the 256 claimed correlations that Vul et al encountered in their lit review are too good for prevailing methods to deliver.

 

How is it that so many researchers got such incredible results?  Vul et al sent out surveys to 55 research teams to ask about the details of their computations.  The survey responses painted a disturbing picture.  Roughly 54% of the research teams (who reported the correlations marked in red in the figure below) had first defined as their region of interest (ROI) whichever voxel or voxels had displayed the highest correlation to the task in question, and then reported how high that correlation was.  Given how many voxels there are in the brain, there were bound to be some that were highly correlated (and others that were highly anti-correlated) with any given task, so it isn’t really all that informative to be told how high the correlation was for some cherry-picked set of highly-correlated voxels.

 

voodoo-correlations-1

 

Here’s an analogy that might help to make this clear.  (I think my analogy is slightly more apt than the weather station analogy Vul et al propose.)  Suppose I turn on my TV to a static channel.  Given the large number of pixels on my TV, it’s bound to be the case that a few of these pixels will have changes in brightness that correlate with changes in, say, the Dow Jones index.  If we carefully go through all the pixels and select some that are highly-correlated to the Dow Jones, we can then report that those pixels together have a high average correlation to the Dow Jones.  But this average correlation will be virtually meaningless – nothing more than an artifact of how we cherry-picked the pixels that we were talking about.  If Vul et al are right, many of the correlations reported in fMRI research are similarly meaningless.

 

How could researchers compute correlations that would be more meaningful?  A correlation between some region of interest (ROI) and some experimental task will be much more interesting if it wasn’t guaranteed to be present by the procedure used to select that ROI.  For example, (1) researchers could choose an ROI that had been implicated in certain sorts of processing in previous research and then test to see how highly correlated that ROI is with some independent measure.  Or (2) they could divide their subjects into two groups, use the first group to select an ROI, and then report how high of a correlation there was between that ROI and the task in the second group.

 

Jabbi et al rightly point out that Vul et al’s survey didn’t completely rule out the possibility that some of the surveyed teams might have been employing a version of (1), where they made multiple psychological measures during task performance and used one measure to select an ROI and an independent measure to calculate a correlation.  Insofar as the psychological measures really are independent, this maneuver would avoid the charge of cherry-picking.  However, researchers who employed this maneuver still should not have been able to surpass the theoretical best-case limit of 0.74 discussed above.  Since many of the researchers that Vul et al accuse of cherry-picking do exceed this theoretical limit, it seems unlikely that very many of them actually did employ this maneuver (or if they did, then we need to seek a new way of explaining why they ended up reporting results that are apparently too good to be credible).

 

Jabbi et al also rightly point out that, contrary to a trying-to-be-helpful suggestion from Vul et al, proposal (2) actually wouldn’t work if we divided the experimental runs in half in such a way that some runs from each subject were used to choose the ROI (because, given the fact that within-subject performances are fairly highly correlated over time, this would amount to a sort of cherry-picking too).  Instead, proposal (2) will work only if we divide the subject pool in half, which means you’d need to run more total subjects to get statistically significant results out of your second measure-the-correlation group.  Of course, given the immense expense of fMRI, researchers will be loathe to accept any proposal that requires them to run more total subjects through their machines.  However, the fact that good methodology would be expensive doesn’t seem to be an excuse for employing bad methodology, especially if it’s as bad as Vul et al accuse it of being.  If we can’t afford to do brain-imaging in a way that delivers meaningful results (and perhaps even if we can), we could instead find lots of other, and arguably much more useful, ways to spend the money instead (as Jerry Fodor amusingly argues here).

 

So, what’s the upshot of all this?  It’s important, especially as far as the public perception of these issues is concerned, not to inflate these negative results too much.  45% of the groups Vul et al surveyed (the green ones in the figure above) used methodology that wasn’t objectionable (at least not in this way).  Of the 54% that did report objectionable correlations, it may still be interesting to note where in the brain these objectionable correlations were located; insofar as these regions are implicated in other studies on similar tasks, we’ll have converging evidence that there were real correlations here, even if they aren’t as strong as advertised.  Reported correlations are only one part of these research papers, and other work in these papers might be very good, even if the reported correlations themselves are meaningless artifacts of the cherry-picking procedure researchers used.

 

There are deeper questions here about the philosophy and sociology of science, and about public policy decisions regarding research spending. 

 

Once it is pointed out, the error in these reported correlations is glaring and obvious.  It seems unlikely that such an error would be tolerated in other fields.  What is it about the field of brain-imaging that allowed such glaring statistical errors to make it into print so often?  One must suspect that it has something to do with all the money and all the pretty pictures that brain imaging research involves – there’s a sort of mystique and exuberant excitement surrounding this work that plausibly poses barriers to the sort of critical evaluation that really should be done.

 

There is an interesting parallel between Vul et al’s worries about brain-imaging research and potential worries about our institutions of scientific publication: e.g., the tendency to print only positive results, and the fact that many attempted lines of research never make it to publication.  The publication process effectively involves cherry-picking only that research that finds high correlations, and silencing all the rest.  Given the variability and noise in our world and the huge number of researchers who are looking at all this noise, some of them were bound to observe some apparent correlations, and only those who do are allowed to publish.  So, one must worry, maybe the publication process is itself guilty of the very same error as Vul et al accuse brain-researchers of committing.  Perhaps we should dismiss many of the correlations we read in scientific journals as telling us little more than we already knew by the fact that we’ve paid a whole bunch of researchers to go out and look for correlations?

 

Counter to this line of thought, it is clear that science does make real progress in many areas.  We can predict, with some reliability, which brain areas are likely to be lit up in the next fMRI study on emotions.  We can predict, with some reliability, what sorts of deficits will arise in people with brain damage in certain areas.  And we can build all sorts of marvelous devices that can do all sorts of things we couldn’t do before.  So science, as a whole, must be working. 

 

These two lines of thought are actually consistent.  Science can make wonderful progress even if each scientific paper doesn’t mean all that much, and many papers are mere reports of statistical noise.  Science as a whole may proceed by seeing which research programs consistently produce new, interesting, replicable results.  Even if many particular papers are mere reports of statistical noise, we can’t explain as mere statistical noise results that are repeatedly replicated, and that produce useful predictions and technological advances.

 

If individual papers aren’t really the appropriate unit of scientific progress, then it’s probably a mistake for the media to report, gleefully and authoritatively, each new paper that comes out with an exciting result, for many of these papers will likely fail to engender replication or much further fruitful research.  And it may also be not so important for Vul, et al, to criticize these particular papers in brain science.  If these papers really are reporting real correlations, those correlations should be borne out in future replications, and eventually in practical consequences involving neurosurgery and technology.  If these papers are instead reporting spurious correlations, that fact will eventually be uncovered by the failure of their research programs to yield fruitful results. 

 

Still, one must wonder, wouldn’t it be better if these researchers weren’t allowed to report meaningless correlations as though they were meaningful?  Wouldn’t that help science to march on a little faster or at a bit less expense to us, the taxpayers, who foot most of the bill for this research?  I, for one, am very glad to have people like Vul hold a critical spotlight to this expensive research. 

One Response to “Voodoo Correlations in Neuroscience?”

I totally agree that researchers should refrain from reporting meaningless correlations as though it was one of the most ‘meaningful’ reports on a single subject being published.

The media seem to pick up on any fact that is printed without looking into the feasibility of the meaningless correlations being reported. The tax payer is ultimately the one who will feel the burden in the end of all this.

Something to say?