Friday, July 17, 2009

Weak Signals, Lots of Noise: Problems with fMRI Brain Scanning

EXECUTIVE SUMMARY: On noticing a trend in recently published research articles in social neuroscience for correlations between brain activity (observed in fMRI) and whatever behavioral or psychological variable was being looked at to be impossibly high, a group of researchers surveyed the authors of fifty-five recent fMRI studies, asking them how they chose the data points they included in their statistical analyses. Many used anatomical criteria: if their hypothesis dictated that a specific part of the brain should be active under certain conditions, they confined their analysis to those fMRI data points corresponding to where that specific brain structure would be. But most of the study's respondents said they used "functional" criteria, meaning, if it activates under the conditions they're testing, they include it in their results. This is a problem because it guarantees positive results, overselecting for data points that fit the pattern being looked for while screening out those that don't fit it.

ResearchBlogging.orgOn NPR last week, I heard about an interesting pair of articles published in May dealing with the limitations of fMRI studies in brain research.

This type of brain imaging is widely used in neuroscience, and autism research is no exception. The past decade or so has brought us lots of different hypotheses about how autistic brains work: probably the most extensively studied one I can think of is the observation that, when they're shown images of human faces, autistic people tend to activate a different part of their brains* than non-autistic people do. Functional MRI studies have also yielded a jumble of tentative observations about differences between autistic and neurotypical brain anatomy: this or that structure appears bigger or smaller on average in this or that series of brain scans.

What the authors of the second of the two studies (full text here) I linked in my first paragraph --- Edward Vul, Christine Harris, Piotr Winkielman**, and Harold Pashler --- noticed about many fMRI studies of emotion, personality and social cognition was the phenomenally high correlations between activity observed in a certain brain region in response to a given stimulus (say, images of happy, angry or frightened faces, or recordings of angry speech, or a semi-scripted interaction meant to make the subject feel lonely or rejected) and individual personality traits (empathy, say, or extraversion or anxiety).

They noticed that the correlations reported in many of these studies --- often above 0.8, on a scale of 0 to 1 --- were higher than they could be given the reliability of the tools used to measure each variable:

This, then, is the puzzle. Measures of personality and emotion evidently do not often have reliabilities greater than .8. Neuroimaging measures seem typically to be reliable at .7 or less. If we assume that a neuroimaging study is performed in a case where the underlying correlation between activation in the brain area and the individual difference measure (i.e., the correlation that would be observed if there were no measurement error) is perfect, then the highest expected correlation would be √(.8 x .7), or .74. Surprisingly, correlations exceeding this upper bound are often reported in recent fMRI studies on emotion, personality, and social cognition.
To solve this mystery, Vul et al. surveyed the authors of fifty-five recent social-neuroscience articles describing fMRI studies, asking them exactly how they settled on which values to use (out of the tens or hundreds of thousands of individual data points, or "voxels," making up each image!) in their calculations.

In the articles we are focusing on here, the final result, as we have seen, was always a correlation value --- a correlation between each person's score on some behavioral measure and some summary statistic of their brain activation. The latter summary statistic reflects the activation or activation contrast within a certain set of voxels. ... [V]oxels may be selected based on anatomical criteria [i.e., those roughly corresponding to the targeted brain structure in spatial terms], functional criteria [i.e., those determined to show activity in response to relevant but not irrelevant stimuli], or both. Within those broad options, there are a number of additional more fine-grained choices. It is hardly surprising, then, that brief method sections rarely suffice to describe how the analyses were done in adequare detail to really understand what choices were being made.
In our survey, we first inquired whether the fMRI signal measure that was correlated across subjects with a behavioral measure represented the average of some number of voxels or the activity from just one voxel that was deemed most informative (referred to as the peak voxel).

If it was the average of some number of voxels, we asked whether the voxels were selected on the basis of anatomy, or activation seen in those voxels, or both. If activation was used to select voxels, or if one voxel was determined to be most informative based on its activation, we asked what measure of activation was used. Was it the difference in activation between two task conditions computed on individual subjects, or was it a measure of how this task contrast correlated with the individual difference measure? Finally, if functional data were used to select the voxels, we asked if the same functional data were used to compute the reported correlation.
While there was a lot of diversity in the approaches respondents employed, and the distribution across the different approaches was fairly even, there was one fairly important trend that emerged.

First, to lead into what this trend was, I'd like to point out the two places where the distribution was not even: when the average of a group of voxels was used, those voxels were much more often selected using functional criteria than not --- i.e., the functional-only and mixed functional-anatomical approaches accounted for more than three-quarters of the articles (23 of 30, as opposed to only 7 studies using only anatomical criteria) --- and every study that used functional criteria to identify voxels of interest then re-used the same data they had used to select the voxels as their output measure for correlating with the behavioral data.

If your Circular-Reasoning Alarm is starting to sound, you're not alone:

The key [to explaining the "implausibly high" correlations often reported in fMRI studies] ... lies in the 53% of respondents who said that "regression across subjects" was the functional constraint used to select voxels, indicating that voxels were selected because they correlated highly with the behavioral measure of interest.

Figure 3 shows very concretely the sequence of steps that these respondents reported following when analyzing their data. A separate correlation across subjects was performed for each voxel within a specific brain region. Each correlation relates some measure of brain activity in that voxel (which might be a difference between responses in two tasks or in two conditions) with the behavioral measure for that individual. Thus, the number of correlations computed was equal to the number of voxels, meaning that thousands of correlations were computed in some cases. At the next stage, researchers selected the set of voxels for which this correlation exceeded a certain threshold, and reported the correlation within this set of voxels.
In other words, because the pool of available data points is so vast, patterns will crop up wherever you choose to look for them. This is what the study's authors term "non-independence error": using the same functional measures for data analysis that you've already used to select your data set.

This graph shows all the studies Vul et al. reviewed --- you can see, the values of the correlations each study unearthed range from about 0.25 to 1.0. The red squares represent correlations derived using non-independent analyses, and the green ones represent independent analyses. You can see that the red ones tend to have much higher values than the green ones --- most of the green ones are clustered between 0.5 and 0.65, which are well below the "upper bound" of 0.74 cited earlier.

Further reading: Mind Hacks, the Neurocritic, and the Neuroskeptic have all posted on this a while ago; also worth reading are these two posts by Andrew Gelman at Statistical Modeling, Causal Inference, and Social Science, this post at The Amazing World of Psychiatry, and two articles in response to Vul et al., one also published in Perspectives on Psychological Science and the other posted online as a draft.

*This would be the "fusiform face area," the existence of which was proposed in 1997 by Nancy Kanwisher and her colleagues. Kanwisher, interestingly enough, seems to be working with Ed Vul --- she co-wrote this longer article about non-independence error with him. Sadly, none of the articles Vul et al. evaluated dealt with the fusiform gyrus, face recognition, or autism, so I can't draw any conclusions about the robustness of that theory. Nor can I (right now, anyway) apply this article's findings more directly to the neuroscience of autism.

**Autism researcher!

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition Perspectives on Psychological Science, 4 (3), 274-290 DOI: 10.1111/j.1745-6924.2009.01125.x


Larry Arnold PhD FRSA said...

As someone who has long observed (anecdotally I might add without correlating) the psychological effect that enthusiasm brings when one desperately wishes to discover what one is looking for, and noting the contrary nature of the conclusions of various studies which if subjected to a bit of overarching logic ought if all true to contradict each other unless the subjects of the experiments are from alternative universes, ... this does not surprise me and adds to my own armoury of "evidence".

It is also a warning to me as a researcher to go for falsifying my bypothesis not to confirm what I suspect I already know.

So how then right now do I set about falsifying the existence of confirmation bias in the vast majority of studies?

There's no answer to that is there :)

Socrates said...

"the psychological effect that enthusiasm brings when one desperately wishes to discover what one is looking for"

And how has Day 1, of their PhD's been so easily forgotten?

Amanda said...

That face study was debunked by Michelle Dawson IIRC. The problem was that the autistic people weren't looking at the faces. When they did look at faces that area of the brain did light up. This was discussed on a Canadian radio show called Quirks and Quarks.

The Dissident said...

I agree this is a significant problem within FMRI analysis. However the techniques to overcome this problem are already available

The answer is quite simple really. Multi variate pattern analysis, adopted from machine learning, is not susceptible to the problems discussed here. It considers the entire set of data generated by the FMRI machines and will allow cognitive psychologists to accurately classify the voxel patterns associated with autism for example.

I am a machine learning researcher working with FMRI data and these techniques are available, but researchers can only police themselves. Until the word gets out that the conventional univariate analyses are flawed the effort put into the studies could be wasted.