Hero image for "Brain Imaging Studies Keep Failing Replication. A New Preprint Explains Part of Why."

Brain Imaging Studies Keep Failing Replication. A New Preprint Explains Part of Why.


The problem with most fMRI research isn't that the findings are wrong. It's that they're underpowered to be right — and the field has known this for years without doing much about it.

A preprint posted to bioRxiv this week puts a sharper point on that problem. The paper examines what its authors call "the hidden landscape of missed effects" in human functional neuroimaging — the systematic pattern by which small studies fail to detect real signals, then get published anyway, then fail to replicate. The full text wasn't extractable from the preprint server, so I'll be careful about specifics. But the framing alone is worth sitting with: the replication failures aren't random noise. They have a structure.

The Sample Size Problem Has a Name Now

Neuroimaging research has a long-documented tendency toward small samples. Scan time is expensive, participant recruitment is hard, and the field developed its statistical conventions in an era when fifty subjects felt like a lot. The result is a literature full of studies with effect sizes that look impressive but confidence intervals wide enough to drive a truck through.

This is not a new observation. What's newer is the accumulating evidence that the problem is worse than even skeptics assumed — and that it interacts badly with the field's publication incentives. Small studies that find effects get published. Small studies that don't find effects often don't. The published literature therefore systematically overstates effect sizes, which means replication studies — often better-powered — find smaller effects or none at all, and get framed as failures rather than corrections.

A commentary published in Aperture Neuro in late April makes a related point from the publishing side: current models "often privilege scale over depth, reinforcing biases that undervalue investigator-led research." The authors argue that large-scale population datasets and small-N mechanistic studies aren't competing approaches — they answer different questions. But journals haven't built review criteria that reflect that distinction, so both types of work get evaluated by the same metrics, and both end up distorted.

What Good Neuroimaging Actually Looks Like

The contrast case is instructive. A study published this week in Nature Communications examined sex differences in brain function using over 700 hours of fMRI data across 978 individuals — a sample size that's genuinely unusual for this kind of work. The findings are interesting (sex differences in task activation are widespread but largely task-specific and of small to moderate effect size), but what's more interesting methodologically is what that sample size allows: the researchers could actually test whether their findings held up across different tasks, different brain measures, and different analytical approaches. They found that machine learning could classify sex from brain activation, volume, or behavior — but that these three data types provided largely independent information. That's the kind of nuanced, cross-validated finding that small studies structurally cannot produce.

The study was funded through the intramural research program of the National Institute of Mental Health and drew on Human Connectome Project data — which is part of the point. The large, well-powered neuroimaging studies tend to be the ones backed by major infrastructure investments. Most labs don't have that.

The Meta-Science Moment: Effect Sizes Are the Story

The replication crisis in neuroimaging isn't primarily a story about fraud or sloppy methods (though both exist). It's a story about what happens when a field's statistical practices are calibrated for the resources available rather than the precision required.

John Ioannidis made this argument broadly about biomedical research in his 2005 PLOS Medicine paper — still one of the most downloaded papers in scientific history, and still largely unresolved twenty years later. Neuroimaging is a particularly clean case study because the effect sizes are often genuinely small, the measurements are noisy by nature, and the analytical flexibility (which brain regions to examine, which contrasts to run, which participants to exclude) is enormous. Each of those factors independently inflates false-positive rates. Together, they compound.

The Aperture Neuro commentary's proposed fix — reforming peer review criteria to reward transparency, reproducible workflows, and methodological rigor rather than just sample scale — is the right direction. Whether journals actually move that way is a different question. The incentives that created the problem haven't changed.


Bottom Line: The bioRxiv preprint on missed effects in neuroimaging is worth watching as it moves toward peer review — it's attempting to quantify a problem the field has mostly described qualitatively. The more immediate takeaway is that the studies worth trusting look like the Nature Communications sex-differences paper: large samples, multiple validation approaches, explicit effect size reporting, and conclusions that don't outrun the data. Most published fMRI research doesn't look like that. Until the publishing ecosystem rewards the ones that do, the replication failures will keep coming.