One morning in August, the social science reporter for National Public Radio, a man named Shankar Vedantam, sounded a little shellshocked. You couldn’t blame him.
Like so many science writers in the popular press, he is charged with reporting provocative findings from the world of behavioral science: “. . . and researchers were very surprised at what they found. The peer-reviewed study suggests that [dog lovers, redheads, Tea Party members] are much more likely to [wear short sleeves, participate in hockey fights, play contract bridge] than cat lovers, but only if [the barometer is falling, they are slapped lightly upside the head, a picture of Jerry Lewis suddenly appears in their cubicle . . . ].”
I’m just making these up, obviously, but as we shall see, there’s a lot of that going around.
On this August morning Science magazine had published a scandalous article. The subject was the practice of behavioral psychology. Behavioral psychology is a wellspring of modern journalism. It is the source for most of those thrilling studies that keep reporters like Vedantam in business.
Over 270 researchers, working as the Reproducibility Project, had gathered 100 studies from three of the most prestigious journals in the field of social psychology. Then they set about to redo the experiments and see if they could get the same results. Mostly they used the materials and methods the original researchers had used. Direct replications are seldom attempted in the social sciences, even though the ability to repeat an experiment and get the same findings is supposed to be a cornerstone of scientific knowledge. It’s the way to separate real information from flukes and anomalies.
These 100 studies had cleared the highest hurdles that social science puts up. They had been edited, revised, reviewed by panels of peers, revised again, published, widely read, and taken by other social scientists as the starting point for further experiments. Except . . .
The researchers, Vedantam glumly told his NPR audience, “found something very disappointing. Nearly two-thirds of the experiments did not replicate, meaning that scientists repeated these studies but could not obtain the results that were found by the original research team.”
“Disappointing” is Vedantam’s word, and it was commonly heard that morning and over the following several days, as the full impact of the project’s findings began to register in the world of social science. Describing the Reproducibility Project’s report, other social psychologists, bloggers, and science writers tried out “alarming,” “shocking,” “devastating,” and “depressing.”
But in the end most of them rallied. They settled for just “surprised.” Everybody was surprised that two out of three experiments in behavioral psychology have a fair chance of being worthless.
The most surprising thing about the Reproducibility Project, however—the most alarming, shocking, devastating, and depressing thing—is that anybody at all was surprised. The warning bells about the feebleness of behavioral science have been clanging for many years.
For one thing, the “reproducibility crisis” is not unique to the social sciences, and it shouldn’t be a surprise it would touch social psychology too. The widespread failure to replicate findings has afflicted physics, chemistry, geology, and other real sciences. Ten years ago a Stanford researcher named John Ioannidis published a paper called “Why Most Published Research Findings Are False.”
“For most study designs and settings,” Ioannidis wrote, “it is more likely for a research claim to be false than true.” He used medical research as an example, and since then most systematic efforts at replication in his field have borne him out. His main criticism involved the misuse of statistics: He pointed out that almost any pile of data, if sifted carefully, could be manipulated to show a result that is “statistically significant.”
Statistical significance is the holy grail of social science research, the sign that an effect in an experiment is real and not an accident. It has its uses. It is indispensable in opinion polling, where a randomly selected sample of people can be statistically enhanced and then assumed to represent a much larger population.
But the participants in behavioral science experiments are almost never randomly selected, and the samples are often quite small. Even the wizardry of statistical significance cannot show them to be representative of any people other than themselves.