Scientists Perturbed by Loss of Stat Tools to Sift Research Fudge from Fact

The journal Basic and Applied Social Psychology recently banned the use of p-values and other statistical methods to quantify uncertainty from significance in research results

By Regina Nuzzo

Psychology researchers have recently found themselves engaged in a bout of statistical soul-searching. In apparently the first such move ever for a scientific journal the editors of Basic and Applied Social Psychology announced in a February editorial that researchers who submit studies for publication would not be allowed to use a common suite of statistical methods, including a controversial measure called the p-value.

These methods, referred to as null hypothesis significance testing, or NHST, are deeply embedded into the modern scientific research process, and some researchers have been left wondering where to turn. “The p-value is the most widely known statistic,” says biostatistician Jeff Leek of Johns Hopkins University. Leek has estimated that the p-value has been used at least three million scientific papers. Significance testing is so popular that, as the journal editorial itself acknowledges, there are no widely accepted alternative ways to quantify the uncertainty in research results—and uncertainty is crucial for estimating how well a study’s results generalize to the broader population.

Unfortunately, p-values are also widely misunderstood, often believed to furnish more information than they do. Many researchers have labored under the misbelief that the p-value gives the probability that their study’s results are just pure random chance. But statisticians say the p-value’s information is much more non-specific, and can interpreted only in the context of hypothetical alternative scenarios: The p-value summarizes how often results at least as extreme as those observed would show up if the study were repeated an infinite number of times when in fact only pure random chance were at work.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

This means that the p-value is a statement about imaginary data in hypothetical study replications, not a statement about actual conclusions in any given study. Instead of being a “scientific lie detector” that can get at the truth of a particular scientific finding, the p-value is more of an “alternative reality machine” that lets researchers compare their results with what random chance would hypothetically produce. “What p-values do is address the wrong questions, and this has caused widespread confusion,” says psychologist Eric-Jan Wagenmakers at the University of Amsterdam.

Ostensibly, p-values allow researchers to draw nuanced, objective scientific conclusions as long as it is part of a careful process of experimental design and analysis. But critics have complained that in practice the p-value in the context of significance testing has been bastardized into a sort of crude spam filter for scientific findings: If the p-value on a potentially interesting result is smaller than 0.05, the result is deemed “statistically significant” and passed on for publication, according to the recipe; anything with larger p-values is destined for the trash bin.

Quitting p-values cold turkey was a drastic step. “The null hypothesis significance testing procedure is logically invalid, and so it seems sensible to eliminate it from science,” says psychologist David Trafimow of New Mexico State University in Las Cruces, editor of the journal. A strongly worded editorial discouraged significance testing in the journal last year. But after researchers failed to heed the warning, Trafimow says, he and associate editor Michael Marks decided this year to go ahead with the new diktat. “Statisticians have critiqued these concepts for many decades but no journal has had the guts to ban them outright,” Wagenmakers says.

Significance testing became enshrined in textbooks in the 1940s when scientists, in desperate search of data-analysis “recipes” that were easy for nonspecialists to follow, ended up mashing together two incompatible statistical systems—p-values and hypothesis testing—into one rote procedure. “P-values were never meant to be used the way we’re using them today,” says biostatistician Steven Goodman of Stanford University.

Although the laundry list of gripes against significance testing is long and rather technical, the complaints center around a common theme: Significance testing’s “scientific spam filter” does a poor job of helping researchers separate the true and important effects from the lookalike ones. The implication is that scientific journals might be littered with claims and conclusions that are not likely to be true. “I believe that psychologists have woken up and come to the realization that some work published in high-impact journals is plain nonsense,” Wagenmakers says.

Not that psychology has a monopoly on publishing results that collapse on closer inspection. For example, gene-hunting researchers in large-scale genomic studies used to be plagued by too many false-alarm results that flagged unimportant genes. But since the field developed new statistical techniques and moved away from the automatic use of p-values, the reliability of results has improved, Leek says.

Confusing as p-values are, however, not everyone is a fan of taking them from researchers’ statistical tool kits. “This might be a case in which the cure is worse than the disease,” Goodman says. “The goal should be the intelligent use of statistics. If the journal is going to take away a tool, however misused, they need to substitute it with something more meaningful.”

One possible replacement that might fit the bill is a rival approach of data analysis called Bayesianism. (The journal said it will consider its use in submitted papers on a “case-by-case basis.”) Bayesianism starts from different principles altogether: Rather than striving for scientifically objective conclusions, this statistical system embraces the subjective, allowing researchers to incorporate their own prior knowledge and beliefs. One obstacle to the widespread use of Bayesianism has been the lack of user-friendly statistical software. To this end Wagenmakers’ team is working to develop a free, open-source statistical software package called JASP. It boasts the tagline: “Bayesian statistics made accessible.”

Other solutions attack the problem from a different angle: human nature. Because researchers in modern science face stiff competition and need to churn out enough statistically significant results for publication and therefore promotion it is no surprise that research groups somehow manage to find significant p-values more often than would be expected, a phenomenon dubbed “p-hacking” in 2011 by psychologist Uri Simonsohn at the University of Pennsylvania.

Several journals are trying a new approach, spearheaded by psychologist Christopher Chambers of Cardiff University in Wales, in which researchers publicly “preregister” all their study analysis plans in advance. This gives them less wiggle room to engage in the sort of unconscious—or even deliberate—p-hacking that happens when researchers change their analyses in midstream to yield results that are more statistically significant than they would be otherwise. In exchange, researchers get priority for publishing the results of these preregistered studies—even if they end up with a p-value that falls short of the normal publishable standard.

Finally, some statisticians are banking on education being the answer. “P-values are complicated and require training to understand,” Leek says. Science education has yet to fully adapt to a world in which data are both plentiful and unavoidable, without enough statistical consultants to go around, he says, so most researchers are stuck analyzing their own data with only a couple of stats courses under their belts. “Most researchers do not care about the details of statistical methods,” Wagenmakers says. “They use them only to support their claims in a general sense, to be able to tell their colleagues, ‘see, I am allowed to make this claim, because p is less than .05, now stop questioning my result.’”

A new, online nine-course “data science specialization” for professionals with very little background in statistics might change that. Leek and his colleagues at Johns Hopkins rolled out the free courses last year, available via the popular Coursera online continuing education platform, and already have two millionstudents have registered. As part of the sequence, Leek says, a full monthlong course will be devoted specifically to understanding methods that allow researchers to convey uncertainty and generalizability of study findings—including, yes, p-values.