How ‘Tiny Shortcuts’ Are Poisoning Science

Seemingly harmless data tweaks are undermining the integrity of the entire field. We must define the problem to prevent it.

MIT Press Reader/Source images: Adobe Stock

By: Thomas Plümper and Eric Neumayer

Listen to this article

0:00

In 1999, Time magazine featured a famous photo of Albert Einstein on its cover — looking old and tired, his forehead covered in wrinkles, his hair long and gray. The photograph was taken in 1947, during a portrait session with Philippe Halsman in which Einstein expressed remorse for his inadvertent role in the Manhattan Project, the initiative that ultimately culminated in the devastating bombings of Hiroshima and Nagasaki. It would go on to become Halsman’s most iconic image.

This article is adapted from Thomas Plümper and Eric Neumayer’s book “The Credibility Crisis in Science.”

Time magazine rarely places a picture of a celebrity from a historical period on its cover. But in 1999, the editors had good reasons to ignore this rule: The magazine had designated Einstein as the “person of the century,” a distinction that placed him above notable figures like Mahatma Gandhi and Franklin D. Roosevelt, who were the runners-up. It was a great honor for Einstein and for the profession he represents. And Einstein was not the only scientist on Time’s list of the 100 most influential people of the twentieth century: The list featured 19 scientists, making them the third most prominently represented professional group, just a shade behind politicians and industrialists. The 20th century was the century of man-made political disasters. But the 20th century was also the century of science, and Einstein was its figurehead.

Those days seem to be over. And they may never come back.

In the 21^st century, the role and relevance of scientists have changed. Science is no longer triumphant: It is in the midst of a severe crisis. Public trust in scientific results and findings has dwindled, and science does not know how to regain credibility. This crisis has many facets. More than anything else, however, it is a credibility crisis. The public no longer believes that scientists merely make honest mistakes on the long and winding road to truth. Instead, scientists are increasingly seen as partial, ideological agents, activists in an armchair, or, worse still, simply fraudsters who fabricate or manipulate data and tweak the specifications of their empirical models to get their desired results.

The credibility crisis of science is not about scientific progress invalidating previously held scientific beliefs, which is intrinsic to the very nature of scientific revolutions. Rather, the crisis has been caused by scientists who deliberately publish overconfident, misleading, and often simply false empirical results based on research designs or model specifications they have intentionally specified to give the desired results. We call this practice “tweaking.” In extreme cases, published results rely on manipulated or outright fabricated data. Whether tweaked, manipulated, or fabricated, the results often cannot be replicated — not even if replication analysts use identical research designs.

By itself, failure to replicate does not necessarily indicate, and certainly not prove, scientific fraud. Empirical results can vary for many reasons. However, replication analyses usually show that replicated effect sizes are, on average, systematically smaller and often statistically insignificant. If 90 percent of replications deviate from the original article in one direction that is less favorable to what the authors wanted to demonstrate, then these deviations are not innocent random errors or acts of nature. If the deviations were random, they would cancel each other out, and their mean would be close to zero.

Scientists are increasingly seen as partial, ideological agents, activists in an armchair, or, worse still, simply fraudsters.

Instead, these deviations indicate that many published results were likely tweaked, manipulated, or fabricated.

Tweaking is potentially more damaging to science in the long run than data manipulation and fabrication. That might be hard to believe, since tweaked empirical results are likely to have smaller effects on the fabric of science than cases of data fabrication and manipulation. But the cumulative effect of tweaking can still be larger than that of data fabrication and manipulation because these strategies are rare, whereas tweaking is common.

Ever since the online platform Retraction Watch began monitoring and reporting retractions in 2010, the number of retracted articles per year has steadily increased. Some of this is due to “bulk retractions” of thousands of articles published by so-called paper mills, where authors pay to have fake articles published. We are not interested in these retracted paper-mill publications but in variants of data fraud, a subset of retractions that have also been steadily increasing. Most notably, there have been several high-profile retractions involving work by Francesca Gino of Harvard University and Marc Tessier-Lavigne of Stanford University, who was ultimately exonerated of personal misconduct but held accountable for inadequate laboratory oversight and failure to correct the scientific record. And these are just the most recent cases — the ones that stick in the public mind for a while before attention shifts to others.

All of this is to say that scientists no longer sit at God’s table, so to speak. They have become mere mortals in the midst of a massive crisis of trust. Could we go so far as to say that today’s scientific process is broken? Perhaps. But the more correct answer is: It depends.

One of the things it depends most on, of course, is how we define fraud itself. Lee McIntyre, one of the foremost philosophers of science, defines scientific fraud as “the intentional fabrication and falsification of the scientific record.” He distinguishes between fraud, on the one hand, and honest error, on the other, plus a third category in between, which he labels “murky,” where scientists’ motives are not “pure.”

What McIntyre calls the murky category, we call “tweaks.” Tweaks are the intentional manipulation of empirical results through changes in and choices of research design, model specification, and/or estimation procedures. McIntyre restricts fraud to data fabrication and manipulation, but the “murky” third category does not, in his view, qualify as fraud. Here is why:

“What about all of those less-than-above-board research practices p-hacking and cherry-picking data . . . ? Why aren’t those considered fraud the minute they are done intentionally? But the relevant question to making a determination of fraud is not just whether those actions are done intentionally, it is whether they also involve fabrication or falsification. . . . The reason that p-hacking isn’t normally considered fraud isn’t that the person who did it didn’t mean to, it’s that . . . p-hacking is not quite up to the level of falsifying or fabricating data.”

In our view, McIntyre’s definition of data fraud is incomplete and imprecise. It conceals that the fabrication and manipulation of data — and the manipulation of empirical results through tweaking — serve the same purpose: to promote the researcher’s interests.

Consider the case of Diederik Stapel, a fraudster with at least 58 retracted articles under his belt, ranking eighth on the Retraction Watch leaderboard. Stapel came to fame as a fraudster; he has contributed massively to the existential crisis of social psychology. Joel Achenbach, in an article for The Washington Post, called him the “Lying Dutchman.” A fraudster he is, but he is surprisingly willing to talk and write about his fraudulent career. He even wrote a book-length manuscript about his life — an autobiography titled “Faking Science: A True Story of Academic Fraud.” Whenever we need insights from a fraudster’s perspective, Stapel is a good, perhaps the best, source.

Stapel kick-started his fraudulent career, as he himself recounts, by becoming “impatient, overambitious, reckless.” Data analyses do not always align with researchers’ expectations and interests. And so Stapel took the truth into his own hands and decided to take “one, tiny little shortcut.” He tortured the data to bring the results into line with the arguments in his articles. In his autobiography, Stapel explains how he drifted further and further away from the path of virtue: “Everything had to be neat and orderly. No mess. I opened the computer file with the data that I had entered and changed a . . . 2 into a 4; then . . . I changed a 3 into a 5. I . . . made a few mouse clicks to tell the computer to run the statistical analyses. When I saw the results, the world had become logical again.”

In the early stages of his fraudulent career, he eliminated cases he classified as “deviant” — cases that prevented the results from turning out as he expected and wanted. These, in his view, were common practices among social psychologists. “Tiny little shortcuts,” he calls them. Tweaks were Stapel’s gateway drug. Soon after he started to tweak empirical results, he resorted to data fabrication and outright data manipulation. But, in his book, Stapel draws a line in the sand: While he accepts data manipulation and fabrication as fraud, his “tiny little shortcuts” are common practice, and thus not fraud, at least not really. In other words, if everyone cheats, is it still cheating?

The cold reality is that tweaks are not just “tiny little shortcuts”; they are tiny little shortcuts with substantively large consequences. They change the results of empirical analyses, often making manuscripts more interesting. Manuscripts that become more interesting change reviewers’ attitudes toward them, allowing tweakers to publish in more visible journals and with better publishers. When tweakers publish more interesting results in more visible places, they get additional attention for their work, receive better job offers and promotions, and rise to ever greater power and influence.

Make no mistake: Tweaking is not about changing the course of science. Nor is it, at least not primarily, about the misuse of public research funds (although it is a scandal that hard-working taxpayers fund the research of tweakers). Rather, tweaking is about scientists pursuing their own interests in a competitive, vulnerable system based on trust and on freedom from control by institutions that enforce rules.

Is the intentional manipulation of statistical quantities of interest always fraudulent? As with any categorization, there are gray areas.

If everyone cheats, is it still cheating?

One of the most common gray areas involves the experimental researcher who, after a first round of experiments, fails to achieve a statistically significant treatment effect. So, they organize a second round of experiments with the very rational expectation that the sheer number of observations will eventually push the significance level above the threshold that separates publishable from unpublishable results. This research practice is common in the life sciences because experiments are costly and may cause unnecessary harm to participants. It may therefore make sense to start with a small sample and only add participants when the results are “not yet significant.”

The problem with ever-increasing sample sizes is that, as the number of observations approaches infinity, the standard error (i.e., the measure of sample variability) of an estimate approaches zero. Thus, if your model indicates any effect at all, then as you collect more and more data, the statistical test will inevitably register the effect as significant — despite how small the effect may be.

Scientists may be reluctant to call the above practice, or p-hacking, “fraudulent.” And indeed, this practice is not fraudulent if a p-hacked study clearly states that the results are insignificant given the original small number of observations and only become significant in a larger sample. But this holds for all adjustments: A change in model specification or research design is not fraudulent if the change and its effects on results are clearly discussed and not suppressed. What makes tweaks fraudulent is not the tweak itself, but the selective reporting of results based on relevant quantities of interest. For example, a gradual increase in sample size is fraudulent if the authors suppress the results of the smaller sample.

Now, are all researchers actually aware of this problem? And do they all collect more and more data until the desired significance appears? Perhaps not. But as we have said, when it comes to tweaking, it is usually impossible to prove intention. At the same time, the existence of a gray area with manipulations that border on the fraudulent does not mean, for example, that intentionally dropping a control variable from the list of regressors, adjusting the operationalization of a key variable, or dropping cases from the sample, to produce desired results, does not constitute scientific fraud.

Rules have the greatest effect when they are clear, violations are easy to detect, and enforcement is simple and not prohibitively expensive. And here lies the problem with scientific fraud: The more broadly we define scientific fraud, the larger the share of fraudulent analyses that are extremely difficult to detect. The more broadly we define scientific fraud, the more costly enforcement becomes. However, if we define it narrowly and exclude tweaks, science will not be able to appropriately address, let alone overcome, its credibility crisis.

Science is ill-advised to narrow the definition of scientific fraud just to make detection easier and rule enforcement less costly. The negative consequences of scientific fraud are not limited to data manipulation and fabrication; tweaks, too, have the same distorting effect on competition for academic merit and research funding, and the same devastating effect on public confidence in scientific results and on trust between scientists.

Both scientists and the public lose confidence in science when there is a non-trivial chance that scientists manipulated empirical results to support the arguments, theories, hypotheses, and stories they wish to corroborate, or to cast doubt on the arguments, theories, hypotheses, and stories that contradict the worldview they believe in.

Science has lost some of its standing with the public. While skepticism about scientific findings can be healthy and is an inherent part of the scientific process, general disbelief and distrust pose significant challenges. Scientists have a vested interest in regaining some of that lost trust. This is easier said than done. But much would be gained if scientists were honest about the uncertainties associated with scientific results — honest with other scientists in scientific publications and honest in public statements. Scientists must learn to distinguish between scientific results and their personal opinions, promote full transparency in scientific research — not hide potential conflicts of interest — and find ways to improve communication with the public to rebuild trust.

Thomas Plümper is Professor of Quantitative Social Research at the Vienna University of Economics and Business and Head of the Department of Socioeconomics. Eric Neumayer is a Professor at the London School of Economics and Political Science (LSE) and its Deputy President and Vice Chancellor. Together they have coauthored several books, including “Robustness Tests for Quantitative Research” and “The Credibility Crisis in Science,” from which this article is adapted.

Correction: An earlier version of this article mischaracterized the nature of misconduct in the Marc Tessier-Lavigne case. The text has been updated to note that he was exonerated of personal misconduct and scientific fraud.

Science & Tech	60 Years Ago, Congress Warned Us About the Surveillance State. What Happened? "We must see to it that this agency and all agencies that possess this technology operate within the law and under proper supervision so that we never cross over that abyss. That is the abyss from which there is no return.” Jennifer Holt \| Sep 27, 2024
Economics	America's Workforce Desperately Needs a Data Overhaul A National Center for Data and Evidence could supplement our archaic and expensive system and more accurately measure AI's impact on jobs. Julia Lane \| Aug 9, 2024

How ‘Tiny Shortcuts’ Are Poisoning Science

60 Years Ago, Congress Warned Us About the Surveillance State. What Happened?

America's Workforce Desperately Needs a Data Overhaul