Late last month, many scientists submitted their first grant proposals of the year to the National Institutes of Health (NIH), the largest U.S. funder of biomedical research. Each will be graded by a panel of external reviewers—scientists who volunteer to rate the merit of the ideas presented by their peers. The scores peer reviewers award will play a big role in deciding which proposals NIH funds, and the process is extremely competitive: In 2014, the agency funded just 18.8% of the more than 27,000 proposals for a bread-and-butter R01 grant.
In recent years, however, some observers have been questioning whether merit alone determines the outcome of such peer reviews, which many agencies around the world use to award research grants. Some studies have found that certain geographic or demographic groups, such as minorities or researchers from certain states, can fare unusually poorly in funding competitions, raising concerns that bias—conscious or unconscious—is skewing scores. Other experiments have raised questions about the role of randomness and subjectivity in scoring, showing that two groups of reviewers can give the same proposals very different scores, leading to different funding outcomes.
Now, a new computer simulation explores just how sensitive the process might be to bias and randomness. Its answer: very. Small biases can have big consequences, concludes Eugene Day, a health care systems engineer at the Children's Hospital of Philadelphia, in Research Policy. He found that bias that skews scores by just 3% can result in noticeable disparities in funding rates.
The “results are astonishing—funding is exceptionally sensitive to bias,” says ecologist Ruth Hufbauer of Colorado State University, Fort Collins. And although the simplified simulation may not explain exactly what’s happening in the real world, it does offer “a framework for quantifying the big-picture financial effects of institutionalized bias,” says geobiologist Hope Jahren of the University of Hawaii, Manoa. “It shows how systemic bias against any group translates into fewer dollars and cents to [a scientist] belonging to that group, irrespective of other factors.”
To explore the role of bias in peer review, Day created a simplified mathematical model of the review process. Like anatomical models that allow medical students to practice costly surgical techniques without hurting anyone, Day’s model allowed him to conduct an inexpensive—and victimless—funding experiment. First, Day created two classes of imaginary grant applicants: a “preferred class” and a “non-preferred class.” Each class submitted a pool of 1000 grants to a funder; the two pools were statistically identical in quality.
Next, three computer-generated reviewers scored each grant. (That’s the number of reviewers used in many real-world competitions.) In an ideal world, the scores would match the intrinsic quality of the grant. In the simulation, however, Day introduced some more realistic randomness: Not all reviewers scored the same grant the same way. And in what outsiders call a creative move, Day based the simulation’s randomness on the differences he’s seen in scores given to his own, real-world grants.
“This was a very clever approach,” Hufbauer says, because few researchers publicly report how their grants were scored, obscuring such variation. So Day’s study “highlights that public data on variability in scores are not readily available, yet are a key piece of the puzzle of who is and who isn’t funded,” she says.
In the final step, Day introduced bias. For the grants submitted by the “non-preferred” investigators, the three reviewers could reduce their scores by a little, a lot, or not at all. (No reduction represented unbiased scoring.)
Overall, Day found that he could begin to detect bias in grant scores between the two groups when it amounted to just 1.9% of the scores given to the “non-preferred” grants. And when biased reviewers took more than 2.8% off the scores awarded to grants from the nonpreferred applicants, the bias began to affect funding decisions—even though that number is smaller than the difference in scores produced by randomness.
In practical terms, the results meant that nonpreferred investigators had to submit higher quality grants to get money, whereas preferred investigators could get relatively lower quality grants funded. For instance, in a simulation that assumed that the top 10% of grants were funded, and that bias reduced the scores of nonpreferred applicants by an average of 3.7%, the preferred applicants got 118 funded grants, compared with just 82 from nonpreferred applicants. (Unbiased scoring would produce a roughly equal number of grants awarded in each group.)
The simplified simulation has limitations. For example, many funding agencies don’t rely on scores alone to award all grants; they have program managers who review the scores and use them as one, albeit important, guide to picking winners. And Day writes that “it remains unknown how much bias there is in the real world, and how much that bias influences scores—and thus awards—of grants.”
But the study does highlight “how little reviewer bias is necessary to result in noticeable outcome biases,” he says. And although “bias may be small in comparison to reviewer randomness,” he notes, it can still give an edge to preferred groups. If funding agencies recognize that fact and figure out how to address subtle bias, he adds, they could improve the overall quality of the grant applications they fund.
“That such a minor bias could change funding outcomes is frankly shocking,” Hufbauer says. She adds that it should be a reminder to those who review grant and employment applications that even subtle bias can alter decisions.