Read our COVID-19 research and news.

A new study of reviewer bias is part of NIH’s effort to increase the racial diversity of U.S. biomedical research.

UC Davis College of Engineering (Biomedical Engineering Lab)/Wikimedia Commons

Can fake names tease out NIH reviewer bias?

When the label “white male” is attached to a research grant application, do peer reviewers give it a better score?

That’s the question psychologist Patricia Devine of the University of Wisconsin in Madison has spent the past 4 years—and more than $1 million from the National Institutes of Health (NIH) in Bethesda, Maryland—trying to answer with an unusual experiment. Devine and her team has substituted fictitious names—those stereotypically borne by whites or blacks, and by men or women—on past NIH grant applications to test whether reviewers are biased by race and gender. The study is one of two NIH-funded projects—the other strips previous applications of all identifying characteristics before subjecting them to a new round of reviews—now underway that were spawned by a 2011 finding that black scientists have a much lower chance of receiving an NIH grant than their white counterparts.

That earlier study, led by economist Donna Ginther of the University of Kansas in Lawrence, prompted NIH to launch a $250 million diversity initiative aimed at boosting the tiny number of black scientists earning NIH’s bread-and-butter research grants. Those projects hope to increase the number of competitive applications from this group. In contrast, the Wisconsin study looks at the next step in the process, namely, how those proposals are scored. And though altering names to detect a reviewer’s bias may sound like a simple exercise, it turns out to be quite complicated.

Name that researcher

Devine began by asking researchers who were funded by NIH in 2012 to submit their proposals for the study. (The work is part of a larger project led by Wisconsin’s Molly Carnes to explore how NIH’s vaunted study sections operate.) From that pool she chose 48 grants intended to be representative of NIH’s vast portfolio. (The projects had been funded by NIH’s four largest institutes and reviewed by its three largest study sections.)

The next step was to recruit reviewers. The volunteers, 432 in all, were told that the goal of the study was to improve the NIH review process and that the applications were genuine proposals, but that their judgment would not “count” in determining funding. Each volunteer was asked to complete three reviews, the same number they would have handled had they been serving on a real study section, and were paid $300 as an inducement to participate.

Meanwhile, Devine asked a colleague, psychologist William Cox, to generate four versions of each application. The key difference was the name, chosen so that the reviewers might infer the race—white or black—and gender—male or female—of the applicant.

Although the researchers wanted to preserve as much of the original application as possible, substituting a fictitious name occasionally forced them to make other adjustments as well. One involved the biosketch, the portion of the application that describes the investigator’s background. For example, Devine and Cox decided that having a “black” applicant who was trained at an Asian university might arouse suspicions among reviewers and jeopardize the integrity of the study. “So we chose a U.S. one, if possible, and one of relatively equal prestige,” Devine says.

Another complication was the racial makeup of NIH’s applicant pool. Devine and Cox wanted to simulate the actual NIH review process. So out of the three proposals each reviewer received, at least two were from white men whose names had not been altered. Only one was drawn from the experimental pool, meaning it might or might not include a name suggesting a black applicant or a woman. “We couldn’t send an equal number of black and white applicants because it wouldn’t have been believable,” Devine explains.

Cox says previous studies have used fake names to measure racial and/or gender bias in hiring and promotion practices and in the review of journal articles. But this is the first time the approach has been applied to the NIH grantsmaking process. “That’s why NIH was so interested,” he says.

The Wisconsin study, managed by the National Institute for General Medical Sciences, is a bolder step than NIH would feel comfortable taking on its own, says Richard Nakamura, head of NIH’s Center for Scientific Review (CSR). “We didn’t want to deceive them by substituting a fake name,” says Nakamura, referring to the anonymization study CSR is funding that is expected to begin this summer. For that project, any information that might identify the applicant is being removed before a new set of reviewers are asked to judge the proposal’s scientific merit.

Nakamura made his comments last week to the NIH Advisory Committee to the Director, which was discussing the agency’s diversity initiatives. But his boss, NIH Director Francis Collins, provided the punchline. “So you’re saying that the government cannot deceive you, but the government can fund other people to deceive you,” Collins quipped.

Devine hopes to begin the analysis next month after receiving comments from a scientist recruited belatedly after someone who had agreed to serve as a reviewer dropped out. Devine acknowledges limitations, such as that the sample doesn’t draw from the entire range of proposals that NIH receives. None of the proposals up for rereview came from blacks, for example, and all of them have been funded.

Devine is also cautious about interpreting the results. Even a negative answer won’t rule out bias, she acknowledges. “There could still be bias in the system even if we don’t see reviewer bias,” she notes. “After all, the Ginther study documents what is a real disparity.”