Artificial intelligence (AI) researchers are hoping to use the tools of their discipline to solve a growing problem: how to identify and choose reviewers who can knowledgeably vet the rising flood of papers submitted to large computer science conferences.
In most scientific fields, journals act as the main venues of peer review and publication, and editors have time to assign papers to appropriate reviewers using professional judgment. But in computer science, finding reviewers is often by necessity a more rushed affair: Most manuscripts are submitted all at once for annual conferences, leaving some organizers only a week or so to assign thousands of papers to a pool of thousands of reviewers.
This system is under strain: In the past 5 years, submissions to large AI conferences have more than quadrupled, leaving organizers scrambling to keep up. One example of the workload crush: The annual AI Conference on Neural Information Processing Systems (NeurIPS)—the discipline’s largest—received more than 9000 submissions for its December 2020 event, 40% more than the previous year. Organizers had to assign 31,000 reviews to about 7000 reviewers. “It is extremely tiring and stressful,” says Marc’Aurelio Ranzato, general chair of this year’s NeurIPS. “A board member called this a herculean effort, and it really is!”
Fortunately, they had help from AI. Organizers used existing software, called the Toronto Paper Matching System (TPMS), to help assign papers to reviewers. TPMS, which is also used at other conferences, calculates the affinity between submitted papers and reviewers’ expertise by comparing the text in submissions and reviewers’ papers. The sifting is part of a matching system in which reviewers also bid on papers they want to review.
But newer AI software could improve on that approach. One newer affinity-measuring system, developed by the paper-reviewing platform OpenReview, uses a neural network—a machine learning algorithm inspired by the brain’s wiring—to analyze paper titles and abstracts, creating a richer representation of their content. Several computer science conferences, including NeurIPS, will begin to use it this year in combination with TPMS, say Melisa Bok and Haw-Shiuan Chang, computer scientists at OpenReview and the University of Massachusetts, Amherst.
AI conference organizers hope that by improving the quality of the matches, they will improve the quality of the resulting peer reviews and the conferences’ published literature. A 2014 study suggests there’s room for progress: As a test, 10% of papers submitted to NeurIPS that year were reviewed by two sets of reviewers. Of papers accepted by one group, the other group accepted only 57%. Many factors could explain the discrepancy, but one possibility is that at least one panel for each paper lacked sufficient relevant expertise to evaluate it.
To promote good matches, Ivan Stelmakh, a computer scientist at Carnegie Mellon University, developed an algorithm called PeerReview4All. Typically, a matching system maximizes the average affinity between papers and reviewers, even if it means some papers get really well matched reviewers and others unfairly get poorly matched reviewers. PeerReview4All instead maximizes the quality of the least good match, with an eye toward avoiding poor matches and increasing fairness.
Last year, Stelmakh experimented with using PeerReview4All at the International Conference on Machine Learning (ICML), and reported results in February at another, the Association for the Advancement of Artificial Intelligence (AAAI) conference. The method improved fairness significantly without harming average match quality, he concluded. OpenReview has also begun to offer a system aimed at increasing fairness, called FairFlow. NeurIPS will try at least one of these this year, says Alina Beygelzimer, a computer scientist at Yahoo and the NeurIPS 2021 senior program chair. “NeurIPS has a long history of experimentation.”
These systems all match a known set of papers to a known set of reviewers. But as the field grows, it will need to recruit, evaluate, and train new reviewers, conference organizers say. A recent experiment led by Stelmakh explored one way, which did not rely on AI, to ease those tasks. At last year’s ICML, he and collaborators used emails and word of mouth to invite students and recent graduates to review unpublished papers collected from colleagues; 134 agreed. Based on evaluations of those reviews, the team invited 52 to join the ICML reviewer pool and assigned them a senior researcher who acted as a mentor. In the end, the novice’s ICML reviews were at least as good as those of seasoned reviewers, as judged by metareviewers, Stelmakh reported at the AAAI meeting. He says organizers could potentially scale up the process to recruit hundreds of reviewers without too much burden. “There was a lot of enthusiasm from candidate reviewers” who participated in the experiment, Stelmakh says.
Matching systems that use affinity to measure reviewer expertise also let prospective reviewers bid on papers to review, and some recent work has attempted to address potential bias in this approach. Researchers have heard tales of bidders picking only their friends’ papers, essentially hacking the algorithm. A preprint posted on the arXiv server in February describes a countermeasure that uses machine learning to filter out suspicious bids. On a simulated data set, it reduced manipulation–even when potential cheaters knew how the system operated—without reducing match quality. Another algorithm, presented at NeurIPS last year, limits any one reviewer’s chance of being assigned a manuscript by enough to make it unlikely that multiple friends of an author bidding on it will all be assigned; researchers demonstrated the method’s effectiveness in reducing manipulation using a combination of simulated bids and real data from a previous conference.
One problem with the tools is that it’s difficult to evaluate how much they outperform alternative methods in real-world settings. Hard evidence would require controlled trials, but there have been none, says Laurent Charlin, a computer scientist at the University of Montreal. In part, that’s because many of these tools are new.
As they evolve, methods like these could also one day help journal editors outside computer science find peer reviewers—but so far uptake has been limited, says Charlin, who led the development of the TPMS affinity-measuring tool about 10 years ago. (Meagan Phelan, a spokesperson for AAAS, which publishes the Science family of journals, says they do not use AI in assigning peer reviewers.)
But in AI, Charlin says, “We are quite comfortable as a field with some level of automation. We have no reason not to use our own tools.”
*Correction, 8 April, 10:10 a.m.: This article has been updated to correctly describe software developed to reduce the risk that would-be reviewers of conference manuscripts can manipulate the process.