Every day, judges across the United States face harrowing decisions: How many years should they give the bipolar woman convicted of murder? Should they jail the young—possibly innocent—man awaiting trial, or release him on bail, where he could commit a crime? Facing overflowing dockets, courts are increasingly using computer-based tools to help make those choices. Now, a new study suggests that one widely used tool—an algorithm that calculates “risk scores” for defendants in sentencing or bail hearings—is no better than people armed with a few key pieces of information.
“A fancy model isn’t necessarily a better model,” says David Robinson, who studies predictive analytics and governance at Georgetown University in Washington, D.C., but wasn’t involved in the new work.
Being accused of a crime—even a minor one such as trespassing—could land you in jail. But if you’re considered “low risk,” or if jails are overcrowded, you might get to go home before your trial. To make sure judges were treating all defendants fairly, U.S. courts in the 1980s started requiring jail staff to collect data on defendants’ finances, families, friends, and drug and criminal histories. That information was often packaged into a recommendation and passed on to judges, who were free to use it—or not.
But in dozens of states, those risk assessment tools are moving from pen-and-paper calculations to complex algorithms, many of them proprietary. Few have been independently studied, raising concerns among researchers and civil rights advocates. Some worry that machines carry an authority unmatched by humans, leading to a greater reliance on their data; others say the “secret sauce” of the algorithms can lead to unfair outcomes. For example, a contested 2016 study by investigative reporters at ProPublica found that one system, Correctional Offender Management Profiling for Alternative Sanctions (COMPAS), disproportionately classified black offenders as being at high risk of rearrest—and white offenders as low risk.
Those findings intrigued Julia Dressel, a computer science major at Dartmouth College. She set out to answer a more basic question: Are humans or machines better at assessing risk? To find out, she uploaded the ProPublica database, a collection of COMPAS scores for 10,000 defendants awaiting trial in Broward County, Florida, as well as their arrest records for the next 2 years.
Dressel randomly selected 1000 of the defendants and recorded seven pieces of information about each, including their age, sex, and number of previous arrests. She then recruited 400 people using Amazon Mechanical Turk, an online crowdsourcing service for finding research volunteers. Each volunteer received profiles of 50 defendants and was asked to predict whether they would be re-arrested within 2 years, the same standard COMPAS uses. The humans got it right nearly as often as the algorithm—between 63% and 67% of the time, compared to about 65% for COMPAS, she reports today in Science Advances.
Dressel was surprised. So was Megan Stevenson, an economist and legal scholar at George Mason University in Arlington, Virginia, who found that a similar risk assessment system in Kentucky hasn’t changed the number of prisoners released on bail. Stevenson says she always assumed algorithms were at least somewhat better than people at assessing risk, so the new study—which she calls the first “horse race” between man and algorithm—left her “quite shocked.”
In a second experiment, Dressel and her adviser, Dartmouth computer scientist Hany Farid, explored whether a simple algorithm could beat COMPAS, which typically uses six factors from a 137-item questionnaire to assess risk. (A common misperception is that all 137 items are used to score risk, when most determine which rehabilitation programs an offender might qualify for.) They created their own algorithm, ultimately settling on just two factors: age and number of prior convictions. Plugging that information into a simple formula yielded predictions that were right about 67% of the time—similar to the COMPAS score.
Robinson says those results reflect something long known in criminology: If you’re young, you’re risky. But just how those results would be translated into the criminal justice system is a mystery, he adds. That’s because the study looked at untrained volunteers, rather than real judges. What’s more, the volunteers were given a real-time feedback score—something impossible to introduce in a courtroom.
Tim Brennan, who created COMPAS in 1998 while at Northpointe (now Equivant) in Canton, Ohio, says that far from undercutting his approach, the new study validates it. Seventy percent accuracy, he says, has long been considered the “speed limit” of such prediction systems, and the fact that humans did no better is encouraging.
But humans are still no better than machines at eliminating bias, notes mathematician Cathy O’Neil, founder of the risk consulting and auditing firm O’Neil Risk Consulting & Algorithmic Auditing in New York City. Dressel’s study found that people were just as likely as COMPAS to overstate re-arrest risks for black defendants and understate risks for white defendants—they incorrectly flagged black defendants as high risk 37.1% of the time (compared to 40.4%) and white defendants as low risk 40.3% of the time (compared to 47.9%).
That’s troubling, given that similar algorithms are increasingly influencing not only court decisions, but also loan approvals, teacher evaluations, and even whether child abuse charges can be investigated by the state.
“People get awed by mathematical sophistication, but it’s mostly a distraction,” says O’Neil. She notes our algorithms are no better than us—or the data we feed them. “At the end of the day… all we can do is make it biased in a way we’re comfortable with. There’s nothing objective about putting people in prison.”