3D protein model of CobS Cobalamin-5-phosphate synthase

Scientists have predicted the 3D structures of hundreds of proteins such as this one.

S. Ovchinnikov et. al., Science 355, 6321 (20 January 2017)

Hundreds of elusive protein structures pinned down from genome data

“Once in a while you get shown the light in the strangest of places if you look at it right.”

—“Scarlet Begonias,” The Grateful Dead

When Robert Hunter (lyricist for many Dead tunes) wrote those lines, it’s a safe bet he didn’t have the prediction of new protein structures in mind. But researchers report today that they’ve figured out how to predict the structures of hundreds of unmapped proteins by gleaning insights from one of the strangest of places: “metagenomics” projects that sequence DNA from broad swaths of microbes in the soils and seas.

“This is a major step forward” in determining how proteins fold, says Peter Preusch, who heads cell biology and biophysics research at the National Institute of General Medical Sciences in Bethesda, Maryland. The new work predicts 614 protein structures, representing 12% of the estimated 5211 protein families for which no experimental structure exists. (A protein family is a group of closely related proteins.) “That’s a very significant change,” Preusch adds. The new structures are expected to lead to a raft of insights into the inner workings of cells and possibly pave the way for new medicines. And the technique will likely grow even more powerful as metagenome sequencing efforts proceed.

To appreciate the new results it helps to take a step back and recall biology’s central dogma. Genes, made up of strings of nucleic acids—As, Gs, Cs, and Ts—tell cellular machines how to string together the 20 different amino acid building blocks of proteins. Once constructed, those amino acid strings spontaneously fold up into proteins, each with a specific 3D shape that governs its cellular function.

The trouble is that it’s impossible to know how a protein will fold based on its gene sequence alone. The number of possible configurations is astronomical, although computational biologists have made progress in narrowing the possibilities. Decades of experiments and computational work have revealed which amino acids prefer to nestle close to one another and which stay at arm’s length. That has helped researchers compute the most energetically stable folding patterns, though mostly for relatively small proteins. But for larger ones, the number of variables makes the computation intractable.

In the 1990s, Chris Sander, a computational biologist now at Harvard University, suggested that gene sequence data could help. Sander reasoned that as proteins fold, pairs of amino acids distant from each other on the 2D string could end up adjacent in the 3D-folded protein, providing a key interaction that allows the protein to hold its shape. If a genetic mutation causes a change to one of those amino acids, it could destroy this interaction, disabling the protein and possibly killing the organism. But in rare cases, genetic mutations may alter both key amino acids at the same time, preserving their interaction so that the protein can continue to do its job. Evolution would favor such tandem mutations, which would cause the amino acid partners to coevolve.

The trick to finding these coevolving pairs, Sander suggested, was to look at the gene sequence of a protein from not just a single organism, but many. Organisms from bacteria to humans share many closely related proteins, and thus genes. By comparing the gene sequences of these shared proteins—say from yeast to bats to bonobos to humans—researchers might be able to spot coevolving snippets of DNA. Any such pairs code for amino acids that are likely to wind up as close neighbors in a 3D structure—just the sort of constraint needed to improve computer folding algorithms.

Several years ago, separate groups led by Sander and David Baker, a biochemist at the University of Washington in Seattle, showed that the idea worked. Up to now, it has helped them pin down structures for a few dozen proteins. “The limiting thing was getting more sequence data,” Baker says.

Baker’s group has now turbocharged the approach. Today in Science he and his colleagues report that they’ve used their technique in conjunction with metagenome sequencing, in which researchers sequence vast swaths of genome data from unknown organisms in the ocean and soil. By sifting through the sequence data, they were able to track enough coevolving amino acids to pin down the structures of 614 proteins, each one representing an entire family of proteins for which no structures exist. Using these structures as templates, computational biologists should be able to model the structures of thousands of related family members.

The new approach will likely continue to grow more powerful with more sequencing data, Preusch says. However, he says, the effort remains laborious for now, because there is no single repository of metagenome data.