David Baker

David Baker shows off models of some of the unnatural proteins his team has designed and made.

© Rich Frishman

Read our COVID-19 research and news.

This protein designer aims to revolutionize medicines and materials

David Baker appreciates nature’s masterpieces. “This is my favorite spot,” says the Seattle native, admiring the views from a terrace at the University of Washington (UW) here. To the south rises Mount Rainier, a 4400-meter glacier-draped volcano; to the west, the white-capped Olympic Mountain range.

But head inside to his lab and it’s quickly apparent that the computational biochemist is far from satisfied with what nature offers, at least when it comes to molecules. On a low-slung coffee table lie eight toy-sized, 3D-printed replicas of proteins. Some resemble rings and balls, others tubes and cages—and none existed before Baker and his colleagues designed and built them. Over the last several years, with a big assist from the genomics and computer revolutions, Baker’s team has all but solved one of the biggest challenges in modern science: figuring out how long strings of amino acids fold up into the 3D proteins that form the working machinery of life. Now, he and colleagues have taken this ability and turned it around to design and then synthesize unnatural proteins intended to act as everything from medicines to materials.


Already, this virtuoso proteinmaking has yielded an experimental HIV vaccine, novel proteins that aim to combat all strains of the influenza viruses simultaneously, carrier molecules that can ferry reprogrammed DNA into cells, and new enzymes that help microbes suck carbon dioxide out of the atmosphere and convert it into useful chemicals. Baker’s team and collaborators report making cages that assemble themselves from as many as 120 designer proteins, which could open the door to a new generation of molecular machines.

If the ability to read and write DNA spawned the revolution of molecular biology, the ability to design novel proteins could transform just about everything else. “Nobody knows the implications,” because it has the potential to impact dozens of different disciplines, says John Moult, a protein-folding expert at the University of Maryland, College Park. “It’s going to be totally revolutionary.”

Baker is by no means alone in this pursuit. Efforts to predict how proteins fold, and use that information to fashion novel versions, date back decades. But today he leads the charge. “David has really inspired the field,” says Guy Montelione, a protein structure expert at Rutgers University, New Brunswick, in New Jersey. “That’s what a great scientist does.”

Baker, 53, didn't start out with any such vision. Though both his parents were professors at UW—in physics and atmospheric sciences—Baker says he wasn’t drawn to science growing up. As an undergraduate at Harvard University, Baker tried studying philosophy and social studies. That was “a total waste of time,” he says now. “It was a lot of talk that didn’t necessarily add content.” Biology, where new insights can be tested and verified or discarded, drew him instead, and he pursued a Ph.D. in biochemistry. During a postdoc at the University of California, San Francisco, when he was studying how proteins move inside cells, Baker found himself captivated instead by the puzzle of how they fold. “I liked it because it’s getting at something fundamental.”

In the early 1960s, biochemists at the U.S. National Institutes of Health (NIH) recognized that each protein folds itself into an intrinsic shape. Heat a protein in a solution and its 3D structure will generally unravel. But the NIH group noticed that the proteins they tested refold themselves as soon as they cool, implying that their structure stems from the interactions between different amino acids, rather than from some independent molecular folding machine inside cells. If researchers could determine the strength of all those interactions, they might be able to calculate how any amino acid sequence would assume its final shape. The protein-folding problem was born.

From DNA to proteins

The machinery for building proteins is essential for all life on earth. Click on the arrows at the bottom or swipe horizontally to learn more.

One way around the problem is to determine protein structures experimentally, through methods such as x-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. But that’s slow and expensive. Even today, the Protein Data Bank, an international repository, holds the structures of only roughly 110,000 proteins out of the hundreds of millions or more thought to exist.

Knowing the 3D structures of those other proteins would offer biochemists vital insights into each molecule’s function, such as whether it serves to ferry ions across a cell membrane or catalyze a chemical reaction. It would also give chemists valuable clues to designing new medicines. So, instead of waiting for the experimentalists, computer modelers such as Baker have tackled the folding problem with computer models.

They’ve come up with two broad kinds of folding models. So-called homology models compare the amino acid sequence of a target protein with that of a template—a protein with a similar sequence and a known 3D structure. The models adjust their prediction for the target’s shape based on the differences between its amino acid sequence and that of the template. But there’s a major drawback: There simply aren’t enough proteins with known structures to provide templates—despite costly efforts to perform industrial-scale x-ray crystallography and NMR spectroscopy.

V. Altounian/Science

Templates were even scarcer more than 2 decades ago, when Baker accepted his first faculty position at UW. That prompted him to pursue a second path, known as ab initio modeling, which calculates the push and pull between neighboring amino acids to predict a structure. Baker also set up a biochemistry lab to study amino acid interactions, in order to improve his models.

Early on, Baker and Kim Simons, one of his first students, created an ab initio folding program called Rosetta, which broke new ground by scanning a target protein for short amino acid stretches that typically fold in known patterns and using that information to help pin down the molecule’s overall 3D configuration. Rosetta required such extensive computations that Baker’s team quickly found themselves outgrowing their computer resources at UW.

Seeking more computing power, they created a crowdsourcing extension called Rosetta@home, which allows people to contribute idle computer time to crunching the calculations needed to survey all the likely protein folds. Later, they added a video game extension called Foldit, allowing remote users to apply their instinctive protein-folding insights to guide Rosetta’s search. The approach has spawned an international community of more than 1 million users and nearly two dozen related software packages that do everything from designing novel proteins to predicting the way proteins interact with DNA.

“The most brilliant thing David has done is build a community,” says Neil King, a former Baker postdoc, now an investigator at UW’s Institute for Protein Design (IPD). Some 400 active scientists continually update and improve the Rosetta software. The program is free for academics and nonprofit users, but there’s a $35,000 fee for companies. Proceeds are plowed back into research and an annual party called RosettaCon in Leavenworth, Washington, where attendees mix mountain hikes and scientific talks.

Despite this success, Rosetta was limited. The software was often accurate at predicting structures for small proteins, fewer than 100 amino acids in length. Yet, like other ab initio programs, it struggled with larger proteins. Several years ago, Baker began to doubt that he or anyone else would ever manage to solve most protein structures. “I wasn’t sure whether I would get there.”

Now, he says, “I don’t feel that way anymore.”

What changed his outlook was a technique first proposed in the 1990s by computational biologist Chris Sander, then with the European Molecular Biology Laboratory in Heidelberg, Germany, and now with Harvard. Those were the early days of whole genome sequencing, when biologists were beginning to decipher the entire DNA sequences of microbes and other organisms. Sander and others wondered whether gene sequences could help identify pairs of amino acids that, although distant from each other on the unfolded proteins, have to wind up next to each other after the protein folds into its 3D structure.

Clues from genome sequences

Comparing the DNA of similar proteins from different organisms shows that certain pairs of amino acids evolve in tandem—when one changes, so does the other. This suggests they are neighbors in the folded protein, a clue for predicting structure. 

V. Altounian/Science

Sander reasoned that the juxtaposition of those amino acids must be crucial to a protein’s function. If a mutation occurs, changing one of the amino acids so that it no longer interacts with its partner, the protein might no longer work, and the organism could suffer or die. But if both neighboring amino acids are mutated at the same time, they might continue to interact, and the protein might work as well or even better.

The upshot, Sander proposed, was that certain pairs of amino acids necessary to a protein’s structure would likely evolve together. And researchers would be able to read out that history by comparing the DNA sequences of genes from closely related proteins in different organisms. Whenever such DNA revealed pairs of amino acids that appeared to evolve in lockstep, it would suggest that they were close neighbors in the folded protein. Put enough of those constraints on amino acid positions into an ab initio computer model, and the program might be able to work out a protein’s full 3D structure.

Unfortunately, Sander says, his idea “was a little ahead of its time.” In the 1990s, there weren’t enough high-quality DNA sequence data from enough similar proteins to track coevolving amino acids.

By the early part of this decade, however, DNA sequences were flooding in thanks to new gene-sequencing technology. Sander had also teamed up with Debora Marks at Harvard Medical School in Boston to devise a statistical algorithm capable of teasing out real coevolving pairs from the false positives that plagued early efforts. In a 2011 article in PLOS ONE, Sander, Marks, and colleagues reported that the coevolution technique could constrain the position of dozens of pairs of amino acids in 15 proteins—each from a different structural family—and work out their structures. Since then, Sander and Marks have shown that they can decipher the structure of a wide variety of proteins for which there are no homology templates. “It has changed the protein-folding game,” Sander says.

I have been waiting 10 years for a breakthrough. This seems to me a breakthrough.

John Moult, University of Maryland, College Park

It certainly did so for Baker. When he and colleagues realized that scanning genomes offered new constraints for Rosetta’s ab initio calculations, they seized the opportunity. They were already incorporating constraints from NMR and other techniques. So they rushed to write a new software program, called Gremlin, to automatically compare gene sequences and come up with all the likely coevolving amino acid pairs. “It was a natural for us to put them into Rosetta,” Baker says.

The results have been powerful. Rosetta was already widely considered the best ab initio model. Two years ago, Baker and colleagues used their combined approach for the first time in an international protein-folding competition, the 11th Critical Assessment of protein Structure Prediction (CASP). The contest asks modelers to compute the structures of a suite of proteins for which experimental structures are just being worked out by x-ray crystallography or NMR. After modelers submit their predictions, CASP’s organizers then reveal the actual experimental structures. One submission from Baker’s team, on a large protein known as T0806, came back nearly identical to the experimental structure. Moult, who heads CASP, says the judge who reviewed the predicted structure immediately fired off an email to him saying “either someone solved the protein-folding problem, or cheated.”

“We didn’t [cheat],” Sergey Ovchinnikov, a grad student in Baker’s lab, says with a chuckle.

The implications are profound. Five years ago, ab initio models had determined structures for just 56 proteins of the estimated 8000 protein families for which there is no template. Since then, Baker’s team alone has added 900 and counting, and Marks believes the approach will already work for 4700 families. With genome sequence data now pouring into scientific databases, it will likely only be a couple years before protein-folding models have enough coevolution data to solve structures for nearly any protein, Baker and Sander predict. Moult agrees. “I have been waiting 10 years for a breakthrough,” he says. “This seems to me a breakthrough.”

For Baker, it's only the beginning. With Rosetta’s steadily improving algorithms and ever-greater computing power, his team has in essence mastered the rules for folding—and they’ve begun to use that understanding to try to one-up nature’s creations. “Almost everything in biomedicine could be impacted by an ability to build better proteins,” says Harvard synthetic biologist George Church.

In a protein-folding competition, Baker's team stunned judges by almost matching the actual structure.

In a protein-folding competition, Baker's team stunned judges by almost matching the actual structure.

V. Altounian/Science

Baker notes that for decades researchers pursued a strategy he refers to as “Neandertal protein design,” tweaking the genes for existing proteins to get them to do new things. “We were limited by what existed in nature. ... We can now short-cut evolution and design proteins to solve modern-day problems.”

Take medicines, such as drugs to combat the influenza virus. Flu viruses come in many strains that mutate rapidly, which makes it difficult to find molecules that can knock them all out. But every strain contains a protein called hemagglutinin that helps it invade host cells, and a portion of the molecule, known as the stem, remains similar across many strains. Earlier this year, Baker teamed up with researchers at the Scripps Research Institute in San Diego, California, and elsewhere to develop a novel protein that would bind to the hemagglutinin stem and thereby prevent the virus from invading cells.

The effort required 80 rounds of designing the protein, engineering microbes to make it, testing it in the lab, and reworking the structure. But in the 4 February issue of PLOS ONE, the researchers reported that when they administered their final creation to mice and then injected them with a normally lethal dose of flu virus, the rodents were protected. “It’s more effective than 10 times the dose of Tamiflu,” an antiviral drug currently on the market, says Aaron Chevalier, a former Baker Ph.D. student who now works at a Seattle biotech company called Virvio here that is working to commercialize the protein as a universal antiflu drug.

Another potential addition to the medicine cabinet: a designer protein that chops up gluten, the infamous substance in wheat and other grains that people with Celiac disease or gluten sensitivity have trouble digesting. Ingrid Swanson Pultz began crafting the gluten-breaker even before joining Baker’s lab as a postdoc and is now testing it in animals and working with IPD to commercialize the research. And those self-assembling cages that debut this week could one day be filled with drugs or therapeutic snippets of DNA or RNA that can be delivered to disease sites throughout the body.

The potential of these unnatural proteins isn’t limited to medicines. Baker, King, and their colleagues have also attached up to 120 copies of a molecule called green fluorescent protein to the new cages, creating nano-lanterns that could aid research by lighting up as they move through tissues.

Church says he believes that designer proteins might soon rewrite the biology inside cells. In a paper last year in eLife, he, Baker, and colleagues designed proteins to bind to either a hormone or a heart disease drug inside cells, and then regulate the activity of a DNA-cutting enzyme, Cas9, that is part of the popular CRISPR genome-editing system. “The ability to design sensors [inside cells] is going to be big,” Church says. The strategy could allow researchers or physicians to target the powerful gene-editing system to a specific set of cells—those that are responding to a hormone or drug. Biosensors could also make it possible to switch on the expression of specific genes as needed to break down toxins or alert the immune cells to invaders or cancer.

Protein for every purpose

The ability to predict how an amino acid sequence will fold—and hence how the protein will function—opens the way to designing novel proteins that can catalyze specific chemical reactions or act as medicines or materials. Genes for these proteins can be synthesized and inserted into microbes, which build the proteins.

2D arrays can be used as nanomaterials in various applications.

Information can be coded into protein sequences, like DNA.

Antagonists bind to a target protein, blocking its activation.

Channels through membranes act as gateways.

Cages can contain medicinal cargo or carry it on their surfaces.

Sensors travel throughout the body to detect various signals.

Baker’s lab is abuzz with other projects. Last year, his group and collaborators reported engineering into bacteria a completely new metabolic pathway, complete with a designer protein that enabled the microbes to convert atmospheric carbon dioxide into fuels and chemicals. Two years ago, they unveiled in Science proteins that spontaneously arrange themselves in a flat layer, like interlocking tiles on a bathroom floor. Such surfaces may lead to novel types of solar cells and electronic devices.

In perhaps the most thought-provoking project, Baker’s team has designed proteins to carry information, imitating the way DNA’s four nucleic acid letters bind and entwine in the genetic molecule’s famed double helix. For now, these protein helixes can’t convey genetic information that cells can read. But they symbolize something profound: Protein designers have shed nature’s constraints and are now only limited by their imagination. “We can now build a whole new world of functional proteins,” Baker says.