In early 2017, epidemiologist Rory Collins at the University of Oxford in the United Kingdom and his team faced a test of their principles. They run the UK Biobank (UKB), a huge research project probing the health and genetics of 500,000 British people. They were planning their most sought-after data release yet: genetic profiles for all half-million participants. Three hundred research groups had signed up to download 8 terabytes of data—the equivalent of more than 5000 streamed movies. That's enough to tie up a home computer for weeks, threatening a key goal of the UKB: to give equal access to any qualified researcher in the world.
"We wanted to create a level playing field" so that someone at a big center with a supercomputer was at no more of an advantage than a postdoc in Scotland with a smaller computer and slower internet link, says Oxford's Naomi Allen, the project's chief epidemiologist. They came up with a plan: They gave researchers 3 weeks to download the encrypted files. Then, on 19 July 2017, they released a final encryption key, firing the starting gun for a scientific race.
Within a couple of days, one U.S. group had done quick analyses linking more than 120,000 genetic markers to more than 2000 diseases and traits, data it eventually put up on a blog. Only 60,000 markers had previously been tied to disease, says human geneticist Eric Lander, president and director of the Broad Institute in Cambridge, Massachusetts. "[They] doubled that in a week."
Within 2 weeks, others had begun to post draft manuscripts on the bioRxiv preprint site. By now, those data have spawned dozens of papers in journals or on bioRxiv, firming up how particular genes contribute to heart disease, diabetes, Alzheimer's, and other conditions, as well as genes' role in shaping personality, depression, birth weight, insomnia, and other traits. More controversially, data from the trove also pointed to DNA markers linked to education level and sexual orientation, stoking long-running controversies about the application of genetics to behavior in people.
When the Manchester-based UKB enrolled its first volunteer 13 years ago, some critics wondered whether it would be a waste of time and money. But by now, any skepticism is long gone. "It's now clear that it has been a massive success—largely because the big data they have are being made widely available," says Oxford developmental neuropsychologist Dorothy Bishop, a participant. Other biobanks are bigger or collect equally detailed health data. But the UKB has both large numbers of participants and high-quality clinical information. It "allows us to do research on a scale that we've never been able to do before," says Peter Visscher, a quantitative geneticist at the University of Queensland in Brisbane, Australia.
The crucial ingredient, however, may be open access. Researchers around the world can freely delve into the UKB data and rapidly build on one another's work, resulting in unexpected dividends in diverse fields, such as human evolution. In a crowdsourcing spirit rare in the hypercompetitive world of biomedical research, groups even post tools for using the data without first seeking credit by publishing in a journal.
"The U.K. is getting all of the world's best brains" to study its citizens, says Ewan Birney, director of the EMBL European Bioinformatics Institute in Hinxton, U.K., and a member of the UKB's steering committee. The U.K. focus is also the project's chief downside, as it explores just one slice of humanity: northern Europeans. It holds data for only about 20,000 people of African or Asian descent, for example. Yet as new papers appear every few days, researchers say the UKB remains a shining example of the power of curiosity unleashed. "It's the thing we always dreamed of," Lander says.
It's now clear that it has been a massive success—largely because the big data they have are being made widely available.
The UKB was announced in the early 2000s as a classical epidemiological study—the kind used to associate risk factors such as diet and smoking with the development of disease over time. The model was the famous Framingham Heart Study, a long-term study that initially analyzed 5200 residents of Framingham, Massachusetts, seeking factors that influence heart disease. The UKB project, which has received $308 million in funding so far from the Wellcome Trust medical charity, the U.K. government, and disease foundations, "was going to be like Framingham, only 100 times bigger," says principal investigator Collins.
From 2006 to 2010, the UKB enrolled 500,000 people aged 40 to 69 through the United Kingdom's National Health Service. Mailed invitations were sent widely, including to people in poor and ethnically diverse areas of cities such as Birmingham. But in the end, participants were "anybody you could persuade," Collins says. Investigators sampled their blood and urine, surveyed their habits, and examined them for more than 2400 different traits or phenotypes, including data on their social lives, cognitive state, lifestyle, and physical health.
The blood samples yielded DNA for genomic analyses. Links to other U.K. databases added information such as cancer diagnoses, deaths, and hospitalizations. "If you're talking about common phenotypes, the Biobank shines," Lander says. "There's arm fat, smoking behavior, miserableness, neurotic behavior, time on your computer, eating behavior, drinking behavior."
Other biobanks have comparably rich health data, such as deCODE Genetics's detailed database on Iceland's population and biobanks run by U.S. health care providers. Some, such as the U.S. Million Veteran Program and the DNA testing company 23andMe, are bigger. But in most cases researchers can use these databases only by collaborating with their creators.
In contrast, the Wellcome Trust and U.K. Medical Research Council insisted that any researcher approved by the UKB board, anywhere in the world, be able to download anonymized data sets on all 500,000 participants. (Users pay a relatively modest fee of $2500 and agree to return their raw data, results, and code to the UKB after publishing. They also sign a legal agreement not to try to reidentify any participant.)
"It was a novel concept," says Collins, who says he's lost track of the times someone has asked him after a talk whether he's interested in collaborating. "I have to say, ‘You just request the data.’ To some extent people don't believe it."
The aim is to maximize the scientific pay-off: "By making data available to 100 people around the world, we can get a lot more research done than if I sit here and do one study a year with the data," he says.
In 2015, his team released the first batch of genetic data on a subset of 150,000 participants. Then came the July 2017 release of full genotyping data for all 500,000. Two months later, Benjamin Neale's group at the Broad Institute put up its blog doubling the number of markers linked to traits and disorders, as well as a web browser for looking up specific markers. "We viewed it as a service to the community," Neale says.
Today, about 7000 researchers have registered to use UKB data on 1400 projects, and nearly 600 papers have been published. Some studies simply link behaviors and disease, for example reporting that drinking more coffee can reduce mortality but that binge-watching TV is associated with more colon cancer. But most studies compare the genomes of people with some trait or disease with those without it, in order to home in on genes that influence that attribute; these projects are known as genome-wide association studies.
The result, every few days, is a new paper using UKB data to link particular gene variants to a disease or trait—arthritis, type 2 diabetes, depression, neuroticism, heart disease. "It's so easy for people who don't collect their own data," says statistical geneticist Danielle Posthuma of Vrije University in Amsterdam, who studies brain diseases. By combining data from the UKB and other collections, investigators can amass samples of a million people or more, amplifying the signal of gene variants with subtle effects. For some diseases, dozens or hundreds of genes appear to play a role. The genetic links are suggestive correlations; establishing cause and effect will take more genetics work and lab studies, which could reveal new disease pathways that might be drug targets.
The U.K. is getting all of the world's best brains [to study its citizens].
In the near term, the large sample sizes are boosting the power of "polygenic" risk scores, which calculate a person's disease risk by combining many genetic markers. For example, one study published in August 2018 in Nature Genetics drew on the July 2017 data to devise risk scores for five diseases, including breast cancer and heart disease. The authors, at Massachusetts General Hospital in Boston and the Broad Institute, found that a surprisingly high 8% of people of European descent have at least a threefold elevated risk for heart disease. And up to 6% have a three-fold increase in risk for one of the four other diseases, suggesting they should be screened early and consider lifestyle changes or other measures that could improve their odds.
The most provocative studies have probed for genetic influences on human behavior. One, published in Nature Genetics in July 2018, drew on the UKB and 23andMe to pin down genetic contributions to a person's level of education. Together, 1300 genetic markers accounted for 11% of the variability among individuals, the researchers found. That's comparable to certain environmental influences in the UKB sample, such as family income, which predicted just 7% of the variance in educational attainment among participants; and mother's education level, which predicted 15%. Another study presented at a meeting last fall found four genetic markers that appear to have a strong influence on whether a person has had sex with someone of their own sex at least once.
Such studies are raising concerns that genetic tests could be used to screen embryos for desired traits or discriminate against individuals with certain genetic profiles. That would be a misuse of the findings, say the researchers who identified these links. They stress that the probabilities mean little on the individual level.
The UKB's unusual design does have some limitations. The big one: Ninety-four percent of participants are white. "It's really good if you're British or European," Lander says. But, "If you're an American without European ancestry or an African or Asian, you're going to be poorly serviced by the new polygenic risk scores." Nor will scores for traits such as educational attainment be meaningful in people with non-European ancestry.
The mailed invitation recruitment strategy didn't work as well as hoped, says Collins, who notes that young, low-income, white men are also scarce in the database. "We were aiming to get heterogeneity, but it's difficult."
Bishop blames the project's slant toward higher income, healthy, white people on a lack of incentives for participants—they don't get even a small payment or the promise of receiving their test results. The people attracted to the project were those with enough spare time to participate or "who [wanted] to help research," she says.
One problem is that many immigrants to the United Kingdom have little experience with the research world, says Naveed Sattar, an adviser to the UKB and a clinical researcher and epidemiologist at the University of Glasgow. "Most first generation Asians simply have no prior experience of what research is and that it may help their community and their children in the future," he says. Surveys have found that immigrants are often suspicious of participating in research—perhaps because of unethical past studies in some countries, or concern that genetic findings could be used to discriminate.
We were aiming to get heterogeneity, but it's difficult.
Engaging such groups is possible, says geneticist David Van Heel of Queen Mary University of London, who heads the Genes & Health study, which so far has enrolled 33,000 Britons of Bangladeshi and Pakistani ancestry. In his experience, South Asians in the United Kingdom are less likely to respond to mailed invitations. His project achieved success by approaching potential participants in person—sometimes in their native language—in "trusted" settings such as health clinics and community centers.
Collins and other geneticists hope other biobanks can help fill the gap. For example, the Wellcome Trust is now the main funder of the China Kadoorie Biobank, with data on 515,000 people from mainland China, belonging to 10 ethnic groups. In the United States, the All of Us biobank funded by the National Institutes of Health (NIH) aims to use community outreach to help enroll at least half of its 1 million participants from minority groups, and like the UKB, promises to make data freely available. The Human Heredity & Health in Africa initiative has 70,000 participants so far across the continent, with funding from NIH and the Wellcome Trust. "There are ways of fixing this up. But we've got a long ways to go," Birney says.
Meanwhile, the UKB's riches are growing. About half of the participants' primary care data, including clinical data and prescriptions, will become available next spring. The UKB has also done MRI scans of the brains, hearts, and abdomens of 25,000 participants, with plans to scan 100,000; researchers are examining and annotating the images.
Collins has been promoting the UKB's scientific treasure in Silicon Valley in California, where he hopes bioinformatics experts will dig in and come up with unexpected findings. The genetic data are ballooning, too: Several companies are now sequencing the exomes, or protein-coding regions, of all UKB participants, and the United Kingdom's public Sanger Institute is sequencing whole genomes from 50,000 volunteers. Unlike the genotyping data, which don't usually point to specific genes, the sequences will allow researchers who have found a genetic marker linked to a disease to quickly zero in on the causative gene and see the specific mutations at work.
Because of the $150 million cost of this sequencing work, the UKB had to compromise on open access: Companies have 9 to 12 months to use the exome data before they are made widely available. But Collins and his team, as well as geneticists around the world, are already gearing up for the wide release of the first batch of exome data on 50,000 participants. Again, they'll allow time for the download, then release a code. The starting gun in the next scientific race is set for March.