Hoping to tame the torrent of data churning out of biology labs, the National Institutes of Health (NIH) today announced $32 million in awards in 2014 to help researchers develop ways to analyze and use large biological data sets.
The awards come out of NIH’s Big Data to Knowledge (BD2K) initiative, announced last year after NIH concluded it needed to invest more in efforts to use the growing number of data sets—from genomics, proteins, and imaging to patient records—that biomedical researchers are amassing. For example, in one such “dry biology” project, researchers mixed public data on gene expression in cells and patients with diseases to predict new uses for existing drugs.
The BD2K awards “will help us overcome the obstacles to maximizing the utility of the mammoth data sets that are emerging at an accelerated pace,” said NIH Director Francis Collins in a call today with reporters. The grants, he said, will support computational tools, software, standards, and methods for sharing and using large data sets.
Eleven centers of excellence will receive $2 million to $3 million a year over 4 years to develop tools and methods for everything from modeling cell signaling in cancer to integrating data from mobile sensors worn by volunteers in health studies. Another center award will support a global brain data-collection effort called ENIGMA, which aims to unearth the genetic roots of psychiatric disorders.
One of ENIGMA’s aims is to allow neuroscientists and geneticists to pool hundreds of thousands of DNA samples in the hope of finding genetic variants underlying diseases such as major depressive disorder. Smaller studies have failed to turn up anything of statistical significance for this disorder, perhaps because many different genes contribute minute effects to depression risk that have previously been too small to detect, says Paul Thompson, a neuroscientist at the University of Southern California in Los Angeles. He will lead the ENIGMA Center for Worldwide Medicine, Imaging and Genomics.
Neuroimaging studies have also long struggled with insufficient data, says Hugh Garavan, a cognitive neuroscientist at the University of Vermont in Burlington who recently joined ENIGMA. Roughly “95% of all imaging studies have maybe 20 participants per group,” largely because of the cost of brain scans, which can run roughly $500 to $600 per person, he says. Garavan’s group plans to use the pooled data to explore the genetics and neurobiology of addiction, he says.
ENIGMA may also help scientists study differences in the thickness of the human cortex, the wrinkly layer of tissue that lies on the brain’s surface and performs most of our higher-level thinking. Normally, it takes at least 24 hours to extract information about the cortex’s thickness from an MRI scan—one must digitally strip off the skull, separate the white matter from the gray matter, and delete the cerebrospinal fluid. Access to supercomputer clusters will allow neuroscientists to process much more quickly that type of data set for hundreds of thousands of patients, he says.
Although bigger data sets do raise a risk of getting more false positive results and missing rare variants, overall the data-pooling strategy “makes perfect sense,” says psychiatrist Jack McClellan of Seattle Children’s Hospital in Washington, who is not involved in the ENIGMA project.
The BD2K program will also fund a “data discovery” coordinating center at the University of California, San Diego, that will work with projects at eight other institutions to find ways to make it easier for researchers to find and use data sets. Right now, “you can’t Google scientific data very successfully,” says Philip Bourne, who this past January became the first NIH associate director for data science.
Finally, a set of training awards will support courses and the work of young scientists working on big data projects.
NIH expects to commit a total of $656 million by 2020 to the BD2K initiative.