Meteorologist Joshua Wurman and his Center for Severe Weather Research(CSWR) in Boulder, Colorado, study how thunderstorms become full-blown tornadoes. CSWR has developed a fleet of mobile Doppler radars mounted on trucks that follow tornado formation as it happens, capturing detailed data on the rotating columns of air in three-dimensional maps. "The data set allows us to see things that we have never seen before," Wurman says. It also brings a host of new challenges: Radar data occupy terabytes of storage, and although storing terabytes of data is cheap, analyzing the data -- and connecting them with other relevant data -- is difficult.
Wurman's situation is common. Technological advances, increases in computing power, and large-scale international experiments have all contributed to huge increases in the amount of data that scientists must parse. In most scientific fields, making sense of data was once something scientists just did. But today's scientists, whether they're just working with their own data or interacting with public databases, need special skills for making sense of data. It has become an essential career skill.
Early-career scientists in most empirical fields need, at a minimum, a working knowledge of public databases and tools for working with big data.
Heaps of data
The amount of data generated in many of today's experiments dwarf previous efforts. In genetics, for example, sequencing technologies have become so cheap and efficient that individual researchers now decipher quantities of genetic information that not long ago could be tackled only by large sequencing centers. And the increase is likely to continue. "When you're moving to ... the next-gen sequencing outputs, you're looking at a 10-fold, 100-fold, 1000-fold increase in the amount of data you got," says David McAllister, strategy and policy manager for the genomics, data, and technologies sector of the Biotechnology and Biological Sciences Research Councilin the United Kingdom.
This week, Science, Science Careers, Science Translational Medicine, and Science Signaling have joined forces to take a broad look at the challenges and opportunities researchers face in dealing with data.
This article is one of three in Science Careers on the topic. See also:
See the entire list of articles in all the Science publications at www.sciencemag.org/special/data/.
An individual researcher may now generate a lot of data, but that's nothing compared with the data generated by large-scale projects in, for example, particle physics and astronomy, most of which now release their data in online databases. One such project is the Sloan Digital Sky Survey; partnering institutions worldwide released many terabytes of images, photometric data, and spectroscopic data describing celestial objects in more than a quarter of the sky. Astronomers expect the Large Synoptic Survey Telescope to generate even more data in a few years, on the order of 30 terabytes per day. It would take 1000 Ph.D. students watching DVDs for 12 hours just to view the new data coming in each day, says astronomer and computational scientist John Wallin of Middle Tennessee State University in Murfreesboro. (Wallin is currently on a sabbatical at the University of Oxford in the United Kingdom.) Doing anything useful with that data would, of course, take far more time.
The ascent of large data sets and public databases has been accompanied by a rise in informatics tools and software platforms that promise to help scientific communities manage and analyze large quantities of data. Scientists need to know how to take advantage of these resources, which is seldom easy. In biology, "There's too much for you to know where to start," says Carole Goble, an information management scientist at the University of Manchester in the United Kingdom, who nevertheless offers some suggestions: Start by asking your colleagues which tools and databases they use. Read the literature and go to conferences to learn about new options. Check the public databases and tools being offered by major data centers such as the U.S. National Center for Biotechnology Information (NCBI) and the European Molecular Biology Laboratory's European Bioinformatics Institute (EBI). Other data-intensive fields have similar problems.
Before you start using a database, make sure you know it well. "It takes energy to be familiar with not only the databases and what they do and don't have in them but also ... what is perfect and what is imperfect about the database, what it will not do for you, and how [databases] complement each other and how they evolve over time," says Huda Akil, co-director of the Molecular and Behavioral Neuroscience Institute (MBNI) in Ann Arbor, Michigan. (Akil discusses the need for neuroinformatics approaches in a related Science Perspective. Subscription or site license required for access.)
More advice from Goble: Ideally, data in public data sets will be accompanied by high-quality metadata, which give you additional information surrounding the data and the experimental setup and will help you understand, compare, and integrate data across different databases. "Do you understand what those data sets really are about? Do you know how to use them? Do you know how to cross-link them?" Goble asks. Scientists who are not savvy can end up doing flawed research -- e.g., comparing data that is not compatible, Goble says.
The same goes with informatics tools and software: There's danger in using them "in an unknowing way," Goble says. Make sure you know enough that you can choose a tool that's appropriate to the type of data you intend to use and the questions you're asking in your research, and make sure you know how to use that tool properly. Goble gives an example: NCBI's Basic Local Alignment Search Tool (BLAST), a program molecular biologists use to identify matches for specific nucleotide or protein sequences in sequence databases. "Hardly anybody changes the configuration of their BLAST queries. They only stick to the coefficients that they understand [from] when they were trained, and these may be completely inappropriate," Goble says. Also bear in mind error bounds and software limitations, she adds.
Filling the computation gap
Even in a field such as biology, where many informatics and computational tools are available, many data tools are difficult to use, Goble says, so seek help if you need it. Many big laboratories employ informatics specialists whose job is to help scientists handle and analyze large and complex data sets. If you're not at a big laboratory, your department may have informatics experts on the faculty who could do some of the heavy lifting for you. Or maybe your institution employs information technology (IT) specialists, perhaps in the library, who can help you deal with big data, Goble says.
In some disciplines, especially those with a strong quantitative or computational tradition such as particle physics and astronomy, scientists are expected to write, or at least tweak, their software tools. This means "figuring out what science you're looking for and finding the appropriate statistical and algorithmic methodologies to use to solve your problem," Wallin says. The work may involve writing database queries or pattern-recognition algorithms, or "using statistical techniques to find outliers."
The balance between creating your own software and tweaking software that already exists depends on your project and your circumstances. Wurman's field -- tornadogenesis -- is so small that there are few ready-made tools, and CSWR is too small to hire a dedicated programmer or IT person. So everyone at CSWR writes their own software or customizes programs developed by other groups. "We are looking for scientists who have a good understanding of meteorology and ... also an understanding of how a problem is solved using computer technologies and programming," Wurman says. That blend of skills, he says, is hard to find.
Whatever your field, many of the most interesting scientific questions are likely to require you to develop or adapt analysis tools. Collaboration with scientists with the appropriate skills -- and the ability to cultivate and maintain those collaborations -- is essential.
All this means that early-career scientists in most empirical fields need, at a minimum, a working knowledge of public databases and tools for working with big data. A basic grounding in topics such as mathematics, statistics, numerical analysis, informatics, and computer science -- pick any two or three -- will help you in this. If you want to develop tools, you need deeper skills in programming, high-performance computing, and computational science. Most early-career scientists are likely to find themselves somewhere in the middle, needing to use existing tools in sophisticated ways or to customize and help develop new tools. They'll need a blend of skills and collaborations appropriate for their particular needs and research interests. As always in science, the research question dictates the skill set.
Such training is rarely part of standard science curricula, though ad hoc training is coming online. Scientists who wish to prepare themselves well should take extra courses, seek out workshops, and attend conferences in relevant fields. Big data centers (like EBI) and most universities with large biology departments offer bioinformatics courses and other types of training, Goble says. But because tools and software languages change fast, the ability to learn on your own is paramount, Wallin adds. And be prepared to learn on the job.
There is much to learn. "Unless you know where to go for what, you might feel like you're just playing and wasting your time," Akil says. But ultimately, scientists' ability to manage and analyze massive and complex data will be a big factor in their scientific success. Scientists in training need to learn to take advantage of the vast and rapidly growing storehouse of existing knowledge, "either through their own skills or through tapping into the skills in partnerships or through the tools that they are able to convene," Goble says.
The payoff for science should be big. Take neuroscience. Ongoing efforts to create a framework in which all relevant databases are listed and well interfaced could help us understand how the brain works, Akil says. "The hope is that one can begin to navigate seamlessly across databases as well as drill down deeply into any given one," she says. This would allow neuroscientists to take a novel, systemic, discovery-focused approach to science, relating molecules to neurons and on to neural circuits and behavior, she says. Neuroscientists could "not only find out what's there but use it as a means to understand the secrets that are buried in there that can only be discovered if you put it all together."