As genome sequencers churn out terabytes of new data daily, researchers are increasingly turning to an information-handling strategy already favored by internet companies: cloud computing.
By Alan Dove
Inclusion of companies in this article does not indicate endorsement by either AAAS or Science, nor is it meant to imply that their products or services are superior to those of other companies.
In February of 1977, Fredrick Sanger and his colleagues published the first sequence of an organism's complete genome, the 5,375 nucleotides of bacteriophage phiX174. Even then, it was clear that studying whole genomes would become cumbersome and tedious as scientists sequenced more complex organisms. Fortunately, the nascent field of genomics didn't have to wait long for a solution; just four months later, a small startup company in Cupertino, California began selling began selling the the Apple II to electronic hobbyists. Scientists quickly discovered that the new, relatively inexpensive computing systems were ideal for storing and analyzing gene data.
Today, it's virtually impossible to imagine molecular biology without computers. Researchers routinely search massive online databases for novel connections between genes while highly automated sequencing systems deliver terabytes of new data daily. Indeed, an entirely new scientific specialty, bioinformatics, has arisen to sort and study the growing trove of biological information.
Many institutions have built dedicated computing centers to handle the glut of data, but recently bioinformatics experts have started borrowing another strategy from the computer industry to avoid that expense: cloud computing (or distributed computing). Instead of storing and analyzing data locally, cloud-based systems divide computationally intense work among hundreds or thousands of remote servers that are available on demand. Early adopters of cloud-based genomics had to write their own software for it, but computer scientists and service companies are now adding user-friendly interfaces to make the technique more broadly available.
The most obvious argument for cloud computing is the sheer volume of new sequence data. "We're not a particularly large campus, and we have the capacity to generate about one terabyte of data per day," says Michael Schatz, assistant professor of quantitative biology at Cold Spring Harbor Laboratory in Cold Spring Harbor, New York. That's enough to fill the entire hard drive of a typical desktop computer in just two or three days.
Worldwide, explains Schultz, DNA sequencing machines produce about 15 petabytes of data per year (and is increasing rapidly); a petabyte is 1,000 terabytes. Writing 15 petabytes to high-capacity DVDs would produce a stack of disks about two and a half miles tall, just for the raw sequences. Experiments that include phenotypic information such as microscopy slides multiply the storage problem even further.
Fortunately, companies with deep pockets and extensive computing experience have already solved data handling problems on that scale. Google, for example, collects and processes dozens of petabytes of information about its users daily. "That's more data processed in a single day than the amount of [sequencing] data generated in the entire world in a single year," says Schatz.
To accomplish this, Google uses cloud-computing-based technology, which splits the work among a "cloud" of hundreds or thousands of servers in computing centers scattered around the world. Researchers can access a similar level of distributed computing power inexpensively and easily through services such as Amazon's EC2 system, which lets anyone rent access to a similarly huge server cloud.
Before rushing into the cloud, though, researchers should assess their needs and local resources. A computing center at a scientist's home institution can often provide faster, cheaper service than a remote cloud system for data that don't need to be shared with distant collaborators. As a rule of thumb, Schatz suggests that "if your data involves more than a hundred terabytes that needs to be shared amongst collaborators, I think you'll get the most accessibility using a cloud platform."
Institutions that don't have dedicated computing centers may also find the cloud attractive. "Traditionally you go and build a big data center with lots of machines in it, but that's expensive and half the time it's sitting around idle, so the benefit of cloud computing is [that] you're only paying whilst you're using the service, and the rest of the time it's no cost to you whatsoever," says Richard Holland, chief business officer at Eagle Genomics in Cambridge, United Kingdom.
Besides access to a huge number of remote servers, a typical cloud service also provides fundamental software. Much of the cloud computing industry now relies on free, open-source tools such as the widespread Apache server software and an add-on to Apache called Hadoop. The former handles the basic communication between each server and the network, while the latter is designed to take complex computing tasks and distribute them efficiently between thousands of servers.
Web companies originally developed this type of architecture to handle their own needs—Hadoop processes all of the world's Facebook photos and Yahoo searches—but in 2009 Schatz and his colleagues began using it for genomic data. Since then, Hadoop has become a top choice for bioinformatics in the cloud. Having "many hundreds of terabytes or petabytes of data that all need to be analyzed at once [is] becoming the de facto standard in the life sciences," says Schatz.
One major attraction of Hadoop is that it's easy to use, at least for scientists familiar with computer programming. "Just a little knowledge of Java programming is enough to be able to run large analysis tasks on very large clusters, and that's a big advantage of using Hadoop," says Jens Dittrich, a professor of information systems at Saarland University in Saarbrucken, Germany. Instead of having to keep track of which processor is handling which tasks, programmers can simply write their algorithms as if a single machine was doing the work, and Hadoop handles the underlying complexities of dividing the processing across thousands of servers.
Cloud-based computing in general, and Hadoop in particular, do have some drawbacks. In order to analyze data in the cloud, researchers first have to put it there, and terabyte-size uploads often take hours even over fast internet connections. Because it lacks the sophisticated indexing systems many databases use, Hadoop can also be inefficient for some types of analyses. A properly structured index allows a program to identify the specific pieces of data that are most likely to be necessary for a particular query; a system without indexes has to search the entire data set, which takes much longer.
Dittrich and his colleagues recently tackled both problems. The team's new Hadoop Aggressive Indexing System creates numerous indexes of a set of data while it's being uploaded to the cloud, using what would normally be wasted computing time to build a useful tool for optimizing subsequent analyses. Depending on the types of questions researchers are asking, the indexes can accelerate processing by a hundredfold. "It's not a silver bullet, to be fair, it depends on the analysis task ... but for many tasks we're doing pretty well," says Dittrich.
Even as new techniques make Hadoop more useful, experts in the field stress that it will never be a universal solution. Both Dittrich and Schatz argue that cloud-based systems excel at answering some biological questions, but not others. Aligning sequencing reads, identifying gene variants, and sorting through RNA expression profiles are all good candidates for cloud-based solutions, as they require searching through large sets of data for individual pieces of information. Metabolic modeling, on the other hand, involves performing complex calculations on smaller sets of data, and may work better on a local computing system.
Hadoop also doesn't make sense for biologists who aren't comfortable writing their own computer programs. For those scientists, several companies now offer user-friendly interfaces for cloud-based data analysis.
"The cloud comes in a variety of different sorts of flavors," says Eagle's Holland. Options range from bare-bones server leasing arrangements, often called infrastructure as a service, to fully built applications or software as a service (SaaS).
With SaaS, a service company provides the cloud infrastructure, data storage, and bioinformatics software. In many cases, researchers can have their sequencing data sent directly to the company, then perform common types of analyses in a point-and-click web environment. Sequencing companies such as Illumina in San Diego, California now offer their own SaaS systems, while numerous startup companies are also exploring this new niche.
Each service company has its own approach. Eagle Genomics, for example, links different prebuilt programs together to tailor software for each user. "People usually come to us and say 'we need to build an analysis pipeline that will do SNP prediction or variant calls,'" says Holland, adding that the company will then take published algorithms and "plumb them together into a ... workflow that answers their question." Researchers can then use the customized workflow to analyze their data on a cloud of servers. More experienced users can also delve into the computer code themselves to modify it.
Investigators looking for an even easier entry into the cloud can turn to one of the many companies now offering general-purpose software that address common types of queries. "There's a lot of functionality that a biologist can access in our service just by logging in and clicking buttons in their web browser," says Andreas Sundquist, chief executive officer and co-founder of SaaS provider DNA Nexus in Mountain View, California.
Though SaaS companies often develop their own proprietary code and interfaces, scientists shopping for cloud services should still ask about the underlying algorithms. "Researchers are a conservative bunch of people really, they like algorithms that have been published and tested and peer reviewed and well understood, and are less keen to experiment with new techniques on important data," says Holland.
Fortunately, most of the new bioinformatics companies are happy to discuss their systems. "All of the algorithms that are currently integrated into Spiral are peer reviewed, [and] we definitely get that people want to use things that are open source," says Adina Mangubat, chief executive officer of Spiral Genetics in Seattle, Washington. Spiral puts its own interface and data-handling layer on the published algorithms to make them easy to use. Other companies in the field echoed that sentiment, and most SaaS leases allow researchers to access the underlying software code directly.
Because cloud computing is relatively new, researchers in some fields remain skeptical of it. That's particularly true for pharmaceutical and biomedical scientists who handle sensitive proprietary data or information about patients. "There's definitely a perception of being able to control what's going on in your local cluster more than being able to control what's happening in a cloud environment," says Mangubat.
That concern may not be well founded. Studies have shown that three quarters of recent medical security breaches in the United States have been the result of clinicians losing laptop computers or portable storage drives. "If they had used the cloud instead ... stealing a laptop would've been a non-issue, because you would've never had the patients' data on that laptop to begin with," says Sundquist.
Indeed, as banking, government, and e-commerce companies have moved their data into cloud storage, security at server facilities has become extremely robust. Companies targeting the medical research market have also paid close attention to data security laws. "One of our fundamental tenets is making sure we have enterprise-grade security and all the features necessary to operate in a clinical or diagnostic setting," says Sundquist.
Even scientists renting bare cloud infrastructure and writing their own algorithms should expect strong security. Mangubat points out that the popular Amazon EC2 cloud leasing service already complies with regulations for the physical security of medical data, leaving only the researchers' own software as a potential weak point.
Another common concern with cloud computing—and something researchers should ask about before signing a server lease—is archiving. If a SaaS company shuts down or a researcher decides to switch to a different system, the lease should specify a way to retrieve the data. "We offer services that will allow us to burn all of that stuff to disk and ship them a big stack of hard drives, you're not married to the cloud for life," says Mangubat.
For general storage, though, the cloud can offer protection from accidents and local disasters, as cloud services typically replicate data across multiple locations. "You could have a meteor hit one of the data centers and a volcano erupt in the other one, but you'll still have another copy of your data," Sundquist explains.
Cloud storage could also help address the tricky problem of archiving digital information. For example, data stored on standard computer floppy disks just a few decades ago are often unreadable today, as the now-obsolete disk drives and operating systems are no longer available. In cloud-based storage, workers constantly transfer data to new media, while version-control systems preserve old editions of the software. Future researchers should be able to resurrect both the data and the tools used to analyze them.
Not everyone is satisfied with that solution, though. "As long as it can be overwritten, it's not an archive," says Dittrich. To prevent valuable sequencing data from being devoured by a computer bug or human error, he recommends storing an extra copy on another type of media. "A good way of doing a backup is to have a write-once medium, a [non-rewritable] DVD is a good example, you burn it once and you can't overwrite it anymore," he says.
As the petabytes continue piling up, though, some experts suggest that the ultimate archiving system for genomic data might be DNA itself, completing the connection between computing and biology. In this view, it may soon be cheaper and faster to resequence a stored biological sample than to retrieve the original sequence data from a digital archive. "Today there is a several day lag and expense to sequencing DNA, but it's kind of a sneak peek into the future ... where if sequencing was really more or less instantaneous, it does have some merit to being an information storage medium," says Schatz.
|Cold Spring Harbor Laboratory
ADDITIONAL RESOURCESAmazon EC2
Note: Readers can find out more about the companies and organizations listed by accessing their sites on the World Wide Web (WWW). If the listed organization does not have a site on the WWW or if it is under construction, we have substituted its main telephone number. Every effort has been made to ensure the accuracy of this information. Inclusion of companies in his article does not indicate endorsement by either AAAS or Science, nor is it meant to imply that their products or services are superior to those of other companies.
Alan Dove is a science writer and editor based in Massachusetts.
|This Custom Publishing Office feature was published on 14 June 2013.|
NEXT GENERATION SEQUENCE ANALYSIS
Lasergene—an integrated suite of software for Sanger and next generation sequence assembly and analysis—is now available on the Amazon Cloud with version 11. By using the cloud, researchers can more easily collaborate globally, access powerful hardware for occasional large projects that require it, run as many assembly projects as desired sequentially or concurrently, and take full advantage of the flexibility offered by our software for any application anywhere, anytime. Other significant improvements with Lasergene 11 include the introduction of a new application, MegAlign Pro, which includes the Muscle sequence alignment algorithm; new 16S rRNA and host-viral integration workflows in support of next generation sequencing platforms and technologies; enhanced Copy Number Variation analysis capability; and numerous improvements to Protean 3D, the integrated protein structure, sequence, and bioinformatic application within the Lasergene software suite.
In today's laboratories, experimental datasets are growing larger, and critical tasks such as data storage, processing, mining, and sharing have become increasingly cumbersome, error prone, and expensive. The revolutionary i3D Enterprise Service overcomes these challenges by integrating storage, processing, and data mining in an enterprise-level private cloud. Historically, to offer enterprise-level informatics, labs required a large team of information technology specialists as well as an associated computer cluster and corresponding data center. With i3D Enterprise Service, laboratory data can be automatically and securely uploaded from instruments to a private cloud and processed on the cloud. This enables workflow execution and data mining in a fraction of the time when compared with processing on a local PC. Researchers with an Internet browser can quickly access all of their data, interrogate it from any location, and share data globally in seconds. Additionally, i3D Enterprise Service supports all major instrument vendor data file formats.
Shimadzu Scientific Instruments
For info: 800-477-1227
The new Genefficiency RNA Sequencing (RNA-Seq) Service overcomes the need for expensive and time-intensive in-house bioinformatics analysis and has a wide range of other benefits. To ensure that researchers obtain informative data, OGT provides expert assistance from the initial stages of experimental design all the way through to the final results. In addition, the results are presented in an easy-to-use, interactive report that makes the final expression analysis as fast and straightforward as possible. By taking care of the complexity of sample and data processing, the new service makes RNA-Seq a widely accessible tool for revealing the complexities of the transcriptome. RNA-Seq allows much broader discoveries at the transcriptome level than many other approaches. It provides the experimental freedom to identify unknown genes and isoforms without having to wait for new versions of exon or custom arrays and genome annotation updates.
Oxford Gene Technology
For info: +44-(0)-1865-856826
Services are now available for investigating DNA methylation (5-mC) and hydroxymethylation (5-hmC) as well as targeted DNA methylation analysis of single or multiple genetic loci. Multiple service options are available to researchers, each offering varying degrees of genome coverage. No experience with sequencing or bioinformatics technologies is required—simply submit the samples and receive high-quality, easy-to-interpret publication-ready data and figures. There are many unique benefits to using the services, including the most highly cited chemistries for bisulfite treatment of DNA for 5-mC analysis, novel library prep workflows for ultralow DNA inputs, and custom-designed bioinformatics pipelines to ensure seamless data handling, analysis, and delivery. Additionally, a novel technique is available to determine DNA hydroxymethylation levels on a genome-wide scale called Reduced Representation Hydroxymethylation Profiling (RRHP), which is the only method available for reliable single base-pair resolution and strand-specific profiling of 5-hmC modifications in DNA.
For info: 888-882-9682
Electronically submit your new product description or product literature information! Go to www.sciencemag.org/products/newproducts.xhtml for more information.
Newly offered instrumentation, apparatus, and laboratory materials of interest to researchers in all disciplines in academic, industrial, and governmental organizations are featured in this space. Emphasis is given to purpose, chief characteristics, and availability of products and materials. Endorsement by Science or AAAS of any products or materials mentioned is not implied. Additional information may be obtained from the manufacturer or supplier.
Look for these Upcoming Articles
Separation Techniques — July 12