Jump to: Page Content, Section Navigation, Site Navigation, Site Search, Account Information, or Site Tools.
|
|
This Special Advertisising Section is brought to you by AAAS OPMSIntegrating Informatics Data
After so many years of waiting–stretching back to the discovery of DNA's overall structure in 1953–the scientific community embraced the publication of the human genome sequence. In the 16 February 2001 issue of Science, an article by Craig Venter's group at Celera Genomics detailed the enormity of this project. After completing 27.3 million high quality sequence reads, which provided 5.11-fold coverage of the genome, and mapping 2.1 million single nucleotide polymorphisms, or SNPs, the investigators unveiled 32,000 human genes. Although the total number of genes fell short of what biologists expected, the volume of DNA sequence data created incredible challenges in managing and analyzing information. Consequently, the field of bioinformatics quickly gained prominence. Thousands of life science and computer science experts worked in laboratories around the world for 15 years to generate the initial draft from the human genome. Nevertheless, more sequencing remains, because the sequence of the human genome is not complete. In addition, investigators will also unravel the genomes of many other organisms. Sequencing is already completed for a few others, including S. cerevisiae (baker's yeast) with 12.1 million bases, C. elegans (nematode) with 97 million bases, and D. melanogaster (fruit fly) with 180 million bases. As with the human project, more sequencing from any genome creates more work in data management and analysis. When asked what lies ahead, beyond the human genome sequencing project, Robert Waterston, director of the Genome Sequencing Center at Washington University in St. Louis, said, "I see more genome sequencing projects, applying this ability to an increasing number of organisms." But he added that every increase in the volume of available data increases the difficulty of searching through it. As a result, he expects even more complexity ahead. "The nice thing about sequencing," he said, "is that it is inherently digital in form. Once we get it, that's it. Expression data is different. It's quantitative and depends on the technique used to generate it." He also sees a need to look at variation and phenotype. He said, "Clearly, associating human phenotypic variation with the underlying sequence is a major challenge." He expects that this will require even more data and statistical analysis. Today, biologists want to make sense of the sequence data and then turn their attention to the function of genes. Future projects will describe more genes, as well as RNA intermediates, resulting proteins, protein-protein interactions, and more–all producing large volumes of data that are likely to reside in different computer formats on different platforms. So far, biologists find themselves bogged down by the growing volume of data, when they would rather be liberated by it. All of this data could create a new freedom–to discover new genetic relations or create new drugs–but that demands more computational power from computers, enormous increases in data storage, and methods to integrate and analyze results from various experiments and techniques. Many projects under way should resolve these issues and open pathways to unexplored territory. Let's see where this all started. In the early 1980s, the National Institutes of Health, working with Los Alamos National Laboratory, created a public database called GenBank, which housed short stretches of DNA sequences that were just beginning to be identified by researchers. This probably marked the beginning of the bioinformatics age. In 1982, GenBank contained about 600 DNA sequences, but today more than 12 million squeeze into this information warehouse, which contains sequence information from many organisms. Currently, the National Center for Biotechnology Information (NCBI) of the National Institutes of Health manages and builds this database. GenBank is one of three centers–along with the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ)–that collaborate in collecting information, which comes almost entirely from submissions of the authors of the data themselves. Scientists can access these data through the Internet and look for similarities and differences between DNA sequences in their search for new genes. Despite the growing number of sequences housed in GenBank, David J. Lipman, director of NCBI, sees much more work ahead. He said, "I'm sure we'll see more and more of the same kind of computational comparative sequence analyses being done. You'll also see more of the kind of hybrid experiment-computational work being done, like what Celera did for their whole genome shotgun assembly." He added, "Essentially experimental approaches are becoming more feasible and attractive because of the ability to do some of the work in the computer." In the 1990s, Expressed Sequence Tags–known as ESTs–arrived on the scene. These are short, about 300-500 base pair, single-pass sequence reads from mRNA. Typically they are produced in large batches and represent a snapshot of genes expressed in a given tissue. In the beginning, scientists considered these sequences long, and the growing numbers of them created some analytical challenges. Consequently, several companies capitalized on these sequences in hopes of discovering new genes and ultimately new drug targets. Incyte Genomics and Millennium Pharmaceuticals represented some of the first players in this business. These organizations believed that the ability to access and study large amounts of this information would provide them with a significant competitive advantage over the more traditional drug hunters. Nevertheless, the large volume created the need for more powerful tools to manage and analyze these sequence data. Once the Human Genome Project got under way, scientists generated even more DNA sequence data. High throughput screening helped investigators sift through more targets in the drug discovery pipeline. Today, scientists can turn to Sequenom for a high throughput genotyping system called the MassARRAY. Charles Cantor, Sequenom's chief scientific officer, said, "We have 100,000 proven assays available." He added that this fall Sequenom will introduce a system that can crank out more than one million genotypes per day. Beyond high throughput, Sequenom can also integrate data across subjects. Cantor explained: "We can scan the genotypes in pools from a large number of people. So we can get the average genotype for 300 to 400 people right away. That saves users a factor of 300 to 400 in time and cost." Using this technology, Sequenom plans to scan the entire human genome in a population of 15,000 people by 2002. This process should expose more genes that could be associated with diseases. Sorting through the data being generated by sequencing projects demanded computer programs for data storage, management, and analysis. For instance, one of the most basic methods in bioinformatics compares new DNA sequences to those previously identified as relevant genes. If a new stretch of DNA resembles a gene already shown to be related to a particular disease, then the new sequence might be targeted as a potential treatment site. One of the software programs developed for searching DNA sequence data is the Basic Local Alignment Search Tool, better known simply as BLAST. This software program is available through NCBI as a part of a suite of programs for database analysis. NCBI also offers other programs for searching databases of three-dimensional protein structures and other biological data platforms. All of these programs make it possible to comb the ever-growing volumes of data. Keeping biologists on track, however, demands a suite of user friendly tools. Some companies concentrate on creating products with improved interfaces and capabilities. As a result, many instruments evolved from crude, hard-to-program keypads to today's Windows-driven operating systems. SynApps Software, Inc., for example, is a relatively new arrival in this business. According to Skip Martin, president of SynApps: "We help our clients develop products." This work focuses on creating custom software that ranges from automating laboratory processes to data mining and bioinformatics. They also offer a series of tools–including ones that help import data and search it–that attach to commercial products. This lets a customer customize a software package that is already in use. Martin added: "Scientists need flexibility, because technology and approaches to analysis change so rapidly. The architecture that we build can evolve with the changing needs of the organization." Martin believes that software itself will continue to evolve in ways that make it easier to operate. He said, "When biologists can think and work as biologists–not computer scientists–then they will be more efficient in their work." In some cases, companies try to enhance data for customers. Douglas Brutlag, chief scientist for DoubleTwist, Inc., said, "We reexamine and reevaluate information from public and private databases and, thereby, add value. Then, we redistribute this information to pharmaceutical companies and biotech firms." This approach could challenge current expectations of the genome. For example, Brutlag said, "We look for genes in several ways, and we find twice the number of genes described in the public databases, or about 70,000." Users of DoubleTwist software can adjust its sensitivity and specificity for customized applications. This company's software also integrates information from a customer's data, other data sources, and various techniques–all put up in a single viewer to analyze. Users can load the software on their machines or use it at DoubleTwist's website. Brutlag added: "I don't think we will know all the human genes until we sequence every mRNA at every stage of development and in every cell." Finding all of the proteins could be an even bigger challenge, because Brutlag said that most genes take an average eight alternatively spliced forms, which could create more than half a million proteins overall. Growing Needs for Handling Data The data from genome projects, proteomics, and the questions that lie beyond will necessitate powerful software. Accordingly, a growing number of companies offer software for analysis of DNA sequences and protein structures. These products and services often include access to proprietary databases with large volumes of sequence data. For example, Biomax Informatics AG, DoubleTwist, Inc., Entigen, LION BioScience, and others offer suites of bioinformatics programs. According to Reinhard Schneider, chief executive officer for LION Bioscience Research in Cambridge in the United States, they base part of their software product line on SRS, which stands for sequence retrieval system. Schneider said, "The system links databases such that the users can make single queries that go to more than one database." LION offers additional products and solutions in the areas of sequence analysis, expression profiling, and metabolic pathway analysis. Schneider added: "Our major focus is linking all of our tools together so that users can exchange data between products." Despite the power of probing databases, Schneider notes that valuable information can be found in other places, too. He said, "Most of the useful information is not in databases, but is in the primary literature. So, we are creating a text mining tool that can, for example, extract protein-protein interactions from Medline abstracts." Schneider and his colleagues are working with Bayer on a combination of text mining and database access that could be used for various tasks, including starting with a phenotype, such as an early flowering plant, and then trying to find a genotype that provides that feature. In the end, everyone hopes to use the knowledge created in the genome projects and the studies beyond it. As one example, Pyrosequencing moved beyond the genome as fast as it could. Bjorn Ekstrom, executive vice president and chief technical officer, said, "We provide tools for applied genetic analysis that enable the scientists to generate enormous amounts of data to correlate genotypes and phenotypes." In fact, Pyrosequencing's techniques can very accurately analyze 100,000 SNP assays in a day with just two instruments. In addition, the same hardware can perform many other types of assays relevant to applied genomics–including de novo sequencing, typing microorganisms, viral load, and allele frequency measurements–by simply changing the software and the reagent kit. Ekstrom said that Pyrosequencing is also working on reducing the assay volumes through microfluidic techniques and arrays, which should continue to decrease the required size of a sample and reduce the cost. The volumes of data must be put together. Scientists can do that with the Discovery Center™, which NetGenics describes as an open, extendable software environment that provides an integrated view of chemical and biological information held in both internal and external repositories. This online system lets customers create an environment that analyzes and displays data through a variety of algorithms and takes data from many sources. Mike Dickson, chief technology officer for NetGenics, said, "Integration will lead to better utilization of knowledge." Different companies, as one would expect, take different approaches to this field. For instance, Klaus Heumann, CEO and founder of Biomax Informatics AG, said, "We are more problem focused, rather than only product focused." He added that today's biologists face heterogeneous problems that off-the-shelf software will not always address comprehensively or efficiently. He said, "We look at all the dimensions of a problem faced by a customer, and then we create a customized solution." Such solutions include global proprietary and public database integration and search capabilities, integration of clustered expression data with functional categories, software for automatic selection of genes for specific disease areas by linguistic analysis from scientific literature databases, and computer simulations of experiments. Furthermore, Biomax offers the unique service of manually annotating complete genomes. As genomics and its resulting spin-off fields continue to grow, companies have more at stake, from intellectual property rights to economic opportunities. As a result, companies demand security and reliability in software. For some organizations, that means using software that resides on their own company servers. Companies that offer software packages designed for in-house use include Informax, Oxford Molecular Group, and Molecular Mining Corporation. These suppliers create software for use in small laboratories as well as for larger research organizations. Part of the flood of data in molecular biology comes from new collection techniques, including DNA chips. With these miniature devices, investigators can conduct large numbers of experiments on a single small slide, not much different from the ones used in basic light microscopy. Companies that offer ready-to-use microarrays include Affymetrix, Inc., Genomic Solutions, and Mergen, Ltd. To fabricate custom chips, investigators can turn to Beckman Coulter, Inc., Genetix Ltd., and GeneMachines. A DNA chip, or microarray, literally is DNA on a chip. Oligonucleotides, or cDNAs, make up the DNA part, and the chip is just glass, plastic, or some other material. A chip includes thousands of different DNA sequences in an orderly pattern, essentially on a grid, to serve as probes. In general, an investigator collects mRNA from cells being studied, converts it to cDNA, and applies it to a DNA chip. The cDNA hybridizes with a gene, or DNA sequence, like the one that made it. The cDNA can be tagged–say, with a fluorescent dye–so that the hybridization site can be located. Scanners record the images for digital analysis, and software programs make sense of the thousands of data points. Software programs are available from various companies, including Affymetrix, BioDiscovery, Hitachi Genetic Systems, and Silicon Genetics. The software uses the locations of hybridization sites to determine which sequences are being expressed in test cells. With the increasing variety in techniques for data generation, collection, and analysis, a biologist could get trapped in a spider web of automation and instrumentation, instead of doing biology. Fortunately, a number of companies tackle the entire product portfolio for both genomic and proteomic research. These companies–including Amersham Pharmacia Biotech and Bio-Rad Laboratories–develop broad product lines, and make every component compatible with all of the others. Genomic Solutions focuses exclusively on genomics and proteomics research. This organization assembles the tools and techniques needed for DNA and protein research, and it also offers contract services in both areas. If an organization does not want to invest in the instrumentation needed to perform this research, Genomic Solutions can provide the staff and know-how to do the work and provide the results. Nisha Sahay, manager of genomic production and research services at Genomic Solutions, said, "We're kind of a turn-key solution. We provide a complete package: genomics, proteomics, analysis, and contract research services." She adds, "Every product made here is used by our scientists to provide custom research services, thereby validating the instruments and consumables that are being built at Genomic Solutions." The breadth of today's biomedical technology often encourages companies to spread their capabilities by teaming up with other companies. For example, some large pharmaceutical companies build in-house bioinformatics capabilities, and others obtain these skills through partnerships with biotechnology companies that specialize in software for data mining. Both approaches come with benefits and costs. Creating an in-house capability for bioinformatics, for instance, can protect proprietary methods, but it can be a very costly venture, even for a large organization. On the other hand, sending bioinformatics jobs out to another company forces a company to disclose certain confidential data to an outside supplier, which might present additional risk. A quick survey reveals well-known companies on both sides of this partnering strategy. For example, Millennium Pharmaceuticals built extensive data mining capabilities in-house and even provides this service to other companies as a source of additional revenue. GlaxoSmithKline also depends on in-house bioinformatics specialists. Alternatively, Celera Genomics, Incyte Genomics, Rosetta Inpharmatics, and others supply services in the bioinformatics market. These suppliers work with companies that prefer the flexibility and quick access to new innovations that these suppliers offer. One of the most recent examples of teaming up is Merck & Co., Inc.'s ongoing acquisition of Rosetta Inpharmatics. When asked about the expected benefits of this merger, Richard Blevins, director of bioinformatics at Merck, simply said, "Rosetta has the capacity to perform high throughput gene expression studies that are not possible at Merck." So far, though, Blevins thinks that many of the benefits of genomics lie in the future for pharmaceuticals. He said, "The effects of sequencing the human genome are just starting to trickle in here." He did indicate, however, that human sequencing turned up several new genes that could be potential drug targets, but that work remains in the research stage. Still, he said, "Finding a gene does not mean–in any way, shape, or form–that you are closer to a drug target." Instead, he expects to learn more about potential drug targets from the interactions between gene products and pathways. The merger with Rosetta should help Merck explore in greater depth the effects that chemical compounds have on genes and their interactions. Cranking Up the Computing Power Although genomics projects alone demanded amazing levels of computing power, future projects will create skyrocketing computational needs. An organism's genome contains many parts, but at least the list remains relatively constant. In contrast, the transcriptome–messenger RNA–generated from the DNA of a living cell and the proteome–proteins–created from the same DNA change from cell to cell and throughout development. Venturing into these dynamic realms requires immense computer storage and seemingly instant computation. Some of the largest hardware and software companies see these growing needs for computational power in the life sciences, and they are creating separate divisions to attack specific biological problems. According to Caroline Kovac, vice president of IBM Life Sciences, today's life sciences research depends on computation. She said, "At IBM, we are focused on three areas: providing systems that can scale to handle the exponential growth in data; integrating the data so that it can be mined for knowledge; and creating knowledge based information systems that support collaboration among the community of scientists from universities, government funded research labs, and the private sector." She added: "What life sciences companies need–close to the top of the list–is a computing infrastructure partner that can deliver end-to-end solutions, including high-performance computing and storage solutions, scalable data management and data integration environments for heterogeneous data, and the capability to put it all together." For instance, IBM's DiscoveryLink data integration software helps scientists integrate and analyze large data sets from multiple databases–in different formats and file types–through a single query. In addition, IBM and MDS Proteomics recently formed a nonprofit organization called blueprint WORLDWIDE, Inc., which oversees a public database of protein-protein interactions called the Biomolecular Interaction Network Database, or BIND. This database uses IBM technology for processing, storing, and managing the data. Biologists will probably need increasing computer power for some time. Siamak H. Zadeh, manager of the life sciences group at Sun Microsystems, Inc., said, "The role of information technology is increasingly pronounced in the life sciences." As a result, biologists are calling for ever more computing power. In fact, Zadeh said that some clients already want petabytes–that's 10,000 trillion bytes–of storage. In some cases, though, an organization might find ways to use its current computing capabilities to gain more power. Paul Renaud, general manager for the ActiveCluster at Platform Computing, Inc., said, "We've been tackling the central problem of making enough computer cycles available to do the work." Most organizations only use 5 percent to 10 percent of the potential in their desktop computers, and even servers only work to about 30 percent of their capability. That leaves millions and millions of computing cycles unused. Consequently, Platform offers a series of software products, including LSF ActiveCluster, that exploit unused power in an organization's computers. This software can link together anything from desktop PCs to supercomputers, and thereby creates a super-supercomputer to attack the toughest computational problems. Renaud said, "It's about high throughput computing. The more processors you can combine, the more powerful the computer." Moreover, this software even links computers that use different operating systems. The wide variety of techniques being employed by many groups in biomedical research creates data in many formats and on assorted computer platforms. Consequently, investigators struggle when they try to extract meaningful information from a handful of different databases. To use all of these data effectively, informatics needs uniform standards for how information is housed and exchanged. The beginning of such standards started with the Interoperable Informatics Infrastructure Consortium (I3C), which was announced at the recent BIO 2001 Conference. This group consists of an assortment of organizations, including IBM, Incogen, LabBook, Millennium Pharmaceuticals, and Sun Microsystems. Working as a group, these organizations will develop common protocols for data exchange and knowledge management in the life sciences. In the post genomic era, Zadeh said that biologists will need significant improvements in data integration, data management, and knowledge management. By settling on a common protocol that will standardize the way data are treated, all software packages and hardware tools will be able to communicate, much like the Internet. As a result, Zadeh said, "Computer providers will be able to concentrate on core business, instead of worrying about integration. With a common protocol, the integration will come on its own." This effort to help scientists work together should reveal additional knowledge, even about data already available. Blevins said, "There's got to be some information buried in this. A standard vocabulary could increase our ability to mine different databases." With the ongoing efforts of I3C, improved mining will soon be possible. With these computing standards, biomedical researchers can focus on the next great challenge: unraveling how a genome creates each specific protein, which eventually determines how a cell functions. In looking toward the future, David Lipman said, "Perhaps the area that could have the most impact on the role of computation in biology will have to do with analysis of signaling pathways and so on–a sort of 'meteorology' of the cell." He added: "If one could even get qualitative predictions based on computational analysis for understanding aspects of cellular physiology, then computation could have a similar role in biology to what it has in physics, but it's far too early to tell on that one." Wherever genomics and bioinformatics lead researchers, each major step in this seems as though they have reached the end of a journey, but then scientists always find new questions to ask and set new expectations.
Note: Readers can find out more about the companies and organizations listed by accessing their sites on the World Wide Web (WWW). If the listed organization does not have a site on the WWW or if it is under construction, we have substituted its main telephone number. Every effort has been made to ensure the accuracy of this information. The companies and organizations in this article were selected at random. Their inclusion in this article does not indicate endorsement by either AAAS or Science nor is it meant to imply that their products or services are superior to those of other companies.
|
||||||||||||||||||||||||||||||
Science. ISSN 0036-8075 (print), 1095-9203 (online)