Note to users. If you're seeing this message, it means that your browser cannot find this page's style/presentation instructions -- or possibly that you are using a browser that does not support current Web standards. Find out more about why this message is appearing, and what you can do to make your experience of our site the best it can be.


Science 13 June 1997:
Vol. 276. no. 5319, pp. 1724 - 1725
DOI: 10.1126/science.276.5319.1724

Technical Comments

Dealing with Database Explosion: A Cautionary Note


Carol J. Bult et al. (1) report the first entire archea genome sequence of Methanococcus jannaschii (Mja). Because the initial gene assignments were conservative (1, 2), we anticipated that much interesting biological information would be missing. We searched the database for additional open reading frames (ORFs), and found 15 ORFs: four within intergenic regions (M1 through M4, Table 1); five overlapping with previously identified ORFs (1, 2) but that read off in a different frame (M5 through M9, Table 1); and six that are extended or truncated as a result of potential frameshifts (M10 through M15, Table 2).

Table 1. New ORFs in M. jannaschii (Mja) identified on the basis of similarity. ORFs were identified after purging out protein coding regions reported for the organism (1) and searched using BLASTX against the combined SwissProt+PIR+Genbank translations database through the NCBI Network BLAST server using a score cutoff of 60, as described previously (6). Corresponding matching protein, matching species-Methanococcus vannielii (Mva), Bacillus subtilis (Bsu), Haemophilus influenzae (Hin)--5' start position, + or - strand, length of the ORF in amino acids (AA), 5' to 3' flanking ORFs, and the Poisson probability estimates are provided for each ORF. Other details available at http://www.golgi.harvard.edu/bhatia/neworfs/mja/table1.html


ORF Matching protein Matching species Start 5' Length (AA) Flanking ORFs
p
5' 3'

M1 30S Ribosomal protein S14 Mva 415652+ 55 469 470 10-19
M2 Yqgp protein Bsu 540515+ 190 610 611 10-9
M3 Amido phosphoribosyl transferase Mja 1301085- 362 1352 1351 10-5
M4 Unknown Mja 1230530- 255 1283 282 10-18
M5 Asparagine synthetase Bsu 994621+ 318 1055 1056 10-26
M6 Modification methylase Mja 1153501- 81 1208 1206 10-12
M7 Modification methylase HINCII Hin 1277783+ 286 1327 1328 10-35
M8 Helicase Bsu 1548365- 182 1573 1572 10-6
M9 Unknown Mja 1329000+ 58 1380 1381 10-12

Table 2. Identification of potential frameshift(s) by similarity. Highly significant BLAST matches, of similar genes in alternative coding frames, were classified as frameshifts, manually assembled, and confirmed. Effect of the frameshift (extension or truncation) and length of the ORF as a result of the frameshift are also provided. M10 through M14 have suffered a single frameshift event, while M15 has apparently undergone a second frameshift. Other details available at http://www.golgi.harvard.edu/bhatia/neworfs/mja/table2.html


ORF Matching protein Matching species Start 5' Length (AA) Frameshift
p
Effect Length (AA)

M10 Restriction modification enzyme subunit M1 Mja 128577- 359 extension 583 10-93
M11 Transposase Mja 276289+ 91 truncation 38 10-43
M12 Polyferredoxin Mja 457630- 410 extension 567 10-40
M13 Unknown Mja 14344- 72 truncation 16 10-35
M14 Unknown Mja 202169+ 177 truncation 50 10-8
M15 Unknown Mja 809431+ 32 extension 131 10-20

Although the potential frameshifts we describe might be bona fide, it cannot be ruled out that they represent actual sequencing artifacts. Erroneous sequences in public databases are a substantial problem and have been estimated to be in the range of 0.37 to 2.9 errors per 1000 nucleotides (3), making data interpretation sometimes difficult. This is especially true, for example, in studies that utilize protein and DNA sequence information to estimate evolutionary distances (4). It is not known how the error rate in this study (1) compares with error rates in the database, but a previous study suggests that error rates generally vary between 1 in 5000 to 1 in 10,000 nucleotides (5).

The issue of sequencing artifacts is important and is expected to be a continuing problem in the future, considering the heightened surge of genome sequencing projects from model organisms, as well as from the human genome sequencing initiative.

Umesh Bhatia
Department of Molecular and
Cellular Biology,
Harvard University,
Cambridge, MA 02138, USA
E-mail: bhatia{at}nucleus.harvard.edu
Keith Robison
Millennium Pharmaceuticals, Inc.,
640 Memorial Drive,
Cambridge, MA 02139, USA
Walter Gilbert
Department of Molecular and
Cellular Biology,
Harvard University

REFERENCES

  1. C. J. Bult, et al., Science 273, 1058 (1996) [Abstract].
  2. N. C. Kyrpides et al., Microb. & Comp. Genomics 1, 329 (1996).
  3. S. A. Krawetz, Nucleic Acids Res. 17, 3951 (1989) [Abstract]; J. Claverie, J. Mol. Biol. 234, 1140 (1993) [Medline].
  4. W. H. Li, et al., Genetics 129, 513 (1991) [Abstract].
  5. R. D. Fleischmann, et al., Science 269, 496 (1995) [Medline].
  6. K. Robison, et al., Nature Genet. 7, 205 (1994) [Medline]; K. Robison, et al., Science 271, 1302 (1996) [Medline].
24 February 1997; accepted 23 April 1997

Response: Bhatia et al. express concerns about erroneous sequences in public databases, which make the interpretation of sequence data sometimes difficult. We share these concerns because faulty entries in public databases, especially sequence annotations, often complicate our research efforts. Therefore, we dedicate considerable resources to maintain a curated in-house database and to carefully check the sequences and annotations provided by us to the public. The challenge is to find a suitable compromise between the quick release of newly sequenced genomes and responsible sequence quality and annotation. We estimate our error rate at the time of release to be 1 base in 5000 to 10,000 (1), which is about the quality requested for the Human Genome Project. For the 1.7-Mbp M. jannaschii genome (1), this would account for about 250 putative errors, which would mainly result in frameshifts in ORFs that as yet have no recognizable homologs in any database. Bhatia et al. specify 15 regions in this genome where they suspect ORFs or frameshift problems resulting from sequencing artifacts. We encourage the input of the scientific community in ongoing efforts to further elucidate the wealth of biological information still hidden in this genome; however, without access to the original electropherograms that were used to generate the final genome sequence data, it is not always possible to definitively determine whether a presumed frameshift reflects an error in the DNA sequence or not (Table 1). For example, ORF M11 in table 2 of the comment suggests that we truncated a transposase gene by a frameshift, but this "ORF" is a vestigial gene that is missing a significant portion of the central part of its homologues. The nucleotide necessary for a correction of the frameshift, A-276,294, is absent in all 12 sequences covering this area of the genome.

No automated computer system will discover some of the treasures (and some of the errors) still hidden in the genome of M. jannaschii. We are therefore grateful to colleagues who, after the release of the M. jannaschii genome sequence, contacted us to provide their biological, biochemical, and genetic experience and expertise, which has resulted in quick updates and corrections of our freely accessible database at http://www.tigr.org.

Hans-Peter Klenk
E-mail: hpklenk{at}tigr.org
Owen White
E-mail: owhite{at}tigr.org
J. Craig Venter
The Institute for Genomic Research, TIGR,
9712 Medical Center Drive,
Rockville, MD 20850, USA

REFERENCES

  1. R. D. Fleischmann, et al., Science 269, 496 (1995) [Medline]; C. M. Fraser, et al., ibid. 270, 397 (1995) [Abstract]; C. J. Bult et al., ibid. 273, 1058 (1996).
22 April 1997; accepted 3 April 1997



THIS ARTICLE HAS BEEN CITED BY OTHER ARTICLES:
Identification and functional analysis of 'hypothetical' genes expressed in Haemophilus influenzae.
E. Kolker, K. S. Makarova, S. Shabalina, A. F. Picone, S. Purvine, T. Holzman, T. Cherny, D. Armbruster, R. S. Munson Jr, G. Kolesov, et al. (2004)
Nucleic Acids Res. 32, 2353-2361
   Abstract »    Full Text »    PDF »
The Molecular Biology Database Collection: 2004 update.
M. Y. Galperin (2004)
Nucleic Acids Res. 32, D3-22
   Abstract »    Full Text »    PDF »
Powers and Pitfalls in Sequence Analysis: The 70% Hurdle.
P. Bork (2000)
Genome Res. 10, 398-400
   Full Text »
Analogous Enzymes: Independent Inventions in Enzyme Evolution.
M. Y. Galperin, D. R. Walker, and E. V. Koonin (1998)
Genome Res. 8, 779-790
   Abstract »    Full Text »



To Advertise     Find Products


Science. ISSN 0036-8075 (print), 1095-9203 (online)