Jump to: Page Content, Section Navigation, Site Navigation, Site Search, Account Information, or Site Tools.
|
|
Technical Comments
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Although the potential frameshifts we describe might be bona fide, it cannot be ruled out that they represent actual sequencing artifacts. Erroneous sequences in public databases are a substantial problem and have been estimated to be in the range of 0.37 to 2.9 errors per 1000 nucleotides (3), making data interpretation sometimes difficult. This is especially true, for example, in studies that utilize protein and DNA sequence information to estimate evolutionary distances (4). It is not known how the error rate in this study (1) compares with error rates in the database, but a previous study suggests that error rates generally vary between 1 in 5000 to 1 in 10,000 nucleotides (5).
The issue of sequencing artifacts is important and is expected to be a continuing problem in the future, considering the heightened surge of genome sequencing projects from model organisms, as well as from the human genome sequencing initiative.
Umesh Bhatia
Department of
Molecular and
Cellular Biology,
Harvard University,
Cambridge,
MA 02138, USA
E-mail: bhatia{at}nucleus.harvard.edu
Keith Robison
Millennium Pharmaceuticals,
Inc.,
640 Memorial Drive,
Cambridge, MA 02139, USA
Walter Gilbert
Department of Molecular and
Cellular Biology,
Harvard University
Response: Bhatia et al. express concerns about erroneous sequences in public databases, which make the interpretation of sequence data sometimes difficult. We share these concerns because faulty entries in public databases, especially sequence annotations, often complicate our research efforts. Therefore, we dedicate considerable resources to maintain a curated in-house database and to carefully check the sequences and annotations provided by us to the public. The challenge is to find a suitable compromise between the quick release of newly sequenced genomes and responsible sequence quality and annotation. We estimate our error rate at the time of release to be 1 base in 5000 to 10,000 (1), which is about the quality requested for the Human Genome Project. For the 1.7-Mbp M. jannaschii genome (1), this would account for about 250 putative errors, which would mainly result in frameshifts in ORFs that as yet have no recognizable homologs in any database. Bhatia et al. specify 15 regions in this genome where they suspect ORFs or frameshift problems resulting from sequencing artifacts. We encourage the input of the scientific community in ongoing efforts to further elucidate the wealth of biological information still hidden in this genome; however, without access to the original electropherograms that were used to generate the final genome sequence data, it is not always possible to definitively determine whether a presumed frameshift reflects an error in the DNA sequence or not (Table 1). For example, ORF M11 in table 2 of the comment suggests that we truncated a transposase gene by a frameshift, but this "ORF" is a vestigial gene that is missing a significant portion of the central part of its homologues. The nucleotide necessary for a correction of the frameshift, A-276,294, is absent in all 12 sequences covering this area of the genome.
No automated computer system will discover some of the treasures (and some of the errors) still hidden in the genome of M. jannaschii. We are therefore grateful to colleagues who, after the release of the M. jannaschii genome sequence, contacted us to provide their biological, biochemical, and genetic experience and expertise, which has resulted in quick updates and corrections of our freely accessible database at http://www.tigr.org.
Hans-Peter Klenk
E-mail: hpklenk{at}tigr.org
Owen White
E-mail: owhite{at}tigr.org
J. Craig Venter
The Institute for Genomic Research, TIGR,
9712 Medical Center
Drive,
Rockville, MD 20850, USA
Science. ISSN 0036-8075 (print), 1095-9203 (online)