Note to users. If you're seeing this message, it means that your browser cannot find this page's style/presentation instructions -- or possibly that you are using a browser that does not support current Web standards. Find out more about why this message is appearing, and what you can do to make your experience of our site the best it can be.

Site Tools

  • AAAS
  • Subscribe
  • Feedback

Site Search

Search Advanced

Science 22 November 2002:
Vol. 298. no. 5598, p. 1509
DOI: 10.1126/science.298.5598.1509a

Technical Comments

Are 100,000 "SNPs" Useless?


Bailey et al. (1) used public and private genome sequences to define segmental duplications within the human genome. Their excellent study demonstrated that the 5.2% of the genome present as segmental duplications contains 6.1% of known exons and roughly 100,000 more single nucleotide polymorphisms (SNPs) from the public SNP database (dbSNP) than expected. This latter feature led the authors to conclude that "about 100,000 paralogous sequence variants currently contaminate dbSNP." In other words, these entries in dbSNP do not represent allelic variants (polymorphisms), but differences between paralogous sequences (cismorphisms). This calculation assumed that "there is no reason to expect that polymorphic variation is increased within duplicated regions."

In contrast to this assertion, however, it has long been recognised that nonallelic gene conversion is capable of generating allelic diversity as well as homogenizing paralogous sequences. For example, the promoter of the growth hormone gene GH1 exhibits roughly 20 times more nucleotide diversity than other autosomal loci, as a consequence of gene conversion with neighbouring paralogous genes (2). Gene conversion has also been detected between dispersed segmental duplications (3-6). In addition, mathematical modeling has shown that heterozygosity increases with gene conversion rate (7).

Gene conversion between segmental duplications raises the additional possibility that etiologically important variants within them might skip between chromosomal locations, thus changing their haplotypic background and potentially rendering such regions opaque to haplotype-based whole genome association studies of complex disease. Variants defining the haplotypic background are themselves subject to gene conversion and, in view of typically short conversion tract lengths, are unlikely to be co-converted with the etiologically important variants, a factor that increases the confusion.

Investigating the extent to which variants within 6.1% of our genes might escape haplotype-based association studies, and the degree to which 100,000 is an overestimate of useless "SNPs" in these segmental duplications, will require greater characterization of the poorly understood dynamics of gene conversion in the human genome.

Matthew Hurles
Molecular Genetics Laboratory
McDonald Institute for
Archaeological Research
University of Cambridge
Cambridge, CB2 3ER, UK

REFERENCES

1. J. A. Bailey, et al., Science 297, 1003 (2002) [Abstract/Free Full Text].
2. M. Giordano, C. Marchetti, E. Chiorboli, G. Bona, P. Momigliano Richiardi, Hum. Genet. 100, 249 (1997) [CrossRef] [Web of Science] [Medline].
3. S. Aradhya, et al., Hum. Mol. Genet. 10, 2557 (2001) [Abstract/Free Full Text].
4. P. Blanco, et al., J. Med. Genet. 37, 752 (2000) [Abstract/Free Full Text].
5. L. L. Han, M. P. Keller, W. Navidi, P. F. Chance, N. Arnheim, Hum. Mol. Genet. 9, 1881 (2000) [Abstract/Free Full Text].
6. M. E. Hurles, BMC Genom. 2, 11 (2001) .
7. T. Nagylaki, Proc. Natl. Acad. Sci. U.S.A. 81, 3796 (1984) [Abstract/Free Full Text].
9 September 2002; accepted 5 November 2002

Response: Hurles raises an excellent point: Assembly errors may not be the sole basis for the observed "SNP" enrichment. There are at least two possible explanations: (i) duplication-induced collapse of paralogous sequence variants (PSVs) (1), and (ii) gene conversion events among the duplicated segments (2). Both events likely contribute--but which is more probable in light of the current state of the genome assembly within duplicated regions?

In previous analyses (1, 3), we found that duplicated regions were in fact underrepresented (by 30 to 40%) within public assemblies. There were fewer copies in the sequence assembly than could be shown by experimental methods (1, 4). The large size of the duplication (100 kilobases) and the high degree of sequence identity between many duplications have led to such sequences being considered as allelic copies rather than representing independent loci. In this respect, it is noteworthy that "overlap" SNPs, which were largely determined by electronic comparison of Genbank sequences, contributed more significantly (2.6 times) to the enrichment compared with SNPs assigned randomly (1.28 times). In addition to collapse, subsequent examination of dbSNP has revealed that many other "overlap" SNPs are annotated as "ambiguously mapped" and are in fact assigned to more than one location (5). Thus, although gene conversion remains a likely source for some of the "SNP" abundance, this effect cannot be satisfactorily addressed without concomitant elimination of the artifacts. We think that these artifacts of our genome provide the most prosaic explanation for this increase. Further experimental validation is required. The regions that we have identified as being increased in SNP density and at the transition of unique and duplicated sequence provide logical targets to assess this effect, especially as the genome nears completion and its quality substantially improves within these areas.

Finally, it was not our intention to intimate that the 100,000 variants underlying these duplicated regions were "useless." The variants are, in fact, incredibly important from a practical and evolutionary perspective. Such variants have proved valuable in resolving the structure of these duplicated regions (6, 7) and in providing a baseline to begin to address such issues as positive selection and gene conversion (8). However, for the average user of dbSNP interested in using SNPs in association-based mapping studies, there is the tacit assumption that the SNP maps to a unique region in the genome. The increased density of SNPs within duplicated regions, whether they arise from errors in assembly or gene conversion, will certainly obfuscate and frustrate these types of analyses. We believe that acknowledging this potential contaminant within dbSNP, and precisely demarcating the positions of these regions which associate with duplications, constitutes a useful--indeed, an essential--first step.

Jeff Bailey
Evan Eichler
Department of Genetics,
Center for Computational Genomics, and Center for Human Genetics
Case Western Reserve University School of Medicine and University
Hospitals of Cleveland
Cleveland, OH 44060, USA

REFERENCES

1. J. A. Bailey, A. M. Yavor, H. F. Massa, B. J. Trask, E. E. Eichler, Genome Res. 11, 1005 (2001) [Abstract/Free Full Text].
2. M. E. Hurles, BMC Genom. 2, 11 (2001) .
3. The International Human Genome Sequencing Consortium, Nature 409, 860 (2001).
4. V. G. Cheung, et al., Nature 409, 953 (2001) [CrossRef] [Medline].
5. X. Estivill, et al., Hum. Mol. Genet. 11, 1987 (2002) [Abstract/Free Full Text].
6. J. Horvath, S. Schwartz, E. Eichler, Genome Res. 10, 839 (2000) [Abstract/Free Full Text].
7. T. Kuroda-Kawaguchi, et al., Nature Genet. 29, 279 (2001) [CrossRef] [Web of Science] [Medline].
8. M. E. Johnson, et al., Nature 413, 514 (2001) [CrossRef] [Medline].
4 October 2002; accepted 5 November 2002





To Advertise     Find Products

ADVERTISEMENT

Featured Jobs

Science. ISSN 0036-8075 (print), 1095-9203 (online)