Note to users. If you're seeing this message, it means that your browser cannot find this page's style/presentation instructions -- or possibly that you are using a browser that does not support current Web standards. Find out more about why this message is appearing, and what you can do to make your experience of our site the best it can be.
Click Me!

Site Tools

  • AAAS
  • Subscribe
  • Feedback

Site Search

Search Advanced

Science 13 April 2007:
Vol. 316. no. 5822, pp. 222 - 234
DOI: 10.1126/science.1139247

Research Articles

Evolutionary and Biomedical Insights from the Rhesus Macaque Genome

Rhesus Macaque Genome Sequencing and Analysis Consortium: *{dagger} Richard A. Gibbs,1,2 Jeffrey Rogers,3 Michael G. Katze,4 Roger Bumgarner,4 George M. Weinstock,1,2 Elaine R. Mardis,5 Karin A. Remington,6 Robert L. Strausberg,6 J. Craig Venter,6 Richard K. Wilson,5 Mark A. Batzer,7 Carlos D. Bustamante,8 Evan E. Eichler,9 Matthew W. Hahn,10 Ross C. Hardison,11 Kateryna D. Makova,11 Webb Miller,11 Aleksandar Milosavljevic,1,2 Robert E. Palermo,4 Adam Siepel,8 James M. Sikela,12 Tony Attaway,1,2 Stephanie Bell,1,2 Kelly E. Bernard,5 Christian J. Buhay,1,2 Mimi N. Chandrabose,1,2 Marvin Dao,1,2 Clay Davis,1,2 Kimberly D. Delehaunty,5 Yan Ding,1,2 Huyen H. Dinh,1,2 Shannon Dugan-Rocha,1,2 Lucinda A. Fulton,5 Ramatu Ayiesha Gabisi,1,2 Toni T. Garner,1,2 Jennifer Godfrey,5 Alicia C. Hawes,1,2 Judith Hernandez,1,2 Sandra Hines,1,2 Michael Holder,1,2 Jennifer Hume,1,2 Shalini N. Jhangiani,1,2 Vandita Joshi,1,2 Ziad Mohid Khan,1,2 Ewen F. Kirkness,6 Andrew Cree,1,2 R. Gerald Fowler,1,2 Sandra Lee,1,2 Lora R. Lewis,1,2 Zhangwan Li,1,2 Yih-shin Liu,1,2 Stephanie M. Moore,1,2 Donna Muzny,1,2 Lynne V. Nazareth,1,2 Dinh Ngoc Ngo,1,2 Geoffrey O. Okwuonu,1,2 Grace Pai,6 David Parker,1,2 Heidie A. Paul,1,2 Cynthia Pfannkoch,6 Craig S. Pohl,5 Yu-Hui Rogers,6 San Juana Ruiz,1,2 Aniko Sabo,1,2 Jireh Santibanez,1,2 Brian W. Schneider,1,2 Scott M. Smith,5 Erica Sodergren,1,2 Amanda F. Svatek,1,2 Teresa R. Utterback,1,2 Selina Vattathil,1,2 Wesley Warren,5 Courtney Sherell White,1,2 Asif T. Chinwalla,5 Yucheng Feng,5 Aaron L. Halpern,6 LaDeana W. Hillier,5 Xiaoqiu Huang,13 Pat Minx,5 Joanne O. Nelson,5 Kymberlie H. Pepin,5 Xiang Qin,1,2 Granger G. Sutton,6 Eli Venter,6 Brian P. Walenz,6 John W. Wallis,5 Kim C. Worley,1,2 Shiaw-Pyng Yang,5 Steven M. Jones,14 Marco A. Marra,14 Mariano Rocchi,15 Jacqueline E. Schein,14 Robert Baertsch,16 Laura Clarke,17 Miklós Csürös,18 Jarret Glasscock,5 R. Alan Harris,1,2 Paul Havlak,1,2 Andrew R. Jackson,1,2 Huaiyang Jiang,1,2 Yue Liu,1,2 David N. Messina,5 Yufeng Shen,1,2 Henry Xing-Zhi Song,1,2 Todd Wylie,5 Lan Zhang,1,2 Ewan Birney,17 Kyudong Han,7 Miriam K. Konkel,7 Jungnam Lee,7 Arian F. A. Smit,19 Brygg Ullmer,20 Hui Wang,7 Jinchuan Xing,7,21 Richard Burhans,11 Ze Cheng,9 John E. Karro,11 Jian Ma,22 Brian Raney,22 Xinwei She,9 Michael J. Cox,12 Jeffery P. Demuth,10 Laura J. Dumas,12 Sang-Gook Han,10 Janet Hopkins,12 Anis Karimpour-Fard,23 Young H. Kim,24 Jonathan R. Pollack,24 Tomas Vinar,8 Charles Addo-Quaye,11 Jeremiah Degenhardt,8 Alexandra Denby,8 Melissa J. Hubisz,25 Amit Indap,8 Carolin Kosiol,8 Bruce T. Lahn,25,26 Heather A. Lawson,11 Alison Marklein,8 Rasmus Nielsen,27 Eric J. Vallender,25,26 Andrew G. Clark,28 Betsy Ferguson,29 Ryan D. Hernandez,8 Kashif Hirani,1,2 Hildegard Kehrer-Sawatzki,30 Jessica Kolb,30 Shobha Patil,1,2 Ling-Ling Pu,1,2 Yanru Ren,1,2 David Glenn Smith,3 David A. Wheeler,1,2 Ian Schenck,11 Edward V. Ball,31 Rui Chen,1,2 David N. Cooper,31 Belinda Giardine,11 Fan Hsu,22 W. James Kent,22 Arthur Lesk,11 David L. Nelson,2 William E. O'Brien,2 Kay Prüfer,32 Peter D. Stenson,31 James C. Wallace,4 Hui Ke,33 Xiao-Ming Liu,34 Peng Wang,33 Andy Peng Xiang,33 Fan Yang,33 Galt P. Barber,22 David Haussler,35,16 Donna Karolchik,22 Andy D. Kern,22 Robert M. Kuhn,22 Kayla E. Smith,22 Ann S. Zwieg22

The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an Indian-origin Macaca mulatta female and compared the data with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify evidence for positive selection and lineage-specific expansions and contractions of gene families. A comparison of sequences from individual animals was used to investigate their underlying genetic diversity. The complete description of the macaque genome blueprint enhances the utility of this animal model for biomedical research and improves our understanding of the basic biology of the species.

1 Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA.
2 Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.
3 Department of Genetics, Southwest Foundation for Biomedical Research, San Antonio, TX 78227, USA.
4 Department of Microbiology, University of Washington, Seattle, WA 98195, USA.
5 Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA.
6 J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA.
7 Department of Biological Sciences, Biological Computation and Visualization Center, Center for BioModular Multi-scale Systems, Louisiana State University, Baton Rouge, LA 70803, USA.
8 Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853, USA.
9 Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.
10 Department of Biology and School of Informatics, Indiana University, Bloomington, IN 47405, USA.
11 Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park, PA 16802, USA.
12 Human Medical Genetics and Neuroscience Programs, Department of Pharmacology, University of Colorado at Denver and Health Sciences Center, Aurora, CO 80045, USA.
13 Department of Computer Science, Iowa State University, Ames, IA 50011, USA.
14 Genome Sciences Centre, British Columbia Cancer Agency, 570 West 7th Avenue, Vancouver, BC, Canada.
15 Department of Genetics and Microbiology, University of Bari, Bari, Italy.
16 Department of Bioinformatics, University of California Santa Cruz, Santa Cruz, CA 95060, USA.
17 The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.
18 Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, QC H3C 3J7, Canada.
19 Institute for Systems Biology, 1441 North 34th Street, Seattle, WA 98103–8904, USA.
20 Center for Computation and Technology, Department of Computer Sciences, Louisiana State University, Baton Rouge, LA 70803, USA.
21 Eccles Institute of Human Genetics, University of Utah, Salt Lake City, UT 84112, USA.
22 Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA.
23 Department of Preventative Medicine and Biometrics, University of Colorado at Denver and Health Sciences Center, Aurora, CO 80045, USA.
24 Department of Pathology, Stanford University, Stanford, CA 94305, USA.
25 Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
26 Howard Hughes Medical Institute, Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
27 Institute of Biology, University of Copenhagen, Copenhagen DK-1017, Denmark.
28 Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA.
29 Genetics Research and Informatics Program, Oregon National Primate Research Center, Beaverton, OR 97006, USA.
30 Institute of Human Genetics, University of Ulm, Ulm, 89081, Germany.
31 Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK.
32 Department Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, 04103, Germany.
33 Centre for Stem Cell Biology and Tissue Engineering, Sun Yat-sen University, Guangzhou 510080, China.
34 South-China Primate Research and Development Center, Guangzhou 510080, China.
35 Howard Hughes Medical Institute, Santa Cruz, CA 95060, USA.

{dagger} All authors with their contributions and affiliations appear at the end of this paper. Back

* To whom correspondence should be addressed. Richard A. Gibbs, E-mail: agibbs{at}bcm.edu

Rhesus macaques (Macaca mulatta) (1) are one of the most frequently encountered and thoroughly studied of all nonhuman primates (table S1.1). They have a broad geographic distribution that reaches from Afghanistan and India across Asia to the Chinese shore of the Pacific Ocean. As an Old World monkey (superfamily Cercopithecoidea, family Cercopithecidae), this species is closely related to humans and shares a last common ancestor from about 25 million years ago (Mya) (2). The two species often live in close association, and macaques exhibit complex and intensely social behavioral repertoires.

The relationship between humans and macaques is even more important because biomedical research has come to depend on these primates as animal models. Compared with rodents, which are separated from humans by more than 70 million years (2, 3), macaques exhibit greater similarity to human physiology, neurobiology, and susceptibility to infectious and metabolic diseases. Critical progress in biomedicine attributed to macaques includes the identification of the "rhesus factor" blood groups and advances in neuroanatomy and neurophysiology. Most important, their response to infectious agents related to human pathogens, including simian immunodeficiency virus and influenza, has made macaques the preferred model for vaccine development. Lesser-known contributions of these animals include their early use in the U.S. space program—a rhesus monkey was launched into space more than a dozen years before any chimpanzee.

The cynomolgus macaque (M. fascicularis), pigtailed macaque (M. nemestrina), and Japanese macaque (M. fuscata) have all contributed to research, but the rhesus macaque has been used most widely. Taxonomists recognize six M. mulatta subspecies (1), which differ substantially in their geographical range, body size, and a variety of morphological, physiological, and behavioral characteristics. North American research colonies include animals representing both Indian and Chinese subspecies, although India ended the exportation of these animals in the 1970s.

With the advent of whole-genome sequencing, a highly accurate human genome sequence and a draft of the chimpanzee genome have been generated and compared. The chimpanzee shared a common ancestor with humans approximately 6 Mya (4, 5), and the major impact of the chimpanzee genome sequence data has been in their direct comparison with data from the human genome. However, the chimpanzee data have major limitations. First, because the alignable sequence is only 1 to 2% different from that of the human, there is no informative "signal" to distinguish conserved elements from the overall high background level of conservation. This is exacerbated by the fact that the chimpanzee genome was an incomplete draft, containing sequence errors that could potentially mask true divergence. Second, the differences that are found between humans and chimpanzees are difficult to assign as specific to either the chimpanzee or the human. As a result, the chimpanzee analyses have on their own provided relatively few answers to the fundamental question of the nature of the specific molecular changes that make us human.

By contrast, the genome of the rhesus macaque has diverged farther from our own, with an average human-macaque sequence identity of ~93%. Figure 1 shows the inferred common ancestor for all three species, as well as a common ancestor that predated the human-chimpanzee divergence. A characteristic that is found in humans but not in the chimpanzee can be recognized as a loss in the chimpanzee if it is present in the macaque, or it can be recognized as a gain in the human if it is absent in macaque. In principle, this three-way comparison should make it possible to pinpoint many changes and identify specific underlying mutational mechanisms, which could have been critically important during the past 25 million years in shaping the biology of the three primate species.


Figure 1 Fig. 1. Evolutionary triangulation in the human, chimpanzee and rhesus macaque lineages (lineage-specific breaks), showing a summary of chromosomal breakpoints on a microscopic scale (Fig. 3) (7). Circled numbers indicate numbers of lineage-specific breaks. [View Larger Version of this Image (47K GIF file)]
 

We examined the basic elements of the rhesus macaque genome and undertook reconstruction of the major changes in the human-chimpanzee–rhesus macaque (HCR) trio. The regions of the genome that were duplicated in macaque were then identified and correlated with other genome features. Individual macaque genes were studied, and the orthologous genes in the HCR trio were aligned to reveal evidence for the action of selection on individual loci. Additional animals from other populations were also sampled by DNA sequencingtostudy their genetic diversity. Throughout, complementary methods were applied and the different results combined in order to represent the most complete picture of macaque biology. For a visual representation of some of the insights gained from the genome and more information about the importance of the macaque as a model organism, see the poster in this issue (6).


Sequencing the Genome

To generate a draft genome sequence for the rhesus macaque, whole-genome shotgun sequences were assembled. The bulk of the sequencing used DNA from a single M. mulatta female, whereas DNA from an unrelated male was used to construct a bacterial artificial chromosome (BAC) library to provide BAC end sequences and to aid in selective finishing. We used several whole-genome shotgun libraries with different insert sizes (~3.0, 10, 35, and 180 kb) to generate a total of 18.4 Gb of raw DNA sequence through standard fluorescent Sanger sequencing technologies. Initial assemblies to the intermediate scaffold stage were carried out by the three different assembly methods: Atlas–whole-genome shotgun, parallel contig assembly program (PCAP), and the Celera Assembler (7). These were compared by means of more than 200 metrics, including gross sequence statistics, agreement with finished sequence, utility for gene predictions in the Ensembl pipeline, and accuracy of alignment to the human genome. The three unpolished assemblies were found to be largely similar and of high quality, so all were used in combination with other genome data for the subsequent assembly and placement of long sequence segments on the macaque chromosomes (tables S2.1 to S2.4).

To produce an optimal representation of the genome, the three intermediate assemblies were merged (Fig. 2). Melding the assemblies involved mapping the Atlas–whole-genome shotgun and PCAP data to the Celera Assembler output, which had longer contiguity than the other two data sets at this stage of the process. There was little difference between assemblies at the sequence contig level, at which robust sequence alignments guide the reconstructions, so we focused our attention instead on contigs that were joined into scaffolds. Additional pairs of Celera Assembler scaffolds were joined based on their mapping to the other two macaque assemblies. Analysis of the output showed that this composite assembly was superior to any of its components (table S2.4).


Figure 2 Fig. 2. Assembly by three methods of the rhesus macaque genome. WGS, whole-genome shotgun. BCM-HGSC, Baylor College of Medicine Human Genome Sequencing Center; WashU-GSC, Washington University Genome Sequencing Center; JCVI, J. Craig Venter Institute. QA/QC, quality assurance and quality control. [View Larger Version of this Image (27K GIF file)]
 

During assembly, a comparison with the human genome sequence [National Center for Biotechnology Information (NCBI) accession code bld35] identified a small number (<100) of obvious inconsistencies, such as improper joins of different chromosomes. These scaffolds were therefore split at the misassembly point. The human map was also used to help place large merged scaffolds onto the macaque chromosomes (8, 9) [the chromosome numbering of Rogers et al. (8) was used] at the highest level of the assembly process. Given that the human data were only used to split scaffolds and that de novo macaque assemblies were always given precedence over the mapping to the human genome in the macaque assembly merging and chromosome assignment process, the final product should not be regarded as a "humanized assembly."

The total length of the combined genome assembly was approximately 2.87 Gb (Table 1). This incorporated ~14.9 Gb of raw sequence, which represents about a 5.2-fold coverage of the macaque genome. Comparison with expressed sequence tag (EST) sequence data and approximately 1.8 Mb of finished sequence (see "Selected sequence finishing," below) indicated that ~98% of the available genome was represented. No misassemblies were identified in that comparison. Contigs showed an N50 (minimum length of contigs representing half of the total length of the assembly) of >25 kb; the N50 for sequence scaffolds was >24 Mb. GenBank accession codes are available online (table S2.5).


Table 1. M. mulatta assembly statistics. Total bases, excluding gaps, number 2,871,189,834.
Contigs Scaffolds

Total number 301,039 122,580
N50 size in bp 25,707 24,345,431
Number to N50 32,114 36
Largest in bp 219,335 98,200,701

Selected sequence finishing. The rhesus macaque genome assembly is a draft DNA sequence, and it contains many gaps. A higher data quality with greater contiguity was desired at several genomic regions that attracted additional interest. In these cases, individual BAC clones were isolated, and data quality was improved by sequence "finishing." Many of these BACs were in regions of pronounced genome duplication, whereas others were gene-rich. All finished BACs, their gene content, and their genome coordinates are listed in table S2.6.


Overview of Genome Features

General organization and content. The macaque genome is organized into 20 autosomes and the XY sex chromosomes. With the exception of 48 breakpoints (Fig. 1)—including three fusions, one fission, and breakpoints induced by inversions that are each detectable through chromosome staining, by radiation hybrid mapping, or by comparative linkage mapping—there is a superficial similarity between the macaque and human chromosomes (811). Several chromosomes in the macaque are also more acrocentric than their human counterparts, but many from the two species are difficult to distinguish.

Nucleotide sequences that aligned between the human and rhesus average 93.54% identity. If, however, small insertions and deletions are included in the calculation, identity is reduced to 90.76%. Considering regions that are difficult to align, such as lineage-specific interspersed repeat elements, would further decrease the level of computed identity. Moreover, evolutionary distances exhibit local fluctuations, as in other mammals (3), and less divergence was observed in chromosome X (94.26% identity of aligned bases). The GC-content of the rhesus in aligned bases was not notably lower than that of the human (40.71% versus 40.74%).

Gene content. A human-centric approach was used to generate new macaque gene sets (table S3.1 and fig. S3.1). These sets include (i) Ensembl (12) gene models based primarily on the alignment of the human Uniprot and RefSeq resources with the current assembly to define the overall gene model, followed by the introduction of the macaque-specific sequences (mainly as lineage-specific paralogs) in that framework; (ii) Gnomen (NCBI) models that include the consideration of the available (~50,000) macaque ESTs along with the human RefSeq; and (iii) Nscan data that include multiple-species alignments along with cDNA alignments (13). Overall, ~20,000 loci were predicted by our methods in which at least one exon was found by two additional predictors. An additional ~5000 loci were each predicted by a single method, but manual inspection of a subset of these loci shows that they are enriched in gene-prediction errors, mainly due to mis-classification of evidence (e.g., cDNAs from untranslated regions that were classified as containing protein coding). On average, high-confidence orthologs have 97.5% identity between the human and macaque at both the nucleotide and amino acid sequence levels. (The nucleotide and amino acid percentages agree because roughly one-third of nucleotide differences within coding regions change an amino acid.)

Overall repetitive landscape. Repeat elements account for ~50% of the genomes of all sequenced primates (14) (Table 2). Similar to the human, the rhesus macaque contains about 320,000 recognizable copies from more than 100 different families of DNA transposons and more than half a million recognizable copies of endogenous retroviruses (ERVs). In general, the DNA transposons show no new lineages, but the ERVs demonstrate a complex phylogeny and many examples of new and expanded family members, some resulting from horizontal transmission. In addition, we conservatively estimate that ~20,000 L1s [a family of long interspersed elements (LINEs)], and ~110,000 Alu elements [a primate-specific family of short interspersed elements (SINEs)], were specifically acquired in the Old World monkey lineage. These two retrotransposon families accounted for most lineage-specific insertions and have played a major role in shaping genomic architecture. Among them, rhesus macaque–specific subsets (derived from the L1PA5 lineage and AluY) are frequently polymorphic and can be assayed by polymerase chain reaction (PCR) genotyping analyses for genetic studies (15).


Table 2. Summary of repeat content of the rhesus macaque genome compared with the human and chimpanzee genomes. hg18, human genome version 18; panTro2, Pan troglodytes version 2; rheMac2, rhesus macaque version 2; LTR, long terminal repeat; MIR, mammalian interspersed repeat. SVA is a composite repetitive element named after its main components, SINE, variable number of tandem repeats, and Alu; includes SVA precursor elements.
Species DNA LTR/ERV LINE

SINE

SVA
L1 L2 Alu MIR

hg18 355,000 506,000 572,000 363,000 1,144,000 584,000 3400
panTro2 305,000 453,000 558,000 315,000 1,111,000 553,000 4400
rheMac2 327,000 432,000 531,000 298,000 1,094,000 539,000 150


Determining Ancestral Genome Structure

Cytogenetically visible rearrangements. The most notable genomic differences among the HCR trio are the presence of cytogenetically visible rearrangements. The human and chimpanzee karyotypes are distinguishable by one chromosome fusion and nine cytogenetically visible pericentric inversions (16); with the use of the macaque as an outgroup, all of these breakpoints (except those induced by two inversions) have now been characterized at the DNA sequence level (17). Analysis of genomic sequence confirms that 14 breakpoints, corresponding to seven inversions, occurred in the chimpanzee lineage, as indicated in Fig. 1. (Five of the inversions are summarized in table S4.1.) The pericentric inversions of human chromosomes 1 and 18 and the fusion creating human chromosome 2 are specific to the human. Comparison of the reconstructed human-chimpanzee ancestral genome and the rhesus genome reveals 43 breakpoints on the microscopic scale (Figs. 1 and 3).


Figure 3 Fig. 3. Chromosomal breakpoints between rhesus macaque and the human-chimpanzee ancestor. Each chromosome is represented by a white bar (left) and a colored bar (right). A total of 820 thin horizontal lines in the white bars represent submicroscopic breakpoints (10-kbp to 4-Mbp range) detected by genomic triangulation (19), and 43 thick black lines in the colored bars represent breakpoints on a microscopic scale (>4 Mbp) (7). Numbers above each bar show the total lines within the bar. [View Larger Version of this Image (27K GIF file)]
 

Submicroscopic rearrangements. Previous analyses [reviewed in (14)] have indicated that primate genomes harbor more structural differences than visible by cytogenetic staining. Analysis of these events is complicated by two issues: the draft state of the genomes and the presence of extensive segmental duplications. We analyzed these structural rearrangements by using the distance between orthologous blocks in each species to infer the ancestral genome structure and determine where rearrangements occurred on the phylogenetic tree. We excluded events smaller than 10 kilobase pairs (kbp), which are mostly due to retroposon insertions, and focused on cytogenetically undetectable breakpoints induced by insertions, deletions, inversions, and complex rearrangements of sizes between 10 kbp and 4 Mbp. Data were combined from inversion detection and ancestral reconstructions by the contiguous ancestral regions method (18) and gap detection by the genomic triangulation method (19), which further integrates data from genomic sequence comparisons (20) and comparative maps (8, 9, 21). The analysis revealed more than 1000 rearrangement-induced breakpoints through the HCR lineages, of which 820 occur between rhesus and the reconstructed human-chimpanzee ancestor (Fig. 3 and fig. S4.1). Each chromosome therefore constitutes a complex mosaic, with multiple changes introduced to orthologous counterparts. When rhesus macaque is compared with the human-chimpanzee ancestor, the X chromosome exhibits three times more rearrangements per megabase than the autosomes. This is both statistically significant and consistent with a slightly more than threefold difference observed in the human lineage following the branching off of chimpanzee (19). Given that a slower rate of variability at the single-nucleotide level in the X chromosome compared with autosomes has been interpreted as support for speciation models, this difference is worthy of further investigation (22).


Duplications in the Genome and Gene Family Expansions

Genomic Duplications. Segmental duplication of genomic regions and the genes they contain are well known in mammals and are postulated to drive fundamental processes, including the birth of new genes and the subsequent expansion of gene families (23). To discover duplications in the macaque genome, we used a battery of different complementary approaches. Two of these, whole-genome assembly comparison (24) and BLASTZ (25) analysis of segmental duplications, depended directly on the assembly. We used a third method, whole-genome shotgun sequence detection (26), that calculated depth of coverage of the raw shotgun sequence reads relative to the assembly. A fourth procedure was created on the basis of BAC end sequence reads combined with BACs that were directly mapped by means of the pooled genomic indexing method (21). The common interspersed repeat families were not considered in any of these analyses.

The first two approaches identified approximately 35.0 Mb of a recently duplicated sequence in the macaque assembly. A further ~15 Mb were collapsed in the assembly and discovered by whole-genome shotgun sequence detection (fig. S5.1 and table S5.1). Adjusting for these collapsed duplications and the overall assembly coverage, we estimate that approximately 66.7 Mb or 2.3% of the macaque genome consists of segmental duplication (Fig. 4)—this proportion is substantially lower than that of either the human or chimpanzee genome (5 to 6%) (26, 27).


Figure 4 Fig. 4. Global pattern of macaque segmental duplications. The statistics are based on all WGAC duplications (> 90%, >1 kb in length), whereas the figure displays only those between 90 and 95% sequence identity and >10 kb in length for simplicity. Red lines indicate interchromosomal (Inter) duplications, blue ticks show intrachromosomal (Intra) events, and purple bars show centromeric, acrocentric, and/or large-gap regions. WGAC, whole-genome assembly comparison. nr, nonredundant. [View Larger Version of this Image (56K GIF file)]
 

The pooled genomic indexing and BAC end sequence read methods suggested slightly higher levels of overall duplication, on the basis of fluorescence in situ hybridization analysis of randomly selected large-insert BAC clones (28). However, this estimate was still less than the 4.8% recently estimated for the baboon genome (28). Overall, we consider 2.3% to be the lower bound of duplicated genomic DNA in the macaque genome.

As with the human and chimpanzee, the analysis of the macaque assembly revealed an enrichment of segmental duplications near gaps, centromeres, and telomeres (14, 29). The study also identified segmental duplications that contain genes of high biological significance. For example, the CCL3L1-CCL4 gene region [for which copy-number variation in humans is correlated with susceptibility to HIV infection (30)], cytochrome P450 (associated with toxicity response), KRAB-C2H2 zinc finger (a developmental regulatory transcription factor), olfactory receptor (smell), human leukocyte antigen (HLA), and other immune and autoantigen gene families were all observed in regions of genome duplication.

Expansion of gene families. Two approaches were used to study gene family structure directly within the draft genome sequence: (i) a statistical approach, based on a likelihood model of gene gain and loss across the mammalian tree (31) and (ii) hybridization of whole genomic DNA to cDNA arrays [a variation of array-based comparative genomic hybridization (array CGH)] to observe changes in gene content directly (32). The results are shown in Tables 3 and 4.


Table 3. Gene families with significant copy-number expansions (P < 0.0001) in the human and the identical statistic for the rhesus macaque. Gene family ID, identification numbers from Ensembl version 41. Family size, number of gene copies in the current genome assemblies. Gains and losses, number of genes gained and lost since the human's split with chimpanzee or the macaque's split with human-chimpanzee lineage. IG, immunoglobulin; IGE, immunoglobulin E; Pre, precursor; MHC, major histocompatibility complex; TCR, T cell receptor; ENV, envelope; ATP, adenosine 5'-triphosphate.
Gene family ID Description Family size Gains Losses

Expanded in human
    ENSF00000000020 IG heavy chain V region 42 10 0
    ENSF00000000073 Receptor 56 16 0
    ENSF00000000233 Peptidyl prolyl cis trans isomerase 38 <