These ORFs overlap the coding region of the N protein.
The coronavirus rep gene products are translated from genomic RNA, but the remaining viral proteins are translated from subgenomic mRNAs that form a 3'-coterminal nested set, each with a 5' end derived from the genomic 5' leader sequence. The coronavirus subgenomic mRNAs are synthesized through a discontinuous transcription process, the mechanism of which has not been unequivocally established (8, 13). The SARS-CoV leader sequence was mapped by comparing the sequence of 5' RACE (rapid amplification of cDNA ends) (11) products synthesized from the N gene mRNA with those synthesized from genomic RNA. A sequence, AAACGAAC (genomic nucleotides 65 to 72), was identified immediately upstream of the site where the N gene mRNA and genomic sequences diverged. This sequence was also present upstream of ORF1a and immediately upstream of five other ORFs (Fig. 1, A and B, and table S1), suggesting that it functions as the conserved core of the transcription-regulating sequences (TRSs). The nucleotides required for TRS function must be identified experimentally.
The favored model for production of subgenomic mRNAs of coronaviruses proposes that discontinuous transcription occurs during synthesis of the negative strand (13). Subgenomic negative strands containing a complementary copy of the leader sequence at their 3' termini serve as templates for synthesis of subgenomic mRNAs. In addition to the site at the 5' terminus of the genome, the TRS conserved core sequence appears six times in the remainder of the genome. The positions of the TRS in the genome of SARS-CoV predict that subgenomic mRNAs of 8.3, 4.5, 3.4, 2.5, 2.0, and 1.7 kb, not including the poly(A) tail, should be produced (Fig. 1, A and B, and table S1). At least five subgenomic mRNAs were detected by Northern hybridization of RNA from SARS-CoVinfected cells, using a probe derived from the 3' untranslated region (Fig. 1C). The calculated sizes of the five predominant bands correspond to the sizes of five of the predicted subgenomic mRNAs of SARS-CoV; we cannot exclude the possibility that other, low-abundance mRNAs are present. Full-length genomic RNA was not detected, probably because it is the least prevalent viral RNA in infected cells (8). The predicted 2.0-kb transcript was also not detected, which suggests that the consensus TRS at nt 27,771 to 27,778 is not used or that it is a low-abundance mRNA. By analogy with other coronaviruses (8), the 8.3-kb and 1.7-kb subgenomic mRNAs are predicted to be monocistronic, directing translation of S and N, respectively, whereas multiple proteins could be translated from the 4.5-kb (X1, X2, and E), 3.4-kb (M and X3), and 2.5-kb (X4 and X5) mRNAs. A consensus TRS is not found directly upstream of the ORF encoding the predicted E protein (14), and a monocistronic mRNA that would be predicted to code for E could not be clearly identified by Northern blot analysis. It is possible that the 3.4-kb band contained more than one mRNA species that were not resolved in the gel or that the monocistronic mRNA for E is a low-abundance message. Also, in some coronaviruses, the E protein is translated from the second ORF on a polycistronic mRNA (15, 16).
Phylogenetic analyses of the sequence of SARS-CoV. To determine the relationship between SARS-CoV and the previously characterized coronaviruses, we compared the predicted amino acid sequences for three well-defined enzymatic proteins encoded by the rep gene and the four major structural proteins of SARS-CoV with those from representative viruses for each of the species of coronavirus for which complete genomic sequence information was available (Fig. 2). The topologies of the resulting phylograms are remarkably similar (Fig. 2A). For each protein analyzed, the species formed monophyletic clusters consistent with the established taxonomic groups. In all cases, SARS-CoV sequences segregated into a fourth, well-resolved branch. These clusters were supported by bootstrap values above 90% [1000 replicates (17)]. Consistent with pairwise comparisons between the previously characterized coronavirus species (Fig. 2B), there was greater sequence conservation in the enzymatic proteins [3CLpro, polymerase (POL), and helicase (HEL)] than among the structural proteins (S, E, M, and N). These results indicate that SARS-CoV is not closely related to any of the previously characterized coronaviruses and forms a distinct group within the genus Coronavirus. SARS-CoV is approximately equidistant from all previously characterized coronaviruses, just as the existing groups are from one another. Detailed pairwise comparison by dot-plot analysis identified many regions of amino acid conservation within each protein (fig. S1), but the overall level of similarity between SARS-CoV and the other coronaviruses was low (Fig. 2B). No evidence for recombination was detected when the predicted protein sequences were analyzed with the program Sim-Plot (17, 18).
|
Fig. 2. Phylogenetic analysis and pairwise identities of coronavirus proteins. Predicted amino acid sequences of SARS-CoV proteins were compared with those from reference viruses representing each species in the three groups of coronaviruses for which complete genomic sequence information was available [group 1(G1): human coronavirus 229E (HCoV-229E), af304460; porcine epidemic diarrhea virus (PEDV), af353511; transmissible gastroenteritis virus (TGEV), aj271965. Group 2 (G2): bovine coronavirus (BCoV), af220295; murine hepatitis virus (MHV), af201929. Group 3 (G3): infectious bronchitis virus (IBV), m95169]. Sequences for representative strains of other coronavirus species, for which partial sequence information was available, were included for some of the structural protein comparisons [group 1: canine coronavirus (CCoV), d13096; feline coronavirus (FCoV), ay204704; porcine respiratory coronavirus (PRCoV), z24675. Group 2: human coronavirus OC43 (HCoV-OC43), m76373, l14643, m93390; porcine hemagglutinating encephalomyelitis virus (HEV), ay078417; rat coronavirus (RtCoV), af207551]. (A) Sequence alignments and neighbor-joining trees were generated by the use of ClustalX 1.83 with the Gonnet protein comparison matrix. The resulting trees were adjusted for final output with treetool 2.0.1. (B) Uncorrected pairwise distances were calculated from the aligned sequences with the Distances program from the Wisconsin Sequence Analysis Package, version 10.2 (Accelrys, Burlington, MA). Distances were converted to percent identity by subtracting from 100. aa, amino acid.
[View Larger Version of this Image (32K GIF file)]
|
|
Predicted replicase gene products of SARS-CoV. Coronaviruses encode a chymotrypsin-like protease, 3CLpro, that is analogous to the main picornaviral protease 3Cpro (19). They also encode one (group 3) or two (groups 1 and 2) papain-like proteases, termed PLP1pro and PLP2pro, which are analogous to the foot-and-mouth disease virus leader protease Lpro. Overall, gene products of ORF1a are poorly conserved among different coronaviruses, except for these protease sequences (fig. S1). The predicted gene product of ORF1a of SARS-CoV appears to contain only one PLPpro domain at amino acids 1632 to 1847. The 3CLpro catalytic histidine and cysteine residues are fully conserved among all coronaviruses (SARS-CoV amino acids His3281 and Cys3385), but coronaviruses appear to lack the conserved catalytic acidic residue that is characteristic of other 3C-like proteases (19). The coronavirus replicase polyprotein is synthesized by a 1 ribosomal frameshift at a conserved "slippery" site (UUUAAAC) immediately upstream of a pseudoknot structure in the overlap of ORF1a and ORF1b. This polyprotein is autocatalytically processed to yield the mature viral proteases PLPpro and 3CLpro, the RNA-dependent polymerase (POL), the RNA helicase (HEL), and other proteins whose functions have not been well characterized. The predicted ribosomal frame shift at the SARS-CoV slippery site (nt 13,392 to 13,398) would result in translation of 7073 amino acids from a single start site.
Analysis of the predicted structural proteins of SARS-CoV. The structural proteins of coronaviruses (S, E, M, and N) function during host cell entry and virion morphogenesis and release (20). During virion assembly, N binds to a defined packaging signal on viral RNA, leading to the formation of the helical nucleocapsid. M is localized at specialized intracellular membrane structures, and interactions between the M and E proteins and nucleocapsids result in budding through the membrane. In some group 2 coronaviruses, the C terminus of M interacts with the nucleocapsid to form a core structure (21). The S protein is incorporated into the viral envelope, again by interaction with M, and mature virions are released from smooth vesicles (22). Bands corresponding to the predicted N and S proteins of SARS-CoV were visible in preparations of purified virions that were analyzed by SDSpolyacrylamide gel electrophoresis; however, the assignment of other proteins in virions awaits the availability of specific antibodies to identify these viral proteins (fig. S4).
The S proteins of coronaviruses are large type-I membrane glycoproteins that are responsible both for binding to receptors on host cells and for membrane fusion. The S proteins of some coronaviruses are cleaved into S1 and S2 subunits. S proteins also contain important virus-neutralizing epitopes, and amino acid changes in the S proteins can dramatically affect the virulence and in vitro host cell tropism of the virus (23, 24). Because of the low level of similarity (20 to 27% pairwise amino acid identity) between the predicted amino acid sequence of the S protein of SARS-CoV and the S proteins of other coronaviruses (Fig. 2B and fig. S1A), the comparison of primary amino acid sequences does not provide insight into the receptor-binding specificity or antigenic properties of SARS-CoV.
The S protein of SARS-CoV has 23 potential N-linked glycosylation sites (table S2). Functional motifs at the amino (N) and carboxyl (C) termini of the S protein that are conserved among the coronaviruses are also present in the predicted SARS-CoV S protein, although the S2 domain is more conserved than the S1 domain. The N terminus of the SARS-CoV S protein contains a short type-I signal sequence composed of hydrophobic amino acids that are presumably removed during cotranslational transport through the endoplasmic reticulum. The C terminus, consisting of a transmembrane domain and a cytoplasmic tail rich in cysteine residues, is highly conserved in SARS-CoV (Fig. 3). At 52 amino acids in length, the SARS-CoV S protein is predicted to have the shortest transmembrane domain and cytoplasmic tail of any coronavirus analyzed (Fig. 3) (range, 61 to 74 amino acids).
|
Fig. 3. Conserved motifs in coronavirus S proteins. Alignment of the C-terminal region of the SARS-CoV and reference coronavirus S proteins was generated with ClustalX 1.83. Residues that match the SARS-CoV sequence exactly are boxed. The membrane-spanning domain and cytoplasmic tails are delineated with arrows. The amino acid sequence Y(V/I)KWPW(Y/W)VWL (26) is a conserved motif in all three coronavirus groups. The cysteine-rich region, which overlaps the membrane-spanning region and the cytoplasmic region, is also found in all coronavirus groups.
[View Larger Version of this Image (43K GIF file)]
|
|
The current paradigm of protein-mediated membrane fusion proposes the collapse of alpha-amphipathic regions in the C half of the coronavirus S protein into coiled coils, thus bringing a fusion peptide toward the transmembrane domain, resulting in cellular and viral membrane fusion. Two or three alpha-amphipathic regions are predicted for the C half of coronavirus S proteins. An alpha-amphipathic region of 116 amino acids was predicted with high confidence at positions 884 to 999 of the SARS-CoV S protein (fig. S2). Syncytia formation, however, is not a prominent feature of SARS-CoV infection of Vero cells (5). The SARS-CoV S protein lacks the basic amino acid cleavage site found in group 2 and group 3 coronaviruses (25), suggesting that the SARS-CoV S protein is probably not cleaved into S1 and S2 subunits.
Although overall sequence conservation is low (Fig. 2B), the predicted E, M, and N proteins of SARS-CoV contain conserved motifs that are found in other coronaviruses. Consistent with the E proteins of other coronaviruses, the predicted E protein of SARS-CoV contains a hydrophobic domain (residues 12 to 37) flanked by charged residues and followed by a cysteine-rich region. The N-terminal domains of coronavirus M proteins are exposed on the viral surface, whereas the C terminus is inside the viral membrane. Most coronavirus M proteins, including the predicted M protein of SARS-CoV, contain three hydrophobic transmembrane domains in the N-terminal half of the protein, although some viruses have four. A highly conserved amino acid sequence [SwWSFNPE (26)], immediately following the third hydrophobic domain, is SMWSFNPE in the SARS-CoV M protein. The M proteins of coronaviruses are invariably glycosylated near the N terminus. Group 1 and group 3 coronaviruses are N-glycosylated, whereas those of group 2 viruses are O-glycosylated (27, 28). The predicted M protein of SARS-CoV has an NGT near its N terminus, suggesting that this protein is N-glycosylated at position 4.
The predicted N protein of SARS-CoV is a highly charged basic protein of 422 amino acids (range for other coronaviruses, 377 to 454) with seven successive hydrophobic residues near the middle of the protein. Although the overall amino acid sequence homology among coronavirus N proteins is low (Fig. 2B), a highly conserved motif [FYYL-GTGP (26)] occurs in the N-terminal half of all coronavirus N proteins, including that of SARS-CoV. Other conserved residues occur near this highly conserved motif (fig. S3).
Conclusion. The completion of the genomic sequence of SARS-CoV provides a first look at the molecular characteristics of this virus and clearly demonstrates that this virus has features typical of a coronavirus, while it also has features that distinguish it from all previously sequenced coronaviruses. Relative to other coronaviruses, no significant major genomic rearrangements or any examples of large insertions or deletions in the genes coding for the replicase, S, E, M, or N proteins were found. Like some other coronaviruses, SARS-CoV has several small nonstructural ORFs that are found between the genes for S and E and between the genes for M and N. SARS-CoV is a novel virus that is phylogenetically distinct from other characterized coronaviruses. The genetic distance between SARS-CoV and any other coronavirus in all gene regions implies that no large part of the SARS-CoV genome was derived from other known viruses. The SARS-CoV genomic sequence does not provide obvious clues concerning the potential animal origins of this pathogen.
The genome of SARS-CoV has several unique features that could be of biological significance. The short anchor of the S protein, the specific number and location of small ORFs, and the presence of only one copy of the PLPpro provide a combination of genetic features that readily differentiate this virus from previously described coronaviruses. Of course, the significance of any of these features remains to be determined experimentally.
Successful control of the global SARS epidemic will require the development of vaccines and antiviral compounds that effectively prevent or treat this disease, as well as rapid and sensitive diagnostic tests to monitor its spread. The availability of complete genomic sequences (table S3) (29) of SARS-CoV in just a few weeks after the discovery of the virus should have an immediate impact on disease control efforts by making it possible to develop improved diagnostic tests, vaccines, and antiviral agents. The sequence information will also make it possible to identify the origin and natural reservoir of this virus and to contribute to studies of the immune response to this virus and the pathogenesis of SARS-CoVrelated disease. The stage is set for the international scientific community to respond and to rapidly develop the tools to control this emerging infectious disease.
References and Notes
- 1. S. M. Poutanen et al., N. Engl. J. Med., available 17 April 2003 at http://nejm.org/earlyrelease/sars.asp#4-2.
- 2. N. Lee et al., N. Engl. J. Med., available 17 April 2003 at http://nejm.org/earlyrelease/sars.asp#4-2.
- 3. K. W. Tsang et al., N. Engl. J. Med., available 17 April 2003 at http://nejm.org/earlyrelease/sars.asp#4-2.
- 4. Centers for Disease Control and Prevention, Morb. Mortal. Wkly. Rep. 52, 357 (2003). [Medline]
- 5. T. G. Ksiazek et al., N. Engl. J. Med. 348, 1947 (2003).[Free Full Text]
- 6. J. S. Peiris et al., Lancet 361, 1319 (2003). [CrossRef] [ISI] [Medline]
- 7. C. Drosten et al., N. Engl. J. Med., available 17 April 2003 at http://nejm.org/earlyrelease/sars.asp#4-2.
- 8. M. M. C. Lai, K. V. Holmes, in Fields Virology, D. M. Knipe, P. M. Howley, Eds. (Lippincott Williams & Wilkins, New York, ed. 4, 2001), chap. 35.
- 9. L. Enjuanes et al., in Virus Taxonomy, M. H. V. van Regenmortal et al., Eds. (Academic Press, New York, 2000), pp. 835849.
- 10. K. V. Holmes, in Fields Virology, D. M. Knipe, P. M. Howley, Eds. (Lippincott Williams & Wilkins, New York, ed. 4, 2001), chap. 36.
- 11. Materials and methods are available as supporting material on Science Online.
- 12. Although the match was not statistically significant, the C half of potential protein X1 contains a region of similarity with calcium-transporting adenosine triphosphatases.
- 13. G. S. Sawicki, D. L. Sawicki. Adv. Exp. Med. Biol. 440, 215 (1998). [ISI] [Medline]
- 14. The sequence immediately upstream of the ORF coding for the predicted E protein is GTACGAAC and differs from the sequence of the consensus TRS at the first two positions.
- 15. D. X. Liu, S. C. Inglis, J. Virol. 66, 6143 (1992).[Abstract/Free Full Text]
- 16. V. Thiel, S. G. Siddell, J. Gen. Virol. 75, 3041 (1994).[Abstract/Free Full Text]
- 17. P. Rota et al., data not shown.
- 18. K. S. Lole et al., J. Virol. 73, 152 (1999).[Abstract/Free Full Text]
- 19. J. Ziebuhr, E. J. Snijder, A. E. Gorbalenya, J. Gen. Virol. 81, 853 (2000).[Free Full Text]
- 20. S. G. Siddell, Ed., The Coronaviridae (Plenum, New York, 1995).
- 21. D. Escors, J. Ortego, H. Laude, L. Enjuanes, J. Virol. 75, 1312 (2001).[Abstract/Free Full Text]
- 22. H. Garoff, R Hewson, D.-J. E. Opstelten, Microbiol. Mol. Biol. Rev. 62, 1171 (1998).[Abstract/Free Full Text]
- 23. C. M. Sanchez et al., J. Virol. 73, 7607 (1999).
- 24. I. Leparc-Goffart et al., J. Virol. 72, 9628 (1998).[Abstract/Free Full Text]
- 25. Cleavage sites in the S proteins of coronaviruses are RRFRR, RRSRR, RRSRR, RSRR, RARS, and RARR (26) in infectious bronchitis virus, bovine coronavirus, human coronavirus OC43, porcine hemagglutinating encephalomyelitis virus, mouse hepatitis virus, and rat coronavirus, respectively.
- 26. Single-letter abbreviations for the amino acid residues are as follows: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; L, Leu; M, Met; N, Asn; P, Pro; Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; and Y, Tyr.
- 27. C. A. M. de Haan et al., Virus Res. 82, 77 (2002). [ISI] [Medline]
- 28. C. A. M. de Haan, L. Kuo, P. S. Masters, H. Vennema, P. J. M. Rottier, J. Virol. 72, 6838 (1998).[Abstract/Free Full Text]
- 29. As of this writing, complete genomic sequences of three additional SARS-CoV isolates were available at GenBank (Tor-2 strain, Canada, accession no. ay274119; CUHK-W1 isolate, Hong Kong, accession no. ay278554; and HKU-39849 isolate, Hong Kong, accession no. ay278491). A comparison of these sequences to the sequence described in this paper is shown in table S3.
- 30. M. A. Marra et al., Science 300, 1399 (2003); published online 1 May 2003 (10.1126.science.1085953).[Abstract/Free Full Text]
- 31. The authors thank the WHO SARS Aetiology Laboratory Investigation Group (Bernhard-Nocht Institute, Hamburg, Germany; Erasmus Universiteit, National Influenza Centre, Rotterdam, Netherlands; Federal Microbiology Laboratories for Health Canada, Winnipeg, Canada; Institut für Virologie, Marburg Germany; Frankfurt A. M. University Hospital, Klinikum der Johann Wolfgang Goethe-Universität, Frankfurt, Germany; Chinese Center for Disease Control, Beijing, China; Public Health Laboratory Service Central Public Health Laboratory, London; Prince of Wales Hospital, Hong Kong; National Institute of Infectious Disease, Tokyo, Japan; The Chinese University of Hong Kong, Hong Kong; Government Virus Unit, Hong Kong; Queen Mary Hospital, Hong Kong; and Institute Pasteur, Paris, France) for the open collaboration and sharing of information; Centers for Disease Control (CDC) Laboratory Partners Group for support and suggestions; the Coronavirology Partners Group (S. C. Baker, R. Baric, D. A. Brian, D. Cavanagh, M. R. Denison, M. S. Diamond, B. G. Hogue, K. V. Holmes, J. Leibowitz, S. Perlman, L. J. Saif, L. Sturman, and S. R. Weiss) for many helpful reagents, guidance and discussion; B. W. J. Mahy for advice and discussions and for organizing the Laboratory Partners Conferences; S. Emery for technical support; J. Osborne and S. Sammons for help with the figures; and C. Chesley for editorial assistance. M-h.C. is supported by a CDC/Georgia State University interagency agreement.
Supporting Online Material
www.sciencemag.org/cgi/content/full/1085952/DC1
Materials and Methods
Figs. S1 to S4
Tables S1 to S3
References
Received for publication 18 April 2003. Accepted for publication 30 April 2003.