|| N/A, not applicable.
The coding potential of the 29,751-base genome is depicted in Fig. 2. Recognizable ORFs include the replicase 1a and 1b translation products, the S glycoprotein, the E protein, the M protein, and the N protein. We have, in addition, conducted a preliminary analysis of the nine novel ORFs in an attempt to ascribe to them a possible functional role. These analyses are summarized below.
The replicase 1a ORF (base pairs 265 to 13,398) and replicase 1b ORF (base pairs 13,398 to 21,485) occupy 21.2 kb of the SARS virus genome (Fig. 2). Conserved in both length and amino acid sequence to other coronavirus replicase proteins, the genes encode a number of proteins that are produced by proteolytic cleavage of a large polyprotein (20). As seen in other coronaviruses and as anticipated, a frame shift interrupts the protein-coding region and separates the 1a and 1b reading frames.
The Spike (S) glycoprotein (Fig. 2; base pairs 21,492 to 25,259) encodes a surface projection glycoprotein precursor predicted to be 1255 amino acids in length. Mutations in the gene encoding the Spike protein have previously been correlated with altered pathogenesis and virulence in other coronaviruses (5). In some coronaviruses, the mature Spike protein is inserted in the viral envelope, with most of the protein exposed on the surface of the viral particles. It is believed that three molecules of the Spike protein form the characteristic peplomers or corona-like structures of this virus family. Our analysis of the Spike glycoprotein with SignalP (21) reveals a high probability of a signal peptide (probability 0.996) with cleavage between residues 13 and 14. TMHMM (22) reveals a strong transmembrane domain near the C-terminal end. Together these data predict a type I membrane protein with the N terminus and the majority of the protein (residues 14 to 1195) on the outside of the cell surface or virus particle, in agreement with other coronavirus Spike protein data. Supporting this conclusion, it has recently been shown that for HCoV-229E virions, residues 417 to 546 are required for binding to the cellular receptor, aminopeptidase N (23). However, it is known that various coronaviruses use different receptors, and hence it is likely that different receptor binding sites are also used.
ORF 3 (Fig. 2; base pairs 25,268 to 26,092) encodes a predicted protein of 274 amino acids that lacks significant BLAST (24), FASTA (25), or PFAM (26) similarities to any known protein. Analysis of the N-terminal 70 amino acids with SignalP provides weak evidence for the existence of a signal peptide and a cleavage site (probability 0.540). Both TMpred (27) and TMHMM predict the existence of three transmembrane regions spanning approximately residues 34 to 56, 77 to 99, and 103 to 125. The most likely model from these analyses is that the C terminus and a large 149amino acid N-terminal domain would be located inside the viral or cellular membrane. The C-terminal (interior) region of the protein may encode a protein domain with ATP-binding properties (ProDom ID PD037277).
ORF 4 (Fig. 2; base pairs 25,689 to 26,153) encodes a predicted protein of 154 amino acids. This ORF overlaps entirely with ORF 3 and the E protein. Our analysis failed to locate a potential TRS sequence at the 5' end of this putative ORF. However, it is possible that this protein is expressed from the ORF 3 mRNA using an internal ribosomal entry site. BLAST analyses fail to identify matching sequences. Analysis with TMpred weakly predicts a single transmembrane helix.
The gene encoding the small envelope (E) protein (Fig. 2; base pairs 26,117 to 26,347) yields a predicted protein of 76 amino acids. BLAST and FASTA comparisons indicate that the predicted protein exhibits significant matches to multiple envelope (alternatively known as small membrane) proteins from several coronaviruses. PFAM analysis of the protein reveals that the predicted protein is a member of the well-characterized NS3_EnvE protein family (26). InterProScan (28, 29) analysis reveals that the protein is a component of the viral envelope, and conserved sequences are also found in other viruses, including gastroenteritis virus and murine hepatitis virus. SignalP analysis predicts the presence of a transmembrane anchor (probability 0.939). TMpred analysis of the predicted protein reveals a similar transmembrane domain at positions 17 to 34, consistent with the known association of this protein with the viral envelope. TMHMM predicts a type II membrane protein with most of the hydrophilic domain (46 residues) and the C terminus located on the surface of the viral particle. In some coronaviruses such as porcine transmissible gastroenteritis virus (TGEV), the E protein is essential for virus replication (30). In contrast, in mouse hepatitis virus (MHV), although deletion of the gene encoding the E protein reduces virus replication by more than four orders of magnitude, the virus still can replicate (31).
The gene encoding the membrane (M) glycoprotein (Fig. 2; base pairs 26,398 to 27,063) yields a predicted protein of 221 amino acids. BLAST and FASTA analyses of the protein reveal significant matches to a large number of coronaviral matrix glycoproteins. The association of the Spike glycoprotein (S) with the matrix glycoprotein (M) is an essential step in the formation of the viral envelope and in the accumulation of both proteins at the site of virus assembly (5). Analysis of the amino acid sequence with SignalP predicts a signal sequence (probability 0.932) that is not likely cleaved. TMHMM and TMpred analyses indicate the presence of three transmembrane helices, located at approximately residues 15 to 37, 50 to 72, and 77 to 99, with the 121amino acid hydrophilic domain on the inside of the virus particle, where it is believed to interact with the nucleocapsid. PFAM analysis reveals a match to PFAM domain PF01635 and alignments to 85 other sequences in the PFAM database bearing this domain, which is indicative of the coronavirus matrix glycoprotein.
ORF 7 (Fig. 2; base pairs 27,074 to 27,265) encodes a predicted protein of 63 amino acids. BLAST and FASTA searches yield no significant matches indicative of function. TMHMM and SignalP predict no transmembrane region; however, TMpred analysis predicts a likely transmembrane helix located between residues 3 and 22, with the N terminus located outside the viral particle. Similarly, ORF 8 (Fig. 2; base pairs 27,273 to 27,641), encoding a predicted protein of 122 amino acids, has no significant BLAST or FASTA matches to known proteins. Analysis of this sequence with SignalP indicates a cleaved signal sequence (probability 0.995) with the predicted cleavage site located between residues 15 and 16. TMpred and TMHMM analyses also predict a transmembrane helix located approximately at residues 99 to 117. Together these data indicate that ORF 8 is likely to be a type I membrane protein, with the major hydrophilic domain of the protein (residues 16 to 98) and the N terminus oriented inside the lumen of the ER/Golgi or on the surface of the cell membrane or virus particle, depending on the membrane localization of the protein.
ORF 9 (Fig. 2; base pairs 27,638 to 27,772) encodes a predicted protein of 44 amino acids. FASTA analysis of this sequence reveals some weak similarities (37% identity over a 35amino acid overlap) to Swiss-Prot accession Q9M883, annotated as a putative sterol-C5 desaturase. A similarly weak match to a hypothetical Clostridium perfringens protein (Swiss-Prot accession CPE2366) is also detected. The functional implications, if any, of these matches are unknown. TMpred predicts the existence of a single strong transmembrane helix, with little preference for alternate models in which the N terminus is located inside or outside the particle. Similarly, ORF 10 (Fig. 2; base pairs 27,779 to 27,898), encoding a predicted protein of 39 amino acids, exhibits no significant matches in BLAST and FASTA searches but is predicted to encode a transmembrane helix by TMpred, with the N terminus located within the viral particle. The region immediately upstream of ORF 10 exhibits a strong match to the TRS consensus (Table 2), providing support for the notion that a transcript initiates from this site. ORF 11 (Fig. 2; base pairs 27,864 to 28,118), encoding a predicted protein of 84 amino acids, exhibits only very short (9 or 10 residues) matches to a region of the human coronavirus S glycoprotein precursor (starting at residue 801). Analyses by SignalP and TMHMM predict a soluble protein. As was the case for ORF 10, a detectable alignment to the TRS consensus sequence was found (Table 2).
The gene encoding the nucleocapsid protein (Fig. 2; base pairs 28,120 to 29,388) yields a predicted protein of 422 amino acids. This protein aligns well with nucleocapsid proteins from other representative coronaviruses, although a short lysine-rich region (KTFPPTEPKKDKKKKTDEAQ) (32) appears to be unique to SARS. This region is suggestive of a nuclear localization signal, and although it contains a hit to InterProDomain IPR001472 (bipartite nuclear localization signal), the function of this insertion remains unknown. It is possible that the SARS virus nucleocapsid protein has a novel nuclear function, which could play a role in pathogenesis. In addition, the basic nature of this peptide suggests that it may assist in RNA binding.
ORF 13 (Fig. 2; base pairs 28,130 to 28,426) encodes a predicted protein of 98 amino acids. BLAST analysis fails to identify similar sequences, and no transmembrane helices are predicted. ORF 14 (Fig. 2; base pairs 28,583 to 28,795) encodes a predicted protein of 70 amino acids. BLAST analysis fails to identify similar sequences. TMpred weakly predicts a single transmembrane helix.
Conclusions. We used genome sequencing to determine that the virus named by the WHO as causally associated with SARS is a novel coronavirus. This has been confirmed by the sequence of two independent isolates: the Tor2 isolate, reported here, and the Urbani isolate, reported by the CDC (16). Although morphologically a coronavirus (3), this SARS virus is not more closely related to any of the three known classes of coronavirus, and we propose that it defines a fourth class of coronavirus (group 4) and that it be referred to as SARS-CoV. Our sequence data do not support a recent interviral recombination event between the known coronavirus groups as the origin of this virus, but this may be due to the limited number of known coronavirus genome sequences. Apart from the s2m motif located in the 3'UTR, there is also no evidence of any exchange of genetic material between the SARS virus and non-Coronaviridae. These data are consistent with the hypothesis that an animal virus for which the normal host is currently unknown recently mutated and developed the ability to productively infect humans. There also remains the possibility that the SARS virus evolved from a previously harmless human coronavirus. However, preliminary evidence suggests that antibodies to this virus are absent in people not infected with SARS-CoV (3), which implies that a benign virus closely related to the Tor2 isolate is not resident in humans. Identification of the normal host of this coronavirus and comparison of the sequences of the ancestral and SARS forms will further elucidate the process by which this virus arose.
The availability of the SARS virus genome sequence is important from a public health perspective. It will allow the rapid development of PCR-based assays for this virus that capitalize on novel sequence features, enabling discrimination between this and other circulating coronaviruses. Such assays will allow the diagnosis of SARS virus infection in humans and, critically, will consolidate the association of this virus with SARS. If the association is further borne out, SARS virus genomebased PCR assays may form an important part of a public health strategy to control the spread of this syndrome. In the longer term, this information will assist in the development of antiviral treatments, including neutralizing antibodies and development of a vaccine to treat this emerging and deadly disease.
References and Notes
- 1. C. A. Donnelly et al., Lancet; published online 7 May 2003 (http://image.thelancet.com/extras/03art4453web.pdf).
- 2. J. S. M. Peiris et al., Lancet; published online 8 April 2003 (http://image.thelancet.com/extras/03art3477web.pdf).
- 3. T. G. Ksiazek et al., N. Engl. J. Med.; published online 10 April 2003 (10.1056/NEJMoa030781).
- 4. R. Munch, Microbes Infect. 5, 69 (2003). [CrossRef] [Web of Science] [Medline]
- 5. B. N. Fields, D. M. Knipe, P. M. Howley, D. E. Griffin, Fields Virology (Lippincott Williams & Wilkins, Philadelphia, ed. 4, 2001).
- 6. M. M. C. Lai, D. Cavanagh, Adv. Virus Res. 48, 1 (1997).
- 7. S. G. Sawicki, D. L. Sawicki, Adv. Exp. Med. Biol. 440, 215 (1998). [Web of Science] [Medline]
- 8. D. L. Sawicki et al., J. Gen. Virol. 82, 386 (2001).
- 9. S. G. Sawicki, D. L. Sawicki, J. Virol. 64, 1050 (1990).[Abstract/Free Full Text]
- 10. M. Schaad, R. S. J. Baric, J. Virol. 68, 8169 (1994).[Abstract/Free Full Text]
- 11. P. B. Sethna et al., Proc. Natl. Acad. Sci. U.S.A. 86, 5626 (1989).[Abstract/Free Full Text]
- 12. S. H. Myint, in The Coronaviridae, S. G. Siddell, Ed. (Plenum, New York, 1995), pp. 389401.
- 13. L. Enjuanes et al., in Virus Taxonomy. Classification and Nomenclature of Viruses, M. H. V. van Regenmortel et al., Eds. (Academic Press, New York, 2000), pp. 835849.
- 14. Information on materials and methods is available on Science Online.
- 15. S. M. Poutanen et al., N. Engl. J. Med.; published online 31 March 2003 (10.1056/NEJMoa030634).
- 16. P. A. Rota et al., Science 300, 1394 (2003); published online 1 May 2003 (10.1126/science.1085952).[Abstract/Free Full Text]
- 17. C. M. Jonassen, T. O. Jonassen, B. Grinde, J. Gen. Virol. 79, 715 (1998).[Abstract]
- 18. W. Lapps, B. G. Hogue, D. A. Brian, Virology 157, 47 (1987). [CrossRef] [Web of Science] [Medline]
- 19. R. Krishnan, R. Y. Chang, D. A. Brian, Virology 218, 400 (1996). [CrossRef] [Web of Science] [Medline]
- 20. J. Ziebuhr, E. J. Snijder, A. E. Gorbalenya, J. Gen. Virol. 81, 853 (2000).[Free Full Text]
- 21. H. Nielsen, J. Engelbrecht, S. Brunak, G. von Heijne, Protein Eng. 10, 1 (1997).[Abstract/Free Full Text]
- 22. E. L. Sonnhammer, G. von Heijne, A. Krogh, Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175 (1998). [Medline]
- 23. J. C. Tsai, B. D. Zelus, K. V. Holmes, S. R. Weiss, J. Virol. 77, 841 (2003).
- 24. S. F. Altschul et al., Nucleic Acids Res. 25, 3389 (1997).[Abstract/Free Full Text]
- 25. W. R. Pearson, D. J. Lipman, Proc. Natl. Acad. Sci. U.S.A. 85, 2444 (1988).[Abstract/Free Full Text]
- 26. A. Bateman et al., Nucleic Acids Res. 30, 276 (2002).[Abstract/Free Full Text]
- 27. K. Hofman, W. Stoffel, Biol. Chem. Hoppe-Seyler 374, 166 (1993).
- 28. R. Apweiler et al., Nucleic Acids Res. 29, 37 (2001).[Abstract/Free Full Text]
- 29. E. M. Zdobnov, R. Apweiler, Bioinformatics 17, 847 (2001).[Abstract/Free Full Text]
- 30. J. Ortego et al., J. Virol. 76, 11518 (2002).[Abstract/Free Full Text]
- 31. L. Kuo et al., paper presented at the annual meeting of the American Society for Virology, Lexington, KY, 20 to 24 July 2002.
- 32. Abbreviations for amino acids: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; L, Leu; M, Met; N, Asn; P, Pro; Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; and Y, Tyr.
- 33. J. D. Thompson, D. G. Higgins, T. J. Gibson, Nucleic Acids Res. 22, 4673 (1994).[Abstract/Free Full Text]
- 34. J. Felsenstein, PHYLIP (Phylogeny Inference Package) version 3.5c (1993). Distributed by the author, Department of Genetics, University of Washington, Seattle.
- 35. We thank all the staff at the BCCA Genome Sciences Centre for helping to facilitate the rapid sequencing of the SARS-CoV genome; R. Tellier (Hospital for Sick Children) for information on primer sequences that amplify a 216base pair region of the Pol gene; I. Sadowski (Department of Biochemistry and Molecular Biology) and J. Hobbs and his staff (Nucleic Acid and Protein Services Unit) of the University of British Columbia for rapid synthesis of PCR primers; F. Ouellette (University of British Columbia Bioinformatics Centre) for advice and assistance; the staff at the National Center for Biotechnology Information for rapidly processing and making available our sequence data; and anonymous reviewers for their useful suggestions. The BCCA Genome Sciences Centre is supported by the British Columbia Cancer Foundation, Genome Canada/Genome British Columbia, Western Economic Diversification, Canada Foundation for Innovation, British Columbia Knowledge Development Fund, Canadian Institutes of Health Research, Michael Smith Foundation for Health Research, and Natural Sciences and Engineering Research Council of Canada. Clones derived from the SARS virus are available from the Genome Sciences Centre (www.bcgsc.bc.ca).
Supporting Online Material
www.sciencemag.org/cgi/content/full/1085953/DC1
Materials and Methods
References
Received for publication 19 April 2003. Accepted for publication 30 April 2003.