Jump to: Page Content, Section Navigation, Site Navigation, Site Search, Account Information, or Site Tools.
|
|
Supplementary Material2. Database Integration 3. The AMCtagmap Database 4. Statistical Analysis 5. Expression Profiles and RIDGEs in Sequence-Based Maps of Chromosomes 6 and 21 6. References 1. Relationship RIDGEs with Gene Density Supplemental Figure 1. Correlation between RIDGEs and gene density maps of all chromosomes. Expression levels are shown as a moving median with a window size of 39 genes in blue. There are 74 regions with one or more consecutive moving medians that have a lower limit of four times the genomic median; 27 of them have a length of at least 10 consecutive moving medians (indicated by green bars). For each chromosome, we calculated the average distance between adjacent UniGene clusters in a window of 39 adjacent UniGene clusters. We plotted the inverse of this value (cR-1/gene) in black, using the same scaling as for the expression levels.
Medium version | Full size version
2. Database Integration For the construction of the Human Transcriptome Map, we integrated four databases [GB4 map of GeneMap´99, rhdb_xrefs_human (Radiation Hybrid Database), AMCtagmap, and SAGE libraries]. The rhdb_xrefs_human database (http://corba.ebi.ac.uk/RHdb) is an intermediate database that links several biological databases through their unique identifiers (e.g., RH-code, accession code, UniGene identifier). The AMCtagmap database contains all reliable tags assigned to UniGene clusters (see following section). A data model was designed (see Web fig. 2) that defined two pathways to map SAGE tags to genomic positions. This model was implemented as a relational database. A subset of data extracted from the above-mentioned databases was loaded into the relational database. The relational databases enable two routes from a genomic position defined by a STS marker to one or more tags and their expression levels in a SAGE library: 1. Starting with a STS marker (assigned to a unique RH-code) in GeneMap´99, we retrieve the accession code of the corresponding clone in the rhdb_xrefs_human database. Subsequently, the UniGene cluster containing this clone is identified in the AMCtagmap database. Most UniGene clusters contain multiple SAGE transcript tags. Finally, these tags are linked to the expression levels in the selected SAGE libraries. 2. It is possible that the accession code of the STS marker in GeneMap´99 is not included in the rhdb_xrefs_human database or in the AMCtagmap database. In these situations, we first retrieve the UniGene number from the rhdb_xrefs_human database and use this to obtain the corresponding tags in the AMCtagmap database. These tags are linked to the expression levels in the selected SAGE libraries. Supplemental Figure 2. Data model used to develop the Human Transcriptome Map.
Medium version | Full size version
3. The AMCtagmap Database The AMCtagmap database, which contains reliable tags assigned to UniGene clusters, was constructed with the use of the following databases: 1. NCBI SAGEmap (accession codes and clone orientation label for each UniGene cluster) 2. EMBL database (EST and gene clone sequences in UniGene clusters) 3. GenBank Genomes H-Sapiens database (PAC sequences and annotated STS/EST markers) 4. rhdb_xrefs_human, to link information from the previous databases. The actual assignment of tags to UniGene clusters was performed by the following steps (see Web fig. 3): - selection of 3´-end clones - electronic tag extractions from reliable 3´-end clones - corrections for 10-bp sequence errors in ESTs - corrections for CATG sequence errors in ESTs - identification of UniGene cluster errors - sense and antisense oriented tags - censored tags (linker sequences and tags from >3 UniGene clusters) These steps are described in detail in the following sections. 3-I. Selection of 3´-end cDNA clones For the reliable extraction of SAGE sequence tags from UniGene clusters, only cDNA clones containing the full-length 3´ end of the transcript can be used. Also, the correct orientation of those sequences needs to be determined. 1. Orientation of cDNA sequences in the EMBL database Sequencing of cDNA clones can either result in the nucleotide sequence of the "sense" or the "opposite" strand, due to the directionality of the DNA polymerase used in the sequencing reaction. This implies that the most likely orientation of sequences in databases of sequenced cDNA clones is either "sense" or "complementary-reverse." In the case of 3´-end sequences, this will respectively show the poly(A) tail as an A-stretch at the end or as a T-stretch at the beginning of the sequence. The two other possible sequence orientations occur in a database as a result of human errors in submitting or processing the sequence. We analyzed the frequency of the four possible sequence orientations, using 718,271 clones in the CGAP database and selecting the 12,381 clones containing a stretch of >30 A´s or T´s at either end of the sequence. Of these clones, 11,476 (99.9%) end with >30 A´s (sense) or start with >30 T´s (complementary-reverse). Only 0.1% of the poly(A)-containing clones belongs to one of the remaining sequence orientations. Therefore, we only considered the sense and complementary-reverse sequence orientations in our electronic tag extraction procedures. 2. Identification of 3´-end cDNA clones The 3´ end of a processed gene transcript is characterized by a poly(A) tail and a poly(A) signal. Besides the two "classical" poly(A) signals (AATAAA and ATTAAA), other poly(A) signals have been reported (1, 2). We analyzed the clones included in the NCBI SAGEmap database for the occurrence of "alternative" polyadenylation signals in 3´-end cDNA clones. We selected the clones containing either >30 A´s at the end or >30 T´s at the beginning of their sequence. Polyadenylation signals are thought to occur within 50 to 100 bp. from the poly(A) addition site (3). We therefore analyzed the 150 nucleotides adjacent to the poly(A) or poly(T) stretch for the presence of the two classical polyadenylation signals, nine possible alternative poly(A) signals (AATTAA, AATAAC, AATAAT, AATACA, ACTAAA, AGTAAA, CATAAA, GATAAA, TATAAA) and six random hexamere sequences. The two classical poly(A) signals were found in 55.8 and 17.7% of those clones, respectively, and showed a clear preference for occurring within the first 50 nucleotides from the poly(A) tail. Four possible alternative poly(A) signals (AATTAA, AATAAT, CATAAA, AGTAAA) show patterns comparable to the two classical signals and occur with a frequency ranging from 5.7 to 8.4%. The other five possible poly(A) signals and the six random hexamere´s showed no appreciable preference for occurring within the 3´ end of transcripts. We therefore configured our sequence orientation algorithms so that they searched for the six poly(A) signals within 50 bp from the poly(A) site. The same patterns for the six poly(A) signals were found in cDNA clones ending with at least 10 A´s or starting with at least 10 T´s. This indicates that the occurrence of stretches of 10 or more A´s or T´s at the end and the beginning of a cDNA sequence, respectively, is likely to represent a poly(A) tail. Additional information to identify 3´-end clones can be obtained from the depositors of the cDNA sequences. The NCBI SAGEmap database assigns a label to cDNA clones from GenBank, based on information on the cloning and sequencing procedures. We combined the information on clone labels with the presence of one of the six poly(A) signals at either end of the clone sequence (within 50 bp) and/or a poly(A) tail (>10 A´s at the end or >10 T´s at the beginning) as depicted in Web table 1 to select for reliable 3´-end clones.
3-II. Electronic tag extractions from UniGene clusters To minimize the risk of extracting erroneous tags from UniGene clusters, we only used "reliable 3´-end" clones (see Web table 1) for electronic tag extractions. When both strands of a cDNA encoded conflicting polyadenylation signals and/or poly(A)/poly(T) stretches, clones were not used for tag extraction. 3-III. Ten-base pair tag sequencing errors The tags assigned to UniGene clusters were checked for errors in the 10-bp sequence resulting from sequencing errors in ESTs. We designed algorithms that detected any combination of matching tags with maximal two-base substitutions, insertions, or deletions. In the first step, all EST clones in a UniGene cluster were collected, together with their extracted tags. Subsequently, all tags were compared pair-wise and checked for substitutions, insertions, or deletions. If two tags were identical, except for one or two mismatches, a sequencing error was assumed. The tag corresponding to the largest number of clones was considered to be the correct tag. The tag with the putative sequencing error was removed when it was found in less than five ESTs/cDNAs and if the correct tag occurred at least five times more frequently. This ensured that variant tags resulting from frequent single nucleotide polymorphisms (SNPs) were not discarded in the AMCtagmap database. Supplemental Figure 3. Error sources in 14 ESTs from one UniGene cluster and selection of reliable tags. Red tags are extracted from the EST clones. These tags are run through sequence comparison algorithms that reject tags with an error in the 10-bp tag sequence and tags based on an error in the anchoring CATG sequence. See text for details.
Medium version | Full size version
3-IV. CATG-sequencing errors Sequence errors in the most 3´ CATG sequence of an EST will result in skipping of the corresponding tag by the extraction algorithm and erroneous use of the next CATG for tag extraction. Also, an EST sequence error may create a new CATG distal to the true most 3´ CATG. This also results in extraction of a false tag for an EST. An algorithm to remove these tags should preserve tags from alternatively spliced transcripts of the same gene (see Web fig. 3). Each gene can have a series of tags belonging to alternatively spliced or alternatively polyadenylated transcripts. Furthermore, SNPs in ESTs can cause extraction of alternative tags that are correct and should be preserved. Our algorithms were directed to the identification and removal of all tags that are caused by CATG sequence errors. The remaining tags were accepted as reliable tags. We used the following principles for the algorithms: 1. A CATG is destroyed by an EST sequence error. If a tag and its preceding CATG are found in a series of EST clones, while the same tag sequence without a preceding CATG is found in one or few other clones, the last clones are likely to contain a sequencing error that destroyed the most 3´ CATG. This clone therefore has caused the extraction of a wrong tag, belonging to the next CATG. The clone is marked as unreliable and is not used for tag extraction. 2. A new CATG is created by an EST sequence error. If a particular tag and its preceding CATG is found in one or few clones and if that tag is found without a preceding CATG in many other clones, the CATG sequences in the first clone(s) are likely to result from sequencing errors. The clone is marked as unreliable and is not used for tag extraction. 3. Introduction or removal of CATG by alternative splicing or polyadenylation. Following this principle, we preserve tags in which an alternative splicing or polyadenylation is responsible for extraction of a new tag. If alternative splicing changes the 3´ sequence, a situation that is similar to destroying the most 3´ CATG exists. The next CATG will be used for the extraction of a correct alternative tag. The more prevalent correct tag is not present in this clone (with or without a CATG), and therefore the clone is not rejected (the algorithm only removes clones where the correct tag is found with a non-CATG). If alternative splicing extends the sequence, a situation similar to the introduction of a new 3´ CATG can occur. In this case, the new tag is not found at all in the other, shorter clones and is therefore preserved. 4. SNPs. When a specific tag was extracted several times from a UniGene cluster, but when it was rejected by the above-described principles, we considered it unlikely that it was caused by the same sequence error each time. Such a tag was considered to result from a SNP, when present in 20% or more of the EST clones and accepted. The CATG-error detection algorithms were therefore defined as follows: 1. The algorithm lists all tags extracted from all ESTs in a UniGene cluster and subsequently scans all individual ESTs for the presence of each of these 10-bp tag sequences. The number of times that a specific tag sequence is preceded by a CATG or non-CATG in an EST is counted, and the ratio between CATG and non-CATG is established. The algorithm takes into account that tag sequences may contain sequencing errors (combination of maximal two substitutions, insertions, or deletions). 2. If the total number of clones (with and without a preceding CATG) in which a particular tag sequence was observed was five or more and if the ratio between CATG and non-CATG was less than or equal to 0.2, the clones in which the CATG was observed were removed. If the total number of clones was four, the clone with a CATG was removed if the ratio was 0.25. If the total number of clones was less than four, the number of clones was considered as too small to make a decision and no clones were removed. 3. If the total number of clones (with and without a preceding CATG) in which a particular tag sequence was observed was five or more and if the ratio between CATG and non-CATG was greater than or equal to five, clones in which a non-CATG was observed were removed. If the total number of clones was four, the non-CATG clone was removed if the ratio was 4. If the total number of clones was less than four, the number of clones was considered as too small to make a decision and no clones were removed. 3-V. UniGene clustering errors Hybrid UniGene clusters caused a problem, as they include ESTs from different genes. These genes, which usually have different map positions, each yield their own reliable correct tags. We therefore searched the GenBank Genomes H-Sapiens (ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/H_sapiens/) to identify the PAC sequenced in the Human Genome Project, as well as two adjacent PACs, for the markers mapped on GeneMap´99. Tags from the gene corresponding to the marker are expected to be present on these PACs, whereas tags from a "contaminating" gene in a hybrid cluster are not. The PACs were analyzed for the presence of the 10-bp tag sequence plus adjacent CATG. When positive, the tag was marked on the extended interval view with a P in a light green box. A one-nucleotide mismatch between tag and PAC sequence was accepted to cover SNPs or PAC sequencing errors (marked P in a dark green box). When a PAC for a marker was known, but when the tag was not found in the sequence, the tag was marked P in a red box. This check is not yet available for all markers, but the progress in sequencing and annotation will provide this function for all UniGene clusters. 3-VI. Sense and antisense orientation UniGene clustering algorithms place overlapping genes encoded on opposite DNA strands in one UniGene cluster. Our tag extraction routines extract the tags from both genes. We therefore designed algorithms to recognize oppositely oriented tags. In such clusters, we arbitrarily choose to consider the orientation of the most frequent tag as "sense." In the extended interval view, tags with the other orientation are marked as AS in a purple box. In the compressed interval view, the cumulative expression levels for "sense" and "antisense" tags are shown as separate bars for each UniGene cluster. Antisense expression levels are not included in the whole chromosome views. 3-VII. Censored tags Two types of tags were considered unreliable for use in the Human Transcriptome Map. They are marked as L or >3 in a yellow box. 1. Linker tags. The SAGE technique produces with low-frequency tags derived from linker oligo´s used in library construction (4). These 73 tags are marked L in a yellow box on the extended interval view, but their expression levels in the SAGE libraries are not shown. 2. Redundant tags. Some tags are found for more than three UniGene clusters. This may be explained by coincidental limited sequence homologies between genes. Other redundant tags are derived from genes with a CATG close to the poly(A) tail. This generates tags with a strongly reduced sequence variability, as most of the tag consists of an A stretch. We found 619 tags belonging to more than three UniGene clusters. They are marked >3 in a yellow box in the extended interval view, and their expression levels in the SAGE libraries are not shown. Tags belonging to two or three UniGene clusters are marked in a yellow box with R2 and R3, respectively, and their expression levels in the libraries are shown. 3-VIII. Characteristics of unmapped UniGene clusters In the AMCtagmap database, 69,754 tags have been electronically extracted for 54,925 UniGene clusters. Of these clusters (37,491 tags), 18,954 have been mapped on the GB4 radiation hybrid panel in GeneMap´99. The expression profile characteristics of the mapped and unmapped UniGene clusters are different. In general, the unmapped clusters have lower expression levels in the 12 tissue-type libraries, whereas most highly expressed genes are found in the mapped UniGene clusters (see Web table 2).
4. Statistical Analysis We used three statistical approaches to analyze whether the observed RIDGEs could be explained by the random variation in the distribution of expression levels of the 18,422 UniGene clusters in the Human Transcriptome Map. 4-I. Computing the probability of RIDGE frequencies under a random permutation of data
Given the data and several cluster definitions, we computed the probability of finding the actual number of RIDGEs or greater number, under a random permutation of the data. A RIDGE is defined by window size, length of the run, and threshold for the lower limit of the median. With a lower limit of the median of four times the genomic median and a run length of at least 10 consecutive windows, we analyzed window sizes of 29, 39, and 59 genes. The genome yields 111, 27,and 19 RIDGEs for these window sizes. We computed the probability of finding this number of clusters, under a random permutation of the gene order (see below). This probability is very low (P = 1 × 10-12, P = 8 × 10-12, and P = 2 × 10-16, respectively), from which we conclude that there is strong evidence against non-random clustering of the so defined clusters. We used the following calculation: suppose we have a random permutation of X1, ..., XN (normalized) counts, N = 18,422. Given the data and a fixed threshold T (=2 in our case), we compute P[Xi 4-II. Monte Carlo simulations
We performed Monte Carlo simulations to analyze whether this number of observed RIDGEs is more than can be expected from a random variation in the distribution of highly expressed genes over the genome. We permutated the genomic order of all 18,422 UniGene clusters in the Human Transcriptome Map and analyzed 10,000 permutated data sets for the incidence of RIDGEs. We analyzed the frequencies of RIDGEs at window sizes of 29, 39, and 59. The lower limit of the median was four times the genomic median. Length of the RIDGEs was either
4-III. Bayesian statistical modeling Count data combined from many sources may exhibit random clusters, i.e., clusters that can be described by underlying distributions. Furthermore, the definition of a cluster will influence the probability of finding such clusters. We therefore fitted for chromosomes 3, 6, and 11 a "null model," assuming a random distribution of gene expression, and an extended model, where tag counts of adjacent genes are assumed to be similar. As a null model for randomness, we used the concept of exchangeability (6) of the genes on the chromosome, where the mean of the genes was assumed to be distributed with a log-normal distribution with a mean equal to the genome-wide mean, while the raw tag counts were assumed to be distributed following a Poisson distribution with a mean equal to each gene mean. Parameters were estimated using the Bayesian paradigm with the programs BUGS (7) and S-Plus. Several alternative models, with more structure in the mean, were fitted. These models were assessed using the concept of expected predictive deviance (EPD) (8). In all cases, the best model was an extension of the exchangeable model where the mean of a gene depends solely on the mean of the foregoing gene of the chromosome. This model is also known as a random walk model (9). Comparison of both models using EPD showed that the extended model fit much better. 5. Expression Profiles and RIDGEs in Sequence-Based Maps of Chromosomes 6 and 21 We used the "GenBank Genomes" database (http://www.ncbi.nlm.nih.gov/genome/guide) to retrieve annotated genomic sequences of sequenced genomic contigs for chromosomes 6 and 21. Chromosome 21 has almost completely been sequenced, and chromosome 6 has a large sequenced contig of 4.3 Mb containing a clear-cut RIDGE (MHC region). Of the entire chromosome 6 sequence, a total of 30.7% has been finished, including the MHC region. To establish expression profiles of the sequence-based maps, we needed to define the genomic position and order of the mapped genes and their corresponding UniGene clusters. We retrieved the annotated sequence for each genomic BAC/PAC clone from GenBank Genomes. We searched the annotation for STS content and extracted the corresponding STS marker name and/or RH-code. UniGene cluster numbers corresponding to these STSs were retrieved through three routes: a. Direct search of UniGene database for STS marker name. b. Search RHdb_xref for STS marker name, continue as in step c. c. Search RHdb_xref for RH-code, retrieve EMBL EST accession code, search UniGene for EST accession code. The UniGene clusters were thus mapped to the annotated genomic sequences. The contigs were aligned along the chromosome using the GenBank annotated order of contigs. For chromosome 6, we could map 562 UniGene clusters of which 138 fall in the MHC contig. For chromosome 21, we could map 209 UniGene clusters. The expression profiles were established by the same algorithms we used to build our "whole chromosome view" in the Human Transcriptome Map. In short, we selected the tags from those UniGene clusters as established by the AMCtagmap routines. Tag counts of the individual tags from a UniGene cluster were combined to produce a "gene expression level." Tags with an antisense orientation and tags belonging to >3 UniGene clusters were excluded. We only included UniGene clusters for which we could identify a tag by our AMCtagmap routines. We used the 2,4 × 106 tags from the tissue-type library "all tissues" to establish expression profiles for #6 and #21. We plotted the expression levels (normalized per 100,000 tags) using the moving median procedure as in Figs. 1 and 3 of the report. On the vertical axis, each unit represents one UniGene cluster using the order of the genome-based maps. For chromosome 21, we did not find any RIDGEs (see Web fig. 4). For chromosome 6, the clear-cut RIDGE in the MHC region is also present in the sequence-based expression map. Also, the pattern along the entire chromosome 6 shows striking similarity with the moving median plot based on RH mapping (Fig. 4 in the report and Web fig. 4). Supplemental Figure 4. Expression profiles of sequence-based maps of chromosomes 6 and 21. The MHC region forms a prominent RIDGE on the chromosome 6 map. For database integration and map construction, see section 5 of supplementary information.
Medium version | Full size version
6. References
9
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Science. ISSN 0036-8075 (print), 1095-9203 (online)