3. Smoothed %HL(total) curves.
4. Origin predictions and confidence levels.
5. Replication timing curves.
6. Results and conclusions.
SUMMARY
For our purposes, the ultimate key graphical features of a plot of % HL(total) values across a chromosome will be its local extrema -- i.e., local maxima (peaks) and local minima (valleys) of the graph. The reasoning is that regions of the genome that replicate before their neighboring sequences should have higher % HL(total) values than their neighbors: an origin of replication would appear as a "peak" in the plot, and a termination zone as a "valley".
Identification of putative origins on a genome scale requires a practical peak/valley detection algorithm. However, each % HL(total) curve b will contain some amount of noise (e.g., see Fig. 4), which will complicate the automated identification of peaks and valleys. Our approach is to extract these extrema from an auxiliary "smoothed % HL(total) curve". A guiding principle in the choice of algorithms was to avoid a priori knowledge of genome replication information, so as to yield a method readily applicable to yeast mutants and to other organisms.
DETAILED DESCRIPTION (motivation)
The goal is to obtain from b a "smoothed % HL(total) curve", to be denoted b*, using the technique of Fourier convolution smoothing (FCS). Roughly speaking, any graph is a composite of real features ("signal") and the underlying scatter ("noise") in the data. The FCS procedure breaks down the graph into a set of individual components, then attempts to discard those components that constitute the noise in the data while preserving the real features. However, the outcome of the procedure is dictated by the choice of a convolution kernel; depending on the kernel, the output can range from being essentially identical to the input (i.e., no damping of noise) to being essentially a flat line (excessive smoothing). Fig. 4 illustrates the consequence of smoothing the pooled HL curve b for chromosome VI with three different choices of convolution kernel (red curves); the initial raw plot b is included for reference (black curve).
|
Figure 5. Comparison of three smoothings of b for Chromosome VI. The original % HL(total) curve b (in black) is shown along with three different smoothings (shades of red). The sets of local extrema for each of these three smoothings vary, depending on the degree to which the originial curve was smoothed. |
Clearly, the choice of which smoothing to use is an important one. We describe below how that choice is made. The first section, giving the precise definitions of a convolution smoothing kernel and FCS, is highly technical; if one wishes, one may simply accept the existence of such a FCS procedure and proceed to the second section, where the smoothing algorithm is described.
DEFINITION OF FCS
Begin with a finite ordered data set A consisting of pairs of real numbers (x,a). For our eventual applications, A = b = the % HL(total) curve, which consists of a finite collection of pairs (x,a), where x is a chromosomal coordinate and a = 100(bHL/(bHL+bHH)) for that coordinate as calculated as in part I.2. The strategy is to order the data set by the values of x, perform the smoothing on the set of values of (a) taken from the ordered list, and then reattach the (x) values to the corresponding smoothed (a) values. Some definitions to start with:
T= # of pairs of points in A.
A+ = {a : (x,a) is in A }; i.e., the ordered set of (a) values in A.
[[z]] = greatest integer less than or equal to the number z.
If L is an ordered list of numbers (NOT pairs), let
RotateLeft[ L , n] = cycle elements in list L n-positions to the left.
Sum[L] = sum of all the numbers in the list L.
For S a positive real number, define the ordered list of numbers by the formula
Note that the set k(S) contains T elements. Form a new ordered table of data by the formula
The finite list of numbers K(S) is called the convolution kernel of index S.
Given an indexed ordered list L ={a r : 1 ≤ r ≤ T}, the Fourier transform is a NEW ordered indexed list
Fourier[L]={bs : 1 ≤ s ≤ T},
where
the notation "i" in this formula represents the square root of "-1". The Inverse Fourier transform InverseFourier[L] is defined the same way, except that we replace "2pi" by "-2pi" in this formula.
Starting with the ordered list A+, form the NEW ordered list of data
A+ *K(S) = InverseFourier[ sqrt[T] Fourier[A+] Fourier[K(S)] ].
Finally, define the K(S)-Fourier Convolution Smoothing (FCS) of A to be the new ordered list of pairs of points
A(S)={(x , a*) : the ordering index of (x , a*) is taken to be the ordering index of a* in A + *K(S) }.
Notice that A(S) is again an ordered set of pairs; a comparison of the graphs of A and A(S) indicates we have smoothed out the data to an extent dictated by the kernel K(S).
For each chromosome, the algorithm begins with the % HL(total) data set b and selects a smoothing kernel to smooth b as follows:
- A moving average is computed for every 20 consecutive values of b along the chromosome (i.e., for every 10 kb window). While this linear filtering step indiscriminately smooths out local noise as well as real features in the data, it provides a basis for assessment of the FCS output. Note that the choice of a 10 kb moving average is consistent with the final normalization described in part I.1 above.
- The FCS method is applied to b using the kernels K(S), where S is of the form 2+0.25 k, k=0,1,2,...,57. As described in the previous section, fifty-seven convolution smoothings b(S) are obtained, ranging from very under-smoothed to very over-smoothed.
- The convolution smoothing b(S) closest to the moving average of b is chosen based on the least-squares metric--for each location along the chromosome, the square of the difference between b(S) and the moving average of b is computed. The FCS output b(S) that gives the smallest sum of squares of differences over all coordinates is chosen as the final smoothed area plot, denoted b*.
As an example, this three step algorithm as applied to the pooled HL curve b for Chromosome VI is shown below.
| (A) | ![]() |
| (B) | ![]() |
| (C) | ![]() |
| (D) | ![]() |
| Figure 6. The FCS Algorithm as applied to Chromosome VI. (A) The computed 10 kb moving average (blue) is compared to the % HL(total) curve (black). (B) Convolution smoothings b(S) (red) of the % HL(total) data set b (black). Ten of the wide range of possible smoothings of b are shown, ranging from under-smoothed to over-smoothed. (C) Comparison of the 10 kb moving average of data set b with each smoothed set b(S). The error is computed as the sum of squares of the difference between the 10 kb average values of b (see Fig. 6A) and the correponding values of b(S). The plot reveals an optimal smoothing index S that gives the smallest error when compared to the moving average. (D) Comparison of smoothed (red) with unsmoothed (black) and the moving average (blue) values of % HL(total). The smoothing closest to the moving average (i.e., giving the smallest error in Fig. 6C) is shown. This particular smoothing has a convolution index 6.25; therefore b(6.25) is the smoothed % HL(total) curve b* selected by the algorithm. | |
The optimal index of smoothing for each chromosome, calculated as described above, is tabulated:
| Chromosome | S=optimal index of smoothing (part II.3) |
|---|---|
| I | 6.50 |
| II | 6.25 |
| III | 6.25 |
| IV | 6.50 |
| V | 6.25 |
| VI | 6.25 |
| VII | 6.50 |
| VIII | 6.25 |
| IX | 6.00 |
| X | 6.25 |
| XI | 6.25 |
| XII | 6.25 |
| XIII | 6.25 |
| XIV | 6.25 |
| XV | 6.25 |
| XVI | 6.25 |
The smoothed % HL(total) curves b* were computed for each chromosome using the above algorithm and optimal smoothing indices; these files are available for downloading at
SUMMARY
For each chromosome, the smoothed pooled HL curve b* from (part II.3) is used to obtain a set of origin predictions for the genome. By studying the persistence of a local extrema throughout an entire spectrum of smoothings, we additionally arrive at a confidence level for each origin location; these confidence levels measure the extent to which the extrema is a real feature of the % HL(total) curve (as opposed to possible noise in the signal).
ORIGIN LOCATIONS
For each chromosome, the smoothed % HL(total) plot b* was constructed as described above (part II.3). Local maxima in the plot of b* were defined as locations along the profile where the slope changes from positive to negative; these locations were tagged as peaks. Conversely, locations where line segment slope changed from negative to positive were tagged as valleys; these are the local minima of the plot of b*. Determining the coordinates of peaks and valleys is an easily implemented test comparing the slopes of successive line segments. As already noted above, the crucial observation is the following:
Regions of the genome that replicate before their neighboring sequences should have higher % HL(total) values than their neighbors. Therefore, an origin of replication would appear as a "peak" in the smoothed % HL(total) curve and a termination zone would appear as a "valley".
It should be noted that as an alternative method, one could fit a cubic spline to the data points of b*, then apply elementary derivative tests from calculus to detect local extrema (peaks and valleys). The minor differences obtained in peak and valley locations by this method are much smaller than the chromosomal coordinate resolution of 0.5 kb; for this reason, we have opted for the more straightforward linear method outlined above.
A typical smoothed % HL(total) curve reveals that some of the extrema are pronounced, while others are less prominent -- e.g, on chromosome VI, the peak occuring near 140 kb is less prominent than the peak near 200 kb.(see Fig. 7 below).
|
Figure 7. Origin predictions. Origins predicted for Chromosome VI (gray dots) superimposed on the smoothed % HL(total) curve. |
ORIGIN CONFIDENCE LEVELS
Recalling that an origin corresponds to a peak in the smoothed pooled HL plot, the persistence of a predicted origin can be measured by counting the number of successive smoothings (beyond and counting the optimal smoothing determined in part II.3) through which the peak survives. The best way to understand this is concept is to plot the origin location predictions for a sequence of successive smoothings and ask to what extent of smoothing a prediction persists. This procedure is illustrated for chromosome VI (Fig. 8).
| (A) | ![]() |
| (B) | ![]() |
|
Figure 8. Confidence levels for Chromosome VI origin predictions. (A) Successive smoothings of the % HL(total) curve. For purposes of illustration, a wide range of smoothings is shown, with smoothing index = 6.25 through 18.25 in increments of 1.0. To assign confidence values (see below), only smoothing indices of 6.25 through 8.25 in increments of 0.25 are used -- i.e., 9 smoothings in total. (B) Chromosome coordinates of maxima for each successive smoothing of the % HL(total) curve. Persistence of maxima at any given smoothing is depicted as a large black dot. The confidence number for each predicted origin corresponds to the number of smoothings for which that origin persists as a maxima. In this example, the confidence numbers (reading left to right) are 4, 9,9,9,6,9,9, and 4, as pictured. |
|
Confidence numbers were calculated for each of the 332 predicted origins in the genome. These data are available at Origin predictions and confidence.
SUMMARY
Starting with the optimally smoothed % HL(total) curve b*, we describe how to obtain a replication timing curve across the entire genome. This procedure assigna a "trep value" to each of the presumptive origins computed in part II.4.
DETAILED DESCRIPTION
Thus far, we have not taken advantage of the time course in the experiment; this information can provide estimates of the time of replication (trep) of chromosomal loci, including origins. Specifically, the experiment involved eight time points:
t1=0, t2=10, t3=14, t4=19, t5=25, t6=33, t7=44, t8=60.
The formula to calculate percent replication at coordinate x and time point i is

These percent replication data can be downloaded at
For each chromosome and for each chromosomal data coordinate "x" (spaced every 0.5kb) there are eight data points
{ (t1 , p(t1)(x)), (t2 , p(t2)(x)), ... ,(t8 , p(t8)(x)) }.
These eight data points can be fit to a sigmoidal timing curve using the model

for some constants a,c,d,m. (Note: The fit varies with position along the chromosome; i.e., the constants a,c,d,m all depend on the chromosomal coordinate "x".) For example, Fig. 9 illustrates shows a timing curve fit to the raw data at the location of ARS607 on Chromosome VI.
|
Figure 9. A timing curve for ARS607 on Chromosome VI. The eight data points { (t1 , p(t1)(x)), ... ,(t8 , p(t8)(x)) } corresponding to the chromosomal data coordinate "x=200kb" (the location of ARS607) are plotted. The timing curve fit is inserted, along with the "trep value"; the time at which half maximum replication is achieved. |
Fitting timing curves requires a nonlinear algorithm; we used the standard Levenberg-Marquardt least squares algorithm, as implemented by the mathematical software package Mathematica 4.0 (Wolfram Research). Some of the fits obtained were deemed "bad". Our criteria for a good fit was that the following conditions hold:
0 < a ≤ 80;
a - d > 0.
![]()
The formula for trep becomes:

In particular, if d=0, then c= t rep . A plot of the points
(trep (x) , b (x) ),
where "x" varies over the coordinates is made where the data produce a good fit to a timing curve. These points are linearly correlated. For each chromosome, the linear function
L(z)= y
converts a pooled HL value "z" to a raw trep value "y":
| Chromosome | L(z) |
|---|---|
| I | 66.0504 -1.29899 z |
| II | 64.5418 -1.09306 z |
| III | 68.2714 -1.2455 z |
| IV | 69.106 -1.21438 z |
| V | 53.17139 -0.77632 z |
| VI | 70.6334 -1.19704 z |
| VII | 59.9107 -0.82513 z |
| VIII | 67.6624 -1.04731 z |
| IX | 66.5806 -0.986341 z |
| X | 66.79604 -1.03044 z |
| XI | 55.8688 -0.763706 z |
| XII | 58.5138 -0.814019 z |
| XIII | 60.284 -0.957527 z |
| XIV | 45.75579 -0.63542 z |
| XV | 53.3642 -0.85139 z |
| XVI | 52.4603 -0.80842 z |
Using L(z), ALL of the % HL(total) data b can be transformed to trep values, and smoothed (FCS analysis in part II.3) to obtain a full chromosomal smoothed trep curve. However, properties of FCS and the fact L(z) is a linear function actually allow us to simply apply the linear correlation function L(z) to the smoothed area data b* to obtain the smoothed trep curve. We refer to these smoothed trep curves as the replication profiles. These data sets are available online at Replication timing data.
For example, Fig. 10 illustrates the replication profile for Chromosomes VI:
|
Figure 10. Replication timing curve for Chromosome VI. The gap in the plot marks a region where the microarray probe density was low. The numbers above predicted origins denotes the confidence level of the origin prediction. The blue bars indicate locations of origins known to be at least 50% efficient; the width of the bar corresponds to the size of the fragment exhibiting ARS activity. |
Plots showing replication profiles for all 16 chromosomes are available at Replication_profiles.pdf.
ORIGIN LOCATIONS AND ACTIVATION TIMES
The earliest-activated origins are on chromosome III and IV (ARS306 and YDori917.0, respectively (1)), while the latest-firing origins are on chromosomes VIII and IV (YHori214.4 and YDori167.3, respectively). No sequence elements that were absolute predictors of origin location were found, nor have we, so far, uncovered any DNA sequence determinants that allow prediction of replication time. The majority (81%) of the 800 perfect matches to the 11 bp ACSs do not lie within the 5 kb regions predicted to contain origins, and only 38% of the five-kb origin regions (126/332) have one or more perfect matches to the ACS within them. This latter finding is not altogether surprising since a number of origins have been shown by mutational analysis to rely on an ACS that has only a 10 or even a 9 bp match to the ACS (2).
Although the subtelomeric Y' elements are known to influence transcriptional silencing (3, 4), they do not appear to have any marked effect on the activation time of originsthe average activation time of the most distal origins was about the same for those ends with Y' elements (5) as those without (p = 0.70; (6)).
ORIGIN LOCATIONS AND Ty ELEMENTS
The locations of transposons or transposon long-terminal repeats (LTRs) shows a striking correlation with that of origins. Among all classes of Ty elements, 75% of full-length Tys (36/48, p = 5.3 x 10-4) and 64% of LTRs (179/281, p = 4.4 x 10-6) were found in the origin-proximal halves of origin-terminus intervals (6). However, because Ty element locations in S288C do not seem to match the Ty locations in our strain, we hesitate to draw firm conclusions about this apparent correlation.
Go to Summary of data analysis parameters.
References for Part II
1. Each previously unknown origin is assigned a name based on the chromosome number and the chromosome coordinate (in kb) at which the origin is located. Thus, "YDori917.0" indicates Yeast (Y) chromosome IV (D) origin (ori) located 917.0 kb from the left end of the chromosome.
2. J. V. Van Houten, C. S. Newlon, Mol. Cell. Biol. 10, 3917 (1990).
3. G. Fourel, E. Revardel, C. E. Koering, E. Gilson, EMBO J. 18, 2522 (1999).
4. F. E. Pryde, E. J. Louis, EMBO J. 18, 2538 (1999).
5. R. J. Britten, Proc. Natl. Acad. Sci. USA 95, 5906 (1998).
6. The analysis of Ty and LTRs, as well of Y and non-Y ends, is based on data from strain S288c as described in the Saccharomyces Genome Database, and may differ from the locations in KK14-3a.