II.  Secondary Data Analysis


3.  Smoothed %HL(total) curves.

4.  Origin predictions and confidence levels.

5.  Replication timing curves.

6.  Results and conclusions.


3.  Smoothed %HL(total) curves

SUMMARY

For our purposes, the ultimate key graphical features of a plot of % HL(total) values across a chromosome will be its local extrema -- i.e., local maxima (peaks) and local minima (valleys) of the graph.  The reasoning is that regions of the genome that replicate before their neighboring sequences should have higher % HL(total) values than their neighbors: an origin of replication would appear as a "peak" in the plot, and a termination zone as a "valley".  

DETAILED DESCRIPTION (motivation)

The goal is to obtain from b a "smoothed % HL(total) curve", to be denoted b*, using the technique of Fourier convolution smoothing (FCS).  Roughly speaking, any graph is a composite of real features ("signal") and the underlying scatter ("noise") in the data.  The FCS procedure breaks down the graph into a set of individual components, then attempts to discard those components that constitute the noise in the data while preserving the real features.  However, the outcome of the procedure is dictated by the choice of a convolution kernel; depending on the kernel, the output can range from being essentially identical to the input (i.e., no damping of noise) to being essentially a flat line (excessive smoothing).  Fig. 4 illustrates the consequence of smoothing the pooled HL curve b for chromosome VI with three different choices of convolution kernel (red curves); the initial raw plot b is included for reference (black curve).

DEFINITION OF FCS

Begin with a finite ordered data set A consisting of pairs of real numbers (x,a).  For our eventual applications, A = b = the % HL(total) curve, which consists of a finite collection of pairs (x,a), where x is a chromosomal coordinate and a = 100(bHL/(bHL+bHH)) for that coordinate as calculated as in part I.2.  The strategy is to order the data set by the values of x, perform the smoothing on the set of values of (a) taken from the ordered list, and then reattach the (x) values to the corresponding smoothed (a) values.  Some definitions to start with:

SMOOTHED % HL(total) CURVES

For each chromosome, the algorithm begins with the % HL(total) data set b and selects a smoothing kernel to smooth b as follows:

  1. A moving average is computed for every 20 consecutive values of b along the chromosome (i.e., for every 10 kb window).  While this linear filtering step indiscriminately smooths out local noise as well as real features in the data, it provides a basis for assessment of the FCS output.  Note that the choice of a 10 kb moving average is consistent with the final normalization described in part I.1 above.
  2. The FCS method is applied to b using the kernels K(S), where S is of the form 2+0.25 k, k=0,1,2,...,57.  As described in the previous section, fifty-seven convolution smoothings b(S) are obtained, ranging from very under-smoothed to very over-smoothed.
  3. The convolution smoothing b(S) closest to the moving average of b is chosen based on the least-squares metric--for each location along the chromosome, the square of the difference between b(S) and the moving average of b is computed.  The FCS output b(S) that gives the smallest sum of squares of differences over all coordinates is chosen as the final smoothed area plot, denoted b*.

4.  Origin predictions and confidence levels.

SUMMARY

For each chromosome, the smoothed pooled HL curve b* from (part II.3) is used to obtain a set of origin predictions for the genome.  By studying the persistence of a local extrema throughout an entire spectrum of smoothings, we additionally arrive at a confidence level for each origin location; these confidence levels measure the extent to which the extrema is a real feature of the % HL(total) curve (as opposed to possible noise in the signal).

ORIGIN LOCATIONS

For each chromosome, the smoothed % HL(total) plot b* was constructed as described above (part II.3).  Local maxima in the plot of b* were defined as locations along the profile where the slope changes from positive to negative; these locations were tagged as peaks.  Conversely, locations where line segment slope changed from negative to positive were tagged as valleys; these are the local minima of the plot of b*.  Determining the coordinates of peaks and valleys is an easily implemented test comparing the slopes of successive line segments.  As already noted above, the crucial observation is the following:

ORIGIN CONFIDENCE LEVELS

Recalling that an origin corresponds to a peak in the smoothed pooled HL plot, the persistence of a predicted origin can be measured by counting the number of successive smoothings (beyond and counting the optimal smoothing determined in part II.3) through which the peak survives.  The best way to understand this is concept is to plot the origin location predictions for a sequence of successive smoothings and ask to what extent of smoothing a prediction persists.  This procedure is illustrated for chromosome VI (Fig. 8).


5.  Replication timing curves.

SUMMARY

Starting with the optimally smoothed % HL(total) curve b*, we describe how to obtain a replication timing curve across the entire genome.  This procedure assigna a "trep value" to each of the presumptive origins computed in part II.4.

DETAILED DESCRIPTION

Thus far, we have not taken advantage of the time course in the experiment; this information can provide estimates of the time of replication (trep) of chromosomal loci, including origins.  Specifically, the experiment involved eight time points:

0 < a ≤ 80;

a - d > 0.


6.  Results and conclusions.

ORIGIN LOCATIONS AND ACTIVATION TIMES

ORIGIN LOCATIONS AND Ty ELEMENTS


Go to Summary of data analysis parameters.


References for Part II

1.  Each previously unknown origin is assigned a name based on the chromosome number and the chromosome coordinate (in kb) at which the origin is located. Thus, "YDori917.0" indicates Yeast (Y) chromosome IV (D) origin (ori) located 917.0 kb from the left end of the chromosome.

2.  J. V. Van Houten, C. S. Newlon, Mol. Cell. Biol. 10, 3917 (1990).

3.  G. Fourel, E. Revardel, C. E. Koering, E. Gilson, EMBO J. 18, 2522 (1999).

4.  F. E. Pryde, E. J. Louis, EMBO J. 18, 2538 (1999).

5.  R. J. Britten, Proc. Natl. Acad. Sci. USA 95, 5906 (1998).

6.  The analysis of Ty and LTRs, as well of Y’ and non-Y’ ends, is based on data from strain S288c as described in the Saccharomyces Genome Database, and may differ from the locations in KK14-3a.