In their report, Lee and Blake (1) asked
whether the visual system could use temporal microstructure to bind
image regions into unified objects, as has been proposed in some neural models (2). Lee and Blake presented two regions of dynamic texture. The elements of the target region changed in synchrony according to a random sequence, while the elements of the background region changed at independent times. The stimulus was designed in an
attempt to remove all classical form-giving cues such as luminance,
contrast, or motion, so that timing itself would provide the only cue.
Subjects were readily able to distinguish the shape of the target
region. Lee and Blake posited the existence of new visual mechanisms
"exquisitely sensitive to the rich temporal structure contained in
these high-order stochastic events." The results have generated much
excitement (3).
However, we believe that the effects can be explained with well-known
mechanisms. The filtering properties of early vision can convert the
task into a simple static or dynamic texture discrimination problem. A
sustained cell (temporal lowpass) will emphasize static texture through
the mechanisms of visual persistence; a transient cell (temporal
bandpass) will emphasize texture that is flickering or moving.
We simulated a lowpass mechanism to see what would emerge. Lee and
Blake's stimuli were composed of randomly oriented Gabor elements,
where the Gabor phase shifted forward or backward on each frame
according to a coin-flip. We downloaded one such movie from their Web
site and ran it through a temporal lowpass filter (4) (An input frame is shown in Fig.
1A; a filtered output frame is shown in Fig. 1B.). At
the particular moment shown, the target region has a lower effective
contrast than does the background, providing a strong form cue. At
other moments the target's contrast may be above or below the
background's contrast because of statistical fluctuations in the
reversal sequences. If a single Gabor element happens to have a run of
multiple shifts in one direction, its effective contrast is low because
of the temporal averaging. Conversely, if it has a run of
alternating forward and backward shifts, thus "jittering" in place,
its contrast remains fairly high. Within the unsynchronized
background the local contrasts fluctuate randomly, but within the
synchronized target region they all rise and fall in unison,
revealing a distinct rectangular form.
Fig. 1.
: (A) One input frame; (B)
result of temporal integration with synchronized target and
unsynchronized background; (C and D) results of
temporal integration when the target and background are each
synchronized.
[View Larger Version of this Image (186K GIF file)]
In a second experiment Lee and Blake synchronized both the target and
background region, each to its own random sequence. Here the target was
even more clearly visible. This result is predicted by our hypothesis.
Since both background and target are synchronized, they will both yield
uniform texture contrasts after temporal filtering. There will be
moments when, by chance, one region's contrast is high while the
other's is low, and the target will become especially clear. Figure 1C
shows one such moment, again the result of filtering a movie from the Web site with the lowpass filter. Figure 1D shows a moment when the
relative contrasts are reversed. We also ran movies through a temporal
bandpass filter (5) with a biphasic impulse response, to
simulate a transient mechanism. Again, the target was clearly
revealed.
Our hypothesis also predicts, with the use of either filter, Lee and
Blake's finding that discrimination will be best when the reversal
sequences have high entropy, that is, when the coin-flip is unbiased.
The contrast cue is best when the target "jitters" in place while
the background has a run in a single direction (or vice versa). This
condition happens most frequently at high entropy.
Lee and Blake's stimuli are designed to remove form cues from single
frames and from frame pairs. However, when one considers the full
sequence, strong contrast cues can emerge due to the spatio-temporal
filtering present in early vision. These cues probably suffice to
explain the perception of form in the experiments. We do not see the
need to posit mechanisms other than those already known to exist.
Edward H. Adelson
Hany Farid
Department of Brain and Cognitive Sciences
Massachusetts
Institute of Technology
Cambridge, MA 02139, U.S.A.
E-mail:
adelson,farid{at}persci.mit.edu
REFERENCE AND NOTES
-
S. Lee and
R. Blake,
Science
284,
1165
(1999)
[Abstract/Free Full Text]
.
-
W. Singer and
C. Gray,
Ann. Rev. Neurosci.
18,
555
(1995)
.
-
M. Barinaga,
Science
284,
1098
(1999)
[Free Full Text]
.
-
The lowpass impulse response was of the form
h(t) = (t/
)2e
t/
,
with
= 0.01. The integration time was roughly 40 msec.
-
The bandpass impulse response was of the form
h(t) = (kt/
)ne
kt/
[1/n!
(kt/
)2/(n + 2!)], with
= 0.01, k = 2 and n = 4. The peak response was at 5 Hz.
Response: We agree with Adelson and Farid that an
appropriately designed, lowpass temporal filter applied to our
stochastic animation sequences (1) could extract form
defined by luminance contrast without resort to temporal synchrony. We raised that possibility in our report, noting that temporal integration could produce occasional pulses in apparent contrast when, by chance,
motion elements repetitively switched back and forth in direction over
several successive frames (called "jitter" in Adelson and Farid's
comment). The output from Adelson and Farid's model (Fig. 1) confirms
our intuition, showing that contrast pulses could be synthesized by a
hypothetical temporal filter with the right time constant. But to
assert that these infrequent, hypothetical events "explain the
perception of form" seems conjectural. In our experiments, observers
never saw static single frames such as Adelson and Farid are pointing
to in their filtered example; successive frames were rapidly animated,
and contrast pulses were not conspicuous in these animations. But
perhaps this cue, although not salient in the animations, is available
and utilized by observers when performing our shape task. In our
research, we created conditions in which the putative contrast pulses
would occur in the figure and in the background regions. Distributing
identical contrast pulses throughout the display, we reasoned, should
impair figure/ground segmentation based on perceived contrast. But
exactly the opposite was found [see figure 2A in (1)],
implying that contrast summation does not mediate performance on our
task.
Adelson and Farid's hypothetical temporal filter uncovers a possible
consequence that we did not address in our report. Specifically, in
animation sequences containing multiple successive frames without change in the direction of motion (runs), effective contrast produced by temporal integration could be temporarily reduced within the figural
region where all elements are doing the same thing. When that happens,
this region could stand out from the background, where elements are
changing independently. Adelson and Farid's figure 1A depicts this
hypothetical situation. Because strings of "no change" frames are
more probable at lower entropy values, shape discrimination based on
global reductions in contrast within the figure should be particularly
easy at low entropy. But just the opposite is true: shape from temporal
synchrony is best at high entropy, where "no change" sequences are
highly unlikely [see figure 2B in (1)].
We are grateful to Adelson and Farid for formalizing a plausible model
of temporal integration. Using their model, we have quantitatively
indexed the potential strength of luminance cues from temporal
integration (2). We find no correlation between this
strength index and psychophysical performance on our shape
discrimination task. We have gone one step further, using this index to
create animations from which temporal integration could produce no
luminance cues whatsoever in any frames of the sequence. We did three
things to achieve this: (i) the contrast of each moving element was
randomized throughout the array and from frame-to-frame of the
animation, (ii) the average luminance of each motion element was
assigned randomly throughout the array, and (iii) those frames causing
"runs" and "jitter" were selectively pruned from the sequence.
Observers still readily perceive shape from temporal synchrony in these
sequences that have been purged of potential luminance cues
(3). This observation is remarkable considering that
contrast randomization and luminance randomization actually introduce
conflicting cues for spatial structure. Our findings undermine the
supposition that temporal integration alone can "explain the
perception of form" (1) in these stochastic displays. On
the contrary, it is revealing that temporal integration does not erase
visual signals generated by these kinds of dynamic, stochastic events.
This constitutes one more piece of evidence that human vision contains
mechanisms that preserve the temporal fine structure in dynamic events,
structures that operate in the interests of spatial grouping.
Adelson and Farid also suggest that a filter with a biphasic impulse
response could be involved in the extraction of shape from our dynamic
displays. Here, too, they confirm a point made in our report where we
noted that reversals in motion direction--the carriers of temporal
structure in our displays--could produce brief neural transients that
accurately denote points in time when reversals occur. When applied to
our displays, an appropriately tuned biphasic temporal filter
accomplishes this operation (change detection). So we agree with
Adelson and Farid that there is no need to posit the existence of new
visual mechanisms sensitive to stochastic temporal structure. Existing
mechanisms provide a reasonable point of departure. Still, change
detection is just a first step in extracting shape from temporal
structure. It remains a challenge to explain how spatial grouping is
accomplished based only on irregularly occurring transients distributed
among local neural mechanisms tuned to different directions of motion.
Sang-Hun Lee
Randolph Blake
Vanderbilt Vision Research Center
Vanderbilt
University
Nashville, TN 37240, U.S.A.
E-mail:
sang-hun.lee{at}vanderbilt.edu
randolph.blake{at}vanderbilt.edu
REFERENCES AND NOTES
-
S. Lee and
R. Blake,
Science
284,
1165
(1999)
.
-
The strength of the putative luminance cue for each
individual frame of the filtered sequence is computed in the following
way: First, for each frame we calculate the standard deviation of the
luminance values of all pixels within the figure region of the filtered
array (SDf) and the standard deviation of all pixels within
the background region of the filtered array (SDb) (Note
that this computation requires knowing the exact location of the
figural region within the array, whereas in our experiments the
location of this region varied unpredictably.) We express the strength
of the luminance cue as the ratio: | SDf
SDb |/(SDf + SDb). From this
ratio, we can predict whether performance under any given condition
should be better than performance under another condition, based on
luminance from temporal integration.
-
Readers may view versions of these animations at
http://www.psy.vanderbilt.edu/faculty/blake/Demos/TI/TI.html. Note that
these Web animations are running at slower frame rates than we use in
the laboratory, and that the spatial resolution of the animations has
been significantly reduced to minimize downloading time.
16 July 1999; accepted 26 October 1999