All astronomy data and literature will soon be online
and accessible via the Internet. The community is building the Virtual Observatory, an organization of this worldwide data into a coherent whole that can be accessed by anyone, in any form, from anywhere. The
resulting system will dramatically improve our ability to do
multi-spectral and temporal studies that integrate data from multiple
instruments. The Virtual Observatory data also provide a wonderful base
for teaching astronomy, scientific discovery, and computational
science.
Many fields are now coping with a
rapidly mounting problem: how to organize, use, and make sense of the
enormous amounts of data generated by today's instruments and
experiments. The data should be accessible to scientists and educators
so that the gap between cutting-edge research and education and public
knowledge is minimized and should be presented in a form that will
facilitate integrative research. This problem is becoming particularly
acute in many fields, notably genomics, neuroscience, and astrophysics. The availability of the Internet is allowing new ideas and concepts for
data sharing and use. Here we describe a plan to develop an Internet
data resource in astronomy to help address this problem in which,
because of the nature of the data and analyses required of them, the
data remain widely distributed rather than gathered in one or a few
databases (e.g., GenBank). This approach may be applicable to many
other fields. Our goal is to make the Internet act as the world's best
telescope--a World-Wide Telescope.
Today, there are many impressive archives painstakingly constructed
from observations associated with an instrument. The Hubble Space
Telescope (HST) (1), the Chandra X-Ray Observatory
(2), the Sloan Digital Sky Survey (SDSS) (3), the
Two Micron All Sky Survey (2MASS) (4), and the
Digitized Palomar Observatory Sky Survey (DPOSS) (5) are
examples of this. Each of these archives is interesting in itself, but
temporal and multi-spectral studies require combining data from
multiple instruments. Furthermore, yearly advances in electronics bring
new instruments, doubling the amount of data we collect each year (Fig.
1). For example, approximately a gigapixel is deployed
on all telescopes today, and new gigapixel instruments are under
construction. A night's observation requires a few hundred gigabytes
of memory. The processed data for a single spectral band over the whole
sky, a few terabytes. It is impossible for each astronomer to have a
private copy of all the data they use. Many of these new instruments are being used for systematic surveys of our galaxy and of the distant
universe. Together they will give us an unprecedented catalog to study
the evolving universe, provided that the data can be systematically
studied in an integrated fashion.
Online archives already contain raw and derived astronomical
observations of billions of objects from both temporal and
multi-spectral surveys. Together, they house an order of magnitude more
data than any single instrument. In addition, all the astronomy
literature is online and is cross-indexed with the observations
(6, 7).
Why is it necessary to study the sky in such detail? Celestial
objects radiate energy over an extremely wide range of wavelengths from
radio waves to infrared, optical to ultraviolet, x-rays and even gamma
rays. Each of these observations carries important information about
the nature of the objects. The same physical object can appear to be
totally different in different wavebands (Fig. 2). A
young spiral galaxy appears as many concentrated "blobs," the
so-called HII regions in the ultraviolet, whereas in the optical it
appears as smooth spiral arms. A galaxy cluster can only be seen as an
aggregation of galaxies in the optical, whereas x-ray observations show
the hot and diffuse gas between the galaxies.
The physical processes inside these objects can only be understood by
combining observations at several wavelengths. Today, we already have
large sky coverage in 10 spectral regions; soon we will have additional
data in at least five more bands. These will reside in different
archives, making their integration all the more complicated.
Raw astronomy data is complex. It can be in the form of fluxes measured
in finite size pixels on the sky, spectra (flux as a function of
wavelength), individual photon events, or even phase information from
the interference of radio waves.
In many other disciplines, once data is collected, it
can be frozen and distributed to other locations. This is not the case for astronomy. Astronomy data needs to be calibrated for the
transmission of the atmosphere and for the response of the instruments.
This requires an exquisite understanding of all the properties of the whole system, which sometimes takes several years. With each new understanding of how corrections should be made, the data are reprocessed and recalibrated. As a result, data in astronomy stays "live" much longer than in other disciplines--it needs an active "curation," mostly by the expert group that collected the data.
Consequently, astronomy data reside at many different geographical
locations, and that will not change. There will not be a central
"Astronomy database." Each group has its own historical reasons to
archive the data one way or another. Any solution that tries to
federate the astronomy data sets must start with the premise that this
trend is not going to change substantially in the near future; there is
no top-down way to simultaneously rebuild all data sources.
To solve these problems, the astrophysical community is
developing the World-Wide Telescope, often called the "Virtual
Observatory" (8). In this approach, the data will
primarily be accessed via digital archives that are widely distributed.
The actual telescopes will either be dedicated to surveys that feed the
archives, or telescopes will be scheduled to follow up on
"interesting" phenomena found in the archives. Astronomers will
look for patterns in the data--spectral and temporal, known and
unknown--and use these to study various object classes. They will have
a variety of tools at their fingertips: a unified search engine, to
collect and aggregate data from several large archives simultaneously,
and a huge distributed computing resource, to perform the analyses
close to the data, in order to avoid moving petabytes of data across
the networks.
Other sciences have comparable efforts of putting all their data online
and in the public domain--GenBank in genomics is a good
example--but so far these are centralized rather than federated systems.
The Virtual Observatory will give everyone access to data that span the
entire spectrum, the entire sky, all historical observations, and all
the literature. For publications, data will reside at a few sites
maintained by the publishers. These archive sites will support simple
searches. More complex analyses will be done with imported data
extracts at the user's facility.
Time on the instrument will be available to all. Thus, the
Virtual Observatory should make it easy to conduct such temporal and
multi-spectral studies by automating the discovery and the assembly of
the necessary data.
One of the main uses of the Virtual Observatory will be to
facilitate searches where statistics are critical. We need large samples of galaxies in order to understand the fine details of the
expanding universe and of galaxy formation. These statistical studies
require multicolor imaging of millions of galaxies and measurement of
their distances. We need to perform statistical analyses as a function
of their observed type, environment, and distance.
Other projects study rare objects, ones that do not fit typical
patterns; they search for the needles in the haystack. To this end, the
use of multi-spectral observations is an enormous help. Colors of
objects reflect their temperature. And in the expanding Universe, the
light emitted by distant objects is redshifted. Therefore, searching
for extremely red objects finds either extremely cold objects or
extremely distant ones. Data mining studies of extremely red objects
discovered distant quasars, the latest at a redshift of 6.28 (9). Mining the 2MASS and SDSS archives found many cold
objects such as brown dwarfs, which are bigger than a planet yet
smaller than a star. These are good examples of multiwavelength searches not possible with a single observation of the sky, done by
hand today, automated in the future. We do not even know all of the
data that existed; we will have to discover them on the fly.
1 The Johns Hopkins University, Baltimore, MD
21218, USA.
2 Microsoft Bay Area Research Center,
San Francisco, CA, USA.