The human genome—the sum total of hereditary information in a person—contains a lot more than the protein-coding genes teenagers learn about in school, a massive international project has found. When researchers decided to sequence the human genome in the late 1990s, they were focused on finding those traditional genes so as to identify all the proteins necessary for life. Each gene was thought to be a discrete piece of DNA; the order of its DNA bases—the well-known "letter" molecules that are the building blocks of DNA—were thought to code for a particular protein. But scientists deciphering the human genome found, to their surprise, that these protein-coding genes took up less than 3% of the genome. In between were billions of other bases that seemed to have no purpose.
Now a U.S.-funded project, called the Encyclopedia of DNA Elements (ENCODE), has found that many of these bases do, nevertheless, play a role in human biology: They help determine when a gene is turned on or off, for example. This regulation is what makes one cell a kidney cell, for instance, and another a brain cell. "There's a lot more to the genome than genes," says Mark Gerstein, a bioinformatician at Yale University.
The insights from this project are helping researchers understand the links between genetics and disease. "We are informing disease studies in a way that would be very hard to do otherwise," says Ewan Birney, a bioinformatician at the European Bioinformatics Institute in Hinxton, U.K., who led the ENCODE analysis.
As part of ENCODE, 32 institutions did computer analyses, biochemical tests, and sequencing studies on 147 cell types—six fairly extensively—to find out what each of the genome's 3 billion bases does. About 80% of the genome is biochemically active, ENCODE's 442 researchers report today in Nature. Some of these DNA bases serve as landing spots for proteins that influence gene activity. Others are converted into strands of RNA that perform functions themselves, such as gene regulation. (RNA is typically thought of as the intermediary messenger molecule that helps make proteins, but ENCODE showed that much of RNA is an end product and is not used to make proteins.) And many bases are simply places where chemical modifications serve to silence stretches of our chromosomes.
ENCODE's results are changing how scientists think about genes. It found about 76% of the genome's DNA is transcribed into RNA of one sort or another, way more than researchers had originally expected. That DNA includes slightly less than 21,000 protein-coding genes (some researchers once estimated we had more than 100,000 such genes); "genes" for 8800 small RNA molecules and 9600 long noncoding RNA molecules, each of which is at least 200 bases long; and 11,224 stretches of DNA that are classified as pseudogenes, "dead" genes now known to really be active in some cell types or individuals. In addition, efforts to define the beginning end, and coding regions of these genes revealed that genes can overlap and have multiple beginnings and ends.
The project uncovered 4 million spots in our DNA that act as switches controlling gene activity. Those switches can be both near and far from the gene they regulate and act in different combinations in different cell types to give each cell type a unique genomic identity. In addition, at least some of the RNA strands produced by the genome also help to control how much protein results from a particular gene's activity. Thus, the regulation of a gene is proving much more complex than expected.
These and other findings appear today in six papers in Nature, and 24 in Genome Research and Genome Biology. Two additional papers are published today on Science online. In a database, ENCODE has created a map showing the roles of all the different bases. "It's like Google Maps for the human genome," says Elise Feingold, a program director for the National Human Genome Research Institute in Bethesda, Maryland, which funded ENCODE. With Google Maps one can choose various views to see different aspects of the landscape. Likewise, in the ENCODE map, one can zoom in from the chromosome level to the individual bases and switch from looking at whether those bases yield RNA or are places where DNA-regulatory proteins bind, for example.
This catalog "will change the way people think about and actually use the human genome, says John A. Stamatoyannopoulos, an ENCODE researcher at the University of Washington, Seattle.
Already he and others are harnessing this information—much of which is already publicly available—to learn about genetic influences on disease. Many large-scale studies have linked specific base changes to higher or lower risks for disorders ranging from diabetes to arthritis. Now researchers can look to see whether those variants are involved in regulation of some sort and if so, what genes are being regulated. For his study of cancer and epigenetics, "ENCODE data were fundamental," says Mathieu Lupien, a molecular biologist from the University of Toronto in Canada who was not associated with ENCODE.