For many molecular biologists, a specimen that is too degraded for easy DNA sequencing is a minor annoyance. They simply prepare another, better sample from the abundant source material they have on hand in cultured cells, laboratory animals, or a freezer full of tissues. But what if the only available specimen is an arsenic-preserved pelt from 150 years ago, a pile of mummified dung found in a cave, or a single formalin-fixed pathology slide? As sequencing technology steadily improves, such irremediably difficult samples have gradually begun yielding usable sequences.
The goals of these hard-core sequencing projects range from studying climate change, to classifying—or even resurrecting—extinct species, to improving cancer diagnosis. Nonetheless, they share many of the same challenges, as stored DNA breaks down into progressively smaller fragments over time. Moreover, standard preservatives or compounds in the environment chemically modify the nucleic acids, thereby making modern sequencing techniques produce vast quantities of often cryptic data. Despite those barriers, scientists and equipment makers are pushing sequencing techniques steadily forward, often with surprising results.
Not dead yet
In 2001, a team of scientists visiting Ball’s Pyramid, an isolated rock spire off Lord Howe Island in the Tasman Sea, discovered the world’s rarest invertebrate: an apparent relict population of two dozen Lord Howe Island stick insects (Dryococelus australis). Once abundant on their nearby namesake island, the insects went extinct there shortly after the introduction of rats in 1918. The population on Ball’s Pyramid looks like the same species, but to be sure, researchers want to compare its DNA to that of museum specimens collected over a century ago.
For Alexander Mikheyev, assistant professor in the ecology and evolution unit at the Okinawa Institute of Science and Technology in Okinawa, Japan, it’s a familiar problem. “Nothing I work with is really preserved well, but there’s some bad and some really bad,” says Mikheyev, who specializes in sequencing preserved insects. The D. australis samples he hopes to analyze have been stored dried on pins in museum drawers.
Old specimens’ DNA is typically fragmented into small pieces. “This is where next-gen[eration sequencing] technology comes to help us, because it’s actually good at sequencing small pieces and lots of them,” says Mikheyev. A trickier problem is that old DNA, especially in samples preserved with alcohol or other fixatives, tends to become chemically derivatized, changing both the backbone structure and apparent base sequence. After a simple DNA preparation, researchers use polymerase chain reaction (PCR) to amplify a library of all the fragments in the sample. The PCR process often substitutes incorrect bases for ones that have been chemically derivatized. “Once you have prepared your library, you also have to make sure that when you analyze it you’re not introducing a lot of biases as a consequence of these postmortem information content changes,” Mikheyev adds.
The internal quality controls built into sequencing systems don’t help with these types of artifacts. “The sequencer confidently substitutes [bases] and tells you with great certainty the wrong answer,” says Mikheyev.
Investigators typically control for such biases by comparing their sequence reads to known sequences from the same or a closely related species. “In insects this could be a huge problem, because insects are highly polymorphic genetically,” says Mikheyev. With the high background level of polymorphism, postmortem changes in the DNA can make sequences from long-dead animals hard to align with those of freshly sampled ones. Sequencing multiple specimens to establish population-level statistics can help reduce those errors. For D. australis, a successful captive breeding program means that at least the Ball’s Pyramid population will be relatively easy to sample.
With a badly degraded sample, researchers may also need to apply extremely rigorous statistical filters, discarding the overwhelming majority of the raw data from the sequencer in order to generate an accurate final sequence. Patience and flexibility also help. “For a lot of these methods, there really are no established protocols,” says Mikheyev, adding that “every sample is going to have its own challenges.”
Besides having degraded DNA, preserved museum specimens are often irreplaceable. “We were able to analyze one specimen collected by Alfred Russel Wallace in 1860 in Raja Ampat during his travel through the ‘Malay Archipelago,’” says Guillaume Besnard, a researcher at the Laboratory of Evolution and Biological Diversity at the Université Paul Sabatier in Toulouse, France. To conserve the famous naturalist’s specimen, a southern crowned pigeon (Goura scheepmakeri), Besnard and his colleagues snipped a tiny piece of dried flesh from the bird’s toe pad. They chose this region because it has a large number of cells and because the bird’s feet hadn’t been treated with arsenic, leaving the DNA there in better shape than on the rest of the preserved carcass.
In their studies on phylogenetics and biogeography, Besnard and his colleagues have also sequenced DNA from preserved plants in herbaria. “For plants, usually we choose one leaf as green as possible,” says Besnard, but he adds that seeds are also good sources for DNA. In all cases, he says, “I recommend choosing the best samples that are as young as possible, collected and preserved in good conditions.”
After identifying a suitable sample and designing a general strategy, Besnard says researchers should test their plan on easily replaced specimens. That will allow them to refine their techniques and ensure that they only have to dip into the valuable specimen once. “Depending on the research questions, it may also be important to define the appropriate strategy to use, either whole-genome sequencing or just targeting some genomic regions with a gene baiting approach,” says Besnard. In gene-baiting, scientists use targeted PCR primers to amplify specific genes during the library preparation step, rather than amplifying all of the DNA fragments in the sample.
Focusing on specific genes or regions is an especially good strategy for taxonomy projects, where variation in a few genes is often sufficient to place an organism on a phylogenetic tree. “We generally focus on abundant genomic regions such as organellar DNA, and when sequencing depth is sufficiently high, it’s relatively easy to assemble high-quality sequences,” says Besnard. Sequencing nuclear genomes requires sequencing the DNA library many times over, increasing the “depth” of sequence reads at each base. This procedure amplifies both the valid and invalid data, though, so researchers have to apply more stringent filters to their results.
Rare samples also need to be analyzed in a clean environment to minimize contamination. Even so, investigators should expect to spend some time scrubbing bacterial and fungal sequences out of their data. Sequencing-equipment vendors can help researchers choose appropriate bioinformatics algorithms for all of these analyses.
Ironically, some of the best DNA sources for extinct species and ancient humans are specimens nobody has tried to preserve. Dried feces from caves and pit toilets have proven especially fruitful. In a dry environment, the Maillard reaction—the same chemical process that browns a steak—causes feces to develop a protective outer shell. The resulting paleofeces can survive for centuries, encapsulating a mixed pool of DNA that includes cells from the animal that produced it, as well as a sampling of the animal’s diet.
Hendrik Poinar, professor of physical anthropology at McMaster University in Hamilton, Ontario, was one of the first researchers to dig into this trove of data. Since the late 1990s, Poinar and his colleagues have analyzed everything from ancient human to extinct ground sloth samples. Besides paleofeces, the team has also successfully sequenced DNA from animal carcasses found in Arctic permafrost, including woolly mammoths.
Poinar says that “the technology has changed dramatically” since he started, adding that “everything for the copying and the sequencing of those molecules has grown at an exponential rate, so we’re doing things now that I couldn’t envision we could do even [a few] years ago.” Despite the progress in sequencing technology, however, Poinar says he’s frustrated that the tools for preparing the samples have hardly changed. “I think the access to samples in both deeper time and from more complicated remains is still a limiting factor because of these rather rudimentary extraction techniques,” he says. Standard laboratory DNA isolation methods, such as sonication, ribonuclease (RNase) treatment, and ethanol precipitation may reduce the available pool of DNA enough to prevent recovering usable sequences from the oldest samples.
Investigators studying paleofeces and other unpreserved samples also face thoroughly degraded DNA. Indeed, some of the breakdown products occur often enough to be useful internal controls. Poinar’s team has cataloged specific degradation artifacts that can distinguish ancient from modern DNA. “We use that as a way to say, ‘This is real to the sample, and is not a modern contaminant coming in,’” says Poinar.
Scientists who sequence ancient or preserved specimens seem to favor the Illumina next-generation platform to perform the sequencing itself, though Thermo Fisher Scientific’s IonTorrent offers similar capabilities. Poinar says the choice is largely a matter of convenience: “The platforms themselves are not going to make any difference as far as I can tell; the difference will be the repair of the molecules that come from your extracts, and then the library prep that’s done.”
I recommend choosing the best samples that are as young as possible, collected and preserved in good conditions.
For researchers just starting to explore ancient samples, Poinar echoes Mikheyev’s advice to be flexible. “Play around a bit; I think the biggest issue that people have is they’re just using standardized methods for extraction of samples, and I don’t think that’s very successful,” says Poinar.
Although a pile of dung from a cave poses serious analytical challenges, it’s probably not the most difficult specimen researchers are sequencing now. Instead, one of the most challenging types of samples is also the kind biomedical researchers are most likely to find interesting: formalin-fixed, paraffin-embedded (FFPE) tissue.
Pathologists and histologists have been fixing tissues with formalin for more than a century, and FFPE sections are a mainstay of clinical pathology labs. The technique is simple and robust. Unfortunately, this robustness promotes a casual attitude toward fixation. “In some cases the sample may have been fixed for several days or over the weekend; in other cases it may be just overnight, so there’s a big difference between the samples,” says Jakob Hedegaard , a postdoctoral fellow in the Department of Molecular Medicine at Aarhus University Hospital in Aarhus, Denmark who has worked extensively with FFPE samples.
Varying the fixation time has little or no effect on a tissue’s morphology, but it plays havoc at the molecular level. Over time, formalin crosslinks proteins in the cells, and fragments and derivatizes both DNA and RNA. Without knowing how long the chemical insult lasted, researchers have a hard time predicting the quality of the DNA they’ll recover.
As with other degraded DNA specimens, derivatized bases pose the biggest challenge. The standard library amplification step in most sequencing protocols substitutes erroneous bases for the ones that have been modified, yielding high-quality but incorrect sequences. Investigators then have to filter the raw data to separate real polymorphisms from artifacts. “Fixation-introduced variants tend to be randomly distributed all over the place, so if you sequence very deep, then you should be able to see the true variants and exclude the noise,” says Hedegaard, adding that “in general the data are much more noisy when the DNA is of FFPE origin.”
Scientists analyzing FFPE tissues usually work with more samples than those studying rare or unusual specimens. For a project on tumor genetics, for example, a team may need to sequence hundreds or thousands of FFPE tissue slices from different patients to find statistically meaningful variations. Even as sequencing costs decline, such high-throughput efforts require pooling the DNA samples.
Hedegaard and his colleagues typically attach specific tags to the DNA from each tissue sample before pooling them, allowing them to sequence numerous specimens in each sequencing run. They then use the tag sequences to separate individual samples from the raw data.
Some molecular biologists may have enough clout to coax pathologists into mending their poorly controlled ways. “The main thing we’ve been working on over the last 6 to 18 months is to look at how we might improve getting sequenceable samples, and it does seem to be that controlling those fixation steps is going to have a marked impact,” says James Hadfield, director of the genomics core for the Cancer
Research UK Cambridge Institute at the University of Cambridge in Cambridge, United Kingdom.
Hadfield and his colleagues are doing large-scale genomic analyses of different tumor types, as part of the massive Genomics England project to sequence 100,000 British genomes. But even with the leverage of a big, well-supported project, changing old habits is hard. Hadfield says packaging the change as a general quality-control improvement may help: “In our research institute, our histopathology core is very carefully controlled in the time of fixation, [and] being controlled in any scientific or diagnostic process means that things are more robust.”
For researchers whose pathology collaborators remain intransigent or those working on historical specimens, equipment and reagent makers may be able to help. Illumina offers a range of microarrays and other tools designed to optimize results from badly degraded FFPE samples, and New England BioLabs sells the NEBNext FFPE DNA repair mix, for example.
Hadfield and others are also trying to develop and promote DNA-friendly fixatives that pathologists could use in lieu of formalin. Although that work has produced promising results, Hadfield emphasizes that getting clinical labs to switch from well-tested methods remains a major challenge.
Scientists with very focused projects may also have the option of avoiding genome-level sequencing entirely. Hadfield echoes Besnard’s suggestion to amplify and sequence specific genes rather than whole genomes, if that will answer the research question.
Regardless of the types of samples they’re sequencing or the strategies they’re using, those working with difficult DNA samples agree that the field calls for a strong dose of skepticism. As Mikheyev says, “Always question your data and prove to yourself using some kind of orthogonal method that the data are telling you what you think they’re telling you.”
Newly offered instrumentation, apparatus, and laboratory materials of interest to researchers in all disciplines in academic, industrial, and governmental organizations are featured in this space. Emphasis is given to purpose, chief characteristics, and availability of products and materials. Endorsement by Science or AAAS of any products or materials mentioned is not implied. Additional information may be obtained from the manufacturer or supplier.