Jump to: Page Content, Section Navigation, Site Navigation, Site Search, Account Information, or Site Tools.
|
|
Technical CommentsComment on "Phylogenetic MCMC Algorithms Are Misleading on Mixtures of Trees"
Mossel and Vigoda (Reports, 30 September 2005, p. 2207) show that nearest neighbor interchange transitions, commonly used in phylogenetic Markov chain Monte Carlo (MCMC) algorithms, perform poorly on mixtures of dissimilar trees. However, the conditions leading to their results are artificial. Standard MCMC convergence diagnostics would detect the problem in real data, and correction of the model misspecification would solve it.
1 School of Computational Science, Florida State University, Tallahassee, FL 323064120, USA.
2 Department of Statistics, University of Wisconsin, Madison, WI 53706, USA. 3 Division of Biological Sciences, University of California at San Diego, San Diego, CA 920930116, USA. 4 Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 5 Department of Mathematics and Computer Science, Duquesne University, Pittsburgh, PA 15282, USA. * To whom correspondence should be addressed. E-mail: ronquist{at}scs.fsu.edu Phylogenetic inference has become an essential tool in the life sciences, with applications ranging from identification of virus transmission pathways to reconstruction of the universal ancestor of all life. Among the many approaches to the phylogeny problem, Bayesian inference with MCMC sampling has rapidly gained popularity in recent years because of its statistical rigor and computational efficiency. It is natural, then, that this approach has increasingly become the focus of detailed scrutiny. Adequate mixing is essential to the success (convergence) of MCMC algorithms when sampling a Bayesian posterior probability distribution. Mossel and Vigoda (1) show that nearest neighbor interchange (NNI) transitions, which are commonly used in phylogenetic MCMC sampling, suffer from poor mixing when the data come from an equal mixture of two dissimilar trees. However, their theoretical results, which show an exponential increase (instead of the desired decrease) in mixing time with sequence length, depend critically on the equal mixture assumption. If the proportions were 0.499 and 0.501, for instance, then the more frequent tree type would quickly become dominant in the posterior as more data were accumulated, and there would be no exponential increase in mixing time. Thus, the extreme phenomenon discussed in (1) only occurs under highly artificial settings and is unlikely to be encountered in real data. Even when the proportions are not exactly equal, however, mixtures of trees can be difficult to analyze. This is true for all phylogenetic methods, not only for Bayesian MCMC inference. For instance, optimization methods, such as parsimony and maximum likelihood, may encounter difficulties because of isolated islands of near-optimal trees when the data are generated from tree mixtures. Fortunately, mixed tree signals are easily discovered in the Bayesian context by using standard MCMC convergence checking. We particularly recommend comparison of tree samples from independent runs, for some time invoked by default in our software (2, 3). This method readily detects slow convergence on tree mixtures (Fig. 1; Simple MCMC, NNI; and 1 heated chain, NNI).
Mossel and Vigoda (1) point out that single-run trace plots can be unreliable indicators of MCMC convergence, but this is well known among Bayesian MCMC practitioners (46). Inexperienced users can be misled when analyzing difficult phylogenetic problems, regardless of the method they choose. It is difficult to describe phylogenetic MCMC algorithms as particularly misleading when they may, through the multiple-run convergence diagnostics, offer some of the best tools phylogeneticists have today for detecting problems such as those caused by tree mixtures. If convergence checking reveals an unexpected mixture of two conflicting phylogenetic signals, then one must conclude that the evolutionary model is misspecified and that the statistical results are of doubtful value regardless of whether convergence can be achieved. The correct approach is then to account for the tree mixture in the evolutionary model. This is easily done using existing software and a partitioned model (7) or a hidden Markov model or mixture model (8, 9). In the former case, the data are divided into fixed partitions before analysis; in the latter, the partitions are themselves random variables. An analysis under one of these models not only would retrieve both of the underlying signals but also is likely to mix rapidly with NNI-like tree updates (Fig. 1; Simple MCMC, NNI, correct model). For the phylogeneticist who insists on analyzing tree mixtures under an erroneous model assuming homogeneity, there are still two good options available. Metropolis coupling, now standard in phylogenetic MCMC analysis (4, 10), uses disparate starting points and heating to improve sampling efficiency. In fact, the default settings in MrBayes may often suffice for rapid convergence even on difficult tree mixtures like the one described by Mossel and Vigoda (Fig. 1; 3 heated chains, NNI). Mixing can also be improved by using a combination of tree updates, including some not in the NNI family (2, 3, 11, 12) (Fig. 1; Simple MCMC, subtree swapper).
Received for publication 8 December 2005. Accepted for publication 15 March 2006.
The editors suggest the following Related Resources on Science sites:In Science Magazine
THIS ARTICLE HAS BEEN CITED BY OTHER ARTICLES:
|
Science. ISSN 0036-8075 (print), 1095-9203 (online)