Molecular Phylogenetics: Principles And Practice

2y ago
27 Views
2 Downloads
573.69 KB
12 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Jacoby Zeller
Transcription

REVIEWSS T U DY D E S I G N SMolecular phylogenetics:principles and practiceZiheng Yang1,2 and Bruce Rannala1,3Abstract Phylogenies are important for addressing various biological questions suchas relationships among species or genes, the origin and spread of viral infection andthe demographic changes and migration patterns of species. The advancement ofsequencing technologies has taken phylogenetic analysis to a new height. Phylogenieshave permeated nearly every branch of biology, and the plethora of phylogeneticmethods and software packages that are now available may seem daunting to anexperimental biologist. Here, we review the major methods of phylogenetic analysis,including parsimony, distance, likelihood and Bayesian methods. We discuss theirstrengths and weaknesses and provide guidance for their use.SystematicsThe inference of phylogeneticrelationships among speciesand the use of such informationto classify species.TaxonomyThe description, classificationand naming of species.CoalescentThe process of joining ancestrallineages when the genealogicalrelationships of a randomsample of sequences froma modern population aretraced back.Center for Computationaland Evolutionary Biology,Institute of Zoology,Chinese Academy of Sciences,Beijing 100101, China.2Department of Biology,University College London,Darwin Building, GowerStreet, London WC1E 6BT, UK.3Genome Center andDepartment of Evolutionand Ecology, University ofCalifornia, Davis, California95616, USA.Correspondence to Z.Y.e‑mail: z.yang@ucl.ac.ukdoi:10.1038/nrg3186Published online28 March 20121Before the advent of DNA sequencing technologies,phylogenetic trees were used almost exclusively todescribe relationships among species in systematics andtaxonomy. Today, phylogenies are used in almost everybranch of biology. Besides representing the relationships among species on the tree of life, phylogenies areused to describe relationships between paralogues in agene family 1, histories of populations2, the evolutionary and epidemiological dynamics of pathogens3,4, thegenealogical relationship of somatic cells during differentiation and cancer development 5 and the evolutionof language6. More recently, molecular phylogeneticshas become an indispensible tool for genome comparisons. In this context, it is used: to classify metagenomicsequences7; to identify genes, regulatory elements andnon-coding RNAs in newly sequenced genomes8–10; tointerpret modern and ancient individual genomes11–13;and to reconstruct ancestral genomes14,15.In other applications, the phylogeny itself may not beof direct interest but must nevertheless be accounted forin the analysis. This ‘tree thinking’ has transformed manybranches of biology. In population genetics, the development of the coalescent theory 16,17 and the widespreadavailability of gene sequences for multiple individualsfrom the same species have prompted the developmentof genealogy-based inference methods, which have revolutionized modern computational population genetics. Here, the gene trees that describe the genealogy ofsequences in a sample are highly uncertain; they are notof direct interest but nevertheless contain valuable information about parameters in the model. Tree thinking hasalso forged a deep synthesis of population genetics andphylogenetics, creating the emerging field of statisticalphylogeography. In species tree methods2,18,19, the genetrees at individual loci may not be of direct interest andmay be in conflict with the species tree. By averagingover the unobserved gene trees under the multi-speciescoalescent model20, those methods infer the species treedespite uncertainty in the gene trees. In comparativeanalysis, inference of associations between traits (forexample, testis size and sexual promiscuity) using theobserved traits of modern species should considerthe species phylogeny to avoid misinterpreting historicalcontingencies as causal relationships21. In the inferenceof adaptive protein evolution, the phylogeny is used totrace the synonymous and nonsynonymous substitutions along branches to identify cases of acceleratedamino acid change22, even though the phylogeny is notof direct interest.Nowadays, every biologist needs to know somethingabout phylogenetic inference. However, to an experimental biologist who is unfamiliar with the field, theexistence of many analytical methods and softwarepackages might seem daunting. In this Review, wedescribe the suite of current methodologies for phylogenetic inference using sequence data. We also discussvarious statistical criteria that are useful for choosing themethods that are best suited for a particular questionand data type. Next-generation sequencing (NGS) technologies are generating huge data sets. In the analysis ofsuch data sets, reducing systematic errors and increasingrobustness to model violations are much more important than reducing random sampling errors. We discussseveral issues in the analysis of large data sets, such as theNATURE REVIEWS GENETICSVOLUME 13 MAY 2012 303 2012 Macmillan Publishers Limited. All rights reserved

REVIEWSBox 1 Tree conceptsGene treesThe phylogenetic orgenealogical tree ofsequences at a gene locusor genomic region.Statistical phylogeographyThe statistical analysis ofpopulation data from closelyrelated species to inferpopulation parameters andprocesses such as populationsizes, demography, migrationpatterns and rates.Species treeA phylogenetic tree for a setof species that underlies thegene trees at individual loci.Systematic errorsErrors that are due to anincorrect model assumption.They are exacerbated whenthe data size increases.Random sampling errorsErrors or uncertainties inparameter estimates owingto limited data.Cluster algorithmAn algorithm of assigning aset of individuals to groups (orclusters) so that objects of thesame cluster are more similarto each other than those fromdifferent clusters. Hierarchicalcluster analysis can beagglomerative (startingwith single elements andsuccessively joining them intoclusters) or divisive (startingwith all objects and successivelydividing them into partitions).Markov chainA stochastic sequence (or chain)of states with the property that,given the current state, theprobabilities for the next statedo not depend on the paststates.TransitionsSubstitutions between the twopyrimidines (T C) or betweenthe two purines (A G).TransversionsSubstitutions between apyrimidine and a purine(T or C A or G).A phylogeny is a model of genealogical history ina Rooted treeb Unrooted treewhich the lengths of the branches are unknownTime2parameters. For example, the phylogeny on the leftτ0is generated by two speciation events that occurredat time points τ0 and τ1. The branch lengths (b0, b1, b2b0b2and b3) are typically expressed in units of expectedτ1number of substitutions per site and measure theamount of evolution along the branches.If the substitution rate is constant over time or amongb2b1b3b1b′3lineages, we say that the molecular clock holds60. Thetree will then have a root and be ultrametric, meaning12313that the distances from the tips of the tree to the rootare all equal (for example, b0 b1 b0 b2 b3). A rootedtree for s species can then be represented by the ages of the s – 1 ancestral nodes and thus involvess – Reviews1 branch-lengthNature Geneticsparameters. The procedure of inferring rooted trees by assuming the molecular clock is called molecular clock rooting.For distantly related species, the clock hypothesis should not be assumed. Most phylogenetic analyses are thereforeconducted without the assumption of the clock. If every branch on the tree is allowed to have an independentevolutionary rate, commonly used models and methods are unable to identify the location of the root, so only unrootedtrees are inferred. An unrooted tree for s species then has 2s – 3 branch length parameters. A commonly used strategy to‘root the tree’ is to include outgroup species in the analysis, which are known to be more distantly related than the speciesof interest. Although the inferred tree for all species is unrooted, the root is believed to be located along the branch thatleads to the outgroup so that the tree for the ingroup species is rooted. This strategy is called outgroup rooting.impact of missing data and strategies of data partitioning. The literature of molecular phylogenetics is largeand complex 23,24; the aim of this Review is to provide astarting point for exploring the methods further.Phylogenetic tree reconstruction: basic conceptsA phylogeny is a tree containing nodes that are connected by branches. Each branch represents the persistence of a genetic lineage through time, and eachnode represents the birth of a new lineage (BOX 1). Ifthe tree represents the relationship among a group ofspecies, then the nodes represent speciation events. Inother contexts, the interpretation might be different.For example, in a gene tree of sequences sampled from apopulation, the nodes represent birth events of individuals who are ancestral to the sample, whereas in a treeof paralogous gene families, the nodes might representgene duplication events.Phylogenetic trees are not directly observed and areinstead inferred from sequence or other data. Phylogenyreconstruction methods are either distance-based orcharacter-based. In distance matrix methods, the distance between every pair of sequences is calculated,and the resulting distance matrix is used for tree reconstruction. For instance, neighbour joining 25 appliesa cluster algorithm to the distance matrix to arrive at afully resolved phylogeny. Character-based methodsinclude maximum parsimony, maximum likelihoodand Bayesian inference methods. These approachessimultaneously compare all sequences in the alignment,considering one character (a site in the alignment) at atime to calculate a score for each tree. The ‘tree score’ isthe minimum number of changes for maximum parsimony, the log-likelihood value for maximum likelihoodand the posterior probability for Bayesian inference. Intheory, the tree with the best score should be identifiedby comparing all possible trees. In practice, because ofthe huge number of possible trees, such an exhaustivesearch is not computationally feasible except for verysmall data sets. Instead, heuristic tree search algorithmsare used. These approaches often generate a starting treeusing a fast algorithm and then perform local rearrangements to attempt to improve the tree score. A heuristictree search is not guaranteed to find the best tree underthe criterion, but it makes it feasible to analyse large datasets. To describe the data, distance matrix, maximumlikelihood and Bayesian inference all make use of a substitution model and are therefore model-based, whereasmaximum parsimony does not have an explicit modeland its assumptions are implicit.Distance matrix methodDistance calculation. Pairwise sequence distances arecalculated assuming a Markov chain model of nucleotidesubstitution. Several commonly used models are illustrated in FIG. 1. The JC69 model26 assumes an equal rateof substitution between any two nucleotides, whereasthe K80 model27 assumes different rates for transitionsand transversions. Both models predict equal frequencies of the four nucleotides. The assumption of equalbase frequencies is relaxed in the HKY85 model28 andthe general time reversible (GTR) model29,30. Becauseof the variation in local mutation rate and in selectiveconstraint, different sites in a DNA or protein sequenceoften evolve at different rates. In distance calculation,such rate variation is accommodated by assuming agamma (Γ) distribution of rates for sites31, leading tomodels such as JC69 Γ, HKY85 Γ or GTR Γ.Distance matrix methods. After the distances have beencalculated, the sequence alignment is no longer usedin distance matrix methods. Here we mention threesuch methods: least squares, minimum evolution andneighbour joining. The least squares method32 (see also304 MAY 2012 VOLUME 13www.nature.com/reviews/genetics 2012 Macmillan Publishers Limited. All rights reserved

REVIEWSHKY85K80JC69TCTCTCAGAGAGFigure 1 Markov models of nucleotide substitution. The thickness of the arrowsindicates the substitution rates of the four nucleotides (T, C, ANatureand G),and the GeneticssizesReviewsof the circles represent the nucleotide frequencies when the substitution processis in equilibrium. Note that both JC69 and K80 predict equal proportions of thefour nucleotides.REF. 33) minimizes a measure of the differences betweenthe calculated distances (dij) in the distance matrixand the expected distances (d̂ ij) on the tree (that is,the sum of branch lengths on the tree linking the twospecies i and j):ssQ Σ Σ (dˆij – dij)2i 1 i 1(1)This is the same least squares method used in statistics for fitting a straight line y a bx to a scatter plot.Optimizing branch lengths (or d̂ij) leads to the score Qfor the given tree, and the tree with the smallest score isthe least squares estimate of the true tree.The minimum evolution method34,35 uses the treelength (which is the sum of branch lengths) instead ofQ for tree selection, even though the branch lengths canstill be estimated using the least squares criterion. Underthe minimum evolution criterion, shorter trees are morelikely to be correct than longer trees are.The most widely used distance method is neighbourjoining 25. This is a cluster algorithm and operates bystarting with a star tree and successively choosing a pairof taxa to join together (based on the taxon distances),until a fully resolved tree is obtained. The taxa to bejoined are chosen in order to minimize an estimate oftree length36. The two joined taxa (for example, species1 and 2 in FIG. 2) are then represented by their ancestor (for example, node y in FIG. 2), and the number oftaxa that are connected to the root (node x in FIG. 2)is reduced by one (FIG. 2). The distance matrix is thenupdated with the joined taxa replacing the two original taxa. See REF. 36 for a discussion of the neighbourjoining updating formula. An efficient implementationof neighbour joining is found in the program MEGA37(TABLE 1).Unrooted treesPhylogenetic trees forwhich the location ofthe root is unspecified.Strengths and weaknesses of distance methods. Oneadvantage of distance methods (especially of neighbourjoining) is their computational efficiency. The clusteralgorithm is fast because it does not need to compareas many trees under an optimality criterion as maximum parsimony and maximum likelihood do. For thisreason, neighbour joining is useful for analysing largedata sets that have low levels of sequence divergence.Note that it might be important to use a realistic substitution model to calculate the pairwise distances.Distance methods can perform poorly for very divergentsequences because large distances involve large samplingerrors, and most distance methods (such as neighbourjoining) do not account for the high variances of largedistance estimates. Distance methods are also sensitiveto gaps in the sequence alignment 38.Maximum parsimonyParsimony tree score. The maximum parsimony methodminimizes the number of changes on a phylogenetictree by assigning character states to interior nodeson the tree. The character (or site) length is the minimum number of changes required for that site, whereasthe tree score is the sum of character lengths over allsites. The maximum parsimony tree is the tree thatminimizes the tree score.Some sites are not useful for tree comparison byparsimony. For example, constant sites, for which thesame nucleotide occurs in all species, have a characterlength of zero on any tree. Singleton sites, at which onlyone of the species has a distinct nucleotide, whereas allothers are the same, can also be ignored, as the character length is always one. The parsimony-informativesites are those at which at least two distinct charactersare observed, each at least twice. For four species, onlythree site patterns are informative: xxyy, xyxy and xyyx,where x and y are any two distinct nucleotides. Thereare three possible unrooted trees for four species, andwhich of them is the maximum parsimony tree dependson which of the three site patterns occurs most often inthe alignment.An algorithm for finding the minimum number ofchanges on a binary tree (and for reconstructing theancestral states to achieve the minimum) was developedby Fitch39 and Hartigan40. PAUP41, MEGA37 and TNT42are commonly used parsimony programs.Parsimony was originally developed for use in analysing discrete morphological characters. During thelate 1970s, it began to be applied to molecular data.A controversy arose concerning whether parsimony(without explicit assumptions) or likelihood (with anexplicit evolutionary model) was a better method forphylogenetic analysis23. The controversy has subsided,and the importance of model-based inference methodsis broadly recognized. The use of parsimony is still common: not because it is believed to be assumption-free,but because it often produces reasonable results and iscomputationally efficient.Strengths and weaknesses of parsimony. A strength ofparsimony is its simplicity; it is easy to describe and tounderstand, and it is amenable to rigorous mathematicalanalysis. The simplicity also helps in the development ofefficient computer algorithms.A major weakness of parsimony is its lack of explicitassumptions, which makes it nearly impossible to incorporate any knowledge of the process of sequence evolution in tree reconstruction. The failure of parsimonyto correct for multiple substitutions at the same siteNATURE REVIEWS GENETICSVOLUME 13 MAY 2012 305 2012 Macmillan Publishers Limited. All rights reserved

REVIEWS3234452x1y5x1866787Figure 2 The neighbour joining algorithm. The neighbour joining algorithm is a Geneticsdivisive cluster algorithm. It starts from a star tree: two nodes Natureare thenReviewsjoined togetheron this tree (in this example, nodes 1 and 2), reducing the number of nodes at the root(node x) by one. The process is repeated until a fully resolved tree is generated.makes it suffer from a problem known as long-branchattraction43. If the correct tree (T1 in FIG. 3a) has twolong external branches separated by a short internalbranch, parsimony tends to infer the incorrect tree (T2in FIG. 3b), and the long branches are grouped together.When the branch lengths in T1 are extreme enough, theprobability for site pattern xxyy, which supports the correct tree T1, may be smaller than that for xyxy, whichsupports the incorrect tree T2. Thus, the more sitesthere are in the sequence, the more probable it is forthe pattern xxyy to be observed at fewer sites than xyxy,and the more certain that the incorrect tree T2 will bechosen to be the maximum parsimony tree. Parsimonythus converges to a wrong tree and is statistically inconsistent. Long-branch attraction has been demonstratedin many real and simulated data sets44 and is due to thefailure of parsimony to correct for multiple changes atthe same site or to accommodate parallel changes on thetwo long branches. See REFS 24,45 for more discussionsof the issue.Note that model-based methods (namely, distance,likelihood and Bayesian methods) also suffer from longbranch attraction if the assumed model is too simplisticand ignores among-site rate variation46. In the reconstruction of deep phylogenies, long-branch attraction(as well as unequal nucleotide or amino acid frequencies among species) is an important source of systematicerror 47,48 (FIG. 3c,d). In such analyses, it is advisable to userealistic substitution models and likelihood or Bayesianmethodologies. Dense taxon sampling to break longbranches and removing fast-evolving proteins or sitescan also be helpful.Long-branch attractionThe phenomenon of inferringan incorrect tree with longbranches grouped together byparsimony or by model-basedmethods under simplisticmodels.Maximum likelihoodBasis of maximum likelihood. Maxi

ing. The literature of molecular phylogenetics is large and complex23,24; the aim of this Review is to provide a starting point for exploring the methods further. Phylogenetic tree reconstruction: basic concepts A phylogeny is a tree containing nodes that ar

Related Documents:

Combinatorial Phylogenetics of Reconstruction Algorithms by Aaron Douglas Kleinman Doctor of Philosophy in Mathematics Designated Emphasis in Computational and Genomic Biology University of California, Berkeley Professor Lior Pachter, Chair Phylogenetics is the study of the evolutionary history

Lecture 18 . Molecular Evolution and Phylogenetics . 6.047/6.878 - Computational Biology: Genomes, Networks,

The journal Molecular Biology covers a wide range of problems related to molecular, cell, and computational biology, including genomics, proteomics, bioinformatics, molecular virology and immunology, molecular development biology, and molecular evolution. Molecular Biology publishes reviews, mini-reviews, and experimental and theoretical works .

Jan 31, 2011 · the molecular geometries for each chemical species using VSEPR. Below the picture of each molecule write the name of the geometry (e. g. linear, trigonal planar, etc.). Although you do not need to name the molecular shape for molecules and ions with more than one "central atom", you should be able to indicate the molecular geometryFile Size: 890KBPage Count: 7Explore furtherLab # 13: Molecular Models Quiz- Answer Key - Mr Palermowww.mrpalermo.comAnswer key - CHEMISTRYsiprogram.weebly.comVirtual Molecular Model Kit - Vmols - CheMagicchemagic.orgMolecular Modeling 1 Chem Labchemlab.truman.eduHow to Use a Molecular Model for Learning . - Chemistry Hallchemistryhall.comRecommended to you b

Xiangrun's Molecular sieve Email:info@xradsorbent.com Tel:86-533-3037068 Website: www.aluminaadsorbents.com Molecular sieve Types 3A Molecular sieve 4A Molecular sieve 5A Molecular sieve 13X Molecular sieve PSA Molecular Sieve Activated zeolite powder 3A Activated zeolite powder 4A Activated zeolite powder 5A

Apr 18, 2013 · systematics, integrating phylogenetic signal from the population up based on DNA and through time based on direct observation rather than inference. Molecular systematics in the 21st century For several years, molecular systematics has been the dominant phylogenetic paradigm [1]. By t

molecular systematics. While molecular phylogeny, in a really broad way, may be a domain of the biology, the molecular systematics might be viewed as more of a statistical science in which powerful computation based simulation experiments are used to infer phylogenetic trees from these biological data obtaine

additif alimentaire ainsi que d’une nouvelle utilisation pour un additif alimentaire déjà permis. Les dispositions réglementaires pour les additifs alimentaires figurent à la partie B du titre 16 du RAD. L’article B.16.001 énumère les exigences relatives à l’étiquetage des additifs alimentaires. En particulier, l’article B.16.002 énumère la liste des critères qui doivent .