Lecture 18 Molecular Evolution And Phylogenetics

2y ago
4 Views
3 Downloads
2.10 MB
78 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Julius Prosser
Transcription

6.047/6.878 - Computational Biology: Genomes, Networks, EvolutionPatrick Winston’s 6.034Lecture 18Molecular Evolution and PhylogeneticsSomewhere, something went wrong 1

Challenges in Computational Biology4 Genome Assembly5 Regulatory motif discovery1 Gene FindingDNA2 Sequence alignment6 Comparative Genomics7 Evolutionary TT3 Database lookupGene expression analysisRNA transcript9 Cluster discovery 10 Gibbs sampling11 Protein network analysis12 Metabolic modelling13 Emerging network properties2

Concepts of Darwinian EvolutionSelectionImage in the public domain.Courtesy of Yuri Wolf; slide in the public domain.Taken from Yuri Wolf, Lecture Slides, Feb. 20143

Concepts of Darwinian EvolutionImage in the public domain.Charles Darwin 1859. Origin of Species [one and onlyillustration]: "descent with modification"Courtesy of Yuri Wolf; slide in the public domain.Taken from Yuri Wolf, Lecture Slides, Feb. 20144

Tree of Life Neal Olander. All rights reserved. This content is excluded from ourCreative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.Image in the public domain.5

Goals for today: Phylogenetics Basics of phylogeny: Introduction and definitions– Characters, traits, nodes, branches, lineages, topology, lengths– Gene trees, species trees, cladograms, chronograms, phylograms1. From alignments to distances: Modeling sequence evolution– Turning pairwise sequence alignment data into pairwise distances– Probabilistic models of divergence: Jukes Cantor/Kimura/hierarchy2. From distances to trees: Tree-building algorithms– Tree types: Ultrametric, Additive, General Distances– Algorithms: UPGMA, Neighbor Joining, guarantees and limitations– Optimality: Least-squared error, minimum evolution (require search)3. From alignments to trees: Alignment scoring given a tree– Parsimony: greedy (union/intersection) vs. DP (summing cost)– ML/MAP (includes back-mutations, lengths): peeling algorithm (DP)4. Tree of Life in Genomic Era– The prokaryotic problem (no real taxa and HGT)– Interpreting the forest of life6

Introduction: Basics and DefinitionsCharacters, traits, gene/species trees7

Common Phylogenetic Tree TerminologyBranches orLineagesTerminal NodesABCRepresent theTAXA (genes,populations,species, etc.)used to inferthe phylogenyDAncestral Nodeor ROOT ofthe TreeInternal Nodes orDivergence Points(represent hypotheticalancestors of the taxa)E8

Extinctions part of lifePhylogenetic tree showing archosaurs, dinosaurs, birds, etc. through geologic time removed due to copyright restrictions.9

PhylogeneticsGeneral Problem:Infer complete ancestry ofa set of ‘objects’ based onknowledge of their ‘traits’Mammal family tree removed due to copyright restrictions.‘Objects’ can be: Species,Genes, Cell types, Diseases,Cancers, Languages, Faiths,Cars, Architectural Styles‘Traits’ can be: Morphological, molecular,gene expression, TF binding, motifs, words Historical record varies: Fossils, imprints,timing of geological events, ‘living fossils’,sequencing of extinct species, paintings, stories.Today: Phylogenies using only extant species data gene trees (paralog / ortholog / homolog trees)10

Inferring Phylogenies: Traits and CharactersTrees can be inferred by several criteria:– Traditional traits: Morphology data Source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.– Modern traits: Molecular AGCAAACGACCTGTGACGTAGCAAACGA11

From physiological traits to DNA characters Traditional phylogenetics– Building species trees– Small number of traits Hoofs, nails, teeth, horns– Well-behaved traits, each arose once Parsimony principle, Occam’s razor Modern phylogenetics– Building gene trees and species trees– Very large number of traits Every DNA base and every protein residue– Frequently ill-behaved traits Back-mutations are frequent (convergent evolution) Small number of letters, arise many times independently12

Three types of treesCladogramChronogramTaxon BTaxon BTaxon CTaxon CTaxon ATaxon ATaxon DTaxon DTopology onlyt1 t2 t3Topology Divergence timesPhylogram131561Taxon BTaxon CTaxon ATaxon DTopology Divergence times Divergence rates13

Inferring a tree from quence data:-Nucleotide alignments-Peptide alignmentsEvolutionary historyrepresented as abinary tree Source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.14

Two basic approaches for phylogenetic inferenceDistance based12From SequencesTo DistancesSequence alignmentTree buildingalgorithmsPair-wise distance matrixOutput treeCharacter based3From alignmentsTo phylogeniesSequence alignmentCouple totree proposaland scoring4Output tree Source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.15

Goals for today: Phylogenetics Basics of phylogeny: Introduction and definitions– Characters, traits, nodes, branches, lineages, topology, lengths– Gene trees, species trees, cladograms, chronograms, phylograms1. From alignments to distances: Modeling sequence evolution1– Turning pairwise sequence alignment data into pairwise distances– Probabilistic models of divergence: Jukes Cantor/Kimura/hierarchy2. From distances to trees: Tree-building algorithms2– Tree types: Ultrametric, Additive, General Distances– Algorithms: UPGMA, Neighbor Joining, guarantees and limitations– Optimality: Least-squared error, minimum evolution (require search)3. From alignments to trees: Alignment scoring given a tree3 – Parsimony: greedy (union/intersection) vs. DP (summing cost)– ML/MAP (includes back-mutations, lengths): peeling algorithm (DP)4. Tree of Life in Genomic Era4– The prokaryotic problem (no real taxa and HGT)– Interpreting the forest of life16

1. From alignments to distancesModeling evolutionary ratesDistance estimation Source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.17

Measuring evolutionary rates Nucleotide divergence– Uniform rate. Overall percent identity. Transitions and transversions– Two-parameter model. A-G, C-T more frequent. Synonymous and non-synonymous substitutions– Ka/Ks rates. Amino-acid changing substitutions Nactual mutations N observed substitutions– Some fraction of “conserved” positions mutated twice.6.1A.1.1C.2GTAC.2T.6.1G18

‘Evolving’ a nucleotide under random model At time step 0, start with letter A At time step 1:.7– Remain A with probability 0.7– Change to C,G,T with prob. 0.1 each.1A.1 At time step 2:G.1.1.1– In state A with probability 0.52 Remain A with probability 0.7 * 0.7 Go back to A from C,G,T with 0.1*0.1 each.7C.1.7T.7– In states C,G,T with prob. 0.16 eacht 1t 2t 3t 4t 0.1960.2176T00.10.160.1960.217619

Modeling Nucleotide EvolutionDuring infinitesimal time t, there is not enough time for twosubstitutions to happen on the same nucleotideSo we can estimate P(x y, t), for x, y {A, C, G, T}Then letS( t) P(A A, t) P(A T, t) P(T A, t) P(T T, t)20

Modeling Nucleotide EvolutionReasonable assumption: multiplicative(implying a stationary Markov process)S(t t’) S(t)S(t’)That is, P(x y, t t’) z P(x z, t) P(z y, t’)Jukes-Cantor: constant rate of evolutionFor short time , S( ) 1 - 3 1 - 3 1 - 3 1 - 3 21

Modeling Nucleotide Evolution Jukes-Cantor:AFor longer times,r(t)S(t) s(t)s(t)s(t)s(t)r(t)s(t)s(t) s(t)s(t)r(t)s(t)s(t)s(t)s(t)r(t)Where we can derive:G C1-3 A 3 T1- otherr(t) ¼ (1 3 e-4 t)s(t) ¼ (1 – e-4 t)Geometric asymptote to 1/422

Modeling Nucleotide EvolutionKimura:Transitions: A/G, C/TTransversions: A/T, A/C, G/T, C/GTransitions (rate ) are much more likely than transversions (rate )AS(t) )r(t)s(t)Tu(t)u(t)s(t)r(t)s(t) ¼ (1 – e-4 t)u(t) ¼ (1 e-4 t – e-2( )t)r(t) 1 – 2s(t) – u(t)23

Distance between two sequencesGiven (well-aligned portion of) sequences xi, xj,Definedij distance between the two sequencesOne possible definition:dij fraction f of sites u where xi[u] xj[u]Better model (Jukes-Cantor):dij - ¾ log(1 – 4f / 3)r(t) ¼ (1 3 e-4 t)s(t) ¼ (1 – e-4 t)Observed F [ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7])ActualD [0.11, 0.23, 0.38, 0.57, 0.82, 1.21, 2.03]24

Many nucleotide models have been developedVarying levels of complexity (parameters) Source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.Models also exist for peptides and codons25

Goals for today: Phylogenetics Basics of phylogeny: Introduction and definitions– Characters, traits, nodes, branches, lineages, topology, lengths– Gene trees, species trees, cladograms, chronograms, phylograms1. From alignments to distances: Modeling sequence evolution– Turning pairwise sequence alignment data into pairwise distances– Probabilistic models of divergence: Jukes Cantor/Kimura/hierarchy2. From distances to trees: Tree-building algorithms– Tree types: Ultrametric, Additive, General Distances– Algorithms: UPGMA, Neighbor Joining, guarantees and limitations– Optimality: Least-squared error, minimum evolution (require search)3. From alignments to trees: Alignment scoring given a tree– Parsimony: greedy (union/intersection) vs. DP (summing cost)– ML/MAP (includes back-mutations, lengths): peeling algorithm (DP)4. Tree of Life in Genomic Era– The prokaryotic problem (no real taxa and HGT)– Interpreting the forest of life26

2. Distance-based tree-building algorithmsMapping a distance matrix to a treeUPGMA, NJ,LSE, ME Source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.27

Distance matrix Phylogenetic .z.x.cr.y.z.x.cd.c0Tree impliesa distance matrixMijdxzyDogcCathHumanmMouserRatMap distances Dijto a treemin ij (Dij-Mij)2Goal:Minimize discrepancy between observed distances and tree-based distances28

Ultrametric distances & 3 Point Condition For all points i, j, k– two distances are equal and third is smallerd(i,j) d(i,k) d(j,k)a a a b a biabjakwhere a b Result:– All paths from leaves are equidistant to the root– Rooted tree with uniform rates of evolution29

Ultrametric treesABCA B C0 6 66 0 46 4 0ABCA033B302C320For now imagine that theseSymmetric 0-diagonalare just the number of substitutions matrix of divergencebetween pairs:timesA:GCCCAACTAB:GTTTCCCTCTaken from Ran Libeskind-Hadas, Lecture Slides, Fall, 201330

Ultrametric Trees Given a symmetric n x n 0-diagonal matrix D, an ultrametric tree T forthat matrix is one in which:– There are n leaves, one for each row (column) of D– Each internal node is labeled by a time in D and has exactly twochildren– Along any path from the root to a leaf, the (divergence) times at theinternal nodes strictly decrease– For any two leaves i, j of T, the LCA of i, j is labeled with time D(i, j)3ABCA0332B302C320ABCTaken from Ran Libeskind-Hadas, Lecture Slides, Fall, 201331

Ultrametric Matrix ConstructionAA 0B 5C 2D 5E 7B50537C25057D53507E77770AA 0B 5C 2D 5E 7B50437C24057D53507E77770 Algorithms exist for “ultrametrifying” matrices.Taken from Ran Libeskind-Hadas, Lecture Slides, Fall, 201332

Minimum Spanning Tree (MST) There is a unique path between any two vertices in aspanning tree Adding an edge to a spanning tree creates a cycle Any edge on that cycle can be removed and we’ll stillhave a spanning tree MST is found using Prim’s Algorithm (graph traversal)Taken from Ran Libeskind-Hadas, Lecture Slides, Fall, 201333

The “Ultrametrification” AlgorithmGiven n x n symmetric 0-diagonal matrix D that is notultrametric1.Construct a completely connected graph with nvertices, one per row of A. The edge weight fromvertex i to vertex j is D(i, j).2.Find a minimum spanning tree (MST) of this graph.3.Build a new matrix D’ such that D’(i, j) is the largestweight on the unique path from i to j in the MST.Taken from Ran Libeskind-Hadas, Lecture Slides, Fall, 201334

Distances: (b) Additive distances All distances satisfy the four-point condition– Any quartet can be labeled i,j,k,l such that: d(i,j) d(k,l) d(i,k) d(j,l) d(i,l) d(j,k) (a b) (c d) (a m c) (b m d) (a m d) (b m c)ikamjcbdl Result:– All pairwise distances obtained by traversing a tree35

Distances: (c) General distances In practice, a distance matrix is neither ultrametric nor additive– Noise Measured distances are not exact Evolutionary model is not exact– Fluctuations Regions used to measure distances not representative of the speciestree Gene replacement (gene conversion), lateral transfer Varying rates of mutation can lead to discrepancies In the general case, tree-building algorithms must handle noisydistance matrices– Such a tree can be obtained by Enumeration and scoring of all trees (too expensive) Neighbor-Joining (typically gives a good tree) UPGMA (typically gives a poor tree)36

Algorithms: (a) UPGMA (aka Hierarchical Clustering)(Unweighted Pair Group Method with Arithmetic mean)Initialization:Assign each xi into its own cluster CiDefine one leaf per sequence, height 014Iteration:Find two clusters Ci, Cj s.t. dij is minLet Ck Ci CjDefine node connecting Ci, Cj,& place it at height dij/2Delete Ci, Cj352Termination:When two clusters i, j remain,place root at height dij/21423537

Ultrametric Distances & UPGMA14235UPGMA is guaranteed to build the correct tree if distance isultrametricProof:1.2.The tree topology is unique, given that the tree is binaryUPGMA constructs a tree obeying the pairwise distances38

Weakness of UPGMAMolecular clock assumption:implies time is constant for all speciesHowever, certain species (e.g., mouse, rat) evolve much fasterExample where UPGMA messes up:UPGMACorrect tree2134142339

Algorithms: (b) Neighbor-Joining Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additiveStep 1: Finding neighboring leaves1Define0.10.10.1Dij dij – (ri rj)Where1ri ––––– k dik L - 230.420.44Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighborsProof: Beyond the scope of this lecture (Durbin book, p. 189)40

Algorithm: Neighbor-joiningInitialization:Define T to be the set of leaf nodes, one per sequenceLet L TIteration:Pick i, j s.t. Dij is minimalDefine a new node k, and set dkm ½ (dim djm – dij) for all m LAdd k to T, with edges of lengths dik ½ (dij ri – rj)Remove i, j from L;Add k to LTermination:When L consists of two nodes, i, j, and the edge between them of length dij41

Algorithms: (c) Distance-fitting algoriths With distance-based algorithms, we can also aim todirectly minimize discrepancy between originaldistance matrix and tree-based distance matrixCOMPUTATIONAL METHODDistances CharactersDATA TYPEOptimality criterion Clustering algorithmPARSIMONYMAXIMUM LIKELIHOODMINIMUM EVOLUTIONUPGMALEAST SQUARESNEIGHBOR-JOINING42

Distance matrix Phylogenetic .z.x.cr.y.z.x.cd.c0Tree impliesa distance matrixMijdxzyDogcCathHumanmMouserRatMap distances Dijto a treemin ij (Dij-Mij)2Goal:Minimize discrepancy between observed distances and tree-based distances43

Aside: Alternative to Molecular clock?Divergence between orthologous sequences is proportional to timeseparating the species.Different genes evolve at specific, roughly constant rates.Zuckerkandl & Pauling 1962ratedistancesamplingerrordivergence timeCourtesy of Yuri Wolf; slide in the public domain.timeTaken from Yuri Wolf, Lecture Slides, Feb. 201444

Molecular ClockUnder MC all individual gene trees are ultrametric (up to a sampling error)and identical to the species tree up to a scaling factor (evolution rate).Are these really ultrametric?speciestreegene 1ABCDEFGHABCDEFGHtimegene 2distanceABCDEFGHdistanceCourtesy of Yuri Wolf; slide in the public domain.Taken from Yuri Wolf, Lecture Slides, Feb. 201445

Molecular ClockMost of the real phylogenetic trees are far from being ultrametric.Molecular clock is substantially overdispersed.rateobservedidealexpected basedon sampling errortime0.2Courtesy of Yuri Wolf; slide in the public domain.Taken from Yuri Wolf, Lecture Slides, Feb. 201446

Relaxed Molecular ClockRelaxed molecular clock models allows for rate variation.Rates are sampled from prior distributions with limited variance,independently or in autocorrelated manner.rateGenes are either analyzed individually, or as concatenated alignments(implying evolution as a single unit).timeCourtesy of Yuri Wolf; slide in the public domain.Taken from Yuri Wolf, Lecture Slides, Feb. 201447

Universal PacemakerUniversal Pacemaker model assumes that evolutionary time runs atdifferent pace in each lineage.Under the UPM, species trees are intrinsically Courtesy of Yuri Wolf; slide in the public domain.Taken from Yuri Wolf, Lecture Slides, Feb. 201448

Pacemaker vs ClockBoth overdispersed MC and UPM models predict that individual genetrees would deviate from ultrametricity.Under MC these deviations are expected to be uncorrelated.Under UPM these deviations are expected to be correlated, so thereexists a non-ultrametric pacemaker tree that can significantly reducevariance of observed rates.A testable hypothesis!2,300 trees of 100 prokaryotic species;7,000 trees of 6 Drosophila species1,000 trees of 9 yeast species5,700 trees of 8 mammalian speciesCourtesy of Yuri Wolf; slide in the public domain.Taken from Yuri Wolf, Lecture Slides, Feb. 201449

Pacemaker vs Clock2,300 trees of 100 prokaryotic species;7,000 trees of 6 Drosophila species1,000 trees of 9 yeast species5,700 trees of 8 mammalian speciesAll show an overwhelming support to UPM model.Snir 2012; work in progress at NCBI (NIH)Courtesy of Yuri Wolf; slide in the public domain.Taken from Yuri Wolf, Lecture Slides, Feb. 201450

Goals for today: Phylogenetics Basics of phylogeny: Introduction and definitions– Characters, traits, nodes, branches, lineages, topology, lengths– Gene trees, species trees, cladograms, chronograms, phylograms1. From alignments to distances: Modeling sequence evolution– Turning pairwise sequence alignment data into pairwise distances– Probabilistic models of divergence: Jukes Cantor/Kimura/hierarchy2. From distances to trees: Tree-building algorithms– Tree types: Ultrametric, Additive, General Distances– Algorithms: UPGMA, Neighbor Joining, guarantees and limitations– Optimality: Least-squared error, minimum evolution (require search)3. From alignments to trees: Alignment scoring given a tree– Parsimony: greedy (union/intersection) vs. DP (summing cost)– ML/MAP (includes back-mutations, lengths): peeling algorithm (DP)4. Tree of Life in Genomic Era– The prokaryotic problem (no real taxa and HGT)– Interpreting the forest of life51

3. Character-based tree-scoring algorithms3a: Parsimony (set-based)3b: Parsimony (Dyn. Prog.)3c: Maximum Likelihood Source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.52

Basic algorithms of phylogenetic methodsDistance based12From SequencesTo DistancesSequence alignmentTree buildingalgorithmsPair-wise distance matrixOutput treeCharacter based3From alignmentsTo phylogeniesSequence alignmentCouple totree proposaland scoring4Output tree Source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.53

Character-based phylogenetic inference Really about tree scoring techniques, not treefinding techniques– Couple them with tree proposal and update and youhave an algorithm (part 4 of the lecture) Two approaches exist, all use same architecture:– Minimize events: Parsimony (union/intersection)– Probabilistic: Max Likelihood / MAP54

Parsimony scoring (a): Union and intersectionGiven a tree, and an alignment columnLabel internal nodes to minimize thenumber of required substitutions{A, B}C 1Initialization:Set cost C 0; k 2N – 1{A}Iteration:If k is a leaf, set Rk { xk[u] }{A, B}If k is not a leaf,Let i, j be the daughter nodes;Set Rk Ri Rj if intersection isnonemptySet Rk Ri Rj, and C 1, ifintersection is emptyTermination:Minimal cost of tree for column u, CC 1A{A}B{B}A{A}B{B}55

Parsimony traceback to find ancestral nucleotidesTraceback:1. Choose an arbitrary nucleotide from R2N – 1 for the root2. Having chosen nucleotide r for parent k,If r Ri choose r for daughter iElse, choose arbitrary nucleotide from RiEasy to see that this traceback produces some assignment of cost CStill optimal, butnot found by tracebackAccessible to traceback{A, B}x{A}B A{B} {A}AA{A, B}A{A}BAB{B}AAAxBABABxxxBBABAxBBAB56

Parsimony Scoring (b): Dynamic 314 Each cell (N,C) represents themin cost of the subtree rooted atN, if the label at N is C. Update table by walking up thetree from the leaves to the root,remembering max choices. Traceback from root to leaves toconstruct a min cost assignmentB3B2B1AMouseGAGRatHumanDog57

Goals for today: Phylogenetics Basics of phylogeny: Introduction and definitions– Characters, traits, nodes, branches, lineages, topology, lengths– Gene trees, species trees, cladograms, chronograms, phylograms1. From alignments to distances: Modeling sequence evolution– Turning pairwise sequence alignment data into pairwise distances– Probabilistic models of divergence: Jukes Cantor/Kimura/hierarchy2. From distances to trees: Tree-building algorithms– Tree types: Ultrametric, Additive, General Distances– Algorithms: UPGMA, Neighbor Joining, guarantees and limitations– Optimality: Least-squared error, minimum evolution (require search)3. From alignments to trees: Alignment scoring given a tree– Parsimony: greedy (union/intersection) vs. DP (summing cost)– ML/MAP (includes back-mutations, lengths): peeling algorithm (DP)4. Tree of Life in Genomic Era– The prokaryotic problem (no real taxa and HGT)– Interpreting the forest of life58

Scoring (c) Maximum Likelihood & Max-a-PosterioriInput: Sequence alignmentOutput: tree with maximum likelihood / max a posteriori prob.Search: Heuristic search for max likelihood tree.Maximum Likelihood (ML)B ,T argmaxB,T P(D B,T)D seq. alignment dataB branch lengthsT topologyMaximum a Posteriori (MAP)B ,T argmaxB,T P(B,T D) argmaxB,T P(B,T,D) / P(D) argmaxB,T P(B,T,D) argmaxB,T P(D B,T)P(B,T)likelihoodlikelihoodP(D B,T) is the likelihood of data given model Computerecursively Use seq evolution model: JC,K2P,HKY.using DPP(B,T) is a prior on trees/branch lengths Use Yule process, Birth-Death process to model59

‘Peeling’ algorithm for P(D B,T) term1. Assume sites j evolve independently. Treat each column of the alignment in isolation2. Assume branch independence, conditioned on parent Expand total joint probability into prod of P(xi xparent,ti) Only P(x2n-1) remains, root prior, background nucl. freq.3. We know how to compute P(xi xparent(i),ti) for fixed pair Defined by our sequence model (JC, K2P, HKY, etc) Easily calculate for any given assignment of internal nodes4. As internal node values are not known marginalize Sum over all possible values of all internal/root nodes Let xn 1, ,x2n-1 represent seqs of n-1 internal nodes60

1. Site evolution over single branchRemember: Jukes-Cantor (JC)JC is a Continuous-TimeMarkov Chain (CTMC) Defines instantaneousrates of transition betweenstates (bases) GA CDiscrete MC version Given time t, we define a discrete MC with transitionmatrix is S(t), also called a substitution probability matrix. Gives the probability of seeing base a given initial base bafter duration time t. T“A”Use JC to define single site evolution:tP(a “C” b “A”, t) S(t)ba“C”61

2. Sequence evolution over single branch Assume site independence– P(xi xk, ti) Πj P(b xij a xkj, ti)Use product to define sequence evolution:xk “AAACTG”tiP(xi xk, ti)xi “CAAGTC”62

3. Sequence evolution over entire tree Assume branch independence– P(x1, xn, , x2n-1 T, t) P(x2n-1)Πi P(xi xparent(i), ti) Assume prior on root sequence, e.g.– P(x2n-1) P(x2n-1,j) (1/4) m for sequence length mUse product and prior to define sequence evolution over tree:x9 “AAACTG”t8x8t6t1x6t2x1x2P(x1, xn, , x2n-1 T, t)t7x7t3t4x3 x4t5x563

4. Integrate (marginalize) over hidden ancestral seqs! Notice, all sequences are needed, both internal nodes and leaves– P(x1, xn, , x2n-1 T, t) But, only leaves are given: x1, xn Therefore, need to marginalize (sum) over unknowns: xn 1, , x2n-1 This looks expensive!x9 “AAACTG”– P(x1, xn T, t) Σxn 1, , Σ x2n-1 P(x1, xn, , x2n-1 T, t)t8 Don’t worry, dynamic programmingcan do it efficiently.x8t6t1x6t2x1t7x2x7t3t4x3 x4t5x564

Basic trick to efficient marginalizationApply factorization trick to every internal node in the tree.x7t5t6x6x5t1x1t2t3x2 x3t4x4P(x1,x2,x3,x4 T, t) Σx5Σx6Σx7 P(x1,x2,x3,x4,x5,x6,x7 T, t) Σx5Σx6Σx7 P(x1 x5,t1) P(x2 x5,t1)P(x3 x6,t3) P(x4 x6,t4)P(x5 x7,t5) P(x6 x7,t6) P(x7) Σx7 P(x7)[Σx5 P(x5 x7,t5) P(x1 x5,t1) P(x2 x5,t1)][Σx6 P(x6 x7,t6) P(x3 x6,t3) P(x4 x6,t4)]Peeling algorithm L(i,j,a) is the DP table. Each entry contains the probabilityof seeing the leaf data below node i,given that node i has base a at site j. The leaves of the table are initializedbased on the observed sequence.Entries populated in post-order traversal. Runtime: O(2n * k 2)65

Use DP to compute argmax P(D B,T) efficientlyLi[a]Char a at node itleftb at jLj[b]trightc at jLk[c] If we know the branch lengths tleft & tright. And we already have the likelihood tablesLj&Lk of left and right subtrees(for each possible ending character at b, c) Fill in likelihood table Li for each char a at iLiLjLkACP(. ‘C’)GTLi[a] Σb {ACGT} Σc {ACGT} ( P(b a,tleft)*Lleft[b] * P(c a,tright)*Lright[c]Prob(a b))Prob(a c)66

Initialization and TerminationRootInternal Nodesn 321A00100C10001G00000T010102n-1 ijk Leavesn 1 Characters at the leaves are already known– Their likelihood is 1 or 0, indicating the known char Fill in internal node likelihood vectors iteratively Once we reach the root, multiply by the base freqs Maximization over Topologies and Lengths Numerical: gradient descent, Newton’s method67

Advantages/disadvantages of ML/MAP methods Advantages:– Inherently statistical and evolutionary model-based.– Usually the most ‘consistent’ of the methods available.– Used for both character and rate analyses– Can be used to infer the sequences of the extinct ancestors.– Account for branch-length effects in unbalanced trees.– Nucleotide or amino acid sequences, other types of data. Disadvantages:– Not as intuitive as parsimony (e.g. may choose more eventsif they’re more likely in our probabilistic model)– Computationally intense (Iimits num taxa, sequence length).– Like parsimony, can be fooled by high levels of homoplasy.– Violations of model assumptions can lead to incorrect trees.68

Tree reliability: Bootstrapping1. Re-sample alignments:––Randomly sample alignment columnswith replacementCreate many alignments of equal size.2. Build a phylogenetic treefor each sample3. Repeat (1) and (2) many times–1000s of times4. Output summary tree–––Tree constructed most frequentlyConsensus tree (even if not most freq)Other options5. Report observation frequencyof each branch–Each branch is a binary split Source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.69

Goals for today: Phylogenetics Basics of phylogeny: Introduction and definitions– Characters, traits, nodes, branches, lineages, topology, lengths– Gene trees, species trees, cladograms, chronograms, phylograms1. From alignments to distances: Modeling sequence evolution1– Turning pairwise sequence alignment data into pairwise distances– Probabilistic models of divergence: Jukes Cantor/Kimura/hierarchy2. From distances to trees: Tree-building algorithms2– Tree types: Ultrametric, Additive, General Distances– Algorithms: UPGMA, Neighbor Joining, guarantees and limitations– Optimality: Least-squared error, minimum evolution (require search)3. From alignments to trees: Alignment scoring given a tree3 – Parsimony: greedy (union/intersection) vs. DP (summing cost)– ML/MAP (includes back-mutations, lengths): peeling algorithm (DP)4. Tree of Life in Genomic Era4– The prokaryotic problem (no real taxa and HGT)– Interpreting the forest of life70

Tree of Life in Genomic EraGenomic era – growing frustration with discrepancies between the treesreconstructed for individual genes and heroic efforts to overcome thenoise. Role of horizontal gene transfer in the evolution of prokaryoticgenomes is established.Major lines of approach: gene repertoire and gene orderdistribution of distances between orthologsconcatenated alignments of "non-transferable" gene coresconsensus trees and supertreesCiccarelli 2006. Towards automatic reconstruction of a highlyresolved tree of li

Lecture 18 . Molecular Evolution and Phylogenetics . 6.047/6.878 - Computational Biology: Genomes, Networks,

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

The journal Molecular Biology covers a wide range of problems related to molecular, cell, and computational biology, including genomics, proteomics, bioinformatics, molecular virology and immunology, molecular development biology, and molecular evolution. Molecular Biology publishes reviews, mini-reviews, and experimental and theoretical works .

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Jan 31, 2011 · the molecular geometries for each chemical species using VSEPR. Below the picture of each molecule write the name of the geometry (e. g. linear, trigonal planar, etc.). Although you do not need to name the molecular shape for molecules and ions with more than one "central atom", you should be able to indicate the molecular geometryFile Size: 890KBPage Count: 7Explore furtherLab # 13: Molecular Models Quiz- Answer Key - Mr Palermowww.mrpalermo.comAnswer key - CHEMISTRYsiprogram.weebly.comVirtual Molecular Model Kit - Vmols - CheMagicchemagic.orgMolecular Modeling 1 Chem Labchemlab.truman.eduHow to Use a Molecular Model for Learning . - Chemistry Hallchemistryhall.comRecommended to you b

Xiangrun's Molecular sieve Email:info@xradsorbent.com Tel:86-533-3037068 Website: www.aluminaadsorbents.com Molecular sieve Types 3A Molecular sieve 4A Molecular sieve 5A Molecular sieve 13X Molecular sieve PSA Molecular Sieve Activated zeolite powder 3A Activated zeolite powder 4A Activated zeolite powder 5A

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .