Lecture 6: Genome Assembly - MIT OpenCourseWare

2y ago
12 Views
3 Downloads
948.98 KB
85 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Matteo Vollmer
Transcription

Lecture 6Genome AssemblyFoundations of Computational Systems BiologyDavid K. Gifford1

- # 7 "#! ' #( " '' ! /Courtesy of Nature Education. Used with permission.Source: Green, Eric D. "Strategies for the Systematic Sequencing of ComplexGenomes." Nature Reviews Genetics 2, no. 8 (2001): 573-83. !'2 5 8 D9 #! . "#! '4 #( " ' % " " 5 ( & )#" 8 92

AssemblyWhole-genome “shotgun” sequencing starts by copying andfragmenting the DNA(“Shotgun” refers to the random fragmentation of the wholegenome; like it was fired from a shotgun)Input: Copy: Fragment: Courtesy of Ben Langmead. Used with terials/3

AssemblyAssume sequencing produces such a large # fragments that almostall genome positions are covered by many fragments.Reconstructthis From these Courtesy of Ben Langmead. Used with terials/4

Assembly.but we don’t know what came from whereReconstructthis From these Courtesy of Ben Langmead. Used with terials/5

AssemblyKey term: coverage. Usually it’s short for average coverage: the averagenumber of reads covering a position in the genome. 177 nucleotides35 nucleotidesAverage coverage 177 / 35 7xCourtesy of Ben Langmead. Used with aterials/

')! )" " #, & ' ' 8 " & ( &! "9 1 ' 0 6 1 / C / C λ ( # 3– 28) λ3 C λ & # – 6 " # C5 7λ– 6 C 5 7λ7

' ,'5 #, & #& "#! ' ( ' (' source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.8

AssemblyCoverage could also refer to the number of reads covering a particularposition in the genome: Coverage at this position 6Courtesy of Ben Langmead. Used with aterials/

AssemblySay two reads truly originate from overlapping stretches of thegenome. Why might there be differences? 1. Sequencing error2. Difference between inhereted copies of a chromosomeE.g. humans are diploid; we have two copies of each chromosome,one from mother, one from father. The copies can differ:Read from Mother: Read from Father: Sequence from Mother: Sequence from Father: We’ll mostly ignore ploidy, butreal tools must consider itCourtesy of Ben Langmead. Used with materials/

-# &# ' (# ' #&( & '' ! / , & /# ( #"' "' ' 7 (& " & '' ! &'– #"'(& ( #, & & & ( / &#! & '2 ! " )" & " "( & '3 (& ( #& '' ! /– . ! '4 2 &! & " & 7 ' '' ! &'– #"'(& ( 7! & & &#! & '3 #& " & ' & ' & – & ( " & #& '' ! /11

Assembly alternativesAlternative 1: Overlap-Layout-Consensus (OLC) assemblyAlternative 2: de Bruijn graph (DBG) assemblyOverlapError correctionLayoutde Bruijn graphConsensusRefineScaffoldingCourtesy of Ben Langmead. Used with materials/

Overlap Layout ConsensusOverlapBuild overlap graphLayoutBundle stretches of the overlap graph into contigsConsensusPick most likely nucleotide sequence for each contigCourtesy of Ben Langmead. Used with terials/13

OverlapsFinding all overlaps is like building a directed graph where directededges connect overlapping nodes (reads) Suffix of source issimilar to prefix of sink Courtesy of Ben Langmead. Used with terials/14

Directed graph reviewDirected graph G(V, E) consists of set of vertices, V and set ofdirected edges, EDirected edge is an ordered pair of vertices.First is the source, second is the sink.abVertex is drawn as a circleEdge is drawn as a line with an arrowconnecting two circlesVertex also called node or pointdcV { a, b, c, d }E { (a, b), (a, c), (c, b) }Edge also called arc or lineSourceSinkDirected graph also called digraphCourtesy of Ben Langmead. Used with materials/

Overlap graphBelow: overlap graph, where an overlap is a suffix/prefix matchof at least 3 charactersA vertex is a read, a directed edge is an overlap between suffix ofsource and prefix of sinkVertices (reads): { a: , b: , c: }Edges (overlaps): { (a, b), (b, c) }To keep our presentation uncluttered we will not show thetreatment of read reverse complementsa: 3b: 4c: Courtesy of Ben Langmead. Used with materials/

Overlap graphOverlap graph could contain cycles. A cycle is a path beginningand ending at the same vertex.7a: 3b: 4c: These happen when the DNA stringitself is circular. E.g. bacterialgenomes are often circular;mitochondrial DNA is circular.Cycles could also be due to repetitiveDNA, as we’ll seeCourtesy of Ben Langmead. Used with terials/17

Finding overlapsa: b: c: How do we build the overlap graph?Assume for now an “overlap” is when a suffix of X oflength l exactly matches a prefix of Y, and k is thelength of reads A merged FM index of all reads allows us to match readprefixes and suffixes to discover overlaps. See SGA paper Efficient de novo assembly of large genomes using compressed data structuresJared T Simpson and Richard Durbin Genome Res. 2012. 22: 549-556SGA algorithm excludes redundant (transitive) edgesA - B - C excludes A - CCourtesy of Ben Langmead. Used with terials/18

Finding overlapsEdge label isoverlap lengthExample overlap graph with l 3k 7 Original string: Courtesy of Ben Langmead. Used with terials/19

Overlap Layout ConsensusOverlapBuild overlap graphLayoutBundle stretches of the overlap graph into contigsConsensusPick most likely nucleotide sequence for each contigCourtesy of Ben Langmead. Used with terials/20

Formulating the assembly problemFinding overlaps is important, and we’ll return to it, but our ultimategoal is to recreate (assemble) the genomeHow do we formulate this problem?First attempt: the shortest common superstring (SCS) problemCourtesy of Ben Langmead. Used with terials/21

Shortest common superstringGiven a collection of strings S, find SCS(S): the shortest string thatcontains all strings in S as substringsWithout requirement of “shortest,” it’s easy: just concatenate themExample:S: Concatenation: 24SCS(S): 10 Courtesy of Ben Langmead. Used with terials/22

Shortest common superstringS: SCS(S): Can we solve it?Imagine a modified overlapgraph where each edge hascost - (length of overlap)AABSCS corresponds to a path thatvisits every node once, minimizingtotal cost along pathThat’s the Traveling SalesmanProblem (TSP), which is NP-hard!-2-2-1-1AAA-1ABB-1-2-1BBBCourtesy of Ben Langmead. Used with permission.23 rials/

Shortest common superstringS: SCS(S): Say we disregard edge weights andjust look for a path that visits all thenodes exactly onceThat’s the Hamiltonian Path problem:NP-completeAAB Indeed, it’s well established that SCSis NP-hardAAAABBBBBCourtesy of Ben Langmead. Used with ng-materials/

Shortest common superstringLet’s take the hint give up on finding the shortest possible superstringNon-optimal superstrings can be found with a greedy algorithmAt each step, the greedy algorithm “greedily” chooses longestremaining overlap, merges its source and sinkCourtesy of Ben Langmead. Used with terials/25

Shortest common superstring: greedyGreedy-SCS algorithm in action (l 1):Input strings ! ! ! In red are strings that get! merged before the next round! ! Greedy answer: SuperstringActual SCS: Rounds of merging, one merge per line.Number in first column length of overlap merged before that round.Courtesy of Ben Langmead. Used with materials/

Shortest common superstring: greedyGreedy algorithm is not guaranteed to choose overlaps yielding SCSBut greedy algorithm is a good approximation; i.e. the superstringyielded by the greedy algorithm won’t be more than 2.5 times longerthan true SCS (see Gusfield, Algorithms on Strings, Trees andSequences: Computer Science and Computational Biology, 16.17.1)Courtesy of Ben Langmead. Used with terials/27

Shortest common superstring: greedyGreedy-SCS algorithm in action again (l 3):Input strings % % " SuperstringCourtesy of Ben Langmead. Used with terials/28

Shortest common superstring: greedyAnother setup for Greedy-SCS: assemble all substrings of length 6from string . l 3. # We only got back: (missing a )What happened?Courtesy of Ben Langmead. Used with terials/29

Shortest common superstring: greedyThe overlap graph for that scenario (l 3): Courtesy of Ben Langmead. Used with materials/

Shortest common superstring: greedyThe overlap graph for that scenario (l 3): Total overlap: 39Courtesy of Ben Langmead. Used with permission.31 http://www.langmead-lab.org/teaching-materials/

Shortest common superstring: greedyThe overlap graph for that scenario (l 3): Total overlap: 44 Better butwrong!Courtesy of Ben Langmead. Used with permission. http://www.langmead-lab.org/teaching-materials/32

Shortest common superstring: greedySame example, but increased the substring length from 6 to 8 & & & & & & & & " Got the whole thing: Courtesy of Ben Langmead. Used with terials/33

Shortest common superstring: greedyWhy are substrings of length 8 long enough for Greedy-SCS to figureout there are 3 copies of ? One length-8 substring spans all three sCourtesy of Ben Langmead. Used with terials/34

RepeatsRepeats often foil assembly. They certainly foil SCS, with its“shortest” criterion!Reads might be too short to “resolve” repetitive sequences. This iswhy sequencing vendors try to increase read length.Algorithms that don’t pay attention to repeats (like our greedySCS algorithm) might collapse them collapse The human genome is 50% repetitive!Courtesy of Ben Langmead. Used with terials/35

RepeatsBasic principle: repeats foil assemblyAnother example using Greedy-SCS:Input: Extract every substring of length k, then run Greedy-SCS.Do this for various l (min overlap length) and k.outputl, k3, 5 3, 7 3, 10 3, 13 Courtesy of Ben Langmead. Used with materials/

RepeatsBasic principle: repeats foil assemblyLonger and longer substrings allow us to “anchor” more of therepeat to its non-repetitive context: Often we can “walk in” from both sides. When we meet in themiddle, the repeat is resolved: Courtesy of Ben Langmead. Used with terials/37

RepeatsBasic principle: repeats foil assemblyYet another example using Greedy-SCS:Input: l, k3, 7output 3, 13 3, 19 3, 25 longer and longer substrings allowus to “reach” further into the repeatCourtesy of Ben Langmead. Used with terials/38

RepeatsPicture the portion of the overlap graph involving repeat AStretches ofgenomeRepeat AR1R2R3R4L1L2L3L4Assume A is longerthan read lengthReadsLots of overlapsamong reads from AL1L2L3L4R1R2R3R4Even if we avoid collapsing copies of A, we can’t know which pathsin correspond to which paths outCourtesy of Ben Langmead. Used with terials/39

LayoutThe overlap graph is big and messy. Contigs don’t “pop out” at us.Below: part of the overlap graph for Courtesy of Ben Langmead. Used with materials/ l 4, k 7

LayoutPicture gets clearer after removing some transitively-inferribleedges1222 2 Courtesy of Ben Langmead. Used with terials/41

LayoutRemove transitively-inferrible edges, starting with edges that skip onenode:xCourtesy of Ben Langmead. Used with permission.42 als/

LayoutRemove transitively-inferrible edges, starting with edges that skip onenode:x After:Courtesy of Ben Langmead. Used with terials/43

LayoutRemove transitively-inferrible edges, starting with edges that skip oneor two nodes:xx After:Even simplerCourtesy of Ben Langmead. Used with terials/44

Layout Emit contigs corresponding to the non-branching stretchesContig 1Contig 2 Unresolvable repeatCourtesy of Ben Langmead. Used with terials/45

LayoutIn practice, layout step also has to deal with spurious subgraphs, e.g.because of sequencing errorPossible repeatboundaryprunebaMismatchba.Mismatch could be due to sequencing error or repeat. Since the paththrough b ends abruptly we might conclude it’s an error and prune b.Courtesy of Ben Langmead. Used with terials/46

Overlap Layout ConsensusOverlapBuild overlap graphLayoutBundle stretches of the overlap graph into contigsConsensusPick most likely nucleotide sequence for each contigCourtesy of Ben Langmead. Used with terials/47

Consensus Take reads that makeup a contig and linethem up Take consensus, i.e.majority voteAt each position, ask: what nucleotide (and/or gap) is here?Complications: (a) sequencing error, (b) ploidySay the true genotype is AG, but we have a high sequencing error rateand only about 6 reads covering the position.Courtesy of Ben Langmead. Used with terials/48

Overlap Layout ConsensusOverlapLayoutConsensusBuild overlap graphBundle stretches of the overlap graph into contigsPick most likely nucleotide sequence for each contigWhat’s the main drawback of OLC?Building overlap graph can be slow.2nd-generation sequencing datasets are 100s of millions or billions ofreads, hundreds of billions of nucleotides totalCourtesy of Ben Langmead. Used with terials/49

Assembly alternativesAlternative 1: Overlap-Layout-Consensus (OLC) assemblyAlternative 2: de Bruijn graph (DBG) assemblyOverlapError correctionLayoutde Bruijn graphConsensusRefineScaffoldingCourtesy of Ben Langmead. Used with materials/

Flow chart removed due to copyright restrictions.51

Table removed due to copyright restrictions.52

# ? B " 2 - . 3 9: 9*: % 98E ;8% # Courtesy of Cold Spring Harbor Laboratory Press. Used with permission.Source: Simpson, Jared T., and Richard Durbin. "Efficient De Novo Assembly of Large Genomesusing Compressed Data Structures." Genome Research 22, no. 3 (2012): 549-56.53

Assembly alternativesAlternative 1: Overlap-Layout-Consensus (OLC) assemblyAlternative 2: de Bruijn graph (DBG) assemblyOverlapError correctionLayoutde Bruijn graphConsensusRefineScaffoldingCourtesy of Ben Langmead. Used with materials/

De Bruijn graph assemblyA formulation conceptually similar to overlapping/SCS, but has somepotentially helpful properties not shared by SCS.Courtesy of Ben Langmead. Used with terials/55

k-mer“k-mer” is a substring of length kS:A 4-mer of S:All 3-mers of S: mer: from Greek meaning “part”I’ll use “k-1-mer” to refer to a substring of length k - 1Courtesy of Ben Langmead. Used with materials/

De Bruijn graphAs usual, we start with a collection of reads, which are substrings ofthe reference genome. , , , , is a k-mer (k 3). is its left k-1-mer, and is its right k-1-mer. 3-mer L R ’s left 2-mer ’s right 2-merCourtesy of Ben Langmead. Used with terials/57

De Bruijn graphTake each length-3 input string and split it into two overlapping substringsof length 2. Call these the left and right 2-mers.take all 3-mers: , , , , , , , , , , , , , form L/R 2-mers:LRLRLRLRLRLet 2-mers be nodes in a new graph. Draw a directed edge from each left2-mer to corresponding right 2-mer: Each edge in this graphcorresponds to a length-3input stringCourtesy of Ben Langmead. Used with materials/

De Bruijn graph An edge corresponds to an overlap (of length -2) between two k-1 mers.More precisely, it corresponds to a k-mer from the input.Courtesy of Ben Langmead. Used with terials/59

De Bruijn graph If we add one more B to our input string: , and rebuild theDe Bruijn graph accordingly, we get a multiedge.Courtesy of Ben Langmead. Used with terials/60

Eulerian walk definitions and statementsNode is balanced if indegree equals outdegreeNode is semi-balanced if indegree differs from outdegree by 1Graph is connected if each node can be reached by some other nodeEulerian walk visits each edge exactly onceNot all graphs have Eulerian walks. Graphs that do are Eulerian.(For simplicity, we won’t distinguish Eulerian from semi-Eulerian.)A directed, connected graph is Eulerian ifand only if it has at most 2 semi-balancednodes and all other nodes are balanced Jones and Pevzner section 8.8 Courtesy of Ben Langmead. Used with terials/61

De Bruijn graphBack to our De Bruijn graph , , , , , , , , , , , , , LRLRLRLRLRIs it Eulerian? YesArgument 1: Argument 2: and are semi-balanced, and are balancedCourtesy of Ben Langmead. Used with terials/62

De Bruijn graph A procedure for making a De Bruijn graphfor a genomeAssume perfect sequencing where each length-ksubstring is sequenced exactly once with no errors Pick a substring length k: 5Start with each read:Take each k mer and splitinto left and right k-1 mers Add k-1 mers as nodes to De Bruijn graph(if not already there), add edge from left k-1mer to right k-1 mer Courtesy of Ben Langmead. Used with terials/63

De Bruijn graph For genome assembly each k-mer isrecorded in “twin” nodes – one node in theforward direction and one node in reversecomplement – k is odd so no node can be its own reversecomplement We will not show reverse complementtwin nodes to cut down on clutter Courtesy of Ben Langmead. Used with permission. http://www.langmead-lab.org/teaching-materials/64

De Bruijn graph First 8 k-mer additions, k 5 Courtesy of Ben Langmead. Used with materials/

De Bruijn graph Last 5 k-mer additions, k 5 Finished graphCourtesy of Ben Langmead. Used with permission.66 http://www.langmead-lab.org/teaching-materials/

De Bruijn graph With perfect sequencing, this procedure alwaysyields an Eulerian graph. Why? Node for k-1-mer from left end is semi-balancedwith one more outgoing edge than incoming *Node for k-1-mer at right end is semi-balancedwith one more incoming than outgoing * Other nodes are balanced since # times k-1-mer occursas a left k-1-mer # times it occurs as a right k-1-mer * Unless genome is circularCourtesy of Ben Langmead. Used with permission.67 http://www.langmead-lab.org/teaching-materials/

De Bruijn graph Assuming perfect sequencing, procedure yieldsgraph with Eulerian walk that can be foundefficiently. We saw cases where Eulerian walk corresponds tothe original superstring. Is this always the case? Courtesy of Ben Langmead. Used with terials/68

De Bruijn graph No: graph can have multiple Eulerian walks, only one ofwhich corresponds to original superstring Right: graph for ZABCDABEFABY, k 3Alternative Eulerian walks: These correspond to two edge-disjoint directedcycles joined by node is a repeat: ZABCDABEFABY Courtesy of Ben Langmead. Used with terials/69

De Bruijn graph This is the first sign that Eulerian walks can’t solveall our problems Other signs emerge when we think about how actualsequencing differs from our idealized construction Courtesy of Ben Langmead. Used with terials/70

De Bruijn graph Gaps in coverage can lead to disconnected graphGraph for , k 5: Courtesy of Ben Langmead. Used with terials/71

De Bruijn graphGaps in coverage can lead to disconnected graphGraph for , k 5 but omitting : Connected components are individuallyEulerian, overall graph is not Courtesy of Ben Langmead. Used with terials/72

De Bruijn graphDifferences in coverage also lead to nonEulerian graph Graph for ,k 5 but with extra copy of :Graph has 4 semi-balanced nodes,isn’t Eulerian Courtesy of Ben Langmead. Used with terials/73

De Bruijn graphErrors and differences between chromosomesalso lead to non-Eulerian graphsGraph for , k 5 but witherror that turns a copy of into Graph is not connected; largestcomponent is not Eulerian Courtesy of Ben Langmead. Used with terials/74

De Bruijn graphHow much work to build graph? For each k-mer, add 1 edge and up to 2 nodes Reasonable to say this is O(1) expected workAssume hash map encodes nodes & edges Assume k-1-mers fit in O(1) machine words,and hashing O

Overlap graph. Below: overlap graph, where an overlap is a suffix/prefix match of at least 3 characters A vertex is

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

The human genome is the first genome entirely sequenced. b. The human genome is about the same size as the genome of E. coli. c. Researchers completed the genomes of yeast and fruit flies during the same time they sequenced the human genome. d. The sequence of the human genome was completed in June 2000. 10.

The human genome is the first genome entirely sequenced. b. The human genome is about the same size as the genome of E. coli. c. Researchers completed the genomes of yeast and fruit flies during the same time they sequenced the human genome. d. Aworking copy of the human genome was completed in June 2000. 10.

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

(1) MIT/Whitehead Institute Center for Genome Research, 320 Charles St., Cambridge MA 02139 (2) MIT Lab for Computer Science, 200 Technology Square, Cambridge MA 02139 (3) MIT Department of Mathematics, 77 Massachusetts Ave, Cambridge MA 02139 (4) MIT Department of Biology, 31 Ames St, Cambridge MA 02139 (5) Corresponding author ABSTRACT

sequencing-by-synthesis on a PicoTiterPlate device image and signal processing whole genome mapping or assembly Comparison of high-throughput Sanger technology to the 454 technology used by the Genome Sequencer 20 System, in whole genome sequencing 7 days * Weeks ** 2.5 days 1 day † De novo s

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016