Lecture 18: Approximate Pattern Matching

2y ago
12 Views
2 Downloads
1.54 MB
36 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Gideon Hoey
Transcription

Lecture 18:Approximate Pattern MatchingStudy Chapter 9.6 – 9.811/4/2014Comp 555 Bioalgorithms (Fall 2014)1

Approximate vs. Exact Pattern Matching Previously we have discussed exact patternmatching algorithms Usually, because of mutations, it makes muchmore biological sense to find approximatepattern matches Biologists often use fast heuristic approaches tofind approximate matches11/4/2014Comp 555 Bioalgorithms (Fall 2014)2

Heuristic Similarity Searches Why heuristics?– Genomes are huge: Smith-Waterman quadraticalignment algorithms are too slow Observation: Good alignments of two sequencesusually have short identical or highly similarsubsequences Many heuristic methods (i.e., BLAST, FASTA) arebased on the idea of filtration– Find short exact matches, and use them as“seeds” for potential match extension– “Filter” out positions with no extendablematches11/4/2014Comp 555 Bioalgorithms (Fall 2014)3

Dot Plot A dot matrix or dot plotshows similaritiesbetween two sequences FASTA makes animplicit dot matrix oflength l matches,– tries to find longdiagonals (allowing forsome mismatches) Nucleotide matchesl 111/4/2014Comp 555 Bioalgorithms (Fall 2014)4

Dot Plot A dot matrix or dot plotshows similaritiesbetween two sequences FASTA makes animplicit dot matrix oflength l matches,– tries to find longdiagonals (allowing forsome mismatches) Dinucleotide matchesl 211/4/2014Comp 555 Bioalgorithms (Fall 2014)5

Dot Plot Identify diagonalsabove a thresholdlength Diagonals in the dotmatrix indicate exactsubstring matchingl 211/4/2014Comp 555 Bioalgorithms (Fall 2014)6

Diagonals in Dot Plots Extend diagonals andtry to link themtogether, allowing forminimalmismatches/indels Linking diagonalsreveals approximatematches over longersubstringsl 211/4/2014Comp 555 Bioalgorithms (Fall 2014)7

A Realistic Dot-Plot On the right is adot-plot ofapproximately 200 KB ofgenomic sequencecompared to itself. L 20 with 90%concordance What do the offdiagonal tracesrepresent?11/4/2014Comp 555 Bioalgorithms (Fall 2014)8

Approximate Pattern Matching (APM) Goal: Find all approximate occurrences of a patternin a text Input:– pattern p p1 pn– text t t1 tm– the maximum number of mismatches k Output: All positions 1 i (m – n 1) such thatti ti n-1 and p1 pn have at most k mismatches– i.e., Hamming distance between ti ti n-1 and p k11/4/2014Comp 555 Bioalgorithms (Fall 2014)9

APM: A Brute-Force AlgorithmApproximatePatternMatching(p, t, k)1 n length of pattern p2 m length of text t3 for i 1 to m – n 14dist 05for j 1 to n6if ti j-1 ! pj7dist dist 18if dist k9output i11/4/2014Comp 555 Bioalgorithms (Fall 2014)10

APM: Running Time That algorithm runs in O(nm). Extend “Approximate Pattern Matching” to a moregeneral “Query Matching Problem”:– Match n-length substring of the query (not the fullpattern) to a substring in a text with at most kmismatches– Motivation: we may seek similarities to somegene, but not know which parts of the gene toconsider11/4/2014Comp 555 Bioalgorithms (Fall 2014)11

Query Matching Problem Goal: Find all substrings of the query that approximatelymatch the text Input: Query q q1 qw,text t t1 tm,n (length of matching substrings n w m),k (maximum number of mismatches) Output: All pairs of positions (i, j) such that then-letter substring of q starting at iapproximately matches then-letter substring of t starting at j,with at most k mismatches11/4/2014Comp 555 Bioalgorithms (Fall 2014)12

Approximate Pattern Matching vs Query Matching11/4/2014Comp 555 Bioalgorithms (Fall 2014)13

Query Matching: Main Idea Approximately matching strings share someperfectly matching substrings. Instead of searching for approximately matchingstrings (difficult) search for perfectly matchingsubstrings first (easy).11/4/2014Comp 555 Bioalgorithms (Fall 2014)14

Filtration in Query Matching We want all n-matches between a query and atext with up to k mismatches “Filter” out positions that do not match betweentext and query Potential match detection: find all matches ofl -tuples in query and text for some small l Potential match verification: Verify eachpotential match by extending it to the left andright, until (k 1) mismatches are found11/4/2014Comp 555 Bioalgorithms (Fall 2014)15

Filtration: Match Detection If x1 xn and y1 yn match with at most k nmismatches they must share l –mers that areperfect matches, with l n/(k 1) Break string of length n into k 1 parts, each oflength n/(k 1) – k mismatches can affect at most k of these k 1parts– At least one of these k 1 parts is perfectlymatched11/4/2014Comp 555 Bioalgorithms (Fall 2014)16

Filtration: Match Detection (cont’d) Suppose k 3. We would then have l n/(k 1) n/4:1 ll 1 2l2l 1 3l12k3l 1 nk 1 There are at most k mismatches in n, so at the very leastthere must be one out of the k 1 l –tuples without amismatch11/4/2014Comp 555 Bioalgorithms (Fall 2014)17

Filtration: Match Verification For each l -match we find, try to extend thematch further to see if it is substantialtextquery11/4/2014Comp 555 Bioalgorithms (Fall 2014)Extend perfect matchof length l until wefind an approximatematch of length nwith no more than kmismatches18

Filtration: Examplel -tuplelengthk 0k 1k 2k 3k 4k 5nn/2n/3n/4n/5n/6Shorter perfect matches requiredPerformance decreases11/4/2014Comp 555 Bioalgorithms (Fall 2014)19

Local alignment is too slow Quadratic local alignment is too slowwhen looking for similarities between 0long strings (e.g. the entire GenBank sdatabase) i 1, j δ (vi , )si , j max Guaranteed to find the optimal si , j 1 δ ( , w j ) si 1, j 1 δ (vi , w j )local alignment Sets the standard for sensitivity Basic Local Alignment Search Tool– Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J.Journal of Mol. Biol., 1990 Search sequence databases for local alignments to a query11/4/2014Comp 555 Bioalgorithms (Fall 2014)20

BLAST Great improvement in speed, with only amodest decrease in sensitivity Opts to minimizes search space instead ofexploring entire search space between twosequences Finds short exact matches (“seeds”), explorelocally around these “hits”Search space of Local Alignment11/4/2014Search space of BLASTComp 555 Bioalgorithms (Fall 2014)21

Similarity BLAST only continues it’s search as long asregions are sufficiently similar Measuring the extent of similarity between twosequences– Based on percent sequence identity– Based on conservation11/4/2014Comp 555 Bioalgorithms (Fall 2014)22

Percent Sequence Identity The extent to which two nucleotide or aminoacid sequences are invariantAC C TG A G – AGAC G TG – G C AGmismatchindel70% identical11/4/2014Comp 555 Bioalgorithms (Fall 2014)23

Conservation Amino acid changes that preserve the physicochemical properties of the original residue– Polar to polar aspartate glutamate– Nonpolar to nonpolar alanine valine– Similarly behaving residues leucine to isoleucine Nucleotide changes that preserve molecularshape– Transitions (A-G, C-T) are more similar thanTransversions (A-C, A-T, C-G, G-T)11/4/2014Comp 555 Bioalgorithms (Fall 2014)24

Assessing Sequence Similarity How good of a local alignment score can be expectedfrom chance alone “Chance” relates to comparison of sequences that aregenerated randomly based upon a certain sequencemodel Sequence models may take into account:– nucleotide frequency– dinucelotide frequency(e.g. C G content in mammals)– common repeats– etc.11/4/2014Comp 555 Bioalgorithms (Fall 2014)25

BLAST: Segment Score BLAST uses scoring matrices (δ) to improve onefficiency of match detection (we did this earlierfor pairwise alignments)– Some proteins may have very different aminoacid sequences, but are still similar (PAM,Blosum) For any two l -mers x1 xl and y1 yl :– Segment pair: pair of l -mers, one from eachsequence– Segment score: Σli 1 δ(xi, yi)11/4/2014Comp 555 Bioalgorithms (Fall 2014)26

BLAST: Locally Maximal Segment Pairs A segment pair is maximal if it has the best scoreover all segment pairs A segment pair is locally maximal if its scorecan’t be improved by extending or shortening Statistically significant locally maximal segmentpairs are of biological interest BLAST finds all locally maximal segment pairs(MSPs) with scores above some threshold– A significantly high threshold will filter outsome statistically insignificant matches11/4/2014Comp 555 Bioalgorithms (Fall 2014)27

BLAST: Statistics Threshold: Altschul-Dembo-Karlin statistics– Identifies smallest segment score that is unlikely to happen bychance # matches with score θ is approximately Poissondistributed with mean:E(θ) Kmne-λθK is a constant, m and n are the lengths of the twocompared sequences, λ is a positive root of:Σx,y in A(pxpyeλδ (x,y)) 1where px and py are frequencies of amino acids x and y, δis the scoring matrix, and A is the twenty letter aminoacid alphabet11/4/2014Comp 555 Bioalgorithms (Fall 2014)28

P-values The probability of finding exactly k MSPswith a score θ is given by:(E(θ)k e-E(θ))/k! For k 0, that chance is:e-E(θ) Thus the probability of finding at least one MSPwith a score θ is:p(MSP 0) 1 – e-E(θ)11/4/2014Comp 555 Bioalgorithms (Fall 2014)29

BLAST algorithm Keyword search of all substrings of length wfrom the query of length n, in database of lengthm with score above threshold– w 11 for DNA queries, w 3 for proteins Local alignment extension for each foundkeyword– Extend result until longest match abovethreshold is achieved Running time O(nm)11/4/2014Comp 555 Bioalgorithms (Fall 2014)30

Original BLAST Dictionary– All words of length w Alignment– Ungapped extensions until score falls belowsome statistical threshold Output– All local alignments with score threshold11/4/2014Comp 555 Bioalgorithms (Fall 2014)32

Original BLAST: Example w 4 Exact keywordmatch of GGTC Extend diagonalswith mismatchesuntil score is undersome threshold(65%) Trim to until allmismatches areinterior Output result:GTAAGGTCC GTTAGGTCC11/4/2014From lectures by Serafim BatzoglouComp 555 Bioalgorithms (Fall 2014) (Stanford)33

Gapped BLAST : Example Original BLASTexact keywordsearch, then: Extend with gapsaround ends ofexact match untilscore threshold Output result:GTAAGGTCCAGT GTTAGGTC-AGT11/4/2014From lectures by Serafim Batzoglou(Stanford)Comp 555 Bioalgorithms (Fall 2014)34

Incarnations of BLAST blastn: Nucleotide-nucleotideblastp: Protein-proteinblastx: Translated query vs. protein databasetblastn: Protein query vs. translated databasetblastx: Translated query vs. translateddatabase (6 frames each)11/4/2014Comp 555 Bioalgorithms (Fall 2014)35

Incarnations of BLAST (cont’d) PSI-BLAST– Find members of a protein family or build acustom position-specific score matrix Megablast:– Search longer sequences with fewer differences WU-BLAST: (Wash U BLAST)– Optimized, added features11/4/2014Comp 555 Bioalgorithms (Fall 2014)36

Timeline 1970: Needleman-Wunsch global alignment algorithm1981: Smith-Waterman local alignment algorithm1985: FASTA1990: BLAST (basic local alignment search tool)2000s: BLAST has become too slow in “genome vs.genome” comparisons - new faster algorithms evolve!– Pattern Hunter– BLAT11/4/2014Comp 555 Bioalgorithms (Fall 2014)39

Approximate vs. Exact Pattern Matching Previously we have discussed exact pattern matching algorithms Usually, because of mutations, it makes much more biological sense to find approximate pattern matches . (PAM, Blosum) For any two : l -

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Default rule is one PO for one Invoice (allows automatic matching). Matching of one line (or a few but not all) of an order number with a PO can be done via manual matching. Matching of the invoice with order is done in Arco Invoice. 7.1.1 Automatic matching on header level Automatic m

struction. Therefore, fast and accurate image matching is crucial for 3D reconstruction. Image matching techniques can be roughly divided into three categories: point matching, line matching and region matching. Due to its robustness to changes of illumination, affine transformation and viewpoint changes, point match-

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .

Keywords: Korean, heritage language, multiliteracies, university-level language classroom, multimodal reading response Journal of Language and Literacy Education Vol. 11 Issue 2—Fall 2015 117 eritage language (HL) learners1 who are exposed to and speak a language other than English exclusively in their homes and communities exhibit relatively lower reading and writing skills compared to .