Lecture 18: Approximate Pattern Matching

2y ago

12 Views

2 Downloads

1.54 MB

36 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Gideon Hoey

Report this link

Download PDF

Transcription

Lecture 18:Approximate Pattern MatchingStudy Chapter 9.6 – 9.811/4/2014Comp 555 Bioalgorithms (Fall 2014)1

Approximate vs. Exact Pattern Matching Previously we have discussed exact patternmatching algorithms Usually, because of mutations, it makes muchmore biological sense to find approximatepattern matches Biologists often use fast heuristic approaches tofind approximate matches11/4/2014Comp 555 Bioalgorithms (Fall 2014)2

Heuristic Similarity Searches Why heuristics?– Genomes are huge: Smith-Waterman quadraticalignment algorithms are too slow Observation: Good alignments of two sequencesusually have short identical or highly similarsubsequences Many heuristic methods (i.e., BLAST, FASTA) arebased on the idea of filtration– Find short exact matches, and use them as“seeds” for potential match extension– “Filter” out positions with no extendablematches11/4/2014Comp 555 Bioalgorithms (Fall 2014)3

Dot Plot A dot matrix or dot plotshows similaritiesbetween two sequences FASTA makes animplicit dot matrix oflength l matches,– tries to find longdiagonals (allowing forsome mismatches) Nucleotide matchesl 111/4/2014Comp 555 Bioalgorithms (Fall 2014)4

Dot Plot A dot matrix or dot plotshows similaritiesbetween two sequences FASTA makes animplicit dot matrix oflength l matches,– tries to find longdiagonals (allowing forsome mismatches) Dinucleotide matchesl 211/4/2014Comp 555 Bioalgorithms (Fall 2014)5

Dot Plot Identify diagonalsabove a thresholdlength Diagonals in the dotmatrix indicate exactsubstring matchingl 211/4/2014Comp 555 Bioalgorithms (Fall 2014)6

Diagonals in Dot Plots Extend diagonals andtry to link themtogether, allowing forminimalmismatches/indels Linking diagonalsreveals approximatematches over longersubstringsl 211/4/2014Comp 555 Bioalgorithms (Fall 2014)7

A Realistic Dot-Plot On the right is adot-plot ofapproximately 200 KB ofgenomic sequencecompared to itself. L 20 with 90%concordance What do the offdiagonal tracesrepresent?11/4/2014Comp 555 Bioalgorithms (Fall 2014)8

Approximate Pattern Matching (APM) Goal: Find all approximate occurrences of a patternin a text Input:– pattern p p1 pn– text t t1 tm– the maximum number of mismatches k Output: All positions 1 i (m – n 1) such thatti ti n-1 and p1 pn have at most k mismatches– i.e., Hamming distance between ti ti n-1 and p k11/4/2014Comp 555 Bioalgorithms (Fall 2014)9

APM: A Brute-Force AlgorithmApproximatePatternMatching(p, t, k)1 n length of pattern p2 m length of text t3 for i 1 to m – n 14dist 05for j 1 to n6if ti j-1 ! pj7dist dist 18if dist k9output i11/4/2014Comp 555 Bioalgorithms (Fall 2014)10

APM: Running Time That algorithm runs in O(nm). Extend “Approximate Pattern Matching” to a moregeneral “Query Matching Problem”:– Match n-length substring of the query (not the fullpattern) to a substring in a text with at most kmismatches– Motivation: we may seek similarities to somegene, but not know which parts of the gene toconsider11/4/2014Comp 555 Bioalgorithms (Fall 2014)11

Query Matching Problem Goal: Find all substrings of the query that approximatelymatch the text Input: Query q q1 qw,text t t1 tm,n (length of matching substrings n w m),k (maximum number of mismatches) Output: All pairs of positions (i, j) such that then-letter substring of q starting at iapproximately matches then-letter substring of t starting at j,with at most k mismatches11/4/2014Comp 555 Bioalgorithms (Fall 2014)12

Approximate Pattern Matching vs Query Matching11/4/2014Comp 555 Bioalgorithms (Fall 2014)13

Query Matching: Main Idea Approximately matching strings share someperfectly matching substrings. Instead of searching for approximately matchingstrings (difficult) search for perfectly matchingsubstrings first (easy).11/4/2014Comp 555 Bioalgorithms (Fall 2014)14

Filtration in Query Matching We want all n-matches between a query and atext with up to k mismatches “Filter” out positions that do not match betweentext and query Potential match detection: find all matches ofl -tuples in query and text for some small l Potential match verification: Verify eachpotential match by extending it to the left andright, until (k 1) mismatches are found11/4/2014Comp 555 Bioalgorithms (Fall 2014)15

Filtration: Match Detection If x1 xn and y1 yn match with at most k nmismatches they must share l –mers that areperfect matches, with l n/(k 1) Break string of length n into k 1 parts, each oflength n/(k 1) – k mismatches can affect at most k of these k 1parts– At least one of these k 1 parts is perfectlymatched11/4/2014Comp 555 Bioalgorithms (Fall 2014)16

Filtration: Match Detection (cont’d) Suppose k 3. We would then have l n/(k 1) n/4:1 ll 1 2l2l 1 3l12k3l 1 nk 1 There are at most k mismatches in n, so at the very leastthere must be one out of the k 1 l –tuples without amismatch11/4/2014Comp 555 Bioalgorithms (Fall 2014)17

Filtration: Match Verification For each l -match we find, try to extend thematch further to see if it is substantialtextquery11/4/2014Comp 555 Bioalgorithms (Fall 2014)Extend perfect matchof length l until wefind an approximatematch of length nwith no more than kmismatches18

Filtration: Examplel -tuplelengthk 0k 1k 2k 3k 4k 5nn/2n/3n/4n/5n/6Shorter perfect matches requiredPerformance decreases11/4/2014Comp 555 Bioalgorithms (Fall 2014)19

Local alignment is too slow Quadratic local alignment is too slowwhen looking for similarities between 0long strings (e.g. the entire GenBank sdatabase) i 1, j δ (vi , )si , j max Guaranteed to find the optimal si , j 1 δ ( , w j ) si 1, j 1 δ (vi , w j )local alignment Sets the standard for sensitivity Basic Local Alignment Search Tool– Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J.Journal of Mol. Biol., 1990 Search sequence databases for local alignments to a query11/4/2014Comp 555 Bioalgorithms (Fall 2014)20

BLAST Great improvement in speed, with only amodest decrease in sensitivity Opts to minimizes search space instead ofexploring entire search space between twosequences Finds short exact matches (“seeds”), explorelocally around these “hits”Search space of Local Alignment11/4/2014Search space of BLASTComp 555 Bioalgorithms (Fall 2014)21

Similarity BLAST only continues it’s search as long asregions are sufficiently similar Measuring the extent of similarity between twosequences– Based on percent sequence identity– Based on conservation11/4/2014Comp 555 Bioalgorithms (Fall 2014)22

Percent Sequence Identity The extent to which two nucleotide or aminoacid sequences are invariantAC C TG A G – AGAC G TG – G C AGmismatchindel70% identical11/4/2014Comp 555 Bioalgorithms (Fall 2014)23

Conservation Amino acid changes that preserve the physicochemical properties of the original residue– Polar to polar aspartate glutamate– Nonpolar to nonpolar alanine valine– Similarly behaving residues leucine to isoleucine Nucleotide changes that preserve molecularshape– Transitions (A-G, C-T) are more similar thanTransversions (A-C, A-T, C-G, G-T)11/4/2014Comp 555 Bioalgorithms (Fall 2014)24

Assessing Sequence Similarity How good of a local alignment score can be expectedfrom chance alone “Chance” relates to comparison of sequences that aregenerated randomly based upon a certain sequencemodel Sequence models may take into account:– nucleotide frequency– dinucelotide frequency(e.g. C G content in mammals)– common repeats– etc.11/4/2014Comp 555 Bioalgorithms (Fall 2014)25

BLAST: Segment Score BLAST uses scoring matrices (δ) to improve onefficiency of match detection (we did this earlierfor pairwise alignments)– Some proteins may have very different aminoacid sequences, but are still similar (PAM,Blosum) For any two l -mers x1 xl and y1 yl :– Segment pair: pair of l -mers, one from eachsequence– Segment score: Σli 1 δ(xi, yi)11/4/2014Comp 555 Bioalgorithms (Fall 2014)26

BLAST: Locally Maximal Segment Pairs A segment pair is maximal if it has the best scoreover all segment pairs A segment pair is locally maximal if its scorecan’t be improved by extending or shortening Statistically significant locally maximal segmentpairs are of biological interest BLAST finds all locally maximal segment pairs(MSPs) with scores above some threshold– A significantly high threshold will filter outsome statistically insignificant matches11/4/2014Comp 555 Bioalgorithms (Fall 2014)27

BLAST: Statistics Threshold: Altschul-Dembo-Karlin statistics– Identifies smallest segment score that is unlikely to happen bychance # matches with score θ is approximately Poissondistributed with mean:E(θ) Kmne-λθK is a constant, m and n are the lengths of the twocompared sequences, λ is a positive root of:Σx,y in A(pxpyeλδ (x,y)) 1where px and py are frequencies of amino acids x and y, δis the scoring matrix, and A is the twenty letter aminoacid alphabet11/4/2014Comp 555 Bioalgorithms (Fall 2014)28

P-values The probability of finding exactly k MSPswith a score θ is given by:(E(θ)k e-E(θ))/k! For k 0, that chance is:e-E(θ) Thus the probability of finding at least one MSPwith a score θ is:p(MSP 0) 1 – e-E(θ)11/4/2014Comp 555 Bioalgorithms (Fall 2014)29

BLAST algorithm Keyword search of all substrings of length wfrom the query of length n, in database of lengthm with score above threshold– w 11 for DNA queries, w 3 for proteins Local alignment extension for each foundkeyword– Extend result until longest match abovethreshold is achieved Running time O(nm)11/4/2014Comp 555 Bioalgorithms (Fall 2014)30

Original BLAST Dictionary– All words of length w Alignment– Ungapped extensions until score falls belowsome statistical threshold Output– All local alignments with score threshold11/4/2014Comp 555 Bioalgorithms (Fall 2014)32

Original BLAST: Example w 4 Exact keywordmatch of GGTC Extend diagonalswith mismatchesuntil score is undersome threshold(65%) Trim to until allmismatches areinterior Output result:GTAAGGTCC GTTAGGTCC11/4/2014From lectures by Serafim BatzoglouComp 555 Bioalgorithms (Fall 2014) (Stanford)33

Gapped BLAST : Example Original BLASTexact keywordsearch, then: Extend with gapsaround ends ofexact match untilscore threshold Output result:GTAAGGTCCAGT GTTAGGTC-AGT11/4/2014From lectures by Serafim Batzoglou(Stanford)Comp 555 Bioalgorithms (Fall 2014)34

Incarnations of BLAST blastn: Nucleotide-nucleotideblastp: Protein-proteinblastx: Translated query vs. protein databasetblastn: Protein query vs. translated databasetblastx: Translated query vs. translateddatabase (6 frames each)11/4/2014Comp 555 Bioalgorithms (Fall 2014)35

Incarnations of BLAST (cont’d) PSI-BLAST– Find members of a protein family or build acustom position-specific score matrix Megablast:– Search longer sequences with fewer differences WU-BLAST: (Wash U BLAST)– Optimized, added features11/4/2014Comp 555 Bioalgorithms (Fall 2014)36

Timeline 1970: Needleman-Wunsch global alignment algorithm1981: Smith-Waterman local alignment algorithm1985: FASTA1990: BLAST (basic local alignment search tool)2000s: BLAST has become too slow in “genome vs.genome” comparisons - new faster algorithms evolve!– Pattern Hunter– BLAT11/4/2014Comp 555 Bioalgorithms (Fall 2014)39

Approximate vs. Exact Pattern Matching Previously we have discussed exact pattern matching algorithms Usually, because of mutations, it makes much more biological sense to find approximate pattern matches . (PAM, Blosum) For any two : l -

Related Documents:

CHEMICAL REACTION ENGINEERING

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

100 Views

2y ago

LECTURE NOTES on PROGRAMMING & DATA STRUCTURE Course Code : BCS101

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

60 Views

1y ago

E-Procurement and Order Matching - Arco

Default rule is one PO for one Invoice (allows automatic matching). Matching of one line (or a few but not all) of an order number with a PO can be done via manual matching. Matching of the invoice with order is done in Arco Invoice. 7.1.1 Automatic matching on header level Automatic m

14 Views

1y ago

Fast and Accurate Image Matching with Cascade Hashing for 3D Reconstruction

struction. Therefore, fast and accurate image matching is crucial for 3D reconstruction. Image matching techniques can be roughly divided into three categories: point matching, line matching and region matching. Due to its robustness to changes of illumination, afﬁne transformation and viewpoint changes, point match-

9 Views

1y ago

MSE 460: Electronic Materials, Devices, and Processing

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

155 Views

2y ago

【E-book】Texts & Questions of 50 Lectures for TOEFL ...

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

149 Views

2y ago

Partial Differential Equations MSO-203-B - IIT Kanpur

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .

39 Views

11m ago

A Heritage Language Learner’s Literacy Practices in a ...

Keywords: Korean, heritage language, multiliteracies, university-level language classroom, multimodal reading response Journal of Language and Literacy Education Vol. 11 Issue 2—Fall 2015 117 eritage language (HL) learners1 who are exposed to and speak a language other than English exclusively in their homes and communities exhibit relatively lower reading and writing skills compared to .

53 Views

3y ago

Recent Views

Legal Proceedings and Legal Privilege Exemptions: Myth-busting - ICO

If asking for legal advice, say so, and start new email chain If giving legal advice, say so Involve lawyers (before litigation contemplated) Maintain confidentiality of legal advice documents Limit dissemination of legal advice (need to know; original only) Make internal communications re legal advice factual

1y ago

240 Views

Smart People Ask for (My) Advice: Seeking Advice Boosts .

advice strategically is likely to be a different experi-ence for the advice seeker than seeking advice with the intention of using it, from the advisor’s perspec-tive, strategic advice seeking may elicit the same per-ceptual effects as authentic advice seeking because the advice seeker’s intentions (and her reliance on advice)

3y ago

177 Views

Legal Action Group The Role of Advice Services in Health Outcomes

The Role of Advice Services in Health Outcomes Evidence Review and Mapping Study June 2015 The Role of Advice Services in Health Outcomes . tor.!Our! r,!

1y ago

170 Views

Legal Information vs Legal Advice Guidelines - TMCEC

giving legal advice. Legal advice is a written or oral statement that: o Interprets some aspect of the law, court rules, or court procedures; o Recommends a specific course of conduct a person should take in an actual or potential legal proceeding; or o Applies the law to the individual person's specific factual circumstances. What is Legal .

1y ago

225 Views

ProQual L2 Certificate Supporting Access to Legal Advice

R/502/7657 Communicating with legal advice clients 2 3 D/503/0822 Supporting clients to make use of the legal advice service 2 3 R/502/7660 Enabling legal advice clients to access signposting and referral opportunities 2 3 Optional Units - a minimum of 6 credits Unit Reference Number Unit Title Unit Level Credit Value

1y ago

173 Views

Guidance for opponents in civil legal aid cases - Scottish Legal Aid Board

injury case - may apply for civil legal aid (since this leaﬂet deals only with civil legal aid, where we refer to "legal aid" we mean "civil legal aid"). Legal aid is ﬁnancial help from public funds. It helps people who qualify to get legal advice and the help of a solicitor to put their case in court.

4m ago

110 Views

Priority Banking Tariff - Standard Chartered

Foreign exchange rate Free Free Free Free Free Free Free Free Free Free Free Free Free Free Free SMS Banking Daily Weekly Monthly. in USD or in other foreign currencies in VND . IDD rates min. VND 85,000 Annual Rental Fee12 Locker size Small Locker size Medium Locker size Large Rental Deposit12,13 Lock replacement

2y ago

206 Views

legal and ethical dimensions of practice - Dovetail

Material in this Guide should never be taken as providing you or any other person with legal advice. Legal advice regarding the application of the law to a particular circumstance or situation can only come from a legal practitioner. A range of sources for legal advice can be found in the Guide.

1y ago

167 Views

How Social Welfare Legal Advice and Social Prescribing can work .

The position of social welfare legal advice and its role in London's recovery The Mayor of London and partners should position social welfare legal advice as a core pillar of Londons recovery from the OVID-19 pandemic, with a core focus on ensuring adequate funding and practical support for advice agencies to ensure ongoing viability.

1y ago

172 Views

WHAT TO DO IF YOU ARE SEXUALLY HARASSED

There are many legal clinics or legal information centres you can contact to obtain legal information, educational resources or legal referrals. Alberta Central Alberta Community Legal Clinic (Red Deer) Centre for Public Legal Education Alberta Pro Bono Law Alberta Women's Centre Legal Advice Clinic (Calgary)

3y ago

245 Views

Legal Advocacy Essentials

Legal Advocacy Essentials: a core training for legal advocates Presented by the Washington State Coalition Against Domestic Violence, 2008. This information is not intended as a substitute for legal advice. 1 Legal Advocacy Essentials . A core training for legal advocates . Table of Contents . What is a legal advocate?

1y ago

249 Views

Legal & Corporate Services: Strategic Plan - CP6

the provision of legal advice, managing legal risk and managing the legal supply chain. By doing this well, the team will move towards its vision. Legal Services is made up of 4 teams, each serving different customers with a dedicated legal resource. This is summarised in the figure right. Although Legal Services has customerdistinct, -focussed .

1y ago

171 Views

Regulatory Guide RG 90 Example Statement of Advice: Scaled advice for a .

representatives and advisers who give personal advice to retail clients. It explains how and why we have developed an example Statement of Advice (SOA) for scaled advice (i.e. personal advice that is limited in scope) on personal insurance for a new retail client. The example SOA was developed in consultation with stakeholders, and we

1y ago

186 Views

Removal of licence disqualification - Legal Aid WA

agencies, permission must first be obtained from Legal Aid Western Australia. This Kit provides information about the law only and does not constitute legal advice. You should seek legal advice if you have a specific legal problem. Every effort is made to ensure that the information contai

2y ago

253 Views

Legal Information vs - txcourts.gov

giving legal advice. Legal advice is a written or oral statement that: Inter p rets some as ect of th elaw, courtles, or du s; Recomme nd s a pecific c ourse of ndu ters h ld k ein an actual or ntial legal proceeding; or 'sApplies th elaw to individu alperso n seci fic actu circums a . What is Legal Information?

1y ago

174 Views

Lecture 18: Approximate Pattern Matching

It looks like you're using an ad-blocker