11/5/09 Comp 590/Comp 790-90 Fall 2009 1

2y ago
9 Views
3 Downloads
5.67 MB
57 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Shaun Edmunds
Transcription

11/5/09Comp 590/Comp 790-90Fall 20091

Last time we discussed exact pattern matchingalgorithms Usually, because of mutations, it makes muchmore biological sense to find approximatepattern matches Biologists often use fast heuristic approaches(rather than local alignment) to findapproximate matches11/5/09Comp 590/Comp 790-90Fall 20092

Genomes are huge: Smith-Waterman quadraticalignment algorithms are too slow Good alignments of two sequences usually haveshort identical or highly similar subsequences Many heuristic methods (i.e., BLAST, FASTA) arebased on the idea of filtration– Find short exact matches, and use them asseeds for potential match extension– “Filter” out positions with no extendablematches11/5/09Comp 590/Comp 790-90Fall 20093

Dot matrices showsimilarities between twosequences FASTA makes animplicit dot matrix fromshort exact matches, andtries to find longdiagonals (allowing forsome mismatches)l 111/5/09Comp 590/Comp 790-90Fall 20094

Dot matrices showsimilarities between twosequences FASTA makes animplicit dot matrix fromshort exact matches, andtries to find longdiagonals (allowing forsome mismatches)l 211/5/09Comp 590/Comp 790-90Fall 20095

Identify diagonalsabove a thresholdlength Diagonals in the dotmatrix indicate exactsubstring matchingl 211/5/09Comp 590/Comp 790-90Fall 20096

Extend diagonals andtry to link themtogether, allowing forminimal mismatches/indels Linking diagonalsreveals approximatematches over longersubstringsl 211/5/09Comp 590/Comp 790-90Fall 20097

Goal: Find all approximate occurrences of a patternin a text Input:– pattern p p1 pn– text t t1 tm– the maximum number of mismatches k Output: All positions 1 i (m – n 1) such thatti ti n-1 and p1 pn have at most k mismatches– i.e., Hamming distance between ti ti n-1 and p k11/5/09Comp 590/Comp 790-90Fall 20098

ApproximatePatternMatching(p, t, k)1 n length of pattern p2 m length of text t3 for i 1 to m – n 14dist 05for j 1 to n6if ti j-1 ! pj7dist dist 18if dist k9output i11/5/09Comp 590/Comp 790-90Fall 20099

That algorithm runs in O(nm). We can generalize the “Approximate PatternMatching Problem” into a “Query MatchingProblem”:– We want to match substrings in a query tosubstrings in a text with at most k mismatches– Motivation: we want to see similarities to somegene, but we may not know which parts of thegene to look for11/5/09Comp 590/Comp 790-90Fall 200910

Goal: Find all substrings of the query that approximatelymatch the text Input: Query q q1 qw,text t t1 tm,n (length of matching substrings n w m),k (maximum number of mismatches) Output: All pairs of positions (i, j) such that then-letter substring of q starting at iapproximately matches then-letter substring of t starting at j,with at most k mismatches11/5/09Comp 590/Comp 790-90Fall 200911

11/5/09Comp 590/Comp 790-90Fall 200912

Approximately matching strings share someperfectly matching substrings. Instead of searching for approximately matchingstrings (difficult) search for perfectly matchingsubstrings first (easy).11/4/09Comp 590/Comp 790-90Fall 200913

We want all n-matches between a query and atext with up to k mismatches “Filter” out positions that do not match betweentext and query Potential match detection: find all matches ofl -tuples in query and text for some small l Potential match verification: Verify eachpotential match by extending it to the left andright, until (k 1) mismatches are found11/4/09Comp 590/Comp 790-90Fall 200914

If x1 xn and y1 yn match with at most kmismatches, they must share an l-tuple that isperfectly matched, with l n/(k 1) Break string of length n into k 1 parts, each oflength n/(k 1) – k mismatches can affect at most k of these k 1parts– At least one of these k 1 parts is perfectlymatched11/4/09Comp 590/Comp 790-90Fall 200915

Suppose k 3. We would then have l n/(k 1) n/4:1 l1l 1 2l22l 1 3lk3l 1 nk 1 There are at most k mismatches in n, so at the very leastthere must be one out of the k 1 l –tuples without amismatch11/4/09Comp 590/Comp 790-90Fall 200916

For each l -match we find, try to extend thematch further to see if it is substantialtextExtend perfectmatch oflength luntil we find anapproximatematch oflength n with kmismatchesquery11/4/09Comp 590/Comp 790-90Fall 200917

l -tuplelengthk 0k 1k 2k 3k 4k 5nn/2n/3n/4n/5n/6Shorter perfect matches requiredPerformance decreases11/4/09Comp 590/Comp 790-90Fall 200918

Quadratic local alignment is too slowwhile looking for similarities betweenlong strings (e.g. the entire GenBankdatabase)11/4/09Comp 590/Comp 790-90Fall 200919

Quadratic local alignment is too slowwhile looking for similarities betweenlong strings (e.g. the entire GenBankdatabase) Guaranteed to find the optimallocal alignment Sets the standard for sensitivity11/4/09Comp 590/Comp 790-90Fall 200920

Quadratic local alignment is too slowwhile looking for similarities betweenlong strings (e.g. the entire GenBankdatabase) Basic Local Alignment Search Tool– Altschul, S., Gish, W., Miller, W.,Myers, E. & Lipman, D.J.Journal of Mol. Biol., 1990 Search sequence databases for localalignments to a query11/4/09Comp 590/Comp 790-90Fall 200921

Great improvement in speed, with a modest decrease insensitivity Minimizes search space instead of exploring entiresearch space between two sequences Finds short exact matches (“seeds”), only exploreslocally around these “hits”11/4/09Comp 590/Comp 790-90Fall 200922

BLASTing a new gene– Evolutionary relationship– Similarity between protein function BLASTing a genome– Potential genes11/4/09Comp 590/Comp 790-90Fall 200923

Measuring the extent of similarity between twosequences– Based on percent sequence identity– Based on conservation11/4/09Comp 590/Comp 790-90Fall 200924

The extent to which two nucleotide or aminoacid sequences are invariantAC C TG A G – AGAC G TG – G C AGmismatchindel70% identical11/4/09Comp 590/Comp 790-90Fall 200925

Amino acid changes that preserve the physicochemical properties of the original residue– Polar to polar aspartate glutamate– Nonpolar to nonpolar alanine valine– Similarly behaving residues leucine to isoleucine11/4/09Comp 590/Comp 790-90Fall 200926

Amino acid substitution matrices– PAM– BLOSUM DNA substitution matrices– DNA: less conserved than protein sequences– Less effective to compare coding regions atnucleotide level11/4/09Comp 590/Comp 790-90Fall 200927

Point Accepted Mutation (Dayhoff et al.) 1 PAM PAM1 1% average change of allamino acid positions– After 100 PAMs of evolution, not everyresidue will have changed May have mutated several times Returned to their original state Not changed at all11/4/09Comp 590/Comp 790-90Fall 200928

PAM250 is most widely used:ORIGINAL AMINO 22WYV017214024014034014014014135024Comp 590/Comp 790-90Fall 2009LeuL622213Lys .K .7 .955151215011029

Blocks Substitution Matrix (Henikoff andHenikoff) Scores derived from observations of thefrequencies of substitutions– From blocks of local alignments in relatedproteins Matrix name indicates evolutionary distance– BLOSUM62 was created using sequencessharing no more than 62% identity11/4/09Comp 590/Comp 790-90Fall 200930

Keyword search of all words of length w fromthe query of length n in database of length mwith score above threshold– w 11 for DNA queries, w 3 for proteins Local alignment extension for each foundkeyword– Extend result until longest match abovethreshold is achieved Running time O(nm)11/4/09Comp 590/Comp 790-90Fall 200931

keywordQuery: IRDGVK 18GAK 16NeighborhoodGIK 16wordsGGK 14neighborhoodGLK 13score thresholdGNK 12(T 13)GRK 11GEK 11GDK 11extensionQuery: 22VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 DN G IR LG K I L E RG KSbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263High-scoring Pair (HSP)11/4/09Comp 590/Comp 790-90Fall 200932

Dictionary– All words of length w Alignment– Ungapped extensions until score falls belowsome statistical threshold Output– All local alignments with score threshold11/4/09Comp 590/Comp 790-90Fall 200933

C T G A T C C T G G A T T G C G A w 4 Exact keywordmatch of GGTC Extenddiagonals withmismatchesuntil score isunder 50% Output resultGTAAGGTCCGTTAGGTCCA C G A A G T A A G G T C C A G TFrom lectures by Serafim Batzoglou(Stanford)11/4/09Comp 590/Comp 790-90Fall 200934

GTAAGGTCCAGTGTTAGGTC-AGTFrom lectures by Serafim Batzoglou(Stanford)11/4/09C T G A T C C T G G A T T G C G A Original BLASTexact keywordsearch, THEN: Extend with gapsaround ends ofexact match untilscore threshold Output resultA C G A A G T A A G G T C C A G TComp 590/Comp 790-90Fall 200935

blastn: Nucleotide-nucleotideblastp: Protein-proteinblastx: Translated query vs. protein databasetblastn: Protein query vs. translated databasetblastx: Translated query vs. translateddatabase (6 frames each)11/4/09Comp 590/Comp 790-90Fall 200936

PSI-BLAST– Find members of a protein family or build acustom position-specific score matrix Megablast:– Search longer sequences with fewer differences WU-BLAST: (Wash U BLAST)– Optimized, added features11/4/09Comp 590/Comp 790-90Fall 200937

Need to know how strong an alignment can beexpected from chance alone “Chance” relates to comparison of sequences that aregenerated randomly based upon a certain sequencemodel Sequence models may take into account:– G C content– Poly-A tails– “Junk” DNA– Codon bias– Etc.11/4/09Comp 590/Comp 790-90Fall 200938

BLAST uses scoring matrices (δ) to improve onefficiency of match detection– Some proteins may have very different aminoacid sequences, but are still similar For any two l-mers x1 xl and y1 yl :– Segment pair: pair of l-mers, one from eachsequence– Segment score: Σli 1 δ(xi, yi)11/4/09Comp 590/Comp 790-90Fall 200939

A segment pair is maximal if it has the best scoreover all segment pairs A segment pair is locally maximal if its scorecan’t be improved by extending or shortening Statistically significant locally maximal segmentpairs are of biological interest BLAST finds all locally maximal segment pairswith scores above some threshold– A significantly high threshold will filter outsome statistically insignificant matches11/4/09Comp 590/Comp 790-90Fall 200940

Threshold: Altschul-Dembo-Karlin statistics– Identifies smallest segment score that isunlikely to happen by chance # matches above θ has mean E(θ) Kmne-λθ; K isa constant, m and n are the lengths of the twocompared sequences– Parameter λ is a positive root of:Σ x,y in A(pxpyeδ(x,y)) 1, where px and py arefrequencies of amino acids x and y, δ is thescoring matrix, and A is the twenty letteramino acid alphabet11/4/09Comp 590/Comp 790-90Fall 200941

The probability of finding b HSPs with a score θ is given by:– (e-EEb)/b! For b 0, that chance is:– e-E Thus the probability of finding at least one HSPwith a score θ is:– P 1 – e-E11/4/09Comp 590/Comp 790-90Fall 200942

Blast of human beta globin protein against zebrafishEScoreSequences producing significant alignments:(bits) Valuegi 18858329 ref NP 571095.1 ba1 globin [Danio rerio] gi 147757.gi 18858331 ref NP 571096.1 ba2 globin; SI:dZ118J2.3 [Danio rer.gi 37606100 emb CAE48992.1 SI:bY187G17.6 (novel beta globin) [D.gi 31419195 gb AAH53176.1 Ba1 protein [Danio rerio]1711701701683e-447e-447e-443e-43ALIGNMENTS gi 18858329 ref NP 571095.1 ba1 globin [Danio rerio]Length 148Score 171 bits (434), Expect 3e-44Identities 76/148 (51%), Positives 106/148 (71%), Gaps 1/148 (0%)Query: 1Sbjct: 1Query: 61Sbjct: LSTPDAVMGNPK 60MV T E A LWGK N DE G AL R L VYPWTQR F FG LS P A FGNLSSPAAIMGNPK NVLVCVLAHHFG 120V AHG V G DN K T A LS H KLHVDP NFRLL A DCITVCAAMKFG 120Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147 FVQ A QK A V AL YHSbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 14811/4/09Comp 590/Comp 790-90Fall 200943

Blast of human beta globin DNA against humanScoreDNAESequences producing significant alignments:(bits) Valuegi 19849266 gb AF487523.1 Homo sapiens gamma A hemoglobin (HBG1.gi 183868 gb M11427.1 HUMHBG3E Human gamma-globin mRNA, 3' endgi 44887617 gb AY534688.1 Homo sapiens A-gamma globin (HBG1) ge.gi 31726 emb V00512.1 HSGGL1 Human messenger RNA for gamma-globingi 38683401 ref NR 001589.1 Homo sapiens hemoglobin, beta pseud.gi 18462073 gb AF339400.1 Homo sapiens haplotype PB26 -343e-33ALIGNMENTS gi 28380636 ref NG 000007.3 Homo sapiens beta globin region (HBB@) on chromosome 11Length 81706Score 149 bits (75), Expect 3e-33Identities 183/219 (83%)Strand Plus / PlusQuery: ccagctgagtgaa 326 Sbjct: 54409 actgagtgac 54468Query: 327ctgcactgtgacaagctgcatgtggatcctgagaacttc 365 Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 5450711/4/09Comp 590/Comp 790-90Fall 200944

1970: Needleman-Wunsch global alignment algorithm1981: Smith-Waterman local alignment algorithm1985: FASTA1990: BLAST (basic local alignment search tool)2000s: BLAST has become too slow in “genome vs.genome” comparisons - new faster algorithms evolve!– Pattern Hunter– BLAT11/4/09Comp 590/Comp 790-90Fall 200945

BLAST: matches shortconsecutive sequences(consecutive seed) Length k Example (k 11):11111111111 PatternHunter: matches shortnon-consecutive sequences(spaced seed) Increases sensitivity by locatinghomologies that wouldotherwise be missed Example (a spaced seed oflength 18 w/ 11 “matches”):111010010100110111Each 1 represents a “match”11/4/09Each 0 represents a “don’t care”, sothere can be a match or amismatchComp 590/Comp 790-90Fall 200946

Example of a hit using a spaced seed:How does this result in better sensitivity?"11/4/09Comp 590/Comp 790-90Fall 200947

BLAST: redundanthitsThis results in 1 hit andcreates clusters ofredundant hits"11/4/09 PatternHunter "This results in very fewredundant hitsComp 590/Comp 790-90Fall 200948

BLAST may also miss a hitGAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT GAATACTCAACAGCAACATCAATGGGCAGCAGAAAATIn this example, despite a clear homology, there is no sequenceof continuous matches longer than length 9. BLAST uses alength 11 and because of this, BLAST does not recognize thisas a hit!"Resolving this would require reducing the seed length to 9,which would have a damaging effect on speed11/4/09Comp 590/Comp 790-90Fall 200949

11 positions11 positions10 positions11/4/09Comp 590/Comp 790-90Fall 200950

Higher hit probability Lower expected number of random hits11/4/09Comp 590/Comp 790-90Fall 200951

Basic Searching Algorithm1. Select a group of spaced seed models2. For each hit of each model, conduct extension tofind a homology.11/4/09Comp 590/Comp 790-90Fall 200952

BLAT (BLAST-Like Alignment Tool) Same idea as BLAST - locate short sequence hitsand extend11/4/09Comp 590/Comp 790-90Fall 200953

BLAT builds an index of the database and scanslinearly through the query sequence, whereasBLAST builds an index of the query sequenceand then scans linearly through the database Index is stored in RAM which is memoryintensive, but results in faster searches11/4/09Comp 590/Comp 790-90Fall 200954

Steps:1. Break cDNA into 500 base chunks.2. Use an index to find regions in genome similar to eachchunk of cDNA.3. Do a detailed alignment between genomic regions andcDNA chunk.4. Use dynamic programming to stitch together detailedalignments of chunks into detailed alignment of whole.A sophisticated divide and conquer approach11/4/09Comp 590/Comp 790-90Fall 200955

BLAT was designed to find sequences of 95%and greater similarity of length 40; may missmore divergent or shorter sequence alignments11/4/09Comp 590/Comp 790-90Fall 200956

PatternHunter is 5-100 times faster than Blastn,depending on data size, at the same sensitivity BLAT is several times faster than BLAST, butbest results are limited to closely relatedsequences11/4/09Comp 590/Comp 790-90Fall 200957

– PAM – BLOSUM DNA substitution matrices – DNA: less conserved than protein sequences – Less effective to compare coding regions at nucleotide level . 11/4/09 Comp 590/Comp 790

Related Documents:

590-209 Chevrolet Colorado, GMC Canyon 2008-04 590-210 Buick LaCrosse 2007-05 590-211 Pontiac Bonneville, LeSabre 2005 590-212 Buick Terraza, Chevrolet Uplander, Venture, Pontiac Montana, Saturn Relay 2008-05 590-213 Chevrolet Trailblazer, GMC Envoy 2004-02 590-214 Chevrolet Silverado, GMC Sierra 2002-01 Failure sets warning light Power Seat .

Song of St. Patrick – Haugen – G Comp II # 685 Taste and See – Haugen – G Comp II # 34 Taste and See – Moore – G Comp II # 827 The Love of The Lord – Joncas – G Comp II # 680 The Servant Song – Richard Gillard– – G Comp II # 661 We Have Been Told – Haas – G Comp II # 69

2016-17 HERI Faculty Survey Institutional Profile Report Full-time Undergraduate Faculty Total Men Women CIRP Construct Note: Significance * p .05, ** p .01, *** p .001 Page 1 of 76 1A. American University of Beirut Your Inst Comp 1 Comp 2 Your Inst Comp 1 Comp 2 Your Inst Comp 1 Comp 2

The boiler model and serial number are given on the boiler data label which is located on the right hand side of the chassis and visible after opening the controls door. The models covered by these instructions are:-Suprima 30L - G.C. No. 41 590 42 Suprima 40L - G.C. No. 41 590 43 Suprima 50L - G.C. No. 41 590 44 Suprima 60L - G.C. No. 41 590 45

APES Servo Electric Press Brake "Belt&Pulley" APES SERVO 31100 Model APES SERVO "BELT&PULLEY" No # 15040 20050 26080 31100 Bending Length mm 1530 2040 2550 3050 Bending Force Tons 40 50 80 100 Inside Frames mm 1790 2300 2810 3350 Daylight opening mm 590/505* 590/505* 590/505* 590/505* Max.

October 2019 5 Salary Tables 208-day schedule (cont.) LANE 3 2019-2020 2020-2021 2021-2022 2022-2023 2023-2024 Year Step Salary Total Comp. Salary Total Comp. Salary Total Comp. Salary Total Comp. Salary Total Comp. 7 70

Adding an Absence in KRONOS – COMP TIME Comp time can be entered by the hour and in increments as small as 15 minutes Comp can be used if an employee has less than 37.5 total hours at the end of the work week. Comp time will preferably be used before Leave time. If Comp time is used, the workweek must total t

of its Animal Nutrition Series. The Food and Drug Administration relies on information in the report to regulate and ensure the safety of pet foods. Other reports in the series address the nutritional needs of horses, dairy cattle, beef cattle, nonhuman primates, swine, poultry, fish, and small ruminants. Scientists who study the nutritional needs of animals use the Animal Nutrition Series to .