Pairwise Sequence Comparison

2y ago
6 Views
2 Downloads
2.85 MB
66 Pages
Last View : 1m ago
Last Download : 5m ago
Upload by : Shaun Edmunds
Transcription

Pairwise Sequence ComparisonDotplots, scoring matrices, dynamicprogrammingIan Hoffeckerian.hoffecker@ki.seDepartment of Medical Biochemistry &BiophysicsKarolinska Institutet, Stockholm, Sweden

INTRODUCTIONBackground: Sequencing techniques from the 1970s Full genomes: drosophila, human, caenorhabditis, saccharomyces,various bacteria, viruses Databases GenBank, UniProt, etc Many sequences available, much unknown aboutfunction of proteins Converting existing data into useful biologicalknowledge is today’s challenge2Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

INTRODUCTIONOUTLINE – analyzing matches & aligning sequences1. Dotplots – comparing sequence similarity visually Window size Tolerance2. Score Matrices – quantifying sequence similarity PAM BLOSUM Pairwise distance, hierarchy, and information theory3. Alignment types4. Dynamic Programming – break problem into small ones Global optimal alignment - Needleman-Wunsch Local optimal alignment - Smith-Waterman3Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

INTRODUCTION Reasons for comparing 2 sequences: Determine if a gene is protein coding Determine if genes are of common descent – homology Infer structure Infer functionSequences: DNA RNA Protein4Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

INTRODUCTION What differences are we likely to encounter? Point mutations Insertions/delections (indels) Fusion of sequences DuplicationWhat processes are they?5Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

INTRODUCTION2 Sequences- are they different?- how different?- and in what way?6Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplots 2D visualization of sequence similarity No convention for the order of sequence labeling If agreement between two elements/bases/amino acidsexists – place a mark “dot”7Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot versus linear inspection sequence 1 gagaaggagaa'sequence 2 gagcaagcgaa'8Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot versus linear inspection sequence 1 gagaaggagaa'sequence 2 gagcaagcgaa'9Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsNoise depends on alphabet size and redundancy ofbases DNA bases: 4 Amino acids: 20 Codons – triplets of DNA bases Sliding window size as a filtering approach:10Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot versus linear inspection sequence 1 gagaaggagaa'sequence 2 gagcaagcgaa'11Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot versus linear inspection sequence 1 gagaaggagaa'sequence 2 gagcaagcgaa'12Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot versus linear inspection sequence 1 gagaaggagaa'sequence 2 gagcaagcgaa'13Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot versus linear inspection sequence 1 gagaaggagaa'sequence 2 gagcaagcgaa'14Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsComparisons with different sizes – scanning sequence 4 'gatcgatc'sequence 3 gatcgatcgcaagcgaa'15Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsTolerating mismatch- Count the number ofmismatches in the ATGACCTT'- Max score is the window size- Thresholdingwindow size:[[8 0 1 .,[0 8 0 .,[1 0 8 .,.,[0 2 0 .,[0 0 2 .,[1 0 0 .,80 0 1]2 0 0]0 2 0]8 0 0]0 8 0]0 0 8]]0816Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot zoo – interpretation by recognition of visual featuresH T T P://T I N Y U R L.C O M/6X Y WQ34a. identity17Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot zoo – interpretation by recognition of visual featuresH T T P://T I N Y U R L.C O M/6X Y WQ34a. identityb. duplication18Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot zoo – interpretation by recognition of visual featuresH T T P://T I N Y U R L.C O M/6X Y WQ34a. identityb. duplicationc. palindrome19Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot zoo – interpretation by recognition of visual featuresH T T P://T I N Y U R L.C O M/6X Y WQ34a. identityb. duplicationc. palindromed. partial palindrome20Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot zoo – interpretation by recognition of visual featuresH T T P://T I N Y U R L.C O M/6X Y WQ34a. identityb. duplicationc. palindromed. partial palindromee. microsatellite repeats21Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot zoo – interpretation by recognition of visual featuresH T T P://T I N Y U R L.C O M/6X Y WQ34a. identityb. duplicationc. palindromed. partial palindromee. microsatellite repeatsf. minisatellite repeats (patterns)22Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot zoo – interpretation by recognition of visual featuresH T T P://T I N Y U R L.C O M/6X Y WQ34a. identityb. duplicationc. palindromed. partial palindromee. microsatellite repeatsf. minisatellite repeats (patterns)g. homology23Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot zoo – interpretation by recognition of visual featuresH T T P://T I N Y U R L.C O M/6X Y WQ34a. identityb. duplicationc. palindromed. partial palindromee. microsatellite repeatsf. minisatellite repeats (patterns)g. homologyh. insertion seq2, deletion seq124Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsDotplot zoo – interpretation by recognition of visual featuresH T T P://T I N Y U R L.C O M/6X Y WQ34sequence 'I got, I got, I got, I got Loyalty, got royalty inside my DNA Cocaine quarterpiece, got war and peace inside my DNA I got power, poison, pain and joy inside my DNA Igot hustle though, ambition, flow, inside my DNA I was born like this, since one like thisImmaculate conception I transform like this, perform like this Was Yeshuas new weapon Idont contemplate, I meditate, then off your fucking head This that put-the-kids-to-bed Thisthat I got, I got, I got, I got Realness, I just kill shit cause its in my DNA I got millions, I gotriches buildin’ in my DNA I got dark, I got evil, that rot inside my DNA I got off, I gottroublesome, heart inside my DNA I just win again, then win again like Wimbledon, I serveYeah, thats'25Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

DotplotsSoftware options R packages: Seqinr DotplotPython Github.com/kn-bibs/dotplotBrowser based Dotplotter DotmatcherDownloadable Synmap GepardMaking your own ian’s github: Github.com/Intertangler/bioinformatics ith26Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

INTRODUCTIONOUTLINE – analyzing matches & aligning sequences1. Dotplots – comparing sequence similarity visually Window size Tolerance2. Score Matrices – quantifying sequence similarity PAM BLOSUM Pairwise distance, hierarchy, and information theory3. Alignment types4. Dynamic Programming – break problem into small ones Global optimal alignment - Needleman-Wunsch Local optimal alignment - Smith-Waterman27Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesScoring a matchAssessing multiplecandidates for an alignment28Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring Matrices29Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesSCORE – How good is an alignment30Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesThe substitution matrix – a linguistic analogy: biology vs biologi31Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesThe substitution matrix – a linguistic analogy32Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesThe score matrixNow try scoring some different alignments.--biologybiologI-score S(b,o) S(i,l) S(o,o) S(l,g) S(o,i) -1 -1 1 -1 0 -2biologybiologiscore S(b,b) S(i,i) S(o,o) S(l,l) S(o,o) S(g,g) S (y,i) 6Note that maximum possible score with our matrix is 7.33Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesThe Meaning Behind Scores The random model assumes that mutations occur depending only on theamino acid frequency e.g. p(I L) is the same as p(E L) Nonrandom models assume a strong correlation between the residues inthe alignment in question Odds ratio – the ratio of probability of the nonrandom vs the random modelsproducing the alignment Logarithms are used to generate better behaved mathematics Logarithms usually indicative of hierarchy or information (e.g. bitsrepresenting forks in a decision tree) and here we have a structuralconnection to phylogenetics or evolutionary changes34Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesTypes of scoring systems: Identity Positive scores for identical residues only Still used for nucleotides Similarity Positive scores for similar residues Based on chemical or mutational similarity Deriving scores Theoretical, count mutations Physicochemical properties Empirical – evidence from evolution35Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring Matrices 36Introduced by Margaret Dayhoff1978PAM point accepted mutationsBased on observed amino acidsubstitution frequencies infamilies of proteins Globins Cytochrome InsulinMultiple alignments amongmembers of the protein family1 PAM 1 accepted amino acidchange per 100 residuesTo obtain probabilities for higheraverage rates of mutation –multiply the matrix by itselfPAM-250 is PAM raised to the250 power – 250 times theevolutionary distance of PAMIan T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesPAM is an empirically calibrated model of evolution 37Multiple alignmentsCalculate the exposure of each amino acid type tomutation Fraction of the residue type in the alignment multipliedby total number of mutations of any kind per 100alignment positionsCalculate the mutability of a residue type Ratio of the total number of mutations in the datainvolving that residue divded by the total exposure ofthat residue Reported relative to alanine – the relative mutability m aMake a phylogenetic tree (family tree) Mutations determine most likely positions in the treeTabulate observed substitutions in matrix A, acceptedpoint mutation matrix – a tally of mutations basically overa decided length of evolutionary timeAssemble a mutation probability matrix MCreate scoring matrix based on odds ratio of the modelvs random chanceIan T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesScore matrix for amino acid statistical frequencies 38BLOSUM BLOck SUBstitutionMatrixDerived from multiple localalignments BLOCKS database 1991 Ungapped multiple localalignments Conserved protein regionsGroup sequences according topercent similarity and look atsubstitutionsDoes not model realsubstitutionsShould reflect evolutionaryevents if alignment is correctNumber (eg in BLOSUM 62)represents threshold of percentidentity used to clustersequences that are too relatedand biasing the substitutionratesIan T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesPAM in action39Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesPAM in action40Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesBLOSUM62 in action41Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesChoosing a MatrixPAM models – evolutionary distanceBLOSUM – sequence similarityRules of thumb Choice depends on situation Distantly related PAM-250, BLOSUM-50 Closely related PAM-120, BLOSUM-80 Short sequences PAM-40, BLOSUM-80 BLOSUM62 standard for ungappedmatching BLOSUM50 better for gapped42Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Scoring MatricesScoring Schemes for nucleotides43Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

GapsGaps Insertions and deletions Homologous sequences have different lengths due toinsertions and deletions Treat by inserting gaps in the alignment Gap penalties Gaps can be overused to create false alginments -L---INE ALAN I N E Gap penalty – gap opening Introduction of a gap costs in score Limits number of inserted gaps Gap extension penalty Smaller than opening penalty Once a gap, easier to extend Magnitude depends on setting High for closely related sequences Low for distantly related sequences44Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

GapsAlignment B has higher similarity than A but is notnecessarily correct45Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

GapsHeuristic scores46Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

GapsHeuristic scores47Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

AlignmentOUTLINE – analyzing matches & aligning sequences1. Dotplots – comparing sequence similarity visually Window size Tolerance2. Score Matrices – quantifying sequence similarity PAM BLOSUM Pairwise distance, hierarchy, and information theory3. Alignment types4. Dynamic Programming – break problem into small ones Global optimal alignment - Needleman-Wunsch Local optimal alignment - Smith-Waterman48Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

AlignmentAffine types49Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

AlignmentLocal alignment50Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

AlignmentLocal &Globalalignment51Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

AlignmentNeedleman Wunsch52Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

AlignmentNeedleman Wunsch53Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingNeedleman-WunschGlobal alignment54Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingNeedleman-WunschGlobal alignment55Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingNeedleman-WunschGlobal alignment56Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingNeedleman-WunschGlobal alignment57Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingNeedleman-WunschGlobal alignment58Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingNeedleman-WunschGlobal alignmentThe Tracebackbegins at thehighest scoringend value –follow arrowsof highestscores back to059Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingChanging the gap penalty value – must be compatible with substitution matrixGap penalty -860Gap penalty -4Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingNeedlemanWunschNo penalty for endgaps61Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingNeedleman-Wunsch Construct a matrix with scores Begin by filling in end gap scores according to either Fixed gap penalty n*d Forgiving end gaps 0 Compute hypothetical alignment scores from boxes with 3 corner-formingneighbors that already have values beginning from the far corner Accept highest scores and directions so that each box has a winner Iterate until matrix is full Perform a traceback Start from highest scoring element – somewhere on the end axes Follow arrows and highest scores backward to zero Fill out the alignment transcript as you go62Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingSmith-Waterman Local Alignment63Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingSmith-WatermanLocal Alignment64Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Dynamic ProgrammingGap penalty -4Gap penalty -865Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

Pairwise Comparison and AlignmentSummary Comparisons Dot plots are a visual way to compare a pair Score matrices are built from empirical data about substitution frequencies Gaps can be incorporated into score matrices heuristically PAM is based on closely related evolutionary changes PAM 1, 100, 250 refer to evolutionary distances, powers of PAM BLOSUM is based on conserved domains in distantly related proteins BLOSUM 50, 62 etc are % threshold identities to cluster sequences Pairwise alignments Local alignments give optimal alignment even for multidomain proteins withunrelated segments Smith waterman Global alignments give optimal alignment for similar sized proteins withmostly corresponding positions Needleman-Wunsch66Ian T. Hoffecker, PhD Stockholm March 2018ian.hoffecker@ki.se

OUTLINE – analyzing matches & aligning sequences 1.Dotplots – comparing sequence similarity visually Window size Tolerance 2.Score Matrices – quantifying sequence similarity PAM BLOSUM Pairwise distance, hierarchy, and information theory 3.Alignment types

Related Documents:

bitopological space has been introduced by Pervin[10]. The study of supra topology was . Gowri and Jegadeesan[2] studied the concept of pairwise connectedness in soft biCechˇ closure spaces. The purpose of this article is to introduce and . Supra Pairwise Connected and Pairwise Semi-Connected Spaces

Therefore, (ˆ ,/ ,/ ) is pairwise connected. Theorem 3.12. If soft biČech closure space is pairwise disconnected such that ˆ / and let 2 be a pairwise connected soft subset of ˆ then 2 need not to be holds the following conditions ( )2 ( )2 . Proof.

pairwise testing in situations where supposedly independent options might interact In our previous webinars, decision tables, state based methods, and use cases Pairwise techniques allow us to cover combinations in a manageable way These advanced three techniques allow you to perform a wide range of important tests AST-Pairwise Testing www.rbcs .

In order to address this issue, a number of pairwise testing (and sampling) strategies have been developed in the literature in the past 15 years. In this paper, we propose and evaluate a novel pairwise strategy called pairwise harmony search algorithm-based . result, many strategies (and their tool implementations) have been designed in the .

Pairwise testing has become an indispensable tool in a software tester's toolbox. This paper pays special attention to usability of the pairwise testing technique. In this paper, we propose a new test generation strategy for pairwise testing using Genetic Algorithm (GA). We compare the result with the random

a pairwise test suite generator tool based on Harmony Search Algorithm (HS-PTSGT). Pairwise testing is a combinatorial technique used to reduce the number of test case inputs to a system, consider all interactions of at most two factors (McCaffrey, J. D., 2009). Pairwise testing normally begins by choosing individual inputs variables

procedures [2]. The linguistic pairwise comparison matrix becomes inconsistent when the transitivity and reciprocity rules are violated [3]. Formally, a positive reciprocal com-parison matrix A (a ij) n n is consistent, if a ika kj a ij, 1 i;j;k n[4]. Since in most cases it is too difficult for evaluators to give a comparison matrix without .

Practical application of biology is of utmost importance in the field of physiology, neurology, biochemistry, cardiology, zoology, pisciculture, epiculture, sericulture etc. Therefore it is necessary to have a firm grip over such an extensive subject and its practical application. Hence we bring to you “Std. XII Sci. : BIOLOGY PRACTICAL HANDBOOK” a handbook which is a complete and thorough .