2y ago

23 Views

2 Downloads

2.03 MB

51 Pages

Transcription

Pairwise SequenceAlignmentStuart M. BrownNYU School of Medicinew/ slides byFourie Joubert

Protein Evolution“For many protein sequences, evolutionary historycan be traced back 1-2 billion years”-William Pearson When we align sequences, we assume that they share a commonancestor– They are then homologous Protein fold is much more conserved than protein sequence DNA sequences tend to be less informative than proteinsequences

Definition Homology: related by descent Homologous sequence positionsATTGCGC!CATTGCGC!à ATTGCGCATTGCGCà AT-CCGCà ATCCGC

Orthologous and paralogous Orthologous sequences differ because they arefound in different species (a speciation event) Paralogous sequences differ due to a geneduplication event Sequences may be both orthologous andparalogous

Pairwise Alignment The alignment of two sequences (DNA orprotein) is a relatively straightforwardcomputational problem.– There are lots of possible alignments. Two sequencescan always be aligned. Sequence alignments have to be scored. Often there is more than one solution with thesame score.

Methods of Alignment By hand - slide sequences on two lines of a wordprocessor Dot plot– with windows Rigorous mathematical approach– Dynamic programming (slow, optimal) Heuristic methods (fast, approximate)– BLAST and FASTA Word matching and hash tables0

Align by HandGATCGCCTA TTACGTCCTGGAC --- AGGCATACGTA GCCCTTTCGCYou still need some kind of scoring system to findthe best alignment

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoPercent Sequence Identity The extent to which two nucleotide or aminoacid sequences are invariantAC C TG A G – AGAC G TG – G C AGmismatch70% identicalindel

Dotplot:A dotplot gives an overview of all possible alignmentsSequence 2ATTCACATAl l l l l l l l l l l l l l l l l l l l l l l Tl l l l l l l A Cl A TTl l A CSequence 1GTl A C

Dotplot:In a dotplot each diagonal corresponds to a possible (ungapped) alignmentSequence 2ATTCACATAl l l l l l l l l l l l l l l l l l l l l l l Tl l l l l l l A Cl A Tl l A CTGTl A CSequence 1One possible alignment:T A C A T T A C G T A CA T A C A C T T A

Insertions / Deletions in a DotplotSequence 2 TACTGTCATTACTGTTCATSequence 1T A C T G - T C A T T A C T G T T C A T

Dotplot(Window 130 / Stringency 9)Hemoglobinβ-chainHemoglobin α-chain

Word Size AlgorithmT A C G G T A T GWord Size 3A C A G T A T CCTATGACAT A C G G T A T GA C A G T A T CT A C G G T A T GA C A G T A T CT A C G G T A T GA C A G T A T C T A C G G T A T G

Window / StringencyScore 11PTHPLASKTQILPEDLASEDLTIPTHPLAGERAIGLARLAEEDFGM Score ore ring Matrix FilteringMatrix: PAM250 Window 12Stringency 9

Dotplot(Window 18 / Stringency 10)Hemoglobinβ-chainHemoglobin α-chain

Considerations The window/stringency method is more sensitive than the wordsizemethod (ambiguities are permitted). The smaller the window, the larger the weight of statistical(unspecific) matches. With large windows the sensitivity for short sequences is reduced. Insertions/deletions are not treated explicitly.

Alignment methods Rigorous algorithms Dynamic Programming– Needleman-Wunsch (global)– Smith-Waterman (local) Heuristic algorithms(faster but approximate) BLAST FASTA

Basic principles of dynamic programming- Creation of an alignment path matrix- Stepwise calculation of score values- Backtracking (evaluation of the optimal path)

Dynamic Programming Dynamic Programming is a very generalprogramming technique. " It is applicable when a large search space canbe structured into a succession of stages,such that: "– the initial stage contains trivial solutions to subproblems"– each partial solution in a later stage can becalculated by recurring a fixed number ofpartial solutions in an earlier stage"– the final stage contains the overall solution "

Creation of an alignment path matrixIdea:Build up an optimal alignment using previous solutions foroptimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F(i,j) is the score of the best alignment between the initialsegment x1.i of x up to xi and the initial segment y1.j of y up to yj Build F(i,j) recursively beginning with F(0,0) 0

Creation of an alignment path matrix If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: xi and yj are aligned, F(i,j) F(i-1,j-1) s(xi ,yj) xi is aligned to a gap, F(i,j) F(i-1,j) - d yj is aligned to a gap, F(i,j) F(i,j-1) - d The best score up to (i,j) will be the largest of the three options

80-8-2-9-17-25-33-42-49-57-65-73A -16-10-3-4-12-20-28-36-44-52-60W -24-18-11-6-7-15-5-13-21-29-37H -32-14-18-13-8-9-13-7-3-11-19E -40-22-8-16-16-9-12-15-73-5A -48-30-16-3-11-11-12-12-15-52E -56-38-24-11-6-12-14-15-12-91POptimal global alignment: HEAG AWGHE- E--P- AW-HEA E

Global vs. Local Alignments Global alignment algorithms start at thebeginning of two sequences and add gaps to eachuntil the end of one is reached. Local alignment algorithms finds the region (orregions) of highest similarity between twosequences and build the alignment outward fromthere.

Global AlignmentTwo closely related sequences:needle (Needleman & Wunsch) creates an end-to-end alignment.

Global AlignmentTwo sequences sharing several regions of local similarity:1 AAAGAGGAGGTAGACCG. 67 1 CTAAAGCGTCAGCGAGACCG 70

Global Alignment(Needleman -Wunsch) The the Needleman-Wunsch algorithm creates aglobal alignment over the length of both sequences(needle) Global algorithms are often not effective for highlydiverged sequences - do not reflect the biologicalreality that two sequences may only share limitedregions of conserved sequence.– Sometimes two sequences may be derived from ancientrecombination events where only a single functional domainis shared. Global methods are useful when you want to forcetwo sequences to align over their entire length

Local Alignment(Smith-Waterman) Local alignment– Identify the most similar sub-region sharedbetween two sequences– Smith-Waterman– EMBOSS: water

Parameters of Sequence AlignmentScoring Systems: Each symbol pairing is assigned anumerical value, based on a symbolcomparison table.Gap Penalties: Opening: The cost to introduce a gap Extension: The cost to elongate a gap

DNA Scoring Systems-very simpleSequence aaggacttaaagactSequence 2AGCTA1000G0100C0010T0001Match: 1Mismatch: 0Score 5

Protein Scoring SystemsSequence 1PTHPLASKTQILPEDLASEDLTISequence 2PTHPLAGERAIGLARLAEEDFGMScoringmatrixCCSTPAGN9S -14T-115P -3-1-17A010-14G -30-2-206N -310-2-205D -30-1-1-2-11.D6.T:G -2T:T 5Score 48

Protein Scoring Systems Amino acids have different biochemical and physical propertiesthat influence their relative replaceability in evolution.aliphaticLhydrophobicPC S archarged

Protein Scoring Systems Scoring matrices reflect:– # of mutations to convert one to another– chemical similarity– observed mutation frequencies– the probability of occurrence of each amino acid Widely used scoring matrices: PAM BLOSUM

PAM matricesØ Family of matrices PAM 80, PAM 120,PAM 250The number with a PAM matrix represents theevolutionary distance between the sequences onwhich the matrix is basedØ Ø Greater numbers denote greater distances

PAM (Percent Accepted Mutations) matrices The numbers of replacements were used to compute a so-calledPAM-1 matrix. The PAM-1 matrix reflects an average change of 1% of all aminoacid positions. PAM matrices for larger evolutionary distances canbe extrapolated from the PAM-1 matrix. PAM250 250 mutations per 100 residues. Greater numbers mean bigger evolutionary distance.

PAM (Percent Accepted Mutations) matrices Derived from global alignments of protein families . Family membersshare at least 85% identity (Dayhoff et al., 1978). Construction of phylogenetic tree and ancestral sequences ofeach protein family Computation of number of replacements for each pair of amino acids

PAM -4-3056

PAM - limitationsØ BasedØ ExaminesØ Basedon only one original datasetproteins with few differences(85% identity)mainly on small globular proteinsso the matrix is biased

BLOSUM matricesDifferent BLOSUMn matrices are calculatedindependently from BLOCKS (ungapped localalignments)Ø Ø BLOSUMn is based on a cluster of BLOCKS ofsequences that share at least n percent identityØ BLOSUM62 represents closer sequences thanBLOSUM45

BLOSUM (Blocks Substitution Matrix) Derived from alignments of domains of distantly relatedproteins (Henikoff & Henikoff,1992).AACEC Occurrences of each amino acid pairin each column of each block alignmentis counted. The numbers derived from all blocks wereused to compute the BLOSUM matrices.AACECA-C 4A-E 2C-E 2A-A 1C-C 1

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.infoThe Blosum50 Scoring Matrix

BLOSUM (Blocks Substitution Matrix) Sequences within blocks are clustered according to their level of identity. Clusters are counted as a single sequence. Different BLOSUM matrices differ in the percentage of sequence identityused in clustering. The number in the matrix name (e.g. 62 in BLOSUM62) refers to thepercentage of sequence identity used to build the matrix. Greater numbers mean smaller evolutionary distance.

PAM Vs. BLOSUMPAM100PAM120PAM160PAM200PAM250 BLOSUM90BLOSUM80BLOSUM60BLOSUM52BLOSUM45More distant sequencesPAM120 for general usel PAM60 for close relationsl PAM250 for distant relationsl BLOSUM62 for general usel BLOSUM80 for close relationsl BLOSUM45 for distant relationsl

TIPS on choosing a scoring matrix Generally, BLOSUM matrices perform better than PAM matricesfor local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lowerPAM or higher BLOSUM matrices, for distantly related proteinshigher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62.

Scoring Insertions and DeletionsA T G T A A T G C AT A T G T G G A A T G AA T G T - - A A T G C AT A T G T G G A A T G Ainsertion / deletionThe creation of a gap is penalized with a negative score value.

Why Gap Penalties?Gaps not permittedScore:1 GTGATAGACACAGACCGGTGGCATTGTGG 29 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29Gaps allowed but not penalized0Match 5Mismatch -4Score: 881 GTG.ATAG.ACACAGA.CCGGT.GGCATTGTGG 29 1 GTGTAT.GGA.AGAGATACC.TCCG.ATGGTTG 29

Why Gap Penalties? Theoptimal alignment of two similar sequences is usuallythat which maximizes the number of matches and minimizes the number of gaps. There is a tradeoff between these two- adding gaps reduces mismatches Permitting the insertion of arbitrarily many gaps can leadto high scoring alignments of non-homologoussequences. Penalizing gaps forces alignments to have relatively fewgaps.

Gap Penalties How to balance gaps with mismatches? Gaps must get a steep penalty, or else you’ll end upwith nonsense alignments. In real sequences, muti-base (or amino acid) gaps arequit common genetic insertion/deletion events “Affine” gap penalties give a big penalty for eachnew gap, but a much smaller “gap extension” penalty.

Scoring Insertions and Deletionsmatch 1mismatch 0Total Score:4A T G T T A T A CT A T G T G C G T A T ATotal Score:8 - 3.2 4.8Gap parameters:d 3 (gap opening)e 0.1 (gap extension)g 3 (gap lenght)γ(g) -3 - (3 -1) 0.1 -3.2A T G T - - - T A T A CT A T G T G C G T A T Ainsertion / deletion

Modification of Gap PenaltiesScore Matrix: BLOSUM62gap opening penaltygap extension penaltyscore 3 0.1 6.31 .VLSPADKFLTNV 12 1 VFTELSPAKTV. 11gap opening penaltygap extension penaltyscore 0 0.1 11.31 V.LSPADKFLTNV 12 1 VFTELSPA.K.T.V 11

PAM BLOSUM Protein Scoring Systems. PAM matrices Family of matrices PAM 80, PAM 120, PAM 250 The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based Greater numbers denote greater distances .

Related Documents: