Sequence Alignment

2y ago
66 Views
3 Downloads
448.28 KB
42 Pages
Last View : 21d ago
Last Download : 3m ago
Upload by : Julia Hutchens
Transcription

Sequence alignmentlAlignment specifies which positions in two sequencesmatchacgtctag actctag-acgtctag -actctagacgtctag ac-tctag2 matches5 mismatches1 not aligned5 matches2 mismatches1 not aligned7 matches0 mismatches1 not alignedIntroduction to bioinformatics, Autumn 200741

Mutations: Insertions, deletions andsubstitutionsIndel: insertion ordeletion of a basewith respect to theancestor sequencelacgtctag -actctagMismatch: substitution(point mutation) ofa single baseInsertions and/or deletions are called indels We can’t tell whether the ancestor sequence had a base ornot at indel positionIntroduction to bioinformatics, Autumn 200742

ProblemslWhat sorts of alignments should be considered?lHow to score alignments?lHow to find optimal or good scoring alignments?lHow to evaluate the statistical significance of scores?In this course, we discuss each of these problemsbriefly.Course Biological sequence analysis tackles all four indepth.Introduction to bioinformatics, Autumn 200743

Sequence Alignment (chapter 6)lThe biological problemlGlobal alignmentlLocal alignmentlMultiple alignmentIntroduction to bioinformatics, Autumn 200744

Global alignmentlProblem: find optimal scoring alignment between twosequences (Needleman & Wunsch 1970)lEvery position in both sequences is included in the alignmentlWe give score for each position in alignmentl Identity (match) 1WHAT Substitution (mismatch)-µ IndelWH-YTotal score: sum of position scoresS(WHAT/WH-Y) 1 1 –Introduction to bioinformatics, Autumn 2007–µ45

Dynamic programminglllHow to find the optimal alignment?We use previous solutions for optimal alignments ofsmaller subsequencesThis general approach is known as dynamicprogrammingIntroduction to bioinformatics, Autumn 200746

Introduction to dynamic programming:the money change problemlllSuppose you buy a pen for 4.23 and pay for with a5 noteYou get 77 cents in change – what coins is the cashiergoing to give you if he or she tries to minimise thenumber of coins?The usual algorithm: start with largest coin(denominator), proceed to smaller coins until nochange is left: l50, 20, 5 and 2 centsThis greedy algorithm is incorrect, in the sense that itdoes not always give you the correct answerIntroduction to bioinformatics, Autumn 200747

The money change problem How else to compute the77change?50205 We could consider all possible275772ways to reduce the amount ofchange Suppose we have 77 cents7 22 7 37 52 22 52 67change, and the followingcoins: 50, 20, 5 cents We can compute the change Many values are computedwith recursionmore than once! Figure shows the recursion This leads to a correct buttree for the examplevery inefficient algorithmIntroduction to bioinformatics, Autumn 200748

The money change problemlWe can speed the computation up by solving thechange problem for all i n llExample: solve the problem for 9 cents with available coinsbeing 1, 2 and 5 centsSolve the problem in steps, first for 1 cent, then 2cents, and so onIn each step, utilise the solutions from the previousstepsIntroduction to bioinformatics, Autumn 200749

The money change problemAmount ofchange leftll0123456789Algorithm runs in time proportional to Md, where M isthe amount of change and d is the number of cointypesThe same technique of storing solutions ofsubproblems can be utilised in aligning sequencesIntroduction to bioinformatics, Autumn 200750

Representing alignments and scoresAlignments can berepresented in thefollowing tabular form.Each alignmentcorresponds to a paththrough the table.-WHAXXTWWHATH YXXWH-YIntroduction to bioinformatics, Autumn 200751

Representing alignments and scores-WHAT -WH-YHA22-T0WHGlobal alignmentscore S3,4 2- -µWYIntroduction to bioinformatics, Autumn 200712- -µ52

Filling the alignment matrix-WHAWConsider the alignment processat shaded square.Case 1. Align H against H(match or substitution).Case 1Case 2HCase 3YTCase 2. Align H in WHY against– (indel) in WHAT.Case 3. Align H in WHATagainst – (indel) in WHY.Introduction to bioinformatics, Autumn 200753

Filling the alignment matrix (2)-WHATScoring the alternatives.Case 1. S2,2 S1,1 s(2, 2)-Case 2. S2,2 S1,2WCase 1Case 2s(i, j) 1 for matching positions,HCase 3YCase 3. S2,2 S2,1s(i, j) - µ for substitutions.Choose the case (path) thatyields the maximum score.Keep track of path choices.Introduction to bioinformatics, Autumn 200754

Global alignment: formaldevelopmentA a1a2a3 an,B b1b2b3 bmb1b2b3b4--a1-a2a3Any alignment can be writtenas a unique path through thematrixlScore for aligning A and B upto positions i and j:0-1a12a23a301234-b1b2b3b4lSi,j S(a1a2a3 ai, b1b2b3 bj)Introduction to bioinformatics, Autumn 200755

Scoring partial alignmentslAlignment of A a1a2a3 an with B b1b2b3 bm can end inthree ways Case 1: (a1a2 ai-1) ai(b1b2 bj-1) bj Case 2: (a1a2 ai-1) ai(b1b2 bj) - Case 3: (a1a2 ai) –(b1b2 bj-1) bjIntroduction to bioinformatics, Autumn 200756

Scoring alignmentslScores for each case: Case 1: (a1a2 ai-1) ai(b1b2 bj-1) bj s(ai, bj) { -µ otherwiseCase 2: (a1a2 ai-1) ai(b1b2 bj) – 1 if ai bjCase 3: (a1a2 ai) –s(ai, -) s(-, bj) -(b1b2 bj-1) bjIntroduction to bioinformatics, Autumn 200757

Scoring alignments (2) First row and first columncorrespond to initial alignmentagainst indels:S(i, 0) -iS(0, j) -j0 Optimal global alignmentscore S(A, B) Sn,m-01234-b1b2b3b4-2-3-401a12a2 -23a3 -3Introduction to bioinformatics, Autumn 200758

Algorithm for global alignmentInput sequences A, B, n A , m B Set Si,0 : - i for all iSet S0,j : - j for all jfor i : 1 to nfor j : 1 to mSi,j : max{Si-1,j – , Si-1,j-1 s(ai,bj), Si,j-1 – }endendAlgorithm takes O(nm) time and space.Introduction to bioinformatics, Autumn 200759

Global alignment: example-TGGTG-2-4-6-8-10µ 1-0 2A-2T-4C-6G-8T-10Introduction to bioinformatics, Autumn 2007?60

Global alignment: example (2)-TGGTGµ 1-0-2-4-6-8-10 3-4T-10-7-4-30-2ATCGT -TGGTGIntroduction to bioinformatics, Autumn 200761

Sequence Alignment (chapter 6)lThe biological problemlGlobal alignmentlLocal alignmentlMultiple alignmentIntroduction to bioinformatics, Autumn 200762

Local alignment: rationale Otherwise dissimilar proteins may have local regions ofsimilarity- Proteins may share a functionHuman bonemorphogenic proteinreceptor type IIprecursor (left) has a300 aa region thatresembles 291 aaregion in TGFreceptor (right).The shared functionhere is protein kinase.Introduction to bioinformatics, Autumn 200763

Local alignment: rationaleABRegions ofsimilarity Global alignment would be inadequate Problem: find the highest scoring local alignmentbetween two sequences Previous algorithm with minor modifications solves thisproblem (Smith & Waterman 1981)Introduction to bioinformatics, Autumn 200764

From global to local alignmentlModifications to the global alignment algorithm Look for the highest-scoring path in the alignment matrix(not necessarily through the matrix), or in other words: Allow preceding and trailing indels without penaltyIntroduction to bioinformatics, Autumn 200765

Scoring local alignmentsA a1a2a3 an, B b1b2b3 bmLet I and J be intervals (substrings) of A and B,respectively:,Best local alignment score:where S(I, J) is the score for substrings I and J.Introduction to bioinformatics, Autumn 200766

Allowing preceding and trailingindels First row and columninitialised to zero:Mi,0 M0,j 0b1 b2 b3- - a101234-b1b2b3b4000000-1a1 02a2 03a3 0Introduction to bioinformatics, Autumn 200767

Recursion for local alignment Mi,j max {Mi-1,j-1 s(ai, bi),Mi-1,j ,Mi,j-1 0Introduction to bioinformatics, Autumn 200768

Finding best local alignment Optimal score is the highestvalue in the matrix maxi,j Mi,j Best local alignment can befound by backtracking fromthe highest value in ntroduction to bioinformatics, Autumn 200769

Local alignment: C04T05A06A07G08G0Introduction to bioinformatics, Autumn 200770

Local alignment: exampleScoringMatch: 2Mismatch: -1Indel: -2C T – A AC T C A ion to bioinformatics, Autumn 200771

Non-uniform mismatch penaltieslWe used uniform penalty for mismatches:s(’A’, ’C’) s(’A’, ’G’) s(’G’, ’T’) µlTransition mutations (A- G, G- A, C- T, T- C) areapproximately twice as frequent than transversions (A T, T- A, A- C, G- T) use non-uniform mismatchACGTpenalties collected into aA1-1-0.5-1substitution matrixC-11-1-0.5G-0.5-11-1T-1-0.5-11Introduction to bioinformatics, Autumn 200772

Gaps in alignmentlGap is a succession of indels in alignmentC T – - - A AC T C G C A AllPrevious model scored a length k gap as w(k) -kReplication processes may produce longer stretchesof insertions or deletions In coding regions, insertions or deletions of codons maypreserve functionalityIntroduction to bioinformatics, Autumn 200773

Gap open and extension penalties (2)lWe can design a score that allows the penalty openinggap to be larger than extending the gap:w(k) -ll(k – 1)Gap open cost , Gap extension costOur previous algorithm can be extended to use w(k)(not discussed on this course)Introduction to bioinformatics, Autumn 200774

Amino acid sequenceslWe have discussed mainly dna sequenceslAmino acid sequences can be aligned as wellllHowever, the design of the substitution matrix is moreinvolved because of the larger alphabetMore on the topic in the course Biological sequenceanalysisIntroduction to bioinformatics, Autumn 200775

Sequence Alignment (chapter 6)lThe biological problemlGlobal alignmentlLocal alignmentlMultiple alignmentIntroduction to bioinformatics, Autumn 200776

Multiple alignment Consider a set of nsequences on the right– Orthologous sequences fromdifferent organisms– Paralogs from multipleduplications How can we studyrelationships between tion to bioinformatics, Autumn 200777

Optimal alignment of threesequenceslllllAlignment of A a1a2 ai and B b1b2 bj can endeither in (-, bj), (ai, bj) or (ai, -)22 – 1 3 alternativesAlignment of A, B and C c1c2 ck can end in 23 – 1ways: (ai, -, -), (-, bj, -), (-, -, ck), (-, bj, ck), (ai, -, ck), (ai,bj, -) or (ai, bj, ck)Solve the recursion using three-dimensional dynamicprogramming matrix: O(n3) time and spaceGeneralizes to n sequences but impractical withmoderate number of sequencesIntroduction to bioinformatics, Autumn 200778

Multiple alignment in practicellIn practice, real-world multiple alignment problems areusually solved with heuristicsProgressive multiple alignment Choose two sequences and align them Choose third sequence w.r.t. two previous sequences andalign the third against them Repeat until all sequences have been aligned Different options how to choose sequences and scorealignmentsIntroduction to bioinformatics, Autumn 200779

Multiple alignment in practicelProfile-based progressive multiple alignment:CLUSTALW Construct a distance matrix of all pairs of sequences usingdynamic programming Progressively align pairs in order of decreasing similarity CLUSTALW uses various heuristics to contribute toaccuracyIntroduction to bioinformatics, Autumn 200780

Additional materiallllR. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biologicalsequence analysisN. C. Jones, P. A. Pevzner: An introduction tobioinformatics algorithmsCourse Biological sequence analysis in Spring 2008Introduction to bioinformatics, Autumn 200781

Demonstration of the EBI web sitelEuropean Bioinformatics Institute (EBI) offers manybiological databases and bioinformatics tools athttp://www.ebi.ac.uk/Introduction to bioinformatics, Autumn 200782

Introduction to bioinformatics, Autumn 2007 49 The money change problem l We can speed the computation up by solving the change problem for all i n Example: solve the problem for 9 cents with available coins being 1, 2 and 5 cents l S

Related Documents:

Alignment REPORTER website www.alignment-reporter.com. 3.2.1. Installing the Windows software If using the Alignment REPORTER CD, place it in the CD-ROM drive. The Alignment REPORTER welcome screen should appear automatically. If not in possession of the CD, visit www.alignment-reporter.com to create an account and download the software.

The profile matrixfor a given motif contains frequency counts for each letter at each position of the isolated conserved region. 8 Sequence logo and consensus sequence We can extract the so-called consensus sequence, i.e. the string of most frequent letters: A graphical representation of the consensus sequence is called a sequence logo:

Sequence alignment To identify branch specific positive selection, it is necessary to obtain a query and a . reference alignment. We downloaded six high quality reference genomes from the . to build a sequence alignment. Next, we refined the alignment using a gene by gene tation(i.e.,ORF1a .

Perl Pipelines Using perl as bioinformatics glue Simon Prochnik with code from Scott Cain 1. Some of what cover in the course Gene prediction EST alignment to genome EST sequence Phylogenetics: paup, MrBayes, phyml, PAML multiple sequence alignment: clustalw, muscle pairwise sequence alignment:

Sequence comparison: -Find the best alignment of two sequences -Find the best match (alignment) of a given sequence in a large dataset of sequences -Find the best alignment of multiple sequences Motif and gene finding Relationship between sequences -Phylogeny Clustering and classification Many many many more

for each alignment (1). Estimates of phylogeny and inferences of pos-itive selection were sensitive to alignment treat-ment. Confirming previous studies showing that alignment method has a considerable effect on tree topology (12–14), we found that 46.2% of the 1502 ORFs had one or more differing trees depending on the alignment procedure used.

This procedure was developed for the alignment of the Wide Field Corrector for the Hobby Eberly Telescope, which uses center references to provide the data for the system alignment. From previous experiments, we determined that using an alignment telescope or similar instrument would not achieve the required alignment uncertainty.

Insert Chuck into Alignment Port Press the Alignment Button down and hold it. Match the either of the Alignment Guides on the Chuck with the 118 Notch on the Alignment Port. Insert the Chuck. While holding the button down, slide the drill bit forward until it touches the Drill Stop and the Chuck is pushed all the way into the Alignment Port .