Sequence Pattern Lecture 7: Sequence Motif Discovery - Otago

1y ago
19 Views
2 Downloads
524.96 KB
5 Pages
Last View : Today
Last Download : 3m ago
Upload by : Julia Hutchens
Transcription

Sequence motif: definitionsCOSC 348: Computing for Bioinformatics In Bioinformatics, a sequence motif is a nucleotide oramino-acid sequence pattern that is widespread and hasbeen proven or assumed to have a biological significance.Lecture 7:Sequence Motif Discovery Once we know the sequence pattern of the motif, then wecan use the search methods to find it in the sequences (i.e.Boyer-Moore algorithm, Rabin-Karp, suffix trees, etc.)Lubica Benuskova The problem is to discover the motifs, i.e. what is the orderof letters the particular motif is comprised of.http://www.cs.otago.ac.nz/cosc348/12Sequence motif: notationsExamples of motifs in DNA The TATA promoter sequence is an example of a highlyconserved DNA sequence motif found in eukaryotes. Another example of motifs: binding sites for transcriptionfactors (TF) near promoter regions of genes, etc. Another notation: each ‘.’ signifies any single AA, and each ‘*’indicates one member of a closely-related AA family:Gene 1Gene 2Gene 3Gene 4Gene 5Binding sites for TF An example of a motif in a protein: N, followed by anything but P,followed by either S or T, followed by anything but P One convention is to write N{P}[ST]{P} where {X} meansany amino acid except X; and [XYZ] means either X or Y or Z. WDIND*.*P.*.D.F.*W***.**.IYS**.A.*H*S*WAMRN In the 1st assignment we have motifs like A?CG, where thewildcard ? Stands for any of A,U,C,G.34Sequence motif discovery from conservation Sequence motifs are conserved sequences of similar or identicalpatterns that may occur within nucleic acids (DNA, RNA) orproteins either– within different molecules produced by the same organism or– within molecules from multiple species of organisms In the case of cross-species conservation, conserved motifindicates that a particular sequence pattern may have beenconserved during evolution to perform certain function, thus¾ Motif conservation is the basis of motif discovery by studyingsimilar genes (or proteins) in different species;¾ A motif discovery program that considers phylogeneticconservation is named PhyloGibbs.5Motif discovery based on alignment profile analysis is another word for this. This is usually done by– first constructing a local alignment of multiple sequences,– after which the highly conserved regions are isolated, based ontheir high alignment scoringProteinIDConservedregion6

Motif discovery based on alignmentSequence logo and consensus sequence After the highly conserved regions are isolated, they are used toconstruct profile matrices for each conserved region. We can extract the so-called consensus sequence, i.e. the stringof most frequent letters: The profile matrix for a given motif contains frequency countsfor each letter at each position of the isolated conserved region. A graphical representation of the consensus sequence is called asequence logo: The height of different letters at the same position isproportional to their frequency in motifs: the better the baseconservation is at that position, the higher the letters will be.78Sequence logo and information theoryShannon’s information Sequence logo is often displayed using the information theory,i.e. Shannon’s information (in bits) is a measure of the information content I(xi)associated with the particular outcome xi of a random variable X, which canhave n values/outcomes, i.e. x1, x2, xn I(xi) log2 P(xi) log2 (1 / P(xi))– When choosing from 4 nucleotides, the probability P(xi) 1/4. Whenparticular nucleotide occurred, the amount of information is– I(‘nucleotide') log2 (1/(1/4)) log2 (4) 2 bits. If the possible values xi of variable X have probabilities P(xi)then Shannon’s expected information (entropy) is:nnH P( xi ) log2 P( xi )9i 1Discovering motifs without alignmentH P( xi ) log2 P( xi )Scoring motifs First, let’s assume we know where the motif starts in the referencesequence set. The motif start positions in sequences can be represented as theset s (s1, s2, s3, , st) where si is the position index Given s (s1, st), wealign t motifs of length lfrom all sequences Construct profile matrixla G g t a c T tC c A t a c g ta c g t T A g ta c g t C c A tC c g t a c g GA 3 0 1 0 3 1 1 0C 2 4 0 0 1 4 0 0lG 0 1 4 0 0 0 3 1[]count(k,i) maxT 0 0 0 5 1 0 1 4i 1 k { A ,T ,C ,G }Motif Score 3 4 4 5 3 4 3 4 3012 Motif Score(s) 1110i 1t

Let’s start againMotif finding problem Given a list of t sequences each of length n, find the “best”pattern of length l that appears in each of the t sequences. The problem is to find the starting positions s (s1, st) tomaximize the Score(s) of the resulting profile matrix. Let s (s1,.,st) be the set of starting positions for l-mers inour t sequences. Several kinds of profile matrices are used: A position frequency matrix (PFM) records the positiondependent frequency f of each letter, i.e. how many times a letteroccurs at a given position in N sequences. The strings corresponding to these starting positions will form:- 4 x l profile matrix* P for DNA- and 20 x l profile matrix* P for proteins A position probability matrix (PPM). When normalized to 1this frequency turns into a probability, i.e. P f / N. A position weight matrix (PWM): β { A ,C ,G ,T }f β k logProfile P Construct the profile P0.60.40.00.0qβ130.00.80.20.014Probability of l-mersla G g t a c T tC c A t a c g ta c g t T A g ta c g t C c A tC c g t a c g G Given s (s1, st), wealign t l-mers from allsequences andACGT* The profile matrix will be defined in terms of the probabilityof letters, and not as the count of letters.f 00.20.00.60.20.00.00.20.8 Pr(a P) is defined as the probability that an l-mer a wascreated by the Profile P.t If a is very similar to the consensus string (i.e. motif)then Pr(a P) will be high. If Pai,k is the probability of letter ai at position k, thenthe probablity of an l-mer a is equal to the product ofindividual probabilities Pai,k , i.e.:lPr( a P) Pai ,kk 115Scoring l-mers with a profile16Scoring l-mers with a profile (cont’d)Given a profile P Given a profile P 1/41/8G1/401/83/81/41/8Pr(aaacct P) ?0Pr(aaacct P) 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336461718

Scoring l-mers with a profile (cont’d)Motif – the P-most probable l-mer Define the P-most probable l-mer from a sequence as anl-mer in that sequence which has the highest probabilityof being created from the profile P.Given a profile: P 1/83/81/41/8 Task: given a sequence ctataaaccttacatc and theknown profile P, find the P-most probable 6-mer:Prob(aaacct P) 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .033646P Prob(atacag P) 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 /8G1/401/83/81/41/81920P-most probable l-mer (cont’d)P-most probable l-mer (cont’d)A1/27/83/801/80Compute Pr(a P) for every possible 6-mer:C1/801/25/83/80Window, Highlighted RedCalculationsT1/81/8001/47/8ctataaaccttacat1/8 x 1/8 x 3/8 x 0 x 1/8 x 00G1/401/83/81/41/8ctataaaccttacat1/2 x 7/8 x 0 x 0 x 1/8 x 000First try: c t a t a a a c c t t a c a t cSecond try: c t a t a a a c c t t a c a t cThird try: c t a t a a a c c t t a c a t cSlide the window to evaluate every possible 6-mer– brute force approachPr(a P)ctataaaccttacat1/2 x 1/8 x 3/8 x 0 x 1/8 x 0ctataaaccttacat1/8 x 7/8 x 3/8 x 0 x 3/8 x 00ctataaaccttacat1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8.0336ctataaaccttacat1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8.0299ctataaaccttacat1/2 x 0 x 1/2 x 0 1/4 x 00ctataaaccttacat1/8 x 0 x 0 x 0 x 0 x 1/8 x 00ctataaaccttacat1/8 x 1/8 x 0 x 0 x 3/8 x 00ctataaaccttacat1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8.00042122Dealing with zeroes and small probabiltiesP-most probable l-mer (cont’d) In our toy example Pr(a P) 0 in many cases. In practice, there willbe enough sequences so that the number of elements in the profilewith a frequency of zero is likely to be small but still we must ensurezeroes are taken care of.P-most probable 6-mer in the sequence is aaacct:Window, Highlighted RedCalculationsctataaaccttacat1/8 x 1/8 x 3/8 x 0 x 1/8 x 0Pr(a P)0ctataaaccttacat1/2 x 7/8 x 0 x 0 x 1/8 x 00ctataaaccttacat1/2 x 1/8 x 3/8 x 0 x 1/8 x 00ctataaaccttacat1/8 x 7/8 x 3/8 x 0 x 3/8 x 00ctataaaccttacat1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8.0336ctataaaccttacat1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8.02990ctataaaccttacat1/2 x 0 x 1/2 x 0 1/4 x 0ctataaaccttacat1/8 x 0 x 0 x 0 x 0 x 1/8 x 00ctataaaccttacat1/8 x 1/8 x 0 x 0 x 3/8 x 00ctataaaccttacat1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8.0004 There exist several techniques to equate zero to a very small numberso that one zero does not make the entire probability of a string zero.– The simplest one is to replace 0 with a small number, e.g. 1 / 10n.23 Another problem is that the product of small probabilities is a verysmall number. Thus, we replace the product with the sum oflogarithms:llPr( a P) Pai ,k log Pr( a P) log(Pai ,k )k 1k 124

Finding the profile P iterativelyP-most probable l-mers are motifsctataaacgttacatc Task: Find the P-mostprobable l-mer in eachof the sequences givenprofile t The P-most probable lmer is our tcacctccaaatcctttaca How do we find P?ggtcatcctttatcct25Finding the profile P ccactcacctccaaatcctttacaggtctacctttatcct Let l 6. Start at random positions (underlined)and calculate the initial profile P.26Comparing new and old profilesUse this initial profile tofind the P-mostprobable l-mer in eachsequence, and set thenew starting positionsaccording to thebeginnings of P-mostprobable l-mer in eachsequence. P-most probable l-mers form a new profile byrecalculating probabilities at all the 47/8G1/401/83/81/41/8 According to this new profile P, find the most probablel-mers in each sequence, set new positions and re-iterate.2827Algorithm for greedy profile motif searchSummary of greedy motif discoveryUse P-most probable l-mers to adjust new start positions until we reachthe “best” profile; this will be pronounced as the motif.Select random starting positions, then: Since we choose starting positions randomly, there is little chancethat our guess will be close to an optimal motif, meaning it willtake a very long time to find the optimal motif. In practice, this algorithm is run many times with the hope thatrandom starting positions will be close to the optimum solutionsimply by chance.1. Create a profile P from the l-mers at these starting positions.2. Find the P-most probable l-mer a in each sequence and change thestarting positions to the starting positions of a’s.3. Go to step 1 and re-iterate until we cannot increase the score anymore.29 The algorithm may be improved by heuristic knowledge, whereapproximately we should start or by more sophisticated statisticaltechniques, like Gibbs sampling that estimates the most probablestart positions for motifs, where we ought to start our iterativeprocess of motif discovery.30

The profile matrixfor a given motif contains frequency counts for each letter at each position of the isolated conserved region. 8 Sequence logo and consensus sequence We can extract the so-called consensus sequence, i.e. the string of most frequent letters: A graphical representation of the consensus sequence is called a sequence logo:

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .

Class- VI-CBSE-Mathematics Knowing Our Numbers Practice more on Knowing Our Numbers Page - 4 www.embibe.com Total tickets sold ̅ ̅ ̅̅̅7̅̅,707̅̅̅̅̅ ̅ Therefore, 7,707 tickets were sold on all the four days. 2. Shekhar is a famous cricket player. He has so far scored 6980 runs in test matches.

Introduction to Quantum Field Theory for Mathematicians Lecture notes for Math 273, Stanford, Fall 2018 Sourav Chatterjee (Based on a forthcoming textbook by Michel Talagrand) Contents Lecture 1. Introduction 1 Lecture 2. The postulates of quantum mechanics 5 Lecture 3. Position and momentum operators 9 Lecture 4. Time evolution 13 Lecture 5. Many particle states 19 Lecture 6. Bosonic Fock .

astrology has nowhere to go but "up" as an undeniable diagnostic tool that will pinpoint areas to be tested. In doing so, it saves the patient pain and time—not to mention, money. '.it-. In the hospitals of Leningrad, Russia, the MST software is being utilized by physicians to help in patient diagnosis—especially in the areas of undiagnosible or hard-to-diagnose patients. My hope is that .