Jun Liu - Harvard University

3y ago
7 Views
3 Downloads
685.60 KB
46 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Azalea Piercy
Transcription

Biological Sequence Analysisand Motif DiscoveryIntroductory Overview LectureJoint Statistical Meetings 2001, AtlantaJun LiuDepartment of StatisticsHarvard Universityhttp://www.fas.harvard.edu/ junliujliu@stat.harvard.eduBiological Sequence Analysis1

Topics to be covered Basic Biology: DNA, RNA, Protein; genetic code. Biological Sequence Analysis– Pairwise alignment --- dynamic programming Needleman-Wunsch Smith-Waterman Blast – Multiple sequence alignment Heuristic approaches The hidden Markov model– Motif finding in DNA and protein sequenceBiological Sequence Analysis2

Central ParadigmCourtesyof Doug BrutlagBiological Sequence Analysis3

Buliding Blocks of Biological Systems:nucleotides and amino acids DNA (nucleotides, 4 types): information carrier/encoder. RNA: bridge from DNA to protein. Protein (amino acids, 20 types): action molecules. Genetic code: deciphering genetic information.atgaatcgta ggggtttgaa cgctggcaat acgatgactt ctcaagcgaacattgacgac ggcagctgga aggcggtctc cgagggcgga MNRRGLNAGNTMTSQANIDDGSWKAVSEGG Biological Sequence Analysis4

DNA ical Sequence Analysis5

From DNA to ProteinBiological Sequence Analysis6

GeneticCodeBiological Sequence Analysis7

Main Resources for Data NCBI --- national center for ank maintainanceBLAST searching and serversEntrez databaseTaxonomy databaseStructure databaseBankit and SEQUIN submission softwareWeb access http://www.ncbi.nlm.nih.gov EMBL --- European equivalent of NCBI.– Web access http://www.embl-heidelberg.de– EBI --- outstation of EMBL. http://www.ebi.ac.uk/Biological Sequence Analysis8

Finding “Patterns” inBiological SequencesBiological Sequence Analysis9

Everything is related to everythingelse in a logical way .Biological Sequence Analysis10

A MotifBiological Sequence Analysis11

Brute-forceStringSearchBiological Sequence Analysis12

Boyer-Moore String SearchBiological Sequence Analysis13

Consensus in Sequences A database for protein families and patterns:– Prosite http://www.expasy.ch/prosite/Pattern: [LIVM]-[ST]-A-[STAG]-H-CInterpretation: A-x-[ST](2)-x(0,1)-V-{LI}This pattern, which must be in the Nterminal of the sequence ( '), means:Ala-any-[Ser or Thr]-[Ser or Thr](any or none)-Val-(any but Leu & Ile)Biological Sequence Analysis14

Pattern Finding Programs Prosite web program (http://www.expasy.ch/prosite/)– ScanProsite - Scan a sequence against PROSITE or a pattern againstSWISS-PROT– ProfileScan - Scan a sequence against the profile entries in PROSITE Motif web programs ( http://motif.stanford.edu/identify/)– IDENTIFY– SCAN GCG SEQWEB programs (commercial)– Stringsearch– Findpatterns– MotifsBiological Sequence Analysis15

Pairwise SequenceAlignment MethodsBiological Sequence Analysis16

The Sequence Alignment problemCATTGGiven a pair of sequences,how do we view them?TCATGCATTGA better A-TGWhat determines the choice? Gap & mismatch penaltiesBiological Sequence Analysis17

Biological Sequence Analysis18

Global: Needleman-Wunsch Algorithm Finding the optimal alignment via dynamic programming Example alignmentC A T T G TCATGCATT - - GoTC - - ATGoBiological Sequence Analysis19

Alignment Recursion F (i 1, j 1) s( xi , y j ) F (i, j ) max F (i 1, j ) γ F (i, j 1) γ E.g.,s ( x, y ) 2 I { x y } , γ 1F(i-1,j-1)F(i,j-1)F(i-1,j)F(i,j)Biological Sequence AnalysisTCATGC A0 -1 -2-1 0 -1-2 1 0-3 0 3-4 -1 3-5 -2 2T T G-3 -4 -50-12 15 44 5 620

Substitution/Scoring Matrices Pam matrices (Dayhoff et al. 1978) --- phylogeny-based.PAM1: expected number of mutation 1%PAM250 Biologicalmatrix,Sequencelog-oddsAnalysis representation21

BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992) Derived from a set (2000) of aligned and ungapped regions from proteinfamilies; emphasizing more on chemical similarities (versus how easy it isto mutate from one residue to another). BLOSUMx is derived from the setof segments of x% identity.BLOSUM62 Matrix, log-odds representationBiological Sequence Analysis22

Gap Penalties Linear score: (g) -gd– Typically: d 8, in unit of half-bits ( 4log2) Affine score: (g) -d-(g-1)e– d: gap opening penalty; e: gap extension penalty– Typical d 12, e 2 (in unit of half-bits) Gap penalty corresponds to log-probability ofopening a gaps. For example, under the standardlinear score, P(g k) exp(-dk) 2-4kBiological Sequence Analysis23

Local Alignment:Smith-Waterman Algorithm 0 F(i 1, j 1) s(x , y ) ijF (i, j) max F(i 1, j) d F(i, j 1) dBiological Sequence Analysis24

Sequence comparison and data base searchBiological Sequence Analysis25

BLAST(Altschul et al. 1990)QTVGLMIVYDDA Create a word list from the querry;– word length 3 for protein and 12 for DNA. For each listed word, find “neighboring words” ( 50), S (W , W ' ) T For each sequence in the database, search exact matches to eachword in the set. Extend the hits in both directions until score drops below X No gap allowed; use Karlin-Altschul statistics for significance New versions ( 1.4) of BLAST gives gapped alignments. Compute Smith-Waterman for “significant” alignments BLASTP (protein), BLASTN (DNA), BLASTX (pr DNA).Biological Sequence Analysis26

BLAST 2.0 Two word hits must be found within a window of Aresidues in order to trigger extension Gapped extension from the middle of ungapped HSPs Position-specific iterative (PSI-) BLAST.– Profile constructed on the fly and iteratively refined.– Begin with a single query, profile constructed from thosesignificant hits; use the profile to do another search, and iterate theprocedure till “convergence”Biological Sequence Analysis27

A Bayesian Model for Pairwise AlignmentMissing data --- Alignment matrixA i,j 1 if residue i of sequence 1 aligns withresidue j of sequence 2, 0 otherwise.Observed data pair of sequence R(1), R(2)(1)iP( R , RΨR1 , R2( 2)j A, Θ) Θ R (1) Θ R ( 2 ) Ai , j ΨR (1) , R ( 2 )iPAM or Blosumj Ai, jiBiological Sequence Analysis 1i Ai, jj 1j28

Multiple AlignmentBiological Sequence Analysis29

ClustalW Step 2: bBuild the TreeBiological Sequence Analysis30

Biological Sequence Analysis31

However . No explicit model to guide for thealignment --- heuristic driven. Tree construction has problems. Overall, not sensitive enough for remotelyrelated sequences.Biological Sequence Analysis32

The Hidden Markov Model For given zs, ys f(ys zs, θ), and the zs follow aMarkov process with transition ps(zs zs-1, φ).y1z1y2z2y3ysz3 .zsyt .ztπ t ( z t ) p ( z1 , l , zt y1 , l , y t ; φ , θ )“The State Space Model”Biological Sequence Analysis33

What Are Hidden in Sequence Alignment? HMM Architecture: transition diagram for theunderlying Markov chain.D1D2D3D4I0I1I2I3I4m0m1m2m3m4Biological Sequence AnalysismE34

The Pathr0 r1 r2δ i # o f d e le ti o n s 0Gi 1if generated by insertionif generated by a matchr1(δ1,G1)Solid path: (0,1)Dashed path: (0,1)r3 r4 r5 al Sequence Analysisr6(δ6,G6)(2,0)(0,0)35

Pfam alignment view (http://pfam.wustl.edu/browse.shtml)Biological Sequence Analysis36

More Restricted Models(Liu et al. 99, Neuwald et al. 97) Block Motif Propagation Model: limit the totalnumber of gaps but no deletions allowed.Motif 1Motif 2Motif 3Biological Sequence AnalysisMotif 437

Representative AlignmentBiological Sequence Analysis38

Finding Repetitive PatternsBiological Sequence Analysis39

Motif Alignment Modela1 Motifa2width waklength nkThe missing data: Alignment variable: A {a1, a2, , ak} Every non-site positions follows a common multinomialwith p0 (p0,1 , , p0,20) Every position i in the motif element follows probabilitydistribution pi (pi,1 , , pi,20)Biological Sequence Analysis40

The Algorithm Initialized by choosing random starting positionsa 1( 0 ) , a 2( 0 ) ,., a K( 0 ) Iterate the following steps many times:– Randomly or systematically choose a sequence, say,sequence k, to exclude.– Carry out the predictive-updating step to update ak Stop when no more observable changes in likelihood Available Programs:– Gibbs motif sampler (http://www.wadsworth.org/res&res/bioinfo)– Bioprospector (http://bioprospector.stanford.edu)– MACAW (http://www.ncbi.nlm.nih.gov)Biological Sequence Analysis41

The PU-Stepa1a2a3ak ?1. Compute predictive frequencies of each position i in motifcij count of amino acid type j at position i.c0j count of amino acid type j in all non-site positions.qij (cij bj)/(K-1 B), B b1 bK “pseudo-counts”2. Sample from the predictive distriubtion of ak .wqi , R k ( l i )i 1q 0 , Rk ( l i )P ( a k l 1) Biological Sequence Analysis42

Using MACAWBiological Sequence Analysis43

Idea 2: Mixture modeling View the dataset as a long sequence with k motif types: Idea: partition the input sequence into segments thatcorrespond to different (unknown) motif models. It is a mixture model (unsupervised learning). Implement a predictive updating scheme.Biological Sequence Analysis44

Special Case: Bernoulli Sampler Sequence data:R r1 r2 r3 rN Indicator variable: δ1δ2δ3 . . . δΝδi 1 , if it is the start of an element 0 , if not.parameter for the motif model Likelihood: π(R, Θ, ε), ε is the prior prob for δi 1 Predictive Update:π (δk 1 [ k ] , R )ε w p i ,rk i 1 π (δk 0 [ k ] , R ) 1 ε i 1 p 0 ,rk i 1 Biological Sequence Analysis45

References (self) Durbin R. et al. (1997). Biological Sequence Analysis, CambridgeUniversity Press, London. Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing, SpringerVerlag, New York. Liu, J.S. and Lawrence, C.E. (1999). Bayesian analysis on biopolymersequences. Bioinformatics, 138-143. Liu, J.S., Neuwald, A.F., and Lawrence, C.E. (1999) . Markovianstructures in biological sequence alignments. J. Amer. Statist. Assoc., 1-15. Neuwald, A.F., Liu, J.S., Lipman, D.J., and Lawrence, C.E. (1997) .Extracting protein alignment models from the sequence database. NucleicAcids. Res. 25, 1665-1677. Liu, J.S., Neuwald, A.F., and Lawrence, C.E. (1995) . Bayesian modelsfor multiple local sequence alignment and Gibbs sampling strategies. J.Amer. Statist. Assoc. 90, 1156-1170. Lawrence, et al. (1993). Science 262, 208-214.Biological Sequence Analysis46

Biological Sequence Analysis 2 Topics to be covered Basic Biology: DNA, RNA, Protein; genetic code. Biological Sequence Analysis – Pairwise alignment --- dynamic programming Needleman-Wunsch Smith-Waterman Blast – Multiple sequence alignment Heuristic approaches The hidden Markov model – Motif finding in .

Related Documents:

Los Angeles, CA Jun 26 Gonzales, LA Jun 26 Salt Lake City, UT Jun 30 Atlanta, GA Jun 30 Chehalis, WA Jun 30 Denver, CO Aug 5 Tipton, CA Aug 12 Phoenix, AZ Aug 28 Las Vegas, NV Sep 30 Wasilla, AK Oct 3 Canada Thunder Bay, ON Jun 15 Montreal, QC Jun 17–18 Edmonton, AB Jun 24–26 North Battleford, SK Jun 25–26 Truro, NS Jun 29 International

Life science graduate education at Harvard is comprised of 14 Ph.D. programs of study across four Harvard faculties—Harvard Faculty of Arts and Sciences, Harvard T. H. Chan School of Public Health, Harvard Medical School, and Harvard School of Dental Medicine. These 14 programs make up the Harvard Integrated Life Sciences (HILS).

nIke proCter & GAmBle StArBUCkS StArwood SteelCASe tArGet wAlt dISney wHIrlpool 5,000 Jun ’ 03 ’ 04 Jun ’ 05 Jun 06 Jun 07 Jun ’ 08 Jun 09 10 Jun ’ 11 Jun 12 Jun 13 D e C ’ 03 C 04 D e C ’ 05 D e C ’ 06 07 D e C ’ 08 D e C 09 10 D e C ’ 11 C 12 C 13 39,922.89 17,5

10 12 14 16 18 10 12 14 16 18 Jun-13 Jun-14 Jun-15 Jun-16 Jun-17 Jun-18 Jun-19 Jun-20 Jun-21 Net Effective Rent* Payback Ratio 2

Sciences at Harvard University Richard A. and Susan F. Smith Campus Center 1350 Massachusetts Avenue, Suite 350 Cambridge, MA 02138 617-495-5315 gsas.harvard.edu Office of Diversity and Minority Affairs minrec@fas.harvard.edu gsas.harvard.edu/diversity Office of Admissions and Financial Aid admiss@fas.harvard.edu gsas.harvard.edu/apply

bali143144 anuradha 8-jul-1981 16-jun-2014 bali141519 avadhesh kumar 13-mar-1946 16-jun-2014 bali154669 bhagyashree bhide 22-jul-1983 16-jun-2014 bali149154 deepak kumar 18-aug-1994 16-jun-2014 bali150052 dhaval ajaykumar oza 5-dec-1983 16-jun-2014 bali53663 doulath basha katike 11-jun-1981 16-jun-2014

Faculty of Arts and Sciences, Harvard University Class of 2018 LEGEND Harvard Buildings Emergency Phones Harvard University Police Department Designated Pathways Harvard Shuttle Bus Stops l e s R i v e r a C h r YOKE ST YMOR E DRIVE BEACON STREET OXFORD ST VENUE CAMBRIDGE STREET KIRKLAND STREET AUBURN STREET VE MEMORIAL

Harvard University Press, 1935) and Harvard College in the Seventeenth Century (Cambridge: Harvard University Press, 1936). Quotes, Founding of Harvard, 168, 449. These works are summarized in Three Centuries of Harvard (Cambridge: Harvard U