Sequence-Based Data Mining - Cornell University

2y ago
466.72 KB
46 Pages
Last View : 23d ago
Last Download : 5m ago
Upload by : Emanuel Batten

Sequence-Based Data MiningJaroslaw PillardyComputational Biology Service UnitCornell University

Sequence analysis: what for? Finding coding regions (gene finding) Finding regulatory regions Analyzing mutation rates Determine properties of a sequence (repeats, lowcomplexity regions) Functionally annotate genes Associate ESTs with genes Make cross-species comparison Build a model for a protein in order to understand itsfunction, mutations etc And many more

Sequence analysis: an example of aproblemQuiz:A human geneticist identified a new gene that wouldsignificantly increase the risk of colon cancer whenmutated. By using BLASTP, she found that this proteinexists in a few vertebrate and invertebrate species with verylow homology, but she was not able to find any goodBLAST hits in Drosophila melanogaster.Before making the conclusion that this gene does not existin fly, what other approaches would you take?

sequenceSequence analysis: how?resultsresultsresultsstructureSimple sequence search (BLAST)Profile-sequence search (HMMER)Structure-sequence search (threading)Homology modeling (MODELLER)Structure-structure search (CE)

Searching for similar proteins in a DatabaseSimple sequencesearchProfile-sequencesearchSensitivity: Least sensitiveStructure-sequencesearchMost sensitiveSpeed:SecondsMinutesHoursDB size:4 x 1064 x 1064 x 104 (PDB)

sequenceSequence analysis: how?resultsresultsresultsstructureSimple sequence search (BLAST)Profile-sequence search (HMMER)Structure-sequence search (threading)Homology modeling (MODELLER)Structure-structure search (CE)

Simple sequence search Sequence similarity search looks like syntactic problem: comparingstrings using alphabets Sequence homology is based of common ancestor and is semanticin nature orthologs similar genes in different species, usually with same function paralogs similar genes created by duplication, may be in samespecies, may not have the same function High sequence similarity does not imply homology, it is only a basefor further investigation Physics can be reintroduced to sequence similarity search viascoring matrices

Scoring alignmentsScoring Matrices Relative entropy: H Σ qijcij Shows information content per pair Matrices with larger entropy values are more sensitive to less divergentsequences Matrices with smaller entropy values are more sensitive to distantly 24a3c31c32c33c34a4c41c42c43c44 Relative entropy can be used tocompare matrices Scores can be related to biology:negative dissimilarity,zero indifference, positive similar

Scoring DNA alignmentsIdentity MatrixAATTGGCTAGCTAA 100C0010G0001Relative entropy: 1.0Matches: 10Mismatches: 4Score: 10 x 1 4 x 0 10Max score: 14Expected score: 3.5Minimum score: 0Score: 71%

Scoring DNA alignmentsBLAST MatrixAATTGGCTAGCTAA 4T-45-4-4C-4-45-4G-4-4-45Relative entropy: -1.0Matches: 10Mismatches: 4Score: 10 x 5 4 x (-4) 36Max score: 70Expected score: -24.5Minimum score: -56Score: 73%

Scoring DNA alignmentsTransition-Transversion MatrixAATTGGCTAGCTAA : 1T-51-1-5C-5-11-5G-1-5-51Relative entropy: -4.5Matches: 10 (1)Mismatches: 3Score: 10 x 1 3 x (-5) 1 x (-1) -6Max score: 14Expected score: -35Minimum score: -70Score: 42%

Scoring protein alignments 20 letter sequences, more possibilities Scoring may be based on physicalproperties of amino acids (polarity,size, hydrophobicity etc) Scoring may based on genetic code:minimum number of nucleotidessubstitutions necessary to convert Hard to put the above into a consistentscoring table Most popular matrices (PAM,BLOSUM) are based on observedsubstitution ratesADCFDGGFAA AECFCGGEAAScore 4 2 9 6 -3 6 6 -3 4 4 35

Scoring protein alignments : PAMDeriving Point Accepted Mutation matrix Dataset of families of very closely related proteins(identity 85%) Phylogenetic tree was constructed for each family Substitution frequency Fij was computed Relative mutability mi was computed for each aminoacid (ratio of occurring mutation to all possible ones) Mutation probability Mij mj Fij / ΣI Fij cij log(Mij/fi) – log odds matrix, fj is frequency ofoccurrence

Scoring protein alignments : PAMUsing Point Accepted Mutation matrix Matrix normalization to PAM-1 unit: 1 substitutionover 100 residues“what is the probability of substitution of a residueduring the time when 1% of residues mutated” Multiplication of PAM-1 unit produces substitutionrates for multiple units PAM-1 is good for very closely related sequences,PAM-250 for intermediate and PAM-1000 for verydistant

Scoring protein alignments : BLOSUMBLOck SUbstitution Matrix Based on comparisons of Blocks of sequences derived from theBlocks database (derived from Prosite) The Blocks database contains multiply aligned ungapped segmentscorresponding to the most highly conserved regions of proteins BLOSUM matrices are categorized by sequence identity above whichblocks were clustered (i.e. BLOSUM62 is derived from blocksclustered at 62% sequence identity)AABCD---BBCDADABCD-A-BBCBBBBBCDBA-BCCAA Focused on highly conserved regionsAAACDC-DCBCDBCCBADB-DBBDCCAAACA---BBCCC

Scoring protein alignments : BLOSUM vs. 3500.186-0.701

Scoring protein alignments :BLOSUM vs. PAMEquivalent PAM and BLOSUMmatrices based on relative entropyPAM100 Blosum90PAM120 Blosum80PAM160 Blosum60PAM200 Blosum52PAM250 Blosum45 PAM matrices have lower expected scores for the BLOSUMmatrices with the same entropy BLOSUM matrices “generally perform better” than PAM matrices

Simple sequence search : scoring TA Gap should correspond to insertion/deletion (indel)even in evolution Multiple (block) nucleotide indels are common assingle nucleotide indels It is then more probable that fewer indel eventsoccurred, i.e. gaps should be grouped Gaps are scored negatively (penalty) Two scores for gaps: origination and continuation Origination score continuation score

Substitution Matrix and Gap CostQuery LengthGap cost 35SubstitutionMatrixPAM-3035-50PAM-70(10, 1)50-85BLOSUM-80(10, 1) 85BLOSUM-62(11, 1)(9,1)

Simple sequence search - alignment Direct enumeration impossible: 100 vs. 95 with 5 gaps 55 millionchoices Optimal solution comes from Dynamic Programming: extendingsolution to n based on all optimal solutions for n-1 problems(Needleman-Wunsh) Solution is a path in the Dynamic Programming score tableA0CTCG-1 -2 -3 -4 -5 Initiate table with gap penalties (1,1) Fill table top-left to low-rightA-1C-2A-3 take left cell add gap penaltyG-4 take upper cell add gap penaltyT-5 take diagonal cell add scoreA-6G-7 Fill element with maximum value of

Simple sequence search - alignment This alignment uses identity scoring table with (1,1) gaps Aligns full sequences: global alignmentACAGTAGAC--TCGA0CTCGA-1 -2 -3 -4 -50CTCGA-1 -2 -3 -4 -50CTCG-1 -2 -3 -4 -5A-1A-1 10-1 -2 -3A-1 10-1 -2 -3C-2C-2 0210-1C-2 0210-1A-3A-3 -1 1210A-3 -1 1210G-4G-4 -2 0122G-4 -2 0122T-5T-5 -3 -1 112T-5 -3 -1 112A-6A-6 -4 -2 011A-6 -4 -2 011G-7G-7 -5 -3 -1 02G-7 -5 -3 -1 02

Simple sequence search - alignment Global alignment is not useful when searching databases Semiglobal alignment: terminal gaps allowed Achieved by initializing gaps to zero in the first step and allowing nogap penalties in the last -2T000002AACACGGTGTCT---ACG-TC---

Simple sequence search - alignment Local alignment: best subsequence matching Dynamic programming algorithm for local alignment: Smith-Waterman Starts like semiglobal alignment with fourth option for filling table: place 0 in the cell when maximum possible value is negative Start with the cell with maximum 14T001111112T001111124AACCTATAGCTGCGATATA

FASTA search algorithm Breaks up query sequence into words (like BLAST) Using lookup tables with words finds areas of identity Areas of identity are joint to form larger pieces Full Smith-Waterman algorithm is used to align these pieces FASTA is slower than BLAST, but produces optimalalignment for pieces

Bit Score and E-valueBit Score: S' (λS-ln K)/ln2Expect Value: E mn 2-S'E 0.01 - 1% chance that the match is due to a random matchE value depends on database sizeE value: expected number of HSPs with score S or higherP value: probability of finding zero HSPs with score S or higherP 1 – exp(-E)

Programs and Database selection1. nucleotide sequence: blastnQuery: nucleotide sequenceDatabase: nucleotide sequence databasee.g. nt htg est

Programs and Database selection2. protein sequence: blastpQuery: protein sequenceDatabase: protein sequence databasee.g. nr

Programs and Database selection3. translated blast search:blastxnucleotide sequence - protein databasetblastnprotein sequence - nucleotide databasetblastxnucleotide sequence- nucleotide

Programs and Database selectionProtein sequence alignment is more sensitivethan nucleotide sequence alignment !

Filtering the low complexity and repetitive sequences1. Low complexity: DUST and SEG programs2. Repetitive sequences: RepeatMasker(DNA sequences: "NNNNNNNN" )(Protein sequences: "XXXXXXXXX")

BLAST Servers1. NCBI Batch Blast s.aspxInput files: Fasta format sequence filesOutput files:1. standard2. -m 8 format3. CBSU parsed format4. CBSU parsed format 2

sequenceSequence analysis: how?resultsresultsresultsstructureSimple sequence search (BLAST)Profile-sequence search (HMMER)Structure-sequence search (threading)Homology modeling (MODELLER)Structure-structure search (CE)

Scoring system of BLASTQuery:ACCGGEFFGACD Target: ACGGGCFCGAGGScore: 493664626431

Sequence alignment of domain ACLGPEFFGACACG1100-100 -100 -100 -1002-100100-100 -100 0.AC.

What is Hidden Markov .0ACGT0. 0.8 1.0 0.8 1.0 0.8 4.7 10-21.0ACGT0.

What is Hidden Markov G--ATCLog-odds: log2(Ps/Pnull)A -0.22C 0.47G -0.22T -0.22-0.51-0.51A 1.16CGT -0.220AC 1.16G -0.22T0A 1.16C -0.22GTA 1.0C-0.92GT0ACG -0.22T 1.16Log-odds(ACACATC) 1.16 0 1.16 0 1.16 6.640AC 1.16G -0.22T

What is Hidden Markov CConsensusBad sequenceSequenceP %Log CT--AGG0.0023-0.97


HMM model table

PSI-BLASTPosition-Specific Iterative BLASTBLAST searchAlign the sequences of the blast targetsConstruct profile from the blast targetsModify substitution matrix to fit profileSearch the database with the new scoringPSI-BLAST uses position-dependent substitution matrix instead ofprobabilities (HMM)

Build a model and search the sequence database formotifs that fit the TGATCTGTTTAAATGTThmmbuildSequence alignmenthmmsearchModelMore sequence motifs that fit this model


Web based programs:PFAM: HMM library based on the Swissprot 48.9 andSP-TrEMBL 31.9 protein sequence databases. 8296protein families in current version.SMART: than 500 extensively annotated domain familiesInterProScan: many HMM and other methods


Evaluating the significance of a hit:1. E-value: 0.1(10% chance that you would've seen a hit this good in asearch of random sequences)2. Raw score GA (the scores used as cutoffs inconstructing Pfam, you may consider TC and NC as well)3. Raw score log2(number of sequences in the database)(20 for the nr)

BLOSUM vs. PAM Equivalent PAM and BLOSUM matrices based on relative entropy PAM100 Blosum90 PAM120 Blosum80 PAM160 Blosum60 PAM200 Blosum52 PAM250 Blosum45 PAM matrices have lower expected scores for the BLOSUM matrices with the same entropy BLOSUM matrices “generally perform better” than PAM matrices

Related Documents:

Project Report Yi Li Cornell University Rudhir Gupta Cornell University Yoshiyuki Nagasaki Cornell University Tianhe Zhang Cornell University Abstract—For our project, we decided to experiment, desig

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data - October 19, 2020 -American Linear Collider Workshop 1 Ongoing and potential Cornell contributions to the EIC Potential ILC contributions from Cornell Georg Hoffstaetter for Cornell Laboratory for Accelerator Based Sciences and Education Cornell has experience in using CESR to study wiggler-dominated ILC

Aman Agarwal Cornell University Ithaca, NY Ivan Zaitsev Cornell University Ithaca, NY Xuanhui Wang, Cheng Li, Marc Najork Google Inc. Mountain View, CA {xuanhui,chgli,najork} Thorsten Joachims Cornell University Ithaca, NY AB

WEILL CORNELL DIRECTOR OF PUBLICATIONS Michael Sellers WEILL CORNELL EDITORIAL ASSISTANT Andria Lam Weill Cornell Medicine (ISSN 1551-4455) is produced four times a year by Cornell Alumni Magazine, 401 E. State St., Suite 301, Ithaca, NY 14850-4400 for Weill Cornell Medical College and Weill Corn