Sequence-Based Data Mining - Cornell University

2y ago

16 Views

2 Downloads

466.72 KB

46 Pages

Last View : 12d ago

Last Download : 3m ago

Upload by : Emanuel Batten

Report this link

Download PDF

Transcription

Sequence-Based Data MiningJaroslaw PillardyComputational Biology Service UnitCornell University

Sequence analysis: what for? Finding coding regions (gene finding) Finding regulatory regions Analyzing mutation rates Determine properties of a sequence (repeats, lowcomplexity regions) Functionally annotate genes Associate ESTs with genes Make cross-species comparison Build a model for a protein in order to understand itsfunction, mutations etc And many more

Sequence analysis: an example of aproblemQuiz:A human geneticist identified a new gene that wouldsignificantly increase the risk of colon cancer whenmutated. By using BLASTP, she found that this proteinexists in a few vertebrate and invertebrate species with verylow homology, but she was not able to find any goodBLAST hits in Drosophila melanogaster.Before making the conclusion that this gene does not existin fly, what other approaches would you take?

sequenceSequence analysis: how?resultsresultsresultsstructureSimple sequence search (BLAST)Profile-sequence search (HMMER)Structure-sequence search (threading)Homology modeling (MODELLER)Structure-structure search (CE)

Searching for similar proteins in a DatabaseSimple sequencesearchProfile-sequencesearchSensitivity: Least sensitiveStructure-sequencesearchMost sensitiveSpeed:SecondsMinutesHoursDB size:4 x 1064 x 1064 x 104 (PDB)

Simple sequence search Sequence similarity search looks like syntactic problem: comparingstrings using alphabets Sequence homology is based of common ancestor and is semanticin nature orthologs similar genes in different species, usually with same function paralogs similar genes created by duplication, may be in samespecies, may not have the same function High sequence similarity does not imply homology, it is only a basefor further investigation Physics can be reintroduced to sequence similarity search viascoring matrices

Scoring alignmentsScoring Matrices Relative entropy: H Σ qijcij Shows information content per pair Matrices with larger entropy values are more sensitive to less divergentsequences Matrices with smaller entropy values are more sensitive to distantly 24a3c31c32c33c34a4c41c42c43c44 Relative entropy can be used tocompare matrices Scores can be related to biology:negative dissimilarity,zero indifference, positive similar

Scoring DNA alignmentsIdentity MatrixAATTGGCTAGCTAA 100C0010G0001Relative entropy: 1.0Matches: 10Mismatches: 4Score: 10 x 1 4 x 0 10Max score: 14Expected score: 3.5Minimum score: 0Score: 71%

Scoring DNA alignmentsBLAST MatrixAATTGGCTAGCTAA 4T-45-4-4C-4-45-4G-4-4-45Relative entropy: -1.0Matches: 10Mismatches: 4Score: 10 x 5 4 x (-4) 36Max score: 70Expected score: -24.5Minimum score: -56Score: 73%

Scoring DNA alignmentsTransition-Transversion MatrixAATTGGCTAGCTAA : 1T-51-1-5C-5-11-5G-1-5-51Relative entropy: -4.5Matches: 10 (1)Mismatches: 3Score: 10 x 1 3 x (-5) 1 x (-1) -6Max score: 14Expected score: -35Minimum score: -70Score: 42%

Scoring protein alignments 20 letter sequences, more possibilities Scoring may be based on physicalproperties of amino acids (polarity,size, hydrophobicity etc) Scoring may based on genetic code:minimum number of nucleotidessubstitutions necessary to convert Hard to put the above into a consistentscoring table Most popular matrices (PAM,BLOSUM) are based on observedsubstitution ratesADCFDGGFAA AECFCGGEAAScore 4 2 9 6 -3 6 6 -3 4 4 35

Scoring protein alignments : PAMDeriving Point Accepted Mutation matrix Dataset of families of very closely related proteins(identity 85%) Phylogenetic tree was constructed for each family Substitution frequency Fij was computed Relative mutability mi was computed for each aminoacid (ratio of occurring mutation to all possible ones) Mutation probability Mij mj Fij / ΣI Fij cij log(Mij/fi) – log odds matrix, fj is frequency ofoccurrence

Scoring protein alignments : PAMUsing Point Accepted Mutation matrix Matrix normalization to PAM-1 unit: 1 substitutionover 100 residues“what is the probability of substitution of a residueduring the time when 1% of residues mutated” Multiplication of PAM-1 unit produces substitutionrates for multiple units PAM-1 is good for very closely related sequences,PAM-250 for intermediate and PAM-1000 for verydistant

Scoring protein alignments : BLOSUMBLOck SUbstitution Matrix Based on comparisons of Blocks of sequences derived from theBlocks database (derived from Prosite) The Blocks database contains multiply aligned ungapped segmentscorresponding to the most highly conserved regions of proteins BLOSUM matrices are categorized by sequence identity above whichblocks were clustered (i.e. BLOSUM62 is derived from blocksclustered at 62% sequence identity)AABCD---BBCDADABCD-A-BBCBBBBBCDBA-BCCAA Focused on highly conserved regionsAAACDC-DCBCDBCCBADB-DBBDCCAAACA---BBCCC

Scoring protein alignments : BLOSUM vs. 3500.186-0.701

Scoring protein alignments :BLOSUM vs. PAMEquivalent PAM and BLOSUMmatrices based on relative entropyPAM100 Blosum90PAM120 Blosum80PAM160 Blosum60PAM200 Blosum52PAM250 Blosum45 PAM matrices have lower expected scores for the BLOSUMmatrices with the same entropy BLOSUM matrices “generally perform better” than PAM matrices

Simple sequence search : scoring TA Gap should correspond to insertion/deletion (indel)even in evolution Multiple (block) nucleotide indels are common assingle nucleotide indels It is then more probable that fewer indel eventsoccurred, i.e. gaps should be grouped Gaps are scored negatively (penalty) Two scores for gaps: origination and continuation Origination score continuation score

Substitution Matrix and Gap CostQuery LengthGap cost 35SubstitutionMatrixPAM-3035-50PAM-70(10, 1)50-85BLOSUM-80(10, 1) 85BLOSUM-62(11, 1)(9,1)

Simple sequence search - alignment Direct enumeration impossible: 100 vs. 95 with 5 gaps 55 millionchoices Optimal solution comes from Dynamic Programming: extendingsolution to n based on all optimal solutions for n-1 problems(Needleman-Wunsh) Solution is a path in the Dynamic Programming score tableA0CTCG-1 -2 -3 -4 -5 Initiate table with gap penalties (1,1) Fill table top-left to low-rightA-1C-2A-3 take left cell add gap penaltyG-4 take upper cell add gap penaltyT-5 take diagonal cell add scoreA-6G-7 Fill element with maximum value of

Simple sequence search - alignment This alignment uses identity scoring table with (1,1) gaps Aligns full sequences: global alignmentACAGTAGAC--TCGA0CTCGA-1 -2 -3 -4 -50CTCGA-1 -2 -3 -4 -50CTCG-1 -2 -3 -4 -5A-1A-1 10-1 -2 -3A-1 10-1 -2 -3C-2C-2 0210-1C-2 0210-1A-3A-3 -1 1210A-3 -1 1210G-4G-4 -2 0122G-4 -2 0122T-5T-5 -3 -1 112T-5 -3 -1 112A-6A-6 -4 -2 011A-6 -4 -2 011G-7G-7 -5 -3 -1 02G-7 -5 -3 -1 02

Simple sequence search - alignment Global alignment is not useful when searching databases Semiglobal alignment: terminal gaps allowed Achieved by initializing gaps to zero in the first step and allowing nogap penalties in the last -2T000002AACACGGTGTCT---ACG-TC---

Simple sequence search - alignment Local alignment: best subsequence matching Dynamic programming algorithm for local alignment: Smith-Waterman Starts like semiglobal alignment with fourth option for filling table: place 0 in the cell when maximum possible value is negative Start with the cell with maximum 14T001111112T001111124AACCTATAGCTGCGATATA

FASTA search algorithm Breaks up query sequence into words (like BLAST) Using lookup tables with words finds areas of identity Areas of identity are joint to form larger pieces Full Smith-Waterman algorithm is used to align these pieces FASTA is slower than BLAST, but produces optimalalignment for pieces

Bit Score and E-valueBit Score: S' (λS-ln K)/ln2Expect Value: E mn 2-S'E 0.01 - 1% chance that the match is due to a random matchE value depends on database sizeE value: expected number of HSPs with score S or higherP value: probability of finding zero HSPs with score S or higherP 1 – exp(-E)

Programs and Database selection1. nucleotide sequence: blastnQuery: nucleotide sequenceDatabase: nucleotide sequence databasee.g. nt htg est

Programs and Database selection2. protein sequence: blastpQuery: protein sequenceDatabase: protein sequence databasee.g. nr

Programs and Database selection3. translated blast search:blastxnucleotide sequence - protein databasetblastnprotein sequence - nucleotide databasetblastxnucleotide sequence- nucleotide

Programs and Database selectionProtein sequence alignment is more sensitivethan nucleotide sequence alignment !

Filtering the low complexity and repetitive sequences1. Low complexity: DUST and SEG programs2. Repetitive sequences: RepeatMasker(DNA sequences: "NNNNNNNN" )(Protein sequences: "XXXXXXXXX")

BLAST Servers1. NCBI http://www.ncbi.nlm.nih.gov/BLAST/2. Batch Blast http://cbsuapps.tc.cornell.edu/cbsu/blast s.aspxInput files: Fasta format sequence filesOutput files:1. standard2. -m 8 format3. CBSU parsed format4. CBSU parsed format 2

Scoring system of BLASTQuery:ACCGGEFFGACD Target: ACGGGCFCGAGGScore: 493664626431

Sequence alignment of domain ACLGPEFFGACACG1100-100 -100 -100 -1002-100100-100 -100 0.AC.

What is Hidden Markov .0ACGT0.00.00.20.8P(ACACATC) 0.8 1.0 0.8 1.0 0.8 4.7 10-21.0ACGT0.00.80.20.0

What is Hidden Markov G--ATCLog-odds: log2(Ps/Pnull)A -0.22C 0.47G -0.22T -0.22-0.51-0.51A 1.16CGT -0.220AC 1.16G -0.22T0A 1.16C -0.22GTA 1.0C-0.92GT0ACG -0.22T 1.16Log-odds(ACACATC) 1.16 0 1.16 0 1.16 6.640AC 1.16G -0.22T

What is Hidden Markov CConsensusBad sequenceSequenceP %Log CT--AGG0.0023-0.97

GTGTAGCGCTCTGTTTCGTGTGTTTGTGTTCATTTATTGTGTTGT GTAAAGTTAGATTCCACCGA TCCGTTTCTGTTA GAAATTTATGCTTATTGTGTSearch formatches

HMM model table

PSI-BLASTPosition-Specific Iterative BLASTBLAST searchAlign the sequences of the blast targetsConstruct profile from the blast targetsModify substitution matrix to fit profileSearch the database with the new scoringPSI-BLAST uses position-dependent substitution matrix instead ofprobabilities (HMM)

Build a model and search the sequence database formotifs that fit the TGATCTGTTTAAATGTThmmbuildSequence alignmenthmmsearchModelMore sequence motifs that fit this model

Programs:Databases:HMMERPFAM http://pfam.wustl.edu/SAMSMART http://smart.embl-heidelberg.de/PSI-BLASTCOG supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/

Web based programs:PFAM: http://pfam.wustl.edu/hmmsearchAn HMM library based on the Swissprot 48.9 andSP-TrEMBL 31.9 protein sequence databases. 8296protein families in current version.SMART: http://smart.embl-heidelberg.de/More than 500 extensively annotated domain familiesInterProScan: http://www.ebi.ac.uk/interpro/scan.htmlCombines many HMM and other methods

The input and ADPASTQDEYRIVYHELETFNGDTSTLTTDRTRFTLESLLPGRNYSL

Evaluating the significance of a hit:1. E-value: 0.1(10% chance that you would've seen a hit this good in asearch of random sequences)2. Raw score GA (the scores used as cutoffs inconstructing Pfam, you may consider TC and NC as well)3. Raw score log2(number of sequences in the database)(20 for the nr)

BLOSUM vs. PAM Equivalent PAM and BLOSUM matrices based on relative entropy PAM100 Blosum90 PAM120 Blosum80 PAM160 Blosum60 PAM200 Blosum52 PAM250 Blosum45 PAM matrices have lower expected scores for the BLOSUM matrices with the same entropy BLOSUM matrices “generally perform better” than PAM matrices

Related Documents:

Million Song Dataset Recommendation Project Report

Project Report Yi Li Cornell University yl2326@cornell.edu Rudhir Gupta Cornell University rg495@cornell.edu Yoshiyuki Nagasaki Cornell University yn253@cornell.edu Tianhe Zhang Cornell University tz249@cornell.edu Abstract—For our project, we decided to experiment, desig

145 Views

2y ago

DATA MINING - University of Rajshahi

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

13 Views

1y ago

Data Mining in Bioinformatics - UQAM

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

42 Views

2y ago

Multi Relational Data Mining Approaches: A Data Mining Technique

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

9 Views

7m ago

Data Mining: Why Data Mining? - Leiden University

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

41 Views

3y ago

Potential ILC contributions from Cornell - ILC Agenda (Indico)

Georg.Hoffstaetter@Cornell.edu - October 19, 2020 -American Linear Collider Workshop 1 Ongoing and potential Cornell contributions to the EIC Potential ILC contributions from Cornell Georg Hoffstaetter for Cornell Laboratory for Accelerator Based Sciences and Education Cornell has experience in using CESR to study wiggler-dominated ILC

22 Views

1y ago

Estimating Position Bias without Intrusive Interventions

Aman Agarwal Cornell University Ithaca, NY aa2398@cornell.edu Ivan Zaitsev Cornell University Ithaca, NY iz44@cornell.edu Xuanhui Wang, Cheng Li, Marc Najork Google Inc. Mountain View, CA {xuanhui,chgli,najork}@google.com Thorsten Joachims Cornell University Ithaca, NY tj@cs.cornell.edu AB

46 Views

2y ago

weillcornellmedicine - Cornell University

WEILL CORNELL DIRECTOR OF PUBLICATIONS Michael Sellers WEILL CORNELL EDITORIAL ASSISTANT Andria Lam Weill Cornell Medicine (ISSN 1551-4455) is produced four times a year by Cornell Alumni Magazine, 401 E. State St., Suite 301, Ithaca, NY 14850-4400 for Weill Cornell Medical College and Weill Corn

45 Views

2y ago

Recent Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

Industry Observations Insurance Industry

Jun 30, 2019 · 6/17/2019 Commercial Insurance Branch of Extraco Banks, N.A. Higginbotham Insurance Group, Inc. Insurance Brokers NA 6/13/2019 Links Insurance Services, LLC World Insurance Associates LLC Property and Casualty Insurance NA 6/13/2019 Abram Interstate Insurance Services, Inc. Risk Placement Services,

2y ago

619 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

18.01.41 - REPLACEMENT OF LIFE INSURANCE AND ANNUITIES - Idaho

Department of Insurance Replacement of Life Insurance and Annuities. Page 3. 04. Existing Life Insurance or Annuity. "Existing Life Insurance or Annuity" means any life insurance or annuity in force, including life insurance under a binding or conditional receipt or a lif e insurance policy or annuity that is within an unconditional refund period.

1y ago

407 Views

EXAMINATION REPORT OF THE ADMIRAL INSURANCE COMPANY AS OF . - Delaware

Berkley Regional Specialty Insurance Comp 31295 DE Carolina Casualty Insurance Company 10510 IA Clermont Insurance Company 33480 IA Continental Western Insurance Company 10804 IA Firemen's Insurance Com pany of Wash, D.C. 21784 DE Gemini Insurance Company 10833 DE Great Divide Insurance Company 25224 ND

1y ago

258 Views

American International Group, Inc. - Federal Reserve

American General Life Insurance Company AGL U.S. Life Insurance Company AGC Life Insurance Company AGC Life U.S. Life Insurance Company The United States Life Insurance Company in the City of New York U.S. Life U.S. Life Insurance Company The Variable Annuity Life Insurance Company VALIC U.S. Life Insurance Company

1y ago

269 Views

Japan's Insurance Market - Toa Re

with 61.6% of net premiums written, of which automobile insurance totaled 48.8% and compulsory automobile liability insurance totaled 12.8%. Fire insurance accounted for 13.7%, miscellaneous casualty insurance including liability insurance accounted for 11.6%, accident insurance accounted for 9.8%, and marine insurance accounted for 3.2%.

1y ago

179 Views

List of Insurance Companies by Insurance Manager - Cayman Islands dollar

2447 Batan Insurance Company SPC, Ltd. 29-Sep-03 1307714 BBG Insurance Services, Ltd. 09-Aug-16 1254 BCHS Insurance, Ltd. 07-Oct-98 1168 Bearacuda Re 01-Aug-97 2639 Bedrock Insurance Limited 24-Nov-05 2150 Bom Ambiente Insurance Company 14-Jun-00 2565 Boundless Insurance Company, Ltd. 01-Dec-04 769 Bucap Limited 03-Mar-89

1y ago

293 Views

Insurance Certificate 713705-3 and Assistance Program

Name of insurance product: Purchase Protection and Travel Insurance for National Bank of Canada Mastercard credit cards, group insurance policy no. 713705 (Schedule A Certificate number 3)/713705-3 Type of insurance product: Purchase insurance and extended warranty and travel insurance (group insurance) Assistance provider contact information

3m ago

54 Views

S OF GENERAL INSURANCE

General Insurance comprises of insurance of property against fire, burglary etc, personal insurance such as Accident and Health Insurance, and liability insurance which covers legal liabilities. Suitable general Insurance covers are necessary for every family. It is important to protect one’s property, which

3y ago

278 Views

Insurance Act 1978 - Bermuda Laws

INSURANCE MANAGERS, BROKERS, AGENTS, INSURANCE MARKETPLACE PROVIDERS AND SALESMEN Insurance managers, agents and insurance marketplace providers to maintain lists of insurers for which they act Insurance broker, agent, salesman or insurance marketplace provider deemed agent of insurer in cert

2y ago

280 Views

NextWave Insurance: Life insurance and retirement 2021 (pdf)

3 NextWave Insurance: life insurance and retirement NextWave Insurance: life insurance and retirement Given the nature of the life insurance and retirement market, its leaders have always taken long-term views of their strategic horizons and growth prospects. Today, a combina

2y ago

481 Views

Sequence-Based Data Mining - Cornell University

It looks like you're using an ad-blocker