A Hidden Markov Model That Finds Genes In E.coli DNA

3y ago

70 Views

2 Downloads

1.04 MB

11 Pages

Last View : Today

Last Download : 3m ago

Upload by : Jerry Bolanos

Report this link

Download PDF

Transcription

4768-4778 1994 Oxford University PressNucleic Acids Research, 1994, Vol. 22, No. 22A hidden Markov model that finds genes in E.coli DNAAnders Krogh, I.Saira Mian1 and David Haussler2*Nordita, Blegdamsvej 17, DK-2100 Copenhagen, Denmark, 1Sinsheimer Laboratories, University ofCalifornia, Santa Cruz, CA 95064 and 2Computer and Information Sciences, University of California,Santa Cruz, CA 95064, USAReceived June 21, 1994; Revised and Accepted September 28, 1994A hidden Markov model (HMM) has been developed tofind protein coding genes in E.coli DNA using E.coligenome DNA sequence from the EcoSeq6 databasemaintained by Kenn Rudd. This HMM includes statesthat model the codons and their frequencies in E.coligenes, as well as the patterns found in the intergenicregion, including repetitive extragenic palindromicsequences and the Shine - Delgarno motif. To accountfor potential sequencing errors and or frameshifts inraw genomic DNA sequence, it allows for the (veryunlikely) possiblity of insertions and deletions ofindividual nucleotides within a codon. The parametersof the HMM are estimated using approximately onemillion nucleotides of annotated DNA in EcoSeq6 andthe model tested on a disjoint set of contigs containingabout 325,000 nucleotides. The HMM finds the exactlocations of about 80% of the known E.coli genes, andapproximate locations for about 10%. It also findsseveral potentially new genes, and locates severalplaces were insertion or deletion errors/and orframeshifts may be present in the contigs.INTRODUCTIONSequencing of the genomes of organisms and organelles has andwill continue to produce large quantities of complex map andDNA sequence data. The development of algorithms, techniques,software and databases is crucial in accumulating and interpretingthese data in a robust and 'automated' manner. Sequencing ofthe E.coli genome is now about 50% complete [1,2] and as such,it serves as an important testbed for both laboratory and computeranalysis techniques. Here we describe a new computer methodfor locating the protein coding genes in unannotated E.coli contigsand translating them into protein sequences.There are two principal methods for finding genes, both ofwhich have been incorporated into systems that analyse eucaryoticDNA [3]. The first locates signals in DNA like promotersequences and splice junctions using techniques such as neuralnetworks [4,5,6] or statistical methods [7,8,9]. The secondapproach scores a certain window of DNA in various ways inorder to decide whether the window belongs to a coding or a*To whom correspondence should be addressednon-coding region (reviewed in [10]). Staden and McLachlan[11,3] proposed deviation from average codon usage as a wayof determining the probability that the window is coding or not.Later, Gribskov et al. [12] used a similar measure as a part oftheir 'codon preference plot', but their measure did not requirethe knowledge of an average codon usage from other sources.Most other scoring methods are related to codon usage in someway [13,3]. Recently, neural networks [4,14,15,16] and Markovchains [17,18,19] have been used to analyze coding (and noncoding) regions. In particular, the program GeneMark [20] findsgenes in E.coli DNA using a Markov model for the coding regionrelated to the one discussed here, and a very simple Markovmodel for the non-coding regions. Whether looking for signalsin the DNA or using window scoring, there remains the problemof combining all the scores and/or signals detected in a givencontig to produce a coherent 'parse' into genes separated byintergenic regions. The output of this final parsing step couldbe a list of genes, each represented by its begin and end positionwithin the contig. Snyder and Stormo have recently proposedan elegant dynamic programming method to accomplish this finalstep [21]. Other more linguistically motivated approaches to thiskind of sequence parsing problem are described in [22,23,24,25].One aim of this paper is to combine all the aforementionedmethods for locating protein coding regions (the search forinitiation signals, the scoring of possible coding regions, and thefinal dynamic programming to get the best parse) in a singlesimple framework of Hidden Markov Models (HMMs). HMMshave been used to analyse DNA [18], to model certain proteinbinding sites in DNA [8,9] and in protein analysis[26,27,28,29,30,31,32]. The HMM we use to find genes in E.coliis much larger and more complex than those used in the earlyHMM work. Since only one strand is modelled, the HMM isapplied twice, once to the direct strand and then to thecomplementary strand. The basic HMM architecture is identicalto our earlier work [29], but here it is organised into a seriesof looping structures (Figure 3) containing explicit submodelsfor each of the 64 codons and for gene overlaps. It allows forthe possiblity of insertions and deletions of individual nucleotideswithin a codon because such errors may result in completely orpartially incorrect translated protein sequences (see [33,34,35]).These sequence 'errors' are distinct from real frameshifts andDownloaded from http://nar.oxfordjournals.org/ at University College London on February 5, 2014ABSTRACT

Nucleic Acids Research, 1994, Vol. 22, No. 22 4769The most distinctive aspects of our work are the complexityof the intergenic model and the simplicity of the overall HMMframework for combining coding measures and specific sensorsto produce useful parses. The Viterbi algorithm replaces theSnyder-Stormo style dynamic programming approach in thiscombination of coding measures and specific sensors. Todemonstrate the advantages of explicitly modeling the structuresin the intergenic region, we also trained and tested a much simplerHMM that did not include a sophisticated intergenic model, butinstead relied only on the statistics of the codon models (Figure1). While this model performed quite well also (about 70%exactly correct), our more complex HMM performedsignificantly better.METHODSA parser with a simple intergenic modelAn HMM for DNA patterns generates sequences of A, C, T andGs according to a random process. The simplest HMM used inthis research is illustrated in Figure 1 and consists of a collectionof rings, all connected to a central state. Each ring possesses oneor more HMMs whose structure is essentially the same as thatused in our work on modelling protein families [29]. There isone codon HMM for each of the 61 DNA triplets that code foramino acids as well as a ring which generates the intergenic regionand its flanking stop and start codons.The random process used by the HMM to generate a sequenceof nucleotides is a random walk starting in the middle of anyof the HMMs. Assume we begin at the central state and enterany of the rings by traversing one of the arrows shown in Figure1. Each such state transition has an associated probability andtransitions out of the central state are chosen at random accordingto these probabilities (they sum to one). For example, a transitionleading to the AAC codon model HMM generates the threenucleotides AAC with very high probability and then, withprobability 1, makes the transition back to the central state.Subsequently, a new transition out of the central state is selectedrandomly and independently of the previous transition. Choosingone of the 61 codon models repeatedly results in a 'random gene'.The gene eventually terminates upon entry into one of the ringsbelow the central state. The probability of such a transition isfairly small. (This probability is roughly determined by thenumber of intergenic regions divided by the number of codonsin a typical contig of E.coli DNA.) One stop codon HMMgenerates both TAA and TGA, each according to its frequencyof occurrence in E.coli, and the other TAG. In the simple HMM,a sequence of nucleotides representing an intergenic region areproduced independently and at random by looping in the statelabelled 'Intergene model'. Next, the start codon HMM generateseither ATG, GTG or TTG, each with the appropriate probability(TTG is very rare in E.coli). A transition is made back to thecentral state and the whole process repeated i.e. generation ofseveral random codons followed by another intergenic region andso on. This entire procedure produces a sequence of nucleotidesthat is statistically similar to a contig of E. coli DNA consistingof a collection of genes interspersed with intergenic regions.Each random walk has a well-defined probability determinedby the probability parameters of the HMM. This probability isinverted and employed to locate the beginning and ends of genes.For a given contig of E.coli DNA, the most likely random walkthrough the HMM that generates this sequence is calculated witha dynamic programming method known as the Viterbi algorithm[described in (41); see also (29)]. The Viterbi algorithm generatesa parse of the contig, i.e. labels genes in the DNA by identifyingportions of the path that begin with the start codon at the endof the intergenic ring, pass through several amino acid codonHMMs, and return to one of the stop codons at the beginningof the intergenic ring. The model parses a gene in one directiononly and thus finds all genes on the direct strand. To locate geneson the opposite strand, the reverse complement (A and Tinterchanged, G and C interchanged, and the sequence reversed)is parsed as just described.The gene modelThe role of the codon HMMs in Figures 1 and 3 is similar tothe role played by codon usage statistics in many other genefinding methods [3]. Codon usage statistics are far from whatwould be expected if they were based on randomly chosennucleotides (see Table 1). In our model, the codons in a geneare considered random and independent. Therefore, theprobability that a region is coding is simply the product of theprobabilities of the individual codons. The probability of an openreading frame (ORF) consisting of codons c,, c2,.ck andexcluding start and stop codons isDownloaded from http://nar.oxfordjournals.org/ at University College London on February 5, 2014other programmed recoding events i.e. alternative reading of thegenetic code (see [36,37]). In the HMM, if for example, a baseis omitted such that one of the 'codons' is only two bases long,the model compensates by skipping one of the bases in the codonmodel (similarly for insertions). To avoid modelling any DNAsequence as a gene with many errors or frameshifts, theprobability of this behavior is small. Models for certain intergenicfeatures such as repetitive extragenic palindromic sequences(REPs) [38,39], emerged from what were initially more genericmodels during the HMM training procedure i.e. estimation ofthe parameters of the HMM.The HMM was trained on approximately one millionnucleotides from the EcoSeq6 database of labelled genes (KennRudd, personal communication; [40]) and tested on the remainder(about 325,000 nucleotides). Since EcoSeq6 is notfiillyannotatedyet (K. Rudd, personal communication), our results should assistin identifying the locations of new genes and highlighting errorsand or inconsistencies in the data. For each contig in this testset we used the Viterbi algorithm [41,29], a standard dynamicprogramming procedure for HMMs, to find its most likely paththrough the hidden states of the HMM. Based on the stochasticmodel represented by our HMM, this path was then used to definea parse of the contig into genes separated by intergenic regions.Of about 240 labelled genes in the test set, we found about 80%of the sequences labeled as protein-coding genes in EcoSeq6exactly, i.e. with precisely the same start and stop codons. [Theactual percentage of exactly correct predictions on the test setis about (85%), but since performance on the training set (about1000 genes) was only 78% exactly correct, we believe that 80%is a more realistic performance estimate.] Approximately 5%were found within 10 codons of the start codon, 5% overlap byat least 60 bases or 50% and about 5 % were missed completely.For each of genes predicted by the parser but not labelled inEcoSeq6, we performed a database search using the programBLASTP [42] and the predicted protein sequence. The resultsindicate that many of these appear to encode known proteins.In addition, there are several instances where the HMM suggestsinsertion or deletion errors in the labelling of the contigs.

4770 Nucleic Acids Research, 1994, Vol. 22, No. 22Prob(c,,.c t ) (1)i \where p(c,) is the probability of codon c,- given in Table 1 forE.coli. We define the gene index of an ORF to be the negativelogarithm of this divided by the length of the contig,1l{cv.ck)'(2)k-1 1average (I) 0.935.(3)For genes in the training set, relatively few have a large geneindex: roughly 16% have an index greater than 0.96, 7% greaterthan 0.98, and only about 2.5% have a gene index larger than1.0, see Figure 2. This gene index will be used to rank predictionsand resolve ambiguities of the predictions by the HMM.The gene model uses the codon probability as the probabilityof making a transition into the corresponding codon model.Assume mat a particular path through the HMM starts in theintergenic model and goes through the start codon model beforelooping in the gene model k times (producing k codons), and thenenters one of the stop codon models before ending in theintergenic model. This corresponds to an ORF of length k (notcounting start and stop codons) flanked by intergenic regions.The probability of that path will contain the probability for theORF as given in Equation 1. Thus, using the Viterbi algorithmwith such a model gives an overall parser similar to Staden andMcLachlan's codon-usage method of locating genes [11], or therelated method of Gribskov et al. [12], and then following thisby a simple dynamic programming method like that of [21].A parser with a complex intergenic modelThe more complex HMM (Figure 3), intergenic model consistsof several parts in addition to the start and stop codon modelsdescribed

find protein coding genes in E.coli DNA using E.coli genome DNA sequence from the EcoSeq6 database maintained by Kenn Rudd. This HMM includes states that model the codons and their frequencies in E.coli genes, as well as the patterns found in the intergenic region, including repetitive extragenic palindromic sequences and the Shine - Delgarno motif. To account for potential sequencing errors .

Related Documents:

Lecture 2: Markov Decision Processes - Stanford University

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

18 Views

1y ago

The Markov Chain Monte Carlo Revolution

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

28 Views

2y ago

Credit Card Fraud Detection using Hidden Morkov Model and ... - IJANA

TECHNIQUES FOR DETECTING CREDIT CARD FRAUD 1. Hidden Markov Model (HMM) Hidden Markov Model is the simplest models which can be used to model sequential data. In markov models, the state . Application of Neural Network Model as Credit Card Fraud Detection Method There is a fixed pattern to how credit-card owners consume their credit-card on .

8 Views

8m ago

Using Multiple Imputation to Simulate Time Series: A ...

2.2 Markov chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a collection of methods to generate pseudorandom numbers via Markov Chains. MCMC works constructing a Markov chain which steady-state is the distribution of interest. Random Walks Markov are closely attached to MCMC. Indeed, t

18 Views

2y ago

IAS 36 – LỖ TỔN THẤT TÀI SẢN

IAS 36 – LỖ TỔN THẤT TÀI SẢN. xxx KHÔNG áp dụngcho Ápdụngcho x Hàng tồnkho (IAS 2) x . Tài sản tài chính (IFRS 9) x . Quyền lợi người lao động (IAS 19) x . Tài sản thuế hoãn lại (IAS 12) x . Hợp đồng xây dựng (IAS 11) x . Bất động s

225 Views

2y ago

12 Hidden Rewards

Hidden Reward 6: Being Trustworthy 127 Hidden Reward 7: Being of Value 133 Hidden Reward 8: Learning to Self-Soothe and Regulate Our Emotions 137 Hidden Reward 9: Better Self-Esteem and a More Positive Self-Concept 145 Hidden Reward 10: Integrity 155 Hidden Reward 11: Intimacy: “I to Thou” Connections 1

31 Views

2y ago

Markov Decision Processes - Johns Hopkins University

Markov Decision Processes Philipp Koehn 3 November 2015 Philipp Koehn Artiﬁcial Intelligence: Markov Decision Processes 3 November 2015. Outline 1 Hidden Markov models Inference: ﬁltering, smoothing, best sequence Kalman ﬁlters (a brief mention) Dynamic Bayesian networks

14 Views

1y ago

Final Version 'Markov Modeling Application to a Redundant ...

Markov techniques and then compared to those obtained using fault tree analysis. Markov techniques can be applied to model these systems by breaking them down into a set of operating (or failed) states with an associated set of transitions among these states. Markov model

16 Views

2y ago

Recent Views

Dear Members of the Harvard Community,

Life science graduate education at Harvard is comprised of 14 Ph.D. programs of study across four Harvard faculties—Harvard Faculty of Arts and Sciences, Harvard T. H. Chan School of Public Health, Harvard Medical School, and Harvard School of Dental Medicine. These 14 programs make up the Harvard Integrated Life Sciences (HILS).

3y ago

182 Views

Xavier Du Maine, Lara Roach, Perspectives - Harvard University

Sciences at Harvard University Richard A. and Susan F. Smith Campus Center 1350 Massachusetts Avenue, Suite 350 Cambridge, MA 02138 617-495-5315 gsas.harvard.edu Office of Diversity and Minority Affairs minrec@fas.harvard.edu gsas.harvard.edu/diversity Office of Admissions and Financial Aid admiss@fas.harvard.edu gsas.harvard.edu/apply

1y ago

146 Views

PROGRAM ON CRISIS LEADERSHIP - Harvard Kennedy School

Harvard Kennedy School Arnold M. Howitt Harvard Kennedy School Philip B. Heymann Harvard Law School April 2014 An earlier version of this white paper provided background for an expert dialogue on lessons learned from the events of the Boston Marathon bombing that was held at the John F. Kennedy School of Government at Harvard

2y ago

330 Views

Harvard Law School - WordPress

Law & Business, Harvard Law School, and H. Douglas Weaver Professor of Business Law. Harvard Business School. 10.30-10.55h. 13th Lecture "Cross-border Insolvency: the New European Regime". Pedro de Miguel Asensio. Full Professor of Private International Law. UCM. 11.00-12.00h. Round Table. "Latest reforms and tendencies on Insolvency Law".

1y ago

145 Views

Harvard Buildings Emergency Phones Harvard University .

Faculty of Arts and Sciences, Harvard University Class of 2018 LEGEND Harvard Buildings Emergency Phones Harvard University Police Department Designated Pathways Harvard Shuttle Bus Stops l e s R i v e r a C h r YOKE ST YMOR E DRIVE BEACON STREET OXFORD ST VENUE CAMBRIDGE STREET KIRKLAND STREET AUBURN STREET VE MEMORIAL

3y ago

171 Views

THE FIRST CENTURY OF THE AMERICAN . - Princeton

Harvard University Press, 1935) and Harvard College in the Seventeenth Century (Cambridge: Harvard University Press, 1936). Quotes, Founding of Harvard, 168, 449. These works are summarized in Three Centuries of Harvard (Cambridge: Harvard U

2y ago

225 Views

Catherine G. Barrera HARVARD UNIVERSITY

danbjork@fas.harvard.edu HARVARD UNIVERSITY Placement Director: Gita Gopinath GOPINATH@HARVARD.EDU 617-495-8161 Placement Director: Nathan Nunn NNUNN@FAS.HARVARD.EDU 617-496-4958 Graduate Administrator: Brenda Piquet BPIQUET@FAS.HARVARD.EDU 617-495-8927 Office Contact Information Department of Economics

2y ago

363 Views

SEAS Lab Safety Officer Orientation

Kuan ebrandin@harvard.edu akuan@fas.harvard.edu Donhee Ham MD B129, MDB132 Dongwan Ha dha@seas.harvard.edu Lene Hau Cruft 112-116 Danny Kim dannykim@seas.harvard.edu Robert Howe 60 Oxford, 312-317,319-321 Paul Loschak loschak@seas.harvard.edu Evelyn Hu McKay 222,226,232 Kathryn Greenberg greenber@fas.harvard.edu

2y ago

359 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Dangerous Defendants - Yale Law Journal

Law School, Louisiana State University Paul M. Hebert Law Center, Roger Williams University School of Law, Rutgers Law School, Sandra Day O'Connor College of Law, Southern Methodist University Dedman School of Law, University of Georgia School of Law, and University of Utah S.J. Quinney College of Law. For institutional support, I am grateful .

1y ago

169 Views

EMPLOYER GUIDE - Harvard Kennedy School

HARVARD KENNEDY SCHOOL EMPLOYER GUIDE At Harvard Kennedy School, our students are being trained in public policy . Post in the HKS JACK job bank or send to HKS_Career@hks.harvard.edu 2. Browse our resume book to identify a student with the skills and experience you need 3. Visit us and meet our talent HARVARD

2y ago

138 Views

2008-2009 FACT BOOK - Harvard University

Harvard Business School Harvard Medical School Harvard Faculty of Arts and Sciences Harvard School of Public . Publishing Division Joint Center for Housing Studies American Repertory Theatre . WIDE is the Wide-scale Interactive Development for Educators. (5) The Nanoscale Science and Engineering Center is a joint program with M.I.T., U.C.S .

2y ago

364 Views

HARVARD UNIVERSITY 2007-08

Harvard Business School Harvard Medical School Harvard Faculty of Arts and Sciences Harvard School of Public . Publishing Division Joint Center for Housing Studies* American Repertory Theatre . WIDE is the Wide-scale Interactive Development for Educators. (5) The Nanoscale Science and Engineering Center is a joint program with M.I.T., U.C.S .

2y ago

314 Views

ANNA BRADY - Harvard University

Jun 02, 2008 · ANNA BRADY 12 Oxford Street Apt. 9 Cambridge, MA 02138 (617) 495-3108 abrady@jd11.law.harvard.edu EDUCATION HARVARD LAW SCHOOL, Candidate for J.D., June 2011 Activities: Harvard Civil Rights-Civil Liberties Law Review UNIVERSITY OF CHICAGO, B.A. i

2y ago

137 Views

A Hidden Markov Model That Finds Genes In E.coli DNA

It looks like you're using an ad-blocker