Whole-genome Comparative Annotation And . - Compbio.mit.edu

2y ago
32 Views
2 Downloads
372.65 KB
10 Pages
Last View : 2d ago
Last Download : 2m ago
Upload by : Jacoby Zeller
Transcription

Whole-genome Comparative Annotation and RegulatoryMotif Discovery in Multiple Yeast SpeciesManolis Kamvysselis1,2, Nick Patterson1, Bruce Birren1, Bonnie Berger2,3,5, Eric Lander1,4,5manoli@mit.edu, nickp@genome.wi.mit.edu, birren@wi.mit.edu, bab@mit.edu, lander@wi.mit.edu(1) MIT/Whitehead Institute Center for Genome Research, 320 Charles St., Cambridge MA 02139(2) MIT Lab for Computer Science, 200 Technology Square, Cambridge MA 02139(3) MIT Department of Mathematics, 77 Massachusetts Ave, Cambridge MA 02139(4) MIT Department of Biology, 31 Ames St, Cambridge MA 02139(5) Corresponding authorABSTRACT1. INTRODUCTIONIn [13] we reported the genome sequences of S. paradoxus, S.mikatae and S. bayanus and compared these three yeast species totheir close relative, S. cerevisiae. Genome-wide comparativeanalysis allowed the identification of functionally importantsequences, both coding and non-coding. In this companion paperwe describe the mathematical and algorithmic resultsunderpinning the analysis of these genomes.With the availability of complete sequences for a number ofmodel organisms, comparative analysis becomes an invaluabletool for understanding genomes. Complete genomes allow forglobal views and multiple genomes increase predictive power.In [13] we used a comparative genomics approach tosystematically discover the full set of conserved genes andregulatory elements in yeast. We sequenced and assembled threenovel yeast species, S.paradoxus, S.mikatae and S.bayanus andcompared them to their close relative S. cerevisiae. The workrepresented the first genome-wide comparison of four completeeukaryotic genomes. This paper focuses on the mathematical andalgorithmic developments underpinning the work.We developed methods for the automatic comparative annotationof the four species and the determination of orthologous genesand intergenic regions. The algorithms enabled the automaticidentification of orthologs for more than 90% of genes despite thelarge number of duplicated genes in the yeast genome, and thediscovery of recent gene family expansions and genomerearrangements.We also developed a test to validatecomputationally predicted protein-coding genes based on theirpatterns of nucleotide conservation. The method has highspecificity and sensitivity, and enabled us to revisit the currentannotation of S.cerevisiae with important biological implications.First, we describe our methods for resolving the genecorrespondence between each of the newly sequenced species andS. cerevisiae to identify orthologous regions and validatepredicted protein-coding genes. We then describe our methods toidentify conserved intergenic sequence elements within theseregions and to cluster them into a small number of regulatorymotifs.We developed statistical methods for the systematic de-novoidentification of regulatory motifs. Without making use of coregulated gene sets, we discovered virtually all previously knownDNA regulatory motifs as well as several noteworthy novelmotifs. With the additional use of gene ontology information,expression clusters and transcription factor binding profiles, weassigned candidate functions to the novel motifs discovered.The gene correspondence method presented here was used for theautomatic annotation of the three newly sequenced species, andcorrectly identified unambiguous orthologs for more than 90% ofprotein coding genes. It also correctly identified the evolutionaryevents that separate the four species, discerning segmentalduplications and gene loss, while correctly resolving genes thatduplicated before the divergence of the species compared.Our results demonstrate that entirely automatic genome-wideannotation, gene validation, and discovery of regulatory motifs ispossible.Our findings are validated by the extensiveexperimental knowledge in yeast, confirming their applicability toother genomes.The methods for regulatory motif discovery presented here do notrely on previous knowledge of co-regulated sets of genes, and inthat way differ from the current literature on computational motifdiscovery. The motifs discovered include most previouslypublished regulatory motifs, adding confidence to our method.Moreover, a number of novel motifs are discovered that appearnear functionally related genes. We have used the extensiveexperimental knowledge in yeast to validate our results, thusconfirming that the methods presented here are applicable to otherspecies.Categories and Subject DescriptorsJ.3 [Life and medical sciences]: Biology and GeneticsGeneral Terms: Algorithms.Keywords: Computational biology, Comparative genomics,Genome annotation, Regulatory motif discovery.1.1 Comparative annotation: graphseparationPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.RECOMB ’03, April 10-13, 2003, Berlin, Germany.Copyright 2003 ACM 1-58113-635-8/03/0004 5.00.The first issue in comparative genomics is determining the correctcorrespondence of functional elements across the speciescompared. We decided to use predicted protein coding genes asgenomic anchors in order to align and compare the species.Resolving the correspondence between 6000 predicted genes ineach species requires an algorithm for comparative annotation that157

longest portion and leave part of a gene unmatched. Finally, sincesynteny blocks are only built on one-to-one unambiguousmatches, the algorithm is robust to sequence contamination. Acontaminating contig will have no unambiguous matches (sinceall features will also be present in genuine contigs from thespecies), and hence will never be used to build a synteny block.This has allowed the true orthologs to be determined and thecontaminating sequences to be marked as paralogs.accounts for gene duplication and loss, and ensures that the 1:1matches established are true orthologs.Previously described algorithms for comparing gene sets havebeen widely used for various purposes, but they were notapplicable to the problem at hand. Best Bidirectional Hits (BBH)[6, 7] looks for gene pairs that are best matches of each other andmarks them as orthologs. In the case of a recent gene duplication,only one of the duplicated genes will be marked as the orthologwithout signaling the presence of additional homologs. Clustersof Orthologous Genes (COG) [22, 23] goes a step further andallows many-to-many orthologous matches. It is able to capturegene duplication events when both copies of a duplicated genehave the same best hit in two other species that are themselvesorthologous. It still suffers though from having slight changes insimilarity influence a hard decision of a single best match.Moreover, since Saccharomyces underwent a whole-genomeduplication event [14] before the divergence of the speciescompared, individual COGs currently contain both copies of eachduplicated pair of genes in a single cluster of orthology, and hencewas not applicable in our pairwise comparative annotation.This algorithm provides a good solution to comparative genomeannotation, works well at a range of evolutionary distances, and isrobust to sequencing artifacts of unfinished genomes.1.2 Motif discovery: signal from noiseHaving accounted for the evolutionary events that gave rise to thegene sets in each species, we can align orthologous genes andintergenic regions and use the multiple alignments to discoverconserved features, and in particular regulatory motifs. Thisamounts to extracting small sequence signals hidden withinlargely non-functional intergenic sequences. This problem isdifficult in a single genome where the signal-to-noise ratio is verysmall.The comparative annotation algorithm we developed has featuresthat make it useful in many applications. It compares twogenomes at a time, and hence can be applied at any range ofevolutionary distances, without requiring a balanced phylogenetictree. Moreover, at its core, it represents the best match of everygene as a set of genes instead of a single best hit, which makes itmore robust to slight differences in sequence similarity. Also, itgroups the genes into progressively smaller subsets, retainingambiguities until later in the pipeline when more informationbecomes available. It progressively refines the synteny map ofconserved gene order while resolving ambiguities, one taskhelping the other. When it terminates, it returns the one-to-oneorthologous pairs resolved, as well as sets of genes whosecorrespondence remains ambiguous in a small number ofhomology groups.Traditional methods for regulatory motif discovery haveaddressed the signal-to-noise problem by focusing on smallsubsets of co-regulated genes whose promoter regions areenriched in regulatory motifs. A number of elegant algorithmshave been developed to search for subtle sequence signals withinunaligned sequences, pioneered by Lawrence and coworkers [15],and made popular in programs such as AlignACE [11, 20, 24],MEME [10] or BioProspector [17]. More recent work haspresented additional statistical methods for motif discovery usingphylogenetic footprinting [3, 12, 18, 26]. Computational methodshave also been developed for finding groups of possibly coregulated genes that share similar expression profiles in a numberof experimental conditions [8]. Additional experimental methodsto find co-regulated genes include genome-wide discovery ofpromoter regions bound by a tagged transcription factor inchromatin IP experiments [16, 21], proteins found in the sameprotein complex obtained by MS [9] and proteins involved in thesame genetically defined pathway [19].Together, theseexperiments have allowed the elucidation of a large number ofregulatory motifs in yeast [28] that have been categorized inpromoter databases [27, 29].We applied this algorithm to automatically annotate theassemblies of the three species of yeast.Our Pythonimplementation terminated within minutes for any of the pairwisecomparisons. It successfully resolved the graph of sequencesimilarities between the four species, and found importantbiological implications in the resulting graph structure. More than90% of genes were connected in a one-to-one correspondence,and groups of homologous proteins were isolated in smallsubgraphs. These contain expanding gene families that are oftenfound in rapidly recombining regions near the telomeres, andgenes involved in environmental adaptation, such as sugartransport and cell surface adhesion [13]. Not surprisingly,transposon proteins formed the largest homology groups.Known regulatory motifs are short and sometimes degenerate, andhence appear frequently throughout the genome, often by chancealone, other times with a functional role.Phylogeneticfootprinting has been used to distinguish between functional andnon-functional instances, by observing alignments of orthologouspromoters across multiple genomes [4]. The functional sites areconstrained to contain the motifs since their change disruptsregulation which is detrimental to the organism, whereas nonfunctional sites are free to change and accumulate mutations.This algorithm has also been applied to species at much largerevolutionary distances, with very successful results (Kamvysselisand Lander, unpublished). Despite hundreds of rearrangementsand duplicated genes separating S.cerevisiae and K.yarowii, itsuccessfully uncovered the correct gene correspondence betweenthe two species that are more than 100 million years apart.The use of comparative information thus provides additionalinformation that can help us separate signal from noise. This,together with a genome-wide view of the complete set of alignedorthologous intergenic regions, allows us to approach motifdiscovery at the genome-wide level.We are no longerconstrained to observing subsets of co-regulated genes, but cansearch for regulatory motifs in all 6000 intergenic regionssimultaneously for those sequences that are preferentiallyFinally, the algorithm works well with unfinished genomes. Byworking with sets of genes instead of one-to-one matches, thisalgorithm correctly groups in a single orthologous set all portionsof genes that are interrupted by sequence gaps and split in two ormultiple contigs. A best bi-directional hit would match only the158

conserved. We can then provide a global view of regulatorysequences that is not constrained by the experimental conditionsgenerated in the laboratory, but instead captures the entireevolutionary history since the divergence of the species compared.built blocks of conserved gene order (synteny) when neighboringgenes in one species had one-to-one matches to neighboring genesin the other species. We used these blocks of conserved syntenyto resolve additional ambiguities by preferentially keepingsyntenic edges incident to a node, and eliminating its non-syntenicedges. We finally separated out subgraphs that were connected tothe remaining edges by solely non-maximal edges as described inthe Best Unambiguous Subsets (BUS) algorithm. When the set ofedges for each node was no further reducible, we output theconnected components of the final graph as the orthology groupsbetween the two species. We finally marked the isolated genes asparalogs of their best match.Our motif discovery strategy consists of an exhaustiveenumeration and testing of short sequence patterns to findunusually conserved motif cores, followed by a motif refinementand collapsing step that ultimately produces a small number offull motifs. We used three different genome-wide statistics ofnon-random conservation to select motif cores from a largeexhaustive set of short sequence patterns. We extended thesecores with correlated surrounding bases that are frequentlyconserved, and collapsed them hierarchically based on sequencesimilarity and genome-wide co-occurrence. The final list of 72genome-wide motifs includes most previously publishedregulatory motifs, as well as additional motifs that correlatestrongly with experimental data.2.1. Initial pruning of sub-optimal matchesLet G be a weighted bipartite graph describing the similaritiesbetween two sets of genes X and Y in the two species compared(Figure 1, top left panel). Every edge e (x,y) in E that connectsnodes x X and y Y was weighted by the total number ofamino acid similarities in BLAST hits between genes x and y.When multiple BLAST hits connected x to y, we summed thenon-overlapping portions of these hits to obtain the total weight ofthe corresponding edge. We constructed graph M as the directedversion of G by replacing every undirected edge e (x,y) by twodirected edges (x,y) and (y,x) with the same weight as e in theundirected graph (Figure 1, top right panel). This allowed us torank edges incident from a node, and construct subsets of M thatcontain only the top matches out of every node.Our results provide a global view of functionally importantregulatory motifs, and provides an important link between proteininteraction networks, clusters of gene expression, andtranscription binding profiles towards understanding the dynamicnature of the cell and the complexity of regulatory interactions.2. COMPARATIVE ANNOTATIONThe first step to comparative genomics is understanding thecorrespondence between genes and other functional featuresacross the species compared. Each species is under selectivepressure to conserve the sequence of functionally importantregions. We can begin to understand these pressures by observingthe patterns of change in the sequence of orthologous regions.This step drastically reduced the overall graph connectivity bysimply eliminating all out-edges that are not near optimal for thenode they are incident from. We defined M80 as the subset of Mcontaining for every node only the outgoing edges that are at least80% of the best outgoing edge. This was mainly a preprocessingIn presence of gene duplication however, some of theevolutionary constraints a region is under are relieved, anduniform models of evolution no longer capture the underlyingselection for these sites. Hence, before any type of motifdiscovery, we needed to identify unambiguously all orthologoussequences across the four genomes as a guide to our subsequentwork.We used genes as discrete genomic anchors to construct a largescale alignment. The anchors were then used to construct anucleotide-level alignment of genes and flanking intergenicregions. With the full assemblies of the yeast species available,we predicted all Open Reading Frames (ORFs) of at least 50amino acids in each of the newly sequenced species, andcompared the predicted proteins to the annotated proteins of S.cerevisiae using protein BLAST [1]. Since every predictedprotein typically matched multiple S.cerevisiae genes, we first hadto resolve the resulting ambiguities.We formulated the problem of genome-wide gene correspondencein a graph-theoretic framework. We represented the similaritiesbetween the genes as a bipartite graph connecting genes betweentwo species (Figure 1). We weighted every edge connecting twogenes by the sequence similarity between the two genes, and theoverall length of the match. We separated this graph intoprogressively smaller subgraphs until the only remaining matchesconnected true orthologs. To achieve this separation, weeliminated edges that are sub-optimal in a series of steps. As apre-processing step, we eliminated all edges that are not within20% of the maximum-weight edge incident to each node. Wethen separated the resulting graph into connected components, andFigure 1. Overview of graph separation. We construct abipartite graph based on the blast hits. We consider bothforward and reverse matches for near-optimality based onsynteny and sequence similarity. Sub-optimal matches areprogressively eliminated simplifying the graph. We return theconnected components of the undirected simplified graph.159

ambiguities. We only considered synteny blocks that had aminimum of three genes before using them for resolvingambiguities, to prevent being misled by rearrangements ofisolated genes. We set the maximum distance d for consideringtwo neighboring genes as proximal to 20kb, which corresponds toroughly 10 genes. This parameter should match the estimateddensity of syntenic anchors. If many genomic rearrangementshave occurred since the separation of the species, or if thescaffolds of the assembly are short, the syntenic segments will beshorter and setting d to larger values might hurt the performance.On the other hand if the number of unambiguous genes is toosmall at the beginning of this step, the genes used as anchors willbe sparse, and no synteny blocks will be possible for small valuesof d.step that eliminated matches that were clearly non-optimal.Virtually all matches eliminated at this stage were due to proteindomain similarity between distantly related proteins of the samesuper-family or proteins of similar function but whose separationwell-precedes the divergence of the species. Selecting a matchthreshold relative to the best edge ensured that the algorithmperforms at a range of evolutionary distances. After each stage,we separated the resulting subgraph into connected components ofthe undirected graph (Figure 1, bottom right panel).2.2. Blocks of conserved syntenyThe initial pruning step created numerous two-cycle subgraphs(unambiguous one-to-one matches) between proteins that do nothave closely related paralogs. We used these to construct blocksof conserved synteny based on the physical distance betweenconsecutive matched genes, and preferentially kept edges thatconnect additional genes within the block of conserved geneorder. Edges connecting these genes to genes outside the blockswere then ignored, as unlikely to represent orthologousrelationships. Without impos

(1) MIT/Whitehead Institute Center for Genome Research, 320 Charles St., Cambridge MA 02139 (2) MIT Lab for Computer Science, 200 Technology Square, Cambridge MA 02139 (3) MIT Department of Mathematics, 77 Massachusetts Ave, Cambridge MA 02139 (4) MIT Department of Biology, 31 Ames St, Cambridge MA 02139 (5) Corresponding author ABSTRACT

Related Documents:

Total cost 2.00 2.05 Total cost (median) 1.99 2.23 # segments 95.68 38.95 / segment 0.0215 0.0595 Table 1: Block vs Full Annotation. Average statistics per image. Figure 4: SUNCG/CGIntrinsics annotation. (a) Ground truth. (b) Block annotation (zoomed-in) (c) Full annotation (zoomed-in). White dotted box highlights an example where block .

The human genome is the first genome entirely sequenced. b. The human genome is about the same size as the genome of E. coli. c. Researchers completed the genomes of yeast and fruit flies during the same time they sequenced the human genome. d. The sequence of the human genome was completed in June 2000. 10.

complete. Although there are no strict rules, an assembly with an N50 scaffold length that is gene-sized is a decent target for annotation. The reason is simple: if the . A beginner's guide to eukaryotic genome annotation Mark Yandell and Daniel Ence Abstract The falling cost of genome sequencing is having a marked impact on the .

The human genome is the first genome entirely sequenced. b. The human genome is about the same size as the genome of E. coli. c. Researchers completed the genomes of yeast and fruit flies during the same time they sequenced the human genome. d. Aworking copy of the human genome was completed in June 2000. 10.

PAGAN consists of two separate sections. One is a web interface for researchers to prepare the annotation task (Section III-A) and the other is an interface for annotation by end-users (Section III-B). Section III-C details the three annotation methods incorporated currently in PAGAN and used in the evaluation study of Section IV.

Study Data Tabulation Model Metadata Submission Guidelines (SDTM-MSG), prepared by the CDISC SDS Metadata Team (Section 4) is a good reference paper to define standard annotation rules. . Figure 4b: Perl expressions and code snippet within Extract Annotation Module to extract each annotation attributes from FDF files. NEW STUDY ANNOTATION PROCESS

Apr 22, 2016 · White Paper Annotation Studio: Multimedia Annotation for Students April 25, 2016 Massachusetts Institute of Technology Grant No. HK-50072-13 2 Table of Contents Introduction 3 From Start-Up to Implementation 4 Annotation Studio’s Core Design 6 Use of the Knowledge Commons: Open Source Techno

This dialog box is displayed once per session of AutoCAD on your first attempt to create an annotative object. The scale setting can be changed at any time on the Annotation Scaling Tools To begin u sing the annotation scaling functionality, you will need to familiarize yourself with several new annotation scaling tools