École Normale Supérieure

3y ago
16 Views
2 Downloads
2.59 MB
29 Pages
Last View : 10d ago
Last Download : 3m ago
Upload by : Esmeralda Toy
Transcription

École Normale SupérieureMaster 2 IMaLiS INTERDISCIPLINARY MASTER IN LIFE SCIENCESmatrix-clustering: a novel tool to cluster and align Transcription Factorbinding motifs.byCASTRO MONDRAGÓN Jaime AbrahamA Report Submitted in partial fulfillment of the requirements for the degree:Master in Life SciencesSupervisor: Jacques VAN HELDENLab. Technological Advances for Genomics and Clinics (TAGC)Paris, FranceJUNE 20141/29

Table of ContentsABSTRACT.3INTRODUCTION.5Transcription factor binding motifs (TFBM).6String-based representations of TFBM.6Position specific scoring matrices (PSSMs).6Collections of reference position-specific scoring matrices.7Objectives.9MATERIAL AND METHODS.10Software tools.10Motif datasets.10Implementation.10RESULTS.11Development of the software tool matrix-clustering.11Input files.12Motif comparison.12Distance calculation.12Hierarchical clustering.12Progressive alignment.13Branch-wise matrices, logos and consensuses.13Tree export.14Phylogram.14Consensus alignment.14Logos alignment.14Evaluation of the matrix-clustering results and selection of relevant parameters.15Case study 1: grouping redundant matrices resulting from multiple motif discovery tools.15Case study 2: negative control with randomized motifs.18Impact of threshold values.20Impact of clustering method.21Case study 3: identification of motif families in the JASPAR database.22CONCLUSIONS AND OUTLOOK.25BIBLIOGRAPHY.272/29

ABSTRACTTranscription factors binding motifs (TFBM) are classically represented either as consensusstrings (IUPAC, regular expressions), or as position-specific scoring matrices (PSSM).Thousands of curated TFBM are available in specialized databases (JASPAR, RegulonDB,TRANSFAC, etc), built from collections of transcription factor binding sites (TFBS) obtainedfrom various experimental methods (e.g. EMSA, DNAse footprinting, SELEX). TFBM canalso be discovered ab initio from genome-scale data sets: promoters of co-expressed genes,ChIP-seq peaks, phylogenetic footprints, etc.Motif collections sometimes contain groups of similar motifs, for different reasons: curationof alternative motifs for a same TF; homologous proteins sharing a particular DNA bindingdomain, motifs discovered with analytic workflows combining several algorithms (e.g. RSATpeak-motifs, or MEME-chip). In order to address the increasing need for efficient toolsenabling to discover groups of similarities among motif collections, we developedmatrix-clustering, which presents significant advantages over existing solutions.1) Segmentation of the input set of TFBM into separated clusters, displayed as a motif forestrather than a single motif tree (alternative software tools force all motifs to be aligned).2) Multiple alignment of all motifs belonging to a same cluster.3) User-friendly display of motif trees with aligned logos and consensuses.4) At each level of the hierarchical tree, computation of a merged motif (matrix andconsensus) summarizing all the descendant motifs.5) Support for a large series of alternative metrics (correlation, Euclidian distance, SSD,Sandelin-Wasserman, logo dot product, and length-normalized version of these scores).6) Possibility to select a custom combination between these scores to compute an integrativethreshold.3/29

The potentialities of the tool are illustrated by study cases: clustering of matrices extractedfrom ChIP-seq peaks using several motif discovery algorithms. Extraction of a motif-to-motifnetwork and clustering of all motifs from the JASPAR taxon-wise collections. Thesignificance of the clustering results is further assessed by analysing collections ofrandomized matrices (column-permuted). In this negative control, most motifs are correctlyassigned to a singleton, except for low complexity motifs (e.g. AAAAAA).We analyzed the effect of hierarchical clustering parameters (hierarchical agglomeration rule,similarity metrics) on the number of clusters and on the relationships between motifs, andidentified suitable parameters to obtain relevant results.Availability: matrix-clustering is available on the Regulatory Sequence Analysis Tools(RSAT) Web site (RSAT; http://www.rsat.eu/). It can also be downloaded with the stand-aloneRSAT distribution to be run from the Unix shell.4/29

INTRODUCTIONGene expression is a process strongly regulated in the cell at different levels (transcription,translation) by distinct molecules (proteins, RNAs). At transcriptional level, gene expressioncan be driven by a set of proteins known as Transcription Factors (TFs) which act either asactivators or repressors of selected target genes by binding DNA in a sequence specificmanner1,2. TFs have a DNA binding domain in which a few amino acid residues interact viaweak bonds with specific nucleotides3,4. The DNA sequences where a TF binds are denoted asTF binding sites (TFBSs).The TFBSs vary in width between 5 30 nucleotides long 5. Although the TFBSs of a particularTF are similar to each other, they are not identical: usually they have a few well conservedpositions, whereas some other positions show residue variations between sites.The study of TFs and their TFBSs has allowed the creation of transcriptional regulatorynetworks and the discovery of which particular combination of TFs give rise to complex anddiverse biological processes as morphogenesis, cell differentiation, development, etc6,2.Figure 1. Representations of TFBMs built from a collection of TFBSs. Thisexample is illustrated with the Sox2 motif from Transfac (M01272). (a) Alignment ofthe 16 annotated TFBSs. (b) PSSM representation: each cell of the matrix indicates thenumber of ocurrences of each nucleotide (row) on each column of the aligned sites. (c)IUPAC representation of the degenerated consensus. (d) Logo representation.5/29

Transcription factor binding motifs (TFBM)The fact that one TF could bind to a large set of similar (but slightly different) sequencesmakes the searching of putative TFBSs a complex task7. It is specially important to find amodel that: (1) encompasses and represents all the already known TFBSs for a TF, (2) isinformative and useful to search for new TFBSs. To build the model it is required to collectand align a sufficient number of TFBSs, in order to extract significant information about theconserved and variable residues (Figure 1a). Such models summarizing the conserved andvariable residues among a collection of reference sites are named “transcription factor bindingmotif” (TFBM). The most common representation modes for TFBM are based either oncharacter strings (strict consensus, IUPAC code, regular expressions) or on position specificscoring matrices (PSSMs). Figure 1 shows different ways to represent TFBMs using asexample the Sox2 TF.String-based representations of TFBMConsensus sequences can be represented either as regular expressions, or using the IUPACalphabet8 to denote the combination of nucleotides at each position of the alignment (Figure1b). Both methods enable to represent positions with variable residues. However, they do nottake into account the nucleotide frequency at each position of the alignment. For example, infigure 1, the letter Y at the 6 th position of the consensus means “C or T”, this is notinformative about the respective frequencies of these two residues in this column of thealigned sites.Position specific scoring matrices (PSSMs)Position specific scoring matrices9 indicate the number of occurrences of each residue (rows)in each column of the aligned sites. This model captures the nucleotide variability andconservation of a collection of TFBSs (Figure 1c). The PSSMs allow observe that manypositions of the alignment have higher frequency associated to a specific nucleotide. Aconvenient way to provide a visual and intuitive representation of the PSSMs is the sequencelogo, which indicates the information content within each column of the matrix (Figure 1d).Currently the PSSMs are the most extensively used computational method to search TFBSs ina sequence7 because they take into account the differences of nucleotide composition between6/29

the TFBSs and the analyzed sequences, the search is supported by different statisticalapproaches to validate the putative TFBSs.Collections of reference position-specific scoring matricesSeveral specialized databases provide PSSMs built from collections of TFBSs, for exampleRegulonDB10, JASPAR11, TRANSFAC12 , etc. The process to build PSSMs is generic andautomatized as part of analysis of biological sequences and in the study of TFs. Softwarepackages such as RSAT13 or MEME suite14, allow to build PSSMs from input sequences. Thistask is relatively easy when we already know the TFBSs (e.g. collection of binding sitescharacterized by gel shift or DNAse protection experiments), but becomes more complicatedwhen the precise TFBSs are not known, and we only dispose of a set of relatively largesequences where a TF possibly binds (e.g. promoters of co expressed genes). To address thissituation, one uses a bioinformatic approach known as de novo motif discovery, whichattempts to detect significant motifs 15,16,17 (e.g. over represented, or positionally biased) in aset of sequences, and build PSSMs from them. This has been a fundamental problem incomputational biology since many years, and a variety of motif discovery algorithms havebeen designed18, for example, searching overrepresented oligonucleotides for monomeric TFs,spaced oligonucleotides for dimeric TFs, positional distribution of sites inside the ChIP seqpeaks, overrepresented words in windows of variable or fixed size, etc.For the cases of high throughput experiments (genomic Selex, ChIP seq, microarrays) orstudies of conservation of cis regulatory elements across species 19 de novo motif discoverytools have to analyze large sets varying from a few hundreds to severals tens of thousandsequences. More than one algorithm can be is used20, complementing themselves for theirlimitations: some of them find the motifs that others do not, but sometimes the same motifsare found by more than one algorithm and hence they could be almost similar and henceredundant with small variations in size and nucleotide frequencies at some positions. Once aset motifs has been discovered, the user is confronted to the next question: do the differentmotifs found correspond to known TFBMs ?7/29

Motifs comparison metricsActually this questions is one of the challenges on the field, many efforts have been done todevelop statistical methods and to find adequate metrics to compare the motifs, although thereare plenty of these metrics each one uses different statistical approaches, each one with itsown limitations. For these reason it must be mentioned that there is no a standard statisticalmethod neither a standard metric to measure the similarity between PSSMs, and this issue hasbeen discussed in several publications21,21,23,24. Currently there are at least 3 software toolswhich measure the similarity between motifs, compare matrices available in RSAT,TomTom21,24 in MEME suite and STAMP23.The free software package RSAT13 integrates a collection of tools for detection and analysisof cis regulatory elements in genomic sequencea. RSAT includes the programcompare matrices which measures the similarity among a set of motifs against a plenty ofmotifs databases. Unlike others motif comparison tools, it enables to compute several metricsin the same analysis and then selects the best matches using rankings statistics on thecombined scores. A drawback is that the current version does not compute p values on thedifferent scores.Matching discovered motifs against reference databases is not the only challenge to comparemotifs. Another application is to regroup the redundant motifs discovered from the samesequences. Both issues are faced, among others, by the tools 20,25,26,27,28 to analyze ChIP seqdata. Some of these tools use many motif discovery variants to search exceptional motifs inthe peaks, after found the motifs the next step is motif comparison, but given the redundancyin the found motifs, the results could be difficult to interpret. As part of motif analysis, itshould be useful to group similar motifs.8/29

ObjectivesKnowing either the value of similarity and the offset among all the pairs a of a set of motifscould be useful information that can be integrated to group the motifs in clusters and alignthem, this approach could have many applications for example: (1) it could help tosimplifying the interpretation of results, (2) to help to find compound motifs, (3) to highlightthe common positions between a set of motifs23.In order to address the increasing need for efficient tools enabling to discover groups ofsimilarities among motif collections, in this project it had been created the toolmatrix clustering which is a tool that combines motif clustering and motif alignment.9/29

MATERIAL AND METHODSSoftware toolsMotif comparison is done using the tool compare matrices. The tool convert matrix is used toadd empty columns on the flanks of PSSM in order to align them, to generate the logosalignment, to change the orientation of the motifs, and to permute the columns of the matricesfor the negative control. The tool merge matrices is used to create the merge level matricesand consensuses at each branch of the trees. These tools used in this work are available atRegulatory Sequence Analysis Tools13 (RSAT).The logo trees is done with D3 which is a JavaScript library for manipulating documentsbased on data (http://d3js.org/).Motifs studied in study cases 1 and 2 were analyzed with STAMP 23, a tool to cluster, compareand align motifs.Motif datasetsFor the study case 1, I used a set of 21 motifs discovered from the peaks set of Oct4 ChIP seqfrom Chen et al6 with the tool peak motifs20,29.For the study case 2, I used the non redundant sets of insect and vertebrates core motifs fromthe JASPAR11 database.Implementationmatrix clustering was implemented in PERL,R and the JavaScript library D3.10/29

RESULTSDevelopment of the software tool matrix clusteringIn this work I present a novel bioinformatic tool called matrix clustering to face with one ofthe current challenges in the analysis of cis regulatory sequences: the clustering of motifs.This tool is now functional and available on the Regulatory Sequence Analysis Tools 13(RSAT) Web site (http://www.rsat.eu/). It can also be downloaded with the stand alone RSATdistribution to be used on the Unix shell, alllowing to include it in automated pipelines.Figure2.matrix clusteringpipeline The figure shows thepipeline from the input motifs andparameters selected by the user tothe final output and theinterconnections between theprograms and files. Grey boxesrepresent input/output files. Blueboxes represent software toolsused in this algorithm. Greenboxes represent the user selectionparameters.11/29

This tool takes as input a set of motifs (PSSMs), measures the similarity between each motifpairs runs hierarchical clustering to group the similar motifs. The clusters are defined basedon one or more metrics selected by the user. Once the clusters are defined, they are displayedin separated trees, which are used as guide trees to produce a progressive alignment of themotifs. The result is displayed in different modes: logo phylogram, logo cladogram, andconsensus tree. Figure 2 shows the flowchart of the algorithm, which is explained below.Input filesmatrix clustering takes as input a file with a set of motifs (several formats enabling to storemultiple PSSM in a file are supported: MEME, transfac, tab delimited, etc).Motif comparisonAll the input motifs are compared each other using the program compare matrices, whichcomputes the similarities using many metrics (correlation, Euclidian distance, SSD,Sandelin Wasserman, logo dot product, and length normalized version of these scores). Allthe pairwise comparisons are exported in a tab delimited file, which can be accessed from thematrix clustering result page.Distance calculationBefore the clustering step it is necessary select one of the supported metrics, which will beused to build the motif to motif distance table. However some of the metrics supported bycompare matrices measure a similarity (e.g. correlation, normalized correlation) whereasothers measure a distance (Euclidian, sum of squared deviations, Sandelin Wasserman). Sincehierarchical clustering assumes a distance table as input, the values of the selected metric aretransformed into distance values and the resulting table with all the resulting distances amongeach pair of the motifs is used for the hierarchical clustering step. This distance table is alsoexported in the matrix clustering results.Hierarchical clusteringAfter having calculated the distance table between all the motifs, the hierarchical clusteringapproach is applied to produce a global tree encompassing all input motif

École Normale Supérieure Master 2 IMaLiS INTERDISCIPLINARY MASTER IN LIFE SCIENCES matrix-clustering: a novel . situation, one uses a bioinformatic approach known as de novo motif discovery, which attempts to detect significant motifs15,16,17 (e.g. over represented, or positionally biased) in a set of sequences, and build PSSMs from .

Related Documents:

SUP.6 Product evaluation A SUP.7 Documentation A H SUP.8 Configuration management A H SUP.9 Problem resolution management A H SUP.10 Change request management Support Process Group (SUP) A H SUP.1 Quality assurance A SUP.2 Verification SUP.3 Validation A SUP.4 Joint review SUP.5 Audit SUP.6 Product evaluation A SUP.7 Documentation

t. The log-normal distribution is described by the Cole-Cole a, and the mode of the distribution is the time constant of relaxation [Cole and Cole, 1941]. If the Cole-Cole distribution parameter, a, is unity, then there is a single time constant of relaxation and the Cole

The Cole-Cole (II is a number that is often used to describe the divergence of a measured dielectric dispersion from the ideal dispersion exhibited by a Debye type of dielectric relaxation, and is widely . [27] equation, introduced by the Cole brothers [28] in which an additional parameter, the Cole-Cole (Y, is used to characterise the fact .

Annales ScientiÞques de lÕ cole Normale Sup”rieure, 45, rue dÕUlm, 75230 Paris Cedex 05, France. T”l. : (33) 1 44 32 20 88. Fax : (33) 1 44 32 20 80. annales@ens.fr dition et abonnements / Publication and subscriptions Soci”t” Math”matique de France Case 916 - Luminy

Le mod ele de la loi normaleCalculs pratiques Param etres de la loi normale Pour chaque ; , il existe une loi normale de moyenne et d' ecart-type . On la note N( ; ). Cas particulier 0 et 1 : loi normale centr ee/r eduite. Lorsque l'on suppose qu'une variable X suit le mod ele de la loi normale N( ; ), on ecrit X N( ; ):

the Cole–Cole and PLS models, the latter technique giving more satisfactory results. Keywords On-line biomass monitoring In-situ spectroscopy Scanning capacitance (dielectric) spectroscopy Cole–Cole equation PLS Calibration model robustness Introduction Over the last few decades, the field of biotechnology has

Equation 20 is the prove of equation 1 which relate water saturation to cole cole time; maximum cole cole time and fractal dimension. The capillary pressure can be scaled as logSw (Df 3) logPc constant 21 Where Sw the water saturation, Pc the capillary pressure and

and Owner's Manual SUP-1.5B, SUP-1.5B, SUP-6.5WF SUP-6WE, SUP-8WE, SUP-10WE Indicates a potentially hazardous situation, which, if not avoided, could result in death or serious injury. Indicates a potentially hazardous situation, which, if not avoided, may result in minor or moderate injury. Information Step-by-step Instructions FOR YOUR SAFETY