École Normale Supérieure

3y ago

16 Views

2 Downloads

2.59 MB

29 Pages

Last View : 10d ago

Last Download : 3m ago

Upload by : Esmeralda Toy

Report this link

Download PDF

Transcription

École Normale SupérieureMaster 2 IMaLiS INTERDISCIPLINARY MASTER IN LIFE SCIENCESmatrix-clustering: a novel tool to cluster and align Transcription Factorbinding motifs.byCASTRO MONDRAGÓN Jaime AbrahamA Report Submitted in partial fulfillment of the requirements for the degree:Master in Life SciencesSupervisor: Jacques VAN HELDENLab. Technological Advances for Genomics and Clinics (TAGC)Paris, FranceJUNE 20141/29

Table of ContentsABSTRACT.3INTRODUCTION.5Transcription factor binding motifs (TFBM).6String-based representations of TFBM.6Position specific scoring matrices (PSSMs).6Collections of reference position-specific scoring matrices.7Objectives.9MATERIAL AND METHODS.10Software tools.10Motif datasets.10Implementation.10RESULTS.11Development of the software tool matrix-clustering.11Input files.12Motif comparison.12Distance calculation.12Hierarchical clustering.12Progressive alignment.13Branch-wise matrices, logos and consensuses.13Tree export.14Phylogram.14Consensus alignment.14Logos alignment.14Evaluation of the matrix-clustering results and selection of relevant parameters.15Case study 1: grouping redundant matrices resulting from multiple motif discovery tools.15Case study 2: negative control with randomized motifs.18Impact of threshold values.20Impact of clustering method.21Case study 3: identification of motif families in the JASPAR database.22CONCLUSIONS AND OUTLOOK.25BIBLIOGRAPHY.272/29

ABSTRACTTranscription factors binding motifs (TFBM) are classically represented either as consensusstrings (IUPAC, regular expressions), or as position-specific scoring matrices (PSSM).Thousands of curated TFBM are available in specialized databases (JASPAR, RegulonDB,TRANSFAC, etc), built from collections of transcription factor binding sites (TFBS) obtainedfrom various experimental methods (e.g. EMSA, DNAse footprinting, SELEX). TFBM canalso be discovered ab initio from genome-scale data sets: promoters of co-expressed genes,ChIP-seq peaks, phylogenetic footprints, etc.Motif collections sometimes contain groups of similar motifs, for different reasons: curationof alternative motifs for a same TF; homologous proteins sharing a particular DNA bindingdomain, motifs discovered with analytic workflows combining several algorithms (e.g. RSATpeak-motifs, or MEME-chip). In order to address the increasing need for efficient toolsenabling to discover groups of similarities among motif collections, we developedmatrix-clustering, which presents significant advantages over existing solutions.1) Segmentation of the input set of TFBM into separated clusters, displayed as a motif forestrather than a single motif tree (alternative software tools force all motifs to be aligned).2) Multiple alignment of all motifs belonging to a same cluster.3) User-friendly display of motif trees with aligned logos and consensuses.4) At each level of the hierarchical tree, computation of a merged motif (matrix andconsensus) summarizing all the descendant motifs.5) Support for a large series of alternative metrics (correlation, Euclidian distance, SSD,Sandelin-Wasserman, logo dot product, and length-normalized version of these scores).6) Possibility to select a custom combination between these scores to compute an integrativethreshold.3/29

The potentialities of the tool are illustrated by study cases: clustering of matrices extractedfrom ChIP-seq peaks using several motif discovery algorithms. Extraction of a motif-to-motifnetwork and clustering of all motifs from the JASPAR taxon-wise collections. Thesignificance of the clustering results is further assessed by analysing collections ofrandomized matrices (column-permuted). In this negative control, most motifs are correctlyassigned to a singleton, except for low complexity motifs (e.g. AAAAAA).We analyzed the effect of hierarchical clustering parameters (hierarchical agglomeration rule,similarity metrics) on the number of clusters and on the relationships between motifs, andidentified suitable parameters to obtain relevant results.Availability: matrix-clustering is available on the Regulatory Sequence Analysis Tools(RSAT) Web site (RSAT; http://www.rsat.eu/). It can also be downloaded with the stand-aloneRSAT distribution to be run from the Unix shell.4/29

INTRODUCTIONGene expression is a process strongly regulated in the cell at different levels (transcription,translation) by distinct molecules (proteins, RNAs). At transcriptional level, gene expressioncan be driven by a set of proteins known as Transcription Factors (TFs) which act either asactivators or repressors of selected target genes by binding DNA in a sequence specificmanner1,2. TFs have a DNA binding domain in which a few amino acid residues interact viaweak bonds with specific nucleotides3,4. The DNA sequences where a TF binds are denoted asTF binding sites (TFBSs).The TFBSs vary in width between 5 30 nucleotides long 5. Although the TFBSs of a particularTF are similar to each other, they are not identical: usually they have a few well conservedpositions, whereas some other positions show residue variations between sites.The study of TFs and their TFBSs has allowed the creation of transcriptional regulatorynetworks and the discovery of which particular combination of TFs give rise to complex anddiverse biological processes as morphogenesis, cell differentiation, development, etc6,2.Figure 1. Representations of TFBMs built from a collection of TFBSs. Thisexample is illustrated with the Sox2 motif from Transfac (M01272). (a) Alignment ofthe 16 annotated TFBSs. (b) PSSM representation: each cell of the matrix indicates thenumber of ocurrences of each nucleotide (row) on each column of the aligned sites. (c)IUPAC representation of the degenerated consensus. (d) Logo representation.5/29

Transcription factor binding motifs (TFBM)The fact that one TF could bind to a large set of similar (but slightly different) sequencesmakes the searching of putative TFBSs a complex task7. It is specially important to find amodel that: (1) encompasses and represents all the already known TFBSs for a TF, (2) isinformative and useful to search for new TFBSs. To build the model it is required to collectand align a sufficient number of TFBSs, in order to extract significant information about theconserved and variable residues (Figure 1a). Such models summarizing the conserved andvariable residues among a collection of reference sites are named “transcription factor bindingmotif” (TFBM). The most common representation modes for TFBM are based either oncharacter strings (strict consensus, IUPAC code, regular expressions) or on position specificscoring matrices (PSSMs). Figure 1 shows different ways to represent TFBMs using asexample the Sox2 TF.String-based representations of TFBMConsensus sequences can be represented either as regular expressions, or using the IUPACalphabet8 to denote the combination of nucleotides at each position of the alignment (Figure1b). Both methods enable to represent positions with variable residues. However, they do nottake into account the nucleotide frequency at each position of the alignment. For example, infigure 1, the letter Y at the 6 th position of the consensus means “C or T”, this is notinformative about the respective frequencies of these two residues in this column of thealigned sites.Position specific scoring matrices (PSSMs)Position specific scoring matrices9 indicate the number of occurrences of each residue (rows)in each column of the aligned sites. This model captures the nucleotide variability andconservation of a collection of TFBSs (Figure 1c). The PSSMs allow observe that manypositions of the alignment have higher frequency associated to a specific nucleotide. Aconvenient way to provide a visual and intuitive representation of the PSSMs is the sequencelogo, which indicates the information content within each column of the matrix (Figure 1d).Currently the PSSMs are the most extensively used computational method to search TFBSs ina sequence7 because they take into account the differences of nucleotide composition between6/29

the TFBSs and the analyzed sequences, the search is supported by different statisticalapproaches to validate the putative TFBSs.Collections of reference position-specific scoring matricesSeveral specialized databases provide PSSMs built from collections of TFBSs, for exampleRegulonDB10, JASPAR11, TRANSFAC12 , etc. The process to build PSSMs is generic andautomatized as part of analysis of biological sequences and in the study of TFs. Softwarepackages such as RSAT13 or MEME suite14, allow to build PSSMs from input sequences. Thistask is relatively easy when we already know the TFBSs (e.g. collection of binding sitescharacterized by gel shift or DNAse protection experiments), but becomes more complicatedwhen the precise TFBSs are not known, and we only dispose of a set of relatively largesequences where a TF possibly binds (e.g. promoters of co expressed genes). To address thissituation, one uses a bioinformatic approach known as de novo motif discovery, whichattempts to detect significant motifs 15,16,17 (e.g. over represented, or positionally biased) in aset of sequences, and build PSSMs from them. This has been a fundamental problem incomputational biology since many years, and a variety of motif discovery algorithms havebeen designed18, for example, searching overrepresented oligonucleotides for monomeric TFs,spaced oligonucleotides for dimeric TFs, positional distribution of sites inside the ChIP seqpeaks, overrepresented words in windows of variable or fixed size, etc.For the cases of high throughput experiments (genomic Selex, ChIP seq, microarrays) orstudies of conservation of cis regulatory elements across species 19 de novo motif discoverytools have to analyze large sets varying from a few hundreds to severals tens of thousandsequences. More than one algorithm can be is used20, complementing themselves for theirlimitations: some of them find the motifs that others do not, but sometimes the same motifsare found by more than one algorithm and hence they could be almost similar and henceredundant with small variations in size and nucleotide frequencies at some positions. Once aset motifs has been discovered, the user is confronted to the next question: do the differentmotifs found correspond to known TFBMs ?7/29

Motifs comparison metricsActually this questions is one of the challenges on the field, many efforts have been done todevelop statistical methods and to find adequate metrics to compare the motifs, although thereare plenty of these metrics each one uses different statistical approaches, each one with itsown limitations. For these reason it must be mentioned that there is no a standard statisticalmethod neither a standard metric to measure the similarity between PSSMs, and this issue hasbeen discussed in several publications21,21,23,24. Currently there are at least 3 software toolswhich measure the similarity between motifs, compare matrices available in RSAT,TomTom21,24 in MEME suite and STAMP23.The free software package RSAT13 integrates a collection of tools for detection and analysisof cis regulatory elements in genomic sequencea. RSAT includes the programcompare matrices which measures the similarity among a set of motifs against a plenty ofmotifs databases. Unlike others motif comparison tools, it enables to compute several metricsin the same analysis and then selects the best matches using rankings statistics on thecombined scores. A drawback is that the current version does not compute p values on thedifferent scores.Matching discovered motifs against reference databases is not the only challenge to comparemotifs. Another application is to regroup the redundant motifs discovered from the samesequences. Both issues are faced, among others, by the tools 20,25,26,27,28 to analyze ChIP seqdata. Some of these tools use many motif discovery variants to search exceptional motifs inthe peaks, after found the motifs the next step is motif comparison, but given the redundancyin the found motifs, the results could be difficult to interpret. As part of motif analysis, itshould be useful to group similar motifs.8/29

ObjectivesKnowing either the value of similarity and the offset among all the pairs a of a set of motifscould be useful information that can be integrated to group the motifs in clusters and alignthem, this approach could have many applications for example: (1) it could help tosimplifying the interpretation of results, (2) to help to find compound motifs, (3) to highlightthe common positions between a set of motifs23.In order to address the increasing need for efficient tools enabling to discover groups ofsimilarities among motif collections, in this project it had been created the toolmatrix clustering which is a tool that combines motif clustering and motif alignment.9/29

MATERIAL AND METHODSSoftware toolsMotif comparison is done using the tool compare matrices. The tool convert matrix is used toadd empty columns on the flanks of PSSM in order to align them, to generate the logosalignment, to change the orientation of the motifs, and to permute the columns of the matricesfor the negative control. The tool merge matrices is used to create the merge level matricesand consensuses at each branch of the trees. These tools used in this work are available atRegulatory Sequence Analysis Tools13 (RSAT).The logo trees is done with D3 which is a JavaScript library for manipulating documentsbased on data (http://d3js.org/).Motifs studied in study cases 1 and 2 were analyzed with STAMP 23, a tool to cluster, compareand align motifs.Motif datasetsFor the study case 1, I used a set of 21 motifs discovered from the peaks set of Oct4 ChIP seqfrom Chen et al6 with the tool peak motifs20,29.For the study case 2, I used the non redundant sets of insect and vertebrates core motifs fromthe JASPAR11 database.Implementationmatrix clustering was implemented in PERL,R and the JavaScript library D3.10/29

RESULTSDevelopment of the software tool matrix clusteringIn this work I present a novel bioinformatic tool called matrix clustering to face with one ofthe current challenges in the analysis of cis regulatory sequences: the clustering of motifs.This tool is now functional and available on the Regulatory Sequence Analysis Tools 13(RSAT) Web site (http://www.rsat.eu/). It can also be downloaded with the stand alone RSATdistribution to be used on the Unix shell, alllowing to include it in automated pipelines.Figure2.matrix clusteringpipeline The figure shows thepipeline from the input motifs andparameters selected by the user tothe final output and theinterconnections between theprograms and files. Grey boxesrepresent input/output files. Blueboxes represent software toolsused in this algorithm. Greenboxes represent the user selectionparameters.11/29

This tool takes as input a set of motifs (PSSMs), measures the similarity between each motifpairs runs hierarchical clustering to group the similar motifs. The clusters are defined basedon one or more metrics selected by the user. Once the clusters are defined, they are displayedin separated trees, which are used as guide trees to produce a progressive alignment of themotifs. The result is displayed in different modes: logo phylogram, logo cladogram, andconsensus tree. Figure 2 shows the flowchart of the algorithm, which is explained below.Input filesmatrix clustering takes as input a file with a set of motifs (several formats enabling to storemultiple PSSM in a file are supported: MEME, transfac, tab delimited, etc).Motif comparisonAll the input motifs are compared each other using the program compare matrices, whichcomputes the similarities using many metrics (correlation, Euclidian distance, SSD,Sandelin Wasserman, logo dot product, and length normalized version of these scores). Allthe pairwise comparisons are exported in a tab delimited file, which can be accessed from thematrix clustering result page.Distance calculationBefore the clustering step it is necessary select one of the supported metrics, which will beused to build the motif to motif distance table. However some of the metrics supported bycompare matrices measure a similarity (e.g. correlation, normalized correlation) whereasothers measure a distance (Euclidian, sum of squared deviations, Sandelin Wasserman). Sincehierarchical clustering assumes a distance table as input, the values of the selected metric aretransformed into distance values and the resulting table with all the resulting distances amongeach pair of the motifs is used for the hierarchical clustering step. This distance table is alsoexported in the matrix clustering results.Hierarchical clusteringAfter having calculated the distance table between all the motifs, the hierarchical clusteringapproach is applied to produce a global tree encompassing all input motif

École Normale Supérieure Master 2 IMaLiS INTERDISCIPLINARY MASTER IN LIFE SCIENCES matrix-clustering: a novel . situation, one uses a bioinformatic approach known as de novo motif discovery, which attempts to detect significant motifs15,16,17 (e.g. over represented, or positionally biased) in a set of sequences, and build PSSMs from .

Related Documents:

VerlässlicheEchtzeitsysteme- KönnenwirunserenAutos nochvertrauen?

SUP.6 Product evaluation A SUP.7 Documentation A H SUP.8 Configuration management A H SUP.9 Problem resolution management A H SUP.10 Change request management Support Process Group (SUP) A H SUP.1 Quality assurance A SUP.2 Verification SUP.3 Validation A SUP.4 Joint review SUP.5 Audit SUP.6 Product evaluation A SUP.7 Documentation

19 Views

1y ago

Frequency and temperature dependence in electromagnetic ...

t. The log-normal distribution is described by the Cole-Cole a, and the mode of the distribution is the time constant of relaxation [Cole and Cole, 1941]. If the Cole-Cole distribution parameter, a, is unity, then there is a single time constant of relaxation and the Cole

57 Views

2y ago

To what extent is the magnitude of the Cole-Cole 111 of ...

The Cole-Cole (II is a number that is often used to describe the divergence of a measured dielectric dispersion from the ideal dispersion exhibited by a Debye type of dielectric relaxation, and is widely . [27] equation, introduced by the Cole brothers [28] in which an additional parameter, the Cole-Cole (Y, is used to characterise the fact .

48 Views

2y ago

aNNALES SCIENnIFIQUES d L ÉCOLE h ORMALE SUPÉRIEU kE

Annales ScientiÞques de lÕ cole Normale Sup”rieure, 45, rue dÕUlm, 75230 Paris Cedex 05, France. T”l. : (33) 1 44 32 20 88. Fax : (33) 1 44 32 20 80. annales@ens.fr dition et abonnements / Publication and subscriptions Soci”t” Math”matique de France Case 916 - Luminy

28 Views

3y ago

Chapitre 3 - French National Centre for Scientific Research

Le mod ele de la loi normaleCalculs pratiques Param etres de la loi normale Pour chaque ; , il existe une loi normale de moyenne et d' ecart-type . On la note N( ; ). Cas particulier 0 et 1 : loi normale centr ee/r eduite. Lorsque l'on suppose qu'une variable X suit le mod ele de la loi normale N( ; ), on ecrit X N( ; ):

13 Views

1y ago

Cole–Cole, linear and multivariate modeling of …

the Cole–Cole and PLS models, the latter technique giving more satisfactory results. Keywords On-line biomass monitoring In-situ spectroscopy Scanning capacitance (dielectric) spectroscopy Cole–Cole equation PLS Calibration model robustness Introduction Over the last few decades, the ﬁeld of biotechnology has

53 Views

2y ago

Cole cole time fractal dimension for characterizing ...

Equation 20 is the prove of equation 1 which relate water saturation to cole cole time; maximum cole cole time and fractal dimension. The capillary pressure can be scaled as logSw (Df 3) logPc constant 21 Where Sw the water saturation, Pc the capillary pressure and

51 Views

2y ago

Installation Guide and Owner's Manual

and Owner's Manual SUP-1.5B, SUP-1.5B, SUP-6.5WF SUP-6WE, SUP-8WE, SUP-10WE Indicates a potentially hazardous situation, which, if not avoided, could result in death or serious injury. Indicates a potentially hazardous situation, which, if not avoided, may result in minor or moderate injury. Information Step-by-step Instructions FOR YOUR SAFETY

9 Views

1y ago

Recent Views

Stock Market Development and Economic Growth: Empirical Evidence from China

measures used to proxy for stock market size and the size of real economy. Most of the existing studies use stock market index as a proxy for measuring the growth and development of stock market in a country. We argue that stock market index may not be a good measure of stock market size when looking at its association with economic growth.

1y ago

263 Views

Lasso Technique Application In Stock Market Modelling: An Empirical .

This research tries to see the influence of G7 and ASEAN-4 stock market on Indonesian stock market by using LASSO model. Stock market estimation method had been conducted such as Stock Market Forecasting Using LASSO Linear Regression Model (Roy et al., 2015) and Mali et al., (2017) on Open Price Prediction of Stock Market Using Regression Analysis.

3m ago

18 Views

The Stock Market Profits Blueprint - Liberated Stock Trader

The stock market profits blueprint has been hand crafted to enable you to understand all the factors that play on the stock market. It is called a blueprint because a blueprint is in effect an architectural document to show how something is designed. The Blueprint will show you a powerful way to envisage how the stock market and the stock market

1y ago

181 Views

Factors Affecting Performance of Stock Market: Evidence from . - HRMARS

We used the data of Colombo Stock Exchange (CSE) for Sri Lankan stock market in this research which is the main stock exchange of Sri Lanka. The market capitalization of CSE is over 20 billion USD. Colombo stock exchange is the first south Asian region stock market and overall 52nd who obtain the membership of World Federation of Exchanges.

11m ago

103 Views

Stock Market Development in the Philippines: Past and Present

Philippine stock market. This paper may serve as a basis for further research on the stock market development in the country. This paper is organized as follows: Section 2 traces the origins of the stock market in the Philippines while section 3 outlines the reforms that have been implemented to strengthen the stock market.

1y ago

128 Views

Columbus,Ohio 1890

Slicing Steaks 3563 Beef Tender, Select In Stock 3852 Angus XT Shoulder Clod, Choice In Stock 3853 Angus XT Chuck Roll, Choice 20/up In Stock 3856 Angus XT Peeled Knuckle In Stock 3857 Angus XT Inside Rounds In Stock 3858 Angus XT Flats, Choice In Stock 3859 Angus XT Eye Of Round, Choice In Stock 3507 Point Off Bnls Beef Brisket, Choice In Stock

2y ago

268 Views

Buying Your First Stock - Stock-Trak

Stock Market Game Time: 15 Minutes Requires: StockTrak Curriculum , Computer Access Buying Your First Stock This lesson is an introduction to buying a stock. Students will be introduced to basic vocabulary that is involved with a buying and owning a stock. Stu-dents will be going through the entire process of buying a stock from looking

1y ago

164 Views

1.11.1. Where to Find Wall Street Training - Investing 101

investing and day trading, how to trade stock options, online free stock trading, market timing strategies, and mutual funds. But, first—learn what these terms mean. Play stock market games:Play stock market games: A stock simulation market game will train you to be comfortable with investing

2y ago

125 Views

Stock Price Prediction Using RNN and LSTM - JETIR

1. BASIC INTRODUCTION OF STOCK MARKET A stock market is a public market for trading of company stocks. Stock market prediction is the task to find the future price of a company stock. The price of a share depends on the number of people who want to buy or sell it. If there are more buyers, then prices will rise. If the seller has a number of .

1y ago

114 Views

Stock Market Wealth Effects - Harvard University

negative stock return and a subsequent decline in household spending and employment. We use a local labor market analysis to address this empirical challenge and provide quantitative evidence on the stock market consumption wealth e ect. Our empirical strategy combines regional heterogeneity in stock market wealth with aggregate movements in stock

1y ago

104 Views

Artificial Intelligence Approach for Stock Market - IJSER

The forecast of stock market helps investors to make investment decisions, via giving them strong insights about the behavior of stock market for avoiding investment risks. It was found that news has an influence on the stock price behavior [2]. The stock market is a constantly changing indicator of economic activity all over the world.

1y ago

109 Views

The Stock Market Game Student Activity Packet - Maryland Council on .

1. The Stock Market Game Kick Off! (3 mins) 2. Intro to Investing (4 mins) 3. Intro to Companies (3 mins) 4. Intro to Stocks (4 mins) 5. Building Your Portfolio (5 mins) 6. The Stock Market Game Trading Portfolio (6 mins) 7. The Stock Market Game Rules (6 mins) 8. Conducting Research (5 mins) 9. Entering Stock Trades (4 mins) 10. Assessing Risk .

1y ago

114 Views

Stock Market Uncertainty and the Stock-Bond Return Relation

implied volatility and stock turnover may prove useful for ﬁnancial applications that need to under-stand and predict stock and bond return co-movements. Finally, our empirical results suggest that the beneﬁts of stock-bond diversiﬁcation increase during periods of high stock market uncertainty. This study is organized as follow.

1y ago

158 Views

The Stock Market Crash of 1929, Great Depression, Dust .

The Stock Market Crash of 1929 In 1929, the Stock Market Crashed!! The stock of a business represents the original money paid into or invested in the business by its founders. So the stock represents how much mone

2y ago

358 Views

Web Based Stock Forecasters - Winlab

Stock market prediction is the act of trying to determine the future value of a company stock or other financial instrument traded on a financial exchange. The successful prediction of a stock's future price could yield significant profit. The stock market is not an efficient market.

1y ago

102 Views

École Normale Supérieure

It looks like you're using an ad-blocker