Data Mining In Bioinformatics - UQAM

2y ago
63 Views
10 Downloads
1.93 MB
31 Pages
Last View : 7d ago
Last Download : 1m ago
Upload by : Evelyn Loftin
Transcription

DATA MINING INBIOINFORMATICSDHIFLI WAJDIPOSTDOCTORAL RESEARCHER ATUNIVERSITY OF QUEBEC AT MONTREAL (UQAM)D H I F L I . WA J D I@ C O U R RI E R . U Q AM . C AH T T P S : / / S I T E S . G O O G L E . C O M/ S I T E / W A J D I D H IF L I /

THE NEED FOR DATA MININGIN BIOINFORMATICSHigh-throughput technologies: Genome and RNA sequencing Compound screening Genotyping chips BioimagingBGI Hong Kong, Tai Po Industrial Estate, Hong Kong2Molecular databases are growing much faster than our knowledgeof biological processes.

THE NEED FOR DATA MININGIN BIOINFORMATICS Large collections of molecular data Gene and protein sequencesGenome sequenceProtein structuresChemical compounds Problems in Bioinformatics 3Predict the function of a gene given its sequencePredict the structure of a protein given its sequencePredict the boundaries of a gene given a genome segmentPredict the function of a chemical compound given itsmolecular structure .

Manual lab works are no longer able to match the increasing load of dataThe need of automated, fast and accurate computational tools is all themore urgent4THE NEED FOR DATA MININGIN BIOINFORMATICS

THE NEED FOR DATA MININGIN BIOINFORMATICS Additional challengesHighly complexNoisyInconsistentRedundant .Data Mining can Help !5

DATA MININGWhat is data mining?[Fayyad 1996]: "Data mining is the application of specific algorithms forextracting patterns from data".[Han&Kamber 2006]: "data mining refers to extracting or mining knowledgefrom large amounts of data".[Zaki and Meira 2014]: "Data mining comprises the core algorithms thatenable one to gain fundamental insights and knowledge from massive data".6Wikipedia: "it is defined as the computational process of discovering patternsin large data sets involving methods at the intersection of artificial intelligence,machine learning, statistics, and database systems".

DATA MININGData mining and the KDD (Knowledge Discovery inDatabases) process:KnowledgePostprocessingData miningPreprocessing7Data

DATA MININGPre-processing: data cleaning (to remove noise)data integration (combine multiple data source)data selection (to extract subsets of data that are concerned by theanalysis)data transformation (to transform data into convenient formats that arerequired from data mining algorithms).Data mining: applying computational methods to extract knowledge from ation of the discovered knowledge8

DATA MININGWhat do we concretely do? Clustering grouping of similar objects into sets (K-means, meanshift, DBSCAN, Spectral Clustering, ) Classification Identifying to which category an object belongs to(SVM, KNN, Decision Trees, Neural Networks, ) Pattern mining9 Finding existing (hidden) patterns in data (associations,correlations, subsequences, subgraphs, .) And more

DATA MINING Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple10Clustering: K-means

DATA MINING11Remark: Different algorithms can give different Clustering!

DATA MININGX(a) 1-nearest neighborXX(b) 2-nearest neighbor(c) 3-nearest neighbor1.Compute distance between two points:2.K-nearest neighbors of a record x are data points that have the ksmallest distance to x3.Takes the majority vote of class labels among the k-nearestneighbors12Classification: K-Nearest Neighbors

DATA MINING13Remark: Different algorithms can give different Classification!

DATA MININGPattern Mining: Frequent Subgraph MiningFinding subgraphs that occur in graph data, giving a minimum supportExample:SATKTGGFKVSTSSKSFTSFMMGMGraph databaseTSKSSSKSKSTTKKSSFFFrequent Subgraphs(minimum support 3)FTSSKSF14TS

DATA MINING INBIOINFORMATICS: EXAMPLE 115PROTEIN GRAPHREPOSITORY

PGR: PROTEIN GRAPHREPOSITORYYearly growth of protein structures in the 62008201020122014016Need of automatic tools to meet the increasing load of data !!

PGR: PROTEIN GRAPHREPOSITORYA protein 3D structure can be represented by a graph (protein contact map) Amino acids Nodes (labeled with the amino acid type) Connections between amino acids EdgesAmino acids .If distance is σan edge is created .17 Use graph mining techniques for automated analysis of protein 3D structure

PGR: PROTEIN GRAPHREPOSITORYProtein-to-graph transformation techniques: Delaunay triangulation Main Atom Abstract each residue in one atom of it; usually C-alpha atom Two residues are linked by an edge if the distance between their mainatoms threshold All Atoms18 Two residues are linked by an edge if the distance between any pair oftheir atoms threshold

PGR: PROTEIN GRAPHREPOSITORYA real world example:Protein 3D structureProtein graph19A protein 3D-structure (PDB-id: 5AHW) and its corresponding graph(Main Atom, C-alpha).

PGR: PROTEIN GRAPHREPOSITORYA real world example:The Human Hemoglobin protein 3D-structure (PDB-id: 1GZX) and itsProtein 3D structureProtein graph20corresponding graph (Main Atom, C-alpha).

PGR: PROTEIN GRAPHREPOSITORYProtein Graph Repository Transform protein 3Dstructure into graph Several transformationmethods Several output formats Download and visualizationRepository Download protein graphs Visualization Pre-computed GraphattributesPG-Similarity Find protein structuralneighbors Distribution of topologicalattributes for the queryprotein and its structuralneighbors21PG-converter

PGR: PROTEIN GRAPHREPOSITORYNumber of currently holding proteins graphs188 252Number of unique proteins 3D-structures94 126Protein graphs based on Cα94 126Protein graphs based on All Atoms94 12622PGR v1.0: Latest Release StatisticsBased on PDB dated July 11, 2014

PGR: PROTEIN GRAPHREPOSITORYOnline DEMO:23http://wjdi.bioinfo.uqam.ca/

PGR: PROTEIN GRAPHREPOSITORYUsefulness of PGRStructure alignment and comparison: Graph Matching: maximal Clique (Vast / Vast ), Edit distance,Similarity measures (MCS)Graph Embedding: Topological similarityPattern mining: Subgraphs (frequent, discriminative, )FingerprintsFunctional / structural motifsBinding sites .24

DATA MINING INBIOINFORMATICS: EXAMPLE 225Fast and AccurateProtein 3D-structureclassification inStructural andTo p o l o g i c a l S p a c e

PROTNN: FAST PROTEIN 3DSTRUCTURE CLASSIFICATIONExisting protein Classification techniques: Sequence-based classification (Blast, ProtFun,SVM-Prot, .) Structure-based classification (e.g. Combinatorial Extension, Sheba,FatCat, Fragbag, .) Subsequences / substructures-based classificationThe subsequences / substructures are used as features to identifythe function of unknown proteins26

PROTNN: FAST PROTEIN 3DSTRUCTURE CLASSIFICATION Sequence (and subsequences)-based classification do notincorporate spatial information ! less efficient in classifying structurally similar proteins with lowsequence similarity Structure and substructure-based classification techniques doincorporate spatial information But suffer computational cost!27It is essential to find an efficient way to incorporate 3Dstructure information with low computational complexity

PROTNN: FAST PROTEIN 3DSTRUCTURE CLASSIFICATIONGeneral FrameworkWe used a set of 18 structural,topological, spectral, and labelgraph attributesNew proteinstructureProtein to graphPreprocessed PDB:(graph embeddings ofall 3D-structures)Graph embeddingNearestNeighborClassificationPredicted Class28 Class?

PROTNN: FAST PROTEIN 3DSTRUCTURE CLASSIFICATIONResults29Accuracy comparison of ProtNN with other classification techniques.

PROTNN: FAST PROTEIN 3DSTRUCTURE CLASSIFICATIONResults30Runtime results of ProtNN, FatCat and CE on the entire Protein Data Bank.

THANKS31QUESTIONS

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Related Documents:

Marie-Michèle Mailloux Patrice Tremblay Direction technique Agent de logistique et location Téléphone : (514) 987-3000 poste 3277 Téléphone : (514) 987-3000 poste 3785 Local: J-2325 Local : J-2320 mailloux.marie-michele@uqam.ca tremblay.patrice@uqam.ca

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

Bioinformatics Crash Course Ian Misner Ph.D. Bioinformatics Coordinator UMD Bioinformatics Core . Bioinformatics!Core The Plan Monday – Introductions – Linux and Python Hands-on Training Tuesday – NGS Introduction – RNAseq with Sailfish (Dr. Steve Mount, CBCB) – RNAse

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

SECTION-A: Attempt any five questions. SECTION-B: Attempt any five questions. SECTION–A Short Answer type Questions: (60-80 Words) 5 5 25 Marks 1. What is the role of internet in bioinformatics? 2. How bioinformatics assist in drug designing? 3. Write a short note on Internet Protocol (IP). 4. What is Pattern mining? 5.

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

volumes of biological information in bioinformatics database. They also provide some bioinformatics tools for database search and data acquire. With the explosion of sequence information available to researchers, the challenge facing bioinformatics and computational biologists is to aid in biomedical researches and to invent efficient toolkits.

TANK DESIGN & DETAILING Introduction The API 650 standard is designed to provide the petroleum industry with tanks of adequate safety and reasonable economy for use in the storage of petroleum, petroleum products, and other liquid products commonly handled and stored by the various branches of the industry. This standard does not present or establish a fixed series of allowable tank sizes .