DATA MINING INBIOINFORMATICSDHIFLI WAJDIPOSTDOCTORAL RESEARCHER ATUNIVERSITY OF QUEBEC AT MONTREAL (UQAM)D H I F L I . WA J D [email protected] C O U R RI E R . U Q AM . C AH T T P S : / / S I T E S . G O O G L E . C O M/ S I T E / W A J D I D H IF L I /
THE NEED FOR DATA MININGIN BIOINFORMATICSHigh-throughput technologies: Genome and RNA sequencing Compound screening Genotyping chips BioimagingBGI Hong Kong, Tai Po Industrial Estate, Hong Kong2Molecular databases are growing much faster than our knowledgeof biological processes.
THE NEED FOR DATA MININGIN BIOINFORMATICS Large collections of molecular data Gene and protein sequencesGenome sequenceProtein structuresChemical compounds Problems in Bioinformatics 3Predict the function of a gene given its sequencePredict the structure of a protein given its sequencePredict the boundaries of a gene given a genome segmentPredict the function of a chemical compound given itsmolecular structure .
Manual lab works are no longer able to match the increasing load of dataThe need of automated, fast and accurate computational tools is all themore urgent4THE NEED FOR DATA MININGIN BIOINFORMATICS
THE NEED FOR DATA MININGIN BIOINFORMATICS Additional challengesHighly complexNoisyInconsistentRedundant .Data Mining can Help !5
DATA MININGWhat is data mining?[Fayyad 1996]: "Data mining is the application of specific algorithms forextracting patterns from data".[Han&Kamber 2006]: "data mining refers to extracting or mining knowledgefrom large amounts of data".[Zaki and Meira 2014]: "Data mining comprises the core algorithms thatenable one to gain fundamental insights and knowledge from massive data".6Wikipedia: "it is defined as the computational process of discovering patternsin large data sets involving methods at the intersection of artificial intelligence,machine learning, statistics, and database systems".
DATA MININGData mining and the KDD (Knowledge Discovery inDatabases) process:KnowledgePostprocessingData miningPreprocessing7Data
DATA MININGPre-processing: data cleaning (to remove noise)data integration (combine multiple data source)data selection (to extract subsets of data that are concerned by theanalysis)data transformation (to transform data into convenient formats that arerequired from data mining algorithms).Data mining: applying computational methods to extract knowledge from ation of the discovered knowledge8
DATA MININGWhat do we concretely do? Clustering grouping of similar objects into sets (K-means, meanshift, DBSCAN, Spectral Clustering, ) Classification Identifying to which category an object belongs to(SVM, KNN, Decision Trees, Neural Networks, ) Pattern mining9 Finding existing (hidden) patterns in data (associations,correlations, subsequences, subgraphs, .) And more
DATA MINING Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple10Clustering: K-means
DATA MINING11Remark: Different algorithms can give different Clustering!
DATA MININGX(a) 1-nearest neighborXX(b) 2-nearest neighbor(c) 3-nearest neighbor1.Compute distance between two points:2.K-nearest neighbors of a record x are data points that have the ksmallest distance to x3.Takes the majority vote of class labels among the k-nearestneighbors12Classification: K-Nearest Neighbors
DATA MINING13Remark: Different algorithms can give different Classification!
DATA MININGPattern Mining: Frequent Subgraph MiningFinding subgraphs that occur in graph data, giving a minimum supportExample:SATKTGGFKVSTSSKSFTSFMMGMGraph databaseTSKSSSKSKSTTKKSSFFFrequent Subgraphs(minimum support 3)FTSSKSF14TS
DATA MINING INBIOINFORMATICS: EXAMPLE 115PROTEIN GRAPHREPOSITORY
PGR: PROTEIN GRAPHREPOSITORYYearly growth of protein structures in the 62008201020122014016Need of automatic tools to meet the increasing load of data !!
PGR: PROTEIN GRAPHREPOSITORYA protein 3D structure can be represented by a graph (protein contact map) Amino acids Nodes (labeled with the amino acid type) Connections between amino acids EdgesAmino acids .If distance is σan edge is created .17 Use graph mining techniques for automated analysis of protein 3D structure
PGR: PROTEIN GRAPHREPOSITORYProtein-to-graph transformation techniques: Delaunay triangulation Main Atom Abstract each residue in one atom of it; usually C-alpha atom Two residues are linked by an edge if the distance between their mainatoms threshold All Atoms18 Two residues are linked by an edge if the distance between any pair oftheir atoms threshold
PGR: PROTEIN GRAPHREPOSITORYA real world example:Protein 3D structureProtein graph19A protein 3D-structure (PDB-id: 5AHW) and its corresponding graph(Main Atom, C-alpha).
PGR: PROTEIN GRAPHREPOSITORYA real world example:The Human Hemoglobin protein 3D-structure (PDB-id: 1GZX) and itsProtein 3D structureProtein graph20corresponding graph (Main Atom, C-alpha).
PGR: PROTEIN GRAPHREPOSITORYProtein Graph Repository Transform protein 3Dstructure into graph Several transformationmethods Several output formats Download and visualizationRepository Download protein graphs Visualization Pre-computed GraphattributesPG-Similarity Find protein structuralneighbors Distribution of topologicalattributes for the queryprotein and its structuralneighbors21PG-converter
PGR: PROTEIN GRAPHREPOSITORYNumber of currently holding proteins graphs188 252Number of unique proteins 3D-structures94 126Protein graphs based on Cα94 126Protein graphs based on All Atoms94 12622PGR v1.0: Latest Release StatisticsBased on PDB dated July 11, 2014
PGR: PROTEIN GRAPHREPOSITORYOnline DEMO:23http://wjdi.bioinfo.uqam.ca/
PGR: PROTEIN GRAPHREPOSITORYUsefulness of PGRStructure alignment and comparison: Graph Matching: maximal Clique (Vast / Vast ), Edit distance,Similarity measures (MCS)Graph Embedding: Topological similarityPattern mining: Subgraphs (frequent, discriminative, )FingerprintsFunctional / structural motifsBinding sites .24
DATA MINING INBIOINFORMATICS: EXAMPLE 225Fast and AccurateProtein 3D-structureclassification inStructural andTo p o l o g i c a l S p a c e
PROTNN: FAST PROTEIN 3DSTRUCTURE CLASSIFICATIONExisting protein Classification techniques: Sequence-based classification (Blast, ProtFun,SVM-Prot, .) Structure-based classification (e.g. Combinatorial Extension, Sheba,FatCat, Fragbag, .) Subsequences / substructures-based classificationThe subsequences / substructures are used as features to identifythe function of unknown proteins26
PROTNN: FAST PROTEIN 3DSTRUCTURE CLASSIFICATION Sequence (and subsequences)-based classification do notincorporate spatial information ! less efficient in classifying structurally similar proteins with lowsequence similarity Structure and substructure-based classification techniques doincorporate spatial information But suffer computational cost!27It is essential to find an efficient way to incorporate 3Dstructure information with low computational complexity
PROTNN: FAST PROTEIN 3DSTRUCTURE CLASSIFICATIONGeneral FrameworkWe used a set of 18 structural,topological, spectral, and labelgraph attributesNew proteinstructureProtein to graphPreprocessed PDB:(graph embeddings ofall 3D-structures)Graph embeddingNearestNeighborClassificationPredicted Class28 Class?
PROTNN: FAST PROTEIN 3DSTRUCTURE CLASSIFICATIONResults29Accuracy comparison of ProtNN with other classification techniques.
PROTNN: FAST PROTEIN 3DSTRUCTURE CLASSIFICATIONResults30Runtime results of ProtNN, FatCat and CE on the entire Protein Data Bank.
DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in