Bioinformatics &Machine LearningDaniel Glez-PeñaIP Leiria, June 3 2009
Agenda1. BioinformaticsDefinition, major research areas, databases2. Machine Learning for bioinformaticsAlgorithm types, examples in bioinformatics3. DNA MicroarraysTechnological overview4. ApplicationsGeneCBR and WhichGenes?
Bioinformatics
Bioinformaticszz“Application of the InformationTechnologies to the field ofmolecular biology”Creation and enhancement of: Databases with biological information Algorithms Statistical techniques to solve formal and practical problemsarising from the management and analysis ofbiological dataBioinformaticsMachine LearningMicroarraysApplications
Major research areaszzzGENOMICSzTEXT MINING Sequence analysis Gene annotation Genome annotation Protein annotation Analysis of mutations in cancer Relation extractionPROTEOMICSzEVOLUTION Protein-protein docking Phylogenetic reconstruction Analysis of protein expression Comparative genomics Prediction of protein structurezMICROARRAYS Analysis of gene expression Genetic network inductionBioinformaticsSYSTEMS BIOLOGYzMachine LearningModelling biological systemsOTHER Image AnalysisMicroarraysApplications
Major research areasLarrañaga et al (2005), Briefings in Bioinformatics 7(1):86-11220092007200820052009
Molecular biology dogmaInteractsTrascriptionDNA& sACTTGTCATGGCGACTGTCCTTTGTGC PathwaySet of ReactionsMEEPQSDPSVEPPLSQETFSDLWKLLPENNVLS InteractomicsGENOMICSSequence analysisGenomeannotationAnalysis ofmutationsProteinSequenceGene sPROTEOMICSProtein expressionanalysis[mass spectometry]Protein structureprediction[folding]Machine D ionalanalylisApplications
DatabasesInteractomics &MetabolomicsProteomicsGenomicsSequencesProt-Prot sOntologiesCene-centricExperimental dataBibliomeBioinformaticsMachine LearningMicroarraysApplications
Machine Learning forBioinformatics
Machine Learning & BioinformaticszCLASSIFICATION (SUPERVISED LEARNING)zCLUSTERING (UNSUPVERVISED LEARNING)zGRAPHICAL PROBABILISTIC MODELSzOPTIMIZATIONBioinformaticsMachine LearningMicroarraysApplications
ML & Bioinformatics: ClassificationzClassification (supervised learning) Given a set of “instances”, each one with a set ofmeasured “attributtes” and a “outcome” value we wantto train a model that predicts the outcome in furtherproblem instanceszIf the “outcome” is discrete (typical 2 o more different values)we are talking about classification (if not: regression)Training dataTest dataBioinformaticsMachine LearningMicroarraysApplications
ML & Bioinformatics: ClassificationzClassification Feature subset selection.zzzAre all input attributes useful?Advantages: reduced cost in data adquisition, improveduniderstability of the model, faster training, and betteraccuracyIt is a search space problem (2n-1), in general: 1. Generate a subset[brute force, deterministic/not deterministic heuristic search]2. Evaluate subsetStatistical estimation: Information Gain, X2, t-test, DFP, CFSWrapper (use classifier accuracy in training set)3. if (!halt condition) GOTO 1BioinformaticsMachine LearningMicroarraysApplications
ML & Bioinformatics: ClassificationzClassification Popular techniqueszzzzzzzLogistic regressionLinear discriminant analysis (LDA)Bayesian classifiers: Naive Bayes, semi-NB, Tree augmented NB, k dependence Bayesian Classification trees: CART, C4.5, RandomForest, J48 K-Nearest NeighboursSupport Vector MachinesMeta: Bagging, BoostingClassification TreeBioinformaticskNN classifierMachine LearningSupport Vector MachineMicroarraysApplications
ML & Bioinformatics: ClassificationzExamples of classification in Bioinformatics (I) GenomicszGene finding (if a sequence is a coding region)zSplice site prediction (if a sequence is a splice site)zPredict disease genes (from i.e. its sequence length?)zPrediction of mutation (SNP) effectzCancer prediction from gene expression (microarrays)ProteomicszPrediction of secondary structure (alpha-helix, beta-sheet,etc.)zPrediction of sub-cellular location of the proteinzCancer prediction from protein expression (mass spectra)BioinformaticsMachine LearningMicroarraysApplications
ML & Bioinformatics: ClassificationzExamples of classification in Bioinformatics (and II) Systems biologyzz Predict the cell migration speed (high, low) from thephosphorilation levels of signalling proteinsPredict a gene regulatory level (up-regulated or downregulated given the ‘related’ genes expression)Text miningzProtein/gene recognition in biomedical literature (is this worda gene/protein given some word features: ortographic, partof-speech, suffix, trigger words, etc ?)BioinformaticsMachine LearningMicroarraysApplications
ML & Bioinformatics: ClusteringzClustering Partition a set of “instances” in several groups (clusters)given the differences between themzTheir are based on “distances” between instances that is aproblem-dependant issue Typical: Euclidean, Pearson, SpermanBioinformaticsMachine LearningMicroarraysApplications
ML & Bioinformatics: ClusteringzClustering Popular techniqueszzPartition clustering: k-means, SOM, GCS, PAMHierarchical clustering with single-linkage, complete linkage, centroidlinkage and wards-criterion zThey produce the popular “dendograms”Model-based clusteringHierarchical clustering(dendogram)Partition clusteringBioinformaticsMachine LearningMicroarraysApplications
ML & Bioinformatics: ClusteringzClustering in Bioinformatics Mainly applied to analyze gene expression datazzzCo-Expression detection (group genes with similarexpression)Subclass discovery (group samples given the expression of itsgenes)Expression data visualization/summarization withdendogramsBioinformaticsMachine LearningMicroarraysApplications
ML & Bioinformatics: Probabilisticgraphical modelszHidden Markov ModelDAGs where nodes arerandom variables and linksare probabilities from anykind of conditionaldependence ExampleszHidden Markov ModelszBayesian NetworksBioinformaticsBayesian NetworkMachine LearningMicroarraysApplications
ML & Bioinformatics:Probabilistic graphical modelszProbabilistic Graph Models in Bioinformatics Genomicszz HMM to gene finding (does a gene sequence come from acoding or a non coding DNA region?)Bayesian networks to detect splice sites (does a genesequence come from a splice-site)Systems BiologyzInference of regulatory genetic networks. Bayesian networksto expression pattern recognition (which genes cause othergenes to express?)BioinformaticsMachine LearningMicroarraysApplications
ML & Bioinformatics: OptimizationzOptimization Search of the best solution in a huge (exponential) space.Popular techniqueszExact optimizationzDeterministiczStochastic Brute forceHill climbing, local optimizationMonte CarloSimulated AnnealingTabu searchEvolutionaryz Genetic algorithmsz Genetic Programmingz Estimation of probabilityBioinformaticsMachine LearningMicroarraysApplications
ML & Bioinformatics: OptimizationzOptimization techniques in Bioinformatics Genomicszzzz Multiple sequence alignment (used almost all optimization algorithms)Splice site prediction with estimation of distribution algorithmsDNA sequencingCluster microarray dataProteomicszzProtein folding (predict 3D structure)Protein side-chain prediction (determine the optimal set of ‘angles’ in the 3Dstructure that minimize the energy) Systems Biology EvolutionzzzInference of gene networks and estimate the parameters of bioprocessesInference of phylogenetic treesHaplotype reconstructionBioinformaticsMachine LearningMicroarraysApplications
DNA Microarrays
DNA MicroarrayszDNA microarray. Objetive: Measure gene expressionzDescription Matrix with measures the expression of thousands of genessimultaneously Gives a “global” vision of gene activity, and allowscomparisonzBetween different individualszSame individual at different timeszDifferent tissuesBioinformaticsMachine LearningMicroarraysApplications 24/25
DNA Microarrays How it works– DNA fragments are spotted orprinted in probes on the arraysurfaceDNA fragmentssample Each probe is a gene– Hibridation is performed with asample putted onto the array– A scanner measures the intensity ineach probemicroarrayImage roarray dataBioinformaticsMachine LearningMicroarraysApplications 25/25
DNA MicroarrayszzHuman Genome U133 HG U133A, HG U133B 22.000 probes aprox. ( 1 probe x gen)Human Genome U133 plus z44.000 probes ( 2 probes x gen)Exon array 1.4 millions of probes ( 16 probes x gen)BioinformaticsMachine LearningMicroarraysApplications 26/25
DNA microarrayszTypical analyses & ML Techniques Gene-based analysisz Differential gene expression analysisz Co-expression detection with clustering techniques(unsupervised)Detect which genes has a significant expression variation amongsamples of two or more conditions (feature selection)Sample-based analysiszClass predicion with classification techniques (supervised)zClass discovery with clustering techniques (unsupervised)Problems:zzHuge number of features (thousands of genes) y low number ofsamples (dozens) V.S. Machine LearningHigh false positive rateBioinformaticsMachine LearningMicroarraysApplications
DNA microarrayszFunctional interpratation after data analysis Typically we have a list of genes of interest (ie.differentially expressed) Question: who are those genes? Solution: Use the available gene annotations (GeneOntology, Pathways, etc) and see if there is acorrelation with a functional module.zzThey answer to the question: Are my genes significantlychosen from a given gene function? If so, which function?On-line tools List-based: FatiGO, DAVID, PathjamGene-set based: GSEA, FatiScanBioinformaticsMachine LearningMicroarraysApplications
Sample applications
geneCBRTranslational tool for DNA s www.genecbr.org Glez-Peña et al. BMC Bioinformatics10:37 2007 Classification guided by a clusteringalgorithm GCSMachine LearningMicroarraysApplications
WhichGenes?OnOn-line geneset building toolBioinformatics Create your own genesets from multipledatasources and use them in your favouritegeneset-based analysis tools like GSEA www.whichgenes.org Glez-Peña et al. Nucleic Acids Res (webserver issue) 2009Machine LearningMicroarraysApplications
Questions?
Bioinformatics Machine Learning Microarrays Applications. Major research areas Larrañaga et al (2005), Briefings in Bioinformatics 7(1):86-112 2009 2007 2005 2008 . zProtein/gene recognition in biomedical literature (is this word a gene/protein given some word features: ortographic, part-
Bioinformatics Crash Course Ian Misner Ph.D. Bioinformatics Coordinator UMD Bioinformatics Core . Bioinformatics!Core The Plan Monday – Introductions – Linux and Python Hands-on Training Tuesday – NGS Introduction – RNAseq with Sailfish (Dr. Steve Mount, CBCB) – RNAse
This report includes general bioinformatics and machine learning to provide context, as well as our experiments, results and conclusions. The project report is organized as follows: 1. bioinformatics, 2. machine learning, 3. experiments and 4. conclusions. The sections on bioinformatics and machine learning provide context for the
SECTION-A: Attempt any five questions. SECTION-B: Attempt any five questions. SECTION–A Short Answer type Questions: (60-80 Words) 5 5 25 Marks 1. What is the role of internet in bioinformatics? 2. How bioinformatics assist in drug designing? 3. Write a short note on Internet Protocol (IP). 4. What is Pattern mining? 5.
volumes of biological information in bioinformatics database. They also provide some bioinformatics tools for database search and data acquire. With the explosion of sequence information available to researchers, the challenge facing bioinformatics and computational biologists is to aid in biomedical researches and to invent efficient toolkits.
tronics, Physics, Statistics, or Business Informatics. 8 LUM RAMABAJA Bachelor’s Student in Bioinformatics ‘Bioinformatics is a truly interesting field. The program has inspired me to apply what I have learned and help people by starting a company that diagnoses malaria.’ To The Point KRISTINA PREUER BSc MSc Graduate in Bioinformatics
Bioinformatics, Stellenbosch University Many bioinformatics tools and resources are available on the command-line interface These are often on the Linux platform (or other Unix-like platforms such as the Mac command line). They are essential for many bioinformatics and genomics applications.
Bioinformatics is an interdisciplinary area of the science composed of biology, mathematics and computer science. Bioinformatics is the application of information technology to manage biological data that helps in decoding plant genomes. The field of bioinformatics emerged as a tool to facilitate biological discoveries more than 10 years ago.
The American Petroleum Institute (API) 617 style compressors are typically found in refinery and petrochemical applications. GE strongly recommends the continuous collection, trending and analysis of the radial vibration, axial position, and temperature data using a machinery management system such as System 1* software. Use of these tools will enhance the ability to diagnose problems and .