Bioinformatics & Machine Learning

3y ago
17 Views
3 Downloads
1.75 MB
32 Pages
Last View : 30d ago
Last Download : 3m ago
Upload by : Cannon Runnels
Transcription

Bioinformatics &Machine LearningDaniel Glez-PeñaIP Leiria, June 3 2009

Agenda1. BioinformaticsDefinition, major research areas, databases2. Machine Learning for bioinformaticsAlgorithm types, examples in bioinformatics3. DNA MicroarraysTechnological overview4. ApplicationsGeneCBR and WhichGenes?

Bioinformatics

Bioinformaticszz“Application of the InformationTechnologies to the field ofmolecular biology”Creation and enhancement of: Databases with biological information Algorithms Statistical techniques to solve formal and practical problemsarising from the management and analysis ofbiological dataBioinformaticsMachine LearningMicroarraysApplications

Major research areaszzzGENOMICSzTEXT MINING Sequence analysis Gene annotation Genome annotation Protein annotation Analysis of mutations in cancer Relation extractionPROTEOMICSzEVOLUTION Protein-protein docking Phylogenetic reconstruction Analysis of protein expression Comparative genomics Prediction of protein structurezMICROARRAYS Analysis of gene expression Genetic network inductionBioinformaticsSYSTEMS BIOLOGYzMachine LearningModelling biological systemsOTHER Image AnalysisMicroarraysApplications

Major research areasLarrañaga et al (2005), Briefings in Bioinformatics 7(1):86-11220092007200820052009

Molecular biology dogmaInteractsTrascriptionDNA& sACTTGTCATGGCGACTGTCCTTTGTGC PathwaySet of ReactionsMEEPQSDPSVEPPLSQETFSDLWKLLPENNVLS InteractomicsGENOMICSSequence analysisGenomeannotationAnalysis ofmutationsProteinSequenceGene sPROTEOMICSProtein expressionanalysis[mass spectometry]Protein structureprediction[folding]Machine D ionalanalylisApplications

DatabasesInteractomics &MetabolomicsProteomicsGenomicsSequencesProt-Prot sOntologiesCene-centricExperimental dataBibliomeBioinformaticsMachine LearningMicroarraysApplications

Machine Learning forBioinformatics

Machine Learning & BioinformaticszCLASSIFICATION (SUPERVISED LEARNING)zCLUSTERING (UNSUPVERVISED LEARNING)zGRAPHICAL PROBABILISTIC MODELSzOPTIMIZATIONBioinformaticsMachine LearningMicroarraysApplications

ML & Bioinformatics: ClassificationzClassification (supervised learning) Given a set of “instances”, each one with a set ofmeasured “attributtes” and a “outcome” value we wantto train a model that predicts the outcome in furtherproblem instanceszIf the “outcome” is discrete (typical 2 o more different values)we are talking about classification (if not: regression)Training dataTest dataBioinformaticsMachine LearningMicroarraysApplications

ML & Bioinformatics: ClassificationzClassification Feature subset selection.zzzAre all input attributes useful?Advantages: reduced cost in data adquisition, improveduniderstability of the model, faster training, and betteraccuracyIt is a search space problem (2n-1), in general: 1. Generate a subset[brute force, deterministic/not deterministic heuristic search]2. Evaluate subsetStatistical estimation: Information Gain, X2, t-test, DFP, CFSWrapper (use classifier accuracy in training set)3. if (!halt condition) GOTO 1BioinformaticsMachine LearningMicroarraysApplications

ML & Bioinformatics: ClassificationzClassification Popular techniqueszzzzzzzLogistic regressionLinear discriminant analysis (LDA)Bayesian classifiers: Naive Bayes, semi-NB, Tree augmented NB, k dependence Bayesian Classification trees: CART, C4.5, RandomForest, J48 K-Nearest NeighboursSupport Vector MachinesMeta: Bagging, BoostingClassification TreeBioinformaticskNN classifierMachine LearningSupport Vector MachineMicroarraysApplications

ML & Bioinformatics: ClassificationzExamples of classification in Bioinformatics (I) GenomicszGene finding (if a sequence is a coding region)zSplice site prediction (if a sequence is a splice site)zPredict disease genes (from i.e. its sequence length?)zPrediction of mutation (SNP) effectzCancer prediction from gene expression (microarrays)ProteomicszPrediction of secondary structure (alpha-helix, beta-sheet,etc.)zPrediction of sub-cellular location of the proteinzCancer prediction from protein expression (mass spectra)BioinformaticsMachine LearningMicroarraysApplications

ML & Bioinformatics: ClassificationzExamples of classification in Bioinformatics (and II) Systems biologyzz Predict the cell migration speed (high, low) from thephosphorilation levels of signalling proteinsPredict a gene regulatory level (up-regulated or downregulated given the ‘related’ genes expression)Text miningzProtein/gene recognition in biomedical literature (is this worda gene/protein given some word features: ortographic, partof-speech, suffix, trigger words, etc ?)BioinformaticsMachine LearningMicroarraysApplications

ML & Bioinformatics: ClusteringzClustering Partition a set of “instances” in several groups (clusters)given the differences between themzTheir are based on “distances” between instances that is aproblem-dependant issue Typical: Euclidean, Pearson, SpermanBioinformaticsMachine LearningMicroarraysApplications

ML & Bioinformatics: ClusteringzClustering Popular techniqueszzPartition clustering: k-means, SOM, GCS, PAMHierarchical clustering with single-linkage, complete linkage, centroidlinkage and wards-criterion zThey produce the popular “dendograms”Model-based clusteringHierarchical clustering(dendogram)Partition clusteringBioinformaticsMachine LearningMicroarraysApplications

ML & Bioinformatics: ClusteringzClustering in Bioinformatics Mainly applied to analyze gene expression datazzzCo-Expression detection (group genes with similarexpression)Subclass discovery (group samples given the expression of itsgenes)Expression data visualization/summarization withdendogramsBioinformaticsMachine LearningMicroarraysApplications

ML & Bioinformatics: Probabilisticgraphical modelszHidden Markov ModelDAGs where nodes arerandom variables and linksare probabilities from anykind of conditionaldependence ExampleszHidden Markov ModelszBayesian NetworksBioinformaticsBayesian NetworkMachine LearningMicroarraysApplications

ML & Bioinformatics:Probabilistic graphical modelszProbabilistic Graph Models in Bioinformatics Genomicszz HMM to gene finding (does a gene sequence come from acoding or a non coding DNA region?)Bayesian networks to detect splice sites (does a genesequence come from a splice-site)Systems BiologyzInference of regulatory genetic networks. Bayesian networksto expression pattern recognition (which genes cause othergenes to express?)BioinformaticsMachine LearningMicroarraysApplications

ML & Bioinformatics: OptimizationzOptimization Search of the best solution in a huge (exponential) space.Popular techniqueszExact optimizationzDeterministiczStochastic Brute forceHill climbing, local optimizationMonte CarloSimulated AnnealingTabu searchEvolutionaryz Genetic algorithmsz Genetic Programmingz Estimation of probabilityBioinformaticsMachine LearningMicroarraysApplications

ML & Bioinformatics: OptimizationzOptimization techniques in Bioinformatics Genomicszzzz Multiple sequence alignment (used almost all optimization algorithms)Splice site prediction with estimation of distribution algorithmsDNA sequencingCluster microarray dataProteomicszzProtein folding (predict 3D structure)Protein side-chain prediction (determine the optimal set of ‘angles’ in the 3Dstructure that minimize the energy) Systems Biology EvolutionzzzInference of gene networks and estimate the parameters of bioprocessesInference of phylogenetic treesHaplotype reconstructionBioinformaticsMachine LearningMicroarraysApplications

DNA Microarrays

DNA MicroarrayszDNA microarray. Objetive: Measure gene expressionzDescription Matrix with measures the expression of thousands of genessimultaneously Gives a “global” vision of gene activity, and allowscomparisonzBetween different individualszSame individual at different timeszDifferent tissuesBioinformaticsMachine LearningMicroarraysApplications 24/25

DNA Microarrays How it works– DNA fragments are spotted orprinted in probes on the arraysurfaceDNA fragmentssample Each probe is a gene– Hibridation is performed with asample putted onto the array– A scanner measures the intensity ineach probemicroarrayImage roarray dataBioinformaticsMachine LearningMicroarraysApplications 25/25

DNA MicroarrayszzHuman Genome U133 HG U133A, HG U133B 22.000 probes aprox. ( 1 probe x gen)Human Genome U133 plus z44.000 probes ( 2 probes x gen)Exon array 1.4 millions of probes ( 16 probes x gen)BioinformaticsMachine LearningMicroarraysApplications 26/25

DNA microarrayszTypical analyses & ML Techniques Gene-based analysisz Differential gene expression analysisz Co-expression detection with clustering techniques(unsupervised)Detect which genes has a significant expression variation amongsamples of two or more conditions (feature selection)Sample-based analysiszClass predicion with classification techniques (supervised)zClass discovery with clustering techniques (unsupervised)Problems:zzHuge number of features (thousands of genes) y low number ofsamples (dozens) V.S. Machine LearningHigh false positive rateBioinformaticsMachine LearningMicroarraysApplications

DNA microarrayszFunctional interpratation after data analysis Typically we have a list of genes of interest (ie.differentially expressed) Question: who are those genes? Solution: Use the available gene annotations (GeneOntology, Pathways, etc) and see if there is acorrelation with a functional module.zzThey answer to the question: Are my genes significantlychosen from a given gene function? If so, which function?On-line tools List-based: FatiGO, DAVID, PathjamGene-set based: GSEA, FatiScanBioinformaticsMachine LearningMicroarraysApplications

Sample applications

geneCBRTranslational tool for DNA s www.genecbr.org Glez-Peña et al. BMC Bioinformatics10:37 2007 Classification guided by a clusteringalgorithm GCSMachine LearningMicroarraysApplications

WhichGenes?OnOn-line geneset building toolBioinformatics Create your own genesets from multipledatasources and use them in your favouritegeneset-based analysis tools like GSEA www.whichgenes.org Glez-Peña et al. Nucleic Acids Res (webserver issue) 2009Machine LearningMicroarraysApplications

Questions?

Bioinformatics Machine Learning Microarrays Applications. Major research areas Larrañaga et al (2005), Briefings in Bioinformatics 7(1):86-112 2009 2007 2005 2008 . zProtein/gene recognition in biomedical literature (is this word a gene/protein given some word features: ortographic, part-

Related Documents:

Bioinformatics Crash Course Ian Misner Ph.D. Bioinformatics Coordinator UMD Bioinformatics Core . Bioinformatics!Core The Plan Monday – Introductions – Linux and Python Hands-on Training Tuesday – NGS Introduction – RNAseq with Sailfish (Dr. Steve Mount, CBCB) – RNAse

This report includes general bioinformatics and machine learning to provide context, as well as our experiments, results and conclusions. The project report is organized as follows: 1. bioinformatics, 2. machine learning, 3. experiments and 4. conclusions. The sections on bioinformatics and machine learning provide context for the

SECTION-A: Attempt any five questions. SECTION-B: Attempt any five questions. SECTION–A Short Answer type Questions: (60-80 Words) 5 5 25 Marks 1. What is the role of internet in bioinformatics? 2. How bioinformatics assist in drug designing? 3. Write a short note on Internet Protocol (IP). 4. What is Pattern mining? 5.

volumes of biological information in bioinformatics database. They also provide some bioinformatics tools for database search and data acquire. With the explosion of sequence information available to researchers, the challenge facing bioinformatics and computational biologists is to aid in biomedical researches and to invent efficient toolkits.

tronics, Physics, Statistics, or Business Informatics. 8 LUM RAMABAJA Bachelor’s Student in Bioinformatics ‘Bioinformatics is a truly interesting field. The program has inspired me to apply what I have learned and help people by starting a company that diagnoses malaria.’ To The Point KRISTINA PREUER BSc MSc Graduate in Bioinformatics

Bioinformatics, Stellenbosch University Many bioinformatics tools and resources are available on the command-line interface These are often on the Linux platform (or other Unix-like platforms such as the Mac command line). They are essential for many bioinformatics and genomics applications.

Bioinformatics is an interdisciplinary area of the science composed of biology, mathematics and computer science. Bioinformatics is the application of information technology to manage biological data that helps in decoding plant genomes. The field of bioinformatics emerged as a tool to facilitate biological discoveries more than 10 years ago.

The American Petroleum Institute (API) 617 style compressors are typically found in refinery and petrochemical applications. GE strongly recommends the continuous collection, trending and analysis of the radial vibration, axial position, and temperature data using a machinery management system such as System 1* software. Use of these tools will enhance the ability to diagnose problems and .