A DATA MINING-BASED APPROACH FOR INVESTIGATING THE .

3y ago
54 Views
2 Downloads
1.33 MB
173 Pages
Last View : 1d ago
Last Download : 3m ago
Upload by : Elisha Lemon
Transcription

A DATA MINING-BASED APPROACH FOR INVESTIGATING THERELATIONSHIP BETWEEN DNA REPAIR GENES AND AGEINGThesis submitted in accordance with the requirements of the Universityof Liverpool for the degree of Master in PhilosophybyAlex A. FreitasJanuary 2011

ABSTRACTThere is a clear motivation for ageing research, since ageing is the greatest risk factor formany diseases, including most types of cancer. Arguably, another strong motivation forageing research is that, despite the large progress in this area in the last two decades,ageing is still to a large extent a poorly understood process, especially in humans.The vast majority of biogerontology research is still based on “wet lab” experiments donewith simpler organisms, due to the problems associated with performing ageing-relatedexperiments with humans. In contrast, this thesis proposes a data mining approach, basedon classification algorithms, for analysing data about human DNA repair genes and theirrelationship to ageing. The classification algorithms – more precisely, decision treeinduction and Naive Bayes algorithms – were applied to datasets prepared specifically forthis research, by adapting and integrating data from several bioinformatics resources,namely: (a) the GenAge database of ageing-related genes; (b) a web site with acomprehensive list of human DNA repair genes; (c) Uniprot, a centralized repository ofrichly-annotated data about proteins; (d) the HPRD (Human Protein Reference Database);and (e) the Gene Ontology – a controlled vocabulary for describing gene or proteinfunctions. Some experiments also used a separate dataset including gene expression data.Applying classification algorithms to such datasets aimed at producing classificationmodels that identify which gene properties are most effective in discriminating ageingrelated DNA repair genes from other types of genes – mainly non-ageing-related DNArepair genes, but in some experiments the other types of genes also included genes whoseprotein product interact with DNA repair genes. A related goal of this research was toanalyse the automatically-built classification models from two perspectives, namely: (a)measuring the predictive accuracy (or “generalization ability”) of those models from adata mining perspective; and (b) interpreting the meaning of the main gene propertiesrelevant for classification in those models, in the light of biological knowledge aboutDNA repair genes and the process of ageing.In summary, the main gene properties that were found effective in discriminating ageingrelated DNA repair genes from other types of genes (mainly non-ageing-related DNArepair genes) in the datasets created in this research are as follows: ageing-related DNArepair genes‟ protein products tend to interact with a considerably larger number ofproteins; their protein products are much more likely to interact with WRN (a proteinwhose defect causes the Werner‟s progeroid syndrome) and XRCC5 (KU80, a key proteinin the initiation of DNA double-strand repair by the error-prone non-homologous endjoining DNA repair pathway); they are more likely to be involved in response to chemicalstimulus and, to a lesser extent, in response to endogenous stimulus or oxidative stress;and they are more likely to have high expression in T lymphocytes.ii

CONTENTSABSTRACT . IICONTENTS .IIILIST OF FIGURES . VILIST OF TABLES . VIIACKNOWLEDGMENTS .VIIIDECLARATION . IXCHAPTER 1 – INTRODUCTION . 11.1 WHAT IS AGEING? . 11.1.1 Defining ageing . 11.1.2 Ageing at the cellular and tissue levels . 21.1.3 The motivation for ageing research . 51.2 THEORIES OF AGEING . 61.2.1 Evolutionary theories of ageing . 61.2.2 DNA damage theory of ageing . 81.3 PROGEROID SYNDROMES . 121.3.1 An overview of progeroid syndromes . 131.3.1.1 Werner syndrome (WS) . 131.3.1.2 Hutchinson-Gilford progeroid syndrome (HGPS) . 141.3.1.3 Trichothiodystrophy (TTD) . 151.3.1.4 Cockayne syndrome (CS) . 151.3.1.5 Ataxia telangiectasia (AT) . 161.3.1.6 Rothmund-Thomsom (RT) syndrome . 161.3.1.7 Xeroderma pigmentosum (XP) . 171.3.2 On the relevance of progeroid syndromes to the study of human ageing . 181.4 DNA DAMAGE . 201.4.1 Two major sources of DNA damage . 201.4.1.1 Oxidative damage. 201.4.1.2 Damage induced by ultraviolet (UV) radiation . 211.4.2 An overview of major types of DNA damage. 221.4.2.1 Depurination and depyrimidination . 221.4.2.2 Deamination . 231.4.2.3 Abasic (AP) sites . 251.4.2.4 DNA strand breaks . 261.4.2.5 Cyclobutane pyrimidine dimers (CPDs) . 26iii

1.5 DNA REPAIR . 271.5.1 Base excision repair (BER) . 271.5.2 Nucleotide excision repair (NER) . 301.5.3 Repair of double-strand breaks. 351.5.3.1 Homologous recombination (HR) . 351.5.3.2 Non-homologous end joining (NHEJ) . 361.5.4 Mismatch repair . 381.6 OBJECTIVES . 39CHAPTER 2 – BIOINFORMATICS AND DATA MINING . 412.1 BIOLOGICAL DATABASES . 412.1.1 GenAge . 412.1.2 Other ageing-related databases . 432.1.3 Uniprot . 442.1.4 HPRD (Human Protein Reference Database). 452.2 GENE ONTOLOGY (GO) . 462.2.1 The motivation for the gene ontology . 462.2.2 The basic structure of the gene ontology . 472.3 ANALYSING AGEING-RELATED GENE OR PROTEIN NETWORKS . 492.3.1 Types of interactions and reference organisms in ageing-related networks . 492.3.2 Analysing ageing-related gene or protein networks . 532.4 CONCEPTS AND PRINCIPLES OF DATA MINING. 572.4.1 Basic concepts of data mining . 572.4.2 The classification task of data mining . 582.4.2.1 Overfitting and underfitting . 612.4.2.2 Classification versus clustering . 612.5 CLASSIFICATION METHODS USED IN THIS RESEARCH . 632.5.1 Decision tree induction . 632.5.2 Naive Bayes . 682.6 RELATED WORK ON PREDICTING PROTEIN FUNCTION WITH CLASSIFICATIONMETHODS. 69CHAPTER 3 – DATASET CREATION AND EXPERIMENTAL SET UP . 753.1 CREATING DATASETS WITH TWO CLASSES AND MULTIPLE ATTRIBUTE TYPES . 753.1.1 Creating two classes: ageing-related vs. non-ageing-related DNA repair . 753.1.2 Creating the predictor attribute type of DNA repair . 763.1.3 Creating a predictor attribute measuring the rate of evolutionary change(Ka/Ki ratio) . 773.1.4 Creating a set of predictor attributes representing GO terms . 783.1.5 Creating a set of attributes representing protein-protein interactioninformation. 813.1.6 Removing duplicate data instances. 823.1.7 Dataset specifications. 833.2 CREATING A DATASET WITH TWO CLASSES AND GENE EXPRESSION ATTRIBUTES. 863.3 CREATING DATASETS WITH FOUR CLASSES AND MULTIPLE ATTRIBUTE TYPES . 88iv

3.3.1 Creating the four classes to be predicted . 883.3.2 Creating the predictor attributes . 893.3.3 Dataset specifications. 893.4 MEASURING PREDICTIVE ACCURACY . 913.5 STATISTICAL SIGNIFICANCE . 94CHAPTER 4 – COMPUTATIONAL RESULTS AND DISCUSSION. 964.1 RESULTS AND DISCUSSION FOR DATASETS WITH TWO CLASSES AND MULTIPLEATTRIBUTE TYPES . 964.1.1 Results for the J4.8 decision tree induction algorithm. 974.1.2 Results for the CART decision tree induction algorithm. 1004.1.3 Results for the Naive Bayes algorithm. 1034.1.4 Discussion on predictive patterns extracted from the decision trees . 1044.1.4.1 Discussion on attributes chosen as root nodes in the decision trees . 1054.1.4.2 Issues on selecting and interpreting rules extracted from decision trees . 1084.1.4.3 Discussion on selected rules extracted from decision trees . 1114.2 RESULTS AND DISCUSSION FOR DATASETS WITH TWO CLASSES AND GENE EXPRESSIONATTRIBUTES . 1174.2.1 Predictive accuracies for J4.8, CART and Naive Bayes algorithms . 1184.2.2 Interpreting a rule extracted from the decision tree built by J4.8 . 1184.2.3 Integrating results for gene expression and other types of predictor attributes. 1204.3 RESULTS AND DISCUSSION FOR DATASETS WITH FOUR CLASSES AND MULTIPLEATTRIBUTE TYPES .

repair genes) in the datasets created in this research are as follows: ageing-related DNA repair genes‟ protein products tend to interact with a considerably larger number of proteins; their protein products are much more likely to interact with WRN (a protein whose defect causes the Werner‟s progeroid syndrome) and XRCC5 (KU80, a key protein in the initiation of DNA double-strand repair .

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

Data Mining Popularity lRecent Data Mining explosion based on: lData available -Transactions recorded in data warehouses -From these warehouses specific databases for the goal task can be created lAlgorithms available -Machine Learning and Statistics -Including special purpose Data Mining software products to make it easier for people to work through the entire data mining cycle

Visual Data Mining. Chidroop Madhavarapu CSE 591:Visual Analytics. Motivation. Visualization for Data Mining Huge amounts of information Limited display capacity of output devices. Visual Data Mining (VDM) is a new approach for exploring very large data sets, combining traditional mining methods and information .