Salvador García Julián Luengo Francisco Herrera Data Preprocessing In .

1y ago
5 Views
1 Downloads
8.03 MB
327 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Mya Leung
Transcription

Intelligent Systems Reference Library 72Salvador GarcíaJulián LuengoFrancisco HerreraDataPreprocessingin DataMining

Intelligent Systems Reference LibraryVolume 72Series editorsJanusz Kacprzyk, Polish Academy of Sciences, Warsaw, Polande-mail: kacprzyk@ibspan.waw.plLakhmi C. Jain, University of Canberra, Canberra, Australiae-mail: Lakhmi.Jain@unisa.edu.au

About this SeriesThe aim of this series is to publish a Reference Library, including novel advancesand developments in all aspects of Intelligent Systems in an easily accessible andwell structured form. The series includes reference works, handbooks, compendia,textbooks, well-structured monographs, dictionaries, and encyclopedias. It containswell integrated knowledge and current information in the field of IntelligentSystems. The series covers the theory, applications, and design methods of Intelligent Systems. Virtually all disciplines such as engineering, computer science,avionics, business, e-commerce, environment, healthcare, physics and life scienceare included.More information about this series at http://www.springer.com/series/8578

Salvador García Julián LuengoFrancisco Herrera Data Preprocessingin Data Mining123

Francisco HerreraDepartment of Computer Scienceand Artificial IntelligenceUniversity of GranadaGranadaSpainSalvador GarcíaDepartment of Computer ScienceUniversity of JaénJaénSpainJulián LuengoDepartment of Civil EngineeringUniversity of BurgosBurgosSpainISSN 1868-4394ISBN 978-3-319-10246-7DOI 10.1007/978-3-319-10247-4ISSN 1868-4408 (electronic)ISBN 978-3-319-10247-4 (eBook)Library of Congress Control Number: 2014946771Springer Cham Heidelberg New York Dordrecht London Springer International Publishing Switzerland 2015This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part ofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformation storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodology now known or hereafter developed. Exempted from this legal reservation are briefexcerpts in connection with reviews or scholarly analysis or material supplied specifically for thepurpose of being entered and executed on a computer system, for exclusive use by the purchaser of thework. Duplication of this publication or parts thereof is permitted only under the provisions ofthe Copyright Law of the Publisher’s location, in its current version, and permission for use mustalways be obtained from Springer. Permissions for use may be obtained through RightsLink at theCopyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law.The use of general descriptive names, registered names, trademarks, service marks, etc. in thispublication does not imply, even in the absence of a specific statement, that such names are exemptfrom the relevant protective laws and regulations and therefore free for general use.While the advice and information in this book are believed to be true and accurate at the date ofpublication, neither the authors nor the editors nor the publisher can accept any legal responsibility forany errors or omissions that may be made. The publisher makes no warranty, express or implied, withrespect to the material contained herein.Printed on acid-free paperSpringer is part of Springer Science Business Media (www.springer.com)

This book is dedicated to all people withwhom we have worked over the years andhave made it possible to reach this moment.Thanks to the members of the research group“Soft Computing and Intelligent InformationSystems”To our families.

PrefaceData preprocessing is an often neglected but major step in the data mining process.The data collection is usually a process loosely controlled, resulting in out of rangevalues, e.g., impossible data combinations (e.g., Gender: Male; Pregnant: Yes),missing values, etc. Analyzing data that has not been carefully screened for suchproblems can produce misleading results. Thus, the representation and quality ofdata is first and foremost before running an analysis. If there is much irrelevant andredundant information present or noisy and unreliable data, then knowledge discovery is more difficult to conduct. Data preparation can take considerable amountof processing time.Data preprocessing includes data preparation, compounded by integration,cleaning, normalization and transformation of data; and data reduction tasks; suchas feature selection, instance selection, discretization, etc. The result expected aftera reliable chaining of data preprocessing tasks is a final dataset, which can beconsidered correct and useful for further data mining algorithms.This book covers the set of techniques under the umbrella of data preprocessing,being a comprehensive book devoted completely to the field of Data Mining,including all important details and aspects of all techniques that belonging to thisfamilies. In recent years, this area has become of great importance because the datamining algorithms require meaningful and manageable data to correctly operate andto provide useful knowledge, predictions or descriptions. It is well known that mostof the efforts made in a knowledge discovery application is dedicated to datapreparation and reduction tasks. Both theoreticians and practitioners are constantlysearching for data preprocessing techniques to ensure reliable and accurate resultstogether trading off with efficiency and time-complexity. Thus, an exhaustive andupdated background in the topic could be very effective in areas such as datamining, machine learning, and pattern recognition. This book invites readers toexplore the many advantages the data preparation and reduction provide:vii

viiiPreface To adapt and particularize the data for each data mining algorithm. To reduce the amount of data required for a suitable learning task, alsodecreasing its time-complexity. To increase the effectiveness and accuracy in predictive tasks. To make possible the impossible with raw data, allowing data mining algorithmsto be applied over high volumes of data. To support to the understanding of the data. Useful for various tasks, such as classification, regression and unsupervisedlearning.The target audience for this book is anyone who wants a better understanding ofthe current state-of-the-art in a crucial part of the knowledge discovery from data:the data preprocessing. Practitioners in industry and enterprise should find newinsights and possibilities in the breadth of topics covered. Researchers and datascientist and/or analysts in universities, research centers, and government could finda comprehensive review in the topic addressed and new ideas for productiveresearch efforts.Granada, Spain, June 2014Salvador GarcíaJulián LuengoFrancisco Herrera

Contents1Introduction . . . . . . . . . . . . . . . . . . . . . . . .1.1 Data Mining and Knowledge Discovery.1.2 Data Mining Methods . . . . . . . . . . . . .1.3 Supervised Learning . . . . . . . . . . . . . .1.4 Unsupervised Learning . . . . . . . . . . . .1.4.1Pattern Mining . . . . . . . . . . . .1.4.2Outlier Detection . . . . . . . . . .1.5 Other Learning Paradigms . . . . . . . . . .1.5.1Imbalanced Learning . . . . . . . .1.5.2Multi-instance Learning . . . . . .1.5.3Multi-label Classification . . . . .1.5.4Semi-supervised Learning . . . .1.5.5Subgroup Discovery . . . . . . . .1.5.6Transfer Learning . . . . . . . . . .1.5.7Data Stream Learning . . . . . . .1.6 Introduction to Data Preprocessing . . . .1.6.1Data Preparation . . . . . . . . . . .1.6.2Data Reduction. . . . . . . . . . . .References. . . . . . . . . . . . . . . . . . . . . . . . . .2Data Sets and Proper Statistical Analysis of Data MiningTechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.1 Data Sets and Partitions . . . . . . . . . . . . . . . . . . . . . . . .2.1.1Data Set Partitioning . . . . . . . . . . . . . . . . . . . .2.1.2Performance Measures. . . . . . . . . . . . . . . . . . .2.2 Using Statistical Tests to Compare Methods . . . . . . . . . .2.2.1Conditions for the Safe Use of Parametric Tests .2.2.2Normality Test over the Group of Data Setsand Algorithms. . . . . . . . . . . . . . . . . . . . . . . .1126788889999101010111316.191921242526.27ix

xContents2.2.3Non-parametric Tests for Comparing TwoAlgorithms in Multiple Data Set Analysis . . . . . . . . .2.2.4Non-parametric Tests for Multiple ComparisonsAmong More than Two Algorithms . . . . . . . . . . . . .References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34Data Preparation Basic Models . . . . . . . . . . . . . . . . . . . . .3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2.1Finding Redundant Attributes . . . . . . . . . . . . .3.2.2Detecting Tuple Duplication and Inconsistency .3.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.4 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . .3.4.1Min-Max Normalization . . . . . . . . . . . . . . . .3.4.2Z-score Normalization . . . . . . . . . . . . . . . . . .3.4.3Decimal Scaling Normalization. . . . . . . . . . . .3.5 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . .3.5.1Linear Transformations . . . . . . . . . . . . . . . . .3.5.2Quadratic Transformations . . . . . . . . . . . . . . .3.5.3Non-polynomial Approximationsof Transformations . . . . . . . . . . . . . . . . . . . .3.5.4Polynomial Approximations of Transformations3.5.5Rank Transformations . . . . . . . . . . . . . . . . . .3.5.6Box-Cox Transformations . . . . . . . . . . . . . . .3.5.7Spreading the Histogram . . . . . . . . . . . . . . . .3.5.8Nominal to Binary Transformation . . . . . . . . .3.5.9Transformations via Data Reduction . . . . . . . .References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Dealing with Missing Values . . . . . . . . . . . . . . . . . . . . . . . .4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.2 Assumptions and Missing Data Mechanisms . . . . . . . . .4.3 Simple Approaches to Missing Data . . . . . . . . . . . . . . .4.4 Maximum Likelihood Imputation Methods . . . . . . . . . . .4.4.1Expectation-Maximization (EM) . . . . . . . . . . . .4.4.2Multiple Imputation . . . . . . . . . . . . . . . . . . . .4.4.3Bayesian Principal Component Analysis (BPCA)4.5 Imputation of Missing Values. Machine LearningBased Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.5.1Imputation with K-Nearest Neighbor (KNNI) . . .4.5.2Weighted Imputation with K-Nearest Neighbour(WKNNI) . . . . . . . . . . . . . . . . . . . . . . . . . . .4.5.3K-means Clustering Imputation (KMI). . . . . . . .5959616364656872.7676.7778

Contentsxi4.5.4Imputation with Fuzzy K-means Clustering(FKMI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.5.5Support Vector Machines Imputation (SVMI). . . .4.5.6Event Covering (EC). . . . . . . . . . . . . . . . . . . . .4.5.7Singular Value Decomposition Imputation (SVDI)4.5.8Local Least Squares Imputation (LLSI) . . . . . . . .4.5.9Recent Machine Learning Approaches to MissingValues Imputation. . . . . . . . . . . . . . . . . . . . . . .4.6 Experimental Comparative Analysis . . . . . . . . . . . . . . . .4.6.1Effect of the Imputation Methodsin the Attributes’ Relationships . . . . . . . . . . . . . .4.6.2Best Imputation Methods for ClassificationMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.6.3Interesting Comments . . . . . . . . . . . . . . . . . . . .References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116117117118118.120.123125125127.129133136140Data Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6.2 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . .147147148Dealing with Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . .5.1 Identifying Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.2 Types of Noise Data: Class Noise and Attribute Noise . .5.2.1Noise Introduction Mechanisms . . . . . . . . . . . .5.2.2Simulating the Noise of Real-World Data Sets . .5.3 Noise Filtering at Data Level . . . . . . . . . . . . . . . . . . . .5.3.1Ensemble Filter . . . . . . . . . . . . . . . . . . . . . . .5.3.2Cross-Validated Committees Filter . . . . . . . . . .5.3.3Iterative-Partitioning Filter . . . . . . . . . . . . . . . .5.3.4More Filtering Methods . . . . . . . . . . . . . . . . . .5.4 Robust Learners Against Noise. . . . . . . . . . . . . . . . . . .5.4.1Multiple Classifier Systems for ClassificationTasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.4.2Addressing Multi-class ClassificationProblems by Decomposition . . . . . . . . . . . . . . .5.5 Empirical Analysis of Noise Filters and Robust Strategies5.5.1Noise Introduction . . . . . . . . . . . . . . . . . . . . .5.5.2Noise Filters for Class Noise . . . . . . . . . . . . . .5.5.3Noise Filtering Efficacy Prediction by DataComplexity Measures . . . . . . . . . . . . . . . . . . .5.5.4Multiple Classifier Systems with Noise . . . . . . .5.5.5Analysis of the OVO Decomposition with NoiseReferences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiiContents6.2.1Principal Components Analysis.6.2.2Factor Analysis. . . . . . . . . . . .6.2.3Multidimensional Scaling. . . . .6.2.4Locally Linear Embedding . . . .6.3 Data Sampling . . . . . . . . . . . . . . . . . .6.3.1Data Condensation . . . . . . . . .6.3.2Data Squashing . . . . . . . . . . .6.3.3Data Clustering. . . . . . . . . . . .6.4 Binning and Reduction of Cardinality . .References. . . . . . . . . . . . . . . . . . . . . . . . . .78.149151152155156158159159161162Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.2.1The Search of a Subset of Features . . . . . . . . . . .7.2.2Selection Criteria . . . . . . . . . . . . . . . . . . . . . . .7.2.3Filter, Wrapper and Embedded Feature Selection .7.3 Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.3.1Output of Feature Selection . . . . . . . . . . . . . . . .7.3.2Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.3.3Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.3.4Using Decision Trees for Feature Selection . . . . .7.4 Description of the Most Representative Feature SelectionMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.4.1Exhaustive Methods . . . . . . . . . . . . . . . . . . . . .7.4.2Heuristic Methods. . . . . . . . . . . . . . . . . . . . . . .7.4.3Nondeterministic Methods . . . . . . . . . . . . . . . . .7.4.4Feature Weighting Methods . . . . . . . . . . . . . . . .7.5 Related and Advanced Topics . . . . . . . . . . . . . . . . . . . .7.5.1Leading and Recent Feature Selection Techniques.7.5.2Feature Extraction. . . . . . . . . . . . . . . . . . . . . . .7.5.3Feature Construction . . . . . . . . . . . . . . . . . . . . .7.6 Experimental Comparative Analyses in Feature Selection. .References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185186188189190191.195195197199.199202202Instance Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.2 Training Set Selection Versus Prototype Selection. .8.3 Prototype Selection Taxonomy . . . . . . . . . . . . . . .8.3.1Common Properties in Prototype SelectionMethods . . . . . . . . . . . . . . . . . . . . . . . .8.3.2Prototype Selection Methods . . . . . . . . . .8.3.3Taxonomy of Prototype Selection Methods.

Contentsxiii8.4Description of Methods . . . . . . . . . . . . . . . . . . . . . . . . .8.4.1Condensation Algorithms. . . . . . . . . . . . . . . . . .8.4.2Edition Algorithms . . . . . . . . . . . . . . . . . . . . . .8.4.3Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . .8.5 Related and Advanced Topics . . . . . . . . . . . . . . . . . . . .8.5.1Prototype Generation. . . . . . . . . . . . . . . . . . . . .8.5.2Distance Metrics, Feature Weightingand Combinations with Feature Selection. . . . . . .8.5.3Hybridizations with Other Learning Methodsand Ensembles . . . . . . . . . . . . . . . . . . . . . . . . .8.5.4Scaling-Up Approaches . . . . . . . . . . . . . . . . . . .8.5.5Data Complexity. . . . . . . . . . . . . . . . . . . . . . . .8.6 Experimental Comparative Analysis in Prototype Selection8.6.1Analysis and Empirical Results on SmallSize Data Sets . . . . . . . . . . . . . . . . . . . . . . . . .8.6.2Analysis and Empirical Results on MediumSize Data Sets . . . . . . . . . . . . . . . . . . . . . . . . .8.6.3Global View of the Obtained Results . . . . . . . . .8.6.4Visualization of Data Subsets: A Case StudyBased on the Banana Data Set . . . . . . . . . . . . . .References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .910Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . .9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .9.2 Perspectives and Background . . . . . . . . . . . . .9.2.1Discretization Process . . . . . . . . . . . .9.2.2Related and Advanced Work . . . . . . .9.3 Properties and Taxonomy. . . . . . . . . . . . . . . .9.3.1Common Properties. . . . . . . . . . . . . .9.3.2Methods and Taxonomy . . . . . . . . . .9.3.3Description of the Most RepresentativeDiscretization Methods . . . . . . . . . . .9.4 Experimental Comparative Analysis . . . . . . . .9.4.1Experimental Set up . . . . . . . . . . . . .9.4.2Analysis and Empirical Results. . . . . .References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85.287288289A Data Mining Software Package Including Data Preparationand Reduction: KEEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10.1 Data Mining Softwares and Toolboxes . . . . . . . . . . . . . .10.2 KEEL: Knowledge Extraction Based on EvolutionaryLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10.2.1 Main Features . . . . . . . . . . . . . . . . . . . . . . . . .10.2.2 Data Management . . . . . . . . . . . . . . . . . . . . . . .

xivContents10.2.3 Design of Experiments: Off-Line Module . . . .10.2.4 Computer-Based Education: On-Line Module . .10.3 KEEL-Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10.3.1 Data Sets Web Pages . . . . . . . . . . . . . . . . . .10.3.2 Experimental Study Web Pages . . . . . . . . . . .10.4 Integration of New Algorithms into the KEEL Tool . . .10.4.1 Introduction to the KEEL Codification Features10.5 KEEL Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . .10.5.1 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . .10.6 Summarizing Comments . . . . . . . . . . . . . . . . . . . . . .References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .291293294294297298298303304310311Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .315

LPMVPCARBFNSONNSVMArtificial Neural NetworkCross ValidationData MiningDimensionality ReductionExpectation-MaximizationFold Cross ValidationFeature SelectionInstance SelectionKnowledge Discovery in DataKnowledge Extraction based on Evolutionary LearningK-Nearest NeighborsLocally Linear EmbeddingLearning Vector QuantizationMulti Dimensional ScalingMutual InformationMachine LearningMulti-Layer PerceptronMissing ValuePrincipal Components AnalysisRadial Basis Function NetworkSelf Organizing Neural NetworkSupport Vector Machinexv

Chapter 1IntroductionAbstract The main background addressed in this book should be presentedregarding Data Mining and Knowledge Discovery. Major concepts used throughout the contents of the rest of the book will be introduced, such as learning models,strategies and paradigms, etc. Thus, the whole process known as Knowledge Discovery in Data is provided in Sect. 1.1. A review on the main models of Data Miningis given in Sect. 1.2, accompanied a clear differentiation between Supervised andUnsupervised learning (Sects. 1.3 and 1.4, respectively). In Sect. 1.5, apart from thetwo classical data mining tasks, we mention other related problems that assumemore complexity or hybridizations with respect to the classical learning paradigms.Finally, we establish the relationship between Data Preprocessing with Data Miningin Sect. 1.6.1.1 Data Mining and Knowledge DiscoveryVast amounts of data are around us in our world, raw data that is mainly intractablefor human or manual applications. So, the analysis of such data is now a necessity.The World Wide Web (WWW), business related services, society, applications andnetworks for science or engineering, among others, are continuously generating datain exponential growth since the development of powerful storage and connectiontools. This immense data growth does not easily allow to useful information or organized knowledge to be understood or extracted automatically. This fact has led to thestart of Data Mining (DM), which is currently a well-known discipline increasinglypreset in the current world of the Information Age.DM is, roughly speaking, about solving problems by analyzing data present inreal databases. Nowadays, it is qualified as science and technology for exploringdata to discover already present unknown patterns. Many people distinguish DM assynonym of the Knowledge Discovery in Databases (KDD) process, while othersview DM as the main step of KDD [16, 24, 32].There are various definitions of KDD. For instance, [10] define it as “the nontrivialprocess of identifying valid, novel, potentially useful, and ultimately understandablepatterns in data” [11] considers the KDD process as an automatic exploratory data Springer International Publishing Switzerland 2015S. García et al., Data Preprocessing in Data Mining,Intelligent Systems Reference Library 72, DOI 10.1007/978-3-319-10247-4 11

21 Introductionanalysis of large databases. A key aspect that characterizes the KDD process is theway it is divided into stages according the agreement of several important researchersin the topic. There are several methods available to make this division, each withadvantages and disadvantages [16]. In this book, we adopt a hybridization widelyused in recent years that categorizes these stages into six steps:1. Problem Specification: Designating and arranging the application domain, therelevant prior knowledge obtained by experts and the final objectives pursued bythe end-user.2. Problem Understanding: Including the comprehension of both the selected datato approach and the expert knowledge associated in order to achieve high degreeof reliability.3. Data Preprocessing: This stage includes operations for data cleaning (such ashandling the removal of noise and inconsistent data), data integrationdata integration (where multiple data sources may be combined into one), data transformation(where data is transformed and consolidated into forms which are appropriate forspecific DM tasks or aggregation operations) and data reduction, including theselection and extraction of both features and examples in a database. This phasewill be the focus of study throughout the book.4. Data Mining: It is the essential process where the methods are used to extractvalid data patterns. This step includes the choice of the most suitable DM task(such as classification, regression, clustering or association), the choice of theDM algorithm itself, belonging to one of the previous families. And finally, theemployment and accommodation of the algorithm selected to the problem, bytuning essential parameters and validation procedures.5. Evaluation: Estimating and interpreting the mined patterns based on interestingness measures.6. Result Exploitation: The last stage may involve using the knowledge directly;incorporating the knowledge into another system for further processes or simplyreporting the discovered knowledge through visualization tools.Figure 1.1 summarizes the KDD process and reveals the six stages mentionedpreviously. It is worth mentioning that all the stages are interconnected, showing thatthe KDD process is actually a self-organized scheme where each stage conditionsthe remaining stages and reverse path is also allowed.1.2 Data Mining MethodsA large number of techniques for DM are well-known and used in many applications.This section provides a short review of selected techniques considered the mostimportant and frequent in DM. This review only highlights some of the main featuresof the different techniques and some of the influences related to data preprocessingprocedures presented in the remaining chapters of this book. Our intention is not to

1.2 Data Mining Methods3Fig. 1.1 KDD processprovide a complete explanation on how these techniques operate with detail, but tostay focused on the data preprocessing step.Figure 1.2 shows a division of the main DM methods according to two methodsof obtaining knowledge: prediction and description. In the following, we will givea short description for each method, including references for some representativeand concrete algorithms and major considerations from the point of view of datapreprocessing.Within the prediction family of methods, two main groups can be distinguished:statistical methods and symbolic methods [4]. Statistical methods are usually characterized by the representation of knowledge through mathematical models withcomputations. In contrast, symbolic methods prefer to represent the knowledge bymeans of symbols and connectives, yielding more interpretable models for humans.The most applied statistical methods are: Regression Models: being the oldest DM models, they are used in estimation tasks,requiring the class of equation modelling to be used [24]. Linear, quadratic andlogistic regression are the most well known regression models in DM. There arebasic requirement that they impose on the data. Among them, the use of numericalattributes are not designed for dealing with missing svalues, they try to fit outliersto the models and use all the features independently whether or not they are usefulor dependent on one another.

41 IntroductionFig. 1.2 DM methods Artificial Neural Networks (ANNs): are powerful mathematical models suitablefor almost all DM tasks, especially predictive one [7]. There are different formulations of ANNs, the most common being the multi-layer perceptron (MLP), Radial Basis Function Networks (RBFNs) and Learning Vector Quantization (LVQ).ANNs are based on the definition of neurons, which are atomic parts that computethe aggregation of their input to an output according to an activation function. Theyusually outperform all other models because of their complex structure; however,the complexity and suitable configuration of the networks make them not verypopular when regarding other methods, being considered as the typical exampleof black box models. Similar to regression models, they require numeric attributesand no MVs. However, if they are appropriately configured, they are robust againstoutliers and noise. Bayesian Learning: positioned using the probability theory as a framework formaking rational decisions under uncertainty, based on Bayes’ theorem. [6]. Themost applied bayesian method is Naïve Bayes, which assumes that the effect ofan attribute value of a given class is independent of the values of other attributes.Initial definitions of these algorithms only work with categorical attributes, due tothe fact that the probability computation can only be made in discrete domains.Furthermore, the independence assumption among attributes causes these methodsto be very sensitive to the redundancy and usefulness of some of the attributes andexamples from the data, together with noisy and outliers examples. They cannotdeal with MVs. Besides Naïve Bayes, there are also complex models based ondependency structures such as Bayesian networks. Instance-based Learning: Here, the examples are stored verbatim, and a

Data preprocessing is an often neglected but major step in the data mining process. The data collection is usually a process loosely controlled, resulting in out of range values, e.g., impossible data combinations (e.g., Gender: Male; Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such

Related Documents:

Salvador merece o melhor e atualmente, a Salvador Destination é a mais ágil entidade de promoção do destino Salvador que, aos poucos, volta a se impor com a força de seus inúmeros atrativos."Salvador, você sente que é diferente!" Em 5 anos de intenso e dedicado trabalho, a cidade de Salvador vem se

KSN SEC SIC BDC Pendaftaran Peserta 27 Juli – 20 Agustus 2020 27 Juli – 20 Agustus 2020 27 Juli – 20 Agustus 2020 27 Juli – 4 . Matematika, Aktuaria, Ilmu Komputer, Sistem Informasi, dan Informatika). i. Sesuai dengan semangat Merdeka

Les secrets publics de Salvador Dalí Salvador Dalí écrit son autobiographie à l’âge de 37 ans. Sous le titre The Secret Life of Salvador Dalí (La vie secrète de Salvador Dalí), le peintre espagnol décrit son enfance, sa vie d’étudiant à Madrid et ses premières années de gloire à Paris jusqu’à son départ pour les États-Unis en

5 United Nations Commission on the Truth for El Salvador, From Madness to Hope: The 12-Year War in El Salvador: Report of the Commission on the Truth for El Salvador , 1993. 6 ³Hungry House: Nayib Bukele¶s Power Grab in El Salvador, The Economist , May 6, 2020.Cited by: 1Page Count: 34File Size: 1MBAuth

Salvador Dalí, 'Saint Sebastian' (1927). In The Collected Writings of Salvador Dalí, edition and translation by Haim Finkelstein, The Dalí Museum, St. Petersburg, FL, 2017, pp. 19-24. 8 Salvador Dalí, 'Two Pieces in Prose. My girlfriend and the beach' (1927) In The Collected Writings of Salvador Dalí, op. cit., pp. 24-25.

5 Les secrets publics de Salvador Dalí S alvador Dalí écrit son autobiographie à l’âge de trente-sept ans. Sous le titre The Secret Life of Salvador Dalí(La vie secrète de Salvador Dalí), le peintre espagnol décrit son enfance, sa vie d’étudiant à Madrid et ses premières années de gloire à

Jose Manuel Salvador Vallejo was born January 1,1814, in the little capital city of Monterey in the province of Upper California. He was the eleventh of thirteen children and was named for his maternal grandfather, Francisco Salvador Lugo, as well as for his uncle, his mother's brother Salvador

Salvador company formation fees (without travel) 1. 5,000 0 Company legal registered office fees 2. 1,100 1,100 Estimated El Salvador Government registration fees 3. 135 25 Tax and VAT registration fees 4. 950 0 Salvador corporate bank account opening fees (without travel) 5. 2,95