Introduction To Data Mining

3y ago
11 Views
2 Downloads
797.55 KB
24 Pages
Last View : 22d ago
Last Download : 3m ago
Upload by : Grant Gall
Transcription

Introduction to Data Mining1

Why Data Mining? Explosive Growth of Data– Data collection and data availability Automated data collection tools, Internet, smartphones, – Major sources of abundant data Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, biotechnology, scientific simulation, Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving forknowledge!2

Decision Support Typical procedure– Data - Knowledge - Action/Decision - Goal Examples– Netflix collects user ratings of movies (data) What typesof movies you will like (knowledge) Recommend newmovies to you (action) Users stay with Netflix (goal)– Gene sequences of cancer patients (data) Which geneslead to cancer? (knowledge) Appropriate treatment(action) Save life (goal)– Road traffic (data) Which road is likely to be congested?(knowledge) Suggest better routes to drivers (action) Save time and energy (goal)3

What Is Data Mining? Data mining– Extraction of interesting (non-trivial, implicit, previouslyunknown and potentially useful) patterns or knowledgefrom huge amount of data Alternative names– Knowledge discovery (mining) in databases (KDD),knowledge extraction, data/pattern analysis, etc. Watch out: Is everything “data mining”?– Simple search and query processing– (Deductive) expert systems4

Data Mining ProcessPattern EvaluationData MiningTask-relevant DataData WarehouseSelectionData CleaningData IntegrationDatabases5

Data Mining ProcedureIncreasing potentialto supportbusiness decisionsDecisionMakingData PresentationVisualization TechniquesEnd UserBusinessAnalystData MiningInformation DiscoveryDataAnalystData ExplorationStatistical Summary, Querying, and ReportingData Preprocessing/Integration, Data WarehousesData SourcesPaper, Files, Web documents, Scientific experiments, Database SystemsDBA6

Multi-Dimensional View of Data Mining Data to be mined– Transactional data, stream, spatiotemporal, time-series, sequence, textand web, multi-media, graphs & social and information networks Knowledge to be mined– Association, classification, clustering, trend/deviation, outlier analysis, etc.– Descriptive vs. predictive data mining Techniques utilized– Data warehouse (OLAP), machine learning, statistics, pattern recognition,optimization, visualization, etc. Applications adapted– Retail, telecommunication, banking, fraud analysis, bio-data mining, stockmarket analysis, text mining, Web mining, etc.7

Data Mining: On What Kinds of Data? Relational database, data warehouse, transactional database Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Spatial data and spatiotemporal data Multimedia data Text data WWW data8

Data Mining Function: (1) Generalization Information integration and data warehouseconstruction– Data cleaning, transformation, integration, andmultidimensional data model Data cube technology– Scalable methods for computing (i.e.,materializing) multidimensional aggregates– OLAP (online analytical processing)9

Data Warehousing Aggregate data from different dimensionsTVPCVCRsum1Qtr2Qtr3Qtr4QtrsumTotal annual salesof TVs in U.S.A.U.S.ACanadaMexicoCountryDatesumTotal sales of all products atall the countries within 1Qtr10

Data Mining Function: (2) Association Analysis Frequent patterns (or frequent itemsets)– What items are frequently purchased together inWalmart? Association rules– A typical association rule Diaper Beer [0.5%, 75%] (support, confidence) How to mine such patterns and rules efficientlyin large datasets?11

Association Rule Mining Data: A set of transactions, and each transaction consists of aset of items Association rules: A set of rules that characterize associationsbetween itemsMarket-Basket transactionsTIDItems12345Bread, Coke, MilkBeer, BreadBeer, Coke, Diaper, MilkBeer, Bread, Diaper, MilkCoke, Diaper, MilkRules Discovered:{Milk} -- {Coke}{Diaper, Milk} -- {Beer}12

Data Mining Function: (3) Classification Classification and label prediction– Construct models (functions) based on some training examples– Describe and distinguish classes or concepts for future prediction– Predict some unknown class labels Typical methods– Decision trees, naïve Bayesian classification, support vector machines,neural networks, rule-based classification, pattern-based classification,logistic regression, Typical applications:– Identifying spams, predicting treatment outcomes, categorizing articles, 13

Classificationfeaturesuserclass labelsage27genderFemaleeducation Ad?Bachelor Yes30MalePhDYes55MaleBachelorNolabeledtraininga classifier: f(x) y: features class labelsuserage60genderFemaleeducation Ad?Bachelor23MaleMastertestingunlabeled14

Data Mining Function: (4) Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Partition data into groups based on object similarity Principle: Maximizing intra-class similarity & minimizinginterclass similarity Methods: Partitional, hierarchical, density-based, mixture model,spectral methods Applications: document clustering, user log clustering, targetmarketing, climate modeling, 15

Clustering Finding groups of objects such that the objects in agroup will be similar to one another and differentfrom the objects in other groups16

Data Mining Function: (5) Anomaly Detection Anomalies– the set of objects are considerablydissimilar from the remainder of thedata– occur relatively infrequently– when they do occur, theirconsequences can be quite dramaticand quite often in a negative sense Approaches– Statistics-based, depth-based,model-based, by product of clusteranalysis Applications– credit card frauds, networkintrusions, system failures, waterleak, .“Mining needle in a haystack.So much hay and so little time”17

Evaluation of Knowledge Are all mined knowledge interesting?– One can mine tremendous amount of “patterns” and knowledge– Some may fit only certain dimension space (time, location, )– Some may not be representative, may be transient, Evaluation of mined knowledge– Descriptive vs. predictive– Coverage– Typicality vs. novelty– Accuracy– Timeliness– 18

Data Mining: Confluence of Multiple ternRecognitionData h-PerformanceComputing19

Challenges in Data Mining Tremendous amount of data– Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data– Micro-array may have tens of thousands of dimensions High complexity of data––––Noisy and unreliableDynamically evolvingHigh dimensionalityMultiple heterogeneous sources New and sophisticated applications20

Applications of Data Mining Web page analysis: from web page classification, clustering toPageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, clusteranalysis (microarray data analysis), biological sequenceanalysis, biological network analysis Social media analysis: mine user opinions and obtain insightsfrom data collected from social networking platforms21

Major Issues in Data Mining (1) Mining Methodology– Mining various and new kinds of knowledge– Mining knowledge from different perspectives– Handling noise, uncertainty, and incompleteness of data– Pattern evaluation and pattern- or constraint-guided mining User Interaction– Interactive mining– Incorporation of background knowledge– Presentation and visualization of data mining results22

Major Issues in Data Mining (2) Efficiency and Scalability– Efficiency and scalability of data mining algorithms– Parallel, distributed, stream, and incremental mining methods Diversity of data types– Handling complex types of data– Mining dynamic, networked, and global data repositories Data mining and society– Social impacts of data mining– Privacy-preserving data mining– Invisible data mining23

Take-away Message Data Mining refers to non-trivial extraction of implicit,previously unknown and potentially useful knowledge fromdata Data Mining covers topics including warehousing, associationanalysis, clustering, classification, anomaly detection, etc.(based on the type of mined knowledge), as well as transactiondata mining, stream data mining, sequence data mining, graphdata mining, etc. (based on the type of data) Data Mining has wide applications in many different fields inbusiness, science, engineering, education, and many more24

Data Mining: Confluence of Multiple Disciplines 19 Data Mining Machine Learning Statistics Applications Algorithm Pattern Recognition High-Performance Computing Visualization Database Technology . Challenges in Data Mining

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data Mining Tasks (What?) 5. Components of Data Mining Algorithms(How?) 6. Statistics vs Data Mining 2 Srihari . Flood of Data 3

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

Introduction to Data Mining with R1 Yanchang Zhao . "r reference card data mining now available cran list" ## [2] "used r functions package data mining applications" 28/44. . mining computing introduction australia pdf ausdm rdatamining softw