CSE5334 Data Mining

2y ago
33 Views
2 Downloads
713.02 KB
32 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Julius Prosser
Transcription

CSE4334/5334DATA MININGCSE 4334/5334 Data Mining, Fall 2014Lecture 2: IntroductionDepartment of Computer Science and Engineering, University of Texas at ArlingtonChengkai Li(Slides courtesy of Jiawei Han and Vipin Kumar)

Why Mine Data? Commercial Viewpoint Lots of data is being collectedand warehoused Web data, e-commerce purchases at department/grocery stores Bank/Credit CardtransactionsComputers have become cheaper and more powerfulCompetitive Pressure is Strong Provide better, customized services for an edge (e.g. in CustomerRelationship Management)

Why Mine Data? Scientific Viewpoint Data collected and stored atenormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating geneexpression data scientific simulationsgenerating terabytes of dataTraditional techniques infeasible for raw dataData mining may help scientists in classifying and segmenting data in Hypothesis Formation

Mining Large Data Sets - Motivation There is often information “hidden” in the data that isnot readily evidentHuman analysts may take weeks to discover useful informationMuch of the data is never analyzed at all4,000,0003,500,000The Data Gap3,000,0002,500,0002,000,0001,500,000Total new disk (TB) since 19951,000,000Number ofanalysts500,000019951996199719981999From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown andpotentially useful) patterns or knowledge from huge amount of data5

What is (not) Data Mining?What is not DataMining? What is Data Mining?– Look up phonenumber in phonedirectory– Certain names are moreprevalent in certain US locations(O’Brien, O’Rurke, O’Reilly inBoston area)– Query a Websearch engine forinformation about“Amazon”– Group together similardocuments returned by searchengine according to their context(e.g. Amazon rainforest,Amazon.com,)

Knowledge Discovery (KDD) Process Datamining—core ofknowledge discoveryprocessPattern EvaluationData MiningTask-relevant DataData WarehouseSelectionData CleaningData IntegrationDatabases7

Architecture: Typical Data Mining SystemGraphical User InterfacePattern EvaluationData Mining EngineKnowledgeBaseDatabase or DataWarehouse Serverdata cleaning, integration, and selectionDatabaseDataWorld-Wide Other InfoRepositoriesWarehouseWeb8

Data Mining: Confluence of Multiple nRecognitionStatisticsData MiningAlgorithmVisualizationOtherDisciplines9

Why Not Traditional Data Analysis? Tremendous amount of data High-dimensionality of data Algorithms must be highly scalable to handle such as tera-bytes of dataMicro-array may have tens of thousands of dimensionsHigh complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulationsNew and sophisticated applications10

Data Mining Tasks Prediction Methods Usesome variables to predict unknown or future valuesof other variables. Description Methods Findhuman-interpretable patterns that describe thedata.From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Data Mining Tasks. ClassificationClusteringAssociation Rule DiscoverySequential Pattern DiscoveryRegressionDeviation/Anomaly Detection

Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of theattributes is the class.Find a model for class attribute as a function of the valuesof other attributes.Goal: previously unseen records should be assigned a classas accurately as possible. A test set is used to determine the accuracy of the model.Usually, the given data set is divided into training andtest sets, with training set used to build the model and testset used to validate it.

Classification ExampleTid Refund MaritalStatusTaxableIncome CheatRefund MaritalStatusTaxableIncome ed120KNoYesDivorced 90K?5NoDivorced esDivorced KYesTrainingSetLearnClassifierTestSetModel

Classification: Application 1 Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy anew cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise.This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction relatedinformation about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model.From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 2 Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its accountholder as attributes. When does a customer buy, what does he buy, how often he pays ontime, etcLabel past transactions as fraud or fair transactions. This forms theclass attribute.Learn a model for the class of the transactions.Use this model to detect fraud by observing credit card transactionson an account.

Classification: Application 3 Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost to a competitor. Approach: Use detailed record of transactions with each of the past and presentcustomers, to find attributes. How often the customer calls, where he calls, what time-of-the day hecalls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty.From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 4 Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects, especially visuallyfaint ones, based on the telescopic survey images (from PalomarObservatory). 3000 images with 23,040 x 23,040 pixels per image. Approach: Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some ofthe farthest objects that are difficult to find!From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classifying GalaxiesEarlyClass: Stages of FormationCourtesy: http://aps.umn.eduAttributes: Image features, Characteristics of lightwaves received, etc.IntermediateLateData Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB

Clustering Definition Given a set of data points, each having a set ofattributes, and a similarity measure among them,find clusters such thatData points in one cluster are more similar to oneanother. Data points in separate clusters are less similar to oneanother. Similarity Measures:Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

Illustrating Clustering Euclidean Distance Based Clustering in 3-D space.Intracluster distancesare minimizedIntercluster distancesare maximized

Clustering: Application 1 Market Segmentation: Goal: subdivide a market into distinct subsets of customerswhere any subset may conceivably be selected as a markettarget to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on theirgeographical and lifestyle related information.Find clusters of similar customers.Measure the clustering quality by observing buying patterns ofcustomers in same cluster vs. those from different clusters.

Clustering: Application 2 Document Clustering:Goal: To find groups of documents that are similar toeach other based on the important terms appearing inthem. Approach: To identify frequently occurring terms ineach document. Form a similarity measure based onthe frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters torelate a new document or search term to clustereddocuments.

Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles Times.Similarity Measure: How many words are common in thesedocuments (after some word 73Entertainment354278Financial

Clustering of S&P 500 Stock Data Observe Stock Movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the eventsdescribed by them frequently happen together on the same day. We used association rules to quantify a similarity measure.Discovered N,Compaq-DOWN, EMC-Corp-DOWN, y NOil-UP

Association Rule Discovery:Definition Given a set of records each of which contain some number ofitems from a given collection; Produce dependency rules which will predict occurrence ofan item based on occurrences of other items.TIDItems12345Bread, Coke, MilkBeer, BreadBeer, Coke, Diaper, MilkBeer, Bread, Diaper, MilkCoke, Diaper, MilkRules Discovered:{Milk} -- {Coke}{Diaper, Milk} -- {Beer}

Association Rule Discovery: Application 1 Marketing and Sales Promotion: Let the rule discovered be{Bagels, } -- {Potato Chips} Potato Chips as consequent Can be used to determinewhat should be done to boost its sales. Bagels in the antecedent Can be used to see whichproducts would be affected if the store discontinues sellingbagels. Bagels in antecedent and Potato chips in consequent Canbe used to see what products should be sold with Bagels topromote sale of Potato chips!

Association Rule Discovery: Application 2 Supermarket shelf management.Goal: To identify items that are bought together bysufficiently many customers. Approach: Process the point-of-sale data collectedwith barcode scanners to find dependencies amongitems. A classic rule - Ifa customer buys diaper and milk, then he is very likelyto buy beer. So, don’t be surprised if you find six-packs stacked next todiapers!

Association Rule Discovery: Application 3 Inventory Management: Goal: A consumer appliance repair company wants toanticipate the nature of repairs on its consumer productsand keep the service vehicles equipped with right parts toreduce on number of visits to consumer households. Approach: Process the data on tools and parts required inprevious repairs at different consumer locations anddiscover the co-occurrence patterns.

Deviation/Anomaly Detection Detect significant deviations from normal behaviorApplications: Credit Card Fraud Detection Network Intrusion DetectionTypical network traffic at University level may reach over 100 million connections per day

Data Mining Tasks. Classification [Predictive]Clustering [Descriptive]Association Rule Discovery [Descriptive]Sequential Pattern Discovery [Descriptive]Regression [Predictive]Deviation/Anomaly Detection [Predictive]

Challenges of Data Mining ScalabilityDimensionalityComplex and Heterogeneous DataData QualityData Ownership and DistributionPrivacy PreservationStreaming Data

DATA MINING CSE 4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington . Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

Distributed Data Mining: mining data that is located in various different locations Uses a combination of localized data analysis with a global data model Hypertext/Hypermedia Data Mining: mining data which includes text, hype

BLUEPRINT TO CUTS PHASE ONE OVERVIEW Use this as a quick reference to the Arnold Schwarzenegger Blueprint to Cuts. Cross the workout off as you complete them and track your own progress. ARNOLD BLUEPRINT: CUTS PHASE 1 WORKOUTS Follow the rep ranges below unless listed otherwise CHEST/BACK PHASE 1: MON / THURS REMEMBER: Run 1-2 Miles as fast as possible 3-5 times per week Post-Workout REST .