CSE5334 Data Mining

2y ago

33 Views

2 Downloads

713.02 KB

32 Pages

Last View : 2m ago

Last Download : 3m ago

Upload by : Julius Prosser

Report this link

Download PDF

Transcription

CSE4334/5334DATA MININGCSE 4334/5334 Data Mining, Fall 2014Lecture 2: IntroductionDepartment of Computer Science and Engineering, University of Texas at ArlingtonChengkai Li(Slides courtesy of Jiawei Han and Vipin Kumar)

Why Mine Data? Commercial Viewpoint Lots of data is being collectedand warehoused Web data, e-commerce purchases at department/grocery stores Bank/Credit CardtransactionsComputers have become cheaper and more powerfulCompetitive Pressure is Strong Provide better, customized services for an edge (e.g. in CustomerRelationship Management)

Why Mine Data? Scientific Viewpoint Data collected and stored atenormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating geneexpression data scientific simulationsgenerating terabytes of dataTraditional techniques infeasible for raw dataData mining may help scientists in classifying and segmenting data in Hypothesis Formation

Mining Large Data Sets - Motivation There is often information “hidden” in the data that isnot readily evidentHuman analysts may take weeks to discover useful informationMuch of the data is never analyzed at all4,000,0003,500,000The Data Gap3,000,0002,500,0002,000,0001,500,000Total new disk (TB) since 19951,000,000Number ofanalysts500,000019951996199719981999From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown andpotentially useful) patterns or knowledge from huge amount of data5

What is (not) Data Mining?What is not DataMining? What is Data Mining?– Look up phonenumber in phonedirectory– Certain names are moreprevalent in certain US locations(O’Brien, O’Rurke, O’Reilly inBoston area)– Query a Websearch engine forinformation about“Amazon”– Group together similardocuments returned by searchengine according to their context(e.g. Amazon rainforest,Amazon.com,)

Knowledge Discovery (KDD) Process Datamining—core ofknowledge discoveryprocessPattern EvaluationData MiningTask-relevant DataData WarehouseSelectionData CleaningData IntegrationDatabases7

Architecture: Typical Data Mining SystemGraphical User InterfacePattern EvaluationData Mining EngineKnowledgeBaseDatabase or DataWarehouse Serverdata cleaning, integration, and selectionDatabaseDataWorld-Wide Other InfoRepositoriesWarehouseWeb8

Data Mining: Confluence of Multiple nRecognitionStatisticsData MiningAlgorithmVisualizationOtherDisciplines9

Why Not Traditional Data Analysis? Tremendous amount of data High-dimensionality of data Algorithms must be highly scalable to handle such as tera-bytes of dataMicro-array may have tens of thousands of dimensionsHigh complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulationsNew and sophisticated applications10

Data Mining Tasks Prediction Methods Usesome variables to predict unknown or future valuesof other variables. Description Methods Findhuman-interpretable patterns that describe thedata.From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Data Mining Tasks. ClassificationClusteringAssociation Rule DiscoverySequential Pattern DiscoveryRegressionDeviation/Anomaly Detection

Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of theattributes is the class.Find a model for class attribute as a function of the valuesof other attributes.Goal: previously unseen records should be assigned a classas accurately as possible. A test set is used to determine the accuracy of the model.Usually, the given data set is divided into training andtest sets, with training set used to build the model and testset used to validate it.

Classification ExampleTid Refund MaritalStatusTaxableIncome CheatRefund MaritalStatusTaxableIncome ed120KNoYesDivorced 90K?5NoDivorced esDivorced KYesTrainingSetLearnClassifierTestSetModel

Classification: Application 1 Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy anew cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise.This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction relatedinformation about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model.From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 2 Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its accountholder as attributes. When does a customer buy, what does he buy, how often he pays ontime, etcLabel past transactions as fraud or fair transactions. This forms theclass attribute.Learn a model for the class of the transactions.Use this model to detect fraud by observing credit card transactionson an account.

Classification: Application 3 Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost to a competitor. Approach: Use detailed record of transactions with each of the past and presentcustomers, to find attributes. How often the customer calls, where he calls, what time-of-the day hecalls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty.From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 4 Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects, especially visuallyfaint ones, based on the telescopic survey images (from PalomarObservatory). 3000 images with 23,040 x 23,040 pixels per image. Approach: Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some ofthe farthest objects that are difficult to find!From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classifying GalaxiesEarlyClass: Stages of FormationCourtesy: http://aps.umn.eduAttributes: Image features, Characteristics of lightwaves received, etc.IntermediateLateData Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB

Clustering Definition Given a set of data points, each having a set ofattributes, and a similarity measure among them,find clusters such thatData points in one cluster are more similar to oneanother. Data points in separate clusters are less similar to oneanother. Similarity Measures:Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

Illustrating Clustering Euclidean Distance Based Clustering in 3-D space.Intracluster distancesare minimizedIntercluster distancesare maximized

Clustering: Application 1 Market Segmentation: Goal: subdivide a market into distinct subsets of customerswhere any subset may conceivably be selected as a markettarget to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on theirgeographical and lifestyle related information.Find clusters of similar customers.Measure the clustering quality by observing buying patterns ofcustomers in same cluster vs. those from different clusters.

Clustering: Application 2 Document Clustering:Goal: To find groups of documents that are similar toeach other based on the important terms appearing inthem. Approach: To identify frequently occurring terms ineach document. Form a similarity measure based onthe frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters torelate a new document or search term to clustereddocuments.

Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles Times.Similarity Measure: How many words are common in thesedocuments (after some word 73Entertainment354278Financial

Clustering of S&P 500 Stock Data Observe Stock Movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the eventsdescribed by them frequently happen together on the same day. We used association rules to quantify a similarity measure.Discovered N,Compaq-DOWN, EMC-Corp-DOWN, y NOil-UP

Association Rule Discovery:Definition Given a set of records each of which contain some number ofitems from a given collection; Produce dependency rules which will predict occurrence ofan item based on occurrences of other items.TIDItems12345Bread, Coke, MilkBeer, BreadBeer, Coke, Diaper, MilkBeer, Bread, Diaper, MilkCoke, Diaper, MilkRules Discovered:{Milk} -- {Coke}{Diaper, Milk} -- {Beer}

Association Rule Discovery: Application 1 Marketing and Sales Promotion: Let the rule discovered be{Bagels, } -- {Potato Chips} Potato Chips as consequent Can be used to determinewhat should be done to boost its sales. Bagels in the antecedent Can be used to see whichproducts would be affected if the store discontinues sellingbagels. Bagels in antecedent and Potato chips in consequent Canbe used to see what products should be sold with Bagels topromote sale of Potato chips!

Association Rule Discovery: Application 2 Supermarket shelf management.Goal: To identify items that are bought together bysufficiently many customers. Approach: Process the point-of-sale data collectedwith barcode scanners to find dependencies amongitems. A classic rule - Ifa customer buys diaper and milk, then he is very likelyto buy beer. So, don’t be surprised if you find six-packs stacked next todiapers!

Association Rule Discovery: Application 3 Inventory Management: Goal: A consumer appliance repair company wants toanticipate the nature of repairs on its consumer productsand keep the service vehicles equipped with right parts toreduce on number of visits to consumer households. Approach: Process the data on tools and parts required inprevious repairs at different consumer locations anddiscover the co-occurrence patterns.

Deviation/Anomaly Detection Detect significant deviations from normal behaviorApplications: Credit Card Fraud Detection Network Intrusion DetectionTypical network traffic at University level may reach over 100 million connections per day

Data Mining Tasks. Classification [Predictive]Clustering [Descriptive]Association Rule Discovery [Descriptive]Sequential Pattern Discovery [Descriptive]Regression [Predictive]Deviation/Anomaly Detection [Predictive]

Challenges of Data Mining ScalabilityDimensionalityComplex and Heterogeneous DataData QualityData Ownership and DistributionPrivacy PreservationStreaming Data

DATA MINING CSE 4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington . Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm

Related Documents:

DATA MINING - University of Rajshahi

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

13 Views

1y ago

Data Mining in Bioinformatics - UQAM

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

41 Views

2y ago

Multi Relational Data Mining Approaches: A Data Mining Technique

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

9 Views

7m ago

Data Mining: Why Data Mining? - Leiden University

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

41 Views

2y ago

Exploration and Mining in Canada

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

22 Views

1y ago

Data Mining Algorithms - Stanford University

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

11 Views

1y ago

Data Mining and its Application in Marketing and Business

Distributed Data Mining: mining data that is located in various different locations Uses a combination of localized data analysis with a global data model Hypertext/Hypermedia Data Mining: mining data which includes text, hype

26 Views

2y ago

THE ULTIMATE CUTS - Bodybuilding.com

BLUEPRINT TO CUTS PHASE ONE OVERVIEW Use this as a quick reference to the Arnold Schwarzenegger Blueprint to Cuts. Cross the workout off as you complete them and track your own progress. ARNOLD BLUEPRINT: CUTS PHASE 1 WORKOUTS Follow the rep ranges below unless listed otherwise CHEST/BACK PHASE 1: MON / THURS REMEMBER: Run 1-2 Miles as fast as possible 3-5 times per week Post-Workout REST .

52 Views

3y ago

Recent Views

Consumer Guide to Auto Insurance - csimt.gov

consumer guide to auto insurance contents introduction to auto insurance 1 understanding your auto insurance policy 2 required auto insurance 3 optional types of auto insurance 4-5 getting the right coverage 6 accidents and violations 7 how to shop for auto insurance 8 shopping tips 9 frequently asked questions 10-11 insurance complaints/when you have a problem 12

2y ago

805 Views

your guide to understanding auto ins in nh - New Hampshire

Hampshire Insurance Department does not mandate or set Auto Insurance Rates. Auto Insurance Rates will vary by insurance company. This guide is intended to give New Hampshire consumers basic information on auto insurance. It suggests ways to: Lower the cost of your auto insurance, shop for Auto insurance and, file an auto insurance claim.

1y ago

449 Views

OWNER'S GUIDE - NinjaKitchen

auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. frozen drinks smoothies puree med high pulse low / dough. auto auto auto. please keep these important safeguards in mind when using the . appliance: mportant: make sure that the .

1y ago

285 Views

Quotes within Quotes: When Single (') and Double (") Quotes . - SAS

Here the outside double quotes are replaced by a single quote and the apostrophe is replaced by two single quotes. This works because when the parser sees two single (or double) quotes immediately following each other, the parser resolves them into one quote mark after the closing quote has been determined.

1y ago

237 Views

What These Inspirational Quotes Say

Self Motivation Quotes Success Quotes Teacher Quotes And after reading all of these inspirational quotes you’d like to share which quotation is . -- Brian Tracy "You must constantly ask yourself these questions: Who am I around? What are they doing to me? Wha

2y ago

302 Views

Consumer Guide Auto Insurance - Tennessee

Auto insurance doesn't cover paying off your loan if your car is damaged and its market value is less than what you owe. Auto dealers and lenders may offer guaranteed auto protection (GAP) insurance for this purpose. Your auto insurance will cover you if you drive into Canada. To drive into Mexico, however, you'll need to buy Mexican auto .

1y ago

199 Views

NAIC Consumer Shopping Tool for Auto Insurance

Whether you are buying auto insurance for the first time, or shopping to be sure you are getting the best deal, you already know how important auto insurance is. By law in most states, if you own a car, you must have some auto insurance. Remember, there is no such thing as a "full coverage" auto insurance policy. Policies are made up of

1y ago

185 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

REVIEW OF AUTOMOBILE INSURANCE RATES - Consumers' Association of Canada

In the summer of 2003 the Association compiled over 7,000 auto insurance rate quotes from sources across Canada. In the case of those provinces in which private insurers provide auto insurance the study ensured that the rate quotes obtained reflected the range of prices likely to be found in those markets.

1y ago

213 Views

Broadway towing winchester ky

MO 77 Motors: Rock Hill, SC 7th Avenue Auto Salvage: Fargo, ND 81 Auto Parts & Recycling : Salem, VA 82 Auto Wrecking: Brookfield, OH #9 Truck & Auto Parts (No US Shipping) : Tottenham, ON 97 Auto Wrecking Shull's Towing: Brewster , WA 98 Auto Recyclers: Brooksville, FL 99 Auto Dismantler: Stockton, CA A & A Auto & Truck LLC:

2y ago

465 Views

All about auto insurance - Option Consommateurs

of insurance companies with which they have agreements. Insurance agents: agents work for a specific insurance company. Before you decide to do business with either a broker or an agent, check out prices, the products being proposed and the quality of the service. Buying auto insurance 4 All about auto insurance

1y ago

230 Views

A Message from Our President - Fox Valley Corvette

Bob Jass Chev-rolet 630-365-6481 Auto Parts 25% in most cas-es Ron Westphal Chevrolet 630-898-9630 Auto Parts 25% in most cas-es Thomsons Auto Parts 630-879-6363 Auto Parts 10% in most cas-es American Mod-ern Insurance Co. Collector Car Auto Insurance 10% on Collector Auto Polic

2y ago

225 Views

Quotations - Free Website Builder: Create free websites

cards, but sometimes, playing a poor hand well." . 50th Birthday Quotes 60th Birthday Quotes And there are more. Funny Birthday Quotes Cute Birthday Quotes . it a try, itʼs free. Triumph over failure can be a

2y ago

267 Views

The Top 100 Motivational & Inspirational Quotes for 2015

I've spent hours crawling through the web trying to find the best quotes to keep me motivated and inspired all throughout the New Year. I've saved hundreds of quotes on my laptop and figured that words alone could motivate and inspire me. but if I couple the quotes

2y ago

329 Views

Inspirational Quotes - Guideposts

Inspirational Quotes Inspiring quotes are like vitamins for the soul. From the heartfelt to the humorous, the words of wisdom you’ll find here will strengthen your faith, lift your spirits, and even spark a positive change in your life. This collection of some our favorite inspirational quotes from religious figures, world leaders, authors,

2y ago

553 Views

CSE5334 Data Mining

It looks like you're using an ad-blocker