Data Mining - Brigham Young University

1y ago

10 Views

2 Downloads

3.35 MB

32 Pages

Last View : 15d ago

Last Download : 3m ago

Upload by : Joanna Keil

Report this link

Download PDF

Transcription

Data MininglllThe Extraction of useful information from dataThe automated extraction of hidden predictive informationfrom (large) databasesBusiness, Huge data bases, customer data, mine the data– Also Medical, Genetic, Astronomy, etc.llData sometimes unlabeled – unsupervised clustering, etc.Focuses on learning approaches which scale to massiveamounts of data– and potentially to a large number of features– sometimes requires simpler algorithms with lower big-OcomplexitiesCS 472- Data Mining1

Data Mining ApplicationsllOften seeks to give businesses a competitive advantageWhich customers should they target– For advertising – more focused campaign– Customers they most/least want to keep– Most favorable business decisionslAssociations– Which products should/should not be on the same shelf– Which products should be advertised together– Which products should be bundledlInformation Brokers– Make transaction information available to others who are seekingadvantagesCS 472- Data Mining2

Data MininglBasically, a particular niche of machine learningapplications– Focused on business and other large data problems– Focused on problems with huge amounts of data which needs to bemanipulated in order to make effective inferences– “Mine” for “gems” of actionable informationCS 472- Data Mining3

Association Analysis – Link AnalysisUsed to discover relationships in large databasesl Relationships represented as association rulesl– Unsupervised learning, any data setlOne example is market basket analysis which seeks tounderstand more about what items are bought together– This can then lead to improved approaches for advertising, productplacement, etc.– Example Association Rule: {Cereal} Þ {Milk}Transaction ID and Info Items Bought1 and (who, when, etc.){Ice cream, milk, eggs, cereal}2{Ice cream}3{milk, cereal, sugar}4{eggs, yogurt, sugar}5{Ice cream, milk, cereal}CS 472- Data Mining4

Data WarehouseslCompanies have large data warehouses of transactions– Records of sales at a store– On-line shopping– Credit card usage– Phone calls made and received– Visits and navigation of web sites, etc lMany/Most things recorded these days and there is potentialinformation that can be mined to gain business improvements– For better customer service/support and/or profitsCS 472- Data Mining5

Data Mining PopularityllRecent Data Mining explosion based on:Data available – Transactions recorded in data warehouses– From these warehouses specific databases for the goal task can becreatedlAlgorithms available – Machine Learning and Statistics– Including special purpose Data Mining software products to makeit easier for people to work through the entire data mining cyclellComputing power availableCompetitiveness of modern business – need an edgeCS 472- Data Mining6

Data Mining Process Modell1.2.You will use much of this process in your group projectIdentify and define the task (e.g. business problem)Gather and Prepare the Data–––3.4.Build and Evaluate the Model(s) – Using training and testdataDeploy the Model(s) and Evaluate business related Results–5.Build Data Base for the taskSelect/Transform/Derive featuresAnalyze and Clean the Data, remove outliers, etc.Data visualization toolsIterate through this process to gain continual improvements– both initially and during life of task–Improve/adjust features and/or machine learning approachCS 472- Data Mining7

Data Mining Process Model - CycleMonitor, Evaluate, and update deploymentCS 472- Data Mining8

Data Science and Big DatalInterdisciplinary field about scientific methods, processesand systems to extract knowledge or insights from data– Machine Learning– Statistics/Math– CS/Database/Algorithms– Visualization– Parallel Processing– Etc.lllIncreasing demand in industry!Data Science Departments and TracksNew DS emphasis in BYU CS began Fall 2019CS 472- Data Mining9

Group ProjectslReview timing and expectations– Progress Report– Time purposely available between Decision Tree and InstanceBased projects to keep going on the group projectllGathering, Cleaning, Transforming the Data can be the most criticalpart of the project, so get that going early!!Then plenty of time to try some different ML models and someiterations on your Features/ML approaches to get improvements– Final report and presentationlQuestions?CS 472- Data Mining10

Association DiscoveryllllAssociation rules are not causal, show correlationsk-itemset is a subset of the possible items – {Milk, Eggs}is a 2-itemsetWhich itemsets does transaction 3 containAssociation Analysis/Discovery seeks to find frequentitemsetsTID Items Bought1{Ice cream, milk, eggs, cereal}2{Ice cream}3{milk, cereal, sugar}4{eggs, yogurt, sugar}5{Ice cream, milk, cereal}CS 472- Data Mining12

Association Rule Qualitysupport(X) {t T : X t }support(X Y ) {t T : (X Y ) t}1{Ice cream, milk, eggs, cereal}T2{Ice cream}3{milk, cereal, sugar}4{eggs, yogurt, sugar}5{Ice cream, milk, cereal}confidence(X Y ) lift(X Y ) llTID Items BoughtT{t T : (X Y ) t}{t T : X t }confidence(X Y )support(Y )t Î T, the set of all transactions, and X and Y are itemsetsRule quality measured by support and confidenceWithout sufficient support (frequency), rule will probably overfit, and also of little interest,since it is rare– Note support(X Y) support(Y X) support(X È Y)–l––––Note that support(X È Y) is support for itemsets where both X and Y occurConfidence measures reliability of the inference (to what extent does X imply Y)confidence(X Y) ! confidence(Y X)Support and confidence range between 0 and 1Lift: Lift is high when X Y has high confidence and the consequent Y is less common,Thus lift suggests ability for X to infer a less common value with good probabilityCS 472- Data Mining13

Association Rule Discovery DefinedlUser supplies two thresholdsminsup (Minimum required support level for a rule)– minconf (Minimum required confidence level for a rule)–lllAssociation Rule Discovery: Given a set of transactions T, find allrules having support minsup and confidence minconfHow do you find the rules?Could simply try every possible rule and just keep those that pass–lNumber of candidate rules is exponential in the size of the number of itemsStandard Approaches - Apriori––1st find frequent itemsets (Frequent itemset generation)Then return rules within those frequent itemsets that have sufficient confidence(Rule generation)lllBoth steps have an exponential number of combinations to considerNumber of itemsets exponential in number of items m (power set: 2m)Number of rules per itemset exponential in number of items in itemset (n!)CS 472- Data Mining14

Apriori AlgorithmlThe support for the rule X Þ Y is the same as the support of theitemset X È Y– Assume X {milk, eggs} and Y {cereal}. C X È Y– All the possible rule combinations of itemset C have the same support(# of possible rules exponential in width of itemset: C !)llllllll{milk, eggs} Þ {cereal}{milk} Þ {cereal, eggs}{eggs} Þ {milk, cereal}{milk, cereal} Þ {eggs}{cereal, eggs} Þ {milk}{cereal} Þ {milk, eggs}Do they have the same confidence?So rather than find common rules we can first just find allitemsets with support minsup– These are called frequent itemsets– After that we can find which rules within the common itemsets havesufficient confidence to be keptCS 472- Data Mining15

Support-based PruninglApriori Principle: If an itemset is frequent, then all subsetsof that itemset will be frequent– Note that subset refers to the items in the itemsetlIf an itemset is not frequent, then any superset of thatitemset will also not be frequentCS 472- Data Mining16

llllExample transaction DB with 5 items and 10 transactionsMinsup 30%, at least 3 transaction must contain the itemsetFor each itemset at the current level of the tree (depth k) go througheach of the n transactions and update tree itemset counts accordinglyAll 1-itemsets are kept since all have support 30%CS 472- Data Mining17

llGenerate level 2 of the tree (all possible 2-itemsets)Normally use lexical ordering in itemsets to generate/count candidatesmore efficiently(a,b), (a,c), (a,d), (a,e), (b,c), (b,d), , (d,e)– When looping through n transactions for (a,b), can stop if a not first in the set, etc.–llNumber of tree nodes will grow exponentially if not prunedWhich ones can we prune assuming minsup .3?CS 472- Data Mining18

llGenerate level 2 of the tree (all possible 2-itemsets)Use lexical ordering in itemsets to generate/count candidates moreefficiently(a,b), (a,c), (a,d), (a,e), (b,c), (b,d), , (d,e)– When looping through n transactions for (a,b), can stop if a not first in the set, etc.–llNumber of tree nodes will grow exponentially if not prunedWhich ones can we prune assuming minsup .3?CS 472- Data Mining19

llGenerate level 3 of the tree (all 3-itemsets with frequent parents)Before calculating the counts, check to see if any of these newlygenerated 3-itemsets, contain an infrequent 2-itemset. If so we canprune it before we count since it must be infrequent– A k-itemset contains k subsets of size k-1– It's parent in the tree is only one of those subsets– Are there any candidates we can delete?CS 472- Data Mining20

CS 472- Data Mining21

CS 472- Data Mining22

CS 472- Data Mining23

CS 472- Data Mining24

lFrequent itemsets are: {a,c}, {a,c,d}, {a,c,e}, {a,d}, {a,d,e}, {a,e},CS 472- Data Mining{b,c}, {c,d}, {c,e}, {d,e}25

Rule GenerationlllFrequent itemsets were: {a,c}, {a,c,d}, {a,c,e}, {a,d},{a,d,e}, {a,e}, {b,c}, {c,d}, {c,e}, {d,e}For each frequent itemset generate the possible rules andkeep those with confidence minconfFirst itemset {a,c} gives possible rules– {a} Þ {c} with confidence 4/7 and– {c} Þ {a} with confidence 4/7llSecond itemset {a,c,d} leads to six possible rulesJust as with frequent itemset generation, we can usepruning and smart lexical ordering to make rule generationmore efficient– Project? – Search pruning tricks (312) vs MLCS 472- Data Mining26

Illustrative Training SetWould if we had real valued data?What are steps for this example?Risk Assessment for Loan ApplicationsClient #Credit HistoryDebt LevelCollateralIncome LevelRISK DERATELOWLOWHIGHMODERATELOWHIGHCS 472- Data Mining27

Running Apriori (I)llChoose MinSupport .4 and MinConfidence .81-Itemsets (Level 1):– (CH Bad, .29) (CH Unknown, .36) (CH Good, .36)– (DL Low, .5) (DL High, .5)– (C None, .79) (C Adequate, .21)– (IL Low, .29) (IL Medium, .29)(IL High, .43)– (RL High, .43) (RL Moderate, .21) (RL Low, .36)CS 472- Data Mining28

Running Apriori (II)llll1-Itemsets {(DL Low, .5); (DL High, .5); (C None,.79); (IL High, .43); (RL High, .43)}2-Itemsets {(DL High C None, .43)}3-Itemsets {}Two possible rules:– DL High Þ C None– C None Þ DL HighlConfidences:– Conf(DL High Þ C None) .86– Conf(C None Þ DL High) .54CS 472- Data MiningRetainIgnore29

SummarylAssociation Analysis useful in many real world tasks– Not a classification approach, but a way to understandrelationships in data and use this knowledge to advantagellAlso standard classification and other approachesData Mining continues to grow as a field– Data and features issuesl Gathering, Selection and Transformation, Preparation, Cleaning,Storing– Data visualization and understanding– Outlier detection and handling– Time series prediction– Web mining– etc.CS 472- Data Mining30

Data WarehouselCompanies have large data warehouses of transactions–––––lRecords of sales at a storeOn-line shoppingCredit card usagePhone calls made and receivedVisits and navigation of web sites, etc Many/Most things recorded these days and there is potentialinformation that can be mined to gain business improvements– For better customer service/support and/or profitslData Warehouse (DWH)– Separate from the operational data (OLTP – Online transaction processing)– Data comes from heterogeneous company sources– Contains static records of data which can be used and manipulated foranalysis and business purposes– Old data is rarely modified, and new data is continually added– OLAP (Online Analytical Processing) – Front end to DWH allowing basicdata base style querieslUseful for data analysis and data gathering and creating the task data baseCS 472- Data Mining31

The Big Picture: DBs, DWH, OLAP & DMOLAP adRefreshDataWarehouseData StorageServeAnalysis,Query,Reports,Create DataBase forData miningOLAP Engine Front-End ToolsCS 472- Data Mining32

Data Mining Popularity lRecent Data Mining explosion based on: lData available -Transactions recorded in data warehouses -From these warehouses specific databases for the goal task can be created lAlgorithms available -Machine Learning and Statistics -Including special purpose Data Mining software products to make it easier for people to work through the entire data mining cycle

Related Documents:

COLLEGE OF NURSING - Brigham Young University

the classroom are responsible for contributing to this complete educational vision. A BYU education should be spiritually strengthening, intellectually enlarging, and character building, leading to lifelong learning and service. (The Mission of Brigham Young University and The Aims of a BYU Education, Brigham Young University. Brigham Young .

8 Views

1y ago

DATA MINING - University of Rajshahi

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

14 Views

1y ago

Data Mining in Bioinformatics - UQAM

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

42 Views

2y ago

Multi Relational Data Mining Approaches: A Data Mining Technique

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

10 Views

7m ago

31st Annual BRIGHAM YOUNG UNIVERSITY MODEL UNITED …

31st Annual BRIGHAM YOUNG UNIVERSITY MODEL UNITED NATIONS CONFERENCE Sponsored by the David M. Kennedy Center for International Studies Saturday, November 14, 2020 Esteemed Delegates, I am pleased to welcome you to the 31st annual Brigham Young University Model United

21 Views

3y ago

Behind Closed Doors A Network Tale of Spoofing Intrusion ...

Behind Closed Doors: A Network Tale of Spoofing, Intrusion, and False DNS Security Casey Deccio Brigham Young University Provo, UT casey@byu.edu Alden Hilton Brigham Young University Provo, UT aldenhilton@byu.edu Michael Briggs Brigham Young University Provo, UT briggs25@byu.edu Trevin Ave

25 Views

2y ago

Brigham Young University Department of Audiology and ...

Brigham Young University CAA File #48 2 AMERICAN SPEECH-LANGUAGE-HEARING ASSOCIATION Application for Reaccreditation Evaluation of Educational Programs in Speech-Language Pathology and Audiology Date March 1, 1998 Name of institution Brigham Young University Address 136 TLRB, Provo, UT 84602 School, College, or

12 Views

2y ago

Required Texts: Harris, Ann Sutherland. Seventeenth ...

Required Texts: Harris, Ann Sutherland. Seventeenth Century Art and Architecture, 1st or 2nd edition will work, only 2nd edition available in book store Harr, Jonathan. The Lost Painting: The Quest for a Caravaggio Masterpiece. Optional Text: Scotti, R.A. Basilica: The Splendor and the Scandal: The Building of St. Peters’s; Barnett, Sylvan.

62 Views

3y ago

Recent Views

IN THIS ISSUE CAR WASH INSIGHT Recent, Notable M&A Transactions .

9/8/2022 Club Car Wash Sites of Tidal Wave Express Car Wash 8 8/29/2022 Take 5 Car Wash Soft Touch Car Wash, Auto Oasis Car Wash, Clearwater Car Wash and Birdie's Car Wash 5 8/25/2022 WhiteWater Express Geaux Clean Car Wash 7 8/19/2022 ModWash Home Team Car Wash 3 8/18/2022 Splash In ECO Car Wash (Wills Group) Blue Hen Car Wash 2

9m ago

100 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

ESSENTIAL PLAN - Discovery

Car insurance only Car and home insurance Car insurance only Car and home insurance 12.5% 25% 5% 10% YOUR FUEL CASH BACK PERCENTAGE GET TO THE HIGHEST CASH BACK PERCENTAGE Add at least R250 000 of home insurance (household contents, buildings or both) Take your car to Tiger Wheel & Tyre and pass the Annual MultiPoint check

1y ago

269 Views

CAR INSURANCE EVERYTHING EXPLAINED - RSA Insurance Group

CAR INSURANCE 93013821.indd 1 15/03/2018 10:46. 2 WELCOME TO µ CAR INSURANCE Thank you for choosing µ to protect you and your car. This booklet is intended to help you check your cover and to reassure you that µ will give you the protection you need for the year ahead. First of all, to help you understand your car insurance policy we want to .

1y ago

274 Views

Describe types and purposes of insurance.

D.O. CAPS Consumer Skills: Insurance—10E 3 Your car - The car you drive can also affect your insurance rates. Insurance companies place certain kinds of cars in special risk categories. You should ask your insurance agent before making a car purchase to make sure you aren't getting a car that will cost you extra for your liability insurance.

1y ago

233 Views

Contours Options Infant Car Seat Adapter Instruction Sheet

your Infant Car Seat, as described in the instruction manual provided by the Infant Car Seat manufacturer. † WHEN USING ONLY ONE INFANT CAR SEAT ADAPTER OR TWO FOR TWINS, THE FOLLOWING INFANT CAR SEATS CAN BE USED: † If your Infant Car Seat is not one of the models listed above, DO NOT use your infant car seat with this car seat adapter.

2y ago

564 Views

Microsoft Advertising Travel Update

last minute cruise deals -58.50% Car Rental Queries WoW Change car rental -43.80% rental cars -46.30% car rentals -40.60% cheap car rentals -48.00% car rentals cheapest rates -52.20% rent a car- 40.30% cheap rental cars -45.60% rental car -41.80% car rental deals -49.30% rental cars lowest price -53.90% Flight Queries WoW Change cheap flights .

1y ago

337 Views

Design and development of lift for an automatic car parking system

1. Stacker type car parking system 2. Puzzle type car parking system 3. Level type car parking system 4. Chess type car parking system 5. Rotary type car parking system 6. Tower type car parking system But lift is used only in tower type car parking system. Objectives:-

6m ago

172 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Car Insurance This booklet covers:Car Rapid Bonus Business

Car Insurance This booklet covers:Car Rapid Bonus Business RAC Direct Insurance is a trading name of London and Edinburgh Insurance Company Limited. Registered in England No 924430. Registered Office: 8 Surrey Street, Norwich NR1 3NG. Member of the Aviva Group. Authorised and regulated by the Financial Services Authority. RAC052(V27)-1971-06.06 .

1y ago

218 Views

Root Insurance (ROOT) - Citron Research

Root Insurance (ROOT) Leveling the Playing Field of Car Insurance What every trader needs to know about one of the mostheavily shorted stocks in the market Traditional Credit-Based Car Insurance PerpetuatesEconomic and Racial Inequalities as one in three American cannot affordessentials because of car insurance premiums

1y ago

209 Views

NK-ID 0192-8365-3702-0D3E - Car-O-Liner

CAR-O-DATA. 4. The vast majority of vehicles on the road today can be found in Car-O-Liner's database. Your . Car-O-Tronic. is delivered with a 14-day trial . Car-O-Data Vision2. subscription. Car-O-Data. is available with different subscription periods and database. 4. Check all options with our distributors. SOFTWARE PART. NO. Vision2 X1 .

3y ago

321 Views

46686 Vision2 IM EN r0 - Metropolitan Car-o-liner

Car-O-Tronic, Vision2 Software and Car-O-Data. Car-O-Tronic is the measuring hardware, Vision2 Software is the measuring software. Car-O-Data is a database containing Car-O-Liner DataSheets, photo DataSheets and indexes for most vehicles. Car-O-Data is available through an online subscription or a DVD subscription which is updated 4 times a year.

3y ago

295 Views

Colorado Masonic Library & Museum Store

York Rite 15.00 _ CE40 Car Emblem - Order of the Eastern Star Cut-Out Auto Car Emblem-CE40 OES 15.00 _ CE41 Car Emblem - Shriners Cut-Out Auto Car Emblem-CE41 Shrine 15.00 _ CE42 Car Emblem - 33rd Degree Wings Up Cut-Out Auto Car Emblem-CE42 Scottish Rite 15.00 _ CE43 Car Emblem Free & Ac

2y ago

517 Views

Queueing Theory Part 2 - UW Courses Web Server

Queueing Theory-12 Car Wash Example Consider the following 3 car washes Suppose cars arrive according to a Poisson input process and service follows an exponential distribution Fill in the following table What conclusions can you draw from your results? ! µ! L L q W W q P 0 Car Wash A 0.1 car/min 0.5 car/min Car Wash B 0.1 car/min

1y ago

245 Views

Data Mining - Brigham Young University

It looks like you're using an ad-blocker