Data Mining - Brigham Young University

1y ago
10 Views
2 Downloads
3.35 MB
32 Pages
Last View : 15d ago
Last Download : 3m ago
Upload by : Joanna Keil
Transcription

Data MininglllThe Extraction of useful information from dataThe automated extraction of hidden predictive informationfrom (large) databasesBusiness, Huge data bases, customer data, mine the data– Also Medical, Genetic, Astronomy, etc.llData sometimes unlabeled – unsupervised clustering, etc.Focuses on learning approaches which scale to massiveamounts of data– and potentially to a large number of features– sometimes requires simpler algorithms with lower big-OcomplexitiesCS 472- Data Mining1

Data Mining ApplicationsllOften seeks to give businesses a competitive advantageWhich customers should they target– For advertising – more focused campaign– Customers they most/least want to keep– Most favorable business decisionslAssociations– Which products should/should not be on the same shelf– Which products should be advertised together– Which products should be bundledlInformation Brokers– Make transaction information available to others who are seekingadvantagesCS 472- Data Mining2

Data MininglBasically, a particular niche of machine learningapplications– Focused on business and other large data problems– Focused on problems with huge amounts of data which needs to bemanipulated in order to make effective inferences– “Mine” for “gems” of actionable informationCS 472- Data Mining3

Association Analysis – Link AnalysisUsed to discover relationships in large databasesl Relationships represented as association rulesl– Unsupervised learning, any data setlOne example is market basket analysis which seeks tounderstand more about what items are bought together– This can then lead to improved approaches for advertising, productplacement, etc.– Example Association Rule: {Cereal} Þ {Milk}Transaction ID and Info Items Bought1 and (who, when, etc.){Ice cream, milk, eggs, cereal}2{Ice cream}3{milk, cereal, sugar}4{eggs, yogurt, sugar}5{Ice cream, milk, cereal}CS 472- Data Mining4

Data WarehouseslCompanies have large data warehouses of transactions– Records of sales at a store– On-line shopping– Credit card usage– Phone calls made and received– Visits and navigation of web sites, etc lMany/Most things recorded these days and there is potentialinformation that can be mined to gain business improvements– For better customer service/support and/or profitsCS 472- Data Mining5

Data Mining PopularityllRecent Data Mining explosion based on:Data available – Transactions recorded in data warehouses– From these warehouses specific databases for the goal task can becreatedlAlgorithms available – Machine Learning and Statistics– Including special purpose Data Mining software products to makeit easier for people to work through the entire data mining cyclellComputing power availableCompetitiveness of modern business – need an edgeCS 472- Data Mining6

Data Mining Process Modell1.2.You will use much of this process in your group projectIdentify and define the task (e.g. business problem)Gather and Prepare the Data–––3.4.Build and Evaluate the Model(s) – Using training and testdataDeploy the Model(s) and Evaluate business related Results–5.Build Data Base for the taskSelect/Transform/Derive featuresAnalyze and Clean the Data, remove outliers, etc.Data visualization toolsIterate through this process to gain continual improvements– both initially and during life of task–Improve/adjust features and/or machine learning approachCS 472- Data Mining7

Data Mining Process Model - CycleMonitor, Evaluate, and update deploymentCS 472- Data Mining8

Data Science and Big DatalInterdisciplinary field about scientific methods, processesand systems to extract knowledge or insights from data– Machine Learning– Statistics/Math– CS/Database/Algorithms– Visualization– Parallel Processing– Etc.lllIncreasing demand in industry!Data Science Departments and TracksNew DS emphasis in BYU CS began Fall 2019CS 472- Data Mining9

Group ProjectslReview timing and expectations– Progress Report– Time purposely available between Decision Tree and InstanceBased projects to keep going on the group projectllGathering, Cleaning, Transforming the Data can be the most criticalpart of the project, so get that going early!!Then plenty of time to try some different ML models and someiterations on your Features/ML approaches to get improvements– Final report and presentationlQuestions?CS 472- Data Mining10

Association Analysis – Link AnalysisUsed to discover relationships in large databasesl Relationships represented as association rulesl– Unsupervised learning, any data setlOne example is market basket analysis which seeks tounderstand more about what items are bought together– This can then lead to improved approaches for advertising, productplacement, etc.– Example Association Rule: {Cereal} Þ {Milk}Transaction ID and Info Items Bought1 and (who, when, etc.){Ice cream, milk, eggs, cereal}2{Ice cream}3{milk, cereal, sugar}4{eggs, yogurt, sugar}5{Ice cream, milk, cereal}CS 472- Data Mining11

Association DiscoveryllllAssociation rules are not causal, show correlationsk-itemset is a subset of the possible items – {Milk, Eggs}is a 2-itemsetWhich itemsets does transaction 3 containAssociation Analysis/Discovery seeks to find frequentitemsetsTID Items Bought1{Ice cream, milk, eggs, cereal}2{Ice cream}3{milk, cereal, sugar}4{eggs, yogurt, sugar}5{Ice cream, milk, cereal}CS 472- Data Mining12

Association Rule Qualitysupport(X) {t T : X t }support(X Y ) {t T : (X Y ) t}1{Ice cream, milk, eggs, cereal}T2{Ice cream}3{milk, cereal, sugar}4{eggs, yogurt, sugar}5{Ice cream, milk, cereal}confidence(X Y ) lift(X Y ) llTID Items BoughtT{t T : (X Y ) t}{t T : X t }confidence(X Y )support(Y )t Î T, the set of all transactions, and X and Y are itemsetsRule quality measured by support and confidenceWithout sufficient support (frequency), rule will probably overfit, and also of little interest,since it is rare– Note support(X Y) support(Y X) support(X È Y)–l––––Note that support(X È Y) is support for itemsets where both X and Y occurConfidence measures reliability of the inference (to what extent does X imply Y)confidence(X Y) ! confidence(Y X)Support and confidence range between 0 and 1Lift: Lift is high when X Y has high confidence and the consequent Y is less common,Thus lift suggests ability for X to infer a less common value with good probabilityCS 472- Data Mining13

Association Rule Discovery DefinedlUser supplies two thresholdsminsup (Minimum required support level for a rule)– minconf (Minimum required confidence level for a rule)–lllAssociation Rule Discovery: Given a set of transactions T, find allrules having support minsup and confidence minconfHow do you find the rules?Could simply try every possible rule and just keep those that pass–lNumber of candidate rules is exponential in the size of the number of itemsStandard Approaches - Apriori––1st find frequent itemsets (Frequent itemset generation)Then return rules within those frequent itemsets that have sufficient confidence(Rule generation)lllBoth steps have an exponential number of combinations to considerNumber of itemsets exponential in number of items m (power set: 2m)Number of rules per itemset exponential in number of items in itemset (n!)CS 472- Data Mining14

Apriori AlgorithmlThe support for the rule X Þ Y is the same as the support of theitemset X È Y– Assume X {milk, eggs} and Y {cereal}. C X È Y– All the possible rule combinations of itemset C have the same support(# of possible rules exponential in width of itemset: C !)llllllll{milk, eggs} Þ {cereal}{milk} Þ {cereal, eggs}{eggs} Þ {milk, cereal}{milk, cereal} Þ {eggs}{cereal, eggs} Þ {milk}{cereal} Þ {milk, eggs}Do they have the same confidence?So rather than find common rules we can first just find allitemsets with support minsup– These are called frequent itemsets– After that we can find which rules within the common itemsets havesufficient confidence to be keptCS 472- Data Mining15

Support-based PruninglApriori Principle: If an itemset is frequent, then all subsetsof that itemset will be frequent– Note that subset refers to the items in the itemsetlIf an itemset is not frequent, then any superset of thatitemset will also not be frequentCS 472- Data Mining16

llllExample transaction DB with 5 items and 10 transactionsMinsup 30%, at least 3 transaction must contain the itemsetFor each itemset at the current level of the tree (depth k) go througheach of the n transactions and update tree itemset counts accordinglyAll 1-itemsets are kept since all have support 30%CS 472- Data Mining17

llGenerate level 2 of the tree (all possible 2-itemsets)Normally use lexical ordering in itemsets to generate/count candidatesmore efficiently(a,b), (a,c), (a,d), (a,e), (b,c), (b,d), , (d,e)– When looping through n transactions for (a,b), can stop if a not first in the set, etc.–llNumber of tree nodes will grow exponentially if not prunedWhich ones can we prune assuming minsup .3?CS 472- Data Mining18

llGenerate level 2 of the tree (all possible 2-itemsets)Use lexical ordering in itemsets to generate/count candidates moreefficiently(a,b), (a,c), (a,d), (a,e), (b,c), (b,d), , (d,e)– When looping through n transactions for (a,b), can stop if a not first in the set, etc.–llNumber of tree nodes will grow exponentially if not prunedWhich ones can we prune assuming minsup .3?CS 472- Data Mining19

llGenerate level 3 of the tree (all 3-itemsets with frequent parents)Before calculating the counts, check to see if any of these newlygenerated 3-itemsets, contain an infrequent 2-itemset. If so we canprune it before we count since it must be infrequent– A k-itemset contains k subsets of size k-1– It's parent in the tree is only one of those subsets– Are there any candidates we can delete?CS 472- Data Mining20

CS 472- Data Mining21

CS 472- Data Mining22

CS 472- Data Mining23

CS 472- Data Mining24

lFrequent itemsets are: {a,c}, {a,c,d}, {a,c,e}, {a,d}, {a,d,e}, {a,e},CS 472- Data Mining{b,c}, {c,d}, {c,e}, {d,e}25

Rule GenerationlllFrequent itemsets were: {a,c}, {a,c,d}, {a,c,e}, {a,d},{a,d,e}, {a,e}, {b,c}, {c,d}, {c,e}, {d,e}For each frequent itemset generate the possible rules andkeep those with confidence minconfFirst itemset {a,c} gives possible rules– {a} Þ {c} with confidence 4/7 and– {c} Þ {a} with confidence 4/7llSecond itemset {a,c,d} leads to six possible rulesJust as with frequent itemset generation, we can usepruning and smart lexical ordering to make rule generationmore efficient– Project? – Search pruning tricks (312) vs MLCS 472- Data Mining26

Illustrative Training SetWould if we had real valued data?What are steps for this example?Risk Assessment for Loan ApplicationsClient #Credit HistoryDebt LevelCollateralIncome LevelRISK DERATELOWLOWHIGHMODERATELOWHIGHCS 472- Data Mining27

Running Apriori (I)llChoose MinSupport .4 and MinConfidence .81-Itemsets (Level 1):– (CH Bad, .29) (CH Unknown, .36) (CH Good, .36)– (DL Low, .5) (DL High, .5)– (C None, .79) (C Adequate, .21)– (IL Low, .29) (IL Medium, .29)(IL High, .43)– (RL High, .43) (RL Moderate, .21) (RL Low, .36)CS 472- Data Mining28

Running Apriori (II)llll1-Itemsets {(DL Low, .5); (DL High, .5); (C None,.79); (IL High, .43); (RL High, .43)}2-Itemsets {(DL High C None, .43)}3-Itemsets {}Two possible rules:– DL High Þ C None– C None Þ DL HighlConfidences:– Conf(DL High Þ C None) .86– Conf(C None Þ DL High) .54CS 472- Data MiningRetainIgnore29

SummarylAssociation Analysis useful in many real world tasks– Not a classification approach, but a way to understandrelationships in data and use this knowledge to advantagellAlso standard classification and other approachesData Mining continues to grow as a field– Data and features issuesl Gathering, Selection and Transformation, Preparation, Cleaning,Storing– Data visualization and understanding– Outlier detection and handling– Time series prediction– Web mining– etc.CS 472- Data Mining30

Data WarehouselCompanies have large data warehouses of transactions–––––lRecords of sales at a storeOn-line shoppingCredit card usagePhone calls made and receivedVisits and navigation of web sites, etc Many/Most things recorded these days and there is potentialinformation that can be mined to gain business improvements– For better customer service/support and/or profitslData Warehouse (DWH)– Separate from the operational data (OLTP – Online transaction processing)– Data comes from heterogeneous company sources– Contains static records of data which can be used and manipulated foranalysis and business purposes– Old data is rarely modified, and new data is continually added– OLAP (Online Analytical Processing) – Front end to DWH allowing basicdata base style querieslUseful for data analysis and data gathering and creating the task data baseCS 472- Data Mining31

The Big Picture: DBs, DWH, OLAP & DMOLAP adRefreshDataWarehouseData StorageServeAnalysis,Query,Reports,Create DataBase forData miningOLAP Engine Front-End ToolsCS 472- Data Mining32

Data Mining Popularity lRecent Data Mining explosion based on: lData available -Transactions recorded in data warehouses -From these warehouses specific databases for the goal task can be created lAlgorithms available -Machine Learning and Statistics -Including special purpose Data Mining software products to make it easier for people to work through the entire data mining cycle

Related Documents:

the classroom are responsible for contributing to this complete educational vision. A BYU education should be spiritually strengthening, intellectually enlarging, and character building, leading to lifelong learning and service. (The Mission of Brigham Young University and The Aims of a BYU Education, Brigham Young University. Brigham Young .

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

31st Annual BRIGHAM YOUNG UNIVERSITY MODEL UNITED NATIONS CONFERENCE Sponsored by the David M. Kennedy Center for International Studies Saturday, November 14, 2020 Esteemed Delegates, I am pleased to welcome you to the 31st annual Brigham Young University Model United

Behind Closed Doors: A Network Tale of Spoofing, Intrusion, and False DNS Security Casey Deccio Brigham Young University Provo, UT casey@byu.edu Alden Hilton Brigham Young University Provo, UT aldenhilton@byu.edu Michael Briggs Brigham Young University Provo, UT briggs25@byu.edu Trevin Ave

Brigham Young University CAA File #48 2 AMERICAN SPEECH-LANGUAGE-HEARING ASSOCIATION Application for Reaccreditation Evaluation of Educational Programs in Speech-Language Pathology and Audiology Date March 1, 1998 Name of institution Brigham Young University Address 136 TLRB, Provo, UT 84602 School, College, or

Required Texts: Harris, Ann Sutherland. Seventeenth Century Art and Architecture, 1st or 2nd edition will work, only 2nd edition available in book store Harr, Jonathan. The Lost Painting: The Quest for a Caravaggio Masterpiece. Optional Text: Scotti, R.A. Basilica: The Splendor and the Scandal: The Building of St. Peters’s; Barnett, Sylvan.