Data Mining - University Of Texas At San Antonio

2y ago
25 Views
2 Downloads
3.24 MB
172 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Francisco Tran
Transcription

Data MiningPractical Machine Learning Tools and TechniquesSlides for Chapter 6 of Data Mining by I. H. Witten, E. Frank andM. A. Hall

Implementation:Real machine learning schemes Decision trees Classification rules Frequent-pattern treesExtending linear models From PRISM to RIPPER and PART (pruning, numeric data, )Association Rules From ID3 to C4.5 (pruning, numeric attributes, .)Support vector machines and neural networksInstance-based learning Pruning examples, generalized exemplars, distance functionsData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)2

Implementation:Real machine learning schemes Numeric prediction Bayesian networks Hierarchical, incremental, probabilistic, BayesianSemisupervised learning Learning and prediction, fast data structures for learningClustering: hierarchical, incremental, probabilistic Regression/model trees, locally weighted regressionClustering for classification, co-trainingMulti-instance learning Converting to single-instance, upgrading learning algorithms,dedicated multi-instance methodsData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)3

Industrial-strength algorithms For an algorithm to be useful in a widerange of real-world applications it must: Permit numeric attributesAllow missing valuesBe robust in the presence of noiseBe able to approximate arbitrary conceptdescriptions (at least in principle)Basic schemes need to be extended tofulfill these requirementsData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)4

Decision trees Extending ID3:to permit numeric attributes:straightforward to deal sensibly with missing values:trickier stability for noisy data:requires pruning mechanism End result: C4.5 (Quinlan)Best-known and (probably) most widely-usedlearning algorithm Commercial successor: C5.0 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)5

Numeric attributes Standard method: binary splits Unlike nominal attributes,every attribute has many possible split pointsSolution is straightforward extension: E.g. temp 45Evaluate info gain (or other measure)for every possible split point of attributeChoose “best” split pointInfo gain for best split point is info gain for attributeComputationally more demandingData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)6

Weather data YesIfIfIfIfIfRainyMildNormalHighFalseYesRainy Cool Normal False Yes Rainy Cool Normal True No outlook sunny and humidity high then play nooutlook rainy and windy true then play nooutlook overcast then play yeshumidity normal then play yesnone of the above then play yesData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)7

Weather data dityNo WindyPlayOvercastHot SunnyHigh Hot85FalseHigh85Yes FalseNoRainyMildSunny80NormalHighHot90FalseHighYes TrueNoRainy Cool OvercastNormal Hot83False High86Yes FalseYesRainy Cool RainyNormal Mild70True Normal96No FalseYes Rainy 68 80 False YesRainy 65 70 True No IfIfIfIfIf outlook sunny and humidity high then play nooutlook rainy and windy true then play nooutlook overcast then play yeshumidity normal then play yesnone of the Ifabovethen play yesoutlooksunnyand humidity 83 then play noIf outlook rainy and windy true then play noIf outlook overcast then play yesIf humidity 85 then play noIf none of the above then play yesData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)8

Example Split on temperature attribute:64Yes 65No68Yes6970717272757580818385Yes Yes No No Yes Yes Yes No Yes Yes No E.g. temperature 71.5: yes/4, no/2temperature 71.5: yes/5, no/3 Info([4,2],[5,3]) 6/14 info([4,2]) 8/14 info([5,3]) 0.939 bitsPlace split points halfway between valuesCan evaluate all split points in one pass!Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)9

Can avoid repeated sorting Sort instances by the values of the numeric attribute Time complexity for sorting: O (n log n)Does this have to be repeated at each node of thetree?No! Sort order for children can be derived from sortorder for parent Time complexity of derivation: O (n)Drawback: need to create and store an array of sortedindices for each numeric attributeData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)10

Binary vs multiway splits Splitting (multi-way) on a nominal attribute exhausts allinformation in that attribute Not so for binary splits on numeric attributes! Nominal attribute is tested (at most) once on any path in thetreeNumeric attribute may be tested several times along a path inthe treeDisadvantage: tree is hard to readRemedy: pre-discretize numeric attributes, oruse multi-way splits instead of binary onesData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)11

Computing multi-way splits Simple and efficient way of generating multi-waysplits: greedy algorithmDynamic programming can find optimum multiway split in O (n2) time imp (k, i, j ) is the impurity of the best split of valuesxi xj into k sub-intervals imp (k, 1, i ) min0 j i imp (k–1, 1, j ) imp (1, j 1, i ) imp (k, 1, N ) gives us the best k-way splitIn practice, greedy algorithm works as wellData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)12

Missing values Split instances with missing values into pieces Info gain works with fractional instances A piece going down a branch receives a weightproportional to the popularity of the branchweights sum to 1use sums of weights instead of countsDuring classification, split the instance into piecesin the same way Merge probability distribution using weightsData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)13

PruningPrevent overfitting to noise in the data “Prune” the decision tree Two strategies: Postpruningtake a fully-grown decision tree and discardunreliable parts Prepruningstop growing a branch when information becomesunreliable Postpruning preferred in practice—prepruning can “stop early”Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)14

Prepruning Based on statistical significance test Stop growing the tree when there is no statisticallysignificant association between any attribute and the classat a particular nodeMost popular test: chi-squared testID3 used chi-squared test in addition toinformation gain Only statistically significant attributes were allowed to beselected by information gain procedureData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)15

Early stopping abclass1000201131011104Pre-pruning may stop the growthprocess prematurely: early stoppingClassic example: XOR/Parity-problem No individual attribute exhibits any significantassociation to the class Structure is only visible in fully expanded tree Prepruning won’t expand the root nodeBut: XOR-type problems rare in practiceAnd: prepruning faster than postpruningData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)16

PostpruningFirst, build full tree Then, prune it Fully-grown tree shows all attribute interactionsProblem: some subtrees might be due to chanceeffects Two pruning operations: Subtree replacementSubtree raisingPossible strategies: error estimationsignificance testingMDL principleData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)17

Subtree replacement Bottom-upConsider replacing a tree onlyafter considering all its subtreesData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)18

Subtree raising Delete nodeRedistribute instancesSlower than subtreereplacement(Worthwhile?)Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)19

Estimating error rates Prune only if it does not increase the estimated errorError on the training data is NOT a useful estimator(would result in almost no pruning) Use hold-out set for pruning(“reduced-error pruning”)C4.5’s method Derive confidence interval from training dataUse a heuristic limit, derived from this, for pruningStandard Bernoulli-process-based methodShaky statistical assumptions (based on training data)Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)20

C4.5’s method Error estimate for subtree is weighted sum of errorestimates for all its leavesError estimate for a node:e f z22N z fNf2N z24N2z2N / 1 If c 25% then z 0.69 (from normal distribution)f is the error on the training dataN is the number of instances covered by the leafData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)21

Examplef 5/14e 0.46e 0.51so prune!f 0.33e 0.47f 0.5e 0.72f 0.33e 0.47Combined using ratios 6:2:6 gives 0.51Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)22

Complexity of tree induction Assume m attributesn training instancestree depth O (log n)Building a tree Subtree replacement Subtree raising O (m n log n)O (n)O (n (log n)2)Every instance may have to be redistributed at everynode between its leaf and the rootCost for redistribution (on average): O (log n)Total cost: O (m n log n) O (n (log n)2)Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)23

From trees to rulesSimple way: one rule for each leaf C4.5rules: greedily prune conditions from each rule if thisreduces its estimated error Can produce duplicate rulesCheck for this at the endThen look at each class in turnconsider the rules for that classfind a “good” subset (guided by MDL)Then rank the subsets to avoid conflicts Finally, remove rules (greedily) if this decreases error onthe training data Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)24

C4.5: choices and options C4.5rules slow for large and noisy datasetsCommercial version C5.0rules uses adifferent technique Much faster and a bit more accurateC4.5 has two parameters Confidence value (default 25%):lower values incur heavier pruningMinimum number of instances in the two mostpopular branches (default 2)Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)25

Cost-complexity pruning C4.5's postpruning often does not prune enough Tree size continues to grow when more instances areadded even if performance on independent data doesnot improveVery fast and popular in practiceCan be worthwhile in some cases to strive for a morecompact tree At the expense of more computational effortCost-complexity pruning method from the CART(Classification and Regression Trees) learning systemData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)26

Cost-complexity pruning Basic idea: First prune subtrees that, relative to their size, lead tothe smallest increase in error on the training dataIncrease in error (α) – average error increase per leaf ofsubtreePruning generates a sequence of successively smallertrees Each candidate tree in the sequence corresponds to oneparticular threshold value, αiWhich tree to chose as the final model? Use either a hold-out set or cross-validation to estimate theerror of eachData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)27

DiscussionTDIDT: Top-Down Induction of Decision Trees The most extensively studied method of machinelearning used in data miningDifferent criteria for attribute/test selection rarelymake a large differenceDifferent pruning methods mainly change thesize of the resulting pruned treeC4.5 builds univariate decision treesSome TDITDT systems can build multivariatetrees (e.g. CART)Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)28

Classification rules Common procedure: separate-and-conquerDifferences: Search method (e.g. greedy, beam search, .)Test selection criteria (e.g. accuracy, .)Pruning method (e.g. MDL, hold-out set, .)Stopping criterion (e.g. minimum accuracy)Post-processing stepAlso: Decision listvs.one rule set for each classData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)29

Test selection criteria Basic covering algorithm: Measure 1: p/t t total instances covered by rulep number of these that are positiveProduce rules that don’t cover negative instances,as quickly as possibleMay produce rules with very small coverage—special cases or noise?Measure 2: Information gain p (log(p/t) – log(P/T)) keep adding conditions to a rule to improve its accuracyAdd the condition that improves accuracy the mostP and T the positive and total numbers before the new condition was addedInformation gain emphasizes positive rather than negative instancesThese interact with the pruning mechanism usedData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)30

Missing values, numeric attributes Common treatment of missing values:for any test, they fail Algorithm must either use other tests to separate out positive instancesleave them uncovered until later in the processIn some cases it’s better to treat “missing” as aseparate valueNumeric attributes are treated just like they are indecision treesData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)31

Pruning rules Two main strategies: Other difference: pruning criterion Incremental pruningGlobal pruningError on hold-out set (reduced-error pruning)Statistical significanceMDL principleAlso: post-pruning vs. pre-pruningData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)32

Using a pruning set For statistical validity, must evaluate measure ondata not used for training: Reduced-error pruning :build full rule set and then prune itIncremental reduced-error pruning : simplifyeach rule as soon as it is built This requires a growing set and a pruning setCan re-split data after rule has been prunedStratification advantageousData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)33

Incremental reduced-error pruningInitialize E to the instance setUntil E is empty doSplit E into Grow and Prune in the ratio 2:1For each class C for which Grow contains an instanceUse basic covering algorithm to create best perfect rulefor CCalculate w(R): worth of rule on Pruneand w(R-): worth of rule with final conditionomittedIf w(R-) w(R), prune rule and repeat previous stepFrom the rules for the different classes, select the onethat’s worth most (i.e. with largest w(R))Print the ruleRemove the instances covered by rule from EContinueData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)34

Measures used in IREP [p (N – n)] / T (N is total number of negatives)Counterintuitive: Success rate p / t Problem: p 1 and t 1vs. p 1000 and t 1001(p – n) / t p 2000 and n 1000 vs. p 1000 and n 1Same effect as success rate because it equals 2p/t – 1Seems hard to find a simple measure of a rule’sworth that corresponds with intuitionData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)35

Variations Generating rules for classes in order Stopping criterion Start with the smallest classLeave the largest class covered by the default ruleStop rule production if accuracy becomes too lowRule learner RIPPER: Uses MDL-based stopping criterionEmploys post-processing step to modify rules guidedby MDL criterionData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)36

Using global optimization RIPPER: Repeated Incremental Pruning to Produce Error Reduction(does global optimization in an efficient way)Classes are processed in order of increasing sizeInitial rule set for each class is generated using IREPAn MDL-based stopping condition is used DL: bits needs to send examples wrt set of rules, bits needed tosend k tests, and bits for kOnce a rule set has been produced for each class, each rule is reconsidered and two variants are produced One is an extended version, one is grown from scratch Chooses among three candidates according to DLFinal clean-up step greedily deletes rules to minimize DLData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)37

PART Avoids global optimization step used in C4.5rulesand RIPPERGenerates an unrestricted decision list using basicseparate-and-conquer procedureBuilds a partial decision tree to obtain a rule A rule is only pruned if all its implications are knownPrevents hasty generalizationUses C4.5’s procedures to build a treeData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)38

Building a partial treeExpand-subset (S):Choose test T and use it to split set of examplesinto subsetsSort subsets into increasing order of averageentropywhilethere is a subset X not yet been expandedANDall subsets expanded so far are leavesexpand-subset(X)ifall subsets expanded are leavesAND estimated error for subtree estimated error for nodeundo expansion into subsets and make node a leafData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)39

ExampleData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)40

Notes on PART Make leaf with maximum coverage into a ruleTreat missing values just as C4.5 does I.e. split instance into piecesTime taken to generate a rule: Worst case: same as for building a pruned tree Occurs when data is noisyBest case: same as for building a single rule Occurs when data is noise freeData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)41

Rules with exceptions1.Given: a way of generating a single good rule2.Then it’s easy to generate rules with exceptions3.Select default class for top-level rule4.Generate a good rule for one of the remaining classes5.Apply this method recursively to the two subsets producedby the rule(I.e. instances that are covered/not covered)Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)42

Iris data exampleExceptions are represented asDotted paths, alternatives assolid ones.Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)43

Association rules Apriori algorithm finds frequent item sets via agenerate-and-test methodology Successively longer item sets are formed from shorteronesEach different size of candidate item set requires a fullscan of the dataCombinatorial nature of generation process is costly –particularly if there are many item sets, or item sets arelargeAppropriate data structures can helpFP-growth employs an extended prefix tree (FP-tree)Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)44

FP-growth FP-growth uses a Frequent Pattern Tree (FPtree) to store a compressed version of the dataOnly two passes are required to map the datainto an FP-treeThe tree is then processed recursively to “grow”large item sets directly Avoids generating and testing candidate item setsagainst the entire databaseData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)45

Building a frequent pattern tree1) First pass over the data – count the number timesindividual items occur2) Second pass over the data – before inserting eachinstance into the FP-tree, sort its items in descendingorder of their frequency of occurrence, as found in step 1 Individual items that do not meet the minimum supportare not inserted into the treeHopefully many instances will share items that occurfrequently individually, resulting in a high degree ofcompression close to the root of the treeData Mining: Practical Machine Learning Tools and Techniques (Chapter 6)46

An example using the weather data Frequency of individual items (minimumsupport 6)play yeswindy falsehumidity normalhumidity highwindy truetempe

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 3 Implementation: Real machine learning schemes Numeric prediction Regression/model trees, locally weighted regression Bayesian networks Learning and prediction, fast data structures for learning Clustering: hierarchical, incremental, probabilistic Hierarchical, incremental, probabilistic, Bayesian

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

The University of Texas at Arlington z The University of Texas at Austin The University of Texas at Brownsville z The University of Texas at Dallas The University of Texas at El Paso z The University of Texas - Pan American The University of T exas of the Permian Basin z The University of Texas . Graduation rates of medical, dental, nursing .

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

DATA MINING CSE 4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington . Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).