Data Mining Classification: Basic Concepts And Techniques

2y ago
6 Views
2 Downloads
1.02 MB
30 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Sabrina Baez
Transcription

Data MiningClassification: Basic Concepts andTechniquesLecture Notes for Chapter 3Introduction to Data Mining, 2nd EditionbyTan, Steinbach, Karpatne, KumarIntroduction to Data Mining, 2nd Edition2/1/202111Classification: DefinitionlGiven a collection of records (training set )– Each record is by characterized by a tuple(x,y), where x is the attribute set and y is theclass labelx: attribute, predictor, independent variable, input y: class, response, dependent variable, output lTask:– Learn a model that maps each attribute set xinto one of the predefined class labels y2/1/20212Introduction to Data Mining, 2nd Edition2

Examples of Classification TaskTaskAttribute set, xClass label, yCategorizing Features extracted fromemailemail message headermessagesand contentspam or non-spamIdentifyingtumor cellsFeatures extracted fromx-rays or MRI scansmalignant or benigncellsCataloginggalaxiesFeatures extracted fromtelescope imagesElliptical, spiral, orirregular-shapedgalaxies2/1/2021Introduction to Data Mining, 2nd Edition33General Approach for BuildingClassification Model2/1/20214Introduction to Data Mining, 2nd Edition4

Classification TechniquesBase Classifiers––––––Decision Tree based MethodsRule-based MethodsNearest-neighborNaïve Bayes and Bayesian Belief NetworksSupport Vector MachinesNeural Networks, Deep Neural NetsEnsemble Classifiers– Boosting, Bagging, Random ForestsIntroduction to Data Mining, 2nd Edition2/1/202155Example of a Decision TreeIDHomeOwnerMaritalStatusSplitting AttributesAnnual DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced es60KYesHomeOwnerNONoMarStSingle, DivorcedIncome 80KNOMarriedNO 80KYES10Training Data2/1/20216Model: Decision TreeIntroduction to Data Mining, 2nd Edition6

Apply Model to Test DataTest DataStart from the root of Annual DefaultedIncome Borrower80K?10NoMarStSingle, DivorcedIncome 80KMarriedNO 80KYESNOIntroduction to Data Mining, 2nd Edition2/1/202177Apply Model to Test DataTest faultedBorrower80K?10NoMarStSingle, DivorcedIncome 80KNO2/1/20218MaritalStatusMarriedNO 80KYESIntroduction to Data Mining, 2nd Edition8

Apply Model to Test DataTest nnual DefaultedIncome Borrower80K?10NoMarStSingle, DivorcedIncome 80KMarriedNO 80KYESNOIntroduction to Data Mining, 2nd Edition2/1/202199Apply Model to Test DataTest DataHomeOwnerHomeOwnerYesNONoMarriedAnnual DefaultedIncome Borrower80K?10NoMarStSingle, DivorcedIncome 80KNO2/1/202110MaritalStatusMarriedNO 80KYESIntroduction to Data Mining, 2nd Edition10

Apply Model to Test DataTest nnualIncomeDefaultedBorrower80K?10NoMarStSingle, DivorcedIncome 80KMarriedNO 80KYESNOIntroduction to Data Mining, 2nd Edition2/1/20211111Apply Model to Test DataTest DataHomeOwnerHomeOwnerYesNONoMarriedAnnual DefaultedIncome Borrower80K?10NoMarStSingle, DivorcedIncome 80KNO2/1/202112MaritalStatusMarriedAssign Defaulted to“No”NO 80KYESIntroduction to Data Mining, 2nd Edition12

Another Example of Decision TreeIDHomeOwnerMaritalStatusAnnual DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced Income 80K 80KYESNOThere could be more than one tree thatfits the same data!10Introduction to Data Mining, 2nd Edition2/1/20211313Decision Tree Classification cisionTree102/1/202114Introduction to Data Mining, 2nd Edition14

Decision Tree InductionMany Algorithms:– Hunt’s Algorithm (one of the earliest)– CART– ID3, C4.5– SLIQ,SPRINT2/1/2021Introduction to Data Mining, 2nd Edition1515General Structure of Hunt’s AlgorithmllLet Dt be the set of trainingrecords that reach a node tGeneral Procedure:– If Dt contains records thatbelong the same class yt,then t is a leaf nodelabeled as yt– If Dt contains records thatbelong to more than oneclass, use an attribute testto split the data into smallersubsets. Recursively applythe procedure to l DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced es60K10Introduction to Data Mining, 2nd EditionDt?16

Hunt’s nual DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced tion to Data Mining, 2nd Edition2/1/20211717Hunt’s nual DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced 3)Introduction to Data Mining, 2nd Edition18

Hunt’s nual DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced tion to Data Mining, 2nd Edition2/1/20211919Hunt’s nual DefaultedIncome 0KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced 3)Introduction to Data Mining, 2nd Edition20

Design Issues of Decision Tree InductionlHow should training records be split?– Method for expressing test condition depending on attribute types– Measure for evaluating the goodness of a testconditionlHow should the splitting procedure stop?– Stop splitting if all the records belong to thesame class or have identical attribute values– Early termination2/1/2021Introduction to Data Mining, 2nd Edition2121Methods for Expressing Test ConditionslDepends on attribute types– Binary– Nominal– Ordinal– Continuous2/1/202122Introduction to Data Mining, 2nd Edition22

Test Condition for Nominal AttributesMulti-way split:– Use as many partitions asdistinct values.Binary split:– Divides values into two subsets2/1/2021Introduction to Data Mining, 2nd Edition2323Test Condition for Ordinal AttributeslMulti-way split:– Use as many partitionsas distinct valueslBinary split:– Divides values into twosubsets– Preserve orderproperty amongattribute values2/1/202124Introduction to Data Mining, 2nd EditionThis groupingviolates orderproperty24

Test Condition for Continuous AttributesIntroduction to Data Mining, 2nd Edition2/1/20212525Splitting Based on Continuous AttributesDifferent ways of handling– Discretization to form an ordinal categoricalattributeRanges can be found by equal interval bucketing,equal frequency bucketing (percentiles), orclustering. Static – discretize once at the beginning Dynamic – repeat at each node– Binary Decision: (A v) or (A v)consider all possible splits and finds the best cut can be more compute intensive 2/1/202126Introduction to Data Mining, 2nd Edition26

How to determine the Best SplitBefore Splitting: 10 records of class 0,10 records of class 1Which test condition is the best?2/1/2021Introduction to Data Mining, 2nd Edition2727How to determine the Best SplitlGreedy approach:– Nodes with purer class distribution arepreferredlNeed a measure of node impurity:High degree of impurity2/1/202128Low degree of impurityIntroduction to Data Mining, 2nd Edition28

Measures of Node ImpurityGini Indexl𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 1 𝑝 𝑡Where 𝒑𝒊 𝒕 is the frequencyof class 𝒊 at node t, and 𝒄 isthe total number of classesEntropyl𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 𝑡 𝑙𝑜𝑔 𝑝 (𝑡)Misclassification 𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 1 max[𝑝 (𝑡)]Introduction to Data Mining, 2nd Edition2/1/20212929Finding the Best Split1.2.Compute impurity measure (P) before splittingCompute impurity measure (M) after splittingCompute impurity measure of each child nodel M is the weighted impurity of child nodesl3.Choose the attribute test condition thatproduces the highest gainGain P - Mor equivalently, lowest impurity measure after splitting(M)2/1/202130Introduction to Data Mining, 2nd Edition30

Finding the Best SplitBefore Splitting:C0C1N00N01PA?B?YesNoNode N1C0C1YesNode N2N10N11C0C1Node N3N20N21C0C1M12M11NoC0C1N30N31M21M1N40N41M22M2Gain P – M12/1/2021Node N4vsP – M2Introduction to Data Mining, 2nd Edition3131Measure of Impurity: GINIGini Index for a given node 𝒕𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 1 𝑝 𝑡Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the totalnumber of classes– Maximum of 1 1/𝑐 when records are equallydistributed among all classes, implying the leastbeneficial situation for classification– Minimum of 0 when all records belong to one class,implying the most beneficial situation for classification– Gini index is used in decision tree algorithms such asCART, SLIQ, SPRINT2/1/202132Introduction to Data Mining, 2nd Edition32

Measure of Impurity: GINIGini Index for a given node t :𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 1 𝑝 𝑡– For 2-class problem (p, 1 – p): GINI 1 – p2 – (1 – p)2 2p (1-p)C1C206Gini 0.000C1C215C1C2Gini 0.27824Gini 0.444C1C233Gini 0.500Introduction to Data Mining, 2nd Edition2/1/20213333Computing Gini Index of a Single Node𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 1 C1C206P(C1) 0/6 0C1C215P(C1) 1/6C1C224P(C1) 2/62/1/202134𝑝 𝑡P(C2) 6/6 1Gini 1 – P(C1)2 – P(C2)2 1 – 0 – 1 0P(C2) 5/6Gini 1 – (1/6)2 – (5/6)2 0.278Gini 1 –P(C2) 4/6(2/6)2 –(4/6)2 0.444Introduction to Data Mining, 2nd Edition34

Computing Gini Index for a Collection ofNodeslWhen a node 𝑝 is split into 𝑘 partitions ��(𝑖)𝑛 𝑛 number of records at child 𝑖,𝑛 number of records at parent node 𝑝.2/1/2021Introduction to Data Mining, 2nd Edition3535Binary Attributes: Computing GINI IndexSplits into two partitions (child nodes)Effect of Weighing partitions:– Larger and purer partitions are soughtParentB?YesGini(N1) 1 – (5/6)2 – (1/6)2 0.278Gini(N2) 1 – (2/6)2 – (4/6)2 0.4442/1/202136NoNode N1N151N224Gini 0.3617C25Gini 0.486Node N2C1C2C1Weighted Gini of N1 N2 6/12 * 0.278 6/12 * 0.444 0.361Gain 0.486 – 0.361 0.125Introduction to Data Mining, 2nd Edition36

Categorical Attributes: Computing Gini IndexllFor each distinct value, gather counts for each class inthe datasetUse the count matrix to make decisionsMulti-way splitTwo-way split(find best partition of values)CarTypeFamily Sports y}820100.167C1C2GiniWhich of these is the best?2/1/2021Introduction to Data Mining, 2nd Edition3737Continuous Attributes: Computing Gini IndexllllUse Binary Decisions based on onevalueSeveral Choices for the splitting value– Number of possible splitting values Number of distinct valuesEach splitting value has a count matrixassociated with it– Class counts in each of thepartitions, A v and A vSimple method to choose best v– For each v, scan the database togather count matrix and computeits Gini index– Computationally Inefficient!Repetition of e70KNo4YesMarried120KNo5NoDivorced 95KYes6NoMarriedNo7YesDivorced es60K10Annual Income ? 80 80Defaulted Yes03Defaulted No34Introduction to Data Mining, 2nd Edition38

Continuous Attributes: Computing Gini Index.lFor efficient computation: for each attribute,– Sort the attribute on values– Linearly scan these values, each time updating the count matrixand computing gini index– Choose the split position that has the least gini indexCheatNoNoNoYesYesYesNoNoNoNoAnnual IncomeSorted 172230 4000.420Introduction to Data Mining, 2nd Edition2/1/20213939Continuous Attributes: Computing Gini Index.lFor efficient computation: for each attribute,– Sort the attribute on values– Linearly scan these values, each time updating the count matrixand computing gini index– Choose the split position that has the least gini indexCheatNo60Split 297120110125122220172230 ini40NoAnnual IncomeSorted 0Introduction to Data Mining, 2nd Edition0.3430.3750.4000.42040

Continuous Attributes: Computing Gini Index.lFor efficient computation: for each attribute,– Sort the attribute on values– Linearly scan these values, each time updating the count matrixand computing gini index– Choose the split position that has the least gini indexCheatNoNoNoYesYesYesNoNoNoNoAnnual IncomeSorted Values60Split 0172230 4000.420Introduction to Data Mining, 2nd Edition2/1/20214141Continuous Attributes: Computing Gini Index.lFor efficient computation: for each attribute,– Sort the attribute on values– Linearly scan these values, each time updating the count matrixand computing gini index– Choose the split position that has the least gini indexCheatNo60Split 297120110125122220172230 ini42NoAnnual IncomeSorted 0Introduction to Data Mining, 2nd Edition0.3430.3750.4000.42042

Continuous Attributes: Computing Gini Index.lFor efficient computation: for each attribute,– Sort the attribute on values– Linearly scan these values, each time updating the count matrixand computing gini index– Choose the split position that has the least gini indexCheatNoNoYesYesYesNoNoNoNoAnnual IncomeSorted Values60Split 0172230 troduction to Data Mining, 2nd Edition0.3430.3750.4000.4204343Measure of Impurity: EntropylEntropy at a given node 𝒕𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 𝑡 𝑙𝑜𝑔 𝑝 (𝑡)Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total numberof classes Maximum of log 𝑐 when records are equally distributedamong all classes, implying the least beneficial situation forclassificationMinimum of 0 when all records belong to one class,implying most beneficial situation for classification– Entropy based computations are quite similar to the GINIindex computations2/1/202144Introduction to Data Mining, 2nd Edition44

Computing Entropy of a Single Node𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 𝑡 𝑙𝑜𝑔 𝑝 (𝑡)C1C206P(C1) 0/6 0C1C215P(C1) 1/6C1C224P(C1) 2/6P(C2) 6/6 1Entropy – 0 log 0 – 1 log 1 – 0 – 0 0P(C2) 5/6Entropy – (1/6) log2 (1/6) – (5/6) log2 (1/6) 0.65P(C2) 4/6Entropy – (2/6) log2 (2/6) – (4/6) log2 (4/6) 0.92Introduction to Data Mining, 2nd Edition2/1/20214545Computing Information Gain After SplittinglInformation Gain:𝐺𝑎𝑖𝑛 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 𝑛𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)𝑛Parent Node, 𝑝 is split into 𝑘 partitions (children)𝑛 is number of records in child node 𝑖– Choose the split that achieves most reduction (maximizesGAIN)– Used in the ID3 and C4.5 decision tree algorithms– Information gain is the mutual information between the classvariable and the splitting variable2/1/202146Introduction to Data Mining, 2nd Edition46

Problem with large number of partitionsNode impurity measures tend to prefer splits thatresult in large number of partitions, each beingsmall but pure– Customer ID has highest information gainbecause entropy for all the children is zeroIntroduction to Data Mining, 2nd Edition2/1/20214747Gain RatiolGain Ratio:𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 𝐺𝑎𝑖𝑛𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛𝑛𝑙𝑜𝑔𝑛𝑛Parent Node, 𝑝 is split into 𝑘 partitions (children)𝑛 is number of records in child node 𝑖– Adjusts Information Gain by the entropy of the partitioning(𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜). Higher entropy partitioning (large number of small partitions) ispenalized!– Used in C4.5 algorithm– Designed to overcome the disadvantage of Information Gain2/1/202148Introduction to Data Mining, 2nd Edition48

Gain RatiolGain Ratio:𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 𝐺𝑎𝑖𝑛𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛𝑛𝑙𝑜𝑔𝑛𝑛Parent Node, 𝑝 is split into 𝑘 partitions (children)𝑛 is number of records in child node 𝑖CarTypeFamily Sports Luxury13C1C2Gini800.163SplitINFO y}91730.468SplitINFO 67SplitINFO 0.97Introduction to Data Mining, 2nd Edition4949Measure of Impurity: Classification ErrorlClassification error at a node 𝑡𝐸𝑟𝑟𝑜𝑟 𝑡 1 max[𝑝 𝑡 ]– Maximum of 1 1/𝑐 when records are equallydistributed among all classes, implying the leastinteresting situation– Minimum of 0 when all records belong to one class,implying the most interesting situation2/1/202150Introduction to Data Mining, 2nd Edition50

Computing Error of a Single Node𝐸𝑟𝑟𝑜𝑟 𝑡 1 max[𝑝 𝑡 ]C1C206P(C1) 0/6 0C1C215P(C1) 1/6C1C224P(C1) 2/62/1/2021P(C2) 6/6 1Error 1 – max (0, 1) 1 – 1 0P(C2) 5/6Error 1 – max (1/6, 5/6) 1 – 5/6 1/6P(C2) 4/6Error 1 – max (2/6, 4/6) 1 – 4/6 1/3Introduction to Data Mining, 2nd Edition5151Comparison among Impurity MeasuresFor a 2-class problem:2/1/202152Introduction to Data Mining, 2nd Edition52

Misclassification Error vs Gini IndexParentA?YesNoNode N1Gini(N1) 1 – (3/3)2 – (0/3)2 0Gini(N2) 1 – (4/7)2 – (3/7)2 0.489C1C27C23Gini 0.42Node N2N130C1N243Gini(Children) 3/10 * 0 7/10 * 0.489 0.342Gini 0.342Gini improves buterror remains thesame!!Introduction to Data Mining, 2nd Edition2/1/20215353Misclassification Error vs Gini IndexParentA?YesNode N1C1C2N130N243Gini 0.342NoN1317C23Gini 0.42Node N2C1C2C1N242Gini 0.416Misclassification error for all three cases 0.3 !2/1/202154Introduction to Data Mining, 2nd Edition54

Decision Tree Based ClassificationlAdvantages:– Relatively inexpensive to construct– Extremely fast at classifying unknown records– Easy to interpret for small-sized trees– Robust to noise (especially when methods to avoid overfitting areemployed)– Can easily handle redundant attributes– Can easily handle irrelevant attributes (unless the attributes areinteracting)lDisadvantages: .– Due to the greedy nature of splitting criterion, interacting attributes (thatcan distinguish between classes together but not individually) may bepassed over in favor of other attributed that are less discriminating.– Each decision boundary involves only a single attributeIntroduction to Data Mining, 2nd Edition2/1/20215555Handling interactions : 1000 instancesEntropy (X) : 0.99Entropy (Y) : 0.99o : 1000 instancesYX2/1/202156Introduction to Data Mining, 2nd Edition56

Handling interactionsIntroduction to Data Mining, 2nd Edition2/1/20215757Handling interactions given irrelevant attributes : 1000 instanceso : 1000 instancesEntropy (X) : 0.99Entropy (Y) : 0.99Entropy (Z) : 0.98YAdding Z as a noisyattribute generatedfrom a uniformdistributionAttribute Z will bechosen for splitting!X2/1/202158Introduction to Data Mining, 2nd Edition58

Limitations of single attribute-based decision boundariesBoth positive ( ) andnegative (o) classesgenerated fromskewed Gaussianswith centers at (8,8)and (12,12)respectively.2/1/202159Introduction to Data Mining, 2nd Edition59

Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar 2/1/2021 Introduction to Data Mining, 2nd Edition 1 Classification: Definition l Given a collection of records (tr

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

mining, text mining, etc. Classification is one of the techniques in Data Mining that solves various problems like algorithm selection, model comparison, division of training and testing data, preprocessing. It is 2 step processes . Build classification model using training data. Every object of the data must be pre-classified.

data mining systems is described, and a brief introduction to the concepts of database systems and data warehouses is given. A detailed classification of data mining tasks is presented, based on the different kinds of knowledge to be mined. A classification of data mining systems is presented, and major challen ges in the field are discussed.

Data mining process 6 CS590D 12 Data Mining: Classification Schemes General functionality – Descriptive data mining – Predictive data mining Different views, different classifications – Kinds of data to be mined – Kinds of knowledge to be discovered – Kinds of techniqu

Youth During the American Revolution Overview In this series of activities, students will explore the experiences of children and teenagers during the American Revolution. Through an examination of primary sources such as newspaper articles, broadsides, diaries, letters, and poetry, students will discover how children, who lived during the Revolutionary War period, processed, witnessed, and .