Data Mining Tutorial - User.eng.umd.edu

1y ago
7 Views
1 Downloads
5.23 MB
37 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Helen France
Transcription

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TutorialMark A. AustinUniversity of Marylandaustin@umd.eduENCE 688P, Fall Semester 2021October 16, 2021

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaOverview1Quick Review2Introduction to Data Mining3Entropy, Probability Distributions, and Information Gain4Information Gain in Decision Trees5Ensemble Learning6Metrics of Evaluation7Working with Weka8Data Mining ExamplesPart 01

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaQuick Review

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaArtificial Intelligence (AI) and Machine Learning (ML)Technical Implementation (2020, Google, Siemens, IBM)AI and ML will be deeply embedded in new software andalgorithms.Artificial Intelligence:Knowledge representation and reasoning with ontologies andrules. Semantic graphs. Executable event-based processing.Machine Learning:Modern neural networks. Input-to-output prediction.Data mining.Identify objects, events, and anomalies.Learn structure and sequence. Remember stu .

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaMan and Machine (AI-ML View)ManGood at formulatingsolutions to problems.Can work with incompletedata and information.AI-ML MachineManipulates Os and 1s.Can work with incompletedata and information.Creative.Creative.Fast logical reasoning.Reasons logically, but veryslow. Forgetful.Performance doublesevery 18-24 months.Performance is static.Data mining can discoverthe rules.Humans make the rules,then they break them.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaTraditional Programming vs AI-ML Workflow

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaIntroduction toData Mining

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaNumerous DefinitionsData MiningThe field of data mining addresses the question of how to best usehistorical data to discover general regularities and improve futuredecisions (Mitchell, 1999).Data MiningData mining is the extraction of implicit, previously unknown, andpotentially useful information – structural patterns – from data(Witten et al., 2017).The process of discovering useful patterns from data must beautomatic (or at least semi-automatic). Useful patterns allow us tomake nontrivial predictions on new data.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesWorking with Initial DatasetData cleaning and curationRemove redundant featuresIdentify input variables and output variable.Preprocessed Dataset:Data split: 80% for training, 20% for validation and testing.DataTrainingValidationTestModel fitweights / biasesPerformanceEvaluationPerformance?Model Parameters Learning rate: 0.1 Batch size: 20 No. epochs: 200 No. hidden layers: 2 Optimizer: AdamBadUpdate model parametersGoodEvaluate

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesTraining DatasetThe sample of data used to fit the model.Validation DatasetThe sample of data used to provide an unbiased evaluation ofthe model fit on the training dataset while training the modelparameters.Testing DatasetThe sample of data used to provide an unbiased evaluation ofa final model fit on the training dataset.

Quick Review Introduction to Data MiningData Mining TechniquesEntropy, Probability Distributions, and Information Gain Information Ga

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesClassification AnalysisClassification analysis learns a method for predicting the instanceclass from pre-labeled (classified) instances.Classification by Shape/Color (Supervised Learning)Labels: Blue, RectangleClassification by ShapeClassification by ColorLabels: Red CircleLabels: Purple Rectangle

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesClassification ProblemGiven a set of n attributes (ordinal or categorical), a set of kclasses, and a set of labeled training instances,[(ii , li ), · · · , (ij , lj )] ,(1)where i (v1 , v2 , · · · , vn ),and l 2 (c1 , c2 , · · · , ck ).Goal is to determine a classification rule – sequence of testson the attributes – that predicts the class of any instancefrom the values of its attributes.NoteThis is a generalization of the concept learning problem sincetypically there are more than two (outcome) classes.Data will contain scatter; may have missing values.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesDecision Trees.A structure that includes a root node, branches, and leaf nodes.Each internal node represents a test on an attribute; each branchrepresents the outcome of a test; and each leaf represents a classlabel.Arbitrary Boolean FunctionsEach attribute is binary valued (true or false).Example trees: XOR, AND and OR, etc .Continuous DomainsEach attribute is real valued (true or false).Tests check if ai value.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesSample Dataset. Will customer buy a computer?IDAge GroupIncome Student Credit RatingBuys -------

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesSample Decision Tree (Split on Discrete Domain)Root node.Edge.Internal node.Leaf node.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesCovering Algorithm and Rule Construction (Split onContinuous Domain)

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesDecision Trees for Regression (One-Dimensional Regression)Goal is to predict real-valued numbers at the leaf nodes.Prediction of a Single Scalar Feature

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesDecision Trees for Regression (Two-Dimensional Regression)Each node splits tree according to a single feature.Mean values of training data are predicted at leaf nodes.Example

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesBasic Questions:How to choose the attribute (or value) to split on at eachlevel of the tree?When should a node be declared a leaf?If a leaf is impure, how should it be labeled?If the tree is too large, how can it be pruned?Notes on Strategy:When all of the data in a single node comes from the sameclass, can declare the node to be a leaf and stop splitting.When a group of data points have exactly the same attributevalues, we cannot split any further. Declare the node to be aleaf, and output the class that is the majority.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesAlgorithmsPerceptron.Logistic Regression.Decision tree algorithms (C4.5, J48)Support Vector Machines (SVM).Random Forest.ApplicationsAnomaly (Fraud) detection.Medical diagnosis.Industrial applications.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesClustering ProblemsClustering techniques apply when there is no class to be predicted,but when un-labeled instances need to be divided into commonnatural groups.Clustering Process (Unsupervised Learning)Scattered DataClustered DataItems within acluster are closely spacedClusteringAlgorithmIndividual clusters areseparated.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesHouse PriceExample 1. Clustering of House Prices and Floor Areas 500kNice neighborhood 400kCity Center 300kFar from City Center 200k5001000150020002500House Floor Area (square ft)

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesExample 2. Hierarchical Clustering and DendrogramsDendrogramA dendrogram is a branching (tree) diagram that representsrelationships of similarity among groups of entities.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesAlgorithmsK-means clustering.Hierarchical clustering.ApplicationsPreprocessing step for many scientific applications.Natural language processing.Market segmentation.Netflix/movie recommendations.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesAssociationAssociation is a data mining function that discovers the probabilitythe co-occurrence of items (or patterns) in a collection of data.Association RulesIdentify relationships between co-occurring items can beexpressed as association rules (e.g., if X, then Y).Key ChallengesHow to identify useful correlations among all correlations?Correlation relationships are not the same as dependencyrelationships – if X, then Y does not imply if Y, then X !Historical data does not necessarily predict the future.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesGoals of Predictive AnalysisFor a customer who purchases product A, what otherproducts will they purchase?Will coupons increase same-store sales?Will a reduced price mean higher sales?Retail StrategiesPut most frequently purchased item (e.g., milk) at the backof the store.Co-locate items that are bought together – can lead toincrease in sales for both.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesExample 1. iPhone Color and Personality Traits.Phone ColorPersonality TraitsGreenFresh, harmonious, healthy, hopeful.BlueConfident, dependable, trustworthy.YellowHappy, honorable, intelligent.PinkCompassionate, energetic, playful.WhiteBalanced, calm, clean.Customers want to select an iPhone Color that correlates withtheir personality traits.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesExample 2. Urban Legend from early 1990s: Diapers and BeerExamples of Association Rules{Diapers} ! {Beer },{Milk, Bread} ! {Eggs, Coke},{Beer , Bread} ! {Milk}.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesItemset and k-ItemsetA collection of one or more items (e.g., {Milk, Bread}.k-Itemset is an itemset containing k items.Support CountFrequency of ocurrence of an itemset.Example:({Milk, Bread, Diaper }) 2.SupportIndicates how frequently the if/then relationship appears inthe data.Association RuleExpression of the form X ! Y, where X and Y are itemsets.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining Techniques (Rule Evaluation Metrics)Support (s)Fraction of transactions that contain both X and Y.Support(s) {Milk,Diaper ,Beer }T 2/5 0.4.Confidence (c)Measures how often items in Y appear in transactions thatcontain X.Confidence(c) {Milk,Diaper ,Beer }{Milk,Diaper } 2/3 0.67.Data Mining for Association RulesGiven a set of transactions T , find all rules having:Support(s)Confidence(c)min support threshhold.min confidence threshold.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining Techniques (Brute-Force Enumeration)Brute-Force EnumerationCompute support and confidence for all possible associationrules.Prune rules that do not meet min support/confidencethresholds.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining Techniques (Brute-Force Enumeration)Computational Complexity: Given d items, there are 2d possiblecandidate itemsets.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining Techniques (Brute-Force Enumeration)Need strategies to reduce computational e ort by systematicallypruning the low scoring items from candidate space.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesAlgorithms (see Chapter 6 of Witten et al.)Apriori: Follows a generate-and-test methodology for findingfrequent item sets, generating successively longer candidateitem sets, and then scanning the item sets to see if they meetthreshold limits.Frequent Pattern Trees: Begins by counting the number oftimes individual items – attribute-value pairs – occur in thedataset. This is a single pass. Then, a (sorted) tree structureis constructed with the goal of identifying large (frequent)item sets.ApplicationsWeather prediction,Medical diagnosis,Purchasing habits of retail customers.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaScientific Research Enabling ApplicationsSource: Mitchell, 1999.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaReferencesJaynes E.T., Information Theory and Statistical Mechanics. II,Phys. Rev. 108, 171, October 1957.Kapur J.N., Maximum-Entropy Models in Science andEngineering, John Wiley and Sons, 1989.Mitchell T.M., Machine Learning and Data Mining,Communications of the ACM, Vol. 42., No. 11, November1999.Russell S., and Norvig P., Artificial Intelligence: A ModernApproach (Third Edition), Prentice-Hall, 2010.Shanon C.E., and Weaver W., The Mathematical Theory ofCommunication, University of Illinois, Urbana, Chicago, 1949.Witten I.H., Frank E., Hall M.A., and Pal C.J., Data Mining:Practical Machine Learning Tools and Techniques, MorganKaufmann, 2017.

Data Mining The field of data mining addresses the question of how to best use historical data to discover general regularities and improve future decisions (Mitchell, 1999). Data Mining Data mining is the extraction of implicit, previously unknown, and potentially useful information - structural patterns - from data (Witten et al., 2017).

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

The Very Hungry Caterpillar Eng/Spanish BB 10.99 Time For Bed Eng/Spanish BB 6.99 Where is the Green Sheep? Eng/Spanish BB 4.99 Who Lives Here? Forest Eng/Spanish BB 5.99 Who Lives Here? Pets Eng/Spanish BB 5.99 Whoever You Are Eng/Spanish BB 6.95 Words a

ENG/PCB/41201 Khushi Jain Rajesh Jain ENG/PCM/41187 Shreya Mittal Ajay Kumar Mittal ENG/PCM/41174 Sayimpu Raghuchandra Prasad Srinivasa Rao ENG/PCM/41094 Aditya Ojha Rajesh Prasad Ojha ENG/PCM/41089 Japneet Singh Parvinder Singh ENG/PCM/41081 Ankita Sharma Raghvendra Sharma ENG/PCB/41057 Debashish Kashyap Rudra Kanta Sarma .

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts off with a basic overview and the terminologies involved in data mining and then gradually moves on to cover topics such as knowledge

A First Course in Scientific Computing Symbolic, Graphic, and Numeric Modeling Using Maple, Java, Mathematica, and Fortran90 Fortran Version RUBIN H. LANDAU Fortran Coauthors: KYLE AUGUSTSON SALLY D. HAERER PRINCETON UNIVERSITY PRESS PRINCETON AND OXFORD