Data Mining Tutorial - User.eng.umd.edu

1y ago

7 Views

1 Downloads

5.23 MB

37 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Helen France

Report this link

Download PDF

Transcription

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TutorialMark A. AustinUniversity of Marylandaustin@umd.eduENCE 688P, Fall Semester 2021October 16, 2021

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaOverview1Quick Review2Introduction to Data Mining3Entropy, Probability Distributions, and Information Gain4Information Gain in Decision Trees5Ensemble Learning6Metrics of Evaluation7Working with Weka8Data Mining ExamplesPart 01

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaQuick Review

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaArtificial Intelligence (AI) and Machine Learning (ML)Technical Implementation (2020, Google, Siemens, IBM)AI and ML will be deeply embedded in new software andalgorithms.Artificial Intelligence:Knowledge representation and reasoning with ontologies andrules. Semantic graphs. Executable event-based processing.Machine Learning:Modern neural networks. Input-to-output prediction.Data mining.Identify objects, events, and anomalies.Learn structure and sequence. Remember stu .

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaMan and Machine (AI-ML View)ManGood at formulatingsolutions to problems.Can work with incompletedata and information.AI-ML MachineManipulates Os and 1s.Can work with incompletedata and information.Creative.Creative.Fast logical reasoning.Reasons logically, but veryslow. Forgetful.Performance doublesevery 18-24 months.Performance is static.Data mining can discoverthe rules.Humans make the rules,then they break them.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaTraditional Programming vs AI-ML Workflow

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaIntroduction toData Mining

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaNumerous DefinitionsData MiningThe field of data mining addresses the question of how to best usehistorical data to discover general regularities and improve futuredecisions (Mitchell, 1999).Data MiningData mining is the extraction of implicit, previously unknown, andpotentially useful information – structural patterns – from data(Witten et al., 2017).The process of discovering useful patterns from data must beautomatic (or at least semi-automatic). Useful patterns allow us tomake nontrivial predictions on new data.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesWorking with Initial DatasetData cleaning and curationRemove redundant featuresIdentify input variables and output variable.Preprocessed Dataset:Data split: 80% for training, 20% for validation and testing.DataTrainingValidationTestModel fitweights / biasesPerformanceEvaluationPerformance?Model Parameters Learning rate: 0.1 Batch size: 20 No. epochs: 200 No. hidden layers: 2 Optimizer: AdamBadUpdate model parametersGoodEvaluate

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesTraining DatasetThe sample of data used to fit the model.Validation DatasetThe sample of data used to provide an unbiased evaluation ofthe model fit on the training dataset while training the modelparameters.Testing DatasetThe sample of data used to provide an unbiased evaluation ofa final model fit on the training dataset.

Quick Review Introduction to Data MiningData Mining TechniquesEntropy, Probability Distributions, and Information Gain Information Ga

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesClassification AnalysisClassification analysis learns a method for predicting the instanceclass from pre-labeled (classified) instances.Classification by Shape/Color (Supervised Learning)Labels: Blue, RectangleClassification by ShapeClassification by ColorLabels: Red CircleLabels: Purple Rectangle

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesClassification ProblemGiven a set of n attributes (ordinal or categorical), a set of kclasses, and a set of labeled training instances,[(ii , li ), · · · , (ij , lj )] ,(1)where i (v1 , v2 , · · · , vn ),and l 2 (c1 , c2 , · · · , ck ).Goal is to determine a classification rule – sequence of testson the attributes – that predicts the class of any instancefrom the values of its attributes.NoteThis is a generalization of the concept learning problem sincetypically there are more than two (outcome) classes.Data will contain scatter; may have missing values.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesDecision Trees.A structure that includes a root node, branches, and leaf nodes.Each internal node represents a test on an attribute; each branchrepresents the outcome of a test; and each leaf represents a classlabel.Arbitrary Boolean FunctionsEach attribute is binary valued (true or false).Example trees: XOR, AND and OR, etc .Continuous DomainsEach attribute is real valued (true or false).Tests check if ai value.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesSample Dataset. Will customer buy a computer?IDAge GroupIncome Student Credit RatingBuys -------

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesSample Decision Tree (Split on Discrete Domain)Root node.Edge.Internal node.Leaf node.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesCovering Algorithm and Rule Construction (Split onContinuous Domain)

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesDecision Trees for Regression (One-Dimensional Regression)Goal is to predict real-valued numbers at the leaf nodes.Prediction of a Single Scalar Feature

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesDecision Trees for Regression (Two-Dimensional Regression)Each node splits tree according to a single feature.Mean values of training data are predicted at leaf nodes.Example

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesBasic Questions:How to choose the attribute (or value) to split on at eachlevel of the tree?When should a node be declared a leaf?If a leaf is impure, how should it be labeled?If the tree is too large, how can it be pruned?Notes on Strategy:When all of the data in a single node comes from the sameclass, can declare the node to be a leaf and stop splitting.When a group of data points have exactly the same attributevalues, we cannot split any further. Declare the node to be aleaf, and output the class that is the majority.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesAlgorithmsPerceptron.Logistic Regression.Decision tree algorithms (C4.5, J48)Support Vector Machines (SVM).Random Forest.ApplicationsAnomaly (Fraud) detection.Medical diagnosis.Industrial applications.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesClustering ProblemsClustering techniques apply when there is no class to be predicted,but when un-labeled instances need to be divided into commonnatural groups.Clustering Process (Unsupervised Learning)Scattered DataClustered DataItems within acluster are closely spacedClusteringAlgorithmIndividual clusters areseparated.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesHouse PriceExample 1. Clustering of House Prices and Floor Areas 500kNice neighborhood 400kCity Center 300kFar from City Center 200k5001000150020002500House Floor Area (square ft)

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesExample 2. Hierarchical Clustering and DendrogramsDendrogramA dendrogram is a branching (tree) diagram that representsrelationships of similarity among groups of entities.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesAlgorithmsK-means clustering.Hierarchical clustering.ApplicationsPreprocessing step for many scientific applications.Natural language processing.Market segmentation.Netflix/movie recommendations.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesAssociationAssociation is a data mining function that discovers the probabilitythe co-occurrence of items (or patterns) in a collection of data.Association RulesIdentify relationships between co-occurring items can beexpressed as association rules (e.g., if X, then Y).Key ChallengesHow to identify useful correlations among all correlations?Correlation relationships are not the same as dependencyrelationships – if X, then Y does not imply if Y, then X !Historical data does not necessarily predict the future.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesGoals of Predictive AnalysisFor a customer who purchases product A, what otherproducts will they purchase?Will coupons increase same-store sales?Will a reduced price mean higher sales?Retail StrategiesPut most frequently purchased item (e.g., milk) at the backof the store.Co-locate items that are bought together – can lead toincrease in sales for both.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesExample 1. iPhone Color and Personality Traits.Phone ColorPersonality TraitsGreenFresh, harmonious, healthy, hopeful.BlueConfident, dependable, trustworthy.YellowHappy, honorable, intelligent.PinkCompassionate, energetic, playful.WhiteBalanced, calm, clean.Customers want to select an iPhone Color that correlates withtheir personality traits.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesExample 2. Urban Legend from early 1990s: Diapers and BeerExamples of Association Rules{Diapers} ! {Beer },{Milk, Bread} ! {Eggs, Coke},{Beer , Bread} ! {Milk}.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesItemset and k-ItemsetA collection of one or more items (e.g., {Milk, Bread}.k-Itemset is an itemset containing k items.Support CountFrequency of ocurrence of an itemset.Example:({Milk, Bread, Diaper }) 2.SupportIndicates how frequently the if/then relationship appears inthe data.Association RuleExpression of the form X ! Y, where X and Y are itemsets.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining Techniques (Rule Evaluation Metrics)Support (s)Fraction of transactions that contain both X and Y.Support(s) {Milk,Diaper ,Beer }T 2/5 0.4.Confidence (c)Measures how often items in Y appear in transactions thatcontain X.Confidence(c) {Milk,Diaper ,Beer }{Milk,Diaper } 2/3 0.67.Data Mining for Association RulesGiven a set of transactions T , find all rules having:Support(s)Confidence(c)min support threshhold.min confidence threshold.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining Techniques (Brute-Force Enumeration)Brute-Force EnumerationCompute support and confidence for all possible associationrules.Prune rules that do not meet min support/confidencethresholds.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining Techniques (Brute-Force Enumeration)Computational Complexity: Given d items, there are 2d possiblecandidate itemsets.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining Techniques (Brute-Force Enumeration)Need strategies to reduce computational e ort by systematicallypruning the low scoring items from candidate space.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaData Mining TechniquesAlgorithms (see Chapter 6 of Witten et al.)Apriori: Follows a generate-and-test methodology for findingfrequent item sets, generating successively longer candidateitem sets, and then scanning the item sets to see if they meetthreshold limits.Frequent Pattern Trees: Begins by counting the number oftimes individual items – attribute-value pairs – occur in thedataset. This is a single pass. Then, a (sorted) tree structureis constructed with the goal of identifying large (frequent)item sets.ApplicationsWeather prediction,Medical diagnosis,Purchasing habits of retail customers.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaScientific Research Enabling ApplicationsSource: Mitchell, 1999.

Quick Review Introduction to Data MiningEntropy, Probability Distributions, and Information Gain Information GaReferencesJaynes E.T., Information Theory and Statistical Mechanics. II,Phys. Rev. 108, 171, October 1957.Kapur J.N., Maximum-Entropy Models in Science andEngineering, John Wiley and Sons, 1989.Mitchell T.M., Machine Learning and Data Mining,Communications of the ACM, Vol. 42., No. 11, November1999.Russell S., and Norvig P., Artificial Intelligence: A ModernApproach (Third Edition), Prentice-Hall, 2010.Shanon C.E., and Weaver W., The Mathematical Theory ofCommunication, University of Illinois, Urbana, Chicago, 1949.Witten I.H., Frank E., Hall M.A., and Pal C.J., Data Mining:Practical Machine Learning Tools and Techniques, MorganKaufmann, 2017.

Data Mining The ﬁeld of data mining addresses the question of how to best use historical data to discover general regularities and improve future decisions (Mitchell, 1999). Data Mining Data mining is the extraction of implicit, previously unknown, and potentially useful information - structural patterns - from data (Witten et al., 2017).

Related Documents:

DATA MINING - University of Rajshahi

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

14 Views

1y ago

Data Mining in Bioinformatics - UQAM

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

43 Views

2y ago

Multi Relational Data Mining Approaches: A Data Mining Technique

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

11 Views

7m ago

Bilingual Eng/Vietnamese Titles & Vietnamese Interest

The Very Hungry Caterpillar Eng/Spanish BB 10.99 Time For Bed Eng/Spanish BB 6.99 Where is the Green Sheep? Eng/Spanish BB 4.99 Who Lives Here? Forest Eng/Spanish BB 5.99 Who Lives Here? Pets Eng/Spanish BB 5.99 Whoever You Are Eng/Spanish BB 6.95 Words a

37 Views

2y ago

URGENT NOTICE - Thapar Institute of Engineering and Technology

ENG/PCB/41201 Khushi Jain Rajesh Jain ENG/PCM/41187 Shreya Mittal Ajay Kumar Mittal ENG/PCM/41174 Sayimpu Raghuchandra Prasad Srinivasa Rao ENG/PCM/41094 Aditya Ojha Rajesh Prasad Ojha ENG/PCM/41089 Japneet Singh Parvinder Singh ENG/PCM/41081 Ankita Sharma Raghvendra Sharma ENG/PCB/41057 Debashish Kashyap Rudra Kanta Sarma .

17 Views

9m ago

Data Mining: Why Data Mining? - Leiden University

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

41 Views

3y ago

About the Tutorial - VISHAL THAKUR

About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts off with a basic overview and the terminologies involved in data mining and then gradually moves on to cover topics such as knowledge

7 Views

1y ago

A First Course in Scientiﬁc Computing

A First Course in Scientiﬁc Computing Symbolic, Graphic, and Numeric Modeling Using Maple, Java, Mathematica, and Fortran90 Fortran Version RUBIN H. LANDAU Fortran Coauthors: KYLE AUGUSTSON SALLY D. HAERER PRINCETON UNIVERSITY PRESS PRINCETON AND OXFORD

63 Views

3y ago

Recent Views

IN THIS ISSUE CAR WASH INSIGHT Recent, Notable M&A Transactions .

9/8/2022 Club Car Wash Sites of Tidal Wave Express Car Wash 8 8/29/2022 Take 5 Car Wash Soft Touch Car Wash, Auto Oasis Car Wash, Clearwater Car Wash and Birdie's Car Wash 5 8/25/2022 WhiteWater Express Geaux Clean Car Wash 7 8/19/2022 ModWash Home Team Car Wash 3 8/18/2022 Splash In ECO Car Wash (Wills Group) Blue Hen Car Wash 2

9m ago

100 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

ESSENTIAL PLAN - Discovery

Car insurance only Car and home insurance Car insurance only Car and home insurance 12.5% 25% 5% 10% YOUR FUEL CASH BACK PERCENTAGE GET TO THE HIGHEST CASH BACK PERCENTAGE Add at least R250 000 of home insurance (household contents, buildings or both) Take your car to Tiger Wheel & Tyre and pass the Annual MultiPoint check

1y ago

269 Views

CAR INSURANCE EVERYTHING EXPLAINED - RSA Insurance Group

CAR INSURANCE 93013821.indd 1 15/03/2018 10:46. 2 WELCOME TO µ CAR INSURANCE Thank you for choosing µ to protect you and your car. This booklet is intended to help you check your cover and to reassure you that µ will give you the protection you need for the year ahead. First of all, to help you understand your car insurance policy we want to .

1y ago

274 Views

Describe types and purposes of insurance.

D.O. CAPS Consumer Skills: Insurance—10E 3 Your car - The car you drive can also affect your insurance rates. Insurance companies place certain kinds of cars in special risk categories. You should ask your insurance agent before making a car purchase to make sure you aren't getting a car that will cost you extra for your liability insurance.

1y ago

233 Views

Life Insurance Buyer's Guide Life Insurance - National Association of .

Life Insurance uers uide Naional ssociaion of Insurance Commissioners Compare the Different Types of Insurance Policies There are many types of life insurance pol-icies. You should choose a policy with fea-tures that fit your individual needs. Some things to consider are: Term Insurance vs. Cash Value In-surance. Term insurance is intended to

1y ago

520 Views

Contours Options Infant Car Seat Adapter Instruction Sheet

your Infant Car Seat, as described in the instruction manual provided by the Infant Car Seat manufacturer. † WHEN USING ONLY ONE INFANT CAR SEAT ADAPTER OR TWO FOR TWINS, THE FOLLOWING INFANT CAR SEATS CAN BE USED: † If your Infant Car Seat is not one of the models listed above, DO NOT use your infant car seat with this car seat adapter.

2y ago

564 Views

Microsoft Advertising Travel Update

last minute cruise deals -58.50% Car Rental Queries WoW Change car rental -43.80% rental cars -46.30% car rentals -40.60% cheap car rentals -48.00% car rentals cheapest rates -52.20% rent a car- 40.30% cheap rental cars -45.60% rental car -41.80% car rental deals -49.30% rental cars lowest price -53.90% Flight Queries WoW Change cheap flights .

1y ago

337 Views

Design and development of lift for an automatic car parking system

1. Stacker type car parking system 2. Puzzle type car parking system 3. Level type car parking system 4. Chess type car parking system 5. Rotary type car parking system 6. Tower type car parking system But lift is used only in tower type car parking system. Objectives:-

6m ago

172 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Money Online Price Comparison - WordPress

you to compare car insurance quotes. You'll notice at the top of the screen is a warning regarding telling the truth when completing any form of car insurance quote as something withheld, which later becomes known, can void an insurance claim. 7 The process of completing a car insurance price comparison is broken down into 4

1y ago

174 Views

Better car deals - Consumer Affairs Victoria

Insurance protects you against costs and liabilities if the car is stolen, vandalised or damaged in a crash. When budgeting, consider taking out at least third party car property insurance. It may be cheaper to arrange your own insurance than taking it out through the trader. Contact insurance companies to compare premiums and policy coverage.

1y ago

153 Views

Car Insurance This booklet covers:Car Rapid Bonus Business

Car Insurance This booklet covers:Car Rapid Bonus Business RAC Direct Insurance is a trading name of London and Edinburgh Insurance Company Limited. Registered in England No 924430. Registered Office: 8 Surrey Street, Norwich NR1 3NG. Member of the Aviva Group. Authorised and regulated by the Financial Services Authority. RAC052(V27)-1971-06.06 .

1y ago

218 Views

Root Insurance (ROOT) - Citron Research

Root Insurance (ROOT) Leveling the Playing Field of Car Insurance What every trader needs to know about one of the mostheavily shorted stocks in the market Traditional Credit-Based Car Insurance PerpetuatesEconomic and Racial Inequalities as one in three American cannot affordessentials because of car insurance premiums

1y ago

209 Views

Life Cycle Analysis: Uber vs. Car Ownership

(LCA) will be performed to compare ridesharing services versus car ownership. We will compare per mile average cost and CO 2 emissions . assumption of 15 years being a car's lifetime and calculated average costs for car maintenance, repairs, insurance, gas and registration. We used Economic Input Output Life Cycle Assessment .

1y ago

122 Views

Data Mining Tutorial - User.eng.umd.edu

It looks like you're using an ad-blocker