Principles Of Data Mining

2y ago

20 Views

3 Downloads

745.83 KB

41 Pages

Last View : 5d ago

Last Download : 3m ago

Upload by : River Barajas

Report this link

Download PDF

Transcription

Principles of Data MiningInstructor: Sargur N. SrihariUniversity at BuffaloThe State University of New Yorksrihari@cedar.buffalo.edu1Srihari

Introduction: Topics1. Introduction to Data Mining2. Nature of Data Sets3. Types of StructureModels and Patterns4. Data Mining Tasks (What?)5. Components of Data Mining Algorithms(How?)6. Statistics vs Data Mining2Srihari

Flood of DataNew York Times, January 11, 2010Video and Image Data“Unstructured”“Structured and Unstructured”(Text) Data3Srihari

Large Data Sets are Ubiquitous1. Due to advances in digital data acquisition and storagetechnologyBusiness Supermarket transactions Credit card usage records Telephone call details Government statisticsScientific Images of astronomical bodies Molecular databases Medical recordsInternational organizations produce more information in a week thanmany people could read in a lifetime2. Automatic data production leads to need for automaticdata consumption3. Large databases mean vast amounts of information44. Difficulty lies in accessing itSrihari

Data Mining as Discovery Data Mining is Science of extracting useful information fromlarge data sets or databases Also known as KDD Knowledge Discovery and Data Mining Knowledge Discovery in Databases5Srihari

KDD is a multidisciplinary fieldMachine LearningPattern ation6StatisticsArtificial IntelligenceExpert SystemsSrihari

Terminology for DataTraining SetStructured DataMachine LearningPattern RecognitionInformationRetrievalUnstructured izationArtificial IntelligenceExpert SystemsData PointsInstances7Srihari

Data Mining DefinitionAnalysis of (often large) Observational Data to findunsuspected relationships and Summarize data in novel waysthat are understandable and useful to data ownerUnsuspected Relationshipsnon-trivial, implicit, previously unknownEx of Trivial: Those who are pregnant are femaleRelationships and Summaryare in the form of Patterns and ModelsLinear Equations, Rules, Clusters, Graphs, Tree Structures, RecurrentPatterns in Time SeriesUsefulness:meaningful: lead to some advantage, usually economicAnalysis:Process of discovery (Extraction of knowledge)Automatic or Semi-automaticSrihari

Observational Data Observational Data Objective of data mining exercise plays no role indata collection strategy E.g., Data collected for Transactions in a Bank Experimental Data Collected in Response to Questionnaire Efficient strategies to Answer Specific Questions In this way it differs from much of statistics For this reason, data mining is referred to assecondary data analysis9Srihari

KDD Process Stages: Selecting Target DataPreprocessingTransforming themData Mining to Extract Patterns and RelationshipsInterpreting Assesses Structures KDD more complicated than initially thought 80% preparing data 20% mining data10Srihari

Seeking Relationships Finding accurate, convenient and usefulrepresentations of data involves these steps: Determining nature and structure of representation E.g., linear regression Deciding how to quantify and compare two differentrepresentation E.g., sum of squared errors Choosing an algorithmic process to optimize scorefunction E.g., gradient descent optimization Efficient Implementation using data managementSrihari

Example of Regression AnalysisEXAMPLE of Model1. Representation2. Score function3. Process to optimizescore4. Implementation:data management,efficiency121. Regression:y a bxPredictor variable x(income)Response variable y(credit card spending)2. Score: sum of squarederrors

Linear Regression Process:Extracting a Linear ModelLinear regression with one variableData of the form (xi, yi), i 1,.n samplesNeed to find a and b such that y a bxYData Representationyx138911114532XWhat is involved in calculating a and bSo thatthe line fits the points the best?13

Score: Sum of Squared ErrorsWhere yi is the response value obtained from the modelWe wish to minimize SSE14

Minimizing SSE for RegressionDifferentiating SSE with respect to a and b we haveSetting partial derivatives equal to zero and rearranging termsWhich we solve for a and b,the regression coefficients15

Regression CoefficientsTo calculate a and b we need to find the means of the x and y values.Then we calculate b as a function of the x and y values and the meansa as a function of the means and b16

Application to Datayx13891111453217meany 5meanx 6a 0.8, b 1.04Linear RegressionFor the data setOptimal regression line isy 0.8 1.04x10yx10

Multiple Regressionp predictor variablesyy(1)x1x1(1)x2 . xpn objectsX n x d 1 matrixWhere a column of 1’s are addedto incorporate a0 in modely(n)x1(n)Solution:18y is a column vector, a (ao,.,ap)e is a n by 1 vector containingresiduals

Implementation of RegressionSolution:Simple summaries of the data; sums, sums of squares andsums of products of X and Y are sufficientto compute estimates of a and bImplies single pass through the data will yield estimates19

2. Nature of Data Sets Structured Data set of measurements from an environment orprocess Simple case n objects with d measurements each: n x d matrix d columns are called variables, features, attributesor fields20

Structured Data and Data TypesUS Census Bureau DataPublic Use Microdata Sample data sets (PUMS)IDAgeSexQuantitative orical d100000Categorical OrdinalNoisy dataA guess?FemaleMarriedHS iedChild0PUMS 21Data has identifying information removed.Available in 5% and 1% sample sizes. 1% sample has 2.7 million records

Unstructured Data1. Structured Data Well-defined tables, attributes (columns), tuples (rows) UC Irvine data set2. Unstructured Data World wide web Documents and hyperlinks– HTML docs represent tree structure with text and attributesembedded at nodes– XML pages use metadata descriptions Text Documents Document viewed as sequence of words and punctuations– Mining Tasks»»»»Text categorizationClustering Similar DocumentsFinding documents that match a queryAutomatic Essay Scoring (AES)– Reuters collection is at http://www.research.att.com/ lewis22

Representations of Text Documents Boolean Vector Document is a vector where each element is a bitrepresenting presence/absence of word A set of documents can be represented as matrix (d,w)– where document d and word w has value 1 or 0(sparse matrix) Vector Space Representation Each element has a value such as no. of occurrences or frequency A set of documents represented as a document-term matrix23

Vector Space ExampleDocument-Term odt6lineardij represents number of timesthat term appears in that document24

Mixed Data: Structured & Unstructured Medical Patient DataBlood Pressure at different times of dayImage data (x-ray or MRI)Specialistʼs comments (text)Hierarchy of relationships betweenpatients, doctors, hospitalsN x d data matrix is oversimplification of what occurs in practice25

Transaction DataList of store purchases: date, customer ID, list of items and pricesCan be transformedintobinary-valuedmatrix1IndividualsWeb transaction log -sequence of triples: (user id, web page, time)111111111111112611111111Web Page Visited11

3.Types of Structures: Modelsand Patterns Representations sought in data mining Global Model Local Pattern27Srihari

Models and Patterns Global Model Make a statement about any point in d-space E.g., assign a point to a cluster Even when some values are missing Simple model: Y aX c Functional model is linear Linear in variables rather than parameters Local Patterns Make a statement about restricted regions ofspace spanned by variables E.g.1: if X thresh1 then Prob (Y thresh2) p E.g.2: certain classes of transactions do not show peaksand troughs (bank discovers dead peopleʼs openaccounts)28

4. Data Mining Tasks (What?) Not so much a single technique Idea that there is more knowledge hidden in the datathan shows itself on the surface Any technique that helps to extract more out of datais useful Five major task types:1. Exploratory Data Analysis (Visualization)2. Descriptive Modeling (Density estimation, Clustering)Model3. Predictive Modeling (Classification and Regression)building4. Discovering Patterns and Rules (Association rules)5. Retrieval by Content (Retrieve items similar to pattern of interest)29Srihari

Exploratory Data Analysis Interactive and VisualPie Charts (angles represent size)Cox Comb Charts (radii represent size)Intricate spatial displays of users ofGoogle around the world30Srihari

Descriptive Modeling Describe all the data or a process forgenerating the data Probability Distribution using DensityEstimation Clustering and Segmentation Partitioning p-dimensional space into groups Similar people are put in same group31Srihari

Predictive Modeling Classification and Regression Market value of a stock, disease,brittleness of a weld Machine Learning Approaches A unique variable is the objective inprediction unlike in description.32Srihari

Discovering Patterns and Rules Detecting fraudulent behavior bydetermining data that differs significantlyfrom rest Finding combinations of transactionsthat occur frequently in transactionaldata bases Grocery items purchased together33Srihari

Retrieval by Content User has pattern of interest and wishesto find that pattern in database, Ex: Text Search Estimate the relative importance of web pagesusing a feature vector whose elements arederived from the Query-URL pair Image Search Search a large database of images by usingcontent descriptors such as color, texture,relative position34Srihari

Components of Data MiningAlgorithms (How?)Four basic components in each algorithm*1. Model or Pattern StructureDetermining underlying structure or functional form weseek from data2. Score FunctionJudging the quality of the fitted model3. Optimization and Search MethodSearching over different model and pattern structures4. Data Management StrategyHandling data access efficiently*IIlustrated in Regression example35

Statistics vs Data Mining Size of data set (large in data mining) Eyeballing not an option (terabytes of data) Entire dataset rather than a sample Many variables Curse of dimensionality Make predictions Small sample sizes can lead to spurious discovery: Superbowl winner conference correlates to stock market(up/down)

Searching Data Base vs Data MiningData Base:When you know exactly what you are looking for Query Tool: SQL (Structured Query Language) exampleTable called ToveKariAddressTimoteivn 10Borgvn 23Storgt 20 Query: SELECT LastName FROM PersonsLastNameCitySandnesSandnesStavangerresults inHansenSvendsonPettersenData Mining:37When you only vaguely know what you are looking forSrihari

Reference Textbooks1. Hand, David, Heikki Mannila, and Padhraic Smyth,Principles of Data Mining, MIT Press 2001.2. Bishop, Christopher, Pattern Recognition and MachineLearning, Springer 2006Approach:Fundamental principlesEmphasis on Theory and AlgorithmsMany other textbooks:Emphasize business applications, case studies38Srihari

Many Other Textbooks1.Han and Kamber, Data Mining Concepts and Techniques, MorganKaufmann, 2000 (Data Base Perspective)2.Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Toolsand Techniques with Java Implementations, Morgan Kaufmann, 2000.(Machine Learning Perspective)3.Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998. (LaymanPerspective)4.Groth, R., Data Mining: A Hands-on Approach for Business Professionals,Prentice-Hall PTR,1997. (Business Perspective)5. Kennedy, R., Y. Lee, et al., Solving Data Mining Problems through PatternRecognition, Prentice-Hall PTR, 1998. (Pattern Recognition Perspective)6. Weiss, S., and N. Indurkhya, Predictive Data Mining: A Practical Guide,Morgan Kaufmann, 1998. (Statistical Perspective)39Srihari

More Data Mining Textbooks7. S.Chakrabarti, Mining the web, Morgan Kaufman, 2003 (Emphasis on webpagesand hyperlinks)8 T. Dasu and T. Johnson, Exploratory Data Mining and Data Cleaning, Wiley,2003 (Focus on data quality)9. K. Cios, W. Pedrycz and R. Swiniarski, Data Mining Methods for KnowledgeDiscovery,Kluwer, 1998,(Focus on Mathematical issues, e.g., rough sets)10. M. Kantardzic, Data Mining: Concepts, Models and Algorithms, IEEE-Wiley,2003 (Focus on Machine Learning)11. A. K. Pujari, Data Mining Techniques, Universities Press, 2001,(Data BasePerspective)12. R. Groth, Data Mining: A hands-on approach for business professionals,Prentice Hall, 1998 (Business user perspective including software CD)40Srihari

Premier Data Mining Conference41Srihari

Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data Mining Tasks (What?) 5. Components of Data Mining Algorithms(How?) 6. Statistics vs Data Mining 2 Srihari . Flood of Data 3

Related Documents:

DATA MINING - University of Rajshahi

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

13 Views

1y ago

Data Mining in Bioinformatics - UQAM

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

41 Views

2y ago

Multi Relational Data Mining Approaches: A Data Mining Technique

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

9 Views

7m ago

Principles of Green Data Mining

review the data mining process and develop a set of principles for green data mining. We conclude by discussing limitations and future work. 2. Methodology . We derived our principles by analyzing the CRISP-DM data mining process and literature on green IT and data mining. In a first st

16 Views

2y ago

Data Mining: Why Data Mining? - Leiden University

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

41 Views

2y ago

Exploration and Mining in Canada

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

22 Views

1y ago

Data Mining Algorithms - Stanford University

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

11 Views

1y ago

Welcome to Unauthorized Use of the GPC

Contractors may use a Government employee's Government Purchase Card as long as the Contracting Office authorizes its use and the contractor notifies the Contracting Officer of all purchases made by the contractor during the billing period. Unauthorized Use of the GPC Page 7 of 29 Knowledge Review Computer supplies from a GSA schedule. Food, drinks, clothing, lodging or travel related expenses .

53 Views

3y ago

Recent Views

Novell SUSE Linux Package Description and Support Level .

aspell-eo An Esperanto Dictionary for Aspell L2 aspell-es A Spanish Dictionary for ASpell L2 aspell-et An Estonian dictionary for aspell L2 aspell-fa A Persian dictionary for aspell L2 aspell-fi Finnish Dictionary Package L2 aspell-fo A Faroese Dictionary for ASpell L2 aspell-fr A French Dictionary for ASpell L2 aspell-ga An Irish Dictionary .

2y ago

348 Views

Dictionary of Aviation - THE AIRLINE PILOTS

Dictionary of Accounting 0 7475 6991 6 . Dictionary of Computing 0 7475 6622 4 Dictionary of Economics 0 7136 8203 5 Dictionary of Environment and Ecology 0 7475 7201 1 Dictionary of Food Science and Nutrition 0 7136 7784 8 Dictionary of Human Resources and Personnel Management 0 7136 8142 X

2y ago

162 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Oxford and the Dictionary - Oxford English Dictionary

What makes an Oxford Dictionary? People find dictionary-making fascinating. The 250th anniversary of Samuel Johnson’s Dictionary in 2005 was widely celebrated, and the recent BBC television series Balderdash and Piffle had a huge response to its call to viewers to help track down elusive word and phrase or

2y ago

210 Views

Cambridge Essential English Dictionary

These Dictionary Guide Worksheets are downloadable versions of the Guide to the Dictionary presented in the Cambridge Essential English Dictionary, Second Edition. The Guide is designed to help you develop skills in using a dictionary. The worksheets are grouped as five separate units, whi

2y ago

516 Views

The Interactive Arabic Dictionary: Another Collaboratively .

the Interactive Arabic Dictionary” [11], and “Conceptual Design of the Interactive Arabic Dictionary” [12], were the main studies used in HIAST to implement the Interactive dictionary. 2.1. Objectives IAD is a Monolingual dictionary (Arabic-Arabic), targeted to

2y ago

333 Views

Dictionary-guided Scene Text Recognition

A dictionary is an explicit language model, and the ben-eﬁts of a dictionary for scene text recognition are well es-tablished. In most previous works, a dictionary was used to ensure that the output sequence of characters is a legit-imate word from the dictionary, and it improved the accu-r

2y ago

313 Views

Going Online with a German Collocations Dictionary - unibas.ch

dictionary articles on two levels: a minimalistic view for the search and navigation stage and a more detailed view once a collocation is found. Keywords: online dictionary, collocations, dictionary design, learners' dictionary, German language . 1. Introduction Many dictionaries are available on the Web today. However, as yet there areno well-

7m ago

66 Views

A Fault Dictionary-Based Fault Diagnosis Approach for CMOS Analog .

Step 5: Fault dictionary construction: The fault dictionary is a collection of potential faulty and fault-free responses. The signatures obtained will be stored in the dictionary. This dictionary involves for each fault a correspondence between the faulty circuit responses and the defect sites.

4m ago

56 Views

On Entries for Neologisms in English-Chinese Learner's Dictionaries

A New English Chinese Dictionary of Journalism (2007) by Hu Zhiyong, An English -Chinese Dictionary of Neologisms (2009) by Li Mingyi, English-Chinese Neologism Dictionary (2013) by Wu Xuemei, A Dictionary of New Chinese Phrases in English (2015) by China Daily and A Chinese-English Dictionary of New Words and Expressions (2015) by Wu .

4m ago

63 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Ross E. Davies, George Mason University School of Law

Jan 15, 2012 · 4. Bryan A. Garner, Preface to the First Pocket Edition of BLACK‘S LAW DICTIONARY, reprinted in BLACK‘S LAW DICTIONARY vii (3d Pocket ed. 2006). Garner is the current editor-in-chief of Black‟s Law Dictionary and (even more surely than was Black in his own time) the most influential contemporary scho-lar of American legal language. 5.

2y ago

297 Views

Principles Of Data Mining

It looks like you're using an ad-blocker