Principles Of Data Mining

4m ago
745.83 KB
41 Pages
Last View : 1m ago
Last Download : n/a
Upload by : River Barajas

Principles of Data MiningInstructor: Sargur N. SrihariUniversity at BuffaloThe State University of New [email protected]

Introduction: Topics1. Introduction to Data Mining2. Nature of Data Sets3. Types of StructureModels and Patterns4. Data Mining Tasks (What?)5. Components of Data Mining Algorithms(How?)6. Statistics vs Data Mining2Srihari

Flood of DataNew York Times, January 11, 2010Video and Image Data“Unstructured”“Structured and Unstructured”(Text) Data3Srihari

Large Data Sets are Ubiquitous1. Due to advances in digital data acquisition and storagetechnologyBusiness Supermarket transactions Credit card usage records Telephone call details Government statisticsScientific Images of astronomical bodies Molecular databases Medical recordsInternational organizations produce more information in a week thanmany people could read in a lifetime2. Automatic data production leads to need for automaticdata consumption3. Large databases mean vast amounts of information44. Difficulty lies in accessing itSrihari

Data Mining as Discovery Data Mining is Science of extracting useful information fromlarge data sets or databases Also known as KDD Knowledge Discovery and Data Mining Knowledge Discovery in Databases5Srihari

KDD is a multidisciplinary fieldMachine LearningPattern ation6StatisticsArtificial IntelligenceExpert SystemsSrihari

Terminology for DataTraining SetStructured DataMachine LearningPattern RecognitionInformationRetrievalUnstructured izationArtificial IntelligenceExpert SystemsData PointsInstances7Srihari

Data Mining DefinitionAnalysis of (often large) Observational Data to findunsuspected relationships and Summarize data in novel waysthat are understandable and useful to data ownerUnsuspected Relationshipsnon-trivial, implicit, previously unknownEx of Trivial: Those who are pregnant are femaleRelationships and Summaryare in the form of Patterns and ModelsLinear Equations, Rules, Clusters, Graphs, Tree Structures, RecurrentPatterns in Time SeriesUsefulness:meaningful: lead to some advantage, usually economicAnalysis:Process of discovery (Extraction of knowledge)Automatic or Semi-automaticSrihari

Observational Data Observational Data Objective of data mining exercise plays no role indata collection strategy E.g., Data collected for Transactions in a Bank Experimental Data Collected in Response to Questionnaire Efficient strategies to Answer Specific Questions In this way it differs from much of statistics For this reason, data mining is referred to assecondary data analysis9Srihari

KDD Process Stages: Selecting Target DataPreprocessingTransforming themData Mining to Extract Patterns and RelationshipsInterpreting Assesses Structures KDD more complicated than initially thought 80% preparing data 20% mining data10Srihari

Seeking Relationships Finding accurate, convenient and usefulrepresentations of data involves these steps: Determining nature and structure of representation E.g., linear regression Deciding how to quantify and compare two differentrepresentation E.g., sum of squared errors Choosing an algorithmic process to optimize scorefunction E.g., gradient descent optimization Efficient Implementation using data managementSrihari

Example of Regression AnalysisEXAMPLE of Model1. Representation2. Score function3. Process to optimizescore4. Implementation:data management,efficiency121. Regression:y a bxPredictor variable x(income)Response variable y(credit card spending)2. Score: sum of squarederrors

Linear Regression Process:Extracting a Linear ModelLinear regression with one variableData of the form (xi, yi), i 1,.n samplesNeed to find a and b such that y a bxYData Representationyx138911114532XWhat is involved in calculating a and bSo thatthe line fits the points the best?13

Score: Sum of Squared ErrorsWhere yi is the response value obtained from the modelWe wish to minimize SSE14

Minimizing SSE for RegressionDifferentiating SSE with respect to a and b we haveSetting partial derivatives equal to zero and rearranging termsWhich we solve for a and b,the regression coefficients15

Regression CoefficientsTo calculate a and b we need to find the means of the x and y values.Then we calculate b as a function of the x and y values and the meansa as a function of the means and b16

Application to Datayx13891111453217meany 5meanx 6a 0.8, b 1.04Linear RegressionFor the data setOptimal regression line isy 0.8 1.04x10yx10

Multiple Regressionp predictor variablesyy(1)x1x1(1)x2 . xpn objectsX n x d 1 matrixWhere a column of 1’s are addedto incorporate a0 in modely(n)x1(n)Solution:18y is a column vector, a (ao,.,ap)e is a n by 1 vector containingresiduals

Implementation of RegressionSolution:Simple summaries of the data; sums, sums of squares andsums of products of X and Y are sufficientto compute estimates of a and bImplies single pass through the data will yield estimates19

2. Nature of Data Sets Structured Data set of measurements from an environment orprocess Simple case n objects with d measurements each: n x d matrix d columns are called variables, features, attributesor fields20

Structured Data and Data TypesUS Census Bureau DataPublic Use Microdata Sample data sets (PUMS)IDAgeSexQuantitative orical d100000Categorical OrdinalNoisy dataA guess?FemaleMarriedHS iedChild0PUMS 21Data has identifying information removed.Available in 5% and 1% sample sizes. 1% sample has 2.7 million records

Unstructured Data1. Structured Data Well-defined tables, attributes (columns), tuples (rows) UC Irvine data set2. Unstructured Data World wide web Documents and hyperlinks– HTML docs represent tree structure with text and attributesembedded at nodes– XML pages use metadata descriptions Text Documents Document viewed as sequence of words and punctuations– Mining Tasks»»»»Text categorizationClustering Similar DocumentsFinding documents that match a queryAutomatic Essay Scoring (AES)– Reuters collection is at lewis22

Representations of Text Documents Boolean Vector Document is a vector where each element is a bitrepresenting presence/absence of word A set of documents can be represented as matrix (d,w)– where document d and word w has value 1 or 0(sparse matrix) Vector Space Representation Each element has a value such as no. of occurrences or frequency A set of documents represented as a document-term matrix23

Vector Space ExampleDocument-Term odt6lineardij represents number of timesthat term appears in that document24

Mixed Data: Structured & Unstructured Medical Patient DataBlood Pressure at different times of dayImage data (x-ray or MRI)Specialistʼs comments (text)Hierarchy of relationships betweenpatients, doctors, hospitalsN x d data matrix is oversimplification of what occurs in practice25

Transaction DataList of store purchases: date, customer ID, list of items and pricesCan be transformedintobinary-valuedmatrix1IndividualsWeb transaction log -sequence of triples: (user id, web page, time)111111111111112611111111Web Page Visited11

3.Types of Structures: Modelsand Patterns Representations sought in data mining Global Model Local Pattern27Srihari

Models and Patterns Global Model Make a statement about any point in d-space E.g., assign a point to a cluster Even when some values are missing Simple model: Y aX c Functional model is linear Linear in variables rather than parameters Local Patterns Make a statement about restricted regions ofspace spanned by variables E.g.1: if X thresh1 then Prob (Y thresh2) p E.g.2: certain classes of transactions do not show peaksand troughs (bank discovers dead peopleʼs openaccounts)28

4. Data Mining Tasks (What?) Not so much a single technique Idea that there is more knowledge hidden in the datathan shows itself on the surface Any technique that helps to extract more out of datais useful Five major task types:1. Exploratory Data Analysis (Visualization)2. Descriptive Modeling (Density estimation, Clustering)Model3. Predictive Modeling (Classification and Regression)building4. Discovering Patterns and Rules (Association rules)5. Retrieval by Content (Retrieve items similar to pattern of interest)29Srihari

Exploratory Data Analysis Interactive and VisualPie Charts (angles represent size)Cox Comb Charts (radii represent size)Intricate spatial displays of users ofGoogle around the world30Srihari

Descriptive Modeling Describe all the data or a process forgenerating the data Probability Distribution using DensityEstimation Clustering and Segmentation Partitioning p-dimensional space into groups Similar people are put in same group31Srihari

Predictive Modeling Classification and Regression Market value of a stock, disease,brittleness of a weld Machine Learning Approaches A unique variable is the objective inprediction unlike in description.32Srihari

Discovering Patterns and Rules Detecting fraudulent behavior bydetermining data that differs significantlyfrom rest Finding combinations of transactionsthat occur frequently in transactionaldata bases Grocery items purchased together33Srihari

Retrieval by Content User has pattern of interest and wishesto find that pattern in database, Ex: Text Search Estimate the relative importance of web pagesusing a feature vector whose elements arederived from the Query-URL pair Image Search Search a large database of images by usingcontent descriptors such as color, texture,relative position34Srihari

Components of Data MiningAlgorithms (How?)Four basic components in each algorithm*1. Model or Pattern StructureDetermining underlying structure or functional form weseek from data2. Score FunctionJudging the quality of the fitted model3. Optimization and Search MethodSearching over different model and pattern structures4. Data Management StrategyHandling data access efficiently*IIlustrated in Regression example35

Statistics vs Data Mining Size of data set (large in data mining) Eyeballing not an option (terabytes of data) Entire dataset rather than a sample Many variables Curse of dimensionality Make predictions Small sample sizes can lead to spurious discovery: Superbowl winner conference correlates to stock market(up/down)

Searching Data Base vs Data MiningData Base:When you know exactly what you are looking for Query Tool: SQL (Structured Query Language) exampleTable called ToveKariAddressTimoteivn 10Borgvn 23Storgt 20 Query: SELECT LastName FROM PersonsLastNameCitySandnesSandnesStavangerresults inHansenSvendsonPettersenData Mining:37When you only vaguely know what you are looking forSrihari

Reference Textbooks1. Hand, David, Heikki Mannila, and Padhraic Smyth,Principles of Data Mining, MIT Press 2001.2. Bishop, Christopher, Pattern Recognition and MachineLearning, Springer 2006Approach:Fundamental principlesEmphasis on Theory and AlgorithmsMany other textbooks:Emphasize business applications, case studies38Srihari

Many Other Textbooks1.Han and Kamber, Data Mining Concepts and Techniques, MorganKaufmann, 2000 (Data Base Perspective)2.Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Toolsand Techniques with Java Implementations, Morgan Kaufmann, 2000.(Machine Learning Perspective)3.Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998. (LaymanPerspective)4.Groth, R., Data Mining: A Hands-on Approach for Business Professionals,Prentice-Hall PTR,1997. (Business Perspective)5. Kennedy, R., Y. Lee, et al., Solving Data Mining Problems through PatternRecognition, Prentice-Hall PTR, 1998. (Pattern Recognition Perspective)6. Weiss, S., and N. Indurkhya, Predictive Data Mining: A Practical Guide,Morgan Kaufmann, 1998. (Statistical Perspective)39Srihari

More Data Mining Textbooks7. S.Chakrabarti, Mining the web, Morgan Kaufman, 2003 (Emphasis on webpagesand hyperlinks)8 T. Dasu and T. Johnson, Exploratory Data Mining and Data Cleaning, Wiley,2003 (Focus on data quality)9. K. Cios, W. Pedrycz and R. Swiniarski, Data Mining Methods for KnowledgeDiscovery,Kluwer, 1998,(Focus on Mathematical issues, e.g., rough sets)10. M. Kantardzic, Data Mining: Concepts, Models and Algorithms, IEEE-Wiley,2003 (Focus on Machine Learning)11. A. K. Pujari, Data Mining Techniques, Universities Press, 2001,(Data BasePerspective)12. R. Groth, Data Mining: A hands-on approach for business professionals,Prentice Hall, 1998 (Business user perspective including software CD)40Srihari

Premier Data Mining Conference41Srihari

Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data Mining Tasks (What?) 5. Components of Data Mining Algorithms(How?) 6. Statistics vs Data Mining 2 Srihari . Flood of Data 3