New York University Computer Science Department Courant .

3y ago
26 Views
3 Downloads
2.74 MB
45 Pages
Last View : 10d ago
Last Download : 3m ago
Upload by : Kaden Thurman
Transcription

Data MiningSession 1 – Main ThemeIntroduction to Data MiningDr. Jean-Claude FranchittiNew York UniversityComputer Science DepartmentCourant Institute of Mathematical SciencesAdapted from course textbook resourcesData Mining Concepts and Techniques (2nd Edition)Jiawei Han and Micheline Kamber1Agenda11InstructorInstructor andand CourseCourse IntroductionIntroduction22IntroductionIntroduction toto DataData MiningMining33SummarySummary andand ConclusionConclusion2

Who am I?- Profile ¾¾¾¾¾¾¾¾¾¾27 years of experience in the Information Technology Industry, including thirteen years of experienceworking for leading IT consulting firms such as Computer Sciences CorporationPhD in Computer Science from University of Colorado at BoulderPast CEO and CTOHeld senior management and technical leadership roles in many large IT Strategy and Modernizationprojects for fortune 500 corporations in the insurance, banking, investment banking, pharmaceutical, retail,and information management industriesContributed to several high-profile ARPA and NSF research projectsPlayed an active role as a member of the OMG, ODMG, and X3H2 standards committees and as aProfessor of Computer Science at Columbia initially and New York University since 1997Proven record of delivering business solutions on time and on budgetOriginal designer and developer of jcrew.com and the suite of products now known as IBM InfoSphereDataStageCreator of the Enterprise Architecture Management Framework (EAMF) and main contributor to the creationof various maturity assessment methodologyDeveloped partnerships between several companies and New York University to incubate newmethodologies (e.g., EA maturity assessment methodology developed in Fall 2008), develop proof ofconcept software, recruit skilled graduates, and increase the companies’ visibility3How to reach me?Come on what elsedid you expect?Cell(212) 203-5004Emailjcf@cs.nyu.eduAIM, Y! IM, ICQ jcf2 2003Woo hoo find the wordof the day MSN IMjcf2 kypejcf2 2003@yahoo.com4

What is the class about? Course description and syllabus:» http://www.nyu.edu/classes/jcf/g22.3033-002/» 2/index.html Textbooks:» Data Mining: Concepts and Techniques (2nd Edition)Jiawei Han, Micheline KamberMorgan KaufmannISBN-10: 1-55860-901-6, ISBN-13: 978-1-55860-901-3, (2006)» Microsoft SQL Server 2008 Analysis Services Step by StepScott CameronMicrosoft PressISBN-10: 0-73562-620-0, ISBN-13: 978-0-73562-620-31 1st Edition (04/15/09)5Icons / MetaphorsInformationCommon RealizationKnowledge/Competency PatternGovernanceAlignmentSolution Approach66

Agenda11InstructorInstructor andand CourseCourse IntroductionIntroduction22IntroductionIntroduction toto DataData MiningMining33SummarySummary andand ConclusionConclusion7Introduction to Data Mining - Sub-Topics Why Data Mining?» Data Mining: A Natural Evolution of Science and TechnologyWhat Is Data Mining?» Data Mining: Essential in a Knowledge Discovery Process» Data Mining: A Confluence of Multiple DisciplinesA Multi-Dimensional View of Data Mining» Knowledge to Be Mined» Data to Be Mined» Technology Utilized» Applications AdaptedData Mining Functionalities: What Kinds of Patterns Can Be Mined?» Generalization» Mining Frequent Patterns, Associations, and Correlations» Classification» Cluster Analysis» Outlier AnalysisData mining: On What Kinds of Data?Time and Ordering: Sequential Pattern, Trend and Evolution AnalysisStructure and Network AnalysisEvaluation of knowledgeApplications of Data MiningMajor Challenges in Data MiningA Brief History of Data Mining and Data Mining Society8

Evolution of Sciences Before 1600, empirical science 1600-1950s, theoretical science» Each discipline has grown a theoretical component. Theoretical models oftenmotivate experiments and generalize our understanding. 1950s-1990s, computational science» Over the last 50 years, most disciplines have grown a third, computational branch(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)» Computational Science traditionally meant simulation. It grew out of our inability tofind closed-form solutions for complex mathematical models. 1990-now, data science» The flood of data from new scientific instruments and simulations» The ability to economically store and manage petabytes of data online» The Internet and computing Grid that makes all these archives universallyaccessible» Scientific info. management, acquisition, organization, query, and visualization tasksscale almost linearly with data volumes. Data mining is a major new challenge! Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,Comm. ACM, 45(11): 50-54, Nov. 20029Evolution of Database Technology (1/2) 1960s:» Data collection, database creation, IMS and network DBMS 1970s:» Relational data model, relational DBMS implementation 1980s:» RDBMS, advanced data models (extended-relational, OO, deductive,etc.)» Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s:» Data mining, data warehousing, multimedia databases, and Webdatabases 2000s» Stream data management and mining» Data mining and its applications» Web technology (XML, data integration) and global information systems10

Evolution of Database Technology (2/2)11Why Data Mining? (1/2) The Explosive Growth of Data: from terabytes to petabytes» Data collection and data availability Automated data collection tools, database systems, Web,computerized society» Major sources of abundant data Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets12

Why Data Mining? (2/2) Associations (e.g. linking purchase of pizza withbeer) Sequences (e.g. tying events together: marriageand purchase of furniture) Classifications (e.g. recognizing patterns such asthe attributes of employees that are most likely toquit) Forecasting (e.g. predicting buying habits ofcustomers based on past patterns) Expertsystems or small ML/statistical programs13What Can Data Mining Do? Classification» Classify credit applicants as low, medium, high risk» Classify insurance claims as normal, suspicious Estimation» Estimate the probability of a direct mailing response» Estimate the lifetime value of a customer Prediction» Predict which customers will leave within six months» Predict the size of the balance that will be transferred by acredit card prospect Association» Find out items customers are likely to buy together» Find out what books to recommend to Amazon.com users Clustering» Difference from classification: classes are unknown!14

Sample Data Mining AlgorithmsData Mining AlgorithmsOnline AnalyticalProcessingSQLDiscovery Driven MethodsDescriptionPredictionQuery ToolsClassification ion TreesNeural NetworksSequential Analysis15Why Data Mining?—Potential Applications Data analysis and decision support» Market analysis and management Target marketing, customer relationship management (CRM), marketbasket analysis, cross selling, market segmentation» Risk analysis and management Forecasting, customer retention, improved underwriting, qualitycontrol, competitive analysis» Fraud detection and detection of unusual patterns (outliers) Other Applications» Text mining (news group, email, documents) and Web mining» Stream data mining» Bioinformatics and bio-data analysis16

Example 1: Market Analysis and Management Where does the data come from?—Credit card transactions, loyalty cards, discount coupons,customer complaint calls, plus (public) lifestyle studies Target marketing»Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits,etc.»Determine customer purchasing patterns over time Direct Marketing Market segmentation Market Basket Analysis Cross-market analysis—Find associations/co-relations between product sales, & predict based onsuch association»»»Identify which prospects should be included in a mailing listidentify common characteristics of customers who buy same productsIdentify what products are likely to be bought together Customer profiling—What types of customers buy what products (clustering or classification) Customer requirement analysis »Identify the best products for different groups of customers»Predict what factors will attract new customersProvision of summary information»Multidimensional summary reports»Statistical summary information (data central tendency and variation)17Sample Market Basket Analysis Association and sequence discovery Principal concepts Support or Prevalence: frequency that a particular association appears in the database Confidence: conditional predictability of B, given A Example: Total daily transactions: 1,000 Number which include “soda”: 500 Number which include “orange juice”: 800 Number which include “soda” and “orange juice”: 450 SUPPORT for “soda and orange juice” 45% (450/1,000) CONFIDENCE of “soda à orange juice” 90% (450/500) CONFIDENCE of “orange juice à soda” 56% (450/800)18

Example 2: Corporate Analysis & Risk Management Finance planning and asset evaluation» cash flow analysis and prediction» contingent claim analysis to evaluate assets» cross-sectional and time series analysis (financial-ratio, trendanalysis, etc.) Resource planning» summarize and compare the resources and spending Competition» monitor competitors and market directions» group customers into classes and a class-based pricingprocedure» set pricing strategy in a highly competitive market19Example 3: Fraud Detection & Mining Unusual Patterns Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm.» Auto insurance: ring of collisions, insurance Claims Analysis Discover patterns of fraudulent transactions Compare current transactions against those patterns» Money laundering: suspicious monetary transactions» Medical insurance Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests» Telecommunications: phone-call fraud Phone call model: destination of the call, duration, time of day or week.Analyze patterns that deviate from an expected norm» Retail industry Analysts estimate that 38% of retail shrink is due to dishonest employees» Anti-terrorism20

Other Applications (1/2) Sports» IBM Advanced Scout analyzed NBA game statistics (shotsblocked, assists, and fouls) to gain competitive advantage forNew York Knicks and Miami Heat Astronomy» JPL and the Palomar Observatory discovered 22 quasars withthe help of data mining Internet Web Surf-Aid» IBM Surf-Aid applies data mining algorithms to Web accesslogs for market-related pages to discover customer preferenceand behavior pages, analyzing effectiveness of Web marketing,improving Web site organization, etc.21Example: Amazon.com book recommendations Example: Identify books to recommend to customers Company keeps log of past customer purchases Represent each customer as a vector whosecomponents are the past purchases Define a “distance” function for comparing customers Based on this distance function, identify thecustomer’s nearest neighbor set (NNS) Identify books that have been purchased by a largepercentage of the nearest neighbor set but not by thecustomer Recommend these books to the customer as possiblenext purchases22

Introduction to Data Mining – Sub-Topics Why Data Mining?» Data Mining: A Natural Evolution of Science and TechnologyWhat Is Data Mining?» Data Mining: Essential in a Knowledge Discovery Process» Data Mining: A Confluence of Multiple DisciplinesA Multi-Dimensional View of Data Mining» Knowledge to Be Mined» Data to Be Mined» Technology Utilized» Applications AdaptedData Mining Functionalities: What Kinds of Patterns Can Be Mined?» Generalization» Mining Frequent Patterns, Associations, and Correlations» Classification» Cluster Analysis» Outlier AnalysisData mining: On What Kinds of Data?Time and Ordering: Sequential Pattern, Trend and Evolution AnalysisStructure and Network AnalysisEvaluation of knowledgeApplications of Data MiningMajor Challenges in Data MiningA Brief History of Data Mining and Data Mining Society23What Is Data Mining? Data mining (knowledge discovery from data)» Extraction of interesting (non-trivial, implicit, previously unknownand potentially useful) patterns or knowledge from huge amountof data» Data mining: a misnomer? Alternative names» Knowledge discovery (mining) in databases (KDD), knowledgeextraction, data/pattern analysis, data archeology, datadredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”?» Simple search and query processing» (Deductive) expert systems24

Knowledge Discovery (KDD) Process» Data mining—core ofknowledge discoveryprocessPattern Evaluationand PresentationData MiningTask-relevant DataData WarehouseSelectionData CleansingData IntegrationDatabases25Example: A Web Mining Framework (1/2) Web mining usually involves» Data cleaning» Data integration from multiple sources» Warehousing the data» Data cube construction» Data selection for data mining» Data mining» Presentation of the mining results» Patterns and knowledge to be used or stored into knowledgebase26

Example: A Web Mining Framework (2/2)27KDD Process: Several Key Steps Learning the application domain» relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation» Find useful features, dimensionality/variable reduction, invariantrepresentation Choosing functions of data mining» summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation» visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge28

Are All the “Discovered” Patterns Interesting? Data mining may generate thousands of patterns: Not all of them areinteresting» Suggested approach: Human-centered, query-based, focused mining Interestingness measures» A pattern is interesting if it is easily understood by humans, valid on newor test data with some degree of certainty, potentially useful, novel, orvalidates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures» Objective: based on statistics and structures of patterns, e.g., support,confidence, etc.» Subjective: based on user’s belief in the data, e.g., unexpectedness,novelty, actionability, etc.29Find All and Only Interesting Patterns? Find all the interesting patterns: Completeness» Can a data mining system find all the interesting patterns? Do weneed to find all of the interesting patterns?» Heuristic vs. exhaustive search» Association vs. classification vs. clustering Search for only interesting patterns: An optimization problem» Can a data mining system find only the interesting patterns?» Approaches First general all the patterns and then filter out the uninteresting ones Generate only the interesting patterns—mining query optimization30

Other Pattern Mining Issues Precise patterns vs. approximate patterns» Association and correlation mining: possible find sets of precisepatterns But approximate patterns can be more compact and sufficient How to find high quality approximate patterns?» Gene sequence mining: approximate patterns are inherent How to derive efficient approximate pattern mining algorithms? Constrained vs. non-constrained patterns» Why constraint-based mining?» What are the possible kinds of constraints? How to pushconstraints into the mining process?31Data Mining and Business IntelligenceIncreasing potentialto supportbusiness decisionsDecisionMakingData PresentationVisualization TechniquesEnd UserBusinessAnalystData MiningInformation DiscoveryDataAnalystData ExplorationStatistical Summary, Querying, and ReportingData Preprocessing/Integration, Data WarehousesData SourcesPaper, Files, Web documents, Scientific experiments, Database SystemsDBA32

Example: Mining vs. Data Exploration Business intelligence view» Warehouse, data cube, reporting but not much mining Business objects vs. data mining toolsSupply chain example: toolsData presentationExploration33KDD Process: A Typical View from ML and StatisticsInput DataData PreProcessingData integrationNormalizationFeature selectionDimension reductionDataMiningPattern discoveryAssociation & correlationClassificationClusteringOutlier analysis onselectioninterpretationvisualization This is a view from typical machine learning and statisticscommunities34

Example: Medical Data Mining Health care & medical data mining – oftenadopted such a view in statistics and machinelearning Preprocessing of the data (including featureextraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation35Data Mining: Confluence of Multiple nRecognitionStatisticsData MiningAlgorithmVisualizationOtherDisciplines36

Introduction to Data Mining - Sub-Topics Why Data Mining?» Data Mining: A Natural Evolution of Science and TechnologyWhat Is Data Mining?» Data Mining: Essential in a Knowledge Discovery Process» Data Mining: A Confluence of Multiple DisciplinesA Multi-Dimensional View of Data Mining» Knowledge to Be Mined» Data to Be Mined» Technology Utilized» Applications AdaptedData Mining Functionalities: What Kinds of Patterns Can Be Mined?» Generalization» Mining Frequent Patterns, Associations, and Correlations» Classification» Cluster Analysis» Outlier AnalysisData mining: On What Kinds of Data?Time and Ordering: Sequential Pattern, Trend and Evolution AnalysisStructure and Network AnalysisEvaluation of knowledgeApplications of Data MiningMajor Challenges in Data MiningA Brief History of Data Mining and Data Mining Society37Multi-Dimensional View of Data Mining Data to be mined» Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media,heterogeneous, legacy, WWW Knowledge to be mined» Characterization, discrimination, association, classification,clustering, trend/deviation, outlier analysis, etc.» Multiple/integrated functions and mining at multiple levels Techniques utilized» Database-oriented, data warehouse (OLAP), machine learning,statistics, visualization, etc. Applications adapted» Retail, telecommunication, banking, fraud analysis, bio-datamining, stock market analysis, text mining, Web mining, etc.38

Why Confluence of Multiple Disciplines? Tremendous amount of data» Algorithms must be highly scalable to handle such as tera-bytes ofdata High-dimensionality of data» Micro-array may have tens of thousands of dimensions High complexity of data»»»»»»Data streams and sensor dataTime-series data, temporal data, sequence dataStructure data, graphs, social networks and multi-linked dataHeterogeneous databases and legacy databasesSpatial, spatiotemporal, multimedia, text and Web dataSoftware programs, scientific simulations New and sophisticated applications39Introduction to Data Mining - Sub-Topics Why Data Mining?» Data Mining: A Natural Evolution of Science and TechnologyWhat Is Data Mining?» Data Mining: Essential in a Knowledge Discovery Process» Data Mining: A Confluence of Multiple DisciplinesA Multi-Dimensional View

What Is Data Mining? » Data Mining: Essential in a Knowledge Discovery Process » Data Mining: A Confluence of Multiple Disciplines A Multi-Dimensional View of Data Mining » Knowledge to Be Mined » Data to Be Mined » Technology Utilized » Applications Adapted Data Mining Functionalities: What Kinds of Patterns Can Be Mined? » Generalization

Related Documents:

New York Buffalo 14210 New York Buffalo 14211 New York Buffalo 14212 New York Buffalo 14215 New York Buffalo 14217 New York Buffalo 14218 New York Buffalo 14222 New York Buffalo 14227 New York Burlington Flats 13315 New York Calcium 13616 New York Canajoharie 13317 New York Canaseraga 14822 New York Candor 13743 New York Cape Vincent 13618 New York Carthage 13619 New York Castleton 12033 New .

N Earth Science Reference Tables — 2001 Edition 3 Generalized Bedrock Geology of New York State modified from GEOLOGICAL SURVEY NEW YORK STATE MUSEUM 1989 N i a g a r R i v e r GEOLOGICAL PERIODS AND ERAS IN NEW YORK CRETACEOUS, TERTIARY, PLEISTOCENE (Epoch) weakly consolidated to unconsolidated gravels, sands, and clays File Size: 960KBPage Count: 15Explore furtherEarth Science Reference Tables (ESRT) New York State .www.nysmigrant.orgNew York State Science Reference Tables (Refrence Tables)newyorkscienceteacher.comEarth Science - New York Regents January 2006 Exam .www.syvum.comEarth Science - New York Regents January 2006 Exam .www.syvum.comEarth Science Textbook Chapter PDFs - Boiling Springs High .smsdhs.ss13.sharpschool.comRecommended to you b

This handbook supplement applies to students entering the fourth year of their degree in Computer Science, Mathematics & Computer Science or Computer Science . Undergraduate Course Handbook 1.2 Mathematics & Computer Science The Department of Computer Science offers the following joint degrees with the Department of Mathematics: BA .

CITY OF NEW YORK, BRONX, KINGS, NEW YORK, QUEENS, AND RICHMOND COUNTIES, NEW YORK 1.0 INTRODUCTION 1.1 Purpose of Study This Flood Insurance Study (FIS) revises and updates a previous FIS/Flood Insurance Rate Map (FIRM) for the City of New York, which incorporates all of Bronx, Kings, New York, Queens, and Richmond counties, New York, this alsoFile Size: 1MB

Garden Lofts Hoboken,New York Soho Mews 311 West Broadway, New York 8 Union Square South, New York 129 Lafayette St., New York The Orion Building 350 West 42nd St., New York Altair 20 15 West 20th St., New York Altair 18 32 West 18th St., New York The Barbizon 63rd St. & Lexington Ave., New York T

New York 65024 : Active . 648 : 108 . 0 : 4 . 19 : 1 . 0 : 324 . 1,104 New York New York 65024 Inactive 27 8 0 0 0 0 0 12 47 New York New York 65024 Total 675 116 0 4 19 1 0 336 1,151 New York : New York 65025 . Active

relation to persons joining the New York state and local retirement system, the New York state teachers’ retirement system, the New York city employees’ retirement system, the New York city teachers’ retirement system, the New York city board of education retirement system, the New York city police pension fund, or the New York

18/10 Stainless Steel New York-00 5 pc. placesetting (marked u) New York-01 Dinner Knife u 24 cm New York-02 Dinner Fork u 20.5 cm New York-03 Salad Fork u 18.8 cm New York-04 Soup Spoon (oval bowl) u 18.8 cm New York-05 Teaspoon u 15.5 cm New York-06 Cream Soup Spoon (round bowl) 17.5 cm New York-07 Demitasse Spoon 11 cm