Data Warehousing And Data Mining - DEI

2y ago
25 Views
2 Downloads
991.49 KB
62 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Genevieve Webb
Transcription

Data WarehousingandData MiningA.A. 04-05Datawarehousing & Datamining1

Outline1. Introduction and Terminology2. Data Warehousing3. Data Mining Association rules Sequential patterns Classification ClusteringA.A. 04-05Datawarehousing & Datamining2

Introduction and TerminologyEvolution of database technologyFile processing (60s)Relational DBMS (70s)Advanced data models e.g.,Object-Oriented,application-oriented (80s)Web-Based Repositories (90s)Data Warehousing andData Mining (90s)Global/Integrated Information Systems (2000s)A.A. 04-05Datawarehousing & Datamining3

Introduction and TerminologyMajor types of information systems within an dusersMANAGEMENT Decision Support Systems (DSS)LEVELManagement Information Sys. (MIS)SYSTEMSKNOWLEDGELEVELSYSTEMSOffice automation, Groupware, ContentDistribution systems, Workflowmangement systemsTRANSACTION Enterprise Resource Planning (ERP)PROCESSINGCustomer Relationship Management (CRM)SYSTEMSA.A. 04-05Datawarehousing & DataminingMany unsophisticatedusers4

Introduction and TerminologyTransaction processing systems: Support the operational level of the organization, possiblyintegrating needs of different functional areas (ERP); Perform and record the daily transactions necessary to theconduct of the business Execute simple read/update operations on traditionaldatabases, aiming at maximizing transaction throughput Their activity is described as:OLTP (On-Line Transaction Processing)A.A. 04-05Datawarehousing & Datamining5

Introduction and TerminologyKnowledge level systems: provide digital support for managingdocuments (office automation), user cooperation andcommunication (groupware), storing and retrievinginformation (content distribution), automation of businessprocedures (workflow management)Management level systems: support planning, controlling andsemi-structured decision making at management level byproviding reports and analyses of current and historical dataExecutive support systems: support unstructured decisionmaking at the strategic level of the organizationA.A. 04-05Datawarehousing & Datamining6

Introduction and TerminologyMulti-Tier architecture forManagement level andExecutive support systemsDecisionMakingData PresentationPRESENTATIONReporting/Visualization enginesData AnalysisOLAP, Data Mining enginesBUSINESSLOGICData Warehouses / Data MartsData SourcesDATATransactional DB, ERP, CRM, Legacy systemsA.A. 04-05Datawarehousing & Datamining7

Introduction and TerminologyOLAP (On-Line Analytical Processing): Reporting based on (multidimensional) data analysis Read-only access on repositories of moderate-large size(typically, data warehouses), aiming at maximizingresponse timeData Mining: Discovery of novel, implicit patterns from, possiblyheterogeneous, data sources Use a mix of sophisticated statistical and highperformance computing techniquesA.A. 04-05Datawarehousing & Datamining8

Outline1. Introduction and Terminology2. Data Warehousing3. Data Mining Association rules Sequential patterns Classification ClusteringA.A. 04-05Datawarehousing & Datamining9

Data WarehousingDATA WAREHOUSEDatabase with the following distinctive characteristics: Separate from operational databases Subject oriented: provides a simple, concise view on one ormore selected areas, in support of the decision process Constructed by integrating multiple, heterogeneous datasources Contains historical data: spans a much longer time horizonthan operational databases (Mostly) Read-Only access: periodic, infrequent updatesA.A. 04-05Datawarehousing & Datamining10

Data WarehousingTypes of Data Warehouses Enterprise Warehouse: covers all areas of interest for anorganization Data Mart: covers a subset of corporate-wide data that isof interest for a specific user group (e.g., marketing). Virtual Warehouse: offers a set of views constructed ondemand on operational databases. Some of the views couldbe materialized (precomputed)A.A. 04-05Datawarehousing & Datamining11

Data WarehousingMulti-Tier ArchitectureDBDBCleaningextractionData sourcesA.A. 04-05Data WarehouseServerAnalysisReportingData MiningData StorageOLAPengineFront-EndToolsDatawarehousing & Datamining12

Data WarehousingMultidimensional (logical) ModelData are organized around one or more FACT TABLEs. EachFact Table collects a set of omogeneous events (facts)characterized by dimensions and dependent attributesExample: Sales at a chain of qtr301500 P2S1St32qtr1009000 dimensionsA.A. 04-05dependent attributesDatawarehousing & Datamining13

Data WarehousingMultidimensional (logical) Model (cont’d)Each dimension can in turn consist of a number of attributes.In this case the value in the fact table is a foreign key referringto an appropriate dimension meAddressA.A. dsupplierStoreSalesFact tableDatawarehousing & DataminingNamedimension AddresstableManagerSTARSCHEMA14

Data WarehousingConceptual Star Schema (E-R 0,N)STORE(1,1)(0,N)SUPPLIERA.A. 04-05Datawarehousing & Datamining15

Data WarehousingOLAP Server ArchitecturesThey are classified based on the underlying storage layoutsROLAP (Relational OLAP): uses relational DBMS to storeand manage warehouse data (i.e., table-orientedorganization), and specific middleware to support OLAPqueries.MOLAP (Multidimensional OLAP): uses array-based datastructures and pre-computed aggregated data. It shows higherperformance than OLAP but may not scale well if not properlyimplementedHOLAP (Hybird OLAP): ROLAP approach for low-level rawdata, MOLAP approach for higher-level data (aggregations).A.A. 04-05Datawarehousing & Datamining16

Data WarehousingMOLAP ApproachTotal sales of PCs ineurope in the 4thquarter of the yearPCDVDCDTV1Qtr2Qtr3Qtr4QtrEuropeNorth AmericaMiddle EastRegionProductPeriodFar EastDATA CUBErepresenting aFact TableA.A. 04-05Datawarehousing & Datamining17

Data WarehousingOLAP Operations: SLICEFix values for one ormore dimensionsE.g. Product CDA.A. 04-05Datawarehousing & Datamining18

Data WarehousingOLAP Operations: DICEFix ranges for one ormore dimensionsE.g.Product CD or DVDPeriod 1Qrt or 2QrtRegion Europe orNorth AmericaA.A. 04-05Datawarehousing & Datamining19

Data WarehousingOLAP Operations: Roll-UpAggregate data by groupingalong one (or .: group quartersNorth AmericaMiddle EastFar EastRegionDrill-Down (Roll-Up)-1A.A. 04-05Datawarehousing & Datamining20

Data WarehousingCube Operator: summaries for each subset of th AmericaMiddle EastFar EastSUMYearly sales of PCsin the middle eastYearly sales ofelectronics in themiddle eastA.A. 04-05Datawarehousing & Datamining21

Data WarehousingCube Operator: it is equivalent to computing the followinglattice of cuboidsproductproduct, periodperiodregionproduct, regionperiod, regionproduct, period, regionFACT TABLEA.A. 04-05Datawarehousing & Datamining22

Data WarehousingCube Operator In SQL:SELECT Product, Period, Region, SUM(Total Sales)FROM FACT-TABLEGROUP BY Product, Period, RegionWITH CUBEA.A. 04-05Datawarehousing & Datamining23

Data WarehousingProductPeriodRegionTot. SalesCD1qtrEurope* *CD1qtrALL* ALL*ALL1qtrEurope*ALL *CDALLEurope* ALL *CDALLALL* *ALL1qtrALL* *ALLALLEurope*ALLALL *ALLALLALL*A.A. 04-05All combinations of Product,Period and RegionAll combinations of Productand PeriodAll combinations of ProductDatawarehousing & Datamining24

Data WarehousingRoll-up ( partial cube) Operator In SQL:SELECT Product, Period, Region, SUM(Total Sales)FROM FACT-TABLEGROUP BY Product, Period, RegionWITH ROLL-UPA.A. 04-05Datawarehousing & Datamining25

Data WarehousingProductPeriodRegionTot. SalesCD1qtrEurope* *CD1qtrALL* ALL*CDALLALL* ALLALL*ALLALLALL*All combinations of Product,Period and RegionAll combinations of Productand PeriodAll combinations of ProductReduces the complexity from exponential to linear in thenumber of dimensionsA.A. 04-05Datawarehousing & Datamining26

Data Warehousingit is equivalent to computing the following subset of the latticeof cuboidsproductproduct, periodperiodregionproduct, regionperiod, regionproduct, period, regionA.A. 04-05Datawarehousing & Datamining27

Outline1. Introduction and Terminology2. Data Warehousing3. Data Mining Association rules Sequential patterns Classification ClusteringA.A. 04-05Datawarehousing & Datamining28

Data MiningData Explosion: tremendous amount of data accumulated indigital repositories around the world (e.g., databases, datawarehouses, web, etc.)Production of digital data /Year: 3-5 Exabytes (1018 bytes) in 2002 30% increase per year (99-02)See:www.sims.berkeley.edu/how-much-infoWe are drowning in data, but starving for knowledgeA.A. 04-05Datawarehousing & Datamining29

Data MiningKnowledge Discovery in DatabasesKNOWLEDGEData MiningSelection &transformationData cleaning& integrationDBEvaluation &presentationTrainingdataData WarehouseDomainknowledgeDBInformation repositoriesA.A. 04-05Datawarehousing & Datamining30

Data MiningTypologies of input data: Unaggregated data (e.g., records, transactions) Aggregated data (e.g., summaries) Spatial, geographic data Data from time-series databases Text, video, audio, web dataA.A. 04-05Datawarehousing & Datamining31

Data MiningDATA MININGProcess of discovering interesting patterns orknowledge from a (typically) large amount of datastored either in databases, data warehouses, orother information repositoriesInteresting: non-trivial, implicit, previously unknown,potentially usefulAlternative names: knowledge discovery/extraction,information harvesting, business intelligenceIn fact, data mining is a step of the more general processof knowledge discovery in databases (KDD)A.A. 04-05Datawarehousing & Datamining32

Data MiningInterestingness measuresPurpose: filter irrelevant patterns to convey concise anduseful knowledge. Certain data mining tasks can producethousands or millions of patterns most of which areredundant, trivial, irrelevant.Objective measures: based on statistics and structure ofpatterns (e.g., frequency counts)Subjective measures: based on user’s belief about thedata. Patterns may become interesting if they confirm orcontradict a user’s hypothesis, depending on the context.Interestingness measures can employed both after and duringthe pattern discovery. In the latter case, they improve thesearch efficiencyA.A. 04-05Datawarehousing & Datamining33

Data MiningMultidisciplinarity of Data Data MiningMachinelearningA.A. 04-05DBTechnologyDatawarehousing & DataminingVisualization34

Data MiningData Mining ProblemsAssociation Rules: discovery of rules X Y (“objects thatsatisfy condition X are also likely to satisfy condition Y”). Theproblem first found application in market basket or transactiondata analysis, where “objects” are transactions and“conditions” are containment of certain itemsetsA.A. 04-05Datawarehousing & Datamining35

Association RulesStatement of the ProblemI Set of itemsD Set of transactions: tFor an itemset XD then tII,support(X) fraction or number of transactions containing XASSOCIATION RULE: X-YY, with YXIsupport support(X)confidence support(X)/support(X-Y)PROBLEM: find all association rules with supportsup and confidence than min confidenceA.A. 04-05Datawarehousing & Dataminingthan min36

Association RulesMarket Basket Analysis:Items products sold in a store or chain of storesTransactions customers’ shopping basketsRule X-Y Y customers who buy items in X-Y are likely tobuy items in YOf diapers and beer .Analysis of customers behaviour in a supermarket chain hasrevealed that males who on thursdays and saturdays buydiapers are likely to buy also beer . That’s why these two items are found close to eachother in most storesA.A. 04-05Datawarehousing & Datamining37

Association RulesApplications: Cross/Up-selling (especially in e-comm., e.g., Amazon)Cross-selling: push complemetary productsUp-selling: push similar products Catalog design Store layout (e.g., diapers and beer close to each other) Financial forecast Medical diagnosisA.A. 04-05Datawarehousing & Datamining38

Association RulesDef.: Frequent itemset itemset with supportmin supGeneral Strategy to discover all association rules:1.Find all frequent itemsets2.frequent itemset X, output all rules X-Y Y, with Y X,which satisfy the min confindence constraintObservation:Min sup and min confidence are objective measures ofinterestingness. Their proper setting, however, requiresuser’s domain knowledge. Low values may yield exponentially(in I ) many rules, high values may cut off interesting rulesA.A. 04-05Datawarehousing & Datamining39

Association RulesExampleTransaction-idItems boughtFrequent ItemsetsSupport10A, B, C{A}75%20A, C{B}50%30A, D{C}50%40B, E, F{A, C}50%Min. sup 50%Min. confidence 50%For rule {A}supportconfidenceFor rule {C}confidenceA.A. 04-05{C} : support({A,C}) 50% support({A,C})/support({A}) 66.6%{A} (same support as {A}{C}): support({A,C})/support({C}) 100%Datawarehousing & Datamining40

Association RulesDealing with Large OutputsObservation: depending on the values of min sup and onthe dataset, the number frequent itemsets can beexponentially large but may contain a lot of redundancyGoal: determine a subset of frequent itemsets ofconsiderably smaller size, which provides the sameinformation content, i.e., from which the complete set offrequent itemsets can be derived without further information.A.A. 04-05Datawarehousing & Datamining41

Association RulesNotion of ClosednessI (items), D (transactions), min sup (support threshold)Closed Frequent Itemsets {X I : supp(X) min sup & supp(Y) supp(X)I Y X} It’s a subset of all frequent itemsets For every frequent itemset X there exists a closedfrequent itemset Y X such that supp(Y) supp(X), i.e., Yand X occur in exactly the same transcations All frequent itemsets and their frequencies can be derivedfrom the closed frequent itemsets without furtherinformationA.A. 04-05Datawarehousing & Datamining42

Association RulesNotion of MaximalityMaximal Frequent Itemsets {X I : supp(X) min sup & supp(Y) min supI Y X} It’s a subset of the closed frequent itemsets, hence of allfrequent itemsets For every (closed) frequent itemset X there exists amaximal frequent itemset Y X All frequent itemsets can be derived from the maximalfrequent itemsets without further information, howevertheir frequencies must be determined from Dinformation lossA.A. 04-05Datawarehousing & Datamining43

Association RulesTid102030405060Example of closed andmaximal frequent itemsetsDATASETMin sup 3Closed FrequentItemsetsBB, A, D, FB, LB, GB, HA.A. 04-05MaximalSupportNOYESYESYESYES64433ItemsB, A, D, F, G, HB, LB, A, D, F, L, HB, L, G, HB, A, D, FB, A, D, F, L, GSupportingTransactionsall10, 30, 50, 6020, 30, 40, 6010, 40, 6010, 30, 40Datawarehousing & Datamining44

Association Rules(All frequent) VS (closed frequent) VS (maximal frequent)( ,4) closed & maximal closed but not maximalA.A. 04-05Datawarehousing & Datamining(B,4)45

Sequential PatternsData Mining ProblemsSequential Patterns: discovery of frequent subsequences ina collection of sequences (sequence database), eachrepresenting a set of events occurring at subsequent times.The ordering of the events in the subsequences is relevant.A.A. 04-05Datawarehousing & Datamining46

Sequential PatternsSequential PatternsPROBLEM: Given a set of sequences, find the complete setof frequent subsequencessequence databaseA sequence: (ef)(ab)(df)(c)(b) SIDsequence10 (a)(abc)(ac)(d)(cf) 20 (ad)(c)(bc)(ae) 30 (ef)(ab)(df)(c)(b) 40 (e)(g)(af)(c)(b)(c) An element may contain a set of items.Items within an element are unorderedand are listed alphabetically. (a)(bc)(d)(c) is a subsequence of (a)(abc)(ac)(d)(cf) (For min sup 2, (ab)(c) is a frequent subsequenceA.A. 04-05Datawarehousing & Datamining47

Sequential PatternsApplications: Marketing Natural disaster forecast Analysis of web log data DNA analysisA.A. 04-05Datawarehousing & Datamining48

ClassificationData Mining ProblemsClassification/Regression: discovery of a model or functionthat maps objects into predefined classes (classification) orinto suitable values (regression). The model/function iscomputed on a training set (supervised learning)A.A. 04-05Datawarehousing & Datamining49

ClassificationStatement of the ProblemTraining Set: T {t1, , tn} set of n examplesEach example ti characterized by m features (ti(A1), , ti(Am)) belongs to one of k classes (Ci : 1ik)GOALFrom the training data find a model to describe the classesaccurately and synthetically using the data’s features. Themodel will then be used to assign class labels to unknown(previously unseen) recordsA.A. 04-05Datawarehousing & Datamining50

ClassificationApplications: Classification of (potential) customers for: Credit approval,risk prediction, selective marketing Performance prediction based on selected indicators Medical diagnosis based on symptoms or reactions totherapyA.A. 04-05Datawarehousing & Datamining51

ClassificationClassification processBUILDTraining SetUnseen DatamodelTest DataVALIDATEPREDICTclass labelsA.A. 04-05Datawarehousing & Datamining52

ClassificationObservations: Features can be either categorical if belonging tounordered domains (e.g., Car Type), or continuous, ifbelonging to ordered domains (e.g., Age) The class could be regarded as an additional attribute ofthe examples, which we want to predict Classification vs Regression:Classification: builds models for categorical classesRegression: builds models for continuous classes Several types of models exist: decision trees, neuralnetworks, bayesan (statistical) classifiers.A.A. 04-05Datawarehousing & Datamining53

ClassificationClassification using decision treesDefinition: Decision Tree for a training set T Labeled tree Each internal node v represents a test on a feature. Theedges from v to its children are labeled with mutuallyexclusive results of the test Each leaf w represents the subset of examples of Twhose features values are consistent with the testresults found along the path from the root to w. Theleaf is labeled with the majority class of the examples itcontainsA.A. 04-05Datawarehousing & Datamining54

ClassificationExample:examples car insurance applicants, class insurance riskfeaturesclassAge 25AgeCar lylow32trucklow20familyhighA.A. 04-05Car type{sports}HighHighLowModel(decision tree)Datawarehousing & Datamining55

ClusteringData Mining ProblemsClustering: grouping objects into classes with theobjective of maximizing intra-class similarity andminimizing inter-class similarity (unsupervised learning)A.A. 04-05Datawarehousing & Datamining56

ClusteringStatement of the ProblemGIVEN: N objects, each characterized by p attributes(a.k.a. variables)GROUP: the objects into K clusters featuring High intra-cluster similarity Low inter-cluster similarityRemark: Clustering is an instance of unsupervised learning orlearning by observations, as opposed to supervised learningor learning by examples (classification)A.A. 04-05Datawarehousing & Datamining57

ClusteringExampleattribute Youtlierattribute Yobjectattribute XA.A. 04-05Datawarehousing & Dataminingattribute X58

ClusteringSeveral types of clustering problems exist dependingon the specific input/output requirements, and on thenotion of similarity: The number of clusters K may be provided in input or not As output, for each cluster one may want a representativeobject, or a set of aggregate measurements, or thecomplete set of objects belonging to the cluster Distance-based clustering: similarity of objects is related tosome kind of geometric distance Conceptual clustering: a group of objects forms a cluster ifthey define a certain conceptA.A. 04-05Datawarehousing & Datamining59

ClusteringApplications: Marketing: identify groups of customers based on theirpurchasing patterns Biology: categorize genes with similar functionalities Image processing: clustering of pixels Web: clustering of documents in meta-search engines GIS: identification of areas of similar land useA.A. 04-05Datawarehousing & Datamining60

ClusteringChallenges: Scalability: strategies to deal with very large datasets Variety of attribute types: defining a good notion ofsimilarity is hard in the presence of different types ofattributes and/or different scales of values Variety of cluster shapes: common distance measuresprovide only spherical clusters Noisy data: outliers may affect the quality of the clustering Sensitivity to input ordering: the ordering of the input datashould not affect the output (or its quality)A.A. 04-05Datawarehousing & Datamining61

ClusteringMain Distance-Based Clustering MethodsPartitioning Methods: create an initial partition of the objectsinto K clusters, and refine the clustering using iterativerelocation techniques. A cluster is represented either by themean value of its component objects or by a centrally locatedcomponent object (medoid)Hierarchical Methods: start with all objects belonging todistinct clusters and then successively merge the pair ofclosest clusters/objects into one single cluster (agglomerativeapproach); or start with all objects belonging to one clusterand successively split up a cluster into smaller ones (divisiveapproach)A.A. 04-05Datawarehousing & Datamining62

Data Warehousing Multidimensional (logical) Model (cont’d) Each dimension can in turn consist of a number of attributes. In this case the value in the fact table is a foreign key referring to an appropriate dimension table Address Name Code supplier Description Code product Address Mana

Related Documents:

Data Warehousing on AWS AWS Whitepaper Introduction Data Warehousing on AWS Publication date: January 15, 2021 (Document histor y and contributors (p. 23)) Enterprises across the globe want to migrate data warehousing to the cloud to improve performance and lower costs. This whitepaper discusses a modern approach to analytics and data warehousing

Data warehousing fundamentals for IT professionals / Paulraj Ponniah.—2nd ed. p. cm. Previous ed. published under title: Data warehousing fundamentals. Includes bibliographical references and index. ISBN 978-0-470-46207-2 (cloth) 1. Data warehousing. I. Ponniah, Paulraj. Data warehousing

Library of Congress Cataloging-in-Publication Data Encyclopedia of data warehousing and mining / John Wang, editor. -- 2nd ed. p. cm. Includes bibliographical references and index. Summary: "This set offers thorough examination of the issues of importance in the rapidly changing field of data warehousing and mining"--Provided by publisher.

Library of Congress Cataloging-in-Publication Data Encyclopedia of data warehousing and mining / John Wang, editor. -- 2nd ed. p. cm. Includes bibliographical references and index. Summary: "This set offers thorough examination of the issues of importance in the rapidly changing field of data warehousing and mining"--Provided by publisher.

Library of Congress Cataloging-in-Publication Data Encyclopedia of data warehousing and mining / John Wang, editor. -- 2nd ed. p. cm. Includes bibliographical references and index. Summary: "This set offers thorough examination of the issues of importance in the rapidly changing field of data warehousing and mining"--Provided by publisher.

BCS6L1 DATA WAREHOUSING AND DATA MINING LABORATORY L T P C Total Contact Hours - 30 0 0 3 2 Prerequisite -Data ware Housing and Data mining Lab Manual Designed by - Dept. of Computer Science and Engineering. OBJECTIVES Data mining is primarily used by the companies with a strong consumer focus. It enables these

COURSE NAME: DATA WAREHOUSING AND MINING LAB COURSE CODE: A70595 COURSE OBJECTIVES: 1. Learn how to build a data warehouse and query it (using open source tools like Pentaho Data Integration Tool, Pentaho Business Analytics). 2. Learn to perform data mining tasks using a data mining toolkit (such as open source WEKA). 3.

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23