ETL Overview Extract, Transform, Load (ETL) General ETL .

2y ago
28 Views
4 Downloads
370.61 KB
10 Pages
Last View : 1d ago
Last Download : 25d ago
Upload by : Rosa Marty
Transcription

ETL OverviewExtract, Transform, Load (ETL) General ETL issues ETL/DW refreshment processBuilding dimensionsBuilding fact tablesExtractTransformations/cleansingLoad MS Integration ServicesOriginal slides were written byTorben Bach PedersenThe ETL ProcessAalborg University 2007 - DWML course2Refreshment Workflow The most underestimated process in DW development The most time-consuming process in DW development 80% of development time is spent on ETL! Extract Extract relevant data Transform IntegrationphaseTransform data to DW formatBuild keys, etc.Cleansing of data Load Load data into DWBuild aggregates, etc.PreparationphaseAalborg University 2007 - DWML course3Aalborg University 2007 - DWML course4

ETL In The ArchitectureData Staging Area (DSA) Transit storage for data in the ETL processMetadataDatasourcesPresentation servers- Extract- Transform- LoadData StagingArea Query sideETL sideQueryServices-Warehouse Browsing-Access and SecurityData marts with-Query Managementaggregate-only data- Standard ReportingConformed -Activity MonitorDataWarehouse dimensionsBusand factsReporting ToolsDesktop DataAccess Tools No user queries Sequential operations on large data volumes Data mining ServiceElementAalborg University 2007 - DWML course 5ETL Construction Process 2)3) 5)6) Make high-level diagram of source-destination flowTest, choose and implement ETL toolOutline complex transformations, key generation and jobsequence for every destination table 8)9)10) Construct and test building static dimensionConstruct and test change mechanisms for one dimensionConstruct and test remaining dimension builds DW key assignment: production keys to DW keys using tableCombination of data sources: find common key?Check one-one and one-many relationships using sortingDescribed in last lectureFind the newest DW key for a given production keyTable for mapping production keys to DW keys must be updated Load of dimensions Construct and test initial fact table buildConstruct and test incremental updateConstruct and test aggregate build (you do this later)Design, construct, and test ETL automationAalborg University 2007 - DWML course6 Handling dimension changesConstruction of fact tables and automation7)Aalborg University 2007 - DWML course Static dimension tableConstruction of dimensions4)Often too time consuming to initial load all data marts by failureBackup/recovery facilities neededBetter to do this centrally in DSA than in all data martsBuilding DimensionsPlan1)Performed by central ETL logicNo need for locking, logging, etc.RDBMS or flat files? (DBMS have become better at this) Finished dimensions copied from DSA to relevant marts Allows centralized backup/recoveryOperationalsystemData marts withatomic dataDataTransformations/cleansing done here 7Small dimensions: replaceLarge dimensions: load only changesAalborg University 2007 - DWML course8

Building Fact TablesTypes of Data Sources Two types of load Initial load Non-cooperative sourcesETL for all data up till nowDone when DW is started the first timeVery heavy - large data volumes Incremental update Move only changes since last loadDone periodically (e.g., month or week) after DW startLess heavy - smaller data volumes Cooperative sources Dimensions must be updated before facts The relevant dimension rows for new facts must be in placeSpecial key considerations if initial load must be performed againAalborg University 2007 - DWML course9 Extract from source systems can take long time Extract applications (SQL): co-existence with other applicationsDB unload tools: faster than SQL-based extracts Delta can easily be computed from current last extract Always possible Handles deletions- High extraction time Put update timestamp on all rows (in sources) Extraction can take days/weeksDrain on the operational systemsDrain on DW systems Extract/ETL only changes since last load (delta)Aalborg University 2007 - DWML course10 Delta changes since last load Store sorted total extracts in DSA Extract applications the only solution in some scenarios Too time consuming to ETL all data at each load Aalborg University 2007 - DWML courseComputing Deltas Types of extracts: DB triggers is an example Extract strategy depends on the source types Goal: fast extract of relevant data Replicated sources – publish/subscribe mechanismCall back sources – calls external code (ETL) when changes occurInternal action sources – only internal actions when changes occur Extract Snapshot sources – provides only full copy of source, e.g., filesSpecific sources – each is different, e.g., legacy systemsLogged sources – writes change log, e.g., DB logQueryable sources – provides query interface, e.g., RDBMS 11Updated by DB triggerExtract only where “timestamp time for last extract” Reduces extract time- Cannot (alone) handle deletions- Source system must be changed, operational overheadAalborg University 2007 - DWML course12

Changed Data CaptureCommon Transformations Messages Applications insert messages in a “queue” at updates Works for all types of updates and systems- Operational applications must be changed operational overhead DB triggers Triggers execute actions on INSERT/UPDATE/DELETE Operational applications need not be changed Enables real-time update of DW- Operational overhead 13 Unique Spellings, codings, Production keys, comments, Replace codes with text (Why?)No contradictory data: aggregates fit with detail data The same things is called the same and has the same key(customers)City name instead of ZIP code, Combine data from multiple sources with common key Timely Pre-processing necessary for BI analysis Remove unnecessary attributesDW has all relevant data and the users know 14 Handle inconsistent data formatsCompleteConsistent DW data must match known numbers - or explanation needed Aalborg University 2007 - DWML course BI does not work on “raw” dataPrecise Table matches production keys to surrogate DW keysCorrect handling of history - especially for total reloadCleansing Data almost never has decent quality Data in DW must be: To the desired DW formatDepending on source format Building keysData Quality EBCDIC ASCII/UniCodeString manipulationsDate/time format conversions Normalization/denormalizationFind changes directly in DB log which is written anyway Operational applications need not be changed No operational overhead- Not possible in some DBMSAalborg University 2007 - DWML course Replication based on DB log Data type conversionsE.g., customer data from customer address, customer name, Data is updated ”frequently enough” and the users know whenAalborg University 2007 - DWML course15Aalborg University 2007 - DWML course16

Types Of CleansingCleansing Conversion and normalization Text coding, date formats, etc.Most common type of cleansing Mark facts with Data Status dimension Special-purpose cleansing Normalize spellings of names, addresses, etc.Remove duplicates, e.g., duplicate customers Uniform treatment of NULL Domain-independent cleansing Approximate, “fuzzy” joins on records from different sources Rule-based cleansing Mark facts with changed status 17Improving Data Quality Responsibility for data qualityIncludes manual inspections and corrections! The optimal? Are totals as expected?Do results agree with alternative source?Number of NULL values? Large overhead (optimization, locking, etc.) for every SQL callDB load tools are much fasterDrop index and rebuild after loadCan be done per index partition Parallellization Allow management to see “weird” data in their reports?Such data may be meaningful for them? (e.g., fraud detection)Aalborg University 2007 - DWML courseLoading deltas is much faster than total load Index on tables slows load a lot Do not fix all problems with data quality 18 SQL-based update is slow Construct programs that check data quality Aalborg University 2007 - DWML course Goal: fast loading into DW Source-controlled improvements New customer, Customer about to cancel contract, Load Appoint “data quality administrator” Avoid problems in joins, since NULL is not equal to NULLGuess missing sales person based on customer and itemAalborg University 2007 - DWML course Use explicit NULL value rather than “special” value (0,-1, )Use NULLs only for measure values (estimates instead?)Use special dimension keys for NULL dimension values User-specifed rules, if-then styleAutomatic rules: use data mining to find patterns in data Normal, abnormal, outside bounds, impossible, Facts can be taken in/out of analyses 19Dimensions can be loaded concurrentlyFact tables can be loaded concurrentlyPartitions can be loaded concurrentlyAalborg University 2007 - DWML course20

LoadETL Tools ETL tools from the big vendors Relationships in the data Referential integrity and data consistency must be ensured (Why?)Can be done by loader Aggregates Offers much functionality at a reasonable priceCan be built and loaded at the same time as the detail data Load tuning Load without logSort load file firstMake only simple transformations in loaderUse loader facilities for building aggregates Data modelingETL code generationScheduling DW jobs The “best” tool does not exist Should DW be on-line 24*7? Oracle Warehouse BuilderIBM DB2 Warehouse ManagerMicrosoft Integration Services Choose based on your own needsCheck first if the “standard tools” from the big vendors are OKUse partitions or several sets of tables (like MS Analysis)Aalborg University 2007 - DWML course21IssuesAalborg University 2007 - DWML course22MS Integration Services Pipes Redirect output from one process to input of another processls grep 'a' sort -r A concrete ETL tool Example ETL flow Demo Files versus streams/pipes Streams/pipes: no disk overhead, fast throughputFiles: easier restart, often only possibility ETL tool or not Code: easy start, co-existence with IT infrastructureTool: better productivity on subsequent projects Load frequency ETL time dependent of data volumesDaily load is much faster than monthlyApplies to all steps in the ETL processAalborg University 2007 - DWML course23Aalborg University 2007 - DWML course24

Integration Services (IS)Packages Microsoft’s ETL tool Part of SQL Server 2005 The central concept in IS Package for: Tools Import/export wizard - simple transformationsBI Development Studio – advanced development Functionality available in several ways Through GUI - basic functionalityProgramming - advanced functionality Aalborg University 2007 - DWML course25Package Control Flow Groups tasks and containers intocontrol flows that are subsets of thepackage control flowProvides services to a single taskAalborg University 2007 - DWML courseBulk insert – fast load of dataExecute SQL – execute any SQL query Scripting Tasks Script – execute VN .NET code Analysis Services Tasks Task host container Execute package – execute other IS packages, good for structure!Execute Process – run external application/batch file SQL Servers TasksRepeat tasks by testing a conditionSequence container File System – operations on filesFTP – up/down-load data Workflow TasksRepeat tasks by using an enumeratorFor loop container Foreach loop container 26 Data Flow – runs data flows Data Preparation TasksStructure to packagesServices to tasks Control flow Aalborg University 2007 - DWML courseTasks “Containers” provide Sources, ConnectionsControl flowTasks, WorkflowsTransformationsDestinations Arrows:green (success)red (failure) Analysis Services Processing – process dims, cubes, modelsAnalysis Services Execute DDL – create/drop/alter cubes, models Maintenance Tasks – DB maintenance27Aalborg University 2007 - DWML course28

Event HandlersData Flow Elements Sources Executables (packages,containers) can raise events Event handlers manage the events Similar to those in languagesJAVA, C#Makes external data availableAll ODBC/OLE DB data sources:RDBMS, Excel, Text files, Transformations UpdateSummarizeCleanseMergeDistribute Destinations Write data to specific storeCreate in-memory data set Input, Output, Error outputAalborg University 2007 - DWML course29Aalborg University 2007 - DWML courseTransformationsA Simple IS Case Business intelligence transformations Use BI Dev Studio/Import Wizard to copy TREO tables Save in Term Extraction - extract terms from textTerm Lookup – look up terms and find term counts Row Transformations Character Map - applies string functions to character dataDerived Column – populates columns using expressions Aggregate - performs aggregationsSort - sorts dataPercentage Sampling - creates sample data set by setting % Conditional Split - routes data rows to different outputsMerge - merges two sorted data setsLookup Transformation - looks up ref values by exact match Error messages? Steps execute in parallelExport Column - inserts data from a data flow into a fileImport Column - reads data from a file and adds it to a data flowSlowly Changing Dimension - configures update of a SCDAalborg University 2007 - DWML courseDROP, CREATE, source, transformation, destination Execute package Other Transformations Available from mini-project web page Look at package parts Split and Join Transformations SQL ServerFile system Look at package structure Rowset Transformations (rowset tabular data) 30 31But dependencies can be set upAalborg University 2007 - DWML course32

ETL Demo Load data into the Product dimension table Construct the DW key for the table by using “IDENTITY”Copy data to the Product dimension table Load data into the Sales fact table Aalborg University 2007 - DWML course33Join “raw” sales table with other tables to get DW keys for eachsales recordOutput of the query written into the fact tableAalborg University 2007 - DWML courseETL Part of Mini ProjectA Few Hints on ETL Design Core: Don’t implement all transformations in one step! Build an ETL flow using MS DTS that can do an initial (first-time)load of the data warehouseInclude logic for generating special DW surrogate integer keys forthe tablesDiscuss and implement basic transformations/data cleansing Extend the ETL flow to handle incremental loads, i.e., updates tothe DW, both for dimensions and factsExtend the DW design and the ETL logic to handle slowlychanging dimensions of Type 2Implement more advanced transformations/data cleansingPerform error handling in the ETL flow Copy source data one-one to DSACompute deltas 35Only if doing incremental loadHandle versions and DW keys Aalborg University 2007 - DWML courseBuild first step and check that result is as expectedAdd second step and execute both, check resultAdd third step Test SQL before putting into IS Do one thing at the time Extensions: 34Versions only if handling slowly changing dimensionsImplement complex transformationsLoad dimensionsLoad factsAalborg University 2007 - DWML course36

Summary General ETL issues The ETL processBuilding dimensionsBuilding fact tablesExtractTransformations/cleansingLoad MS Integration ServicesAalborg University 2007 - DWML course37

ETL tool or not Code: easy start, co-existence with IT infrastructure Tool: better productivity on subsequent projects Load frequency ETL time dependent of data volumes Daily load is much faster than monthly Applies to all steps in the ETL process Aalborg University 2007 - DWML course 24 MS Integration Services A concrete ETL toolFile Size: 370KB

Related Documents:

ssis etl developer resume, senior etl developer resume, etl resume example, etl developer 3 years experience resume, ? www.mr-data.nl . What is ETL? In computing, extract, transform, and load (ETL) refers to a process i

Audience This guide is for users of Sybase ETL Development. How to use this book This book contains these chapters: † Chapter 1, "Sybase ETL," is an overview of the Sybase ETL architecture and the feature set of Sybase ETL Development and Sybase ETL Server. † Chapter 2, "Getting Started," describes how to get started using Sybase ETL.

ETL Extract, Transform, Load In order to get from our native/raw data into the OMOP CDM we need to design and develop and ETL process Goal in ETLing is to standardize the format and terminology This tutorial - Will teach you best practices around designing an ETL and CDM maintenance - Will not teach you how to program an ETL

data transformation process by using the unique Sybase ETL "Step and See" technology. How to use this book This book contains the following chapters: Chapter 1, "Sybase ETL" gives you a brief overview of Sybase ETL architecture and the feature set of Sybase ETL Development and Sybase ETL Server.

techniques to increase performance of the ETL layer. ETL tools existing on the market support mainly parallel processing of ETL tasks. Additionally, the IBM DataStage ETL engine applies the so-called balanced optimization that allows to execute ETL tasks either in a data source or in a data warehouse, thus opti-mizing overall resource consumption.

QA recreates pseudo ETL code in parallel to the developers' actual ETL process. This pseudo ETL code processes a subset of data and generates an output. The actual ETL process also uses the same input data and generates data. Then the ETL tester compares the data generated by both the processes and documents the differences.

For example, instead of converting an ETL job into a mapping, Orchid can rewrite the job and deploy it back as a sequence of combined SQL queries and ETL jobs. This rewrite and deployment of ETL jobs occurs auto-matically and reduces the workload of (highly expensive) ETL programmers. Mapping and ETL tools are aimed at different sets of users.

English Language Arts Model Curriculum with Instructional Supports . Code . Standard . Reading Standards for Literature Key Ideas and details . RL.2.1 . Ask and answer such questions as who, what, where, when, why, and how to demonstrate understanding of key details in a text. RL.2.2 : Analyze literary text development. a. Determine the lesson or moral. b. Retell stories, including fables and .