Pentaho - .e-bookshelf.de

3y ago
29 Views
2 Downloads
943.12 KB
30 Pages
Last View : 20d ago
Last Download : 3m ago
Upload by : Audrey Hope
Transcription

Pentaho Kettle Solutions

Pentaho KettleSolutions Building Open Source ETL Solutionswith Pentaho Data IntegrationMatt CastersRoland BoumanJos van Dongen

Pentaho Kettle Solutions: Building Open Source ETL Solutions withPentaho Data IntegrationPublished byWiley Publishing, Inc.10475 Crosspoint BoulevardIndianapolis, IN 46256www.wiley.comCopyright 2010 by Wiley Publishing, Inc., Indianapolis, IndianaPublished simultaneously in CanadaISBN: 978-0-470-63517-9ISBN: 9780470942420 (ebk)ISBN: 9780470947524 (ebk)ISBN: 9780470947531 (ebk)Manufactured in the United States of America10 9 8 7 6 5 4 3 2 1No part of this publication may be reproduced, stored in a retrieval system or transmitted in anyform or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, withouteither the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923,(978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed tothe Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201)748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this workand specifically disclaim all warranties, including without limitation warranties of fitness for aparticular purpose. No warranty may be created or extended by sales or promotional materials.The advice and strategies contained herein may not be suitable for every situation. This work issold with the understanding that the publisher is not engaged in rendering legal, accounting,or other professional services. If professional assistance is required, the services of a competentprofessional person should be sought. Neither the publisher nor the author shall be liable fordamages arising herefrom. The fact that an organization or Web site is referred to in this work asa citation and/or a potential source of further information does not mean that the author or thepublisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this workmay have changed or disappeared between when this work was written and when it is read.For general information on our other products and services please contact our Customer CareDepartment within the United States at (877) 762-2974, outside the United States at (317) 572-3993or fax (317) 572-4002.Wiley also publishes its books in a variety of electronic formats. Some content that appears inprint may not be available in electronic books.Library of Congress Control Number: 2010932421Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley &Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used withoutwritten permission. Pentaho is a registered trademark of Pentaho, Inc. All other trademarks arethe property of their respective owners. Wiley Publishing, Inc. is not associated with any productor vendor mentioned in this book.

For my wife and kids, Kathleen, Sam and Hannelore.Your love and joy keeps me sane in crazy times.—MattFor my wife, Annemarie, and my children, David, Roos,Anne and Maarten. Thanks for bearing with me—I love you!—RolandFor my children Thomas and Lisa, and for Yvonne, to whomI owe more than words can express.—Jos

About the AuthorsMatt Casters has been an independent business intelligence consultant for many yearsand has implemented numerous data warehouses and BI solutions for large companies.For the last 8 years, Matt kept himself busy with the development of an ETL tool calledKettle. This tool was open sourced in December 2005 and acquired by Pentaho earlyin 2006. Since then, Matt took up the position of Chief Data Integration at Pentaho. Hisresponsibility is to continue to be lead developer for Kettle. Matt tries to help the Kettlecommunity in any way possible; he answers questions on the forum and speaks occasionally at conferences all around the world. He has a blog at http://www.ibridge.be andyou can follow his @mattcasters account on Twitter.Roland Bouman has been working in the IT industry since 1998 and is currently working as a web and business intelligence developer. Over the years he has focused onopen source software, in particular database technology, business intelligence, andweb development frameworks. He’s an active member of the MySQL and Pentaho communities, and a regular speaker at international conferences, such as the MySQL UserConference, OSCON and at Pentaho community events. Roland co-authored the MySQL5.1. Cluster Certification Guide and Pentaho Solutions, and was a technical reviewer fora number of MySQL and Pentaho related book titles. He maintains a technical blog athttp://rpbouman.blogspot.com and tweets as @rolandbouman on Twitter.Jos van Dongen is a seasoned business intelligence professional and well-known authorand presenter. He has been involved in software development, business intelligence, anddata warehousing since 1991. Before starting his own consulting practice, Tholis Consulting,in 1998, he worked for a top tier systems integrator and a leading management consulting firm. Over the past years, he has successfully implemented BI and data warehousesolutions for a variety of organizations, both commercial and non-profit. Jos covers newBI developments for the Dutch Database Magazine and speaks regularly at national andinternational conferences. He authored one book on open source BI and is co-author of thebook Pentaho Solutions. You can find more information about Jos on http://www.tholis.com or follow @josvandongen on Twitter.vii

CreditsExecutive EditorRobert ElliottMarketing ManagerAshley ZurcherProject EditorSara ShlaerProduction ManagerTim TateTechnical EditorsJens BleuelSven BodenKasper de GraafDaniel EinspanjerNick GoodmanMark HallSamatar HassanBenjamin KallmannBryan SensemanJohannes van den BoschVice President and Executive GroupPublisherRichard SwadleyProduction EditorDaniel ScribnerCopy EditorNancy RapoportEditorial DirectorRobyn B. SieskyEditorial ManagerMary Beth WakefieldviiiVice President and Executive PublisherBarry PruettAssociate PublisherJim MinatelProject Coordinator, CoverLynsey StanfordCompositorMaureen Forys,Happenstance Type-O-RamaProofreaderNancy BellIndexerRobert SwansonCover DesignerRyan Sneed

AcknowledgmentsThis book is the result of the efforts of many individuals. By convention, authors receiveexplicit credit, and get to have their names printed on the book cover. But creating this bookwould not have been possible without a lot of hard work behind the scenes. We, the authors,would like to express our gratitude to a number of people that provided substantial contributions, and thus help define and shape the final result that is Pentaho Kettle Solutions.First, we’d like to thank those individuals that contributed directly to the materialthat appears in the book: Ingo Klose suggested an elegant solution to generate keys starting from a givenoffset within a single transformation (this solution is discussed in Chapter 8,“Handling Dimension Tables,” subsection “Generating Surrogate Keys Basedon a Counter,” shown in Figure 8-2). Samatar Hassan provided text as well as working example transformations todemonstrate Kettle’s RSS capabilities. Samatar’s contribution is included almostcompletely and appears in the RSS section of Chapter 21, “Web Services.” Thanks to Mike Hillyer and the MySQL documentation team for creating and maintaining the Sakila sample database, which is introduced in Chapter 4 and appearsin many examples throughout this book. Although only three authors appear on the cover, there was actually a fourth one: Wecannot thank Kasper de Graaf of DIKW-Academy enough for writing the Data Vaultchapter, which has benefited greatly from his deep expertise on this subject. Specialthanks also to Johannes van den Bosch who did a great job reviewing Kasper’s workand gave another boost to the overall quality and clarity of the chapter. Thanks to Bernd Aschauer and Robert Wintner, both from Aschauer EDV(http://www.aschauer-edv.at/en), for providing the examples and screenshots used in the section dedicated to SAP of Chapter 6, “Data Extraction.” Daniel Einspanjer of the Mozilla Foundation provided sample transformationsfor Chapter 7, “Cleansing and Conforming.”ix

xAcknowledgmentsThanks for your contributions. This book benefited substantially from your efforts.Much gratitude goes out to all of our technical reviewers. Providing a good technicalreview is hard and time-consuming, and we have been very lucky to find a collectionof such talented and seasoned Pentaho and Kettle experts willing to find some time intheir busy schedules to provide us with the kind of quality review required to write abook of this size and scope.We’d like to thank the Kettle and Pentaho communities. During and before the writing of this book, individuals from these communities provided valuable suggestionsand ideas to all three authors for topics to cover in a book that focuses on ETL, dataintegration, and Kettle. We hope this book will be useful and practical for everybodywho is using or planning to use Kettle. Whether we succeeded is up to the reader, butif we did, we have to thank individuals in the Kettle and Pentaho communities forhelping us achieve it.We owe many thanks to all contributors and developers of the Kettle software project.The authors are all enthusiastic users of Kettle: we love it, because it solves our dailydata integration problems in a straightforward and efficient manner without gettingin the way. Kettle is a joy to work with, and this is what provided much of the drive towrite this book.Finally, we’d like to thank our publisher, Wiley, for giving us the opportunity to writethis book, and for the excellent support and management from their end. In particular,we’d like to thank our Project Editor, Sara Shlaer. Despite the often delayed deliveriesfrom our end, Sara always kept her cool and somehow managed to make deadlineswork out. Her advice, patience, encouragement, care, and sense of humor made all thedifference and form an important contribution to this book. In addition, we’d like tothank our Executive Editor Robert Elliot. We appreciate the trust he put into our smallteam of authors to do our job, and his efforts to realize Pentaho Kettle Solutions.—The authorsWriting a technical book like the one you are reading right now is very hard to doall by yourself. Because of the extremely busy agenda caused by the release processof Kettle 4, I probably should never have agreed to co-author. It’s only thanks to thededication and professionalism of Jos and Roland that we managed to write this bookat all. I thank both friends very much for their invitation to co-author. Even thoughwriting a book is a hard and painful process, working with Jos and Roland made it allworthwhile.When Kettle was not yet released as open source code it often received a lukewarmreaction. The reason was that nobody was really waiting for yet another closed source ETLtool. Kettle came from that position to being the most widely deployed open sourceETL tool in the world. This happened only thanks to the thousands of volunteers whooffered to help out with various tasks. Ever since Kettle was open sourced it becamea project with an every growing community. It’s impossible to thank this communityenough. Without the help of the developers, the translators, the testers, the bug reporters,the folks who participate in the forums, the people with the great ideas, and even thefolks who like to complain, Kettle would not be where it is today. I would like to especially thank one important member of our community: Pentaho. Pentaho CEO RichardDaley and his team have done an excellent job in supporting the Kettle project ever

Acknowledgmentssince they got involved with it. Without their support it would not have been possiblefor Kettle to be on the accelerated growth path that it is on today. It’s been a pleasureand a privilege to work with the Pentaho crew.A few select members of our community also picked up the tough job of reviewing the often technical content of this book. The reviewers of my chapters, NicholasGoodman, Daniel Einspanjer, Bryan Senseman, Jens Bleuel, Samatar Hassan, and MarkHall had the added disadvantage that this was the first time that I was going throughthe process of writing a book. It must not have been pretty at times. All the same theyspent a lot of time coming up with insightful additions, spot-on advice, and to the pointcomments. I do enormously appreciate the vast amount of time and effort that they putinto the reviewing. The book wouldn’t have been the same without you guys!—Matt CastersI’d like to thank both my co-authors, Jos and Matt. It’s an honor to be working withsuch knowledgeable and skilled professionals, and I hope we will collaborate again inthe future. I feel our different backgrounds and expertise have truly complemented eachother and helped us all to cover the many different subjects covered in this book.I’d also like to thank the reviewers of my chapters: Benjamin Kallman, BryanSenseman, Daniel Einspanjer, Sven Boden, and Samatar Hassan. Your comments andsuggestions made all the difference and I thank you for your frank and constructivecriticism.Finally, I’d like to thank the readers of my blog at http://rpbouman.blogspot.com/.I got a lot of inspiration from the comments posted there, and I got a lot of good feedbackin response to the blog posts announcing the writing of Pentaho Kettle Solutions.—Roland BoumanBack in October 2009, when Pentaho Solutions had only been on the shelves for twomonths and Roland and I agreed never to write another book, Bob Elliot approachedus asking us to do just that. Yes, we had been discussing some ideas and already concluded that if there were to be another book, it would have to be about Kettle. And thiswas exactly what Bob asked us to do: write a book about data integration using Kettle.We quickly found out that Matt Casters was not only interested in reviewing, but inactually becoming a full author as well, an offer we gladly accepted. Looking back, Ican hardly believe that we pulled it off, considering everything else that was going onin our lives. So many thanks to Roland and Matt for bearing with me, and thank youBob and especially Sara for your relentless efforts of keeping us on track.A special thank you is also warranted for Ralph Kimball, whose ideas you’ll findthroughout this book. Ralph gave us permission to use the Kimball Group’s 34 ETLsubsystems as the framework for much of the material presented in his book. Ralph alsotook the time to review Chapter 5, and thanks to his long list of excellent comments thechapter became a perfect foundation for Parts II, III, and IV of the book.Finally I’d like to thank Daniel Einspanjer, Bryan Senseman, Jens Bleuel, Sven Boden,Samatar Hassan, and Benjamin Kallmann for being an absolute pain in the neck andthus doing a great job as technical reviewers for my chapters. Your comments, questionsand suggestions definitely gave a big boost to the overall quality of this book.—Jos van Dongenxi

Contents at a GlanceIntroductionxxxiPart IGetting Started1Chapter 1ETL Primer3Chapter 2Kettle Concepts23Chapter 3Installation and Configuration53Chapter 4An Example ETL Solution—Sakila73Part IIETL111Chapter 5ETL Subsystems113Chapter 6Data Extraction127Chapter 7Cleansing and Conforming167Chapter 8Handling Dimension Tables207Chapter 9Loading Fact Tables245Chapter 10 Working with OLAP Data269Part III293Management and DeploymentChapter 11 ETL Development Lifecycle295Chapter 12 Scheduling and Monitoring321xiii

xivContents at a GlanceChapter 13 Versioning and Migration341Chapter 14 Lineage and Auditing357Part IV375Performance and ScalabilityChapter 15 Performance Tuning377Chapter 16 Parallelization, Clustering, and Partitioning403Chapter 17 Dynamic Clustering in the Cloud433Chapter 18 Real-Time Data Integration449Part V463Advanced TopicsChapter 19 Data Vault Management465Chapter 20 Handling Complex Data Formats497Chapter 21 Web Services515Chapter 22 Kettle Integration569Chapter 23 Extending Kettle593Appendix A The Kettle Ecosystem629Appendix B Kettle Enterprise Edition Features635Appendix C Built-in Variables and Properties Reference637Index643

ContentsIntroductionxxxiPart IGetting Started1Chapter 1ETL Primer3OLTP versus Data WarehousingWhat Is ETL?The Evolution of ETL SolutionsETL Building BlocksETL, ELT, and EIIELTEII: Virtual Data IntegrationData Integration ChallengesMethodology: Agile BIETL DesignData AcquisitionBeware of SpreadsheetsDesign for FailureChange Data CaptureData QualityData ProfilingData ValidationETL Tool RequirementsConnectivityPlatform IndependenceScalabilityDesign 616161717171818191919xv

xviContentsChapter 2Chapter 3Data TransformationsTesting and DebuggingLineage and Impact AnalysisLogging and AuditingSummary2021212222Kettle Concepts23Design PrinciplesThe Building Blocks of Kettle DesignTransformationsStepsTransformation HopsParallelismRows of DataData ConversionJobsJob EntriesJob HopsMultiple Paths and BacktrackingParallel ExecutionJob Entry ResultsTransformation or Job MetadataDatabase ConnectionsSpecial OptionsThe Power of the Relational DatabaseConnections and TransactionsDatabase ClusteringTools and UtilitiesRepositoriesVirtual File SystemsParameters and VariablesDefining VariablesNamed ParametersUsing VariablesVisual ProgrammingGetting StartedCreating New StepsPutting It All 93940414142434344444546474951Installation and Configuration53Kettle Software OverviewIntegrated Development Environment: SpoonCommand-Line Launchers: Kitchen and PanJob Server: CarteEncr.bat and encr.shInstallation535557575858

ContentsChapter 4Java EnvironmentInstalling Java ManuallyUsing Your Linux Package Management SystemInstalling KettleVersions and ReleasesArchive Names and FormatsDownloading and UncompressingRunning Kettle ProgramsCreating a Shortcut Icon or Launcher for SpoonConfigurationConfiguration Files and the .kettle DirectoryThe Kettle Shell ScriptsGeneral Structure of the Startup ScriptsAdding an Entry to the ClasspathChanging the Maximum Heap SizeManaging JDBC DriversSummary5858595959606061626363697070717272An Example ETL Solution—Sakila73SakilaThe Sakila Sample DatabaseDVD Rental Business ProcessSakila Database Schema DiagramSakila Database Subject AreasGeneral Design ConsiderationsInstalling the Sakila Sample DatabaseThe Rental Star SchemaRental Star Schema DiagramRental Fact TableDimension TablesKeys and Change Data CaptureInstalling the Rental Star SchemaPrerequisites and Some Basic Spoon SkillsSetting Up the ETL SolutionCreating Database AccountsWorking with SpoonOpening Transformation and Job FilesOpening the Step’s Configuration DialogExamining StreamsRunning Jobs and TransformationsThe Sample ETL SolutionStatic, Generated DimensionsLoading the dim date Dimension TableLoading the dim time Dimension TableRecurring LoadThe load rentals 4868788xvii

xviiiContentsPart IIChapter 5Chapter 6The load dim staff TransformationDatabase ConnectionsThe load dim customer TransformationThe load dim store TransformationThe fetch address SubtransformationThe load dim actor TransformationThe load dim film TransformationThe load fact rental TL Subsystems113Introduction to the 34 SubsystemsExtracti

web development frameworks. He’s an active member of the MySQL and Pentaho com - munities, and a regular speaker at international conferences, such as the MySQL User Conference, OSCON and at Pentaho community events. Roland co-authored the MySQL 5.1. Cluster Certification Guide and Pentaho Solutions, and was a technical reviewer for

Related Documents:

Course Outline Overview of the Pentaho Stack Introduction to OLAP, DWH and Dimension Modelling Pentaho BA Server Pentaho ETL OLAP Schema creation in Pentaho in detail Dynamic Security in Pentaho OLAP MDX in action Pentaho Reporting in detail (PRD and PME) Analysis Report Dashboar

Pentaho Schema Workbench 3. Pentaho Dashboards : a. Pentaho Design Studio 4. Pentaho Data Integration (ETL) : a. Pentaho Data Integration 5. Pe

This guide is an introduction to Pentaho Business Analytics, and includes: Instructions for the installation of Pentaho Business Analytics on Windows Features of a simple report created using Pentaho Interactive Reporting Features of a

Pentaho BA Server Manual Deployment Guide This document supports Pentaho Business Analytics Suite 4.8 GA and Pentaho Data Integration 4.4 GA, documentation revision October 31, 2012.

1.2 Pentaho BI-Server A Pentaho Corporation gerencia, facilita, suporta e lidera o papel de desenvolvimento do Projeto Pentaho BI - uma iniciativa pioneira da comu-nidade de desenvolvimento Open Source para proporcionar ferramentas de Business Intelligence (BI) para que as organiza c oes melhorem radicalmente

She has worked as a BI consultant for more than 10 years. Over the last four years, she has been dedicated full time to developing BI solutions using Pentaho Suite. Currently, she works for Webdetails, one of the main Pentaho contributors. She is the author of Pentaho 3.2 Data Integration: Beginner's Guide published by Packt Publishing in April .

Pentaho Business Analytics, providing data integration, OLAP services, reporting, dashboarding, data mining and ETL capabilities. 2 Pentaho BI - Analytics for everyone Analytics is all about gaining insights from the data for better decision making [13]. A competitor on the growing market of BI solutions, Pentaho BI is an ongoing effort by

8th Grade Writing and Speaking/Listening Scope and Sequence 1 s t Q u a r te r 2 n d Q u a r te r 3 r d Q u a r te r 4 th Q u a r te r Writing N a rra t i ve I n t ro d u ce ch a ra ct e rs a n d o rg a n i ze a n e ve n t se q u e n ce (W . 8 . 3 a ) U se n a rra t i ve t e ch n i q u e s i n cl u d i n g