Creating A Data-Driven Enterprise With DataOps

3y ago
155 Views
32 Downloads
3.97 MB
165 Pages
Last View : 6d ago
Last Download : 3m ago
Upload by : Ronan Garica
Transcription

ComplimentsofCreating aData-Driven Enterprisewith DataOpsInsights from Facebook, Uber, LinkedIn,Twitter, and eBayAshish Thusoo &Joydeep Sen Sarma

Data Platforms 2017Engineering the Future with DataOpsThe killer app for public cloud is big data analytics. And as ITevolves from a cost center to a true nexus of businessinnovation, the data team, data engineers, platform engineersand database admins need to build the enterprise oftomorrow. One that is scalable, and built on a totallyself-service infrastructure.Announcing the first industry conference focused exclusivelyon helping data teams build a modern data platform. Comemeet the data gurus who helped transform their companiesinto self service, data-driven enterprises.Their stories are in this book. Come meet them in person andlearn more at Data Platforms 2017. Join us for the first everconference dedicated to building the enterprise of tomorrow conference attendees will take home the blueprint to createtomorrow's data driven architecture today.Learn Morehttp://bit.ly/DataPlatformsConferencePresented by:

Creating a Data-DrivenEnterprise with DataOpsInsights from Facebook, Uber,LinkedIn, Twitter, and eBayAshish Thusoo and Joydeep Sen SarmaBeijingBoston Farnham SebastopolTokyo

Creating a Data-Driven Enterprise with DataOpsby Ashish Thusoo and Joydeep Sen SarmaCopyright 2017 O’Reilly Media, Inc. All rights reserved.Printed in the United States of America.Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA95472.O’Reilly books may be purchased for educational, business, or sales promotional use.Online editions are also available for most titles (http://oreilly.com/safari). For moreinformation, contact our corporate/institutional sales department: 800-998-9938 orcorporate@oreilly.com.Editor: Nicole TacheProduction Editor: Kristen BrownCopyeditor: Octal Publishing, Inc.April 2017:Interior Designer: David FutatoCover Designer: Karen MontgomeryIllustrator: Rebecca DemarestFirst EditionRevision History for the First Edition2017-04-24: First ReleaseThe O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Creating a DataDriven Enterprise with DataOps, the cover image, and related trade dress are trade‐marks of O’Reilly Media, Inc.While the publisher and the authors have used good faith efforts to ensure that theinformation and instructions contained in this work are accurate, the publisher andthe authors disclaim all responsibility for errors or omissions, including withoutlimitation responsibility for damages resulting from the use of or reliance on thiswork. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is sub‐ject to open source licenses or the intellectual property rights of others, it is yourresponsibility to ensure that your use thereof complies with such licenses and/orrights.978-1-491-97781-1[LSI]

Table of ContentsAcknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiPart I.Foundations of a Data-Driven Enterprise1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3The Journey BeginsThe Emergence of the Data-Driven OrganizationMoving to Self-Service Data AccessThe Emergence of DataOpsIn This Book361013162. Data and Data Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17A Brief History of DataThe Evolution of Data to “Big Data”Challenges with Big DataThe Evolution of AnalyticsComponents of a Big Data InfrastructureHow Companies Adopt Data: The Maturity ModelHow Facebook Moved Through the Stages of DataMaturitySummary17182021232529313. Data Warehouses Versus Data Lakes: A Primer. . . . . . . . . . . . . . . . . 33Data Warehouse: A DefinitionWhat Is a Data Lake?Key Differences Between Data Lakes and Data Warehouses333536iii

When Facebook’s Data Warehouse Ran Out of SteamIs Using Either/Or a Possible Strategy?Common MisconceptionsDifficulty Finding Qualified PersonnelSummary37383941424. Building a Data-Driven Organization. . . . . . . . . . . . . . . . . . . . . . . . . 43Creating a Self-Service CultureOrganizational Structure That Supports a Self-ServiceCultureRoles and ResponsibilitiesSummary444952565. Putting Together the Infrastructure to Make Data Self-Service. . . 57Technology That Supports the Self-Service ModelTools Used by Producers and Consumers of DataThe Importance of a Complete and Integrated DataInfrastructureThe Importance of Resource Sharing in aSelf-Service WorldSecurity and GovernanceSelf Help Support for UsersMonitoring Resources and ChargebacksThe “Big Compute Crunch”: How Facebook Allocates DataInfrastructure ResourcesUsing the Cloud to Make Data Self ServiceSummary575860646566676869696. Cloud Architecture and Data Infrastructure-as-a-Service. . . . . . . . 71Five Properties of the CloudCloud ArchitectureObjections About the Cloud RefutedWhat About a Private Cloud?Data Platforms for Data 2.0Summary7177818485867. Metadata and Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87The Three Types of MetadataThe Challenges of MetadataEffectively Managing MetadataSummaryiv Table of Contents87909193

8. A Maturity-Model “Reality Check” for Organizations. . . . . . . . . . . . 95Organizations Understand the Need for Big Data, ButReach Is Still LimitedSignificant Challenges RemainSummaryPart II.9599107Case Studies9. LinkedIn: The Road to Data Craftsmanship. . . . . . . . . . . . . . . . . . . 111Tracking and DALIFaster Access to Data and InsightsOrganizational Structure of the Data TeamThe Move to Self-Service11411411511610. Uber: Driven to Democratize Data. . . . . . . . . . . . . . . . . . . . . . . . . . 119Uber’s First Data Challenge: Too PopularUber’s Second Data Challenge: ScalabilityMaking Data Democratic11912012511. Twitter: When Everything Happens in Real Time. . . . . . . . . . . . . . 127Twitter Develops HeronSeven Different Use Cases for Real-Time StreamingAnalyticsAdvice to Companies Seeking to Be Data-DrivenLooking Ahead12712913013112. Capture All Data, Decide What to Do with It Later:My Experience at eBay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Ensuring “CAP-R” in Your Data InfrastructurePersonalization: A Key Benefit of Data-Driven CultureBuilding Data Tools and Giving Back to the Open SourceCommunityThe Importance of Machine LearningLooking Ahead135138139140141A. A Podcast Interview Transcript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143Table of Contents v

AcknowledgmentsThis book is an attempt to capture what we have learned buildingteams, systems, and processes in our constant pursuit of a datadriven approach for the companies that we have worked for, as wellas companies that are clients of Qubole today. To capture the essenceof those learnings has taken effort and support from a number ofpeople.We cannot express enough thanks to David Hsieh for noticing theprescient need for a book on this topic and then constantly encour‐aging us to put our learnings to paper. We are also thankful to himfor creating the maturity model for big data based on the patterns ofour learnings about the adoption cycle of big data in the enterprise.At all the steps of the creation of this book, David has been a greatsounding board and has given timely and useful advice. Thanks arealso equally due to Karyn Scott for managing everything and any‐thing related to the book, from coordinating the logistics withO’Reilly, to working behind the scenes with the Qubole team to pol‐ish the diagrams and presentations. She has constantly pushed tostrive for timely delivery of the manuscript, which at times wasunderstandably frustrating given that both of us were working onthis while building out Qubole. Thanks are also due to Mauro Calviand Dharmesh Desai for capturing some of the discussions in easyto-digest pictorial representations.We also want to thank the entire production team at O’Reilly, start‐ing with Nicole Tache who edited a number of versions of themanuscript to ensure that not just the content but also our voice waswell represented. We are grateful for her flexibility in the productionprocess so that we could get the content right. Also at O’Reilly, wevii

want to thank Alice LaPlante for diligently capturing our interviewson the subject and for helping build the content based on thoseinterviews.This book also tries to look for patterns that are common in enter‐prises that have achieved the “nirvana” of being data-driven. In thataspect, the contributions of Debashis Saha (eBay), Karthik Ramas‐amy (Twitter), Shrikanth Shankar (LinkedIn), and Zheng Shao(Uber) are some of the most valuable to the book as well as to ourcollective knowledge. All of these folks are great practitioners of theart and science of making their companies data-driven, and we arevery thankful to them for sharing their learnings and experiences,and in the process making this book all the more insightful.Last but not least, thanks to our families for putting up with us whilewe worked on this book. Without their constant encouragement andsupport, this effort would not have been possible.viii Acknowledgments

PART IFoundations of a Data-DrivenEnterpriseThis book is divided into two parts. In Part I, we discuss the theoret‐ical and practical foundations for building a self-service, data-drivencompany.In Chapter 1, we explain why data-driven companies are more suc‐cessful and profitable than companies that do not center theirdecision-making on data. We also define what DataOps is andexplain why moving to a self-service infrastructure is so critical.In Chapter 2, we trace the history of data over the past three decadesand how analytics has evolved accordingly. We then introduce theQubole Self-Service Maturity Model to show how companies pro‐gress from a relatively simple state to a mature state that makes dataubiquitous to all employees through self-service.In Chapter 3, we discuss the important distinctions between datawarehouses and data lakes, and why, at least for now, you need tohave both to effectively manage big data.In Chapter 4, we define what a data-driven company is and how tosuccessfully build, support, and evolve one.

In Chapter 5, we explore the need for a complete, integrated, andself-service data infrastructure, and the personas and tools that arerequired to support this.In Chapter 6, we talk about how the cloud makes building a selfservice infrastructure much easier and more cost effective. Weexplore the five capabilities of cloud to show why it makes the per‐fect enabler for a self-service culture.In Chapter 7, we define metadata, and explain why it is essential fora successful self-service, data-driven operation.In Chapter 8, we reveal the results of a Qubole survey that show thecurrent state of maturity of global organizations today.

CHAPTER 1IntroductionThe Journey BeginsMy journey with big data began at Oracle, led me to Facebook, and,finally, to founding Qubole. It’s been an exciting and informativeride, full of learnings and epiphanies. But two early “ah-ha’s” in par‐ticular stand out. They both occurred at Facebook. One was thatusers were eager to get their hands on data directly, without goingthrough the data engineers in the data team. The second was howpowerful data could be in the hands of the people.I joined Facebook in August 2007 as part of the data team. It was anew group, set up in the traditional way for that time. The datainfrastructure team supported a small group of data professionalswho were called upon whenever anyone needed to access or analyzedata located in a traditional data warehouse. As was typical in thosedays, anyone in the company who wanted to get data beyond somesmall and curated summaries stored in the data warehouse had tocome to the data team and make a request. Our data team was excel‐lent, but it could only work so fast: it was a clear bottleneck.I was delighted to find a former classmate from my undergraduatedays at the Indian Institute of Technology already at Facebook. Joy‐deep Sen Sarma had been hired just a month previously. Our team’scharter was simple: to make Facebook’s rich trove of data moreavailable.Our initial challenge was that we had a nonscalable infrastructurethat had hit its limits. So, our first step was to experiment with3

Hadoop. Joydeep created the first Hadoop cluster at Facebook andthe first set of jobs, populating the first datasets to be consumed byother engineers—application logs collected using Scribe and appli‐cation data stored in MySQL.But Hadoop wasn’t (and still isn’t) particularly user friendly, even forengineers. Gartner found that even today—due to how difficult it isto find people with adequate Hadoop skills—more than half of busi‐nesses (54 percent) have no plans to invest in it.1 It was, and is, achallenging environment. We found that the productivity of ourengineers suffered. The bottleneck of data requests persisted (seeFigure 1-1).Figure 1-1. Human bottlenecks (source: Qubole)SQL, on the other hand, was widely used by both engineers and ana‐lysts, and was powerful enough for most analytics requirements. SoJoydeep and I decided to make the programmability of Hadoopavailable to everyone. Our idea: to create a SQL-based declarativelanguage that would allow engineers to plug in their own scripts andprograms when SQL wasn’t adequate. In addition, it was built tostore all of the metadata about Hadoop-based datasets in one place.This latter feature was important because it turned out indispensablefor creating the data-driven company that Facebook subsequentlybecame.1 http://www.gartner.com/newsroom/id/30517174 Chapter 1: Introduction

That language, of course, was Hive, and the rest is history. Still, theidea was very new to us. We had no idea whether it would succeed.But it did. The data team immediately became more productive. Thebottleneck eased. But then something happened that surprised us.In January of 2008, when we released the first version of Hive inter‐nally at Facebook, a rush of employees—data scientists and engi‐neers—grabbed the interfaces for themselves. They began to accessthe data they needed directly. They didn’t bother to request helpfrom the data team. With Hive, we had inadvertently brought thepower of big data to the people. We immediately saw tremendousopportunities in completely democratizing data. That was our first“ah-ha!”One of the things driving employees to Hive was that at that sametime (January 2008) Facebook released its Ad product.Over the course of the next six months, a number of employeesbegan to use the system heavily. Although the initial use case forHive and Hadoop centered around summarizing and analyzingclickstream data for the launch of the Facebook Ad program, Hivequickly began to be used by product teams and data scientists for anumber of other projects. In addition, we first talked about Hive atthe first Hadoop summit, and immediately realized the tremendouspotential beyond just what Facebook was doing with it.With this, we had our second “ah-ha”—that by making data moreuniversally accessible within the company, we could actually disruptour entire industry. Data in the hands of the people was that power‐ful. As an aside, some time later we saw another example of whathappens when you make data universally available.Facebook used to have “hackathons,” where everyone in the com‐pany stayed up all night, ordered pizza and beer, and coded into thewee hours with the goal of coming up with something interesting.One intern—Paul Butler—came up with a spectacular idea. He per‐formed analyses using Hadoop and Hive and mapped out how Face‐book users were interacting with each other all over the world. Bydrawing the interactions between people and their locations, hedeveloped a global map of Facebook’s reach. Astonishingly, it map‐ped out all continents and even some individual countries.The Journey Begins 5

In Paul’s own words:When I shared the image with others within Facebook, it resonatedwith many people. It’s not just a pretty picture, it’s a reaffirmation ofthe impact we have in connecting people, even across oceans andborders.To me, this was nothing short of amazing. By using data, this interncame up with an incredibly creative idea, incredibly quickly. It couldnever have happened in the old world when a data team was neededto fulfill all requests for data.Data was clearly too important to be left behind lock and key, acces‐sible only by data engineers. We were on our way to turning Face‐book into a data-driven company.The Emergence of the Data-DrivenOrganization84 percent of executives surveyed said they believe that “most to all” oftheir employees should use data analysis to help them perform theirjob duties.Let’s discuss why data is important, and what a data-driven organi‐zation is. First and foremost, a data-driven organization is one thatunderstands the importance of data. It possesses a culture of usingdata to make all business decisions. Note the word all. In a datadriven organization, no one comes to a meeting armed only withhunches or intuition. The person with the superior title or largestsalary doesn’t win the discussion. Facts do. Numbers. Quantitativeanalyses. Stuff backed up by data.Why become a data-driven company? Because it pays off. The MITCenter for Digital Business asked 330 companies about their dataanalytics and business decision-making processes. It found that themore companies characterized themselves as data-driven, the betterthey performed on objective measures of financial and operationalsuccess.2Specifically, companies in the top third of their industries when itcame to making data-driven decisions were, on average, five percentmore productive and six percent more profitable than their compet‐2 volution6 Chapter 1: Introduction

itors. This performance difference remained even after accountingfor labor, capital, purchased services, and traditional IT investments.It was also statistically significant and reflected in increased stockmarket prices that could be objectively measured.Another survey, by The Economist Intelligence Unit, showed a clearconnection between how a company uses data, and its financial suc‐cess. Only 11 percent of companies said that their organizationmakes “substantially” better use of data than their peers. Yet morethan a third of this group fell into the category of “top performingcompanies.”3 The reverse also indicates the relationship betweendata and financial success. Of the 17 percent of companies that saidthey “lagged” their peers in taking advantage of data, not one was atop-performing business.Figure 1-2. Rating an organization’s use of data (data from EconomistIntelligence Unit survey, October 2012)Another Economist Intelligence Unit survey found that 70 percentof senior business executives said analyzing data for sales and mar‐keting decisions is already “very” or “extremely important” to their3 apers/tableau dataculture 130219.pdfThe Emergence of the Data-Driven Organization 7

company’s competitive advantage. A full 89 percent of respondentsexpect this to be the case within two years.4According to the aforementioned MIT report, 50 percent of “aboveaverage” performing businesses said they had achieved a data-drivencompany by the promotion of data sharing. More than half (57 per‐cent) said that a data-driven company was driven by top-down man‐dates from the highest level. And an eye-opening 84 percent ofexecutives surveyed said they believe that “most to all” of theiremployees should use data analysis to help them perform their jobduties, not just IT workers or data scientists and analysts.5Figure 1-3. Successful strategies for promoting a data-driven culture(data from Economist In

Foundations of a Data-Driven Enterprise This book is divided into two parts. In Part I, we discuss the theoret‐ ical and practical foundations for building a self-service, data-driven company. In Chapter 1, we explain why data-driven companies are more suc‐ cessful and profitable than companies that do not center their decision-making on data.

Related Documents:

This model can be applied directly to a data-driven enterprise to enable self-service of data products, as illustrated in Figure 5. An insights marketplace offers a nice way to democratize the data within an organization in its journey toward becoming a data-driven enterprise. An insights marketplace is an interface whereby users

the data-driven testing needs with the keyword-driven approach alone. Keywords: test automation, test automation framework, data-driven testing, keyword-driven testing ii. TEKNILLINEN KORKEAKOULU DIPLOMITYON TIIVISTELM A .

enabling unprecedented levels of insight. In fact, becoming data-driven is the new table-stakes for enterprise success. So, what does it mean to be a data-driven enterprise? It means maximizing the value of your . the organization. It transforms data into an independent digital asset for the business and expands its use out into the connected .

Tutorial A Getting Started with SAS Enterprise Guide 3 . Starting SAS Enterprise Guide . 3. SAS Enterprise Guide windows . 4. Basic elements of SAS Enterprise Guide . 5. Entering data . 5. Creating a frequency report . 22. Creating a scatter plot . 28. Adding a note to the project . 34. Saving the project . 36. Tutorial B Creating Reports 39

Data-Driven Instruction: What Gets Measured Gets Done Dr. Roger Isaac Blanco LEAD Manager of School Partnerships, Florida rblanco@carnegielearning.com 1-888-851-7094 ext. 458 . Guiding Questions 1. What is the meaning of Data-driven Decision Making (DDDM) . Six Steps to Creating a Data-Driven Decision Making (DDDM) Culture

the enterprise data hub and cloudera enterprise 5 2015 cloudera includes kafka, kudu and record service within cloudera enterprise cdh / cm enterprise data hub cloudera enterprise 4 2016 navigator optimizer general availability, imroved cloud coverage with aws, azure and gcp clouds

Data-driven marketing doesn't happen in isolation, or solely within one section of the marketing department. Data-driven marketing is an enterprise-wide effort that requires data, expertise and innovative thinking from many parts of the enterprise. For most of the companies leading in this practice, there is also tight

The handbook Architectural Graphic Standards was first published in 1932, the same year and in the same city that the exhibition The International Style opened at The Museum of Modern Art in New York. The coincidence of these two events underscores the bifur cation in modern architectural practice between appearance and function. While the show emphasized formal composi tional principles to .