Practical Data Science With Hadoop - Pearsoncmg

1y ago
11 Views
2 Downloads
3.25 MB
74 Pages
Last View : 16d ago
Last Download : 3m ago
Upload by : Kamden Hassan
Transcription

Practical DataScience withHadoop and Spark Mendelevitch Book.indb i11/16/16 6:39 PM

Practical DataScience withHadoop and Spark Designing and Building EffectiveAnalytics at ScaleOfer MendelevitchCasey StellaDouglas EadlineBoston Columbus Indianapolis New York San Francisco Amsterdam Cape TownDubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico CitySão Paulo Sydney Hong Kong Seoul Singapore Taipei TokyoMendelevitch Book.indb iii11/16/16 6:39 PM

Many of the designations used by manufacturers and sellers to distinguish their products are claimedas trademarks. Where those designations appear in this book, and the publisher was aware of atrademark claim, the designations have been printed with initial capital letters or in all capitals.The authors and publisher have taken care in the preparation of this book, but make no expressedor implied warranty of any kind and assume no responsibility for errors or omissions. No liability isassumed for incidental or consequential damages in connection with or arising out of the use of theinformation or programs contained herein.For information about buying this title in bulk quantities, or for special sales opportunities (whichmay include electronic versions; custom cover designs; and content particular to your business,training goals, marketing focus, or branding interests), please contact our corporate sales departmentat corpsales@pearsoned.com or (800) 382-3419.For government sales inquiries, please contact governmentsales@pearsoned.com.For questions about sales outside the U.S., please contact intlcs@pearson.com.Visit us on the Web: informit.com/awLibrary of Congress Control Number: 2016955465Copyright 2017 Pearson Education, Inc.All rights reserved. Printed in the United States of America. This publication is protected by copyright,and permission must be obtained from the publisher prior to any prohibited reproduction, storage ina retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying,recording, or likewise. For information regarding permissions, request forms and the appropriatecontacts within the Pearson Education Global Rights & Permissions Department, please visitwww.pearsoned.com/permissions/.ISBN-13: 978-0-13-402414-1ISBN-10: 0-13-402414-11Mendelevitch Book.indb iv1611/16/16 6:39 PM

ContentsForewordxiiiPrefacexvAcknowledgments xxiAbout the AuthorsxxiiiI Data Science with Hadoop—An Overview1 Introduction to Data ScienceWhat Is Data Science?3345Example: Search AdvertisingA Bit of Data Science History67Statistics and Machine LearningInnovation from Internet Giants8Data Science in the Modern Enterprise8Becoming a Data Scientist8The Data Engineer9The Applied Scientist9Transitioning to a Data Scientist Role1112Soft Skills of a Data ScientistBuilding a Data Science Team13The Data Science Project Life Cycle14Ask the Right QuestionData Acquisition115Data Cleaning: Taking Care of Data QualityExplore the Data and Design Model Features17Building and Tuning the ModelDeploy to Production1718Managing a Data Science ProjectSummary182 Use Cases for Data ScienceBig Data—A Driver of Change1919Volume: More Data Is Now AvailableVariety: More Data TypesVelocity: Fast Data IngestMendelevitch Book.indb v151620202111/16/16 6:39 PM

viContents21Business Use Cases21222223Product RecommendationCustomer Churn AnalysisCustomer SegmentationSales Leads Prioritization24Sentiment AnalysisFraud Detection252626Predictive MaintenanceMarket Basket Analysis27Predictive Medical Diagnosis28Predicting Patient Re-admission28Detecting Anomalous Record Access29Insurance Risk Analysis29Predicting Oil and Gas Well Production LevelsSummary293 Hadoop and Data ScienceWhat Is Hadoop?313132Distributed File SystemResource Manager and Scheduler3435Distributed Data Processing FrameworksHadoop’s Evolution37Hadoop Tools for Data Science3839Apache Flume39Apache Hive40Apache Pig41Apache Spark42R44Python45Apache SqoopJava Machine Learning Packages46Why Hadoop Is Useful to Data ScientistsCost Effective StorageSchema on Read464647Unstructured and Semi-Structured DataMulti-Language Tooling484849Robust Scheduling and Resource ManagementLevels of Distributed Systems AbstractionsMendelevitch Book.indb vi4911/16/16 6:39 PM

Contents5051Scalable Creation of ModelsScalable Application of ModelsSummaryvii51II Preparing and Visualizing Data with Hadoop4 Getting Data into HadoopHadoop as a Data Lake535556The Hadoop Distributed File System (HDFS)5858Direct File Transfer to Hadoop HDFSImporting Data from Files into Hive Tables5959Import CSV Files into Hive Tables62Import CSV Files into HIVE Using Spark63Import a JSON File into HIVE Using Spark64Using Apache Sqoop to Acquire Relational Data65Data Import and Export with Sqoop66Apache Sqoop Version Changes67Using Sqoop V2: A Basic Example68Using Apache Flume to Acquire Data Streams74Using Flume: A Web Log Example Overview76Importing Data into Hive Tables Using SparkManage Hadoop Work and Data Flows with ApacheOozie7981Apache Falcon82What’s Next in Data Ingestion?Summary825 Data Munging with HadoopWhy Hadoop for Data Munging?Data Quality858686What Is Data Quality?86Dealing with Data Quality IssuesUsing Hadoop for Data Quality93The Feature MatrixChoosing the “Right” FeaturesSampling: Choosing InstancesGenerating FeaturesText FeaturesMendelevitch Book.indb vii87929494969711/16/16 6:39 PM

viiiContents100Time-Series Features101Features from Complex Data TypesFeature ManipulationDimensionality Reduction102103106Summary6 Exploring and Visualizing DataWhy Visualize Data?107107Motivating Example: Visualizing NetworkThroughput108Visualizing the Breakthrough That NeverHappened110Creating VisualizationsComparison ChartsComposition ChartsDistribution ChartsRelationship Charts112113114117118121Using Visualization for Data Science121Popular Visualization ToolsR121Python: Matplotlib, Seaborn, and OthersSASMatlabJulia122123123123Other Visualization Tools123Visualizing Big Data with HadoopSummary124III Applying Data Modeling with Hadoop7 Machine Learning with HadoopOverview of Machine LearningTerminology122125127127128Task Types in Machine LearningBig Data and Machine LearningTools for Machine Learning129130131The Future of Machine Learning and ArtificialIntelligence132SummaryMendelevitch Book.indb viii13211/16/16 6:39 PM

Contents8 Predictive Modelingix133133Classification Versus Regression134Evaluating Predictive Models136Evaluating Classifiers136Evaluating Regression Models139Cross Validation139Supervised Learning Algorithms140Overview of Predictive ModelingBuilding Big Data Predictive Model Solutions141141Batch Prediction143Real-Time Prediction144Model TrainingExample: Sentiment Analysis145145Data Preparation145Feature Generation146Building a Classifier149Summary150Tweets Dataset9 Clustering151Overview of ClusteringUses of Clustering151152Designing a Similarity MeasureDistance FunctionsSimilarity FunctionsClustering AlgorithmsExample: Clustering Algorithmsk-means Clustering153153154154155155Latent Dirichlet Allocation157Evaluating the Clusters and Choosing the Numberof Clusters157Building Big Data Clustering Solutions158Example: Topic Modeling with Latent DirichletAllocation160Feature Generation160Running Latent Dirichlet AllocationSummaryMendelevitch Book.indb ix16216311/16/16 6:39 PM

xContents10 Anomaly Detection with HadoopOverview165Uses of Anomaly Detection166Types of Anomalies in Data166Approaches to Anomaly DetectionRules-based Methods165167167168168Semi-Supervised Learning Methods170Tuning Anomaly Detection Systems170Supervised Learning MethodsUnsupervised Learning MethodsBuilding a Big Data Anomaly Detection Solutionwith Hadoop171Example: Detecting Network IntrusionsData Ingestion172172176Evaluating Performance177Summary179Building a Classifier11 Natural Language ProcessingNatural Language ProcessingHistorical Approaches181181182182Text Segmentation183Part-of-Speech Tagging183Named Entity Recognition184Sentiment Analysis184Topic Modeling184Tooling for NLP in Hadoop184Small-Model NLP184Big-Model NLP186Textual ent Analysis Example189Stanford CoreNLP189NLP Use CasesUsing Spark for Sentiment AnalysisSummaryMendelevitch Book.indb x18919311/16/16 6:39 PM

Contentsxi12 Data Science with Hadoop—The NextFrontier195Automated Data Discovery195Deep Learning197Summary199A Book Web Page andCode Download201B HDFS Quick Start203Quick Command Dereference204General User HDFS CommandsList Files in HDFS204205Make a Directory in HDFS206206207Copy Files within HDFS207Delete a File within HDFS207Delete a Directory in HDFS207Copy Files to HDFSCopy Files from HDFSGet an HDFS Status Report (Administrators)Perform an FSCK on HDFS (Administrators)207208C Additional Background on Data Science and ApacheHadoop and Spark209General Hadoop/Spark Information209Hadoop/Spark Installation Recipes210HDFS210MapReduce211Spark211Essential Tools211Machine Learning212IndexMendelevitch Book.indb xi21311/16/16 6:39 PM

This page intentionally left blank

ForewordHadoop and data science have been sought after skillsets respectively over the last fiveyears. However, few publications have attempted to bring the two together, teachingdata science within the Hadoop context. For practitioners looking for an introductionto data science combined with solving those problems at scale using Hadoop and relatedtools, this book will prove to be an excellent resource.The topic of data science is introduced with topics covered including data ingest,munging, feature extraction, machine learning, predictive modeling, anomaly detection, and natural language processing. The platform of choice for the examples andimplementation of these topics is Hadoop, Spark, and the other parts of the Hadoopecosystem. Its coverage is broad, with specific examples keeping the book grounded inan engineer’s need to solve real-world problems. For those already familiar with datascience, but looking to expand their skillsets to very large datasets and Hadoop, this bookis a great introduction.Throughout the text it focuses on concrete examples and providing insight intobusiness value with each approach. Chapter 5, “Data Munging with Hadoop,” providesparticularly useful real-world examples on using Hadoop to prepare large datasets forcommon machine learning and data science tasks. Chapter 10 on anomaly detectionis particularly useful for large datasets where monitoring and alerting are important.Chapter 11 on natural language processing will be of interest to those attempting tomake chatbots.Ofer Mendelevitch is the VP of Data Science at Lendup.com and was previouslythe Director of Data Science at Hortonworks. Few others are as qualified to be thelead author on a book combining data science and Hadoop. Joining Ofer is his formercolleague, Casey Stella, a Principal Data Scientist at Hortonworks. Rounding outthese experts in data science and Hadoop is Doug Eadline, frequent contributor to theAddison-Wesley Data & Analytics Series with the titles Hadoop Fundamentals Live Lessons,Apache Hadoop 2 Quick-Start Guide, and Apache Hadoop YARN. Collectively, this team ofauthors brings over a decade of Hadoop experience. I can imagine few others that have asmuch knowledge on the subject of data science and Hadoop.I’m excited to have this addition to the Data & Analytics Series. Creating data sciencesolutions at scale in production systems is an in-demand skillset. This book will helpyou come up to speed quickly to deploy and run production data science solutions at scale.—Paul DixSeries EditorMendelevitch Book.indb xiii11/16/16 6:39 PM

This page intentionally left blank

PrefaceData science and machine learning are at the core of many innovative technologies andproducts and are expected to continue to disrupt many industries and business modelsacross the globe for the foreseeable future. Until recently though, most of this innovation was constrained by the limited availability of data.With the introduction of Apache Hadoop, all of that has changed. Hadoop providesa platform for storing, managing, and processing large datasets inexpensively and at scale,making data science analysis of large datasets practical and feasible. In this new worldof large-scale advanced analytics, data science is a core competency that enables organizations to remain competitive and innovate beyond their traditional business models.During our time at Hortonworks, we have had a chance to see how various organizationstackle this new set of opportunities and help them on their journey to implementingdata science at scale with Hadoop and Spark. In this book we would like to share someof this learning and experiences.Another issue we also wish to emphasize is the evolution of Apache Hadoop from itsearly incarnation as a monolithic MapReduce engine (Hadoop version 1) to a versatiledata analytics platform that runs on YARN and supports not only MapReduce but also Tezand Spark as processing engines (Hadoop version 2). The current version of Hadoopprovides a robust and efficient platform for many data science applications and opens upa universe of opportunities to new business use cases that were previously unthinkable.Focus of the BookThis book focuses on real-world practical aspects of data science with Hadoop and Spark.Since the scope of data science is very broad, and every topic therein is deep and complex,it is quite difficult to cover the topic thoroughly. We approached this problem by attemptinga good balance between the theoretical coverage of each use case and the example-driventreatment of practical implementation.This book is not designed to dig deep into many of the mathematical details of eachmachine learning or statistical approach but rather provide a high-level description ofthe main concepts along with guidelines for its practical use in the context of the business problem. We provide some references that offer more in-depth treatment of themathematical details of these techniques in the text and have compiled a list of relevantresources in Appendix C, “Additional Background on Data Science and Apache Hadoopand Spark.”When learning about Hadoop, access to a Hadoop cluster environment can becomean issue. Finding an effective way to “play” with Hadoop and Spark can be challengingMendelevitch Book.indb xv11/16/16 6:39 PM

xviPrefacefor some individuals. At a minimum, we recommend the Hortonworks virtual machinesandbox for those that would like an easy way to get started with Hadoop. The sandboxis a full single-node Hadoop installation running inside a virtual machine. The virtualmachine can be run under Windows, Mac OS, and Linux. Please see http://hortonworks.com/products/sandbox for more information on how to download and install the sandbox.For further help with Hadoop we recommend Hadoop 2 Quick-Start Guide: Learn theEssentials of Big Data Computation in the Apache Hadoop 2 Ecosystem (and supporting videos),all mentioned in Appendix C.Who Should Read This BookThis book is intended for those readers who are interested to learn more about whatdata science is and some of the practical considerations of its application to large-scaledatasets. It provides a strong technical foundation for readers who want to learn moreabout how to implement various use cases, the tools that are best suited for the job, andsome of the architectures that are common in these situations. It also provides a businessdriven viewpoint on when application of data science to large datasets is useful to helpstakeholders understand what value can be derived for their organization and where toinvest their resources in applying large-scale machine learning.There is also a level of experience assumed for this book. For those not versed in datascience, some basic competencies are important to have to understand the differentmethods, including statistical concepts (for example, mean and standard deviation), and a bitof background in programming (mostly Python and a bit of Java or Scala) to understand theexamples throughout the book.For those with a data science background, you should generally be comfortable withthe material, although there may be some practical issues such as understanding thenumerous Apache projects. In addition, all examples are text-based, and some familiaritywith the Linux command line is required. It should be noted that we did not use (or test)a Windows environment for the examples. However, there is no reason to assume theywill not work in that and other environments (Hortonworks supports Windows).In terms of a specific Hadoop environment, all the examples and code were rununder Hortonworks HDP Linux Hadoop distribution (either laptop or cluster). Yourenvironment may differ in terms of distribution (Cloudera, MapR, Apache Source)or operating systems (Windows). However, all the tools (or equivalents) are availablein both environments.How to Use This BookWe anticipate several different audiences for the book:nnnMendelevitch Book.indb xvidata scientistsdevelopers/data engineersbusiness stakeholders11/16/16 6:39 PM

PrefacexviiWhile these readers come at the Hadoop analytics from different backgrounds, theirgoal is certainly the same—running data analytics with Hadoop and Spark at scale. Tothis end, we have designed the chapters to meet the needs of all readers, and as suchreaders may find that they can skip areas where they may have a good practical understanding. Finally, we also want to invite novice readers to use this book as a first step in theirunderstanding of data science at scale. We believe there is value in “walking” throughthe examples, even if you are not sure what is actually happening, and then going backand buttressing your understanding with the background material.Part I, “Data Science with Hadoop—An Overview,” spans the first three chapters.Chapter 1, “Introduction to Data Science,” provides an overview of data scienceand its history and evolution over the years. It lays out the journey people often take tobecome a data scientist. For those not versed in data science, this chapter will help youunderstand why it has evolved into a powerful discipline and provide some insight intohow a data scientist designs and refines projects. There is also some discussion about whatmakes a data scientist and how to best plan your career in that direction.Chapter 2, “Use Cases for Data Science,” provides a good overview of how businessuse cases are impacted by the volume, variety, and velocity of modern data streams. Italso covers some real-world data science use cases in order to help you gain an understanding of its benefits in various industries and applications.Chapter 3, “Hadoop and Data Science,” provides a quick overview of Hadoop, itsevolution over the years, and the various tools in the Hadoop ecosystem. For first-timeHadoop users this chapter can be a bit overwhelming. There are many new conceptsintroduced including the Hadoop file system (HDFS), MapReduce, the Hadoop resourcemanager (YARN), and Spark. While the number of sub-projects (and weird names)that make up the Hadoop ecosystem may seem daunting, not every project is used at thesame time, and the applications in the later chapters usually focus on only a few tools ata time.Part II, “Preparing and Visualizing Data with Hadoop,” includes the next three chapters.Chapter 4, “Getting Data into Hadoop,” focuses on data ingestion, discussingvarious tools and techniques to import datasets from external sources into Hadoop. Itis useful for many subsequent chapters. We begin with describing the Hadoop data lakeconcept and then move into the various ways data can be used by the Hadoop platform.The ingestion targets two of the more popular Hadoop tools—Hive and Spark. Thischapter focuses on code and hands-on solutions—if you are new to Hadoop, its best toalso consult Appendix B, “HDFS Quick Start,” to get you up to speed on the HDFSfile system.Chapter 5, “Data Munging with Hadoop,” focuses on data munging with Hadoopor how to identify and handle data quality issues, as well as pre-process data and prepareit for modeling. We introduce the concepts of data completeness, validity, consistency,timeliness, and accuracy. Examples of feature generation using a real data set are provided.This chapter is useful for all types of subsequent analysis and, like Chapter 4, is a precursorto many of the techniques mentioned in later chapters.Mendelevitch Book.indb xvii11/16/16 6:39 PM

xviiiPrefaceAn important tool in the process of data munging is visualization. Chapter 6,“Exploring and Visualizing Data,” discusses what it means to do visualization with bigdata. As background, this chapter is useful for reinforcing some of the basic conceptsbehind data visualization. The charts presented in the chapter were generated using R.Source code for all the plots is available so readers can try these charts with their own data.Part III, “Applying Data Modeling with Hadoop,” encompasses the final six chapters.Chapter 7, “Machine Learning with Hadoop,” provides an overview of machinelearning at a high level, covering the main tasks in machine learning such as classificationand regression, clustering, and anomaly detection. For each task type, we explore theproblem and the main approaches to solutions.Chapter 8, “Predictive Modeling,” covers the basic algorithms and various Hadooptools for predictive modeling. The chapter includes an end-to-end example of buildinga predictive model for sentiment analysis of Twitter text using Hive and Spark.Chapter 9, “Clustering,” dives into cluster analysis, a very common technique in datascience. It provides an overview of various clustering techniques and similarity functions, which are at the core of clustering. It then demonstrates a real-world example ofusing topic modeling on a large corpus of documents using Hadoop and Spark.Chapter 10, “Anomaly Detection with Hadoop,” covers anomaly detection, describing various types of approaches and algorithms as well as how to perform large-scaleanomaly detection on various datasets. It then demonstrates how to build an anomalydetection system with Spark for the KDD99 dataset.Chapter 11, “Natural Language Processing,” covers applications of data science tothe specific area of human language, using a set of techniques commonly called naturallanguage processing (NLP). It discusses various approaches to NLP, open-source toolsthat are effective at various NLP tasks, and how to apply NLP to large-scale corpuses usingHadoop, Pig, and Spark. An end-to-end example shows an advanced approach to sentimentanalysis that uses NLP at scale with Spark.Chapter 12, “Data Science with Hadoop—The Next Frontier,” discusses the futureof data science with Hadoop, covering advanced data discovery techniques and deeplearning.Consult Appendix A, “Book Webpage and Code Download,” for the book web pageand code repository (the web page provides a question and answer forum). Appendix B, asmentioned previously, provides a quick overview of HDFS for new users and the aforementioned Appendix C provides further references and background on Hadoop, Spark,HDFS, machine learning, and many other topics.Book ConventionsCode and file references are displayed in a monospaced font. Code input lines that wrapbecause they are too long to fit on one line in this book are denoted with this symbol at the start of the next line. Long output lines are wrapped at page boundarieswithout the symbol.Mendelevitch Book.indb xviii11/16/16 6:39 PM

PrefacexixAccompanying CodeAgain, please see Appendix A, “Book Web Page and Code Download,” for the locationof all code used in this book.Register your copy of Practical Data Science with Hadoop and Spark at informit.com forconvenient access to downloads, updates, and corrections as they become available.To start the registration process, go to informit.com/register and log in or createan account. Enter the product ISBN (9780134024141) and click Submit. Once theprocess is complete, you will find any available bonus content under “RegisteredProducts.”Mendelevitch Book.indb xix11/16/16 6:39 PM

This page intentionally left blank

AcknowledgmentsSome of the figures and examples were inspired and copied from Yahoo! (yahoo.com),the Apache Software Foundation (http://www.apache.org), and Hortonworks(http://hortonworks.com). Any copied items either had permission from the authoror were available under an open sharing license.Many people have worked behind the scenes to make this book possible. Thank youto the reviewers who took the time to carefully read the rough drafts: Fabricio Cannini,Brian D. Davison, Mark Fenner, Sylvain Jaume, Joshua Mora, Wendell Smith, and JohnWilson.Ofer MendelevitchI want to thank Jeff Needham and Ron Lee who encouraged me to start this book, manyothers at Hortonworks who helped with constructive feedback and advice, John Wilsonwho provided great constructive feedback and industry perspective, and of course DebraWilliams Cauley for her vision and support in making this book a reality. Last but not least,this book would not have come to life without the loving support of my beautiful wife,Noa, who encouraged and supported me every step of the way, and my boys, Daniel andJordan, who make all this hard work so worthwhile.Casey StellaI want to thank my patient and loving wife, Leah, and children, William and Sylvia,without whom I would not have the time to dedicate to such a time-consuming andrewarding venture. I want to thank my mother and grandmother, who instilled a loveof learning that has guided me to this day. I want to thank the taxpayers of the State ofLouisiana for providing a college education and access to libraries, public radio, andtelevision; without which I would have neither the capability, the content, nor the courageto speak. Finally, I want to thank Debra Williams Cauley at Addison-Wesley who usedthe carrot far more than the stick.Douglas EadlineTo Debra Williams Cauley at Addison-Wesley, your kind efforts and office at the GCTOyster Bar made the book-writing process almost easy (again!). Thanks to my supportcrew, Emily, Carla, and Taylor—yet another book you know nothing about. Of course,I cannot forget my office mate, Marlee, and those two boys. And, finally, another big thankyou to my wonderful wife, Maddy, for her constant support.Mendelevitch Book.indb xxi11/16/16 6:39 PM

This page intentionally left blank

About the AuthorsOfer Mendelevitch is Vice President of Data Science at Lendup, where he is responsiblefor Lendup’s machine learning and advanced analytics group. Prior to joining Lendup,Ofer was Director of Data Science at Hortonworks, where he was responsible for helpingHortonwork’s customers apply Data Science with Hadoop and Spark to big data acrossvarious industries including healthcare, finance, retail, and others. Before Hortonworks,Ofer served as Entrepreneur in Residence at XSeed Capital, Vice President of Engineeringat Nor1, and Director of Engineering at Yahoo!.Casey Stella is a Principal Data Scientist at Hortonworks, which provides an opensource Hadoop distribution. Casey’s primary responsibility is leading the analytics/datascience team for the Apache Metron (Incubating) Project, an open source cybersecurityproject. Prior to Hortonworks, Casey was an architect at Explorys, which was a medicalinformatics startup spun out of the Cleveland Clinic. In the more distant past, Caseyserved as a developer at Oracle, Research Geophysicist at ION Geophysical, and as a poorgraduate student in Mathematics at Texas A&M.Douglas Eadline, PhD, began his career as an analytical chemist with an interest incomputer methods. Starting with the first Beowulf how-to document, Doug has writtenhundreds of articles, white papers, and instructional documents covering many aspects ofHPC and Hadoop computing. Prior to starting and editing the popular ClusterMonkey.netwebsite in 2005, he served as editor in chief for ClusterWorld Magazine and was seniorHPC editor for Linux Magazine. He has practical hands-on experience in many aspectsof HPC and Apache Hadoop, including hardware and software design, benchmarking,storage, GPU, cloud computing, and parallel computing. Currently, he is a writer andconsultant to the HPC/analytics industry and leader of the Limulus Personal Cluster ). He is author of the Hadoop FundamentalsLiveLessons and Apache Hadoop YARN Fundamentals LiveLessons videos from Pearson, andis co-author of Apache Hadoop YARN: Moving beyond MapReduce and Batch Processingwith Apache Hadoop 2 and author of Hadoop 2 Quick Start Guide: Learn the Essentials ofBig Data Computing in the Apache Hadoop 2 Ecosystem, also from Addison-Wesley, and HighPerformance Computing for Dummies.Mendelevitch Book.indb xxiii11/16/16 6:39 PM

This page intentionally left blank

IIPreparing andVisualizing Datawith HadoopMendelevitch Book.indb 5311/16/16 6:39 PM

This page intentionally left blank

4Getting Data into HadoopYou can have data without information,but you cannot have information without data.Daniel Keys MoranIn This Chapter:nnnnnnnThe data lake concept is presented as a new data processing paradigm.Basic methods for importing CSV data into HDFS and Hive tables arepresented.Additional methods for using Spark to import data into Hive tables or directlyfor a Spark job are presented.Apache Sqoop is introduced as a tool for exporting and importing relationaldata into and out of HDFS.Apache Flume is introduced as a tool for transporting and capturing streamingdata (e.g., web logs) into HDFS.Apache Oozie is introduced as workf low manager for Hadoop ingestion jobs.The Apache Falcon project is described as a framework for data governance(organization) on Hadoop clusters.No matter what kind of data needs processing, there is often a tool for importing suchdata from or exporting such data into the Hadoop Di

these experts in data science and Hadoop is Doug Eadline, frequent contributor to the Addison-Wesley Data & Analytics Series with the titles Hadoop Fundamentals Live Lessons, Apache Hadoop 2 Quick-Start Guide, and Apache Hadoop YARN. Collectively, this team of authors brings over a decade of Hadoop experience. I can imagine few others that have as

Related Documents:

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.

The In-Memory Accelerator for Hadoop is a first-of-its-kind Hadoop extension that works with your choice of Hadoop distribution, which can be any commercial or open source version of Hadoop available, including Hadoop 1.x and Hadoop 2.x distributions. The In-Memory Accelerator for Hadoop is designed to provide the same performance

Introduction Apache Hadoop . What is Apache Hadoop? MapReduce is the processing part of Hadoop HDFS is the data part of Hadoop Dept. of Computer Science, Georgia State University 05/03/2013 5 Introduction Apache Hadoop HDFS MapReduce Machine . What is Apache Hadoop? The MapReduce server on a typical machine is called a .

Configuring SSH: 6 Add hadoop user to sudoer's list: 8 Disabling IPv6: 8 Installing Hadoop: 8 Hadoop overview and HDFS 9 Chapter 2: Debugging Hadoop MR Java code in local eclipse dev environment. 12 Introduction 12 Remarks 12 Examples 12 Steps for configuration 12 Chapter 3: Hadoop commands 14 Syntax 14 Examples 14 Hadoop v1 Commands 14 1 .

-Type "sudo tar -xvzf hadoop-2.7.3.tar.gz" 6. I renamed the download to something easier to type-out later. -Type "sudo mv hadoop-2.7.3 hadoop" 7. Make this hduser an owner of this directory just to be sure. -Type "sudo chown -R hduser:hadoop hadoop" 8. Now that we have hadoop, we have to configure it before it can launch its daemons (i.e .

ASTM – Revision of ASTM B633 - Zinc Electroplating Standard . The IFI 2018 Annual report will detail that: IFI remains healthy and continues to build reserves, which remain over 2 million, which is sufficient for nearly two years of operations. Workforce development continues to be a major objective for the industry. With orders and production in the final months of 2018 .