Practical Data Science With Hadoop - Pearsoncmg

1y ago

11 Views

2 Downloads

3.25 MB

74 Pages

Last View : 16d ago

Last Download : 3m ago

Upload by : Kamden Hassan

Report this link

Download PDF

Transcription

Practical DataScience withHadoop and Spark Mendelevitch Book.indb i11/16/16 6:39 PM

Practical DataScience withHadoop and Spark Designing and Building EffectiveAnalytics at ScaleOfer MendelevitchCasey StellaDouglas EadlineBoston Columbus Indianapolis New York San Francisco Amsterdam Cape TownDubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico CitySão Paulo Sydney Hong Kong Seoul Singapore Taipei TokyoMendelevitch Book.indb iii11/16/16 6:39 PM

Many of the designations used by manufacturers and sellers to distinguish their products are claimedas trademarks. Where those designations appear in this book, and the publisher was aware of atrademark claim, the designations have been printed with initial capital letters or in all capitals.The authors and publisher have taken care in the preparation of this book, but make no expressedor implied warranty of any kind and assume no responsibility for errors or omissions. No liability isassumed for incidental or consequential damages in connection with or arising out of the use of theinformation or programs contained herein.For information about buying this title in bulk quantities, or for special sales opportunities (whichmay include electronic versions; custom cover designs; and content particular to your business,training goals, marketing focus, or branding interests), please contact our corporate sales departmentat corpsales@pearsoned.com or (800) 382-3419.For government sales inquiries, please contact governmentsales@pearsoned.com.For questions about sales outside the U.S., please contact intlcs@pearson.com.Visit us on the Web: informit.com/awLibrary of Congress Control Number: 2016955465Copyright 2017 Pearson Education, Inc.All rights reserved. Printed in the United States of America. This publication is protected by copyright,and permission must be obtained from the publisher prior to any prohibited reproduction, storage ina retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying,recording, or likewise. For information regarding permissions, request forms and the appropriatecontacts within the Pearson Education Global Rights & Permissions Department, please visitwww.pearsoned.com/permissions/.ISBN-13: 978-0-13-402414-1ISBN-10: 0-13-402414-11Mendelevitch Book.indb iv1611/16/16 6:39 PM

ContentsForewordxiiiPrefacexvAcknowledgments xxiAbout the AuthorsxxiiiI Data Science with Hadoop—An Overview1 Introduction to Data ScienceWhat Is Data Science?3345Example: Search AdvertisingA Bit of Data Science History67Statistics and Machine LearningInnovation from Internet Giants8Data Science in the Modern Enterprise8Becoming a Data Scientist8The Data Engineer9The Applied Scientist9Transitioning to a Data Scientist Role1112Soft Skills of a Data ScientistBuilding a Data Science Team13The Data Science Project Life Cycle14Ask the Right QuestionData Acquisition115Data Cleaning: Taking Care of Data QualityExplore the Data and Design Model Features17Building and Tuning the ModelDeploy to Production1718Managing a Data Science ProjectSummary182 Use Cases for Data ScienceBig Data—A Driver of Change1919Volume: More Data Is Now AvailableVariety: More Data TypesVelocity: Fast Data IngestMendelevitch Book.indb v151620202111/16/16 6:39 PM

viContents21Business Use Cases21222223Product RecommendationCustomer Churn AnalysisCustomer SegmentationSales Leads Prioritization24Sentiment AnalysisFraud Detection252626Predictive MaintenanceMarket Basket Analysis27Predictive Medical Diagnosis28Predicting Patient Re-admission28Detecting Anomalous Record Access29Insurance Risk Analysis29Predicting Oil and Gas Well Production LevelsSummary293 Hadoop and Data ScienceWhat Is Hadoop?313132Distributed File SystemResource Manager and Scheduler3435Distributed Data Processing FrameworksHadoop’s Evolution37Hadoop Tools for Data Science3839Apache Flume39Apache Hive40Apache Pig41Apache Spark42R44Python45Apache SqoopJava Machine Learning Packages46Why Hadoop Is Useful to Data ScientistsCost Effective StorageSchema on Read464647Unstructured and Semi-Structured DataMulti-Language Tooling484849Robust Scheduling and Resource ManagementLevels of Distributed Systems AbstractionsMendelevitch Book.indb vi4911/16/16 6:39 PM

Contents5051Scalable Creation of ModelsScalable Application of ModelsSummaryvii51II Preparing and Visualizing Data with Hadoop4 Getting Data into HadoopHadoop as a Data Lake535556The Hadoop Distributed File System (HDFS)5858Direct File Transfer to Hadoop HDFSImporting Data from Files into Hive Tables5959Import CSV Files into Hive Tables62Import CSV Files into HIVE Using Spark63Import a JSON File into HIVE Using Spark64Using Apache Sqoop to Acquire Relational Data65Data Import and Export with Sqoop66Apache Sqoop Version Changes67Using Sqoop V2: A Basic Example68Using Apache Flume to Acquire Data Streams74Using Flume: A Web Log Example Overview76Importing Data into Hive Tables Using SparkManage Hadoop Work and Data Flows with ApacheOozie7981Apache Falcon82What’s Next in Data Ingestion?Summary825 Data Munging with HadoopWhy Hadoop for Data Munging?Data Quality858686What Is Data Quality?86Dealing with Data Quality IssuesUsing Hadoop for Data Quality93The Feature MatrixChoosing the “Right” FeaturesSampling: Choosing InstancesGenerating FeaturesText FeaturesMendelevitch Book.indb vii87929494969711/16/16 6:39 PM

viiiContents100Time-Series Features101Features from Complex Data TypesFeature ManipulationDimensionality Reduction102103106Summary6 Exploring and Visualizing DataWhy Visualize Data?107107Motivating Example: Visualizing NetworkThroughput108Visualizing the Breakthrough That NeverHappened110Creating VisualizationsComparison ChartsComposition ChartsDistribution ChartsRelationship Charts112113114117118121Using Visualization for Data Science121Popular Visualization ToolsR121Python: Matplotlib, Seaborn, and OthersSASMatlabJulia122123123123Other Visualization Tools123Visualizing Big Data with HadoopSummary124III Applying Data Modeling with Hadoop7 Machine Learning with HadoopOverview of Machine LearningTerminology122125127127128Task Types in Machine LearningBig Data and Machine LearningTools for Machine Learning129130131The Future of Machine Learning and ArtificialIntelligence132SummaryMendelevitch Book.indb viii13211/16/16 6:39 PM

Contents8 Predictive Modelingix133133Classification Versus Regression134Evaluating Predictive Models136Evaluating Classifiers136Evaluating Regression Models139Cross Validation139Supervised Learning Algorithms140Overview of Predictive ModelingBuilding Big Data Predictive Model Solutions141141Batch Prediction143Real-Time Prediction144Model TrainingExample: Sentiment Analysis145145Data Preparation145Feature Generation146Building a Classifier149Summary150Tweets Dataset9 Clustering151Overview of ClusteringUses of Clustering151152Designing a Similarity MeasureDistance FunctionsSimilarity FunctionsClustering AlgorithmsExample: Clustering Algorithmsk-means Clustering153153154154155155Latent Dirichlet Allocation157Evaluating the Clusters and Choosing the Numberof Clusters157Building Big Data Clustering Solutions158Example: Topic Modeling with Latent DirichletAllocation160Feature Generation160Running Latent Dirichlet AllocationSummaryMendelevitch Book.indb ix16216311/16/16 6:39 PM

xContents10 Anomaly Detection with HadoopOverview165Uses of Anomaly Detection166Types of Anomalies in Data166Approaches to Anomaly DetectionRules-based Methods165167167168168Semi-Supervised Learning Methods170Tuning Anomaly Detection Systems170Supervised Learning MethodsUnsupervised Learning MethodsBuilding a Big Data Anomaly Detection Solutionwith Hadoop171Example: Detecting Network IntrusionsData Ingestion172172176Evaluating Performance177Summary179Building a Classifier11 Natural Language ProcessingNatural Language ProcessingHistorical Approaches181181182182Text Segmentation183Part-of-Speech Tagging183Named Entity Recognition184Sentiment Analysis184Topic Modeling184Tooling for NLP in Hadoop184Small-Model NLP184Big-Model NLP186Textual ent Analysis Example189Stanford CoreNLP189NLP Use CasesUsing Spark for Sentiment AnalysisSummaryMendelevitch Book.indb x18919311/16/16 6:39 PM

Contentsxi12 Data Science with Hadoop—The NextFrontier195Automated Data Discovery195Deep Learning197Summary199A Book Web Page andCode Download201B HDFS Quick Start203Quick Command Dereference204General User HDFS CommandsList Files in HDFS204205Make a Directory in HDFS206206207Copy Files within HDFS207Delete a File within HDFS207Delete a Directory in HDFS207Copy Files to HDFSCopy Files from HDFSGet an HDFS Status Report (Administrators)Perform an FSCK on HDFS (Administrators)207208C Additional Background on Data Science and ApacheHadoop and Spark209General Hadoop/Spark Information209Hadoop/Spark Installation Recipes210HDFS210MapReduce211Spark211Essential Tools211Machine Learning212IndexMendelevitch Book.indb xi21311/16/16 6:39 PM