Data Analytics Python

3y ago
86 Views
5 Downloads
1.43 MB
55 Pages
Last View : 5d ago
Last Download : 5m ago
Upload by : Baylee Stein
Transcription

Data Analyticswith Spark UsingPython

The Pearson Addison-WesleyData and Analytics SeriesVisit informit.com/awdataseries for a complete list of available publications.The Pearson Addison-Wesley Data and Analytics Series provides readers withpractical knowledge for solving problems and answering questions with data.Titles in this series primarily focus on three areas:1. Infrastructure: how to store, move, and manage data2. Algorithms: how to mine intelligence or make predictions based on data3. Visualizations: how to represent data and insights in a meaningful andcompelling wayThe series aims to tie all three of these areas together to help the reader buildend-to-end systems for fighting spam; making recommendations; buildingpersonalization; detecting trends, patterns, or problems; and gaining insightfrom the data exhaust of systems and user interactions.bMake sure to connect with us!informit.com/socialconnect

Data Analyticswith Spark UsingPythonJeffrey AvenBoston Columbus Indianapolis New York San Francisco AmsterdamCape Town Dubai London Madrid Milan Munich ParisMontreal Toronto Delhi Mexico City São Paulo SydneyHong Kong Seoul Singapore Taipei Tokyo

Many of the designations used by manufacturers and sellers to distinguish their productsare claimed as trademarks. Where those designations appear in this book, and thepublisher was aware of a trademark claim, the designations have been printed with initialcapital letters or in all capitals.Editor-in-ChiefGreg WiegandThe author and publisher have taken care in the preparation of this book, but make noexpressed or implied warranty of any kind and assume no responsibility for errors oromissions. No liability is assumed for incidental or consequential damages in connectionwith or arising out of the use of the information or programs contained herein.Development EditorAmanda KaufmannExecutive EditorTrina MacDonaldManaging EditorSandra SchroederFor information about buying this title in bulk quantities, or for special sales opportunities(which may include electronic versions; custom cover designs; and content particular toyour business, training goals, marketing focus, or branding interests), please contact ourcorporate sales department at corpsales@pearsoned.com or (800) 382-3419.Senior ProjectEditorLori LyonsFor government sales inquiries, please contact governmentsales@pearsoned.com.Technical EditorYaniv RodenskiFor questions about sales outside the U.S., please contact intlcs@pearson.com.Visit us on the Web: informit.com/awLibrary of Congress Control Number: 2018938456 2018 Pearson Education, Inc.All rights reserved. This publication is protected by copyright, and permission must beobtained from the publisher prior to any prohibited reproduction, storage in a retrievalsystem, or transmission in any form or by any means, electronic, mechanical, photocopying,recording, or likewise. For information regarding permissions, request forms, and theappropriate contacts within the Pearson Education Global Rights & Permissions Department,please visit www.pearsoned.com/permissions/.Microsoft and/or its respective suppliers make no representations about the suitability ofthe information contained in the documents and related graphics published as part of theservices for any purpose. All such documents and related graphics are provided “as is”without warranty of any kind. Microsoft and/ or its respective suppliers hereby disclaimall warranties and conditions with regard to this information, including all warranties andconditions of merchantability, whether express, implied or statutory, fitness for a particularpurpose, title and non-infringement. In no event shall Microsoft and/or its respective suppliers be liable for any special, indirect or consequential damages or any damages whatsoeverresulting from loss of use, data or profits, whether in an action of contract, negligence orother tortious action, arising out of or in connection with the use or performance of information available from the services. The documents and related graphics contained herein couldinclude technical inaccuracies or typographical errors. Changes are periodically added tothe information herein. Microsoft and/or its respective suppliers may make improvementsand/or changes in the product(s) and/or the program(s) described herein at any time. Partialscreenshots may be viewed in full within the software version specified.Microsoft Windows , and Microsoft Office are registered trademarks of the MicrosoftCorporation in the U.S.A. and other countries. This book is not sponsored or endorsed byor affiliated with the Microsoft Corporation.ISBN-13: 978-0-13-484601-9ISBN-10: 0-13-484601-X118Copy EditorCatherine D.WilsonProject ManagerDhayanidhiKarunanidhiIndexerErika MillenProofreaderJeanine FurinoCover DesignerChuti PrasertsithCompositorcodemantra

Contents at a GlancePreface xiIntroduction 1I: Spark Foundations1 Introducing Big Data, Hadoop, and Spark2 Deploying Spark5273 Understanding the Spark Cluster Architecture4 Learning Spark Programming Basics4559II: Beyond the Basics5 Advanced Programming Using the Spark Core API6 SQL and NoSQL Programming with Spark1111617 Stream Processing and Messaging Using Spark2098 Introduction to Data Science and Machine Learning Using SparkIndex 281243

Table of ContentsPreface xiIntroduction 1I: Spark Foundations1 Introducing Big Data, Hadoop, and Spark5Introduction to Big Data, Distributed Computing, and Hadoop 5A Brief History of Big Data and Hadoop 6Hadoop Explained 7Introduction to Apache Spark 13Apache Spark Background 13Uses for Spark 14Programming Interfaces to Spark 14Submission Types for Spark Programs 14Input/Output Types for Spark Applications 16The Spark RDD 16Spark and Hadoop 16Functional Programming Using Python 17Data Structures Used in Functional Python Programming 17Python Object Serialization 20Python Functional Programming Basics 23Summary252 Deploying Spark27Spark Deployment ModesLocal Mode28Spark StandaloneSpark on YARNSpark on Mesos282930Preparing to Install SparkGetting Spark273031Installing Spark on Linux or Mac OS X32Installing Spark on Windows 34Exploring the Spark Installation36Deploying a Multi-Node Spark Standalone Cluster37

ContentsDeploying Spark in the Cloud39Amazon Web Services (AWS)39Google Cloud Platform (GCP)41DatabricksSummary42433 Understanding the Spark Cluster Architecture 45Anatomy of a Spark ApplicationSpark Driver4546Spark Workers and Executors49The Spark Master and Cluster Manager51Spark Applications Using the Standalone SchedulerSpark Applications Running on YARN5353Deployment Modes for Spark Applications Running on YARNClient ModeCluster Mode5455Local Mode RevisitedSummary56574 Learning Spark Programming Basics 59Introduction to RDDs 59Loading Data into RDDs 61Creating an RDD from a File or Files 61Methods for Creating RDDs from a Text File or Files 63Creating an RDD from an Object File 66Creating an RDD from a Data Source 66Creating RDDs from JSON Files 69Creating an RDD Programmatically 71Operations on RDDs 72Key RDD Concepts 72Basic RDD Transformations 77Basic RDD Actions 81Transformations on PairRDDs 85MapReduce and Word Count Exercise 92Join Transformations 95Joining Datasets in Spark 100Transformations on Sets 103Transformations on Numeric RDDs 105Summary10853vii

viiiContentsII: Beyond the Basics5 Advanced Programming Using the Spark Core API111Shared Variables in Spark 111Broadcast Variables 112Accumulators116Exercise: Using Broadcast Variables and Accumulators 119Partitioning Data in Spark 120Partitioning Overview 120Controlling Partitions 121Repartitioning Functions 123Partition-Specific or Partition-Aware API Methods 125RDD Storage Options 127RDD Lineage Revisited 127RDD Storage Options 128RDD Caching 131Persisting RDDs 131Choosing When to Persist or Cache RDDs 134Checkpointing RDDs 134Exercise: Checkpointing RDDs 136Processing RDDs with External Programs 138Data Sampling with Spark 139Understanding Spark Application and Cluster Configuration 141Spark Environment Variables 141Spark Configuration Properties 145Optimizing Spark 148Filter Early, Filter Often 149Optimizing Associative Operations 149Understanding the Impact of Functions and Closures 151Considerations for Collecting Data 152Configuration Parameters for Tuning and Optimizing Applications 152Avoiding Inefficient Partitioning 153Diagnosing Application Performance Issues 155Summary1596 SQL and NoSQL Programming with Spark 161Introduction to Spark SQL 161Introduction to Hive 162Spark SQL Architecture 166

ContentsGetting Started with DataFrames 168Using DataFrames 179Caching, Persisting, and Repartitioning DataFrames 187Saving DataFrame Output 188Accessing Spark SQL 191Exercise: Using Spark SQL 194Using Spark with NoSQL Systems 195Introduction to NoSQL 196Using Spark with HBase 197Exercise: Using Spark with HBase 200Using Spark with Cassandra 202Using Spark with DynamoDB 204Other NoSQL Platforms 206Summary2067 Stream Processing and Messaging Using Spark 209Introducing Spark Streaming 209Spark Streaming Architecture 210Introduction to DStreams 211Exercise: Getting Started with Spark Streaming 218State Operations 219Sliding Window Operations 221Structured Streaming 223Structured Streaming Data Sources 224Structured Streaming Data Sinks 225Output Modes 226Structured Streaming Operations 227Using Spark with Messaging Platforms 228Apache Kafka 229Exercise: Using Spark with Kafka 234Amazon Kinesis 237Summary2408 Introduction to Data Science and Machine Learning Using Spark 243Spark and R243Introduction to R244Using Spark with R250Exercise: Using RStudio with SparkR257ix

xContentsMachine Learning with Spark 259Machine Learning Primer 259Machine Learning Using Spark MLlib 262Exercise: Implementing a Recommender Using Spark MLlib 267Machine Learning Using Spark ML 271Using Notebooks with Spark 275Using Jupyter (IPython) Notebooks with Spark 275Using Apache Zeppelin Notebooks with Spark 278Summary279Index 281

PrefaceSpark is at the heart of the disruptive Big Data and open source software revolution. The interestin and use of Spark have grown exponentially, with no signs of abating. This book will prepareyou, step by step, for a prosperous career in the Big Data analytics field.Focus of the BookThis book focuses on the fundamentals of the Spark project, starting from the core and workingoutward into Spark’s various extensions, related or subprojects, and the broader ecosystem ofopen source technologies such as Hadoop, Kafka, Cassandra, and more.Although the foundational understanding of Spark concepts covered in this book—includingthe runtime, cluster and application architecture—are language independent and agnostic, themajority of the programming examples and exercises in this book are written in Python. ThePython API for Spark (PySpark) provides an intuitive programming environment for data analysts,data engineers, and data scientists alike, offering developers the flexibility and extensibility ofPython with the distributed processing power and scalability of Spark.The scope of this book is quite broad, covering aspects of Spark from core Spark programming toSpark SQL, Spark Streaming, machine learning, and more. This book provides a good introductionand overview for each topic—enough of a platform for you to build upon any particular area ordiscipline within the Spark project.Who Should Read This BookThis book is intended for data analysts and engineers looking to enter the Big Data space orconsolidate their knowledge in this area. The demand for engineers with skills in Big Data and itspreeminent processing framework, Spark, is exceptionally high at present. This book aims to preparereaders for this growing employment market and arm them with the skills employers are looking for.Python experience is useful but not strictly necessary for readers of this book as Python is quiteintuitive for anyone with any programming experience whatsoever. A good working knowledge ofdata analysis and manipulation would also be helpful. This book is especially well suited to datawarehouse professionals interested in expanding their careers into the Big Data area.How to Use This BookThis book is structured into two parts and eight chapters. Part I, “Spark Foundations,” includesfour chapters designed to build a solid understanding of what Spark is, how to deploy Spark, andhow to use Spark for basic data processing operations: Chapter 1, “Introducing Big Data, Hadoop and Spark,” provides a good overview of the BigData ecosystem, including the genesis and evolution of the Spark project. Key properties ofthe Spark project are discussed, including what Spark is and how it is used, as well as howSpark relates to the Hadoop project. Chapter 2, “Deploying Spark,” demonstrates how to deploy a Spark cluster, including thevarious Spark cluster deployment modes and the different ways you can leverage Spark.

Chapter 3, “Understanding the Spark Cluster Architecture,” discusses how Spark clustersand applications operate, providing a solid understanding of exactly how Spark works. Chapter 4, “Learning Spark Programming Basics,” focuses on the basic programmingbuilding blocks of Spark using the Resilient Distributed Dataset (RDD) API.Part II, “Beyond the Basics,” includes the final four chapters, which extend beyond the Sparkcore into its uses with SQL and NoSQL systems, streaming applications, and data science andmachine learning: Chapter 5, “Advanced Programming Using the Spark Core API,” covers advanced constructsused to extend, accelerate, and optimize Spark routines, including different shared variablesand RDD storage and partitioning concepts and implementations. Chapter 6, “SQL and NoSQL Programming with Spark,” discusses Spark’s integration intothe vast SQL landscape as well as its integration with non-relational stores. Chapter 7, “Stream Processing and Messaging Using Spark,” introduces the Spark streamingproject

Contents at a Glance Preface xi Introduction 1 I: Spark Foundations 1 Introducing Big Data, Hadoop, and Spark 5 2 Deploying Spark 27 3 Understanding the Spark Cluster Architecture 45 4 Learning Spark Programming Basics 59 II: Beyond the Basics 5 Advanced Programming Using the Spark Core API 111 6 SQL and NoSQL Programming with Spark 161 7 Stream Processing and Messaging Using Spark 209

Related Documents:

Python Programming for the Absolute Beginner Second Edition. CONTENTS CHAPTER 1 GETTING STARTED: THE GAME OVER PROGRAM 1 Examining the Game Over Program 2 Introducing Python 3 Python Is Easy to Use 3 Python Is Powerful 3 Python Is Object Oriented 4 Python Is a "Glue" Language 4 Python Runs Everywhere 4 Python Has a Strong Community 4 Python Is Free and Open Source 5 Setting Up Python on .

Python 2 versus Python 3 - the great debate Installing Python Setting up the Python interpreter About virtualenv Your first virtual environment Your friend, the console How you can run a Python program Running Python scripts Running the Python interactive shell Running Python as a service Running Python as a GUI application How is Python code .

Python is readable 5 Python is complete—"batteries included" 6 Python is cross-platform 6 Python is free 6 1.3 What Python doesn't do as well 7 Python is not the fastest language 7 Python doesn't have the most libraries 8 Python doesn't check variable types at compile time 8 1.4 Why learn Python 3? 8 1.5 Summary 9

example, Netflix uses Big Data Analytics to prescribe favourite song/movie based on customer‟s interests, behaviour, day and time analysis. 3. Python For Big Data Analytics 3.1 . Advantages. of . Python for Big Data Analytics Python. is. the most popular language amongst Data Scientists for Data Analytics not only because of its ease in

site "Python 2.x is legacy, Python 3.x is the present and future of the language". In addition, "Python 3 eliminates many quirks that can unnecessarily trip up beginning programmers". However, note that Python 2 is currently still rather widely used. Python 2 and 3 are about 90% similar. Hence if you learn Python 3, you will likely

There are currently two versions of Python in use; Python 2 and Python 3. Python 3 is not backward compatible with Python 2. A lot of the imported modules were only available in Python 2 for quite some time, leading to a slow adoption of Python 3. However, this not really an issue anymore. Support for Python 2 will end in 2020.

The Monty Python : œuvres (62 ressources dans data.bnf.fr) Œuvres audiovisuelles (y compris radio) (20) Monty Python live (mostly) (2014) Monty Python live (mostly) (2014) Monty Python live (mostly) (2014) "Monty Python, almost the truth" (2014) de Alan Parker et autre(s) avec The Monty Python comme Acteur "Monty Python, almost the truth" (2014)

Introduction to basic Python Contents 1. Installing Python 2. How to run Python code 3. How to write Python code 4. How to troubleshoot Python code 5. Where to go to learn more Python is an astronomer's secret weapon. With Python, the process of visualizing, processing, and interacting with data is made extremely simple.