Prerequisite - Tutorialspoint

3y ago

69 Views

5 Downloads

1.09 MB

36 Pages

Last View : 15d ago

Last Download : 3m ago

Upload by : Madison Stoltz

Report this link

Download PDF

Transcription

Apache SparkAbout the TutorialApache Spark is a lightning-fast cluster computing designed for fast computation. It wasbuilt on top of Hadoop MapReduce and it extends the MapReduce model to efficiently usemore types of computations which includes Interactive Queries and Stream Processing.This is a brief tutorial that explains the basics of Spark Core programming.AudienceThis tutorial has been prepared for professionals aspiring to learn the basics of Big DataAnalytics using Spark Framework and become a Spark Developer. In addition, it wouldbe useful for Analytics Professionals and ETL developers as well.PrerequisiteBefore you start proceeding with this tutorial, we assume that you have prior exposureto Scala programming, database concepts, and any of the Linux operating systemflavors.Copyright & Disclaimer Copyright 2015 by Tutorials Point (I) Pvt. Ltd.All the content and graphics published in this e-book are the property of Tutorials Point(I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute orrepublish any contents or a part of contents of this e-book in any manner without writtenconsent of the publisher.We strive to update the contents of our website and tutorials as timely and as preciselyas possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I)Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness ofour website or its contents including this tutorial. If you discover any errors on ourwebsite or in this tutorial, please notify us at contact@tutorialspoint.comi

Apache SparkTable of ContentsAbout the Tutorial . iAudience . iPrerequisite. iCopyright & Disclaimer. iTable of Contents . ii1. SPARK INTRODUCTION . 1Apache Spark . 1Evolution of Apache Spark . 1Features of Apache Spark . 1Spark Built on Hadoop . 2Components of Spark . 32. SPARK – RDD . 4Resilient Distributed Datasets . 4Data Sharing is Slow in MapReduce . 4Iterative Operations on MapReduce . 4Interactive Operations on MapReduce . 5Data Sharing using Spark RDD . 6Iterative Operations on Spark RDD. 6Interactive Operations on Spark RDD . 63. SPARK – INSTALLATION . 8Step 1: Verifying Java Installation. 8Step 2: Verifying Scala installation . 8Step 3: Downloading Scala . 8Step 4: Installing Scala . 9Step 5: Downloading Apache Spark . 9ii

Apache SparkStep 6: Installing Spark . 10Step 7: Verifying the Spark Installation . 104. SPARK – CORE PROGRAMMING. 12Spark Shell . 12RDD . 12Transformations . 12Actions . 16Programming with RDD . 17UN Persist the Storage . 215. SPARK – DEPLOYMENT . 23Spark-submit Syntax . 276. ADVANCED SPARK PROGRAMMING . 30Broadcast Variables. 30Accumulators . 30Numeric RDD Operations . 31iii

1. SPARK – INTRODUCTIONApache SparkIndustries are using Hadoop extensively to analyze their data sets. The reason is thatHadoop framework is based on a simple programming model (MapReduce) and itenables a computing solution that is scalable, flexible, fault-tolerant and cost effective.Here, the main concern is to maintain speed in processing large datasets in terms ofwaiting time between queries and waiting time to run the program.Spark was introduced by Apache Software Foundation for speeding up the Hadoopcomputational computing software process.As against a common belief, Spark is not a modified version of Hadoop and is not,really, dependent on Hadoop because it has its own cluster management. Hadoop is justone of the ways to implement Spark.Spark uses Hadoop in two ways – one is storage and second is processing. SinceSpark has its own cluster management computation, it uses Hadoop for storage purposeonly.Apache SparkApache Spark is a lightning-fast cluster computing technology, designed for fastcomputation. It is based on Hadoop MapReduce and it extends the MapReduce model toefficiently use it for more types of computations, which includes interactive queries andstream processing. The main feature of Spark is its in-memory cluster computingthat increases the processing speed of an application.Spark is designed to cover a wide range of workloads such as batch applications,iterative algorithms, interactive queries and streaming. Apart from supporting all theseworkload in a respective system, it reduces the management burden of maintainingseparate tools.Evolution of Apache SparkSpark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab byMatei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated toApache software foundation in 2013, and now Apache Spark has become a top levelApache project from Feb-2014.Features of Apache SparkApache Spark has following features. Speed: Spark helps to run an application in Hadoop cluster, up to 100 times fasterin memory, and 10 times faster when running on disk. This is possible by reducingnumber of read/write operations to disk. It stores the intermediate processing datain memory.1

Apache Spark Supports multiple languages: Spark provides built-in APIs in Java, Scala, orPython. Therefore, you can write applications in different languages. Spark comesup with 80 high-level operators for interactive querying. Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supportsSQL queries, Streaming data, Machine learning (ML), and Graph algorithms.Spark Built on HadoopThe following diagram shows three ways of how Spark can be built with Hadoopcomponents.There are three ways of Spark deployment as explained below. Standalone: Spark Standalone deployment means Spark occupies the place ontop of HDFS(Hadoop Distributed File System) and space is allocated for HDFS,explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobson cluster. Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarnwithout any pre-installation or root access required. It helps to integrate Sparkinto Hadoop ecosystem or Hadoop stack. It allows other components to run ontop of stack. Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark jobin addition to standalone deployment. With SIMR, user can start Spark and usesits shell without any administrative access.2

Apache SparkComponents of SparkThe following illustration depicts the different components of Spark.Apache Spark CoreSpark Core is the underlying general execution engine for spark platform that all otherfunctionality is built upon. It provides In-Memory computing and referencing datasets inexternal storage systems.Spark SQLSpark SQL is a component on top of Spark Core that introduces a new data abstractioncalled SchemaRDD, which provides support for structured and semi-structured data.Spark StreamingSpark Streaming leverages Spark Core's fast scheduling capability to perform streaminganalytics. It ingests data in mini-batches and performs RDD (Resilient DistributedDatasets) transformations on those mini-batches of data.MLlib (Machine Learning Library)MLlib is a distributed machine learning framework above Spark because of thedistributed memory-based Spark architecture. It is, according to benchmarks, done bythe MLlib developers against the Alternating Least Squares (ALS) implementations.Spark MLlib is nine times as fast as the Hadoop disk-based version of ApacheMahout (before Mahout gained a Spark interface).GraphXGraphX is a distributed graph-processing framework on top of Spark. It provides an APIfor expressing graph computation that can model the user-defined graphs by usingPregel abstraction API. It also provides an optimized runtime for this abstraction.3

2. SPARK – RDDApache SparkResilient Distributed DatasetsResilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is animmutable distributed collection of objects. Each dataset in RDD is divided into logicalpartitions, which may be computed on different nodes of the cluster. RDDs can containany type of Python, Java, or Scala objects, including user-defined classes.Formally, an RDD is a read-only, partitioned collection of records. RDDs can be createdthrough deterministic operations on either data on stable storage or other RDDs. RDD isa fault-tolerant collection of elements that can be operated on in parallel.There are two ways to create RDDs: parallelizing an existing collection in your driverprogram, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop Input Format.Spark makes use of the concept of RDD to achieve faster and efficient MapReduceoperations. Let us first discuss how MapReduce operations take place and why they arenot so efficient.Data Sharing is Slow in MapReduceMapReduce is widely adopted for processing and generating large datasets with aparallel, distributed algorithm on a cluster. It allows users to write parallel computations,using a set of high-level operators, without having to worry about work distribution andfault tolerance.Unfortunately, in most current frameworks, the only way to reuse data betweencomputations (Ex: between two MapReduce jobs) is to write it to an external stablestorage system (Ex: HDFS). Although this framework provides numerous abstractions foraccessing a cluster’s computational resources, users still want more.Both Iterative and Interactive applications require faster data sharing across paralleljobs. Data sharing is slow in MapReduce due to replication, serialization, and diskIO. Regarding storage system, most of the Hadoop applications, they spend more than90% of the time doing HDFS read-write operations.Iterative Operations on MapReduceReuse intermediate results across multiple computations in multi-stage applications. Thefollowing illustration explains how the current framework works, while doing the iterativeoperations on MapReduce. This incurs substantial overheads due to data replication, diskI/O, and serialization, which makes the system slow.4

Apache SparkFigure: Iterative operations on MapReduceInteractive Operations on MapReduceUser runs ad-hoc queries on the same subset of data. Each query will do the disk I/O onthe stable storage, which can dominates application execution time.The following illustration explains how the current framework works while doing theinteractive queries on MapReduce.Figure: Interactive operations on MapReduce5

Apache SparkData Sharing using Spark RDDData sharing is slow in MapReduce due to replication, serialization, and disk IO. Mostof the Hadoop applications, they spend more than 90% of the time doing HDFS readwrite operations.Recognizing this problem, researchers developed a specialized framework called ApacheSpark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports inmemory processing computation. This means, it stores the state of memory as an objectacross the jobs and the object is sharable between those jobs. Data sharing in memoryis 10 to 100 times faster than network and Disk.Let us now try to find out how iterative and interactive operations take place in SparkRDD.Iterative Operations on Spark RDDThe illustration given below shows the iterative operations on Spark RDD. It will storeintermediate results in a distributed memory instead of Stable storage (Disk) and makethe system faster.Note: If the Distributed memory (RAM) is sufficient to store intermediate results (Stateof the JOB), then it will store those results on the disk.Figure: Iterative operations on Spark RDDInteractive Operations on Spark RDDThis illustration shows interactive operations on Spark RDD. If different queries are runon the same set of data repeatedly, this particular data can be kept in memory for betterexecution times.Figure: Interactive operations on Spark RDD6

Apache SparkBy default, each transformed RDD may be recomputed each time you run an action onit. However, you may also persist an RDD in memory, in which case Spark will keep theelements around on the cluster for much faster access, the next time you query it. Thereis also support for persisting RDDs on disk, or replicated across multiple nodes.7

3. SPARK – INSTALLATIONApache SparkSpark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux basedsystem. The following steps show how to install Apache Spark.Step 1: Verifying Java InstallationJava installation is one of the mandatory things in installing Spark. Try the followingcommand to verify the JAVA version. java -versionIf Java is already, installed on your system, you get to see the following response –java version "1.7.0 71"Java(TM) SE Runtime Environment (build 1.7.0 71-b13)Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)In case you do not have Java installed on your system, then Install Java beforeproceeding to next step.Step 2: Verifying Scala installationYou should Scala language to implement Spark. So let us verify Scala installation usingfollowing command. scala -versionIf Scala is already installed on your system, you get to see the following response –Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFLIn case you don’t have Scala installed on your system, then proceed to next step forScala installation.Step 3: Downloading ScalaDownload the latest version of Scala by visit the following link Download Scala. For thistutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tarfile in the download folder.8

Apache SparkStep 4: Installing ScalaFollow the below given steps for installing Scala.Extract the Scala tar fileType the following command for extracting the Scala tar file. tar xvf scala-2.11.6.tgzMove Scala software filesUse the following commands for moving the Scala software files, to respective directory(/usr/local/scala). su –Password:# cd /home/Hadoop/Downloads/# mv scala-2.11.6 /usr/local/scala# exitSet PATH for ScalaUse the following command for setting PATH for Scala. export PATH PATH:/usr/local/scala/binVerifying Scala InstallationAfter installation, it is better to verify it. Use the following command for verifying Scalainstallation. scala -versionIf Scala is already installed on your system, you get to see the following response –Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFLStep 5: Downloading Apache SparkDownload the latest version of Spark by visiting the following link Download Spark. Forthis tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it,you will find the Spark tar file in the download folder.9

Apache SparkStep 6: Installing SparkFollow the steps given below for installing Spark.Extracting Spark tarThe following command for extracting the spark tar file. tar xvf spark-1.3.1-bin-hadoop2.6.tgzMoving Spark software filesThe following commands for moving the Spark software files to respective directory(/usr/local/spark). su –Password:# cd /home/Hadoop/Downloads/# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark# exitSetting up the environment for SparkAdd the following line to /.bashrc file. It means adding the location, where the sparksoftware file are located to the PATH variable.export PATH PATH:/usr/local/spark/binUse the following command for sourcing the /.bashrc file. source /.bashrcStep 7: Verifying the Spark InstallationWrite the following command for opening Spark shell. spark-shellIf spark i

Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014. Features of Apache Spark Apache Spark has following features. Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing

Related Documents:

cPanel - Tutorialspoint

tutorialspoint.com or google.com these are domain names. A domain name has two parts, TLD (Top Level Domain) and SLD (Second level domain), for example in tutorialspoint.com, tutorialspoint is second level domain of TLD .com, or you can say it's a subdomain of .com TLD. There are many top level domains available, like .com,

54 Views

1y ago

Computer Programming Tutorial - Plastics World

tutorialspoint.com or this tutorial may not be redistributed or reproduced in any way, shape, or form without the written permission of tutorialspoint.com. Failure to do so is a violation of copyright laws. This tutorial may contain inaccuracies or errors and tutorialspoint provides no guarantee regarding the

157 Views

3y ago

SQLite Tutorial - MY LabVIEW

345 Views

3y ago

Java Tutorial - Colorado State University

155 Views

3y ago

Scala Tutorial - People

137 Views

3y ago

Bootstrap Tutorial - Stellenbosch University

180 Views

3y ago

Hibernate Tutorial - orgfree.com

All the content and graphics on this tutorial are the property of tutorialspoint.com. Any content from tutorialspoint.com or this tutorial may not be redistributed or reproduced in any way, shape, or form without the written permission of tutorialspoint.com. Failure to do so is a violation of copyright laws.

30 Views

1y ago

Unix Tutorial - Department of Physics and Astronomy

22 Views

1y ago

Recent Views

Yahoo: Failures - Harvard University

Stock closes at an all time low 8.11 Yahoo invested 1Bn in Alibaba Yahoo co-founder & CEO Jerry Yang steps down after 18 months Microsoft and Yahoo agree to search partnership 2008 Yahoo tries to buy Google for 3Bn. Google denied the offer 2009 Yahoo acquires many media companies Microsoft tries to buy Yahoo for 44.6Bn Yahoo denied offer .

1y ago

200 Views

Reviewers Guide – AT&T Yahoo! Go Mobile

Reviewers Guide – AT&T Yahoo! Go Mobile AT&T Yahoo! Go Mobile gives you access to a wide range of the Yahoo! services you . select download then select attachments to view and download the attachment. 4 . emoticons, audibles, voice IMs and attach photos to IM conversations. To use Yahoo! Messenger, click on Messenger in the Yahoo! Go .

2y ago

369 Views

MANAGERIAL FINANCE - GBV

of Managerial Finance page 2 Introduction to Managerial Finance 1 Starbucks—A Taste for Growth page 3 1.1 Finance and Business What Is Finance? 4 Major Areas and Opportunities in Finance 4 Legal Forms of Business Organization 5 Why Study Managerial Finance? Review Questions 9 1.2 The Managerial Finance Function 9 Organization of the Finance

3y ago

6.8K Views

Chapter 1 The roles of finance function in organisations

The roles of the finance function in organisations 4. The role of ethics in the role of the finance function Ethics is the system of moral principles that examines the concept of right and wrong. Ethics underpins an organisation’s sustained value creation. The roles that the finance function performs should be carried out in an .File Size: 888KBPage Count: 10Explore furtherRole of the Finance Function in the Financial Management .www.managementstudyguide.c Roles and Responsibilities of a Finance Department in a .www.pharmapproach.comRoles and Responsibilities of a Finance Department .www.smythecpa.comTop 10 – Functions of Business Finance in an om23 Functions and Duties of Accounting and Finance nded to you b

2y ago

335 Views

Yahoo Microsoft: A Horizontal Romance, or a Broken

News, Finance, Sports and Rivals Entertainment -Yahoo! Music, Movies, TV, Games, Video and omg! Life Style - Yahoo! Autos, Real Estate, Food, Tech, Kids, Health o Connected Life - Co-branded broadband, Yahoo! Moblie Digital Home, Desktop

1y ago

127 Views

2017-2018 GRANDE ÉCOLE MSc in MANAGEMENT

Descriptif des cours Course Outlines 10 Catalogue des cours/ Course Catalog 2017-2018 FIN: Finance/Finance A : Actuariat/Actuarial, Insurance E : Finance d’entreprise/Corporate Finance The course liste tables and the course outlines G : Finance générale/General Finance M : Finance de marché/Market Finance S : Synthèse/Synthesis IDS: Systèmes d’Information, Sciences de la Décision et .

3y ago

312 Views

Behavioral Finance and Wealth L Management

Introduction to Behavioral Finance CHAPTER1 What Is Behavioral Finance? Behavioral Finance: The Big Picture Standard Finance versus Behavioral Finance The Role of Behavioral Finance with Private Clients How Practical Application of Behavioral Finance Can Create a Successful Advisory Rel

2y ago

377 Views

Catalogue des Cours Course Catalog - ESSEC Business School

10 Catalogue des cours/Course Catalog 2021-2022 FIN: Finance/Finance E : Finance d'entreprise/Corporate Finance G : Finance générale/General Finance M : Finance de marché/Market Finance S : Synthèse/Synthesis IDS: Systèmes d'Information, Sciences de la Décision et Statistiques/ Information Systems, Decision Sciences and Statistics

1y ago

222 Views

kama sastry 2004@yahoo.co.uk in.groups.yahoo .

kama_sastry_2004@yahoo.co.uk up/hot-indi

2y ago

477 Views

IX. “Can You Buy Me Now?”: The Erratic Closing of the .

2016-2017 Developments in Banking law 547 by both parties, Verizon was supposed to purchase Yahoo’s shares for 4,825,800,000.965 Excluded from the transaction were Yahoo’s holdings in Yahoo Japan and Alibaba.966 The sale will end Yahoo’s twenty

2y ago

358 Views

Implementasi Rest Web Service Pada Aplikasi Pengolah Pesan Yahoo . - Core

REST Web Service: Gambar 3. Desain Sistem REST Web Service 3. HASIL DAN PEMBAHASAN 3.1 Gambaran Umum Aplikasi Pada Penelitian ini akan menghasilkan sebuah aplikasi pengolah pesan Yahoo Messenger dan Aplikasi REST Web Service. Aplikasi pengolah pesan Yahoo Messenger berfungsi untuk mengirim dan menerima pesan Yahoo Messenger.

1y ago

165 Views

SINGAPORE - Kelly Services

FINANCE Chief Financial Officer Degree/Master 15 20,000 25,000 Finance Assistant Diploma 1-3 2,800 3,400 Finance Controller Degree 10-15 10,000 18,000 Finance Director Degree 15 15,000 20,000 Finance Executive/ Senior Finance Executive Degree 2-5 3,000 6,000 Finance Manager/ Assistan

2y ago

527 Views

Ministries of Finance and Nationally Determined Contributions

Rodrigo Rojo, IDB Sr. Consultant and advisor to Ministry of Finance of Chile. Colombia German Romero Otalora and Laura Marcela Ruiz Daza — Office of the Vice-Minister — Ministry of Finance. Ireland Paul Ryan — International Finance Division — Ministry of Finance Sean Judge — Department of Finance — Ministry of Finance

1y ago

232 Views

Trade Finance & Supply Chain Finance Awards 2022

In February 2022, Global Finance will publish its annual selections for the World's Best Trade Finance and Supply Chain Finance Providers. Global Finance will name the best trade finance providers in more than 100 countries and territories, eight global regions and

1y ago

215 Views

Vol. 36 No. 7 - tall

Finance Officer Barry Umbs xxtallbarry@aol.com Secretary Mary Kershner tllskr@yahoo.com Editor Megan Lukans pdxmegan@yahoo.com Miss TI Coordinator Erica Hand QueenErica2015@gmail.com Alt. Exec Officer Patty Huggett pjh2637@yahoo.com Treasurer Bob Huggett Sactallbob@gmail.com

1y ago

106 Views

Prerequisite - Tutorialspoint

It looks like you're using an ad-blocker