Limitation Of Hadoop Map Reduce - Computer Science And .

2y ago
20 Views
2 Downloads
2.27 MB
34 Pages
Last View : 16d ago
Last Download : 2m ago
Upload by : Vicente Bone
Transcription

Limitation of Hadoop Map ReduceHDFSReadIter. 1HDFSWriteHDFSReadMap Slow due to replicationReduceInputMapOutputReduceMapIter. 2HDFSWrite Inefficient for:– Iterative algorithms (Machine Learning,Graphs & network analysis)– Interactive Data Mining (R)

Why Spark as Solution: In-Memory Data Sharingiter. 1iter. 2Inputquery 1one-timeprocessingInputquery 2query 3. . . . .

What is Spark A Big Data analytics cluster-computing framework written in scala Open sourced originally developed at AMP Lab @ UC Berkely Provides In-Memory analytics which is faster than Hadoop/Hive (Up to 100x) Designed for iterative algorithms and interactive analytics Highly compatible with Hadoop’s storage APIs.– Can run on existing Hadoop Cluster Setup Developers can write driver programs in using multiple languages

The Spark Stack & Architecture

Core Spark Concepts Spark Context: (Spark Context Object)– Driver programs access point.– Represents a connection to a computing cluster.– Used to build resilient distributed datasets or RDDs RDD: Resilient Distributed Datasets– Immutable data structure– Fault Tolerant– Parallel Data Structure– In-Memory (Explicitly)InputRDDRDDRDDRDD

Spark-Shell

Hello World!! : Text SearchSpark ContextRDD is createdRDD Operation: TransformationRDD Operation: Action

RDD: Resilient Distributed Dataset An RDD in Spark is simply a distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java or Scala objects, including user-defined classes RDD Created in two ways– 1. Loading the external data– 2. By distributing a collection of objects in from driver program Operation on RDD:– Transformation: construct a new RDD from a previous one– Actions: compute a result based on an RDD, and either return it to the driver program or save it to an externalstorage system Spark computes RDDs in a lazy fashion

Resilient Distributed Dataset Contd val lines ()high-level APIs in Scala, Java, and Python, and an optimized engine thatval pythonLines lines.filter(line line.contains("Python"))

RDD OperationsTransformations Create new dataset fromand existing oneActions Lazy in nature. They areexecuted only whensome action isperformed Example: Map() Filter() Distinct()Returns to the driverprogram a value orexports data to a storagesystem after performinga computationExample: Count() Reduce() Collect() Take()Persistence For caching data inmemory for futureoperations. Options to store on diskor RAM or mixed(Storage Level) Example: Persist() Cache()

RDD: Lazy Fashion & persist/Caching Lazy Fashion:– RDDs are computed actually created the first time they are used in an action.Persist:–Spark’s RDDs are by default recomputed each time you run an action on them.–Want to use RDD in multiple actions, ask Spark to store it in memory by using persist. i.e. RDD.persist()–RDD.unpersist(): To remove from the cache–Persisting RDDs on disk instead of memory is also possibleLevelMEMORY ONLYMEMORY ONLY2MEMORY AND DISKDISK ONLYDISK ONLY 2SpaceUsedCPUInNodesOn Disktime memorywith SomeYY112CommentsSpills to disk if there is toomuch data to fit in memory.

RDD Fault Tolerance Spark keeps track of the set of dependencies between different RDDs, called the lineage graph. It uses this information to compute each RDD on demand and to recover lost data if part of a persistent RDDis lost.

Spark Streaming Framework for large scale stream processing– Scales to 100s of nodes– Can achieve second scale latencies– Integrates with Spark’s batch and interactive processing– Provides a simple batch-like API for implementing complex algorithm– Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

Motivation Many important applications must process large streams of live data and provide results in near-real-time-Social network trends-Website statistics-Intrustion detection systems-etc. Require large clusters to handle workloads Require latencies of few seconds

Stateful Stream Processing Traditional streaming systems have a eventdriven record-at-a-time processing model-Each node has mutable state-For each record, update state & send newrecords State is lost if node dies! Making stateful stream processing be faulttolerant is challenging

Discretized Stream ProcessingRun a streaming computation as a series of verysmall, deterministic batch jobslive data stream Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs andprocesses them using RDD operations Finally, the processed results of the RDDoperations are returned in batchesSparkStreamingbatches of X secondsprocessedresultsSpark

Discretized Stream ProcessingRun a streaming computation as a series of verysmall, deterministic batch jobslive data stream Batch sizes as low as ½ second, latency 1 second Potential for combining batch processing andstreaming processing in the same systemSparkStreamingbatches of X secondsprocessedresultsSpark

Key concepts DStream – sequence of RDDs representing a stream of data– Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets Transformations – modify data from on DStream to another– Standard RDD operations – map, countByValue, reduce, join, – Stateful operations – window, countByValueAndWindow, Output Operations – send data to external entity– saveAsHadoopFiles – saves to HDFS– foreach – do anything with each batch of results

Example 1 – Get hashtags from Twitterval tweets ssc.twitterStream( Twitter username , Twitter password )DStream: a sequence of RDD representing a stream of dataTwitter Streaming APIbatch @ tbatch @ t 1batch @ t 2tweets DStreamstored in memory as an RDD(immutable, distributed)

Example 1 – Get hashtags from Twitterval tweets ssc.twitterStream( Twitter username , Twitter password )val hashTags tweets.flatMap (status getTags(status))new DStreamtransformation: modify data in one Dstream to create another DStreambatch @ tbatch @ t 1batch @ t 2tweets DStreamhashTags Dstream[#cat, #dog, ]flatMapflatMap flatMapnew RDDs created forevery batch

Example 1 – Get hashtags from Twitterval tweets ssc.twitterStream( Twitter username , Twitter password )val hashTags tweets.flatMap (status /.")output operation: to push data to external storagebatch @ tbatch @ t 1batch @ t 2tweets DStreamhashTags DStreamflatMapflatMapflatMapsavesavesaveevery batch savedto HDFS

Fault-tolerance RDDs are remember the sequenceof operations that created it fromthe original fault-tolerant input data Batches of input data are replicatedin memory of multiple workernodes, therefore fault-tolerant Data lost due to worker failure, canbe recomputed from input datatweetsRDDinput datareplicatedin memoryflatMaphashTagsRDDlost partitionsrecomputed onother workers

Spark: Machine Learning (MLlib) MLlib is a standard component of Spark providing machine learning primitives on top ofSpark Algorithms supported by Mllib– Classification: SVM– Regression: Linear Regression, and random forests– Collaborative Filtering: Alternating Least Squares (ALS)– Clustering: K-means– Dimensionality Reduction: Singular Value Decomposition (SVD)– Basic Statistics: Summary Statistics, correlation, hypothesis testing– Feature Extraction and Sampling:

Collaborative Filtering Recover a rating matrix from a subset of its entries. ALS - wall-clock time

Collaborative Filtering

Spark SQL Spark SQL unifies access to structured data. Load and query data from a variety of sources Run unmodified Hive queries on existing warehouses Connect through JDBC or ODBC. Spark SQL includes a server mode with industrystandard JDBC and ODBC connectivity.

The Spark Community

Vision - one stack to rule them all

A New Feature Addition to MLlib ELM: Extreme Learning Machine.– A latest and fast learning method.– Based Single Layer Neural Network model.– Works 100x faster than the Backpropgation algorithm– More training accuracy compared SVM*– It used Singular Value Decomposition (SVD) for computation which is already supported by SparkMLlib– Very recent research publications (2014) prove parallel and distributed model of ELM

References (google Apache Spark)

Question?

Thank You!!!!Video About Google:https://www.youtube.com/watch?v p0ysH2Glw5w

Limitation of Hadoop Map Reduce Slow due to replication Inefficient for: – Iterative algorithms (Machine Learning, Graphs & network analysis) – Interactive Data Mining (R) Iter. 1 Iter. 2 HDFS Read HDFS Write HDFS Write Input Output Map Map Map Reduce Reduce

Related Documents:

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.

The In-Memory Accelerator for Hadoop is a first-of-its-kind Hadoop extension that works with your choice of Hadoop distribution, which can be any commercial or open source version of Hadoop available, including Hadoop 1.x and Hadoop 2.x distributions. The In-Memory Accelerator for Hadoop is designed to provide the same performance

3 Reduce Reduce V 1 V 2 V 3 barrier Map Map V Map V 1 V 2 V 3 Reduce Reduce V 1 V 2 3 Map Map Reduce Map V 1 V 2 V 3 V 1 V 2 Reduce V 3 V 1 V 2 Reduce 3 Reduce Reduce V 1 V 2 V 3 V 1 V 2 Reduce 3 SPLIT: enough work for one GPU NO MEMORY LIMIT CPU GPU execution CPU GPU execution on each node Final reduction in cluster Aggregate nodes results Network

Configuring SSH: 6 Add hadoop user to sudoer's list: 8 Disabling IPv6: 8 Installing Hadoop: 8 Hadoop overview and HDFS 9 Chapter 2: Debugging Hadoop MR Java code in local eclipse dev environment. 12 Introduction 12 Remarks 12 Examples 12 Steps for configuration 12 Chapter 3: Hadoop commands 14 Syntax 14 Examples 14 Hadoop v1 Commands 14 1 .

-Type "sudo tar -xvzf hadoop-2.7.3.tar.gz" 6. I renamed the download to something easier to type-out later. -Type "sudo mv hadoop-2.7.3 hadoop" 7. Make this hduser an owner of this directory just to be sure. -Type "sudo chown -R hduser:hadoop hadoop" 8. Now that we have hadoop, we have to configure it before it can launch its daemons (i.e .

MPhys Astrophysics with a Year Abroad (2021-22) This specification provides a concise summary of the main features of the programme and the learning outcomes that a typical student might reasonably be expected to achieve and demonstrate if s/he takes full advantage of the learning opportunities that are provided. Awarding Institution University of Southampton Teaching Institution University of .