YARN, The Apache Hadoop Platform For Streaming Realtime .

2y ago
34 Views
2 Downloads
3.97 MB
21 Pages
Last View : 9d ago
Last Download : 3m ago
Upload by : Lilly Kaiser
Transcription

YARN, the Apache HadoopPlatform for Streaming,Realtime and Batch ProcessingEric Charles [http://echarles.net] @echarlesDatalayer [http://datalayer.io] @datalayerioFOSDEM 02 Feb 2014 – NoSQL DevRoom@datalayerio hacks@datalayer.io https://github.com/datalayer

eric@apache.orgEric Charles (@echarles)Java DeveloperApache MemberApache James CommitterApache Onami CommitterApache HBase ContributorWorked in London with Hadoop, Hive,Cascading, HBase, Cassandra,Elasticsearch, Kafka and StormJust founded Datalayer@datalayerio hacks@datalayer.io https://github.com/datalayer

Map Reduce V1 Limits Scalability Availability Maximum Cluster size – 4,000 nodesMaximum concurrent tasks – 40,000Coarse synchronization in JobTrackerJob Tracker failure kills all queued and running jobsNo alternate paradigms and servicesIterative applications implemented using MapReduce areslow (HDFS read/write)Map Reduce V2 ( “NextGen”) based on YARN (not 'mapred' vs 'mapreduce' package)@datalayerio hacks@datalayer.io https://github.com/datalayer

YARN as a LayerAll problems in computer science can be solvedby another level of indirection– David WheelerHivePigMap Reduce ster and Resource ManagementHDFSλλ YARN a.k.a. Hadoop 2.0 separatesλ the cluster and resource managementλ from theλ processing components@datalayerio hacks@datalayer.io https://github.com/datalayer.

Components A global ResourceManagerA per-node slave NodeManagerA per-applicationApplication Masterrunning on a NodeManagerA per-applicationContainer running on aNode Manager@datalayerio hacks@datalayer.io https://github.com/datalayer

Yahoo!Yahoo! has been running35000 nodes of YARN inproduction for over 8 monthsnow since begin -batch-to-continuous-computing-at-yahoo.html ]@datalayerio hacks@datalayer.io https://github.com/datalayer

Twitter@datalayerio hacks@datalayer.io https://github.com/datalayer

Get It! Download /Unzip and configure mapred-site.xml mapreduce.framework.name yarnyarn-site.xml yarn.nodemanager.aux-services mapreduce shuffleyarn.nodemanager.aux-services.mapreduce shuffle.class o hacks@datalayer.io https://github.com/datalayer

Namenodehttp://namenode:50070 Namenode Browserhttp://namenode:50075/logs Secondary Namenodehttp://snamenode:50090 Resource Manager Application Status Resource Node Manager Mapreduce JobHistory Server p://manager:8089/proxy/ app-id http://manager:8042/node@datalayerio hacks@datalayer.io https://github.com/datalayer

YARNed Batch Map ReduceHive / Pig /Cascading / .Graph Streaming Storm Spark KafkaRealtime Giraph HBase Hama Memcached OpenMPI@datalayerio hacks@datalayer.io https://github.com/datalayer

BatchApache Tez : Fast response times and extremethroughput to execute complex DAG of tasks“The future of #Hadoop runs on #Tez”MR-V1MR V2HiveHivePigCascadingMap ReduceHDFSPigUnder DevelopmentCascadingMRHivePigMR V2TezYARNYARNHDFSHDFS@datalayerio hacks@datalayer.io https://github.com/datalayerCascading

Streaming Storm-YARN enables Storm applications to utilize thecomputational resources in a Hadoop cluster along withaccessing Hadoop storage resources such as HBase andHDFSSpark YARNStorm [https://github.com/yahoo/storm-yarn] Storm / Spark / KafkaNeed to build a YARN-Enabled Assembly JARGoal is more to integrate Map Reduce e.g. SIMR supportsMRV1Kafka with Samza [http://samza.incubator.apache.org] Implements StreamTask Execution Engine: YARNStorage Layer: Kafka, not HDFS@datalayerio hacks@datalayer.io https://github.com/datalayer

@Yahoo!From “Storm and Hadoop: Convergence of Big-Dataand Low-Latency Processing YDN Blog - Yahoo.html”@datalayerio hacks@datalayer.io https://github.com/datalayer

HBase HBaseYARNYARN Resource ManagerHoya [https://github.com/hortonworks/hoya.git]YARN Node Manager Allows users to create on-demand HBase clustersHoya Client Hoya AM [HBase Master]HDFSAllow different users/applications to run different versionsof HBaseHDFSAllow users to configure different HBase instancesdifferentlyYARNNode ManagerYARN Node ManagerStop / Suspend / Resume clusters as neededHBase Region Server HBase Region Server HBase Region Server CLIHDFSbasedExpand / shrink clusters as neededHDFS@datalayerio hacks@datalayer.io https://github.com/datalayer

Graph Giraph / HamaYARNGiraph Offline batch processing of semi-structured graph data on a massive scale Compatible with Hadoop 2.x "Pure YARN" build profileManages Failure Scenarios Worker/container failure during a job? What happens if our App Master fails during a job? Application Master allows natural bootstrapping of Giraph jobs Next Steps Zookeeper in AM Own Management WEB UI .Abstracting the Giraph framework logic away from MapReduce has madeporting Giraph to other platforms like Mesos possible(from “Giraph on YARN - Qcon SF”)@datalayerio hacks@datalayer.io https://github.com/datalayer

Options Apache Mesos Cluster manager Can run Hadoop, Jenkins, Spark, Aurora. s -vs-mesos/Apache Helix Generic cluster management frameworkYARN automates service deployment, resource allocation, and codedistribution. However, it leaves state management and fault-handlingmostly to the application developer.Helix focuses on service operation but relies on manual hardwareprovisioning and service deployment.@datalayerio hacks@datalayer.io https://github.com/datalayer

You Looser! More Devops and IO Tuning and Debugging the Application Master and Container is hard Both AM and RM based on an asynchronous event framework No flow control Deal with RPC Connection loose - Split Brain, AM Recovery. !!! What happens if a worker/container or a App Master fails?New Application Master per MR Job - No JVM Reuse for MR Tez-on-Yarn will fix these No Long living Application Master (see YARN-896) New application code development difficult Resource Manager SPOF (chuch. don't even ask this) No mixed V1/V2 Map Reduce (supported by some commecrialdistribution)@datalayerio hacks@datalayer.io https://github.com/datalayer

You Rocker! Sort and Shuffle speed gain for Map Reduce Real-time processing with Batch Processing Collocation brings Elasticity to share resource (Memory/CPU/.)Sharing data between realtime and batch - Reduce network transfersand total cost of acquiring the dataHigh expectations from #Tez Long Living Sessions Avoid HDFS Read/WriteHigh expectations from #Twill Remote Procedure Calls between containers Lifecycle Management Logging@datalayerio hacks@datalayer.io https://github.com/datalayer

Your App? Your App?YARNWHY porting your App on YARN? Benefit from existing *-yarn projects Reuse unused cluster resource Common Monitoring, Management andSecurity framework Avoid HDFS write on reduce (via Tez) Abstract and Port to other platforms @datalayerio hacks@datalayer.io https://github.com/datalayer

Summary YARN brings One component, One responsiblity!!! Resource ManagementData ProcessingMultiple applications and patterns in HadoopMany organizations are already building andusing applications on YARNTry YARN and Contribute!@datalayerio hacks@datalayer.io https://github.com/datalayer

Thank You!Questions ?(Special Thx to @acmurthy and @steveloughran for helping tweets)@echarles yer.io/jobs@datalayerio hacks@datalayer.io https://github.com/datalayer

Java Developer Apache Member Apache James Committer Apache Onami Committer Apache HBase Contributor Worked in London with Hadoop, Hive, Cascading, HBase, Cassand

Related Documents:

Getting Started with the Cloud . Apache Bigtop Apache Kudu Apache Spark Apache Crunch Apache Lucene Apache Sqoop Apache Druid Apache Mahout Apache Storm Apache Flink Apache NiFi Apache Tez Apache Flume Apache Oozie Apache Tika Apache Hadoop Apache ORC Apache Zeppelin

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

The blue yarn is 43 cm long. The red yarn is 28 cm longer than the blue yarn. The green yarn is 15 cm shorter than the red yarn. What is the length of the green yarn? Answer: The length of the green yarn is cm. Step 1. Find the length of the red yarn: 43 28 71 Step 2. Find the length of the green yarn: 71 -

Introduction Apache Hadoop . What is Apache Hadoop? MapReduce is the processing part of Hadoop HDFS is the data part of Hadoop Dept. of Computer Science, Georgia State University 05/03/2013 5 Introduction Apache Hadoop HDFS MapReduce Machine . What is Apache Hadoop? The MapReduce server on a typical machine is called a .

CDH: Cloudera’s Distribution Including Apache Hadoop Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling Metadata APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE APACHE OOZIE APACHE HIVE File System Mount UI

Contents Foreword by Raymie Stata xiii Foreword by Paul Dix xv Preface xvii Acknowledgments xxi About the Authors xxv 1 Apache Hadoop YARN: A Brief History and Rationale 1 Introduction 1 Apache Hadoop 2 Phase 0: The Era of Ad Hoc Clusters 3 Phase 1: Hadoop on Demand 3 HDFS in the HOD World 5 Features and Advantages of HOD 6 Shortcomings of Hadoop on Demand 7