Amazon EMR Best Practices

2y ago
42 Views
2 Downloads
1.04 MB
38 Pages
Last View : 18d ago
Last Download : 2m ago
Upload by : Angela Sonnier
Transcription

Amazon Web Services – Best Practices for Amazon EMRBest Practices for Amazon EMRParviz DeyhimAugust 2013(Please consult http://aws.amazon.com/whitepapers/ for the latest version of this paper)Page 1 of 38August 2013

Amazon Web Services – Best Practices for Amazon EMRAugust 2013Table of ContentsAbstract . 3Introduction . 3Moving Data to AWS . 4Scenario 1: Moving Large Amounts of Data from HDFS (Data Center) to Amazon S3 . 4Using S3DistCp . 4Using DistCp . 6Scenario 2: Moving Large Amounts of Data from Local Disk (non-HDFS) to Amazon S3 . 6Using the Jets3t Java Library . 6Using GNU Parallel . 7Using Aspera Direct-to-S3 . 7Using AWS Import/Export . 8Using AWS Direct Connect . 9Scenario 3: Moving Large Amounts of Data from Amazon S3 to HDFS . 10Using S3DistCp . 10Using DistCp . 11Data Collection . 11Using Apache Flume . 11Using Fluentd . 12Data Aggregation . 12Data Aggregation with Apache Flume . 13Data Aggregation Best Practices. 13Best Practice 1: Aggregated Data Size . 15Best Practice 2: Controlling Data Aggregation Size . 15Best Practice 3: Data Compression Algorithms . 15Best Practice 4: Data Partitioning. 18Processing Data with Amazon EMR . 19Picking the Right Instance Size . 19Picking the Right Number of Instances for Your Job . 20Estimating the Number of Mappers Your Job Requires . 21Amazon EMR Cluster Types . 22Transient Amazon EMR Clusters . Error! Bookmark not defined.Persistent Amazon EMR Clusters . 23Common Amazon EMR Architectures . 23Pattern 1: Amazon S3 Instead of HDFS . 24Pattern 2: Amazon S3 and HDFS . 25Pattern 3: HDFS and Amazon S3 as Backup Storage . 26Pattern 4: Elastic Amazon EMR Cluster (Manual) . 27Pattern 5: Elastic Amazon EMR Cluster (Dynamic) . 27Optimizing for Cost with Amazon EMR and Amazon EC2 . 29Optimizing for Cost with EC2 Spot Instances. 32Performance Optimizations (Advanced) . 33Suggestions for Performance Improvement . 34Map Task Improvements. 34Reduce Task Improvements . 35Use Ganglia for Performance Optimizations . 35Locating Hadoop Metrics . 37Conclusion . 37Further Reading and Next Steps . 37Appendix A: Benefits of Amazon S3 compared to HDFS . 38Page 2 of 38

Amazon Web Services – Best Practices for Amazon EMRAugust 2013AbstractAmazon Web Services (AWS) cloud accelerates big data analytics. It provides instant scalability and elasticity, letting youfocus on analytics instead of infrastructure. Whether you are indexing large data sets or analyzing massive amounts ofscientific data or processing clickstream logs, AWS provides a range of big data tools and services that you can leveragefor virtually any data-intensive project.Amazon Elastic MapReduce (EMR) is one such service that provides fully managed hosted Hadoop framework on top ofAmazon Elastic Compute Cloud (EC2). In this paper, we highlight the best practices of moving data to AWS, collectingand aggregating the data, and discuss common architectural patterns for setting up and configuring Amazon EMRclusters for faster processing. We also discuss several performance and cost optimization techniques so you can processand analyze massive amounts of data at high throughput and low cost in a reliable manner.IntroductionBig data is all about collecting, storing, processing, and visualizing massive amounts of data so that companies can distillknowledge from it, derive valuable business insights from that knowledge, and make better business decisions, all asquickly as possible. The main challenges in operating data analysis platforms include installation and operationalmanagement, dynamically allocating data processing capacity to accommodate for variable load, and aggregating datafrom multiple sources for holistic analysis. The Open Source Apache Hadoop and its ecosystem of tools help solve theseproblems because Hadoop can expand horizontally to accommodate growing data volume and can process unstructuredand structured data in the same environment.Amazon Elastic MapReduce (Amazon EMR) simplifies running Hadoop and related big data applications on AWS. Itremoves the cost and complexity of managing the Hadoop installation. This means any developer or business has thepower to do analytics without large capital expenditures. Today, you can spin up a performance-optimized Hadoopcluster in the AWS cloud within minutes on the latest high performance computing hardware and network withoutmaking a capital investment to purchase the hardware. You have the ability to expand and shrink a running cluster ondemand. This means if you need answers to your questions faster, you can immediately scale up the size of your clusterto crunch the data more quickly. You can analyze and process vast amounts of data by using Hadoop’s MapReducearchitecture to distribute the computational work across a cluster of virtual servers running in the AWS cloud.In addition to processing, analyzing massive amounts of data also involves data collection, migration, and optimization.Moving Data toAWSData CollectionDataAggregationData ProcessingCost andPerformanceOptimizationsFigure 1: Data FlowThis whitepaper explains the best practices of moving data to AWS; strategies for collecting, compressing, aggregatingthe data; and common architectural patterns for setting up and configuring Amazon EMR clusters for processing. It alsoprovides examples for optimizing for cost and leverage a variety of Amazon EC2 purchase options such as Reserved andSpot Instances. This paper assumes you have a conceptual understanding and some experience with Amazon EMR andPage 3 of 38

Amazon Web Services – Best Practices for Amazon EMRAugust 2013Apache Hadoop. For an introduction to Amazon EMR, see the Amazon EMR Developer Guide.1 For an introduction toHadoop, see the book Hadoop: The Definitive Guide.2Moving Data to AWSA number of approaches are available for moving large amounts of data from your current storage to Amazon SimpleStorage Service (Amazon S3) or from Amazon S3 to Amazon EMR and the Hadoop Distributed File System (HDFS). Whendoing so, however, it is critical to use the available data bandwidth strategically. With the proper optimizations, uploadsof several terabytes a day may be possible. To achieve such high throughput, you can upload data into AWS in parallelfrom multiple clients, each using multithreading to provide concurrent uploads or employing multipart uploads forfurther parallelization. You can adjust TCP settings such as window scaling3 and selective acknowledgement4 to enhancedata throughput further. The following scenarios explain three ways to optimize data migration from your current localstorage location (data center) to AWS by fully utilizing your available throughput.Scenario 1: Moving Large Amounts of Data from HDFS (Data Center) to Amazon S3Two tools—S3DistCp and DistCp—can help you move data stored on your local (data center) HDFS storage to AmazonS3. Amazon S3 is a great permanent storage option for unstructured data files because of its high durability andenterprise class features, such as security and lifecycle management.Using S3DistCpS3DistCp is an extension of DistCp with optimizations to work with AWS, particularly Amazon S3. By adding S3DistCp as astep in a job flow, you can efficiently copy large amounts of data from Amazon S3 into HDFS where subsequent steps inyour EMR clusters can process it. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS toAmazon S3.S3DistCp copies data using distributed map–reduce jobs, which is similar to DistCp. S3DistCp runs mappers to compile alist of files to copy to the destination. Once mappers finish compiling a list of files, the reducers perform the actual datacopy. The main optimization that S3DistCp provides over DistCp is by having a reducer run multiple HTTP upload threadsto upload the files in parallel.To illustrate the advantage of using S3DistCP, we conducted a side-by-side comparison between S3DistCp and DistCp. Inthis test, we copy 50 GB of data from a Hadoop cluster running on Amazon Elastic Compute Cloud (EC2) in Virginia andcopy the data to an Amazon S3 bucket in Oregon. This test provides an indication of the performance differencebetween S3DistCp and DistCp under certain circumstances, but your results may vary.MethodData Size CopiedTotal TimeDistCpS3DistCp50 GB50 GB26 min19 MinFigure 2: DistCp and S3DistCp Performance CPSelectiveAcknowledgement.html2Page 4 of 38

Amazon Web Services – Best Practices for Amazon EMRAugust 2013To copy data from your Hadoop cluster to Amazon S3 using S3DistCpThe following is an example of how to run S3DistCp on your own Hadoop installation to copy data from HDFS to AmazonS3. We’ve tested the following steps with: 1) Apache Hadoop 1.0.3 distribution 2) Amazon EMR AMI 2.4.1. We’ve nottested this process with the other Hadoop distributions and cannot guarantee that the exact same steps works beyondthe Hadoop distribution mentioned here (Apache Hadoop 1.0.3).1. Launch a small Amazon EMR cluster (a single node).elastic-mapreduce --create --alive --instance-count 1 --instance-type m1.small -ami-version 2.4.12. Copy the following jars from Amazon EMR’s master node (/home/Hadoop/lib) to your local Hadoop master nodeunder the /lib directory of your Hadoop installation path (For example: /usr/local/hadoop/lib). Depending onyour Hadoop installation, you may or may not have these jars. The Apache Hadoop distribution does not containthese ar/home/hadoop/lib/httpclient-4.1.1.jar3. Edit the core-site.xml file to insert your AWS credentials. Then copy the core-site.xml config file to all ofyour Hadoop cluster nodes. After copying the file, it is unnecessary to restart any services or daemons for thechange to take effect. property name fs.s3.awsSecretAccessKey /name value YOUR SECRETACCESSKEY /value /property property name fs.s3.awsAccessKeyId /name value YOUR ACCESSKEY /value /property property name fs.s3n.awsSecretAccessKey /name value YOUR SECRETACCESSKEY /value /property property name fs.s3n.awsAccessKeyId /name value YOUR ACCESSKEY /value /property 4. Run s3distcp using the following example (modify HDFS PATH, YOUR S3 BUCKET and PATH):hadoop jar /usr/local/hadoop/lib/emr-s3distcp-1.0.jar mr-Page 5 of 38

Amazon Web Services – Best Practices for Amazon EMRAugust local/hadoop/lib/httpclient4.1.1.jar --src HDFS PATH --dest s3://YOUR S3 BUCKET/PATH/ --disableMultipartUploadUsing DistCpDistCp (distributed copy) is a tool used for large inter- or intra-cluster copying of data. It uses Amazon EMR to effect itsdistribution, error handling, and recovery, as well as reporting. It expands a list of files and directories into input to maptasks, each of which will copy a partition of the files specified in the source list.DistCp can copy data from HDFS to Amazon S3 in a distributed manner similar to S3DistCp; however, DistCp is not asfast. DistCp uses the following algorithm to compute the number of mappers required:min (total bytes / bytes.per.map, 20 * num task trackers)Usually, this formula works well, but occasionally it may not compute the right amount of mappers. If you are usingDistCp and notice that the number of mappers used to copy your data is less than your cluster’s total mapper capacity,you may want to increase the number of mappers that DistCp uses to copy files by specifying the -mnumber of mappers option.The following is an example of DistCp command copying /data directory on HDFS to a given Amazon S3 bucket:hadoop distcp hdfs:///data/ /For more details and tutorials on working with DistCp, see Scenario 2: Moving Large Amounts of Data from Local Disk (non-HDFS) to Amazon S3Scenario 1 explained how to use distributed copy tools (DistCp and S3DistCp) to help you copy your data to AWS inparallel. The parallelism achieved in Scenario 1 was possible because the data was stored on multiple HDFS nodes andmultiple nodes can copy data simultaneously. Fortunate

Amazon Web Services – Best Practices for Amazon EMR August 2013 Page 4 of 38 Apache Hadoop. For an introduction to Amazon EMR, see the Amazon EMR Developer Guide.1 For an introduction to Hadoop, see the book Hadoop: The Definitive Guide.2 Moving Data to AWS

Related Documents:

Festive Overture (Shostakovich) N EMR Blasorchester Concert Band EMR 1085 EMR 11867 EMR 11846 EMR 12023 EMR 11905 EMR 11870 EMR 11854 Time 10’43 7’49 10’18 3’12 4’48 7’41 5’53 Famous Overtures 4 N EMR Brass Band----EMR 9557 EMR 9510-EMR 31055. 1st TROMBONE EMR 31055

Concerto (Wagenseil) Concerto (Rosetti) Introduktion, Thema und Variationen (Hummel) N EMR Orchestra EMR 22108 EMR 1012 EMR 1148 EMR 4674 EMR 4696 EMR 1009 EMR 1008 EMR 4676 . Concerto (Solo Trombone) EMR 1164 DAETWYLER, Jean 3. Alphorn Concerto (Alphorn in Gb) EMR 4750 DAETWYLER, Jean Capriccio Sur Deux Chants Populaires

EMR 1067 Bohemian Rhapsody QUEEN (Mortimer) EMR 1573 Bohemian Rhapsody (Chorus SATB) QUEEN (Mortimer) EMR 11175 Bohemian Romance MIELENZ (Sedlak) EMR 11718 Born In The USA (Solo Voice) SPRINGSTEEN (Mortimer) EMR 1469 Bowling Party TAILOR EMR 10502 Bridge Lake RODENMACHER EMR 10659 Bringer Of Joy MIELENZ (Macduff) .

EMR 6199 DEBUSSY, Claude Clair de Lune (5) EMR 6194 DVORAK, Antonin Humoresque (5) EMR 6199 DVORAK, Antonin Largo aus der Neuen Welt (5) EMR 6192 GERSHWIN, George I Got Rhytm (5) EMR 6194 GERSHWIN, George ‘S Wonderful (5) EMR 6200 GERSHWIN, George Summertime (5) EMR 6195 GERSHWIN, George The Man I Love (5)

EMR 2538 Music MILES (Mortimer) Chorus (SATB) (Fortsetzung - Continued - Suite) EMR 3446 Nearer, My God, To Thee MASON (Tailor) EMR 3680 New York, New York (Chorus SATB) KANDER / EBB (Parson) EMR 3943 Nobody Knows PARSON EMR 3438 Now Thank We All Our God CRÜGER (Schneiders) EMR 3168 O Christmas Tree Arr.: PARSON

EMR 17001 WAGENSEIL, G.C. Concerto (Angerer) EMR 222 WAGENSEIL, G.C. Concerto (Angerer) EMR 2305L WEBER, C.M. von Romance (Mortimer) EMR 17006 WEBER, C.M. von Romance (Wagenhäuser) EMR 236 WEBER, C.M. von Romance (Wagenhäuser) EMR 205 ZETTLER, Richard Concerto Primo TROMBONE & PIANO (ORGAN)

EMR 19478 MILLER, Glenn Moonlight Serenade EMR 19481 MONTANA, Carlos Brazilian Beach Party EMR 13747 NAULAIS, Jérôme As You Like EMR 13434 NAULAIS, Jérôme Bahia Blues EMR 13456 NAULAIS, Jérôme Bahia Blues EMR 13759 NAULAIS, Jé

The American Revolution DID inspire other revolutions to follow. French Revolution (1789-1799) –partly because France was broke after helping us (and we broke our alliance partly thanks to George Washington’s advice against “entangling alliances”) Haitian Revolution (1791-1804) Mexican War of Independence (1810-1821)