Applying Apache Hadoop To NASA's Big Climate Data

1y ago
5 Views
2 Downloads
4.37 MB
21 Pages
Last View : 28d ago
Last Download : 3m ago
Upload by : Anton Mixon
Transcription

National Aeronautics and Space AdministrationApplying Apache Hadoop toNASA’s Big Climate Data!Use Cases and Lessons Learned!Glenn Tamkin (NASA/CSC)!!Team: John Schnase (NASA/PI), Dan Duffy (NASA/CO),!Hoot Thompson (PTP), Denis Nadeau (CSC), Scott Sinno (PTP),Savannah Strong (CSC)!!www.nasa.gov

Overview TheNASA Center for Climate Simulation (NCCS)is using Apache Hadoop for high-performanceanalytics because it optimizes computer clusters andcombines distributed storage of large data sets withparallel computation. We have built a platform for developing new climateanalysis capabilities with Hadoop.National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!2

Solution Hadoopis well known for text-basedproblems. Our scenario involves binarydata. So, we created custom Javaapplications to read/write data during theMapReduce process. Our solution is different because it: a)uses a custom composite key design forfast data access, and b) utilizes theHadoop Bloom filter, a data structuredesigned to identify rapidly and memoryefficiently whether an element is present.National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!3

Why HDFS and MapReduce ? Softwareframework to store large amounts of data inparallel across a cluster of nodes Provides fault tolerance, load balancing, andparallelization by replicating data across nodes Co-locates the stored data with computationalcapability to act on the data (storage nodes andcompute nodes are the same – typically) A MapReduce job takes the requested operationand maps it to the appropriate nodes forcomputation using specified keysNational Aeronautics and Space Administration!Who uses thistechnology? Google Yahoo FacebookMany PBsand probablyeven EBs ofdata.Applying Apache Hadoop to NASA’s Big Climate Data!4

Background Scientificdata services are a critical aspect of theNASA Center for Climate Simulation’s mission(NCCS). Modern Era Retrospective-Analysis forResearch and Applications Analytic Services(MERRA/AS) Is a cyber-infrastructure resource for developingand evaluating a next generation of climate dataanalysis capabilities A service that reduces the time spent in thepreparation of MERRA data used in data-modelinter-comparisonNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!5

Vision Provide a test-bed forexperimental development ofhigh-performance analytics Offer an architecturalapproach to climate dataservices that can begeneralized to applications andcustomers beyond thetraditional climate researchcommunityNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!6

Example Use Case - WEI ExperimentNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!7

Example Use Case - WEI ExperimentNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!8

MERRA Data TheGEOS-5 MERRA products aredivided into 25 collections: 18 standardproducts, 7 chemistry products Comprise monthly means files and dailyfiles at six-hour intervals running from1979 – 2012 Total size of NetCDF MERRA collectionin a standard filesystem is 80 TB One file per month/day produced with filesizes ranging from 20 MB to 1.5 GBNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!9

Map Reduce WorkflowNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!10

Ingesting MERRA data into HDFS Option 1: Put the MERRA data into Hadoop with no changes» Would require us to write a custom mapper to parse Option 2: Write a custom NetCDF to Hadoop sequencer and keep thefiles together» Basically puts indexes into the files so Hadoop can parse by key» Maintains the NetCDF metadata for each file Option 3: Write a custom NetCDF to Hadoop sequencer and split thefiles apart (allows smaller block sizes)» Breaks the connection of the NetCDF metadata to the data Chose Option 2National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!11

Sequence File Format During sequencing, the data is partitioned by time, so that eachrecord in the sequence file contains the timestamp and name ofthe parameter (e.g. temperature) as the composite key and thevalue of the parameter (which could have 1 to 3 spatialdimensions)National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!12

Bloom Filter A Bloom filter, conceived by Burton Howard Bloom in 1970, is a spaceefficient probabilistic data structure that is used to test whether an element is amember of a set. False positive retrieval results are possible, but falsenegatives are not; i.e. a query returns either "inside set (may be wrong)" or"definitely not in set".In Hadoop terms, the BloomMapFile can be thought of as an enhancedMapFile because it contains an additional hash table that leverages the existingindexes when seeking data.National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!13

Bloom Filter Performance Increase The original MapReduce application utilized standard Hadoop Sequence Files. Later they were modifiedto support three different formats called Sequence, Map, and Bloom. Dramatic performance increases were observed with the addition of the Bloom filter ( 30-80%). !Job DescriptionHostSequence(sec)Map(sec)Bloom(sec)Read a single parameter (“T”) from a singlesequenced monthly means fileStandalone VM6.11.21.1 81.9%Single MR job across 4 months of data seeking“T” (period 2)Standalone VM2046736 82.3%Generate sequence file from a single MM fileStandalone VM394151-30.7%Single MR job across 4 months of data seeking“T” (period 2)Cluster314622 29.0%Single MR job across 12 months of data seeking“T” (period 3)Cluster495936 26.5%National Aeronautics and Space Administration!PercentIncreaseApplying Apache Hadoop to NASA’s Big Climate Data!14

Data Set Descriptions Twodata sets MAIMNPANA.5.2.0 (instM 3d ana Np) – monthly means MAIMCPASM.5.2.0 (instM 3d asm Cp) – monthly means Common characteristics Spans years 1979 through 2012 . Two files per year (hdf, xml), 396 total files SizingRawSequenced RawSequenced SequenceTypeTotal(GB) Total(GB) File(MB) File(MB) 015National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!15

MERRA Cluster op fs1TBDataNode 1/hadoop fs16TB/mapred16TBNational Aeronautics and Space Administration!LANJobTracker180TBRawFDR IB/mapred1TBDataNode 2DataNode 2/hadoop fs16TBMERRAData/mapred16TB DataNode 34DataNode 8/hadoop fs16TB/mapred16TBApplying Apache Hadoop to NASA’s Big Climate Data!16

Operational Node ConfigurationsNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!17

Other Apache Contributions Avro – a data serialization system Maven – a tool for building and managing Java-based projects Commons – a project focused on all aspects of reusable Java components Lang – provides methods for manipulation of core Java classes I/O - a library of utilities to assist with developing IO functionality CLI - an API for parsing command line options passed to programs Math - a library of mathematics and statistics components Subversion – a version control system Log4j - a framework for logging application debugging messagesNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!18

Other Open Source Tools UsingCloudera (CDH), the open sourceenterprise-ready distribution of ApacheHadoop. Cloudera is integrated with configurationand administration tools and related opensource packages, such as Hue, Oozie,Zookeeper, and Impala. Cloudera Manager Free Edition isparticularly useful for cluster management,providing centralized administration ofCDH.National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!19

Next Steps Tune the MapReduce Framework Try different ways to sequence the files Experiment with data accelerators Explore real-time querying services ontop of the Hadoop file system: Apache Drill Impala (Cloudera) Ceph, MapR National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!20

Conclusions and Lessons Learned Design of sequence file format is critical for big binary data Configuration is key change only one parameter at a time for tuning Big data is hard, and it takes a long time . Expect things to fail – a lot Hadoop craves bandwidth HDFS installs easy but optimizing is not so easy Not as fast as we thought is there something in Hadoop that wedon’t understand yet Ask the mailing list or your support providerNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!21

Applying Apache Hadoop to NASA's Big Climate Data! 16 Namenode JobTracker /merra 5TB /hadoop_fs 1TB mapred 1TB Data Node 1 /hadoop_fs 16TB /mapred 16TB Data Node 2 /hadoop_fs 16TB /mapred 16TB Data Node 8 /hadoop_fs 16TB /mapred 16TB Head Nodes Data Nodes FDR IB Data Node 2 Data Node 34 MERRA Data 180TB Raw LAN

Related Documents:

Getting Started with the Cloud . Apache Bigtop Apache Kudu Apache Spark Apache Crunch Apache Lucene Apache Sqoop Apache Druid Apache Mahout Apache Storm Apache Flink Apache NiFi Apache Tez Apache Flume Apache Oozie Apache Tika Apache Hadoop Apache ORC Apache Zeppelin

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

Introduction Apache Hadoop . What is Apache Hadoop? MapReduce is the processing part of Hadoop HDFS is the data part of Hadoop Dept. of Computer Science, Georgia State University 05/03/2013 5 Introduction Apache Hadoop HDFS MapReduce Machine . What is Apache Hadoop? The MapReduce server on a typical machine is called a .

CDH: Cloudera’s Distribution Including Apache Hadoop Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling Metadata APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE APACHE OOZIE APACHE HIVE File System Mount UI

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

The In-Memory Accelerator for Hadoop is a first-of-its-kind Hadoop extension that works with your choice of Hadoop distribution, which can be any commercial or open source version of Hadoop available, including Hadoop 1.x and Hadoop 2.x distributions. The In-Memory Accelerator for Hadoop is designed to provide the same performance

these experts in data science and Hadoop is Doug Eadline, frequent contributor to the Addison-Wesley Data & Analytics Series with the titles Hadoop Fundamentals Live Lessons, Apache Hadoop 2 Quick-Start Guide, and Apache Hadoop YARN. Collectively, this team of authors brings over a decade of Hadoop experience. I can imagine few others that have as