Applying Apache Hadoop To NASA's Big Climate Data

1y ago

5 Views

2 Downloads

4.37 MB

21 Pages

Last View : 28d ago

Last Download : 3m ago

Upload by : Anton Mixon

Report this link

Download PDF

Transcription

National Aeronautics and Space AdministrationApplying Apache Hadoop toNASA’s Big Climate Data!Use Cases and Lessons Learned!Glenn Tamkin (NASA/CSC)!!Team: John Schnase (NASA/PI), Dan Duffy (NASA/CO),!Hoot Thompson (PTP), Denis Nadeau (CSC), Scott Sinno (PTP),Savannah Strong (CSC)!!www.nasa.gov

Overview TheNASA Center for Climate Simulation (NCCS)is using Apache Hadoop for high-performanceanalytics because it optimizes computer clusters andcombines distributed storage of large data sets withparallel computation. We have built a platform for developing new climateanalysis capabilities with Hadoop.National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!2

Solution Hadoopis well known for text-basedproblems. Our scenario involves binarydata. So, we created custom Javaapplications to read/write data during theMapReduce process. Our solution is different because it: a)uses a custom composite key design forfast data access, and b) utilizes theHadoop Bloom filter, a data structuredesigned to identify rapidly and memoryefficiently whether an element is present.National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!3

Why HDFS and MapReduce ? Softwareframework to store large amounts of data inparallel across a cluster of nodes Provides fault tolerance, load balancing, andparallelization by replicating data across nodes Co-locates the stored data with computationalcapability to act on the data (storage nodes andcompute nodes are the same – typically) A MapReduce job takes the requested operationand maps it to the appropriate nodes forcomputation using specified keysNational Aeronautics and Space Administration!Who uses thistechnology? Google Yahoo FacebookMany PBsand probablyeven EBs ofdata.Applying Apache Hadoop to NASA’s Big Climate Data!4

Background Scientificdata services are a critical aspect of theNASA Center for Climate Simulation’s mission(NCCS). Modern Era Retrospective-Analysis forResearch and Applications Analytic Services(MERRA/AS) Is a cyber-infrastructure resource for developingand evaluating a next generation of climate dataanalysis capabilities A service that reduces the time spent in thepreparation of MERRA data used in data-modelinter-comparisonNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!5

Vision Provide a test-bed forexperimental development ofhigh-performance analytics Offer an architecturalapproach to climate dataservices that can begeneralized to applications andcustomers beyond thetraditional climate researchcommunityNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!6

Example Use Case - WEI ExperimentNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!7

Example Use Case - WEI ExperimentNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!8

MERRA Data TheGEOS-5 MERRA products aredivided into 25 collections: 18 standardproducts, 7 chemistry products Comprise monthly means files and dailyfiles at six-hour intervals running from1979 – 2012 Total size of NetCDF MERRA collectionin a standard filesystem is 80 TB One file per month/day produced with filesizes ranging from 20 MB to 1.5 GBNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!9

Map Reduce WorkflowNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!10

Ingesting MERRA data into HDFS Option 1: Put the MERRA data into Hadoop with no changes» Would require us to write a custom mapper to parse Option 2: Write a custom NetCDF to Hadoop sequencer and keep thefiles together» Basically puts indexes into the files so Hadoop can parse by key» Maintains the NetCDF metadata for each file Option 3: Write a custom NetCDF to Hadoop sequencer and split thefiles apart (allows smaller block sizes)» Breaks the connection of the NetCDF metadata to the data Chose Option 2National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!11

Sequence File Format During sequencing, the data is partitioned by time, so that eachrecord in the sequence file contains the timestamp and name ofthe parameter (e.g. temperature) as the composite key and thevalue of the parameter (which could have 1 to 3 spatialdimensions)National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!12

Bloom Filter A Bloom filter, conceived by Burton Howard Bloom in 1970, is a spaceefficient probabilistic data structure that is used to test whether an element is amember of a set. False positive retrieval results are possible, but falsenegatives are not; i.e. a query returns either "inside set (may be wrong)" or"definitely not in set".In Hadoop terms, the BloomMapFile can be thought of as an enhancedMapFile because it contains an additional hash table that leverages the existingindexes when seeking data.National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!13

Bloom Filter Performance Increase The original MapReduce application utilized standard Hadoop Sequence Files. Later they were modifiedto support three different formats called Sequence, Map, and Bloom. Dramatic performance increases were observed with the addition of the Bloom filter ( 30-80%). !Job DescriptionHostSequence(sec)Map(sec)Bloom(sec)Read a single parameter (“T”) from a singlesequenced monthly means fileStandalone VM6.11.21.1 81.9%Single MR job across 4 months of data seeking“T” (period 2)Standalone VM2046736 82.3%Generate sequence file from a single MM fileStandalone VM394151-30.7%Single MR job across 4 months of data seeking“T” (period 2)Cluster314622 29.0%Single MR job across 12 months of data seeking“T” (period 3)Cluster495936 26.5%National Aeronautics and Space Administration!PercentIncreaseApplying Apache Hadoop to NASA’s Big Climate Data!14

Data Set Descriptions Twodata sets MAIMNPANA.5.2.0 (instM 3d ana Np) – monthly means MAIMCPASM.5.2.0 (instM 3d asm Cp) – monthly means Common characteristics Spans years 1979 through 2012 . Two files per year (hdf, xml), 396 total files SizingRawSequenced RawSequenced SequenceTypeTotal(GB) Total(GB) File(MB) File(MB) 015National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!15

MERRA Cluster op fs1TBDataNode 1/hadoop fs16TB/mapred16TBNational Aeronautics and Space Administration!LANJobTracker180TBRawFDR IB/mapred1TBDataNode 2DataNode 2/hadoop fs16TBMERRAData/mapred16TB DataNode 34DataNode 8/hadoop fs16TB/mapred16TBApplying Apache Hadoop to NASA’s Big Climate Data!16

Operational Node ConfigurationsNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!17

Other Apache Contributions Avro – a data serialization system Maven – a tool for building and managing Java-based projects Commons – a project focused on all aspects of reusable Java components Lang – provides methods for manipulation of core Java classes I/O - a library of utilities to assist with developing IO functionality CLI - an API for parsing command line options passed to programs Math - a library of mathematics and statistics components Subversion – a version control system Log4j - a framework for logging application debugging messagesNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!18

Other Open Source Tools UsingCloudera (CDH), the open sourceenterprise-ready distribution of ApacheHadoop. Cloudera is integrated with configurationand administration tools and related opensource packages, such as Hue, Oozie,Zookeeper, and Impala. Cloudera Manager Free Edition isparticularly useful for cluster management,providing centralized administration ofCDH.National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!19

Next Steps Tune the MapReduce Framework Try different ways to sequence the files Experiment with data accelerators Explore real-time querying services ontop of the Hadoop file system: Apache Drill Impala (Cloudera) Ceph, MapR National Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!20

Conclusions and Lessons Learned Design of sequence file format is critical for big binary data Configuration is key change only one parameter at a time for tuning Big data is hard, and it takes a long time . Expect things to fail – a lot Hadoop craves bandwidth HDFS installs easy but optimizing is not so easy Not as fast as we thought is there something in Hadoop that wedon’t understand yet Ask the mailing list or your support providerNational Aeronautics and Space Administration!Applying Apache Hadoop to NASA’s Big Climate Data!21

Applying Apache Hadoop to NASA's Big Climate Data! 16 Namenode JobTracker /merra 5TB /hadoop_fs 1TB mapred 1TB Data Node 1 /hadoop_fs 16TB /mapred 16TB Data Node 2 /hadoop_fs 16TB /mapred 16TB Data Node 8 /hadoop_fs 16TB /mapred 16TB Head Nodes Data Nodes FDR IB Data Node 2 Data Node 34 MERRA Data 180TB Raw LAN

Related Documents:

Course Slides: Cloud Fundamentals (191213)

Getting Started with the Cloud . Apache Bigtop Apache Kudu Apache Spark Apache Crunch Apache Lucene Apache Sqoop Apache Druid Apache Mahout Apache Storm Apache Flink Apache NiFi Apache Tez Apache Flume Apache Oozie Apache Tika Apache Hadoop Apache ORC Apache Zeppelin

42 Views

3y ago

hadoop - riptutorial.com

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

35 Views

1y ago

Real Time Micro-Blog Summarization based on Hadoop/HBase

Introduction Apache Hadoop . What is Apache Hadoop? MapReduce is the processing part of Hadoop HDFS is the data part of Hadoop Dept. of Computer Science, Georgia State University 05/03/2013 5 Introduction Apache Hadoop HDFS MapReduce Machine . What is Apache Hadoop? The MapReduce server on a typical machine is called a .

19 Views

1y ago

11/16/2011, Stanford EE380 Computer Systems Colloquium ...

CDH: Cloudera’s Distribution Including Apache Hadoop Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling Metadata APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE APACHE OOZIE APACHE HIVE File System Mount UI

41 Views

2y ago

Big Data Analytics - learnerspoint.org

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.

10 Views

1y ago

Lecture @Dhbw: Data Warehouse Part Vii: Hadoop

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

13 Views

1y ago

IN-MEMORY ACCELERATOR FOR HADOOP - GridGain Systems

The In-Memory Accelerator for Hadoop is a first-of-its-kind Hadoop extension that works with your choice of Hadoop distribution, which can be any commercial or open source version of Hadoop available, including Hadoop 1.x and Hadoop 2.x distributions. The In-Memory Accelerator for Hadoop is designed to provide the same performance

13 Views

1y ago

Practical Data Science with Hadoop - pearsoncmg.com

these experts in data science and Hadoop is Doug Eadline, frequent contributor to the Addison-Wesley Data & Analytics Series with the titles Hadoop Fundamentals Live Lessons, Apache Hadoop 2 Quick-Start Guide, and Apache Hadoop YARN. Collectively, this team of authors brings over a decade of Hadoop experience. I can imagine few others that have as

11 Views

1y ago

Recent Views

IN THIS ISSUE CAR WASH INSIGHT Recent, Notable M&A Transactions .

9/8/2022 Club Car Wash Sites of Tidal Wave Express Car Wash 8 8/29/2022 Take 5 Car Wash Soft Touch Car Wash, Auto Oasis Car Wash, Clearwater Car Wash and Birdie's Car Wash 5 8/25/2022 WhiteWater Express Geaux Clean Car Wash 7 8/19/2022 ModWash Home Team Car Wash 3 8/18/2022 Splash In ECO Car Wash (Wills Group) Blue Hen Car Wash 2

9m ago

100 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

ESSENTIAL PLAN - Discovery

Car insurance only Car and home insurance Car insurance only Car and home insurance 12.5% 25% 5% 10% YOUR FUEL CASH BACK PERCENTAGE GET TO THE HIGHEST CASH BACK PERCENTAGE Add at least R250 000 of home insurance (household contents, buildings or both) Take your car to Tiger Wheel & Tyre and pass the Annual MultiPoint check

1y ago

269 Views

CAR INSURANCE EVERYTHING EXPLAINED - RSA Insurance Group

CAR INSURANCE 93013821.indd 1 15/03/2018 10:46. 2 WELCOME TO µ CAR INSURANCE Thank you for choosing µ to protect you and your car. This booklet is intended to help you check your cover and to reassure you that µ will give you the protection you need for the year ahead. First of all, to help you understand your car insurance policy we want to .

1y ago

274 Views

Describe types and purposes of insurance.

D.O. CAPS Consumer Skills: Insurance—10E 3 Your car - The car you drive can also affect your insurance rates. Insurance companies place certain kinds of cars in special risk categories. You should ask your insurance agent before making a car purchase to make sure you aren't getting a car that will cost you extra for your liability insurance.

1y ago

233 Views

Contours Options Infant Car Seat Adapter Instruction Sheet

your Infant Car Seat, as described in the instruction manual provided by the Infant Car Seat manufacturer. † WHEN USING ONLY ONE INFANT CAR SEAT ADAPTER OR TWO FOR TWINS, THE FOLLOWING INFANT CAR SEATS CAN BE USED: † If your Infant Car Seat is not one of the models listed above, DO NOT use your infant car seat with this car seat adapter.

2y ago

564 Views

Microsoft Advertising Travel Update

last minute cruise deals -58.50% Car Rental Queries WoW Change car rental -43.80% rental cars -46.30% car rentals -40.60% cheap car rentals -48.00% car rentals cheapest rates -52.20% rent a car- 40.30% cheap rental cars -45.60% rental car -41.80% car rental deals -49.30% rental cars lowest price -53.90% Flight Queries WoW Change cheap flights .

1y ago

337 Views

Design and development of lift for an automatic car parking system

1. Stacker type car parking system 2. Puzzle type car parking system 3. Level type car parking system 4. Chess type car parking system 5. Rotary type car parking system 6. Tower type car parking system But lift is used only in tower type car parking system. Objectives:-

6m ago

172 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Car Insurance This booklet covers:Car Rapid Bonus Business

Car Insurance This booklet covers:Car Rapid Bonus Business RAC Direct Insurance is a trading name of London and Edinburgh Insurance Company Limited. Registered in England No 924430. Registered Office: 8 Surrey Street, Norwich NR1 3NG. Member of the Aviva Group. Authorised and regulated by the Financial Services Authority. RAC052(V27)-1971-06.06 .

1y ago

218 Views

Root Insurance (ROOT) - Citron Research

Root Insurance (ROOT) Leveling the Playing Field of Car Insurance What every trader needs to know about one of the mostheavily shorted stocks in the market Traditional Credit-Based Car Insurance PerpetuatesEconomic and Racial Inequalities as one in three American cannot affordessentials because of car insurance premiums

1y ago

209 Views

NK-ID 0192-8365-3702-0D3E - Car-O-Liner

CAR-O-DATA. 4. The vast majority of vehicles on the road today can be found in Car-O-Liner's database. Your . Car-O-Tronic. is delivered with a 14-day trial . Car-O-Data Vision2. subscription. Car-O-Data. is available with different subscription periods and database. 4. Check all options with our distributors. SOFTWARE PART. NO. Vision2 X1 .

3y ago

321 Views

46686 Vision2 IM EN r0 - Metropolitan Car-o-liner

Car-O-Tronic, Vision2 Software and Car-O-Data. Car-O-Tronic is the measuring hardware, Vision2 Software is the measuring software. Car-O-Data is a database containing Car-O-Liner DataSheets, photo DataSheets and indexes for most vehicles. Car-O-Data is available through an online subscription or a DVD subscription which is updated 4 times a year.

3y ago

295 Views

Colorado Masonic Library & Museum Store

York Rite 15.00 _ CE40 Car Emblem - Order of the Eastern Star Cut-Out Auto Car Emblem-CE40 OES 15.00 _ CE41 Car Emblem - Shriners Cut-Out Auto Car Emblem-CE41 Shrine 15.00 _ CE42 Car Emblem - 33rd Degree Wings Up Cut-Out Auto Car Emblem-CE42 Scottish Rite 15.00 _ CE43 Car Emblem Free & Ac

2y ago

517 Views

Queueing Theory Part 2 - UW Courses Web Server

Queueing Theory-12 Car Wash Example Consider the following 3 car washes Suppose cars arrive according to a Poisson input process and service follows an exponential distribution Fill in the following table What conclusions can you draw from your results? ! µ! L L q W W q P 0 Car Wash A 0.1 car/min 0.5 car/min Car Wash B 0.1 car/min

1y ago

245 Views

Applying Apache Hadoop To NASA's Big Climate Data

It looks like you're using an ad-blocker