11/16/2011, Stanford EE380 Computer Systems Colloquium .

2y ago
39 Views
2 Downloads
3.08 MB
30 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Kairi Hasson
Transcription

11/16/2011, Stanford EE380 Computer Systems ColloquiumIntroducing Apache Hadoop:The Modern Data Operating SystemDr. Amr Awadallah Founder, CTO, VP of Engineeringaaa@cloudera.com, twitter: @awadallah

Limitations of Existing Data Analytics ArchitectureBI Reports Interactive AppsCan’t Explore OriginalHigh Fidelity Raw DataRDBMS (aggregated data)ETL Compute GridMoving Data ToCompute Doesn’t ScaleStorage Only Grid (original raw data)Mostly AppendCollectionInstrumentation2 2011 Cloudera, Inc. All Rights Reserved.Archiving PrematureData Death

So What is ApacheHadoop ? A scalable fault-tolerant distributed system fordata storage and processing (open sourceunder the Apache license). Core Hadoop has two main systems:– Hadoop Distributed File System: self-healinghigh-bandwidth clustered storage.– MapReduce: distributed fault-tolerant resourcemanagement and scheduling coupled with ascalable data programming abstraction.3 2011 Cloudera, Inc. All Rights Reserved.

The Key Benefit: Agility/FlexibilitySchema-on-Write (RDBMS):Schema-on-Read (Hadoop): Schema must be created beforeany data can be loaded. Data is simply copied to the filestore, no transformation is needed. An explicit load operation has totake place which transformsdata to DB internal structure. A SerDe (Serializer/Deserlizer) isapplied during read time to extractthe required columns (late binding) New columns must be addedexplicitly before new data forsuch columns can be loadedinto the database. New data can start flowing anytimeand will appear retroactively oncethe SerDe is updated to parse it. Read is Fast Standards/GovernancePros Load is Fast Flexibility/Agility4 2011 Cloudera, Inc. All Rights Reserved.

Innovation: Explore Original Raw DataData Committee5 2011 Cloudera, Inc. All Rights Reserved.Data Scientist

Flexibility: Complex Data Processing1. Java MapReduce: Most flexibility and performance, but tediousdevelopment cycle (the assembly language of Hadoop).2. Streaming MapReduce (aka Pipes): Allows you to develop inany programming language of your choice, but slightly lowerperformance and less flexibility than native Java MapReduce.3. Crunch: A library for multi-stage MapReduce pipelines in Java(modeled After Google’s FlumeJava)4. Pig Latin: A high-level language out of Yahoo, suitable for batchdata flow workloads.5. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes.6. Oozie: A PDL XML workflow engine that enables creating aworkflow of jobs composed of any of the above.6 2011 Cloudera, Inc. All Rights Reserved.

Scalability: Scalable Software DevelopmentGrows without requiring developers tore-architect their algorithms/application.AUTO SCALE7 2011 Cloudera, Inc. All Rights Reserved.

Scalability: Data Beats AlgorithmSmarter AlgosMore DataA. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 20098 2011 Cloudera, Inc. All Rights Reserved.

Scalability: Keep All Data Alive ForeverArchive to Tape andNever See It AgainExtract Value FromAll Your Data9 2011 Cloudera, Inc. All Rights Reserved.

Use The Right Tool For The Right JobRelational Databases:Use when:Hadoop:Use when: Interactive OLAP Analytics ( 1sec) Structured or Not (Flexibility) Multistep ACID Transactions Scalability of Storage/Compute 100% SQL Compliance Complex Data Processing10 2011 Cloudera, Inc. All Rights Reserved.

HDFS: Hadoop Distributed File SystemA given file is broken down into blocks(default 64MB), then blocks arereplicated across cluster (default 3).Optimized for: Throughput Put/Get/Delete AppendsBlock Replication for: Durability Availability ThroughputBlock Replicas are distributedacross servers and racks.11 2011 Cloudera, Inc. All Rights Reserved.

MapReduce: Computational Frameworkcat *.txt mapper.pl sort reducer.pl out.txtSplit 1(docid, text)(words, counts)Map 1(sorted words, counts)Be, 5Reduce 1“To BeOr NotTo Be?”(sorted words,sum of counts)OutputFile 1Be, 30Be, 12Split i(docid, text)Map iBe, 7Be, 6Split N(docid, text)Map MReduce i(sorted words,sum of counts)Reduce R(sorted words,sum of counts)Shuffle(words, counts)Map(in key, in value) list of (out key, intermediate value)(sorted words, counts)OutputFile iOutputFile RReduce(out key, list of intermediate values) out value(s)12 2011 Cloudera, Inc. All Rights Reserved.

MapReduce: Resource Manager / SchedulerA given job is broken down into tasks,then tasks are scheduled to be asclose to data as possible.Three levels of data locality: Same server as data (local disk) Same rack as data (rack/leaf switch) Wherever there is a free slot (cross rack)Optimized for: Batch Processing Failure RecoverySystem detects laggard tasks andspeculatively executes parallel taskson the same slice of data.13 2011 Cloudera, Inc. All Rights Reserved.

But Networks Are Faster Than Disks!Yes, however, core and disk density per serverare going up very quickly: 1 Hard Disk 100MB/sec ( 1Gbps)Server 12 Hard Disks 1.2GB/sec ( 12Gbps)Rack 20 Servers 24GB/sec ( 240Gbps)Avg. Cluster 6 Racks 144GB/sec ( 1.4Tbps)Large Cluster 200 Racks 4.8TB/sec ( 48Tbps)Scanning 4.8TB at 100MB/sec takes 13 hours.14 2011 Cloudera, Inc. All Rights Reserved.

Hadoop High-Level ArchitectureHadoop ClientContacts Name Node for dataor Job Tracker to submit jobsName NodeJob TrackerMaintains mapping of file namesto blocks to data node slaves.Tracks resources and schedulesjobs across task tracker slaves.Data NodeTask TrackerStores and servesblocks of dataRuns tasks (work units)within a jobShare Physical Node15 2011 Cloudera, Inc. All Rights Reserved.

Changes for Better Availability/ScalabilityHadoop ClientFederation partitionsout the name space,High Availability viaan Active Standby.Contacts Name Node for dataor Job Tracker to submit jobsName NodeEach job has its ownApplication Manager,Resource Manager isdecoupled from MR.Job TrackerData NodeTask TrackerStores and servesblocks of dataRuns tasks (work units)within a jobShare Physical Node16 2011 Cloudera, Inc. All Rights Reserved.

Build/Test: APACHE BIGTOPCDH: Cloudera’s Distribution Including Apache HadoopData MiningUI Framework/SDKFile System MountFUSE-DFSWorkflowHUEAPACHE MAHOUTSchedulingAPACHE OOZIEMetadataAPACHE HIVEAPACHE OOZIELanguages / CompilersDataIntegrationAPACHE PIG, APACHE HIVEAPACHE FLUME,APACHE SQOOPFastRead/WriteAccessAPACHE HBASECoordinationAPACHE ZOOKEEPERSCM Express (Installation Wizard for CDH)17 2011 Cloudera, Inc. All Rights Reserved.

Books18 2011 Cloudera, Inc. All Rights Reserved.

Conclusion The Key Benefits of Apache Hadoop:– Agility/Flexibility (Quickest Time to Insight).– Complex Data Processing (Any Language, Any Problem).– Scalability of Storage/Compute (Freedom to Grow).– Economical Storage (Keep All Your Data Alive Forever). The Key Systems for Apache Hadoop are:– Hadoop Distributed File System: self-healing highbandwidth clustered storage.– MapReduce: distributed fault-tolerant resourcemanagement coupled with scalable data processing.19 2011 Cloudera, Inc. All Rights Reserved.

AppendixBACKUP SLIDES20 2011 Cloudera, Inc. All Rights Reserved.

Unstructured Data is ExplodingComplex, UnstructuredRelational 2,500 exabytes of new information in 2012 with Internet as primary driver Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2“zettabytes” this year21Source: IDC White Paper - sponsored by EMC.As the Economy Contracts, the Digital Universe Expands. May 2009. 2011 Cloudera, Inc. All Rights Reserved.

Hadoop Creation History Fastest sort of a TB,62secs over 1,460 nodes Sorted a PB in 16.25hoursover 3,658 nodes22 2011 Cloudera, Inc. All Rights Reserved.

Hadoop in the Enterprise Data StackData ScientistsIDEsAnalystsETL ToolsEnterpriseReportingBI, AnalyticsDevelopment ToolsSystemOperatorsBusiness UsersBusiness Intelligence ToolsODBC, JDBC,NFS, NativeClouderaMgmt meFlumeFlumeSqoopLogsFilesWeb DataRelationalDatabases23 2011 Cloudera, Inc. All Rights ication

MapReduce Next GenMain idea is to split up the JobTracker functions: Cluster resource management (for tracking andallocating nodes) Application life-cycle management (for MapReducescheduling and execution)Enables: High Availability Better Scalability Efficient Slot Allocation Rolling Upgrades Non-MapReduce Apps24 2011 Cloudera, Inc. All Rights Reserved.

ApplicationIndustryApplicationSocial Network AnalysisWebClickstream SessionizationContent OptimizationMediaClickstream SessionizationNetwork AnalyticsTelcoMediationLoyalty & PromotionsRetailData FactoryFraud AnalysisFinancialTrade ReconciliationEntity AnalysisFederalSIGINTSequencing AnalysisBioinformaticsGenome MappingProduct QualityManufacturingMfg Process Tracking25 2011 Cloudera, Inc. All Rights Reserved.Use CaseDATA PROCESSINGUse CaseADVANCED ANALYTICSTwo Core Use Cases Common Across Many Industries

What is Cloudera Enterprise?Cloudera Enterprise makes opensource Apache Hadoop enterprise-easy Simplify and Accelerate Hadoop Deployment Reduce Adoption Costs and Risks Lower the Cost of Administration Increase the Transparency & Control of Hadoop Leverage the Experience of Our ExpertsCLOUDERA ENTERPRISE COMPONENTSClouderaManagementSuiteProductionLevel SupportComprehensiveToolset for HadoopAdministrationOur Team of ExpertsOn-Call to Help YouMeet Your SLAs3 of the top 5 telecommunications, mobile services, defense &intelligence, banking, media and retail organizations depend on ClouderaEFFECTIVENESSEFFICIENCYEnsuring Repeatable Value fromApache Hadoop DeploymentsEnabling Apache Hadoop to beAffordably Run in Production26 2011 Cloudera, Inc. All Rights Reserved.

Hive vs Pig Latin (count distinct values 0) Hive Syntax:SELECT COUNT(DISTINCT col1)FROM mytableWHERE col1 0; Pig Latin Syntax:mytable LOAD ‘myfile’ AS (col1, col2, col3);mytable FOREACH mytable GENERATE col1;mytable FILTER mytable BY col1 0;mytable DISTINCT col1;mytable GROUP mytable BY col1;mytable FOREACH mytable GENERATE COUNT(mytable);DUMP mytable;27 2011 Cloudera, Inc. All Rights Reserved.

Apache Hive Key Features A subset of SQL covering the most common statementsJDBC/ODBC supportAgile data types: Array, Map, Struct, and JSON objectsPluggable SerDe system to work on unstructured files directlyUser Defined Functions and AggregatesRegular Expression supportMapReduce supportPartitions and Buckets (for performance optimization)Microstrategy/Tableau Compatibility (through ODBC)In The Works: Indices, Columnar Storage, Views, Explode/CollectMore details: http://wiki.apache.org/hadoop/Hive28 2011 Cloudera, Inc. All Rights Reserved.

Hive Agile Data Types STRUCTS:– SELECT mytable.mycolumn.myfield FROM MAPS (Hashes):– SELECT mytable.mycolumn[mykey] FROM ARRAYS:– SELECT mytable.mycolumn[5] FROM JSON:– SELECT get json object(mycolumn, objpath) FROM 29 2011 Cloudera, Inc. All Rights Reserved.

CDH: Cloudera’s Distribution Including Apache Hadoop Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling Metadata APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE APACHE OOZIE APACHE HIVE File System Mount UI

Related Documents:

September 10, 2013 EE380 (Control Lab) IITK Lab Manual 0.2 Past status of Control Systems Laboratory Up to the August – December semester of 2008 EE380 had 4 sections of up to 24 students. Each section was divided into 6 groups of up to 4 students. 0.2.1 Logistical challenges 1.Six different experiments were done concurrently during each lab .

SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity Qingyuan Zhao Stanford University qyzhao@stanford.edu Murat A. Erdogdu Stanford University erdogdu@stanford.edu Hera Y. He Stanford University yhe1@stanford.edu Anand Rajaraman Stanford University anand@cs.stanford.edu Jure Leskovec Stanford University jure@cs.stanford .

Domain Adversarial Training for QA Systems Stanford CS224N Default Project Mentor: Gita Krishna Danny Schwartz Brynne Hurst Grace Wang Stanford University Stanford University Stanford University deschwa2@stanford.edu brynnemh@stanford.edu gracenol@stanford.edu Abstract In this project, we exa

Computer Science Stanford University ymaniyar@stanford.edu Madhu Karra Computer Science Stanford University mkarra@stanford.edu Arvind Subramanian Computer Science Stanford University arvindvs@stanford.edu 1 Problem Description Most existing COVID-19 tests use nasal swabs and a polymerase chain reaction to detect the virus in a sample. We aim to

Stanford University Stanford, CA 94305 bowang@stanford.edu Min Liu Department of Statistics Stanford University Stanford, CA 94305 liumin@stanford.edu Abstract Sentiment analysis is an important task in natural language understanding and has a wide range of real-world applications. The typical sentiment analysis focus on

Mar 16, 2021 · undergraduate and graduate students, faculty, staff, and members of the community. Anyone interested in auditioning for the Stanford Philharmonia, Stanford Symphony Orchestra, or Stanford Summer Symphony should contact Orchestra Administrator Adriana Ramírez Mirabal at orchestra@stanford.edu. For further information, visit orchestra.stanford.edu.

Stanford Health Care Organizational Overview 3 Contract Administration is a Shared Service of Stanford Health Care to Eight Other Stanford Medicine Entities Stanford Health are ("SH")is the flagship academic medical center associated with the Stanford University School of Medicine. SHC has 15,232 employees and volunteers, 613 licensed

STANFORD INTERNATIONAL nANK, LTD., § STANFORD GROUP COMPANY, § STANFORD CAPITAL MANAGEMENT, LLC, § R. ALLEN STANFORD, JAMES . M. DAVIS, . The false data has helped SGC grow the SAS program from less than 10 million in around 2004 to . I : over 1.2 billion, generating fees for SGC (and ultimately Stanford) in excess of 25 million. .