11/16/2011, Stanford EE380 Computer Systems Colloquium .

2y ago

38 Views

2 Downloads

3.08 MB

30 Pages

Last View : 5d ago

Last Download : 3m ago

Upload by : Kairi Hasson

Report this link

Download PDF

Transcription

11/16/2011, Stanford EE380 Computer Systems ColloquiumIntroducing Apache Hadoop:The Modern Data Operating SystemDr. Amr Awadallah Founder, CTO, VP of Engineeringaaa@cloudera.com, twitter: @awadallah

Limitations of Existing Data Analytics ArchitectureBI Reports Interactive AppsCan’t Explore OriginalHigh Fidelity Raw DataRDBMS (aggregated data)ETL Compute GridMoving Data ToCompute Doesn’t ScaleStorage Only Grid (original raw data)Mostly AppendCollectionInstrumentation2 2011 Cloudera, Inc. All Rights Reserved.Archiving PrematureData Death

So What is ApacheHadoop ? A scalable fault-tolerant distributed system fordata storage and processing (open sourceunder the Apache license). Core Hadoop has two main systems:– Hadoop Distributed File System: self-healinghigh-bandwidth clustered storage.– MapReduce: distributed fault-tolerant resourcemanagement and scheduling coupled with ascalable data programming abstraction.3 2011 Cloudera, Inc. All Rights Reserved.

The Key Benefit: Agility/FlexibilitySchema-on-Write (RDBMS):Schema-on-Read (Hadoop): Schema must be created beforeany data can be loaded. Data is simply copied to the filestore, no transformation is needed. An explicit load operation has totake place which transformsdata to DB internal structure. A SerDe (Serializer/Deserlizer) isapplied during read time to extractthe required columns (late binding) New columns must be addedexplicitly before new data forsuch columns can be loadedinto the database. New data can start flowing anytimeand will appear retroactively oncethe SerDe is updated to parse it. Read is Fast Standards/GovernancePros Load is Fast Flexibility/Agility4 2011 Cloudera, Inc. All Rights Reserved.

Flexibility: Complex Data Processing1. Java MapReduce: Most flexibility and performance, but tediousdevelopment cycle (the assembly language of Hadoop).2. Streaming MapReduce (aka Pipes): Allows you to develop inany programming language of your choice, but slightly lowerperformance and less flexibility than native Java MapReduce.3. Crunch: A library for multi-stage MapReduce pipelines in Java(modeled After Google’s FlumeJava)4. Pig Latin: A high-level language out of Yahoo, suitable for batchdata flow workloads.5. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes.6. Oozie: A PDL XML workflow engine that enables creating aworkflow of jobs composed of any of the above.6 2011 Cloudera, Inc. All Rights Reserved.

Use The Right Tool For The Right JobRelational Databases:Use when:Hadoop:Use when: Interactive OLAP Analytics ( 1sec) Structured or Not (Flexibility) Multistep ACID Transactions Scalability of Storage/Compute 100% SQL Compliance Complex Data Processing10 2011 Cloudera, Inc. All Rights Reserved.

HDFS: Hadoop Distributed File SystemA given file is broken down into blocks(default 64MB), then blocks arereplicated across cluster (default 3).Optimized for: Throughput Put/Get/Delete AppendsBlock Replication for: Durability Availability ThroughputBlock Replicas are distributedacross servers and racks.11 2011 Cloudera, Inc. All Rights Reserved.

MapReduce: Computational Frameworkcat *.txt mapper.pl sort reducer.pl out.txtSplit 1(docid, text)(words, counts)Map 1(sorted words, counts)Be, 5Reduce 1“To BeOr NotTo Be?”(sorted words,sum of counts)OutputFile 1Be, 30Be, 12Split i(docid, text)Map iBe, 7Be, 6Split N(docid, text)Map MReduce i(sorted words,sum of counts)Reduce R(sorted words,sum of counts)Shuffle(words, counts)Map(in key, in value) list of (out key, intermediate value)(sorted words, counts)OutputFile iOutputFile RReduce(out key, list of intermediate values) out value(s)12 2011 Cloudera, Inc. All Rights Reserved.

MapReduce: Resource Manager / SchedulerA given job is broken down into tasks,then tasks are scheduled to be asclose to data as possible.Three levels of data locality: Same server as data (local disk) Same rack as data (rack/leaf switch) Wherever there is a free slot (cross rack)Optimized for: Batch Processing Failure RecoverySystem detects laggard tasks andspeculatively executes parallel taskson the same slice of data.13 2011 Cloudera, Inc. All Rights Reserved.

But Networks Are Faster Than Disks!Yes, however, core and disk density per serverare going up very quickly: 1 Hard Disk 100MB/sec ( 1Gbps)Server 12 Hard Disks 1.2GB/sec ( 12Gbps)Rack 20 Servers 24GB/sec ( 240Gbps)Avg. Cluster 6 Racks 144GB/sec ( 1.4Tbps)Large Cluster 200 Racks 4.8TB/sec ( 48Tbps)Scanning 4.8TB at 100MB/sec takes 13 hours.14 2011 Cloudera, Inc. All Rights Reserved.

Hadoop High-Level ArchitectureHadoop ClientContacts Name Node for dataor Job Tracker to submit jobsName NodeJob TrackerMaintains mapping of file namesto blocks to data node slaves.Tracks resources and schedulesjobs across task tracker slaves.Data NodeTask TrackerStores and servesblocks of dataRuns tasks (work units)within a jobShare Physical Node15 2011 Cloudera, Inc. All Rights Reserved.

Changes for Better Availability/ScalabilityHadoop ClientFederation partitionsout the name space,High Availability viaan Active Standby.Contacts Name Node for dataor Job Tracker to submit jobsName NodeEach job has its ownApplication Manager,Resource Manager isdecoupled from MR.Job TrackerData NodeTask TrackerStores and servesblocks of dataRuns tasks (work units)within a jobShare Physical Node16 2011 Cloudera, Inc. All Rights Reserved.

Build/Test: APACHE BIGTOPCDH: Cloudera’s Distribution Including Apache HadoopData MiningUI Framework/SDKFile System MountFUSE-DFSWorkflowHUEAPACHE MAHOUTSchedulingAPACHE OOZIEMetadataAPACHE HIVEAPACHE OOZIELanguages / CompilersDataIntegrationAPACHE PIG, APACHE HIVEAPACHE FLUME,APACHE SQOOPFastRead/WriteAccessAPACHE HBASECoordinationAPACHE ZOOKEEPERSCM Express (Installation Wizard for CDH)17 2011 Cloudera, Inc. All Rights Reserved.

Conclusion The Key Benefits of Apache Hadoop:– Agility/Flexibility (Quickest Time to Insight).– Complex Data Processing (Any Language, Any Problem).– Scalability of Storage/Compute (Freedom to Grow).– Economical Storage (Keep All Your Data Alive Forever). The Key Systems for Apache Hadoop are:– Hadoop Distributed File System: self-healing highbandwidth clustered storage.– MapReduce: distributed fault-tolerant resourcemanagement coupled with scalable data processing.19 2011 Cloudera, Inc. All Rights Reserved.

Unstructured Data is ExplodingComplex, UnstructuredRelational 2,500 exabytes of new information in 2012 with Internet as primary driver Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2“zettabytes” this year21Source: IDC White Paper - sponsored by EMC.As the Economy Contracts, the Digital Universe Expands. May 2009. 2011 Cloudera, Inc. All Rights Reserved.

Hadoop in the Enterprise Data StackData ScientistsIDEsAnalystsETL ToolsEnterpriseReportingBI, AnalyticsDevelopment ToolsSystemOperatorsBusiness UsersBusiness Intelligence ToolsODBC, JDBC,NFS, NativeClouderaMgmt meFlumeFlumeSqoopLogsFilesWeb DataRelationalDatabases23 2011 Cloudera, Inc. All Rights ication

MapReduce Next GenMain idea is to split up the JobTracker functions: Cluster resource management (for tracking andallocating nodes) Application life-cycle management (for MapReducescheduling and execution)Enables: High Availability Better Scalability Efficient Slot Allocation Rolling Upgrades Non-MapReduce Apps24 2011 Cloudera, Inc. All Rights Reserved.

ApplicationIndustryApplicationSocial Network AnalysisWebClickstream SessionizationContent OptimizationMediaClickstream SessionizationNetwork AnalyticsTelcoMediationLoyalty & PromotionsRetailData FactoryFraud AnalysisFinancialTrade ReconciliationEntity AnalysisFederalSIGINTSequencing AnalysisBioinformaticsGenome MappingProduct QualityManufacturingMfg Process Tracking25 2011 Cloudera, Inc. All Rights Reserved.Use CaseDATA PROCESSINGUse CaseADVANCED ANALYTICSTwo Core Use Cases Common Across Many Industries

What is Cloudera Enterprise?Cloudera Enterprise makes opensource Apache Hadoop enterprise-easy Simplify and Accelerate Hadoop Deployment Reduce Adoption Costs and Risks Lower the Cost of Administration Increase the Transparency & Control of Hadoop Leverage the Experience of Our ExpertsCLOUDERA ENTERPRISE COMPONENTSClouderaManagementSuiteProductionLevel SupportComprehensiveToolset for HadoopAdministrationOur Team of ExpertsOn-Call to Help YouMeet Your SLAs3 of the top 5 telecommunications, mobile services, defense &intelligence, banking, media and retail organizations depend on ClouderaEFFECTIVENESSEFFICIENCYEnsuring Repeatable Value fromApache Hadoop DeploymentsEnabling Apache Hadoop to beAffordably Run in Production26 2011 Cloudera, Inc. All Rights Reserved.

Hive vs Pig Latin (count distinct values 0) Hive Syntax:SELECT COUNT(DISTINCT col1)FROM mytableWHERE col1 0; Pig Latin Syntax:mytable LOAD ‘myfile’ AS (col1, col2, col3);mytable FOREACH mytable GENERATE col1;mytable FILTER mytable BY col1 0;mytable DISTINCT col1;mytable GROUP mytable BY col1;mytable FOREACH mytable GENERATE COUNT(mytable);DUMP mytable;27 2011 Cloudera, Inc. All Rights Reserved.

Apache Hive Key Features A subset of SQL covering the most common statementsJDBC/ODBC supportAgile data types: Array, Map, Struct, and JSON objectsPluggable SerDe system to work on unstructured files directlyUser Defined Functions and AggregatesRegular Expression supportMapReduce supportPartitions and Buckets (for performance optimization)Microstrategy/Tableau Compatibility (through ODBC)In The Works: Indices, Columnar Storage, Views, Explode/CollectMore details: http://wiki.apache.org/hadoop/Hive28 2011 Cloudera, Inc. All Rights Reserved.

Hive Agile Data Types STRUCTS:– SELECT mytable.mycolumn.myfield FROM MAPS (Hashes):– SELECT mytable.mycolumn[mykey] FROM ARRAYS:– SELECT mytable.mycolumn[5] FROM JSON:– SELECT get json object(mycolumn, objpath) FROM 29 2011 Cloudera, Inc. All Rights Reserved.

CDH: Cloudera’s Distribution Including Apache Hadoop Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling Metadata APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE APACHE OOZIE APACHE HIVE File System Mount UI

Related Documents:

Lab Manual for EE380 (Control Lab) - IIT Kanpur

September 10, 2013 EE380 (Control Lab) IITK Lab Manual 0.2 Past status of Control Systems Laboratory Up to the August – December semester of 2008 EE380 had 4 sections of up to 24 students. Each section was divided into 6 groups of up to 4 students. 0.2.1 Logistical challenges 1.Six different experiments were done concurrently during each lab .

23 Views

3y ago

SEISMIC: A Self-Exciting Point Process Model for ...

SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity Qingyuan Zhao Stanford University qyzhao@stanford.edu Murat A. Erdogdu Stanford University erdogdu@stanford.edu Hera Y. He Stanford University yhe1@stanford.edu Anand Rajaraman Stanford University anand@cs.stanford.edu Jure Leskovec Stanford University jure@cs.stanford .

67 Views

3y ago

Domain Adversarial Training for QA Systems

Domain Adversarial Training for QA Systems Stanford CS224N Default Project Mentor: Gita Krishna Danny Schwartz Brynne Hurst Grace Wang Stanford University Stanford University Stanford University deschwa2@stanford.edu brynnemh@stanford.edu gracenol@stanford.edu Abstract In this project, we exa

56 Views

2y ago

Predicting COVID-19 in Chest X-Ray Images - Stanford University

Computer Science Stanford University ymaniyar@stanford.edu Madhu Karra Computer Science Stanford University mkarra@stanford.edu Arvind Subramanian Computer Science Stanford University arvindvs@stanford.edu 1 Problem Description Most existing COVID-19 tests use nasal swabs and a polymerase chain reaction to detect the virus in a sample. We aim to

28 Views

1y ago

Deep Learning for Aspect-Based Sentiment Analysis - Stanford University

Stanford University Stanford, CA 94305 bowang@stanford.edu Min Liu Department of Statistics Stanford University Stanford, CA 94305 liumin@stanford.edu Abstract Sentiment analysis is an important task in natural language understanding and has a wide range of real-world applications. The typical sentiment analysis focus on

18 Views

1y ago

music for strings Stanford Philharmonia

Mar 16, 2021 · undergraduate and graduate students, faculty, staff, and members of the community. Anyone interested in auditioning for the Stanford Philharmonia, Stanford Symphony Orchestra, or Stanford Summer Symphony should contact Orchestra Administrator Adriana Ramírez Mirabal at orchestra@stanford.edu. For further information, visit orchestra.stanford.edu.

52 Views

2y ago

Contract Administration - Stanford Health Care

Stanford Health Care Organizational Overview 3 Contract Administration is a Shared Service of Stanford Health Care to Eight Other Stanford Medicine Entities Stanford Health are ("SH")is the flagship academic medical center associated with the Stanford University School of Medicine. SHC has 15,232 employees and volunteers, 613 licensed

33 Views

1y ago

SEC Complaint: Stanford International Bank, et al.

STANFORD INTERNATIONAL nANK, LTD., § STANFORD GROUP COMPANY, § STANFORD CAPITAL MANAGEMENT, LLC, § R. ALLEN STANFORD, JAMES . M. DAVIS, . The false data has helped SGC grow the SAS program from less than 10 million in around 2004 to . I : over 1.2 billion, generating fees for SGC (and ultimately Stanford) in excess of 25 million. .

14 Views

9m ago

Recent Views

PHONE NO. CONTACT TOPIC/SUBTOPIC ORGANIZATION #A

651-757-2762 Deborah Klooz MPCA Paralegal: 651-757-2631 Jean Coleman MPCA Staff Attorney: 651-757-2791 Adonis Neblett MPCA Staff Attorney: 651-757-2017 Carmen Netten MPCA Staff Attorney: 651-757-2759 David Stellmach MPCA Staff Attorney: 651-757-2247 Joseph Dammel MPCA Staff Attorney: 651-757-2545 Michelle Janson MPCA Staff Attorney: #ATTORNEY .

2y ago

403 Views

Local Prosecutors and The Attorney General

Attorney General of Iowa Other Members iii Honorable Arthur K. Bolton Attorney General of Georgia Honorable Chauncey H. Browning, J 1'. Honorable John C. Danforth Attorney General of Missouri Honorable J olm P. Moore Attorney General of Colorado Attorney General of West Virginia Honorable Larry Derryberry Attorney General of Oklahoma

1y ago

178 Views

30th Annual Anti-Fraud Conference Tentative Schedule

Apr 30, 2019 · Jill Nerone, Supervising Deputy District Attorney, Alameda County District Attorney’s Office Laura Meyers, Assistant District Attorney, San Francisco County District Attorney’s, Office Nicole Pantaleo, Deputy District Attorney, Marin County District Attorney’s Office, Insurance F

2y ago

150 Views

Shannon McClellan Hon. Diane O. Leasure Ellery M. “Rick .

Attorney at Law Hon. Pamila J. Brown BOG Liaison District Court, Howard County Alan S. Carmel Attorney at Law Sarah Dawn Cline Attorney at Law Adam Sean Cohen Attorney at Law Delegate Kathleen M. Dumais District 15 Suzanne K. Farace Attorney at Law Barry L. Gogel Attorney at Law Michael I. Gordon

2y ago

142 Views

Powers of Attorney Act 2003 A Commentary - Law Society of New South Wales

POWERS OF ATTORNEY ACT 2003: A COMMENTARY 6 POWERS OF ATTORNEY ACT 2003: COMMENTARY The commentary is provided in black text. Reference to the "Act" is a reference to the Powers of Attorney Act 2003 as amended. Reference to the "Regulation" is a reference to the Powers of Attorney Regulation 2011, recently amended by the Powers of Attorney Amendment Act 2013 and the Powers of

7m ago

94 Views

California Safe Drinking Water and Toxic Enforcement Act .

District Attorney of Madera County 209 West Yosemite Avenue Madera, CA 93637 District Attorney of Marin County 3501 Civic Center Drive, Rm. 130 San Rafael, CA 94903 District Attorney of Mariposa County P.O. Box 730 Mariposa, CA 95338 District Attorney of Mendocino County P.O. Box 1000 Ukiah, CA 95482 District Attorney of Merced County

3y ago

163 Views

IN THE UNITED STATES COURT OF APPEALS FOR THE FIRST

Mar 06, 2020 · Attorney General of New Jersey Assistant Attorney General Counsel of Record Attorney for Amicus Curiae JOHN T. PASSANTE State of New Jersey Deputy Attorney General New Jersey Attorney General’s Office Richard J. Hughes Justice Complex 25 Market Street Trenton, NJ 086

2y ago

128 Views

ATTORNEY HANDBOOK - United States Courts

e. Each attorney's or pro se litigant's name must be typed and signed on the last page of the complaint, with: (1) his/her address (2) telephone number (3) if a Pennsylvania attorney, his/her Pennsylvania Attorney ID Number f. To file a complaint, the attorney must have an electronic signature on the complaint and must have an electronic

1y ago

124 Views

Power of Attorney - FedEx

Show the date the Power of Attorney is signed. Corporation Power of Attorney Partnership 1 10 9 8 7 6 5 4 3 2 12 11 1 10 9 8 7 6 5 4 3 2 12 11 1 10 9 8 7 6 5 4 3 2 12 11 Rev 6/13 The number preceding each instruction corresponds to the same number on the example of the power of attorney form. Customs Power of Attorney, Designation as Export .

1y ago

157 Views

Powers of Attorney - Ontario

attorney, a family member or friend may have to apply to be appointed as guardian. Powers of attorney that were properly made under previous laws of Ontario remain legally valid. The forms for a Continuing Power of Attorney for Property and a Power of Attorney for Personal Care contained in this booklet were revised on March 29, 1996 in accordance

1y ago

155 Views

STATUTORY POWER OF ATTORNEY - eForms

repudiated the power of attorney; and the power of attorney still is in full force and effect. 5. I/we make this affidavit for the purpose of inducing _ to accept delivery of the above described instrument, as executed by me/us in my/our capacity of attorney(s)-in-fact for the Principal. _, Attorney-in-fact

1y ago

118 Views

John J. Hoffman Acting Attorney General of New Jersey

JOHN J. HOFFMAN ACTING ATTORNEY GENERAL OF NEW JERSEY Division of Law 124 Halsey Street — 5th Floor P.O. Box 45029 Newark, New Jersey 07101 Attorney for Plaintiffs By: Jah-Juin Ho - #033032007 Deputy Attorney General 973-648-2500 JOHN J. HOFFMAN, Acting Attorney General of the State of New Jersey, and ERIC T.

1y ago

89 Views

Options in Oregon to Help Another Person Make Decisions

Power of Attorney A “Power of Attorney” is a legal document that allows a person to give another person (called an “agent”) the right to act on the person’s behalf. A “Power of Attorney” in Oregon can only be used for financial decisions. The way a “Power of Attorney” is written is important. The authority given to the agent can

3y ago

134 Views

- fcdfa

FRESNO COUNTY SUPERIOR COURT By DEPT.402 JAN SCULLY District Attorney, County of Sacramento RUTH YOUNG, State Bar No. 133606 Deputy District Attorney 906 G Street, Suite 700 Sacramento, CA 95814 Telephone: (916) 874-6174 JACKIE LACEY District Attorney, County of Los Angeles STUART C. LYTTON, State Bar No. 114241 Deputy District Attorney

2y ago

136 Views

Non-Attorney E-File Registration

your motion for e-filing access. Instructions to submit the Non-Attorney E-File Registration: 1. Register for a Non-Attorney Filer Account on the PACER website at www.pacer.uscourts.gov. If you already have a PACER Account, login to Manage My Account, select Non-Attorney E-File Re

2y ago

181 Views

11/16/2011, Stanford EE380 Computer Systems Colloquium .

It looks like you're using an ad-blocker