Outline Of Tutoria Hadoop And Pig Overview Hands-on

3y ago

39 Views

2 Downloads

3.68 MB

52 Pages

Last View : 15d ago

Last Download : 3m ago

Upload by : Luis Wallis

Report this link

Download PDF

Transcription

Outline of Tutorial Hadoop and Pig Overview Hands-on1

Hadoop and Pig OverviewLavanya RamakrishnanShane CanonLawrence Berkeley National LabOctober 2011

Overview Concepts & Background– MapReduce and Hadoop Hadoop Ecosystem– Tools on top of Hadoop Hadoop for Science– Examples, Challenges Programming in Hadoop– Building blocks, Streaming, C-HDFS API3

Processing Big Data Internet scale generates BigData– Terabytes of data/day– just reading 100 TB can be overwhelming using clusters of standard commoditycomputers for linear scalability Timeline– Nutch open source search project(2002-2004)– MapReduce & DFS implementation andHadoop splits out of Nutch (2004-2006)4

MapReduce Computation performed on largevolumes of data in parallel– divide workload across large number ofmachines– need a good data management scheme tohandle scalability and consistency Functional programming concepts– map– reduce5OSDI 2004

Mapping Map input to an output using somefunction Example– string manipulation6

Reduces Aggregate values together to providesummary data Example– addition of the list of numbers7

Google File System Distributed File System– accounts for component failure– multi-GB files and billions of objects Design– single master with multiple chunkserversper master– file represented as fixed-sized chunks– 3-way mirrored across chunkservers8

Hadoop Open source reliable, scalable distributedcomputing platform– implementation of MapReduce– Hadoop Distributed File System (HDFS)– runs on commodity hardware Fault Tolerance– restarting tasks– data replication Speculative execution– handles stragglers9

HDFS Architecture10

HDFS and other ParallelFilesystemsHDFSGPFS and LustreTypical Replication31Storage LocationCompute NodeServersAccess ModelCustom (except withFuse)POSIXStripe Size64 MB1 MBConcurrent WritesNoYesScales with# of Compute Nodes# of ServersScale of LargestSystemsO(10k) NodesO(100) ServersUser/Kernel SpaceUserKernel

Who is using Hadoop? euniversity initiative hoo!

Hadoop vroSource: Hadoop: The Definitive GuideConstantly evolving!13

Google Vs HadoopGoogleHadoopMapReduceHadoop MapReduceGFSHDFSSawzallPig, HiveBigTableHbaseChubbyZookeeperPregelHama, Giraph14

Pig Platform for analyzing large data sets Data-flow oriented language “Pig Latin”– data transformation functions– datatypes include sets, associative arrays,tuples– high-level language for marshalling data Developed at Yahoo!15

Hive SQL-based data warehousingapplication– features similar to Pig– more strictly SQL-type Supports SELECT, JOIN, GROUP BY,etc Analyzing very large data sets– log processing, text mining, documentindexing Developed at Facebook16

HBase Persistent, distributed, sorted,multidimensional, sparse map– based on Google BigTable– provides interactive access to information Holds extremely large datasets (multiTB) High-speed lookup of individual (row,column)17

ZooKeeper Distributed consensus engine– runs on a set of servers and maintainsstate consistency Concurrent access semantics– leader election– service discovery– distributed locking/mutual exclusion– message board/mailboxes– producer/consumer queues, priorityqueues and multi-phase commitoperations18

Other Related Projects [1/2] Chukwa – Hadoop log aggregationScribe – more general log aggregationMahout – machine learning libraryCassandra – column store database on a P2Pbackend Dumbo – Python library for streaming Spark – in memory cluster for interactive anditerative Hadoop on Amazon – Elastic MapReduce19

Other Related Projects [2/2] Sqoop – import SQL-based data to Hadoop Jaql – JSON (JavaScript Object Notation)based semi-structured query processing Oozie – Hadoop workflows Giraph – Large scale graph processing onHadoop Hcatlog – relational view of HDFS Fuse-DS – POSIX interface to HDFS20

Hadoop for Science21

Magellan and Hadoop DOE funded project to determineappropriate role of cloud computing forDOE/SC midrange workloads Co-located at Argonne LeadershipComputing Facility (ALCF) and NationalEnergy Research Scientific Center(NERSC) Hadoop/Magellan research questions– Are the new cloud programming modelsuseful for scientific computing?22–

Data Intensive Science Evaluating hardware and softwarechoices for supporting next generationdata problems Evaluation of Hadoop– using mix of synthetic benchmarks andscientific applications– understanding application characteristicsthat can leverage the model data operations: filter, merge, reorganization compute-data ratio(collaboration w/ Shane Canon, Nick Wright, Zacharia Fadika)23

MapReduce and HPC Applications that can benefit fromMapReduce/Hadoop– Large amounts of data processing– Science that is scaling up from thedesktop– Query-type workloads Data from Exascale needs newtechnologies– Hadoop On Demand lets one run Hadoopthrough a batch queue24

Hadoop for Science Advantages of Hadoop– transparent data replication, data localityaware scheduling– fault tolerance capabilities Hadoop Streaming– allows users to plug any binary as mapsand reduces– input comes on standard input25

BioPig Analytics toolkit for Next-GenerationSequence Data User defined functions (UDF) forcommon bioinformatics programs– BLAST, Velvet– readers and writers for FASTA and FASTQ– pack/unpack for space conservation withDNA sequenceså26

Application Examples Bioinformatics applications (BLAST)– parallel search of input sequences– Managing input data format Tropical storm detection– binary file formats can’t be handled instreaming Atmospheric River Detection– maps are differentiated on file andparameter27

“Bring your application” Hadoopworkshop When: TBD Send us email if you are interested– LRamakrishnan@lbl.gov– Scanon@lbl.gov Include a brief description of yourapplication.28

HDFS vs GPFS (Time)Teragen )Expon.(GPFS)10Time (minutes)86420050010001500Number of maps29200025003000

Application Characteristic AffectChoices Wikipedia data set On 75 nodes,GPFS performsbetter with largenodes Iden%caldataloadsandprocessingload Amountofwri%nginapplica%onaﬀectsperformance

Hadoop: Challenges Deployment– all jobs run as user “hadoop” affecting filepermissions– less control on how many nodes are used affects allocation policies Programming: No turn-key solution– using existing code bases, managing inputformats and data Additional benchmarking, tuningneeded, Plug-ins for Science31

100Comparison of MapReduceImplementations01020node1: (Under stress)node2node3Hadoop80Output data size (MB)90LEMO MR64 core Twister Cluster64 core LEMO MR Cluster64 core Hadoop Cluster!304070Twister!2060Speedup050LEMO MRLoad balancing!40Twister50Number of words processed (Billion)10064 core Twister Cluster64 core Hadoop Cluster64 core LEMO MR Cluster50Processing time (s)150Hadoop10Producing random floating point numbers!!!0!Collaboration w/ Zacharia Fadika, Elif Dede, MadhusudhanGovindaraju, SUNY Binghamton010203040Cluster size (cores)Processing 5 million 33 x 33 matrices325060

Programming Hadoop33

Programming with Hadoop Map and reduce as Java programsusing Hadoop API Pipes and Streaming can help withexisting applications in otherlanguages C- HDFS API Higher-level languages such as Pigmight help with some applications34

Keys and Values Maps and reduces produce key-valuepairs– arbitrary number of values can be output– may map one input to 0,1, .100 outputs– reducer may emit one or more outputs Example: Temperature recordings– 94089 8:00 am, 59– 27704 6:30 am, 70– 94089 12:45 pm, 80– 47401 1 pm, 9035

Keys divide the reduce space36

Data Flow37

Mechanics[1/2] Input files– large 10s of GB or more, typically in HDFS– line-based, binary, multi-line, etc. InputFormat– function defines how input files are split up andread– TextInputFormat (default), KeyValueInputFormat,SequenceFileInputFormat InputSplits– unit of work that comprises a single map task– FileInputFormat divides it into 64MB chunks38

Mechanics [2/2] RecordReader– loads data and converts to key value pair Sort & Partiton & Shuffle– intermediate data from map to reducer Combiner– reduce data on a single machine Mapper & Reducer OutputFormat, RecordWriter39

Word Count Mapperpublic static class TokenizerMapperextends Mapper Object, Text, Text, IntWritable {private final static IntWritable one new IntWritable(1);private Text word new Text();public void map(Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}40

Word Count Reducerpublic static class IntSumReducerextends Reducer Text,IntWritable,Text,IntWritable {private IntWritable result new IntWritable();public void reduce(Text key, Iterable IntWritable values,Context context) throws IOException, InterruptedException {int sum 0;for (IntWritable val : values) {sum val.get();}result.set(sum);context.write(key, result);}}41

Word Count Examplepublic static void main(String[] args) throws Exception {Configuration conf new Configuration();String[] otherArgs new GenericOptionsParser(conf, args).getRemainingArgs(); .Job job new Job(conf, "word nputFormat.addInputPath(job, new (job, new ion(true) ? 0 : 1);}42

Pipes Allows C code to be used forMapper and Reducer Both key and value inputs to pipesprograms are provided as std::string hadoop pipes43

C-HDFS API Limited C API to read and write from HDFS#include "hdfs.h"int main(int argc, char **argv){hdfsFS fs hdfsConnect("default", 0);hdfsFile writeFile hdfsOpenFile(fs, writePath,O WRONLY O CREAT, 0, 0, 0);tSize num written bytes hdfsWrite(fs, writeFile,(void*)buffer, strlen(buffer) 1);hdfsCloseFile(fs, writeFile);}44

Hadoop Streaming Generic API that allows programs inany language to be used as HadoopMapper and Reducer implementations Inputs written to stdin as strings withtab character separating Output to stdout as key \t value \n hadoop jar contrib/streaming/hadoop-[version]-streaming.jar45

Debugging Test core functionality separate Use Job Tracker Run “local” in Hadoop Run job on a small data set on asingle node Hadoop can save files from failedtasks46

Pig – Basic Operations LOAD – loads data into a relationalform FOREACH.GENERATE – Adds orremoves fields (columns) GROUP – Group data on a field JOIN – Join two relations DUMP/STORE – Dump query toterminal or fileThere are others, but these will be usedfor the exercises today

Pig ExampleFind the number of gene hits for each model in anhmmsearch ( 100GB of output, 3 Billion Lines)bash# cat * cut –f 2 sort uniq -c hits LOAD ’/data/bio/*' USING PigStorage() AS(id:chararray,model:chararray, value:float);! amodels FOREACH hits GENERATE model;! models GROUP amodels BY model;! counts FOREACH models GENERATE group,COUNT(amodels) as count;! STORE counts INTO 'tcounts' USING PigStorage();!

Pig - LOADExample:hits LOAD 'load4/*' USING PigStorage() AS(id:chararray, model:chararray,value:float);! Pig has several built-in data types (chararray, float,integer) PigStorage can parse standard line oriented text files. Pig can be extended with custom load types written inJava. Pig doesn’t read any data until triggered by a DUMP orSTORE

Pig – FOREACH.GENERATE,GROUPExample:amodel FOREACH model GENERATE hits;!models GROUP amodels BY model;!counts FOREACH models GENERATE group,COUNT(amodels) as count;! Use FOREACH.GENERATE to pick of specific fields orgenerate new fields. Also referred to as a projectionGROUP will create a new record with the group name and a“bag” of the tuples in each groupYou can reference a specific field in a bag with bag .field (i.e.amodels.model)You can use aggregate functions like COUNT, MAX, etc on abag

Pig – Important Points Nothing really happens until a DUMP orSTORE is performed. Use FILTER and FOREACH early toremove unneeded columns or rows toreduce temporary output Use PARALLEL keyword on GROUPoperations to run more reduce tasks

Questions? Shane Canon– Scanon@lbl.gov Lavanya Ramakrishnan– LRamakrishnan@lbl.gov52

Hadoop and Pig Overview Lavanya Ramakrishnan Shane Canon . Source: Hadoop: The Definitive Guide Zoo Keeper 13 Constantly evolving! Google Vs Hadoop Google Hadoop MapReduce Hadoop MapReduce GFS HDFS Sawzall Pig, Hive . Hadoop on Amazon – Elastic MapReduce 19 .

Related Documents:

hadoop - riptutorial.com

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

35 Views

1y ago

Big Data Analytics - learnerspoint.org

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.

10 Views

1y ago

Lecture @Dhbw: Data Warehouse Part Vii: Hadoop

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

12 Views

1y ago

IN-MEMORY ACCELERATOR FOR HADOOP - GridGain Systems

The In-Memory Accelerator for Hadoop is a first-of-its-kind Hadoop extension that works with your choice of Hadoop distribution, which can be any commercial or open source version of Hadoop available, including Hadoop 1.x and Hadoop 2.x distributions. The In-Memory Accelerator for Hadoop is designed to provide the same performance

13 Views

1y ago

Integrating Hadoop into BI/DW - sas.com

Introduction to Hadoop Products and Technologies 5 Busting 10 Myths about Hadoop 5 The Status of HdFS Implementations 7 Hadoop Technologies in Use Today and Tomorrow 8 Use Cases for Hadoop in BI, DW, DI, and Analytics 10 . Hadoop Functionality that eeds Improvementn 23 Trends among Tools and Platforms Integrated with Hadoop 25 .

8 Views

1y ago

hadoop - RIP Tutorial

Configuring SSH: 6 Add hadoop user to sudoer's list: 8 Disabling IPv6: 8 Installing Hadoop: 8 Hadoop overview and HDFS 9 Chapter 2: Debugging Hadoop MR Java code in local eclipse dev environment. 12 Introduction 12 Remarks 12 Examples 12 Steps for configuration 12 Chapter 3: Hadoop commands 14 Syntax 14 Examples 14 Hadoop v1 Commands 14 1 .

42 Views

3y ago

Installing Hadoop 2.7.3 / Yarn, Hive 2.1.0, Scala 2.11.8 ...

-Type "sudo tar -xvzf hadoop-2.7.3.tar.gz" 6. I renamed the download to something easier to type-out later. -Type "sudo mv hadoop-2.7.3 hadoop" 7. Make this hduser an owner of this directory just to be sure. -Type "sudo chown -R hduser:hadoop hadoop" 8. Now that we have hadoop, we have to configure it before it can launch its daemons (i.e .

41 Views

3y ago

MODUL I - Universitas Kristen Indonesia

diberikan kepada penulis, sehingga dapat menyelesaikan buku ajar” Konsep Keperawatan Keluaraga”. Buku ini ditulis untuk membantu memenuhi kebutuhan perkembangan trend dan isu ilmu keperawatan khususnya Keperawatan Keluarga sesuai dengan kurikulum tahun 2019 dan juga membantu mahasiswa keperawatan memahami konsep tentang keperawatan sebagai landasan dalam pengembangan profesi keperawatan .

70 Views

3y ago

Recent Views

Request for Proposals for VILLAGE SOLICITOR

705.11 Village solicitor or city director of law - duties. “The village solicitor or city director of law shall act as the legal advisor to and attorney for the municipal corporation, and for all officers of the municipal corporation in matters relating to their official duties.

2y ago

131 Views

Appellee'S Answering Brief - Dol

APPELLEE'S ANSWERING BRIEF . KATE S. O'SCANNLAIN THOMAS TSO . Solicitor of Labor Counsel for Appellate and Special Litigation . G. WILLIAM SCOTT Associate Solicitor JEFFREY M. HAHN for Plan Benefits Security Senior Trial Attorney . U.S. Department of Labor Office of the Solicitor 200 Constitution Ave. N.W. Room N-4611 Washington, D.C. 20210

1y ago

137 Views

Alb erta Solicitor Gene nd Min of Publi c Sec

Alberta Solicitor General and Ministry of Public Security Alberta Basic Security Training Jan-14 Module One: Introduction to the Security Industry, Page 1 Module One: Introduction to the Security Industry When you decided to enroll in this course, you presumably did so because you have a desire or need to work in the security industry in Alberta.

3y ago

219 Views

REQUEST FOR PROPOSALS FOR PROFESSIONAL SERVICES FOR .

The State Solicitor is appointed by the Attorney General and reports through the Chief Deputy Attorney General. As set forth at 29 Del. C. § 2505 (b), the State Solicitor is responsible for all civil actions and matters wherein the State or its agencies or subdivisions are involved and has such powers as the Attorney General shall designate.

3y ago

193 Views

Interviewing and advising

(d) To have the solicitor’s full attention. (e) To sit in reasonable comfort. (f) No physical barriers between us and our solicitor to impede communication. (g) Not to be kept waiting. The bare minimum, then, would appear to be a comfortable, quiet room where you

3y ago

148 Views

Information for Trainee Solicitors on CV Preparation

Solicitor who trained in large law firm with a wide range of experience in banking and financial services, debt collection, commercial property and company law. EDUCATION April 2011 PPC II Completed - Qualified as a solicitor with Law Society of Ireland 2003-2007 Trinity College Dublin - BCL - 2.1 Honours

3y ago

226 Views

STATE OF TEXAS, OFFICE OF THE SPECIAL MASTER Solicitor .

Jul 05, 2019 · FREDERICK LIU Assistant to the Solicitor General US Department of Justice 950 Pennsylvania Avenue, NW Washington, D.C. 20530-0001 JAMES J. DUBOIS* R. LEE LEININGER THOMAS K. SNODGRASS U.S. Department of Justice Environment & Natural Resources Division 999 18th Street South Terrace – Suite 3

2y ago

156 Views

WILLIAM K W LEUNG & CO

Articleship with : Isadore Goldman, a top City of London firm specializing in insolvency (corporate); recovery; commercial and international litigation. 1986 – 1988 (2) Assistant Solicitor with . Wilkinson & Grist, a reputable local solicitor firm . 1990 (3) Corporate Finance Associate wit

2y ago

185 Views

Newsletter of Tweed Valley Jazz Club

SOUTHERN CROSS CREDIT UNION 2 Commercial Road, Murwillumbah Ph. 6672 2744 TWEED ENDEAVOUR CRUISES River Terrace, Tweed Heads Ph. 0755 368800 RUSSELL J BAXTER SOLICITOR N.S.W. & QLD (Honorary Club Solicitor) 28 Recreation Street, Tweed Heads Ph. 0755 992266 AN

2y ago

309 Views

IN THE HULL CROWN COURT R -v- WILLIAM FLANNIGAN

Adams, who worked for a well-known and reputable firm in Hull known as Cooper Wilkin Chapman. It is clear from the evidence that the recommendation to use this solicitor was just that; a recommendation. There was no obligation to use this solicitor and the purchasers were at liberty to use th

2y ago

140 Views

JAMES S. M. KITCHEN Suite 224 BARRISTER & SOLICITOR Airdrie AB T4B 3C3

BARRISTER & SOLICITOR. 203-304 Main St S Suite 224 . Airdrie AB T4B 3C3 . Phone: 403-667-8575 . Email: james@jsmklaw.ca [2] COVID mRNA vaccines (Pfizer and Moderna). She is also compelled to maintain both the physical and spiritual integrity of her body by asserting her God-given prior right to decline

1y ago

206 Views

Rules of Court - WordPress

and includes a Registrar, Court interpreter, bailiff, clerk, process server or other officer who is attached to a Court; "solicitor" means an advocate and solicitor as defined in section 3 of the Legal Profession Act 1976 [Act 166]; "Registry" means the Registry of the High Court, the Sessions Court or the Magistrates' Court;

1y ago

131 Views

Financial Statement for a variation of an order for a financial remedy .

If there is not enough room on the form for any particular piece of information, you may continue on an . attached sheet of paper. If you are in doubt about how to complete any part of this form you should seek legal advice. This statement is filed by (give name and address of solicitor) Solicitor's fee account no.

10m ago

111 Views

SOLICITOR - Charleston County Bar Association

101 Meeting St., Ste 400, Charleston, SC 29401. Main No: 843 958-1900 Fx:- 843 958-1905 Front Desk Line #2: 843 958-1927 . Direct Access is 843 958-**** Website: www.scsolicitor9.org. E-mail: solicitor@scsolicitor9.org. . Moncks Corner, SC 29461. Berkeley Number. 843 719-4529 . Charleston Number .843 723- 3800 ext.4529 .

8m ago

151 Views

DIGES T - academy.difc.ae

The Qualified Lawyers Transfer Scheme (QLTS) allows qualified lawyers in other jurisdictions to qualify as a solicitor in England and Wales. The English legal profession is relatively open to international lawyers seeking to qualify as a solicitor and it does not impose restrictions to admission on grounds of nationality or residence.

7m ago

74 Views

Outline Of Tutoria Hadoop And Pig Overview Hands-on

It looks like you're using an ad-blocker