Outline Of Tutoria Hadoop And Pig Overview Hands-on

3y ago
39 Views
2 Downloads
3.68 MB
52 Pages
Last View : 15d ago
Last Download : 3m ago
Upload by : Luis Wallis
Transcription

Outline of Tutorial Hadoop and Pig Overview Hands-on1

Hadoop and Pig OverviewLavanya RamakrishnanShane CanonLawrence Berkeley National LabOctober 2011

Overview Concepts & Background– MapReduce and Hadoop Hadoop Ecosystem– Tools on top of Hadoop Hadoop for Science– Examples, Challenges Programming in Hadoop– Building blocks, Streaming, C-HDFS API3

Processing Big Data Internet scale generates BigData– Terabytes of data/day– just reading 100 TB can be overwhelming using clusters of standard commoditycomputers for linear scalability Timeline– Nutch open source search project(2002-2004)– MapReduce & DFS implementation andHadoop splits out of Nutch (2004-2006)4

MapReduce Computation performed on largevolumes of data in parallel– divide workload across large number ofmachines– need a good data management scheme tohandle scalability and consistency Functional programming concepts– map– reduce5OSDI 2004

Mapping Map input to an output using somefunction Example– string manipulation6

Reduces Aggregate values together to providesummary data Example– addition of the list of numbers7

Google File System Distributed File System– accounts for component failure– multi-GB files and billions of objects Design– single master with multiple chunkserversper master– file represented as fixed-sized chunks– 3-way mirrored across chunkservers8

Hadoop Open source reliable, scalable distributedcomputing platform– implementation of MapReduce– Hadoop Distributed File System (HDFS)– runs on commodity hardware Fault Tolerance– restarting tasks– data replication Speculative execution– handles stragglers9

HDFS Architecture10

HDFS and other ParallelFilesystemsHDFSGPFS and LustreTypical Replication31Storage LocationCompute NodeServersAccess ModelCustom (except withFuse)POSIXStripe Size64 MB1 MBConcurrent WritesNoYesScales with# of Compute Nodes# of ServersScale of LargestSystemsO(10k) NodesO(100) ServersUser/Kernel SpaceUserKernel

Who is using Hadoop? euniversity initiative hoo!

Hadoop vroSource: Hadoop: The Definitive GuideConstantly evolving!13

Google Vs HadoopGoogleHadoopMapReduceHadoop MapReduceGFSHDFSSawzallPig, HiveBigTableHbaseChubbyZookeeperPregelHama, Giraph14

Pig Platform for analyzing large data sets Data-flow oriented language “Pig Latin”– data transformation functions– datatypes include sets, associative arrays,tuples– high-level language for marshalling data Developed at Yahoo!15

Hive SQL-based data warehousingapplication– features similar to Pig– more strictly SQL-type Supports SELECT, JOIN, GROUP BY,etc Analyzing very large data sets– log processing, text mining, documentindexing Developed at Facebook16

HBase Persistent, distributed, sorted,multidimensional, sparse map– based on Google BigTable– provides interactive access to information Holds extremely large datasets (multiTB) High-speed lookup of individual (row,column)17

ZooKeeper Distributed consensus engine– runs on a set of servers and maintainsstate consistency Concurrent access semantics– leader election– service discovery– distributed locking/mutual exclusion– message board/mailboxes– producer/consumer queues, priorityqueues and multi-phase commitoperations18

Other Related Projects [1/2] Chukwa – Hadoop log aggregationScribe – more general log aggregationMahout – machine learning libraryCassandra – column store database on a P2Pbackend Dumbo – Python library for streaming Spark – in memory cluster for interactive anditerative Hadoop on Amazon – Elastic MapReduce19

Other Related Projects [2/2] Sqoop – import SQL-based data to Hadoop Jaql – JSON (JavaScript Object Notation)based semi-structured query processing Oozie – Hadoop workflows Giraph – Large scale graph processing onHadoop Hcatlog – relational view of HDFS Fuse-DS – POSIX interface to HDFS20

Hadoop for Science21

Magellan and Hadoop DOE funded project to determineappropriate role of cloud computing forDOE/SC midrange workloads Co-located at Argonne LeadershipComputing Facility (ALCF) and NationalEnergy Research Scientific Center(NERSC) Hadoop/Magellan research questions– Are the new cloud programming modelsuseful for scientific computing?22–

Data Intensive Science Evaluating hardware and softwarechoices for supporting next generationdata problems Evaluation of Hadoop– using mix of synthetic benchmarks andscientific applications– understanding application characteristicsthat can leverage the model data operations: filter, merge, reorganization compute-data ratio(collaboration w/ Shane Canon, Nick Wright, Zacharia Fadika)23

MapReduce and HPC Applications that can benefit fromMapReduce/Hadoop– Large amounts of data processing– Science that is scaling up from thedesktop– Query-type workloads Data from Exascale needs newtechnologies– Hadoop On Demand lets one run Hadoopthrough a batch queue24

Hadoop for Science Advantages of Hadoop– transparent data replication, data localityaware scheduling– fault tolerance capabilities Hadoop Streaming– allows users to plug any binary as mapsand reduces– input comes on standard input25

BioPig Analytics toolkit for Next-GenerationSequence Data User defined functions (UDF) forcommon bioinformatics programs– BLAST, Velvet– readers and writers for FASTA and FASTQ– pack/unpack for space conservation withDNA sequenceså26

Application Examples Bioinformatics applications (BLAST)– parallel search of input sequences– Managing input data format Tropical storm detection– binary file formats can’t be handled instreaming Atmospheric River Detection– maps are differentiated on file andparameter27

“Bring your application” Hadoopworkshop When: TBD Send us email if you are interested– LRamakrishnan@lbl.gov– Scanon@lbl.gov Include a brief description of yourapplication.28

HDFS vs GPFS (Time)Teragen )Expon.(GPFS)10Time (minutes)86420050010001500Number of maps29200025003000

Application Characteristic AffectChoices Wikipedia data set On 75 nodes,GPFS performsbetter with largenodes Iden%caldataloadsandprocessingload Amountofwri%nginapplica%onaffectsperformance

Hadoop: Challenges Deployment– all jobs run as user “hadoop” affecting filepermissions– less control on how many nodes are used affects allocation policies Programming: No turn-key solution– using existing code bases, managing inputformats and data Additional benchmarking, tuningneeded, Plug-ins for Science31

100Comparison of MapReduceImplementations01020node1: (Under stress)node2node3Hadoop80Output data size (MB)90LEMO MR64 core Twister Cluster64 core LEMO MR Cluster64 core Hadoop Cluster!304070Twister!2060Speedup050LEMO MRLoad balancing!40Twister50Number of words processed (Billion)10064 core Twister Cluster64 core Hadoop Cluster64 core LEMO MR Cluster50Processing time (s)150Hadoop10Producing random floating point numbers!!!0!Collaboration w/ Zacharia Fadika, Elif Dede, MadhusudhanGovindaraju, SUNY Binghamton010203040Cluster size (cores)Processing 5 million 33 x 33 matrices325060

Programming Hadoop33

Programming with Hadoop Map and reduce as Java programsusing Hadoop API Pipes and Streaming can help withexisting applications in otherlanguages C- HDFS API Higher-level languages such as Pigmight help with some applications34

Keys and Values Maps and reduces produce key-valuepairs– arbitrary number of values can be output– may map one input to 0,1, .100 outputs– reducer may emit one or more outputs Example: Temperature recordings– 94089 8:00 am, 59– 27704 6:30 am, 70– 94089 12:45 pm, 80– 47401 1 pm, 9035

Keys divide the reduce space36

Data Flow37

Mechanics[1/2] Input files– large 10s of GB or more, typically in HDFS– line-based, binary, multi-line, etc. InputFormat– function defines how input files are split up andread– TextInputFormat (default), KeyValueInputFormat,SequenceFileInputFormat InputSplits– unit of work that comprises a single map task– FileInputFormat divides it into 64MB chunks38

Mechanics [2/2] RecordReader– loads data and converts to key value pair Sort & Partiton & Shuffle– intermediate data from map to reducer Combiner– reduce data on a single machine Mapper & Reducer OutputFormat, RecordWriter39

Word Count Mapperpublic static class TokenizerMapperextends Mapper Object, Text, Text, IntWritable {private final static IntWritable one new IntWritable(1);private Text word new Text();public void map(Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}40

Word Count Reducerpublic static class IntSumReducerextends Reducer Text,IntWritable,Text,IntWritable {private IntWritable result new IntWritable();public void reduce(Text key, Iterable IntWritable values,Context context) throws IOException, InterruptedException {int sum 0;for (IntWritable val : values) {sum val.get();}result.set(sum);context.write(key, result);}}41

Word Count Examplepublic static void main(String[] args) throws Exception {Configuration conf new Configuration();String[] otherArgs new GenericOptionsParser(conf, args).getRemainingArgs(); .Job job new Job(conf, "word nputFormat.addInputPath(job, new (job, new ion(true) ? 0 : 1);}42

Pipes Allows C code to be used forMapper and Reducer Both key and value inputs to pipesprograms are provided as std::string hadoop pipes43

C-HDFS API Limited C API to read and write from HDFS#include "hdfs.h"int main(int argc, char **argv){hdfsFS fs hdfsConnect("default", 0);hdfsFile writeFile hdfsOpenFile(fs, writePath,O WRONLY O CREAT, 0, 0, 0);tSize num written bytes hdfsWrite(fs, writeFile,(void*)buffer, strlen(buffer) 1);hdfsCloseFile(fs, writeFile);}44

Hadoop Streaming Generic API that allows programs inany language to be used as HadoopMapper and Reducer implementations Inputs written to stdin as strings withtab character separating Output to stdout as key \t value \n hadoop jar contrib/streaming/hadoop-[version]-streaming.jar45

Debugging Test core functionality separate Use Job Tracker Run “local” in Hadoop Run job on a small data set on asingle node Hadoop can save files from failedtasks46

Pig – Basic Operations LOAD – loads data into a relationalform FOREACH.GENERATE – Adds orremoves fields (columns) GROUP – Group data on a field JOIN – Join two relations DUMP/STORE – Dump query toterminal or fileThere are others, but these will be usedfor the exercises today

Pig ExampleFind the number of gene hits for each model in anhmmsearch ( 100GB of output, 3 Billion Lines)bash# cat * cut –f 2 sort uniq -c hits LOAD ’/data/bio/*' USING PigStorage() AS(id:chararray,model:chararray, value:float);! amodels FOREACH hits GENERATE model;! models GROUP amodels BY model;! counts FOREACH models GENERATE group,COUNT(amodels) as count;! STORE counts INTO 'tcounts' USING PigStorage();!

Pig - LOADExample:hits LOAD 'load4/*' USING PigStorage() AS(id:chararray, model:chararray,value:float);! Pig has several built-in data types (chararray, float,integer) PigStorage can parse standard line oriented text files. Pig can be extended with custom load types written inJava. Pig doesn’t read any data until triggered by a DUMP orSTORE

Pig – FOREACH.GENERATE,GROUPExample:amodel FOREACH model GENERATE hits;!models GROUP amodels BY model;!counts FOREACH models GENERATE group,COUNT(amodels) as count;! Use FOREACH.GENERATE to pick of specific fields orgenerate new fields. Also referred to as a projectionGROUP will create a new record with the group name and a“bag” of the tuples in each groupYou can reference a specific field in a bag with bag .field (i.e.amodels.model)You can use aggregate functions like COUNT, MAX, etc on abag

Pig – Important Points Nothing really happens until a DUMP orSTORE is performed. Use FILTER and FOREACH early toremove unneeded columns or rows toreduce temporary output Use PARALLEL keyword on GROUPoperations to run more reduce tasks

Questions? Shane Canon– Scanon@lbl.gov Lavanya Ramakrishnan– LRamakrishnan@lbl.gov52

Hadoop and Pig Overview Lavanya Ramakrishnan Shane Canon . Source: Hadoop: The Definitive Guide Zoo Keeper 13 Constantly evolving! Google Vs Hadoop Google Hadoop MapReduce Hadoop MapReduce GFS HDFS Sawzall Pig, Hive . Hadoop on Amazon – Elastic MapReduce 19 .

Related Documents:

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

The In-Memory Accelerator for Hadoop is a first-of-its-kind Hadoop extension that works with your choice of Hadoop distribution, which can be any commercial or open source version of Hadoop available, including Hadoop 1.x and Hadoop 2.x distributions. The In-Memory Accelerator for Hadoop is designed to provide the same performance

Introduction to Hadoop Products and Technologies 5 Busting 10 Myths about Hadoop 5 The Status of HdFS Implementations 7 Hadoop Technologies in Use Today and Tomorrow 8 Use Cases for Hadoop in BI, DW, DI, and Analytics 10 . Hadoop Functionality that eeds Improvementn 23 Trends among Tools and Platforms Integrated with Hadoop 25 .

Configuring SSH: 6 Add hadoop user to sudoer's list: 8 Disabling IPv6: 8 Installing Hadoop: 8 Hadoop overview and HDFS 9 Chapter 2: Debugging Hadoop MR Java code in local eclipse dev environment. 12 Introduction 12 Remarks 12 Examples 12 Steps for configuration 12 Chapter 3: Hadoop commands 14 Syntax 14 Examples 14 Hadoop v1 Commands 14 1 .

-Type "sudo tar -xvzf hadoop-2.7.3.tar.gz" 6. I renamed the download to something easier to type-out later. -Type "sudo mv hadoop-2.7.3 hadoop" 7. Make this hduser an owner of this directory just to be sure. -Type "sudo chown -R hduser:hadoop hadoop" 8. Now that we have hadoop, we have to configure it before it can launch its daemons (i.e .

diberikan kepada penulis, sehingga dapat menyelesaikan buku ajar” Konsep Keperawatan Keluaraga”. Buku ini ditulis untuk membantu memenuhi kebutuhan perkembangan trend dan isu ilmu keperawatan khususnya Keperawatan Keluarga sesuai dengan kurikulum tahun 2019 dan juga membantu mahasiswa keperawatan memahami konsep tentang keperawatan sebagai landasan dalam pengembangan profesi keperawatan .