5 Filesystem Gfs Hdfs Copy - GitHub Pages

1y ago

11 Views

1 Downloads

3.44 MB

52 Pages

Last View : 17d ago

Last Download : 3m ago

Upload by : Melina Bettis

Report this link

Download PDF

Transcription

TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl

Intro Streams Streams Map Reduce HDFS Pig Graphs Pig Design Pattern Hadoop Mix Giraph Zoo Keeper Spark Spark

But first Partitioner & Combiner

Reminder: map & reduce Key/value pairs form the basic data structure. Apply a map operation to each record in the input to compute a set of intermediate key/value pairs map: (ki , vi ) ! [(kj , vj )] map: (ki , vi ) ! [(kj , vx ), (km , vy ), (kj , vn ), .] Apply a reduce operation to all values that share the same key reduce: (kj , [vx , vn ]) ! [(kh , va ), (kh , vb ), (kl , va )] There are no limits on the number of key/value pairs. 4

Combiner overview Combiner: local aggregation of key/value pairs after map() and before the shuffle & sort phase (occurs on the same machine as map()) Sometimes the reducer code can be used. Also called “mini-reducer” Instead of emitting 100 times (the,1), the combiner emits (the,100) Can lead to great speed-ups Needs to be employed with care 5

There is more: the combiner Setup: a mapper which outputs (term,termFreqInDoc) and a combiner which is simply a copy of the reducer. Task 1: total term frequency of a term in the corpus without combiner with combiner (reducer copy) (the,2),(the,2),(the,1) reduce (the,2),(the,3) reduce (the,5) (the,5) correct! Task 2: average frequency of a term in the documents without combiner (the,2),(the,2),(the,1) reduce (the,(2 2 1)/3 1.66) 6 (the,2),(the,1.5) (2 1)/2 1.5 with combiner reduce (reducer wrong! copy) (the,(2 1.5)/2 1.75)

There is more: the combiner Each combiner operates in isolation, has no access to other mapper’s key/value pairs A combiner cannot be assumed to process all values associated with the same key (may not run at all! Hadoop’s decision) Emitted key/value pairs must be the same as those emitted by the mapper Most often, combiner code ! reducer code Exception: associative & commutative reduce operations 7

Hadoop in practice Specified by the user: Mapper Reducer Combiner (optional) Partitioner (optional) Driver/job configuration 90% of the code comes from given templates 8

Hadoop in practice Mapper: counting inlinks input key/value: (sourceUrl, content) output key/value: (targetUrl, 1) import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class InlinkCount extends Mapper Object,Text,Text,IntWritable { IntWritable one new IntWritable(1); Pattern linkPattern Pattern.compile("\\[\\[. ?\\]\\]"); public void map(Object key, Text value, Context con) throws Exception { String page value.toString(); Matcher m linkPattern.matcher(page); while(m.find()) { String match m.group(); con.write(new Text(match),one); } } } template differs slightly in diff. Hadoop versions 9

Hadoop in practice Reducer: counting inlinks input key/value:(targetUrl, 1) output key/value:(targetUrl, count) import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class SumReducer extends Reducer Text,IntWritable,Text,IntWritable { public void reduce(Text key,Iterable IntWritable values,Context con) throws Exception { int sum 0; for(IntWritable iw : values) sum iw.get(); con.write(key, new IntWritable(sum)); } } template differs slightly in diff. Hadoop versions 10

Hadoop in practice Driver: counting inlinks import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class InlinkCountDriver { public static void main(String[] args) throws Exception { Configuration conf new Configuration(); String[] otherArgs new GenericOptionsParser (conf,args).getRemainingArgs(); Job job new Job(conf, “InlinkAccumulator"); job.setMapperClass(InlinkCountMapper.class); job.setCombinerClass(SumUpReducer.class); job.setReducerClass(SumUpReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job,new Path("/user/in/")); FileOutputFormat.setOutputPath(job,new Path("/user/out/")); job.waitForCompletion(true); } } 11

Hadoop in practice Partitioner: two URLs that are the same apart from their #fragment should be sent to the same reducer. import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class CustomPartitioner extends Partitioner { public int getPartition(Object key, Object value, int numPartitions) { String s ((Text)key).toString(); String newKey s.substring(0,s.lastIndexOf('#')); return newKey.hashCode() % numPartitions; } } 12

GFS / HDFS

Learning objectives Explain the design considerations behind GFS/HDFS Explain the basic procedures for data replication, recovery from failure, reading and writing Design alternative strategies to handle the issues GFS/HDFS was created for Decide whether GFS/HDFS is a good fit given a usage scenario Implement strategies for handling small files 14

GFS introduction Hadoop is heavily inspired by it. One way (not the only way) to design a distributed file system. 15

History of MapReduce & GFS Developed by engineers at Google around 2003 Built on principles in parallel and distributed processing Seminal Google papers: The Google file system by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (2003) MapReduce: Simplified Data Processing on Large Clusters. by Jeffrey Dean and Sanjay Ghemawat (2004) Yahoo! paper: The Hadoop distributed file system by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (2010) 16

What is a file system? File systems determine how data is stored and retrieved Distributed file systems manage the storage across a network of machines Added complexity due to the network GFS (Google) and HDFS (Hadoop) are distributed file systems HDFS inspired by GFS 17

GFS Assumptions based on Google’s main use cases (at the time) Hardware failures are common (commodity hardware) Files are large (GB/TB) and their number is limited (millions, not billions) Two main types of reads: large streaming reads and small random reads Workloads with sequential writes that append data to files Once written, files are seldom modified (! append) again; random modification in files possible, but not efficient in GFS High sustained bandwidth trumps low latency 18

Disclaimer GFS/HDFS are not a good fit for: Low latency data access (in the ms range) Solutions: HBase, Hive, Many small files Solution: stuffing of binary files Constantly changing data Not all details of GFS are public knowledge (HDFS developers “filled in” the details) 19

user level processes: they can run on the same physical machine GFS architecture Remember: one way, not the only way. several clients Data does not(!) flow across the GFS master a single master (metadata) several data servers 20 Image source: .google.com/en//archive/gfs-sosp2003.pdf

GFS: files 21

Files on GFS A single file can contain many objects (e.g. Web documents) Files are divided into fixed size chunks (64MB) with unique 64 bit identifiers IDs assigned by GFS master at chunk creation time chunkservers store chunks on local disk as “normal” Linux files Reading & writing of data specified by the tuple (chunk handle, byte range) 22

File information at Master level Files are replicated (by default 3 times) across all chunk servers Master maintains all file system metadata Namespace, access control information, mapping from file to chunks, chunk locations, garbage collection of orphaned chunks, chunk migration, distributed systems are complex! Heartbeat messages between master and chunk servers Is the chunk server still alive? What chunks are stored at the chunkserver? To read/write data: client communicates with master (metadata operations) and chunk servers (data) 23

Files on GFS Seek time: 10ms (0.01s) Transfer rate: 100MB/s What is the chunk size to make the seek time 1% of the transfer rate? 100 MB Clients cache metadata Clients do not cache file data Chunkservers do not cache file data (responsibility of the underlying file system: Linux’s buffer cache) Advantages of (large) fixed-size chunks: Disk seek time small compared to transfer time A single file can be larger than a node’s disk space Fixed size makes allocation computations easy 24

GFS: Master 25

One master Single master simplifies the design tremendously Chunk placement and replication with global knowledge Single master in a large cluster can become a bottleneck Goal: minimize the number of reads and writes (thus metadata vs. data) 26

A read operation (in detail) 1. Client translates filename and byte offset specified by the application into a chunk index within the file. Sends request to master. 27 Image source: .google.com/en//archive/gfs-sosp2003.pdf

A read operation (in detail) 2. Master replies with chunk handle and locations. 28 Image source: .google.com/en//archive/gfs-sosp2003.pdf

A read operation (in detail) 3. Client caches the metadata. 4. Client sends a data request to one of the replicas (the closest one). Byte range indicates wanted part of the chunk. More than one chunk can be included in a single request. 29 Image source: .google.com/en//archive/gfs-sosp2003.pdf

A read operation (in detail) 5. Contacted chunk server replies with the requested data. 30 Image source: .google.com/en//archive/gfs-sosp2003.pdf

Metadata on the master 3 types of metadata Files and chunk namespaces Mapping from files to chunks Locations of each chunk’s replicas All metadata is kept in master’s memory (fast random access) Sets limits on the entire system’s capacity Operation log is kept on master’s local disk: in case of the master’s crash, master state can be recovered Namespaces and mappings are logged Chunk locations are not logged 31

GFS: Chunks 32

Chunks 1 chunk 64MB or 128MB (can be changed); chunk stored as a plain Linux file on a chunk server Advantages of large (but not too large) chunk size Reduced need for client/master interaction 1 request per chunk suits the target workloads Client can cache all the chunk locations for a multi-TB working set Reduced size of metadata on the master (kept in memory) Disadvantage: chunkserver can become hotspot for popular file(s) Question: how could the33hotspot issue be solved?

Chunk locations Master does not keep a persistent record of chunk replica locations Master polls chunkservers about their chunks at startup Master keeps up to date through periodic HeartBeat messages Master/chunkservers easily kept in sync when chunk servers leave/join/fail/restart [regular event] Chunkserver has the final word over what chunks it has 34

master Operation log writes to operation log recovery checkpoint creation; log file reset too large Persistent record of critical metadata changes Critical to the recovery of the system Changes to metadata are only made visible to clients after they have been written to the operation log crashed! Operation log replicated on multiple remote machines Before responding to client operation, log record must have been flashed locally and remotely Master recovers its file system from checkpoint operation 35 log size?

master Operation log operation log recovery checkpoint creation; log file reset Persistent record of critical metadata changes Critical to the recovery of the system writes to too large log size? mv file name1 file name2 Changes to metadata are only made visible to clients after C1 they have been written to the operation log get file name1 crashed! C2 master Operation log replicated on multiple remote machines Before get file name1 responding to client operation, operation log log record must C3 have been flashed locally and remotely Question: when does the master relay the new information to Master recovers its file system from checkpoint operation the clients? Before or after having written it to the op. log? 36

Chunk replica placement Creation of (initially empty) chunks Use under-utilised chunk servers; spread across racks Limit number of recent creations on each chunk server Re-replication Started once the available replicas fall below setting Master instructs chunkserver to copy chunk data directly from existing valid replica Number of active clone operations/bandwidth is limited Re-balancing Changes in replica distribution for better load balancing; gradual filling of new chunk servers 37

GFS: Data integrity 38

Garbage collection Question: how can a file be deleted from the cluster? Deletion logged by master File renamed to a hidden file, deletion timestamp kept Periodic scan of the master’s file system namespace Hidden files older than 3 days are deleted from master’s memory (no further connection between file and its chunk) Periodic scan of the master’s chunk namespace Orphaned chunks (not reachable from any file) are identified, their metadata deleted HeartBeat messages used to synchronise deletion between master/chunkserver 39

Stale replica detection Scenario: a chunkserver misses a change (“mutation”) applied to a chunk, e.g. a chunk was appended Master maintains a chunk version number to distinguish up-to-date and stale replicas Before an operation on a chunk, master ensures that version number is advanced Stale replicas are removed in the regular garbage collection cycle 40

Data corruption Data corruption or loss can occur at the read and write stage Question: how can chunk servers detect whether or not Chunkservers use checksums to detect corruption of stored their stored data is corrupt? data Alternative: compare replicas across chunk servers Chunk is broken into 64KB blocks, each has a 32 bit checksum Kept in memory and stored persistently Read requests: chunkserver verifies checksum of data blocks that overlap read range (i.e. corruptions not send to clients) 41

HDFS: Hadoop Distributed File System 42

GFS vs. HDFS GFS HDFS Master NameNode chunkserver DataNode operation log journal, edit log chunk block random file writes possible only append is possible multiple writer, multiple reader model single writer, multiple reader model chunk: 64KB data and 32bit checksum pieces per HDFS block, two files created on a DataNode: data file & metadata file (checksums, timestamp) default block size: 64MB default block size: 128MB 43

“Ma Hadoop’s architecture O.X and 1.X pRe duc e1 NameNode Master of HDFS, directs the slave DataNode daemons to perform low-level I/O tasks Keeps track of file splitting into blocks, replication, block location, etc. Secondary NameNode: takes snapshots of the NameNode DataNode: each slave machine hosts a DataNode daemon 44 ”

“Ma pRe duc e1 JobTracker and TaskTracker JobTracker (job scheduling task progress monitoring) One JobTracker per Hadoop cluster Middleman between your application and Hadoop (single point of contact) Determines the execution plan for the application (files to process, assignment of nodes to tasks, task monitoring) Takes care of (supposed) task failures TaskTracker One TaskTracker per DataNode Manages individual tasks Keeps in touch with the JobTracker (via HeartBeats) - sends progress report & signals empty task slots 45 ”

“Ma pRe duc e1 JobTracker and TaskTracker 46 Image source: http://lintool.github.io/MapReduceAlgorithms/ ”

“Ma What about the jobs? pRe duc e1 “Hadoop job”: unit of work to be performed (by a client) Input data MapReduce program Configuration information Hadoop divides job into tasks (two types: map, reduce) Hadoop divides input data into fixed size input splits One map task per split One map function call for each record in the split Splits are processed in parallel (if enough DataNodes exist) Job execution controlled by JobTracker and TaskTrackers 47 ”

Hadoop in practice: Yahoo! (2010) 40 nodes/rack sharing one IP switch 16GB RAM per cluster node, 1-gigabit Ethernet 70% of disk space allocated to HDFS Remainder: operating system, data emitted by Mappers (not in HDFS) NameNode: up to 64GB RAM Total storage: 9.8PB - 3.3PB net storage (replication: 3) 60 million files, 63 million blocks 54,000 blocks hosted per DataNode 1-2 nodes lost per day Time for cluster to re-replicate HDFS cluster with 3,500 nodes lost blocks: 2 minutes 48

YARN (MapReduce 2) JobTracker/TaskTrackers setup becomes a bottleneck in clusters with thousands of nodes As answer YARN has been developed (Yet Another Resource Negotiator) YARN splits the JobTracker’s tasks (job scheduling and task progress monitoring) into two daemons: Resource manager (RM) Application master (negotiates with RM for cluster resources; each Hadoop job has a dedicated master) 49

YARN Advantages Scalability: larger clusters are supported Availability: high availability (high uptime) supported Utilization: more fine-grained use of resources Multitenancy: MapRedue is just one application among many 50

Related Documents:

Hadoop Distributed Filesystem & I/O - Fordham University

Filesystems are specified by a URI: hdfs URI to configure Hadoop to use HDFS by default. ! HDFS daemons will use this property to determine the host and port for HDFS namenode. (Here it's on localhost, on the default HDFS port, 8020.)!! And HDFS clients will use this property to work out where the namenode is running so they can connect to it.!

9 Views

1y ago

Introduction to HDFS - CERN

What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS provides interfaces for applications to move themselves closer to data. HDFS is designed to ‘just work’, however a

22 Views

2y ago

Getting Started with n Apache Hadoop n By Eugene Ciurana ... - GitHub Pages

HDFS shell commands apply to local or HDFS file systems and take the form: hadoop dfs -command dfs_command_options HDFS Shell du /var/data1 hdfs://node/data2 Display cumulative of files and directories lsr Recursive directory list cat hdfs://node/file Types a file to stdout count hdfs://node/data Count the directories, files, and bytes in a path

8 Views

1y ago

IBM Spectrum Scale HDFS Support - files.gpfsug.org

Hadoop FS APIs Higher-level languages: Hive, BigSQL JAQL, Pig Applications Supported Hadoop versions: 2.7.1 HDFS Client Spectrum Scale HDFS RPC Hadoop client Hadoop FileSystem API Connector on libgpfs,posix API Hadoop client Hadoop FileSystem API Connector on libgpfs,posix API GPFS node

26 Views

2y ago

Hadoop Distributed File System (HDFS) - University of California, Riverside

HDFS Shell In addition to regular commands, there are special commands in HDFS copyToLocal/get Copies a file from HDFS to the local file system copyFromLocal/putCopies a file from the local file system to HDFS setrepChanges the replication factor A list of shell commands with usage

9 Views

1y ago

IBM Spectrum Scale - IBM Redbooks

IBM Spectrum Scale and HDFS comparison In addition to comparable or better performance, IBM Spectrum Scale provides more enterprise-level storage services and data management capabilities, as listed in Table 1. Table 1 Comparison of IBM Spectrum Scale (with HDFS Transparency) with HDFS Capability IBM Spectrum Scale (with HDFS Transparency) HDFS

12 Views

6m ago

CS6100 Big Data Computing DFS, GFS, HDFS - Western Michigan University

GFS Single Master We know this is a: -Single point of failure -Scalability bottleneck GFS solutions: -Shadow masters -Minimize master involvement Never move data through it, use only for metadata (and cache metadata at clients) Large chunk size Master delegates authority to primary replicas in data mutations (chunk leases)

9 Views

1y ago

Hangeul, the Korean Alphabet Andrea De Benedittis

Studying the Korean language is even more challenging and fascinating than studying other languages. Korea has an ancient culture. Over the centuries, it has— amazingly—been able to mix all the influences coming from Central Asia, the Steppes, Manchuria, China, Japan, and the West into a beautiful, brilliant, and unique new culture. This cultural richness has affected the Korean language .

154 Views

3y ago

Recent Views

Legal Proceedings and Legal Privilege Exemptions: Myth-busting - ICO

If asking for legal advice, say so, and start new email chain If giving legal advice, say so Involve lawyers (before litigation contemplated) Maintain confidentiality of legal advice documents Limit dissemination of legal advice (need to know; original only) Make internal communications re legal advice factual

1y ago

240 Views

Smart People Ask for (My) Advice: Seeking Advice Boosts .

advice strategically is likely to be a different experi-ence for the advice seeker than seeking advice with the intention of using it, from the advisor’s perspec-tive, strategic advice seeking may elicit the same per-ceptual effects as authentic advice seeking because the advice seeker’s intentions (and her reliance on advice)

3y ago

177 Views

Legal Action Group The Role of Advice Services in Health Outcomes

The Role of Advice Services in Health Outcomes Evidence Review and Mapping Study June 2015 The Role of Advice Services in Health Outcomes . tor.!Our! r,!

1y ago

170 Views

Legal Information vs Legal Advice Guidelines - TMCEC

giving legal advice. Legal advice is a written or oral statement that: o Interprets some aspect of the law, court rules, or court procedures; o Recommends a specific course of conduct a person should take in an actual or potential legal proceeding; or o Applies the law to the individual person's specific factual circumstances. What is Legal .

1y ago

225 Views

ProQual L2 Certificate Supporting Access to Legal Advice

R/502/7657 Communicating with legal advice clients 2 3 D/503/0822 Supporting clients to make use of the legal advice service 2 3 R/502/7660 Enabling legal advice clients to access signposting and referral opportunities 2 3 Optional Units - a minimum of 6 credits Unit Reference Number Unit Title Unit Level Credit Value

1y ago

173 Views

Guidance for opponents in civil legal aid cases - Scottish Legal Aid Board

injury case - may apply for civil legal aid (since this leaﬂet deals only with civil legal aid, where we refer to "legal aid" we mean "civil legal aid"). Legal aid is ﬁnancial help from public funds. It helps people who qualify to get legal advice and the help of a solicitor to put their case in court.

4m ago

110 Views

Priority Banking Tariff - Standard Chartered

Foreign exchange rate Free Free Free Free Free Free Free Free Free Free Free Free Free Free Free SMS Banking Daily Weekly Monthly. in USD or in other foreign currencies in VND . IDD rates min. VND 85,000 Annual Rental Fee12 Locker size Small Locker size Medium Locker size Large Rental Deposit12,13 Lock replacement

2y ago

206 Views

legal and ethical dimensions of practice - Dovetail

Material in this Guide should never be taken as providing you or any other person with legal advice. Legal advice regarding the application of the law to a particular circumstance or situation can only come from a legal practitioner. A range of sources for legal advice can be found in the Guide.

1y ago

167 Views

How Social Welfare Legal Advice and Social Prescribing can work .

The position of social welfare legal advice and its role in London's recovery The Mayor of London and partners should position social welfare legal advice as a core pillar of Londons recovery from the OVID-19 pandemic, with a core focus on ensuring adequate funding and practical support for advice agencies to ensure ongoing viability.

1y ago

172 Views

WHAT TO DO IF YOU ARE SEXUALLY HARASSED

There are many legal clinics or legal information centres you can contact to obtain legal information, educational resources or legal referrals. Alberta Central Alberta Community Legal Clinic (Red Deer) Centre for Public Legal Education Alberta Pro Bono Law Alberta Women's Centre Legal Advice Clinic (Calgary)

3y ago

245 Views

Legal Advocacy Essentials

Legal Advocacy Essentials: a core training for legal advocates Presented by the Washington State Coalition Against Domestic Violence, 2008. This information is not intended as a substitute for legal advice. 1 Legal Advocacy Essentials . A core training for legal advocates . Table of Contents . What is a legal advocate?

1y ago

249 Views

Legal & Corporate Services: Strategic Plan - CP6

the provision of legal advice, managing legal risk and managing the legal supply chain. By doing this well, the team will move towards its vision. Legal Services is made up of 4 teams, each serving different customers with a dedicated legal resource. This is summarised in the figure right. Although Legal Services has customerdistinct, -focussed .

1y ago

171 Views

Regulatory Guide RG 90 Example Statement of Advice: Scaled advice for a .

representatives and advisers who give personal advice to retail clients. It explains how and why we have developed an example Statement of Advice (SOA) for scaled advice (i.e. personal advice that is limited in scope) on personal insurance for a new retail client. The example SOA was developed in consultation with stakeholders, and we

1y ago

186 Views

Removal of licence disqualification - Legal Aid WA

agencies, permission must first be obtained from Legal Aid Western Australia. This Kit provides information about the law only and does not constitute legal advice. You should seek legal advice if you have a specific legal problem. Every effort is made to ensure that the information contai

2y ago

253 Views

Legal Information vs - txcourts.gov

giving legal advice. Legal advice is a written or oral statement that: Inter p rets some as ect of th elaw, courtles, or du s; Recomme nd s a pecific c ourse of ndu ters h ld k ein an actual or ntial legal proceeding; or 'sApplies th elaw to individu alperso n seci fic actu circums a . What is Legal Information?

1y ago

174 Views

5 Filesystem Gfs Hdfs Copy - GitHub Pages

It looks like you're using an ad-blocker