CS6100 Big Data Computing DFS, GFS, HDFS - Western Michigan University

1y ago
9 Views
1 Downloads
2.22 MB
13 Pages
Last View : 17d ago
Last Download : 3m ago
Upload by : Oscar Steel
Transcription

Big Data Computing 10/8/20 CS6100 Big Data Computing DFS, GFS, HDFS Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 1 1 Acknowledgements I have liberally borrowed these slides and material from a number of sources including – Web – MIT, Harvard, UMD, UCSD, UW, Clarkson, . . . – Amazon, Google, IBM, Apache, ManjraSoft, CloudBook, . . . Thanks to original authors including Dyer, Lin, Dean, Buyya, Ghemawat, Fanelli, Bisciglia, Kimball, Michels-Slettvet, If I have missed any, its purely unintentional. My sincere appreciation to those Big authors and their creative mind. 2 WiSe Lab @ WMU Data Computing www.cs.wmich.edu/wise 2 How do we get data to the workers? NAS SAN Compute Nodes What’s the problem here? WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 3 3 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 1

Big Data Computing 10/8/20 Distributed File System Don’t move data to workers Move workers to the data! – Store data on the local disks for nodes in the cluster – Start up the workers on the node that has the data local Why? – Not enough RAM to hold all the data in memory – Disk access is slow, disk throughput is good A distributed file system is the answer – GFS (Google File System) – HDFS for Hadoop WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 4 4 DFS - functionality Manage files and data blocks across different clusters and racks. Enhance fault tolerance Concurrency by replicating data blocks Distribution, Replication Advantages: fault tolerance and high concurrency. 5 6 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 2

Big Data Computing 10/8/20 7 8 9 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 3

Big Data Computing 10/8/20 GFS: Assumptions Commodity hardware over “exotic” hardware High component failure rates – Inexpensive commodity components fail all the time “Modest” number of HUGE files Files are write-once, mostly appended to – Perhaps concurrently Large streaming reads over random access High sustained throughput over low latency WiSe Lab @ WMU GFS slides adapted from material by Dean et al. www.cs.wmich.edu/wise Big Data Computing 10 10 GFS: Design Decisions Files stored as chunks – Fixed size (64MB) Reliability through replication – Each chunk replicated across 3 chunkservers Single master to coordinate access, keep metadata – Simple centralized management No data caching – Little benefit due to large data sets, streaming reads Simplify the API – Push some of the issues onto the client WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 11 11 12 Source: Ghemawat et al. (SOSP 2003) 12 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 4

Big Data Computing 10/8/20 GFS Single Master We know this is a: – Single point of failure – Scalability bottleneck GFS solutions: – Shadow masters – Minimize master involvement Never move data through it, use only for metadata (and cache metadata at clients) Large chunk size Master delegates authority to primary replicas in data mutations (chunk leases) Simple, and good enough! WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 13 13 GFS Master’s Responsibilities (1/2) Metadata storage Namespace management/locking Periodic communication with chunkservers – Give instructions, collect state, track cluster health Chunk creation, re-replication, rebalancing – Balance space utilization and access speed – Spread replicas across racks to reduce correlated failures – Re-replicate data if redundancy falls below threshold – Rebalance data to smooth out storage and request load WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 14 14 GFS Master’s Responsibilities (2/2) Garbage Collection – Simpler, more reliable than traditional file delete – Master logs the deletion, renames the file to a hidden name – Lazily garbage collects hidden files Stale replica deletion – Detect “stale” replicas using chunk version numbers WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 15 15 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 5

Big Data Computing 10/8/20 GFS : Metadata Global metadata is stored on the master – File and chunk namespaces – Mapping from files to chunks – Locations of each chunk’s replicas All in memory (64 bytes / chunk) – Fast – Easily accessible Master has an operation log for persistent logging of critical metadata updates – Persistent on local disk – Replicated – Checkpoints for faster recovery WiSe Lab @ WMU Big Data Computing 16 www.cs.wmich.edu/wise 16 GFS: Mutations Mutation write or append – Must be done for all replicas Goal: minimize master involvement Lease mechanism: – Master picks one replica as primary; gives it a “lease” for mutations – Primary defines a serial order of mutations – All replicas follow this order – Data flow decoupled from control flow WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 17 17 Parallelization Problems How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have How is MapReduce different? finished? What if workers die? WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 18 18 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 6

Big Data Computing 10/8/20 From Theory to Practice 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 You Hadoop Cluster 5. Move data out of HDFS 6. Scp data from cluster WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 19 19 On Amazon: With EC2 0. Allocate Hadoop cluster 1. Scp data to cluster 2. Move data into HDFS EC2 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Your Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster 7. Clean up! Uh oh. Where did the data go? WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 20 20 On Amazon: EC2 and S3 Copy from S3 to HDFS S3 EC2 (Persistent Store) (The Cloud) Your Hadoop Cluster Copy from HDFS to S3 WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 21 21 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 7

Big Data Computing 10/8/20 Questions? WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 22 22 HDFS: Hadoop Distributed File System Introduction Architecture NameNode, DataNodes, HDFS Client File I/O Operations and Replica Management WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 23 23 HDFS - Introduction HDFS The Hadoop Distributed File System (HDFS) is the file system component of Hadoop. It is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. These are achieved by replicating file content on multiple machines (DataNodes). WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 24 24 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 8

Big Data Computing 10/8/20 HDFS Architecture A file is made of several DATA blocks, and those are stored across a cluster of one or more machines with data storage capacity. Each block of a file is replicated across a number of machines, to prevent loss of data. WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 25 25 HDFS Architecture WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 26 26 HDFS Architecture NameNode and DataNodes HDFS stores file system metadata and application data separately Metadata refers to file metadata (attributes such as permissions, modification, access times, namespace and disk space quotas HDFS stores metadata on a dedicated server, called the NameNode (Master) Application data are stored on other servers called DataNodes (Slaves) WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 27 27 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 9

Big Data Computing 10/8/20 NameNode - HDFS Single Single Namenode: Maintain the namespace tree(a hierarchy of files and directories) it have operations like opening, closing, and renaming files and directories. Maintain Determine Determine the mapping of file blocks to DataNodes (the physical location of file data). Block Collect block reports from Datanodes on block locations. Replicate Replicate missing data blocks. WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 28 28 HDFS DataNodes Functionality DataNodes: responsible for serving read and write requests from the file system’s clients perform block creation, deletion, and replication upon instruction from the NameNode periodically send block reports to Namenode. WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 29 29 HDFS Architecture WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 30 30 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 10

Big Data Computing 10/8/20 HDFS Data Read/Write Operation to write/read a file in HDFS – a client needs to interact with namenode (master) – namenode provides the address of the datanodes (slaves) – client will start writing/reading the data WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 31 31 HDFS – Read/Write Operation WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 32 32 HDFS – Read/Write Operation WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 33 33 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 11

Big Data Computing 10/8/20 Heartbeats - HDFS DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 34 34 Reliability, Robustness - HDFS HeartBeat: the signal that datanode continuously sends to namenode. – If namenode doesn’t receive heartbeat from a datanode then it will consider it dead, and take corrective action Balancing: If datanode crashes, the blocks on it gone, so blocks will be under-replicated compared to the remaining blocks. – master node(namenode) gives a signal to datanodes containing replicas of those lost blocks to replicate so that overall distribution of blocks is balanced Replication: done by datanodes Big Data Computing WiSe Lab @ WMU www.cs.wmich.edu/wise 35 35 HDFS Example Scenario 36 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 12

Big Data Computing 10/8/20 Questions? WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 37 37 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 13

GFS Single Master We know this is a: -Single point of failure -Scalability bottleneck GFS solutions: -Shadow masters -Minimize master involvement Never move data through it, use only for metadata (and cache metadata at clients) Large chunk size Master delegates authority to primary replicas in data mutations (chunk leases)

Related Documents:

Digital Force Gauge DFS II / DFS II-R / DFS II-R-ND Series The Chatillon DFS II Series offers the best price performance of any digi-tal force gauge available today. This compact, easy-to-use force gauge is designed for basic and complex applications. Ideal for handheld or test stand applications, the DFS II may be equipped with integral .

The DFS-700A also provides comprehensive interface with Sony editing controllers. For example, you can use the DFS-700A with Sony BVE-2000/P and PVE-500/P editors via the serial interface. As a post-production tool, the DFS-700A is an editor’s dream. Editing interface You can integrate the DFS

How to setup DFS on Windows Server 2019 In this post, I will show you how to install and configure DFS (Distributed File System) on Windows Server 2019. Microsoft introduced DFS as an add-on to Windows NT 4.0, and DFS has been included as a free subsystem in all versions of Window

In addition to the force measurement capability of the DFS II Series with non-dedicated remote load cells, Chatillon also offers the STS Series of remote torque sensors. When combined with the DFS II-R-ND Series, these rugged and accurate sensors turn your force gauge into a torque measurement device: no need to purchase another gauge.

Avant tout, installer sur le premier serveur un Active directory qui servira de contrôleur de domaine. Voir tutoriel Installation et Configuration d’un AD. Ensuite, installer sur les deux serveurs les fonctionnalités Espace de nom DFS et Réplication DFS 3) Configuration/mise en place du DFS

definitions for DFS and data analytics. Part 1: Data Methods and Applications Chapter 1.1: Discusses data science in the context of DFS and provides an overview of the data types, sources and methodologies and tools used to derive insights from data. Chapter 1.2: Describes how to apply data analytics to DFS. The chapter summarizes

ITU‐T progress with big data 2013, July: o Initiation of 1st big data working item Y.Bigdata ‐reqts (Requirements and capabilities for cloud computing based big data) by ITU‐T SG13 Q17 Overview of cloud computing based big data; Big Data system context and its activities;

A maximum rotation of the pile head of 0.5 is usually demanded. Regarding axially loaded piles an important question is how the axial ultimate pile capacity can be predicted with sufficient accuracy. The ß-method commonly used in offshore design (e.g. API, 2000) is known to either over-or underestimate pile capacities, dependent on the boundary