CS6100 Big Data Computing DFS, GFS, HDFS - Western Michigan University

1y ago

9 Views

1 Downloads

2.22 MB

13 Pages

Last View : 17d ago

Last Download : 3m ago

Upload by : Oscar Steel

Report this link

Download PDF

Transcription

Big Data Computing 10/8/20 CS6100 Big Data Computing DFS, GFS, HDFS Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 1 1 Acknowledgements I have liberally borrowed these slides and material from a number of sources including – Web – MIT, Harvard, UMD, UCSD, UW, Clarkson, . . . – Amazon, Google, IBM, Apache, ManjraSoft, CloudBook, . . . Thanks to original authors including Dyer, Lin, Dean, Buyya, Ghemawat, Fanelli, Bisciglia, Kimball, Michels-Slettvet, If I have missed any, its purely unintentional. My sincere appreciation to those Big authors and their creative mind. 2 WiSe Lab @ WMU Data Computing www.cs.wmich.edu/wise 2 How do we get data to the workers? NAS SAN Compute Nodes What’s the problem here? WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 3 3 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 1

Big Data Computing 10/8/20 Distributed File System Don’t move data to workers Move workers to the data! – Store data on the local disks for nodes in the cluster – Start up the workers on the node that has the data local Why? – Not enough RAM to hold all the data in memory – Disk access is slow, disk throughput is good A distributed file system is the answer – GFS (Google File System) – HDFS for Hadoop WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 4 4 DFS - functionality Manage files and data blocks across different clusters and racks. Enhance fault tolerance Concurrency by replicating data blocks Distribution, Replication Advantages: fault tolerance and high concurrency. 5 6 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 2

Big Data Computing 10/8/20 7 8 9 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 3

Big Data Computing 10/8/20 GFS: Assumptions Commodity hardware over “exotic” hardware High component failure rates – Inexpensive commodity components fail all the time “Modest” number of HUGE files Files are write-once, mostly appended to – Perhaps concurrently Large streaming reads over random access High sustained throughput over low latency WiSe Lab @ WMU GFS slides adapted from material by Dean et al. www.cs.wmich.edu/wise Big Data Computing 10 10 GFS: Design Decisions Files stored as chunks – Fixed size (64MB) Reliability through replication – Each chunk replicated across 3 chunkservers Single master to coordinate access, keep metadata – Simple centralized management No data caching – Little benefit due to large data sets, streaming reads Simplify the API – Push some of the issues onto the client WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 11 11 12 Source: Ghemawat et al. (SOSP 2003) 12 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 4

Big Data Computing 10/8/20 GFS Single Master We know this is a: – Single point of failure – Scalability bottleneck GFS solutions: – Shadow masters – Minimize master involvement Never move data through it, use only for metadata (and cache metadata at clients) Large chunk size Master delegates authority to primary replicas in data mutations (chunk leases) Simple, and good enough! WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 13 13 GFS Master’s Responsibilities (1/2) Metadata storage Namespace management/locking Periodic communication with chunkservers – Give instructions, collect state, track cluster health Chunk creation, re-replication, rebalancing – Balance space utilization and access speed – Spread replicas across racks to reduce correlated failures – Re-replicate data if redundancy falls below threshold – Rebalance data to smooth out storage and request load WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 14 14 GFS Master’s Responsibilities (2/2) Garbage Collection – Simpler, more reliable than traditional file delete – Master logs the deletion, renames the file to a hidden name – Lazily garbage collects hidden files Stale replica deletion – Detect “stale” replicas using chunk version numbers WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 15 15 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 5

Big Data Computing 10/8/20 GFS : Metadata Global metadata is stored on the master – File and chunk namespaces – Mapping from files to chunks – Locations of each chunk’s replicas All in memory (64 bytes / chunk) – Fast – Easily accessible Master has an operation log for persistent logging of critical metadata updates – Persistent on local disk – Replicated – Checkpoints for faster recovery WiSe Lab @ WMU Big Data Computing 16 www.cs.wmich.edu/wise 16 GFS: Mutations Mutation write or append – Must be done for all replicas Goal: minimize master involvement Lease mechanism: – Master picks one replica as primary; gives it a “lease” for mutations – Primary defines a serial order of mutations – All replicas follow this order – Data flow decoupled from control flow WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 17 17 Parallelization Problems How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have How is MapReduce different? finished? What if workers die? WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 18 18 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 6

Big Data Computing 10/8/20 From Theory to Practice 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 You Hadoop Cluster 5. Move data out of HDFS 6. Scp data from cluster WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 19 19 On Amazon: With EC2 0. Allocate Hadoop cluster 1. Scp data to cluster 2. Move data into HDFS EC2 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Your Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster 7. Clean up! Uh oh. Where did the data go? WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 20 20 On Amazon: EC2 and S3 Copy from S3 to HDFS S3 EC2 (Persistent Store) (The Cloud) Your Hadoop Cluster Copy from HDFS to S3 WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 21 21 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 7

Big Data Computing 10/8/20 Questions? WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 22 22 HDFS: Hadoop Distributed File System Introduction Architecture NameNode, DataNodes, HDFS Client File I/O Operations and Replica Management WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 23 23 HDFS - Introduction HDFS The Hadoop Distributed File System (HDFS) is the file system component of Hadoop. It is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. These are achieved by replicating file content on multiple machines (DataNodes). WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 24 24 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 8

Big Data Computing 10/8/20 HDFS Architecture A file is made of several DATA blocks, and those are stored across a cluster of one or more machines with data storage capacity. Each block of a file is replicated across a number of machines, to prevent loss of data. WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 25 25 HDFS Architecture WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 26 26 HDFS Architecture NameNode and DataNodes HDFS stores file system metadata and application data separately Metadata refers to file metadata (attributes such as permissions, modification, access times, namespace and disk space quotas HDFS stores metadata on a dedicated server, called the NameNode (Master) Application data are stored on other servers called DataNodes (Slaves) WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 27 27 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 9

Big Data Computing 10/8/20 NameNode - HDFS Single Single Namenode: Maintain the namespace tree(a hierarchy of files and directories) it have operations like opening, closing, and renaming files and directories. Maintain Determine Determine the mapping of file blocks to DataNodes (the physical location of file data). Block Collect block reports from Datanodes on block locations. Replicate Replicate missing data blocks. WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 28 28 HDFS DataNodes Functionality DataNodes: responsible for serving read and write requests from the file system’s clients perform block creation, deletion, and replication upon instruction from the NameNode periodically send block reports to Namenode. WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 29 29 HDFS Architecture WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 30 30 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 10

Big Data Computing 10/8/20 HDFS Data Read/Write Operation to write/read a file in HDFS – a client needs to interact with namenode (master) – namenode provides the address of the datanodes (slaves) – client will start writing/reading the data WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 31 31 HDFS – Read/Write Operation WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 32 32 HDFS – Read/Write Operation WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 33 33 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 11

Big Data Computing 10/8/20 Heartbeats - HDFS DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 34 34 Reliability, Robustness - HDFS HeartBeat: the signal that datanode continuously sends to namenode. – If namenode doesn’t receive heartbeat from a datanode then it will consider it dead, and take corrective action Balancing: If datanode crashes, the blocks on it gone, so blocks will be under-replicated compared to the remaining blocks. – master node(namenode) gives a signal to datanodes containing replicas of those lost blocks to replicate so that overall distribution of blocks is balanced Replication: done by datanodes Big Data Computing WiSe Lab @ WMU www.cs.wmich.edu/wise 35 35 HDFS Example Scenario 36 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 12

Big Data Computing 10/8/20 Questions? WiSe Lab @ WMU www.cs.wmich.edu/wise Big Data Computing 37 37 WiSe Lab www.cs.wmic.edu/wise Ajay Gupta, WMU-CS 13

GFS Single Master We know this is a: -Single point of failure -Scalability bottleneck GFS solutions: -Shadow masters -Minimize master involvement Never move data through it, use only for metadata (and cache metadata at clients) Large chunk size Master delegates authority to primary replicas in data mutations (chunk leases)

Related Documents:

Ametek - Chatillon DFS II Digital Force Gauge - Transcat

Digital Force Gauge DFS II / DFS II-R / DFS II-R-ND Series The Chatillon DFS II Series offers the best price performance of any digi-tal force gauge available today. This compact, easy-to-use force gauge is designed for basic and complex applications. Ideal for handheld or test stand applications, the DFS II may be equipped with integral .

33 Views

1y ago

DFS 700A 700AP - Adcom

The DFS-700A also provides comprehensive interface with Sony editing controllers. For example, you can use the DFS-700A with Sony BVE-2000/P and PVE-500/P editors via the serial interface. As a post-production tool, the DFS-700A is an editor’s dream. Editing interface You can integrate the DFS

55 Views

2y ago

How to setup DFS on Windows Server 2019 - INFOTECHRAM

How to setup DFS on Windows Server 2019 In this post, I will show you how to install and configure DFS (Distributed File System) on Windows Server 2019. Microsoft introduced DFS as an add-on to Windows NT 4.0, and DFS has been included as a free subsystem in all versions of Window

124 Views

2y ago

Digital Force Gauge DFS II / DFS II-R / DFS II-R-ND Series

In addition to the force measurement capability of the DFS II Series with non-dedicated remote load cells, Chatillon also offers the STS Series of remote torque sensors. When combined with the DFS II-R-ND Series, these rugged and accurate sensors turn your force gauge into a torque measurement device: no need to purchase another gauge.

48 Views

1y ago

Tutoriel installation et configuration de rôle DFS sous ...

Avant tout, installer sur le premier serveur un Active directory qui servira de contrôleur de domaine. Voir tutoriel Installation et Configuration d’un AD. Ensuite, installer sur les deux serveurs les fonctionnalités Espace de nom DFS et Réplication DFS 3) Configuration/mise en place du DFS

82 Views

3y ago

DATA ANALYTICS AND DIGITAL K FINANCIAL SERVICES - World Bank

definitions for DFS and data analytics. Part 1: Data Methods and Applications Chapter 1.1: Discusses data science in the context of DFS and provides an overview of the data types, sources and methodologies and tools used to derive insights from data. Chapter 1.2: Describes how to apply data analytics to DFS. The chapter summarizes

55 Views

1y ago

蔡永顺：ITU-T Cloud computing based Big data ecosystem and ... - DMTF

ITU‐T progress with big data 2013, July: o Initiation of 1st big data working item Y.Bigdata ‐reqts (Requirements and capabilities for cloud computing based big data) by ITU‐T SG13 Q17 Overview of cloud computing based big data; Big Data system context and its activities;

14 Views

1y ago

Design of Axially and Laterally Loaded Piles for the ...

A maximum rotation of the pile head of 0.5 is usually demanded. Regarding axially loaded piles an important question is how the axial ultimate pile capacity can be predicted with sufficient accuracy. The ß-method commonly used in offshore design (e.g. API, 2000) is known to either over-or underestimate pile capacities, dependent on the boundary

68 Views

3y ago

Recent Views

MANAGERIAL FINANCE - GBV

of Managerial Finance page 2 Introduction to Managerial Finance 1 Starbucks—A Taste for Growth page 3 1.1 Finance and Business What Is Finance? 4 Major Areas and Opportunities in Finance 4 Legal Forms of Business Organization 5 Why Study Managerial Finance? Review Questions 9 1.2 The Managerial Finance Function 9 Organization of the Finance

3y ago

6.8K Views

Chapter 1 The roles of finance function in organisations

The roles of the finance function in organisations 4. The role of ethics in the role of the finance function Ethics is the system of moral principles that examines the concept of right and wrong. Ethics underpins an organisation’s sustained value creation. The roles that the finance function performs should be carried out in an .File Size: 888KBPage Count: 10Explore furtherRole of the Finance Function in the Financial Management .www.managementstudyguide.c Roles and Responsibilities of a Finance Department in a .www.pharmapproach.comRoles and Responsibilities of a Finance Department .www.smythecpa.comTop 10 – Functions of Business Finance in an om23 Functions and Duties of Accounting and Finance nded to you b

2y ago

335 Views

2017-2018 GRANDE ÉCOLE MSc in MANAGEMENT

Descriptif des cours Course Outlines 10 Catalogue des cours/ Course Catalog 2017-2018 FIN: Finance/Finance A : Actuariat/Actuarial, Insurance E : Finance d’entreprise/Corporate Finance The course liste tables and the course outlines G : Finance générale/General Finance M : Finance de marché/Market Finance S : Synthèse/Synthesis IDS: Systèmes d’Information, Sciences de la Décision et .

3y ago

312 Views

Behavioral Finance and Wealth L Management

Introduction to Behavioral Finance CHAPTER1 What Is Behavioral Finance? Behavioral Finance: The Big Picture Standard Finance versus Behavioral Finance The Role of Behavioral Finance with Private Clients How Practical Application of Behavioral Finance Can Create a Successful Advisory Rel

2y ago

377 Views

Catalogue des Cours Course Catalog - ESSEC Business School

10 Catalogue des cours/Course Catalog 2021-2022 FIN: Finance/Finance E : Finance d'entreprise/Corporate Finance G : Finance générale/General Finance M : Finance de marché/Market Finance S : Synthèse/Synthesis IDS: Systèmes d'Information, Sciences de la Décision et Statistiques/ Information Systems, Decision Sciences and Statistics

1y ago

222 Views

INDIAN FINANCIAL SYSTEM - Tumkur University

banking financial companies (NBFCs) providing whole range of financial services. These include hire - purchase 300 consumer finance companies, leasing companies, housing finance companies, factoring companies, Credit rating agencies, merchant banking companies etc. NBFCs mobilise public funds and provide loanable funds.

5m ago

47 Views

SINGAPORE - Kelly Services

FINANCE Chief Financial Officer Degree/Master 15 20,000 25,000 Finance Assistant Diploma 1-3 2,800 3,400 Finance Controller Degree 10-15 10,000 18,000 Finance Director Degree 15 15,000 20,000 Finance Executive/ Senior Finance Executive Degree 2-5 3,000 6,000 Finance Manager/ Assistan

2y ago

527 Views

Ministries of Finance and Nationally Determined Contributions

Rodrigo Rojo, IDB Sr. Consultant and advisor to Ministry of Finance of Chile. Colombia German Romero Otalora and Laura Marcela Ruiz Daza — Office of the Vice-Minister — Ministry of Finance. Ireland Paul Ryan — International Finance Division — Ministry of Finance Sean Judge — Department of Finance — Ministry of Finance

1y ago

232 Views

Trade Finance & Supply Chain Finance Awards 2022

In February 2022, Global Finance will publish its annual selections for the World's Best Trade Finance and Supply Chain Finance Providers. Global Finance will name the best trade finance providers in more than 100 countries and territories, eight global regions and

1y ago

215 Views

Industry Data Report - Restaurant Research

Building Bridges between Franchisees, Franchisors & Financiers Industry Data Report Finance & Valuations - 2021 RR's Finance & Valuation report is based on survey responses (equally weighted) from 50 finance companies including traditional cash flow lenders, sale leaseback companies, SBA lenders and equipment finance companies.

1y ago

94 Views

McKinsey on Finance

finance and strategy 23 How M&A practitioners enable their success Perspectives on Corporate Finance and Strategy Number 56, Autumn 2015 Finance McKinsey on. McKinsey on Finance. is a quarterly publication written by corporate-finance experts and practitioners at McKinsey & Company. This publication offers readers insights into value-creating .

3y ago

272 Views

SAP Simple Finance - tutorialspoint

SAP Simple Finance is only known as S/4 HANA Finance and this will be the only name of other releases of SAP Simple Finance. During the installation of SAP S/4 HANA Finance, various front-end and back-end components get installed. 2. SAP Simple Finance Introduction

3y ago

252 Views

pwc Finance Function Transformation

PwC’s finance effectiveness framework looks at 3 core areas within finance, to frame a programme of work that makes the finance function more effective, and to increase its interaction with the business: Finance efficiency Risk, Compliance and Control Finance Insights (the key lever in

2y ago

285 Views

Sustainable Finance: A Primer and Recent Developments

Social (impact) finance RBC Wealth Management Green finance Resonance Fund Impact finance Bridges Fund Management Socially responsible finance Nutmeg . Source: Author's own research. Despite this variety of definitions, some consistency of terminology has coalesced around the construct of "sustainable finance" in terms of a range of

1y ago

151 Views

The International Finance Corporation's Blended Finance Operations

The International Finance Corporation's Blended Finance Operations . 1. Context. Blended finance is a risk mitigation tool applied to investments for which it is difficult to attract commercial funding. Blended finance refers to the combination of concessional and commercial funding in private sector-led projects. Its rationale is

1y ago

187 Views

CS6100 Big Data Computing DFS, GFS, HDFS - Western Michigan University

It looks like you're using an ad-blocker