Large-Scale Data Management And Distributed Systems - Introduction

1y ago

2 Views

1 Downloads

1.04 MB

56 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Abram Andresen

Report this link

Download PDF

Transcription

Large-Scale Data Management and DistributedSystemsIntroductionThomas ropars.github.io/20211

Teaching staff Vivien Quema (vivien.quema@grenoble-inp.fr) Thomas Ropars(thomas.ropars@univ-grenoble-alpes.fr)2

Organization of the course2 complementary topics Distributed systems (V. Quema) – 18 hours Data management (T. Ropars) – 18 hoursData Management 12 hours of lectures 6 hours of practical sessionsGrading Graded Lab (25% of the final grade) Written exam (75% of the final grade)3

Covered topics The challenges of Big Data and distributed data processing The Map/Reduce programming model Batch and stream processing systems Distributed (NoSQL) databases About the design of these systems:I Their underlying design principlesI The impact of Cloud characteristics4

Overview of this lecture Introduction to the Big Data challenges Challenges of distributed computing Introduction to Cloud Computing Scalability techniques5

AgendaThe challenges of Big DataDistributed and Parallel SystemsCloud ComputingRunning at scale6

References Coursera – Big Data, University of California San Diego The lecture notes of V. Leroy The lecture notes of R. Lachaize Designing Data-Intensive Applications by Martin Kleppmann7

The data delugeMany sources of data8

The data delugeMany sources of data Sensors Social media Scientific experiments Industry activity Etc.8

Some numbers Every 2 days, we create as much information as we did since20131I 90% of all data has been created in the last two years 40K search queries on Google every second2 45M messages on WhatsApp every minute 40 Billions of IoT devices by 2025. 570 new web sites every minute Largest database: 3.2 Trillions rows (AT&T) 40 TB of data every second during an experiment at theLarge Hadron data29

Hardware capacityStorage All the music of the world stored for 500 Large Amazon EC2 instance: 3.9TB of RAM, 8x7.5TB of SSDComputing resources Google data-centers: more than 2.5M servers (2016) Amazon capacity increase each day size of Amazon in 2005Huge opportunities for storing and processing data10

Big data challenges: The V’ssource: Big Data for Modern Industry: Challenges and Trends11

Big data challenges: The V’s Volume: Amount of data generated Variety: all kinds of data are generated (text, image, voice,time series, etc.) Velocity: Rate at which data are produced and should beprocessed Veracity: Noise/anomalies in data, truthfulness Value: How do we extract/learn valuable knowledge from thedata12

Big data challenges: The V’sIn this course we are going to deal with: Volume Velocity VarietyQuestions to be answered: How to build a system and algorithms that can process hugeamount of data? How to build a system and algorithms that can process datain a timely manner? (Bonus questions) How to build software that can deal withthe variety of data?13

AgendaThe challenges of Big DataDistributed and Parallel SystemsCloud ComputingRunning at scale14

MotivationThe solution to process large amount of data:Using large amount of resourcesNote that: Different strategies can be used to leverage these resources Using large amount of resources presents new challenges15

Increasing the processing power and the storage capacityGoals Increasing the amount of data that can be processed (weakscaling) Decreasing the time needed to process a given amount of data(strong scaling)Two solutions Scaling up Scaling out16

Vertical scaling (scaling up)IdeaIncrease the processing power by adding resources to existingnodes: Upgrade the processor (more cores, higher frequency) Increase memory volume Increase storage volumePros and Cons §Performance improvement without modifying the application§Expensive (non linear costs)Limited scalability (capabilities of the hardware, cf The end ofMoore’s law)17

Horizontal scaling (scaling out)IdeaIncrease the processing power by adding more nodes to the system Cluster of commodity serversPros and Cons18

Horizontal scaling (scaling out)IdeaIncrease the processing power by adding more nodes to the system Cluster of commodity serversPros and Cons§ Often requires modifying applicationsLess expensive (nodes can be turned off when not needed)Infinite scalability18

Large scale infrastructuresFigure: Google Data-centerFigure: Barcelona SupercomputingCenterFigure: Amazon Data-center19

Distributed computing: DefinitionA distributed computing system is a system including severalcomputational entities where: Each entity has its own local memory All entities communicate by message passing over a networkEach entity of the system is called a node.20

Distributed computing: Challenges11Read Chapter 1 of Designing Data-Intensive Applications for further details21

Distributed computing: Challenges1Scalability How to take advantage of a large number of distributedresources?Performance How to take full advantage of the available resources? Moving data is costlyI How to maximize the ratio between computation andcommunication? How to ensure that the latency of requests processing remainsbelow some upper bound?1Read Chapter 1 of Designing Data-Intensive Applications for further details21

Distributed computing: ChallengesFault tolerance The more resources, the higher the probability of failure MTBF (Mean Time Between Failures)I MTBF of one server 3 yearsI MTBF of 1000 servers ' 19 hours (beware: over-simplifiedcomputation) How to ensure computation completion? How to ensure that results are correct?Programmability How to provide programming models that hide the complexityof distributed computing? (while remaining efficient) What high level services should be made available to ease lifeof programmers?22

A warning about distributed computingYou can have a second computer once you’ve shown youknow how to use the first one. (P. Braham)Horizontal scaling is very popular. But not always the most efficient solution (both in time andcost)Examples Processing a few 10s of GB of data is often more efficient ona single machine that on a cluster of machines Sometimes a single threaded program outperforms a cluster ofmachines (F. McSherry et al. “Scalability? But at whatCOST!”. 2015.)23

AgendaThe challenges of Big DataDistributed and Parallel SystemsCloud ComputingRunning at scale24

Where to find computing resources?Cloud computing A service provider gives access to computing resourcesthrough an internet connection.Pros and Cons25

Where to find computing resources?Cloud computing A service provider gives access to computing resourcesthrough an internet connection.Pros and Cons Pay only for the resources you useGet access to large amount of resourcesI Amazon Web Services features millions of servers§VolatilityI Low control on the resourcesI Example: Access to resources based on biddingI See ”The Netflix Simian Army”§Performance variabilityI Physical resources shared with other users25

Architecture of a data centerSimplifiedSwitch: storage: memory: processor26

Architecture of a data centerA shared-nothing architecture Horizontal scaling No specific hardwareA hierarchical infrastructure Resources clustered in racks Communication inside a rack is more efficient than betweenracks Resources can even be geographically distributed over severaldatacenters27

A hybrid systemTwo paradigms for communicating between computing entities: Shared memory Message passing28

Shared memory Entities share a global memory Communication by reading and writing to the globally sharedmemory Communication between threads inside one node29

Message passing Entities have their own private memory Communication by sending/receiving messages over a network Communication between nodes30

AgendaThe challenges of Big DataDistributed and Parallel SystemsCloud ComputingRunning at scale31

Running at scaleHow to distribute data? Partitioning Replication32

Running at scaleHow to distribute data? Partitioning ReplicationReplication Several nodes host a copy of the data Main goal: Fault toleranceI No data lost if one node crashesPartitioning Splitting the data into partitions Partitions are assigned to different nodes Main goal: PerformanceI Partitions can be processed in parallel32

ReplicationPurposes Continuing to serve requests when parts of the system fail Keep data close to the users Having multiple servers able to answer read requestsChallenges How to handle operations that modify data? (writeoperations)I Consistency (Consensus in a distributed system is a verydifficult problem)I Performance33

ReplicationwritewritereadA 1AAClient 1writereadA 2ASwitchClient 2A?A?A?A?A?34

PartitioningShardingPurposes PerformanceI Distributing the load over several nodesChallenges How to partition the data?I Evenly distributed load (even for skewed workloads)I Range queries35

PartitioningreadwritereadA-DAAClient 1writeread CCSwitchClient 2BCAD36

Partitioning ReplicationSwitchB CBDAACBC DDA37

More referencesMandatory reading Big data and its technical challenges, by Jagadish et al,CACM 2014.Suggested reading Chapter 1 of Designing Data-Intensive Applications by MartinKleppmann The Netflix Simian lix-simian-army16e57fbab11638

A service provider gives access to computing resources through an internet connection. Pros and Cons ' Pay only for the resources you use ' Get access to large amount of resources I Amazon Web Services features millions of servers § Volatility I Low control on the resources I Example: Access to resources based on bidding I See "The Net ix .

Related Documents:

SCALE Newsletter

CCC-466/SCALE 3 in 1985 CCC-725/SCALE 5 in 2004 CCC-545/SCALE 4.0 in 1990 CCC-732/SCALE 5.1 in 2006 SCALE 4.1 in 1992 CCC-750/SCALE 6.0 in 2009 SCALE 4.2 in 1994 CCC-785/SCALE 6.1 in 2011 SCALE 4.3 in 1995 CCC-834/SCALE 6.2 in 2016 The SCALE team is thankful for 40 years of sustaining support from NRC

27 Views

1y ago

Bench Scale Silver Recovery Unit for the ME0 System

Svstem Amounts of AaCl Treated Location Scale ratio Lab Scale B en&-Scale 28.64 grams 860 grams B-241 B-161 1 30 Pilot-Plant 12500 grams MWMF 435 Table 2 indicates that scale up ratios 30 from lab-scale to bench scale and 14.5 from bench scale to MWMW pilot scale. A successful operation of the bench scale unit would provide important design .

29 Views

1y ago

Monday 5/18/2020 Tuesday 5/19/2020 Wednesday 5/20/2020 ...

Scale Review - Review the E - flat scale. Friday 5/29/2020. Scale Review - Review the c minor scale. Sight Reading. Monday 6/1/2020. History - Read 20th Century Packet - Complete listenings and quiz. Scale Review - Practice the B - flat Major scale. Tuesday 6/2/2020. Scale Review - Practice the g melodic minor scale. Wednes

42 Views

2y ago

Lead Guitar Cheat-Sheet Key & Scale-Finder V4 - National Guitar Academy

Remember, this is just an abridged form of the major scale. It's not a 'separate', distinct scale. It's just the major scale, in a simpler form. Can you see that this has just a few notes less? Minor Scale Minor Pentatonic Scale Remember, this is just an abridged form of the minor scale. It's not a 'separate', distinct scale.

23 Views

1y ago

Large Scale Fine-Grained Categorization and Domain-Specific Transfer ...

backgrounds. With the limited availability of large scale FGVC datasets, how to design models that perform well on large scale non-iconic images with ﬁne-grained cate-gories remains an underdeveloped area. 2) How does one effectively conduct transfer learning, by ﬁrst training the network on a large scale dataset and then ﬁne-tuning it

12 Views

1y ago

Lesson 15: Solving Area Problems Using Scale Drawings

Lesson 15: Solving Area Problems Using Scale Drawings Classwork Opening Exercise For each diagram, Drawing 2 is a scale drawing of Drawing 1. . and compute the scale factor. Convert each scale factor into a fraction and percent, examine the results, and write a conclusion relating scale factors to area. Drawing 1 Drawing 2 Scale Factor as a .

10 Views

8m ago

Radio Control Fixed Wing Scale

515 RC Designer Scale 517 RC Sport Scale Soaring 520 RC Fun Scale 522 RC Team Scale 523 RC Open Scale *All RC Scale Classes must comply with pertinent FCC rules and regulations, in addition to AMA rules. A contestant may only be listed once in a list of winners in each event and may on

22 Views

2y ago

Use of CFD as a Design Tool for Scale -Up of Fluidized-bed Reactors

Bench scale Pilot scale Commercial scale RTI has an extensive pipeline of fluidized-bed processes Many technologies are entering pilot- scale and commercial-scale demonstration phases CFD modeling offers tremendous benefits for RTI's scale-up and commercialization efforts www.rti.org 5/14/2010 3 RTI's Fluidized-bed Reactor Technologies

26 Views

1y ago

Recent Views

Grammar as a Foreign Language - List of Proceedings

Grammar as a Foreign Language Oriol Vinyals Google vinyals@google.com Lukasz Kaiser Google lukaszkaiser@google.com Terry Koo Google terrykoo@google.com Slav Petrov Google slav@google.com Ilya Sutskever Google ilyasu@google.com Geoffrey Hinton Google geoffhinton@google.com Abstract Synta

2y ago

445 Views

Attention is All you Need - NIPS

Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomezy University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia .

1y ago

303 Views

GSA Implementation of Google (G) Suite

Google Meet Classic Hangouts Google Chat Google Calendar Google Drive and Shared Drive Google Docs Google Sheets Google Slides Google Forms Google Sites Google Keep Apps Script D

2y ago

316 Views

Google Drive (Google Docs, Google Sheets, Google Slides)

Google Drive (Google Docs, Google Sheets, Google Slides) Employees are automatically issued a Kyrene Google account. Navigate to drive.google.com. Use Kyrene email address and network password to login. Launch in Chrome browser for best experience. Google Drive is a cloud storage sys

2y ago

388 Views

Quick Guide of Using Google Home to Control Smart Devices

Configuration needs Google Home app. Search "Google Home" in App Store or Google Play to install the app. 3.1 Set up Google Home with Google Home app You can skip this part if your Google Home is already set up. 1. Make sure your Google Home is energized. 2. Open the Google Home app by tapping the app icon on your mobile device. 3.

1y ago

326 Views

Elaboração de Provas Online usando o Formulário Google Docs

2 Após o login acesse o Google Drive ou o Google Docs e selecione a ferramenta Google Forms (Formulários). Clique na caixa de Ferramentas do Google, localizada no canto direito superior da tela e selecione o Google Drive. Na tela do Google Drive clique em New , opção More e selecione Google Forms. OBS: É possível acessar o google

11m ago

123 Views

ACS WASC Templates

File upload, Folder upload, Google Docs, Google Sheets, or Google Slides. You can also create Google Forms, Google Drawings, Google My Maps, etc. Share with exactly who you want — without email attachments. Search or sort your list of files, folders, and Google Docs. Preview files and Google Docs.

2y ago

366 Views

Share a Google Doc in Schoology - fcps.edu

After you have connected your Google Drive to Schoology (directions in a separate handout), another way to share a Doc with students is to use the Google Drive Resource App. To share a Google Doc using the Google Drive Resources App: 1. From the Add Materials drop down menu, select Import from Resources. 2. Select Apps. Then Google Drive .

1y ago

92 Views

Google Drive - San Bernardino City Unified School District

Google Apps All of the Google applications that are available upon logging into Google.com (G , Gmail, Gphotos, Gdrive, etc.). Google Suite Google’s online cloud based office companion applications (Docs, Sheets, Slides). Google Drive Google’s online cloud storage and file sharing/collaboration application.

2y ago

378 Views

Single Sign On for Google Apps with NetScaler Unified Gateway

Google Apps for Work is a suite of cloud computing productivity and collaboration applications provided by Google on a subscription basis. It includes Google’s popular web applications including Gmail, Google Drive, Google Hangouts, Google Calendar and Google

2y ago

295 Views

Serviceteil

Google 84, 87, 124 Google 110 Google AdWords 101, 103 Google Alerts 127 Google Analytics 89 Google Maps 100, 110, 173 Google-Maps 63 Google Places 100, 103, 124 Graphiken 66 H Haftung 170 Haftungsausschluss 72 Hausfarbe 11 Headline 35 Heilmittelwerbegesetz 14, 69, 163 Heilversprechen 164 HONcode 78 HTML 58 HWG 31 I Imagefilm 31

2y ago

336 Views

Best practices for managing identities when you move to Google Cloud

Google Cloud. To provide t he informat ion an organizat ion would ne e d to transfer data and ownership from one Google Account to anot her for s ome of t he noncore Google s er vice s, such as Google Ads, Google Analyt ics, or DV360. Intende d audience Organizat ion administrators. Sta planning Google Cloud / Google Wor kspace migrat ion. Key .

1y ago

481 Views

Google Analytics 101 - Content Jam

Google Analytics 101 201 301 Google Ads 101 201 Google Tag Manager 101 Google Data Studio 101 Google Optimize 101. Welcome Fun Facts: Share . Google Analytics 301 35 Web Property The web property ID is of the form UA-XXXXXX-YY. It's often called the "UA number" since it starts with

1y ago

107 Views

Introduction - Google Earth User Guide

Google Earth Community: Learn from other Google Earth users by asking questions and sharing answers on the Google Earth Community forums. Using Google Earth: This blog describes how you can use some of the interesting features of Google Earth. Selecting a Server Note: This section is relevant to Google Earth Pro and EC users.

3y ago

288 Views

Using Google Forms to Manage Officials Signups

Google Sheets, deleting a response from the form or sheet will not affect the other. Once the Google Form is linked to a Google Sheet, clicking on the spreadsheet icon will open the linked Google Sheet. Google Responses Sheet Google automatically creates and populates the sp

2y ago

276 Views

Large-Scale Data Management And Distributed Systems - Introduction

It looks like you're using an ad-blocker