Large-Scale Data Management And Distributed Systems - Introduction

1y ago
2 Views
1 Downloads
1.04 MB
56 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Abram Andresen
Transcription

Large-Scale Data Management and DistributedSystemsIntroductionThomas ropars.github.io/20211

Teaching staff Vivien Quema (vivien.quema@grenoble-inp.fr) Thomas Ropars(thomas.ropars@univ-grenoble-alpes.fr)2

Organization of the course2 complementary topics Distributed systems (V. Quema) – 18 hours Data management (T. Ropars) – 18 hoursData Management 12 hours of lectures 6 hours of practical sessionsGrading Graded Lab (25% of the final grade) Written exam (75% of the final grade)3

Covered topics The challenges of Big Data and distributed data processing The Map/Reduce programming model Batch and stream processing systems Distributed (NoSQL) databases About the design of these systems:I Their underlying design principlesI The impact of Cloud characteristics4

Overview of this lecture Introduction to the Big Data challenges Challenges of distributed computing Introduction to Cloud Computing Scalability techniques5

AgendaThe challenges of Big DataDistributed and Parallel SystemsCloud ComputingRunning at scale6

References Coursera – Big Data, University of California San Diego The lecture notes of V. Leroy The lecture notes of R. Lachaize Designing Data-Intensive Applications by Martin Kleppmann7

The data delugeMany sources of data8

The data delugeMany sources of data Sensors Social media Scientific experiments Industry activity Etc.8

Some numbers Every 2 days, we create as much information as we did since20131I 90% of all data has been created in the last two years 40K search queries on Google every second2 45M messages on WhatsApp every minute 40 Billions of IoT devices by 2025. 570 new web sites every minute Largest database: 3.2 Trillions rows (AT&T) 40 TB of data every second during an experiment at theLarge Hadron data29

Hardware capacityStorage All the music of the world stored for 500 Large Amazon EC2 instance: 3.9TB of RAM, 8x7.5TB of SSDComputing resources Google data-centers: more than 2.5M servers (2016) Amazon capacity increase each day size of Amazon in 2005Huge opportunities for storing and processing data10

Big data challenges: The V’ssource: Big Data for Modern Industry: Challenges and Trends11

Big data challenges: The V’ssource: Big Data for Modern Industry: Challenges and Trends11

Big data challenges: The V’s Volume: Amount of data generated Variety: all kinds of data are generated (text, image, voice,time series, etc.) Velocity: Rate at which data are produced and should beprocessed Veracity: Noise/anomalies in data, truthfulness Value: How do we extract/learn valuable knowledge from thedata12

Big data challenges: The V’sIn this course we are going to deal with: Volume Velocity VarietyQuestions to be answered: How to build a system and algorithms that can process hugeamount of data? How to build a system and algorithms that can process datain a timely manner? (Bonus questions) How to build software that can deal withthe variety of data?13

AgendaThe challenges of Big DataDistributed and Parallel SystemsCloud ComputingRunning at scale14

MotivationThe solution to process large amount of data:Using large amount of resourcesNote that: Different strategies can be used to leverage these resources Using large amount of resources presents new challenges15

Increasing the processing power and the storage capacityGoals Increasing the amount of data that can be processed (weakscaling) Decreasing the time needed to process a given amount of data(strong scaling)Two solutions Scaling up Scaling out16

Vertical scaling (scaling up)IdeaIncrease the processing power by adding resources to existingnodes: Upgrade the processor (more cores, higher frequency) Increase memory volume Increase storage volumePros and Cons17

Vertical scaling (scaling up)IdeaIncrease the processing power by adding resources to existingnodes: Upgrade the processor (more cores, higher frequency) Increase memory volume Increase storage volumePros and Cons §Performance improvement without modifying the application§Expensive (non linear costs)Limited scalability (capabilities of the hardware, cf The end ofMoore’s law)17

Horizontal scaling (scaling out)IdeaIncrease the processing power by adding more nodes to the system Cluster of commodity serversPros and Cons18

Horizontal scaling (scaling out)IdeaIncrease the processing power by adding more nodes to the system Cluster of commodity serversPros and Cons§ Often requires modifying applicationsLess expensive (nodes can be turned off when not needed)Infinite scalability18

Horizontal scaling (scaling out)IdeaIncrease the processing power by adding more nodes to the system Cluster of commodity serversPros and Cons§ Often requires modifying applicationsLess expensive (nodes can be turned off when not needed)Infinite scalabilityThe solution studied in this course18

Large scale infrastructuresFigure: Google Data-centerFigure: Barcelona SupercomputingCenterFigure: Amazon Data-center19

Distributed computing: DefinitionA distributed computing system is a system including severalcomputational entities where: Each entity has its own local memory All entities communicate by message passing over a networkEach entity of the system is called a node.20

Distributed computing: Challenges11Read Chapter 1 of Designing Data-Intensive Applications for further details21

Distributed computing: Challenges1Scalability How to take advantage of a large number of distributedresources?Performance How to take full advantage of the available resources? Moving data is costlyI How to maximize the ratio between computation andcommunication? How to ensure that the latency of requests processing remainsbelow some upper bound?1Read Chapter 1 of Designing Data-Intensive Applications for further details21

Distributed computing: ChallengesFault tolerance The more resources, the higher the probability of failure MTBF (Mean Time Between Failures)I MTBF of one server 3 yearsI MTBF of 1000 servers ' 19 hours (beware: over-simplifiedcomputation) How to ensure computation completion? How to ensure that results are correct?Programmability How to provide programming models that hide the complexityof distributed computing? (while remaining efficient) What high level services should be made available to ease lifeof programmers?22

A warning about distributed computingYou can have a second computer once you’ve shown youknow how to use the first one. (P. Braham)Horizontal scaling is very popular. But not always the most efficient solution (both in time andcost)Examples Processing a few 10s of GB of data is often more efficient ona single machine that on a cluster of machines Sometimes a single threaded program outperforms a cluster ofmachines (F. McSherry et al. “Scalability? But at whatCOST!”. 2015.)23

AgendaThe challenges of Big DataDistributed and Parallel SystemsCloud ComputingRunning at scale24

Where to find computing resources?Cloud computing A service provider gives access to computing resourcesthrough an internet connection.Pros and Cons25

Where to find computing resources?Cloud computing A service provider gives access to computing resourcesthrough an internet connection.Pros and Cons Pay only for the resources you useGet access to large amount of resourcesI Amazon Web Services features millions of servers§VolatilityI Low control on the resourcesI Example: Access to resources based on biddingI See ”The Netflix Simian Army”§Performance variabilityI Physical resources shared with other users25

Architecture of a data centerSimplifiedSwitch: storage: memory: processor26

Architecture of a data centerA shared-nothing architecture Horizontal scaling No specific hardwareA hierarchical infrastructure Resources clustered in racks Communication inside a rack is more efficient than betweenracks Resources can even be geographically distributed over severaldatacenters27

A hybrid systemTwo paradigms for communicating between computing entities: Shared memory Message passing28

Shared memory Entities share a global memory Communication by reading and writing to the globally sharedmemory Communication between threads inside one node29

Message passing Entities have their own private memory Communication by sending/receiving messages over a network Communication between nodes30

AgendaThe challenges of Big DataDistributed and Parallel SystemsCloud ComputingRunning at scale31

Running at scaleHow to distribute data? Partitioning Replication32

Running at scaleHow to distribute data? Partitioning ReplicationReplication Several nodes host a copy of the data Main goal: Fault toleranceI No data lost if one node crashesPartitioning Splitting the data into partitions Partitions are assigned to different nodes Main goal: PerformanceI Partitions can be processed in parallel32

ReplicationPurposes Continuing to serve requests when parts of the system fail Keep data close to the users Having multiple servers able to answer read requestsChallenges How to handle operations that modify data? (writeoperations)I Consistency (Consensus in a distributed system is a verydifficult problem)I Performance33

ReplicationwritewritereadA 1AAClient 1writereadA 2ASwitchClient 2A?A?A?A?A?34

ReplicationwritewritereadA 1AAClient 1writereadA 2ASwitchClient 2A?A?A?A?A?34

ReplicationwritewritereadA 1AAClient 1writereadA 2ASwitchClient 2A?A?A?A?A?34

ReplicationwritewritereadA 1AAClient 1writereadA 2ASwitchClient 2A?A?A?A?A?34

ReplicationwritewritereadA 1AAClient 1writereadA 2ASwitchClient 2A?A?A?A?A?34

ReplicationwritewritereadA 1AAClient 1writereadA 2ASwitchClient 2A?A?A?A?A?34

ReplicationwritewritereadA 1AAClient 1writereadA 2ASwitchClient 2A?A?A?A?A?34

ReplicationwritewritereadA 1AAClient 1writereadA 2ASwitchClient 2A?A?A?A?A?34

PartitioningShardingPurposes PerformanceI Distributing the load over several nodesChallenges How to partition the data?I Evenly distributed load (even for skewed workloads)I Range queries35

PartitioningreadwritereadA-DAAClient 1writeread CCSwitchClient 2BCAD36

PartitioningreadwritereadA-DAAClient 1writeread CCSwitchClient 2BCAD36

PartitioningreadwritereadA-DAAClient 1writeread CCSwitchClient 2BCAD36

PartitioningreadwritereadA-DAAClient 1writeread CCSwitchClient 2BCAD36

Partitioning ReplicationSwitchB CBDAACBC DDA37

More referencesMandatory reading Big data and its technical challenges, by Jagadish et al,CACM 2014.Suggested reading Chapter 1 of Designing Data-Intensive Applications by MartinKleppmann The Netflix Simian lix-simian-army16e57fbab11638

A service provider gives access to computing resources through an internet connection. Pros and Cons ' Pay only for the resources you use ' Get access to large amount of resources I Amazon Web Services features millions of servers § Volatility I Low control on the resources I Example: Access to resources based on bidding I See "The Net ix .

Related Documents:

CCC-466/SCALE 3 in 1985 CCC-725/SCALE 5 in 2004 CCC-545/SCALE 4.0 in 1990 CCC-732/SCALE 5.1 in 2006 SCALE 4.1 in 1992 CCC-750/SCALE 6.0 in 2009 SCALE 4.2 in 1994 CCC-785/SCALE 6.1 in 2011 SCALE 4.3 in 1995 CCC-834/SCALE 6.2 in 2016 The SCALE team is thankful for 40 years of sustaining support from NRC

Svstem Amounts of AaCl Treated Location Scale ratio Lab Scale B en&-Scale 28.64 grams 860 grams B-241 B-161 1 30 Pilot-Plant 12500 grams MWMF 435 Table 2 indicates that scale up ratios 30 from lab-scale to bench scale and 14.5 from bench scale to MWMW pilot scale. A successful operation of the bench scale unit would provide important design .

Scale Review - Review the E - flat scale. Friday 5/29/2020. Scale Review - Review the c minor scale. Sight Reading. Monday 6/1/2020. History - Read 20th Century Packet - Complete listenings and quiz. Scale Review - Practice the B - flat Major scale. Tuesday 6/2/2020. Scale Review - Practice the g melodic minor scale. Wednes

Remember, this is just an abridged form of the major scale. It's not a 'separate', distinct scale. It's just the major scale, in a simpler form. Can you see that this has just a few notes less? Minor Scale Minor Pentatonic Scale Remember, this is just an abridged form of the minor scale. It's not a 'separate', distinct scale.

backgrounds. With the limited availability of large scale FGVC datasets, how to design models that perform well on large scale non-iconic images with fine-grained cate-gories remains an underdeveloped area. 2) How does one effectively conduct transfer learning, by first training the network on a large scale dataset and then fine-tuning it

Lesson 15: Solving Area Problems Using Scale Drawings Classwork Opening Exercise For each diagram, Drawing 2 is a scale drawing of Drawing 1. . and compute the scale factor. Convert each scale factor into a fraction and percent, examine the results, and write a conclusion relating scale factors to area. Drawing 1 Drawing 2 Scale Factor as a .

515 RC Designer Scale 517 RC Sport Scale Soaring 520 RC Fun Scale 522 RC Team Scale 523 RC Open Scale *All RC Scale Classes must comply with pertinent FCC rules and regulations, in addition to AMA rules. A contestant may only be listed once in a list of winners in each event and may on

Bench scale Pilot scale Commercial scale RTI has an extensive pipeline of fluidized-bed processes Many technologies are entering pilot- scale and commercial-scale demonstration phases CFD modeling offers tremendous benefits for RTI's scale-up and commercialization efforts www.rti.org 5/14/2010 3 RTI's Fluidized-bed Reactor Technologies