Introduction To Databases Lecture 5 Distributed Databases .

3y ago
69 Views
5 Downloads
945.85 KB
61 Pages
Last View : Today
Last Download : 6m ago
Upload by : Kaydence Vann
Transcription

Introduction to DatabasesLecture 5 – Distributed databases and NoSQLGianluca Quercinigianluca.quercini@centralesupelec.frMaster DSBA 2020 – 2021

Distributed databases and NoSQLObjectivesWhat you will learnIn this lecture you will learn:The limitations of the relational data model.What a distributed database is.How data is distributed across different machines.The availability-consistency trade-off (CAP theorem).The main characteristics of NoSQL databases.The families of NoSQL databases.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 20211 / 58

Distributed databases and NoSQLTowards NoSQLRelational data model limitations: impedance mismatchDefinition (Impedance mismatch)Impedance mismatch refers to the challenges encountered when one needs to mapobjects used in an application to tables stored in a relational t namelast tabaseGianluca QuerciniAuthortablesBookauthor idfirst namelast namecountryisbntitlepublisher idIntroduction to DatabasesBook authorauthor idisbnPublisherpublisher idnamecountryMaster DSBA 2020 – 20212 / 58

Distributed databases and NoSQLTowards NoSQLImpedance mismatch: solutionsObject-oriented databasesData is stored as objects.Object-oriented applications save their objects as they are.Examples. ConceptBase, Db4o, Objectivity/DB.DisadvantageNot as popular as relational database systems.Requires familiarity with object-oriented concepts.No standard query language.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 20213 / 58

Distributed databases and NoSQLTowards NoSQLImpedance mismatch: solutionsObject relational mappers (ORM)Use of libraries that map objects to relational tables.The application manipulates objects.The ORM library translates object operations into SQL queries.Examples. SQLAlchemy, Hibernate, Sequelize.DisadvantageAbstraction. Weak control on how queries are translated.Portability. Each ORM has a different set of APIs.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 20214 / 58

Distributed databases and NoSQLTowards NoSQLLimitations of the relational model: graph dataNormalizationIn a relational databases, tables are normalized.Data on different entities are kept in different tables.This reduces redundancy and guarantees integrity.In a normalized relational database, links between entities areexpressed with foreign key constraints.Need to join different tables (expensive operation).AuthorBookisbntitlepublisher idGianluca QuerciniBook authorjoinauthor idisbnIntroduction to Databasesjoinauthor idfirst namelast namecountryMaster DSBA 2020 – 20215 / 58

Distributed databases and NoSQLTowards NoSQLLimitations of the relational model: data distributionObjective of a relational database systemPrivilege data integrity and consistency.Different mechanisms to ensure integrity and consistency.Primary and foreign key constraints.Transactions.Mechanisms to enforce data integrity and consistency have a cost.Manage transactions.Check that new data complies with the given integrity constraints.Things get worse in distributed databases.Data is distributed across several machines.Join operations become very expensive.Integrity mechanisms become very expensive.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 20216 / 58

Distributed databases and NoSQLData distributionDistributed databaseDefinition (Distributed database)A distributed database is one where data is stored across severalmachines, a.k.a, nodes.Shared-nothing architectureEach node has its own CPU, memory and storage.Nodes only share the network connection.Pros/cons of a distributed databaseAllows storage and management of large volumes of data. ,Far more complex than a single-server database. /Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 20217 / 58

Distributed databases and NoSQLData distributionDistributed databased Database TopicsGlobal UserLocal UserLocal UserGlobalSchemaDistributedDBMSDBMS-1DBMS-2DBMS-3 DBMS-nClick hereGianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 20218 / 58

Distributed databases and NoSQLData distributionDistributing data: when?Small-scale dataData distribution is not a good option when the data scale is small.With small-scale data, the performances of a distributed databaseare worse than a single-server database.Overhead. We lose more time distributing and managing data thanretrieving it.Large-scale dataIf the data does not fit in a single machine, data distribution is theonly option left.Distributed databases allow more concurrent database requeststhan single-server databases.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 20219 / 58

Distributed databases and NoSQLData distributionDistributing data: how?Data distribution optionsReplication. Multiple copies of the same data stored on differentnodes.Sharding. Data partitions stored on different nodes.Hybrid. Replication Sharding.PropertiesLocation transparency: applications do not have to be aware of thelocation of the data.Replication transparency: applications do not need to be awarethat the data is replicated.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202110 / 58

Distributed databases and NoSQLData distributionReplicationThe same piece of data is replicated across different nodes.Each copy is called a replica.Replication factor. The number of nodes on which the data rces150,00045HumanResources150,000Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202111 / 58

Distributed databases and NoSQLData distributionReplicationAdvantagesScalability. Multiple nodes can serve queries on the same data.Latency. Queries can be served by geographically proximate nodes.Fault tolerance. The database keeps serving queries even if somenodes fail.DisadvantagesStorage cost. Storage is used to keep multiple copies of the samedata.Consistency. All replicas must be kept in sync.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202112 / 58

Distributed databases and NoSQLData distributionReplicationReplica consistencyWhen a replica is updated, the other replicas must be updated as well.ABPropagateupdatePropagateupdateCUPDATE DepartmentSET budget 500000WHERE codeD umanResources150,000Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202113 / 58

Distributed databases and NoSQLData distributionReplicationSynchronous updatesUpdates are propagated immediately to the other replicas.Small inconsistency window. The replicas will be inconsistent for ashort interval of time. ,If updates are frequent, the database might be too busy propagatingupdates than serving queries. /Asynchronous updatesUpdates are propagated at regular intervals.More efficient when updates are frequent. ,Long inconsistency window. /Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202114 / 58

Distributed databases and NoSQLData distributionReplicationMaster-slave replicationWrite operations are only possible on the master node.The master node propagates the updates to the slave nodes.Read operations are served by both the master and the slave nodes.write readreadslaveGianluca QuerciniwritemasterIntroduction to DatabasesreadwriteslaveMaster DSBA 2020 – 202115 / 58

Distributed databases and NoSQLData distributionReplicationMaster-slave replicationPrevents write conflicts. ,Only one replica is written at any given time.Single point of failure. /If the master fails, write operations are unavailable.Algorithms exist to elect a new master.Read conflicts are possible. /write readreadslaveGianluca QuerciniwritemasterIntroduction to DatabasesreadwriteslaveMaster DSBA 2020 – 202116 / 58

Distributed databases and NoSQLData distributionReplicationMaster-slave replication read conflictTwo read operations on the same data might return different values.Write: update (Department, budget 500,000)Read: select (Department, budget)500,000300,000500,000write urces150,000Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202117 / 58

Distributed databases and NoSQLData distributionReplicationPeer-to-peer replicationRead and write operations are possible on any node.write readwrite readwrite readACwritewriteBwriteGianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202118 / 58

Distributed databases and NoSQLData distributionReplicationPeer-to-peer replicationNo single point of failure. ,Write and read conflicts are possible. /write readwrite readwrite readACwritewriteBwriteGianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202119 / 58

Distributed databases and NoSQLData distributionShardingShardingData is partitioned into balanced, non-overlapping shards.Shards are distributed across the tration300,00025Education150,000EmployeeBGianluca QuercinicodeElast st oduction to DatabasesDepartmentMaster DSBA 2020 – 202120 / 58

Distributed databases and NoSQLData distributionShardingAdvantagesLoad balance. Data can be uniformly distributed across nodes.Inconsistencies cannot arise (non-overlapping shards).DisadvantagesWhen a node fails, all its partitions are lost.Join operations might need to be performed across nodes.When data is added, shards might need to be rebalanced.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202121 / 58

Distributed databases and NoSQLData distributionCombining replication and shardingDataP1A1P2P3B1P1C1P2A2A3P1P1Gianluca QuerciniP3B2P2B3P2Introduction to DatabasesC2C3P3P3Master DSBA 2020 – 202122 / 58

Distributed databases and NoSQLData distributionConsistency in distributed databasesReplication consistencyKeeping in sync all replicas of the same data.Cross-record consistencyEnsuring the coherence of data in related records. Related records mightbe on different 00,00025Education150,000EmployeeBGianluca QuercinicodeElast ion to DatabasesUPDATE DepartmentSET codeD 15WHERE codeD 14Master DSBA 2020 – 202123 / 58

Distributed databases and NoSQLData distributionConsistency in distributed databasesDefinition (Distributed transactions)A distributed transaction is a sequence of read/write operations that areapplied on data that reside on multiple nodes and are executed as anatomic ion300,00025Education150,000EmployeeBcodeElast namecodeD1Bennet142Doe623Fisher254Green62Gianluca QuerciniUPDATE DepartmentSET codeD 15WHERE codeD 14We need to update thecodeD of each employeein department 14.Introduction to DatabasesMaster DSBA 2020 – 202124 / 58

Distributed databases and NoSQLData distributionConsistency in distributed databasesDistributed transactionThe nodes need to coordinate before committing the transactionoperations on their data.The coordination requires an exchange of messages between thetransaction managers on different ucation150,000codeElast namecodeD1Bennet142Doe623Fisher254Green62UPDATE DepartmentSET codeD 15WHERE codeD 14Employeeready to commit?BGianluca QuerciniIntroduction to DatabasesUPDATE EmployeeSET codeD 15WHERE codeD 14Master DSBA 2020 – 202125 / 58

Distributed databases and NoSQLData distributionConsistency in distributed databasesDistributed transactionData being manipulated by a transaction is locked.Locked data is unavailable for both read and write operations.Locking guarantees the consistency of the database.Locking reduces the availability of the 5Education150,000codeElast namecodeD1Bennet142Doe623Fisher254Green62UPDATE DepartmentSET codeD 15WHERE codeD 14Employeeready to commit?BGianluca QuerciniIntroduction to DatabasesUPDATE EmployeeSET codeD 15WHERE codeD 14Master DSBA 2020 – 202126 / 58

Distributed databases and NoSQLData distributionThe CAP theoremConsistency (C), Availability (A), Partition tolerance (P)Consistency. Replicas are in sync and related records are coherent across all nodes.Availability. A database can still execute read/write operations when some nodesfail.Partition tolerance. The database can still operate when a network partitionoccurs.ANetwork partitionCBGianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202127 / 58

Distributed databases and NoSQLData distributionThe CAP theoremTheorem (CAP, Brewer 1999)Given the three properties of consistency, availability and partitiontolerance, a networked shared-data system can have at most two of theseproperties.ProofSuppose that the system is partition tolerant (P). When a networkpartition occurs, we have two options.1 Allow write operations. This makes the database available (A),but not consistent (C).Some of the replicas might not be synced due to the network partition.2Disable write operations. This makes the database consistent (C)but not available (A).Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202128 / 58

Distributed databases and NoSQLData distributionThe CAP theoremTheorem (CAP, Brewer 1999)Given the three properties of consistency, availability and partitiontolerance, a networked shared-data system can have at most two of theseproperties.ProofThe only way that we can have a consistent (C) and available (A)database is when network partitions do not occur.But if we assume that network partitions never occur, the system isnot partition tolerant (P).Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202129 / 58

Distributed databases and NoSQLData distributionConsistency vs AvailabilityRelational databases favor consistency over availability.They take a transactional approach to data consistency.NoSQL databases favor availability over consistency.In many contexts strong consistency is not necessary.t1 t2Bobtime t1Server in Europe1. postAlice does not see Bob’spost between t1 and t2.Is it really an issue?updateAlicetime t22. readServer in USAGianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202130 / 58

Distributed databases and NoSQLData distributionACID vs BASEACID (strong consistency)Atomicity (A). “All or nothing”.Consistency (C). From a consistent state to a consistent state.Isolation (I). Serializability of transactions.Durability (D). Upon commit, all the updates are permanent.BASE (availability)Basic Availability (BA). The database appears to work most of thetime.Soft state (S). Write and read inconsistencies can occur.Eventually consistent (E). The database will be consistent at somepoint.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202131 / 58

Distributed databases and NoSQLNoSQL databasesNoSQL databasesNoSQL: interpretations of the acronymNon SQL: strong opposition to SQL.Not only SQL: NoSQL and SQL coexistence.GoalsAddress the object-relational impedance mismatch.Provide better scalability for distributed databases.Provide a better modeling of semi-structured data.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202132 / 58

Distributed databases and NoSQLNoSQL databasesNoSQL databasesFamiliesKey-value databases.Document-oriented databases.Column-oriented databases.Graph databases.The first three families use the notion of aggregate to model thedata.They differ in how the aggregates are organized.Graph databases are somewhat outliers.They were not conceived for data distribution in mind.They were born ACID-compliant. There is not a single NoSQL database and there is not a “NoSQL”query language.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202133 / 58

Distributed databases and NoSQLNoSQL databasesAggregateAn aggregate is a data structure used to store the data of a specificentity.In that, it is similar to a row in a relational table.We can nest an aggregate into another aggregate.This is a huge difference from a row in a relational table.An aggregate is a unit of data for replication and sharding.All data in an aggregate will never be split across two shards.All data in an aggregate will always be available on one node.Unlike a relational database, we can control how data is distributed.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202134 / 58

Distributed databases and NoSQLNoSQL databasesAggregate vs relational rowDenormalized tableIn a relational database, the following table would not be in firstnormal form.The column categories contains a list of values.Searching for all products in category kitchen would be hard with SQL.article id234543 nameproducercategoriesBamboo utensilhome, kitchen,KitchenMasterspoonspatulasIn a relational database, we can address this problem by normalizing the table.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202135 / 58

Distributed databases and NoSQLNoSQL databasesAggregate vs relational rowFirst normal formThe following table is in first normal form.But we introduced redundancy.What if we update the producer name of the article 234543?In a distributed database, the rows corresponding to this article mightbe on different nodes.article id nameproducercategories234543Bamboo utensilKitchenMasterspoonhome234543Bamboo utensilKitchenMasterspoonkitchen234543Bamboo utensilKitchenMasterspoonspatulasWe can further normalize the table to avoid redundancy.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202136 / 58

Distributed databases and NoSQLNoSQL databasesAggregate vs relational rowSecond normal formTo avoid redundancy, we split the table into three tables in secondnormal form.In a distributed database, the rows in these tables might be ondifferent nodes.We might need cross-node join operations, which are very expensive.articlearticle id234543namearticle categoryproducerBamboo utensilKitchenMasterspoonGianluca Querciniarticle id category idcategorycategory ntroduction to DatabasesMaster DSBA 2020 – 202137 / 58

Distributed databases and NoSQLNoSQL databasesAggregate vs relational rowAggregateIn an aggregate, list of values are allowed.Searching for all products in category kitchen is supported.{"article id": 234543,"name": "Bamboo utensil spoon","producer": "KitchenMaster",categories: ["home", "kitchen", "spatulas"]} All data in an aggregate is never split across different nodes.Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202138 / 58

Distributed databases and NoSQLNoSQL databasesDenormalization is allowed in the aggregate.Data that are queried together are stored in the same node.{"code employee": 12353,"first name": "John","last name": "Smith","salary": 50000,"position": "Assistant director",department: {"dept code": 12,"dept name": "Accounting",budget: 120000}}Gianluca QuerciniIntroduction to DatabasesMaster DSBA 2020 – 202139 / 58

Distributed databases and NoSQLNo

Distributed databases allow more concurrent database requests than single-server databases. Gianluca Quercini Introduction to Databases Master DSBA 2020 { 20219/58. . 14 Administration 300,000 25 Education 150,000 62 Finance 600,000 45 Human Resources 150,000 Department B C codeD nameD budget

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

UNIT‐8 Miningg p Complex Types of Data Lecture Topic ***** Lecture‐50 Multidimensional analysis and descriptive mining of complex data objects Lecture‐51 Mining spatial databases Lecture‐52 Mining multimedia databases Lecture‐53 Mining time‐series and sequence data Lecture‐54 Mining text databases

14 databases History 183 databases ProQuest Primary Sources available for: Introduction ProQuest Historical Primary Sources Support Research, Teaching and Learning. Faculty and students are using a variety of resources in research, teaching and learning – including primary sources,

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

Control Techniques, Database Recovery Techniques, Object and Object-Relational Databases; Database Security and Authorization. Enhanced Data Models: Temporal Database Concepts, Multimedia Databases, Deductive Databases, XML and Internet Databases; Mobile Databases, Geographic Information Systems, Genome Data Management, Distributed Databases .

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

ANSI A300 Purpose: To provide performance standards for developing written specifications for tree management. Currently nine individual parts . ANSI Z60 American Nurseryman and Landscape Association began developing this standard back in 1929 Became an ANSI document in 1949 Current version is 2004 . ANSI A300 The Tree Care Industry Association convened a consensus body .