Distributed DatabaseManagement Systems
OutlineDistributed DBMS Introduction Distributed DBMS Architecture Distributed Database Design Distributed Query Processing Distributed Concurrency Control Distributed Reliability Protocols
Outline Introduction What is a distributed DBMS Problems Current state-of-affairsDistributed DBMS Distributed DBMS Architecture Distributed Database Design Distributed Query Processing Distributed Concurrency Control Distributed Reliability Protocols
tionintegration centralizationDistributed DBMS
What is a Distributed DatabaseSystem?A distributed database (DDB) is a collection of multiple,logically interrelated databases distributed over acomputer network.A distributed database management system (D–DBMS) isthe software that manages the DDB and provides anaccess mechanism that makes this distributiontransparent to the users.Distributed database system (DDBS) DDB D–DBMSDistributed DBMS
Centralized DBMS on NetworkSite 1Site 2Site 5CommunicationNetworkSite 4Distributed DBMSSite 3
Distributed DBMS EnvironmentSite 1Site 2Site 5CommunicationNetworkSite 4Distributed DBMSSite 3
Implicit Assumptions Data stored at a number of sites each sitelogically consists of a single processor. Processors at different sites are interconnectedby a computer network no multiprocessors parallel database systems Distributed database is a database, not acollection of files data logically related asexhibited in the users’ access patterns relational data model D-DBMS is a full-fledged DBMS not remote file system, not a TP systemDistributed DBMS
Distributed DBMS Promises Transparent management of distributed,fragmented, and replicated data Improved reliability/availability throughdistributed transactions Improved performance Easier and more economical system expansionDistributed DBMS
Transparency Transparency is the separation of the higherlevel semantics of a system from the lower levelimplementation issues.Fundamental issue is to providedata independencein the distributed environment Network (distribution) transparency Replication transparency Fragmentation transparency horizontalfragmentation: selection vertical fragmentation: projection hybridDistributed DBMS
ExampleASGEMPENOENAMETITLEE1E2E3E4E5E6E7E8J. DoeM. SmithA. LeeJ. MillerB. CaseyL. ChuR. DavisJ. JonesElect. Eng.Syst. Anal.Mech. Eng.ProgrammerSyst. Anal.Elect. Eng.Mech. Eng.Syst. Anal.ENO PNOPROJDistributed umentationDatabase Elect. Eng.Syst. Anal.Mech. Eng.Programmer40000340002700024000
Transparent AccessSELECT ENAME,SALTokyoFROMEMP,ASG,PAYWHEREDUR 12ANDEMP.ENO ASG.ENOANDPAY.TITLE EMP.TITLEParisBostonCommunicationNetworkParis projectsParis employeesParis assignmentsBoston employeesBoston projectsBoston employeesBoston assignmentsMontrealNewYorkBoston projectsNew York employeesNew York projectsNew York assignmentsDistributed DBMSMontreal projectsParis projectsNew York projectswith budget 200000Montreal employeesMontreal assignments
Distributed Database –User ViewDistributed DatabaseDistributed DBMS
Distributed DBMS - ryDistributed ionSubsystemUserQueryDBMSSoftwareUserApplication
Potentially ImprovedPerformance Proximity of data to its points of use Requires some support for fragmentation and replication Parallelism in execution Inter-query parallelism Intra-query parallelismDistributed DBMS
Parallelism Requirements Have as much of the data required by eachapplication at the site where the applicationexecutes Full replication How about updates? Updates to replicated data requires implementation ofdistributed concurrency control and commit protocolsDistributed DBMS
System Expansion Issue is database scaling Emergence of microprocessor and workstationtechnologies Demise of Grosh's law Client-server model of computing Data communication cost vs telecommunicationcostDistributed DBMS
Distributed DBMS Issues Distributed Database Design how to distribute the database replicated & non-replicated database distribution a related problem in directory management Query Processing convert user transactions to data manipulation instructions optimization problem min{cost data transmission local processing} general formulation is NP-hardDistributed DBMS
Distributed DBMS Issues Concurrency Control synchronization of concurrent accesses consistency and isolation of transactions' effects deadlock management Reliability how to make the system resilient to failures atomicity and durabilityDistributed DBMS
Relationship Between buted DBMSReliability
Outline Introduction Distributed DBMS Architecture Implementation Alternatives Component ArchitectureDistributed DBMS Distributed Database Design Distributed Query Processing Distributed Concurrency Control Distributed Reliability Protocols
DBMS ImplementationAlternativesDistributionDistrib utedmulti-DBMSPeer-to-p eerDistrib uted DBMSC lient/serverAutonomyMulti-DBMSFederated DBMSHeterogeneityDistributed DBMS
Dimensions of the Problem Distribution Whether the components of the system are located on the samemachine or not Heterogeneity Various levels (hardware, communications, operating system) DBMS important one data model, query language,transaction management algorithmsAutonomy Not well understood and most troublesome Various versions Distributed DBMSDesign autonomy: Ability of a component DBMS to decide onissues related to its own design.Communication autonomy: Ability of a component DBMS todecide whether and how to communicate with other DBMSs.Execution autonomy: Ability of a component DBMS to executelocal operations in any manner it wants to.
Datalogical DistributedDBMS ArchitectureES1ES2.ESnGCSDistributed DBMSLCS1LCS2LIS1LIS2.LCSnLISn
Datalogical Multi-DBMSArchitectureLES11Distributed DBMS.GESnGES1GES2LES1nGCSLESn1 LCS1LCS2 LCSnLIS1LIS2 LISn LESnm
tsFiltereddata onlyCommunicationsMultiple client/single serverDBMS ServicesDatabaseDistributed DBMSLAN
Task DistributionApplicationQLInterface ProgrammaticInterfaceCommunications ManagerSQLqueryresulttableCommunications ManagerQuery OptimizerLock ManagerStorage ManagerPage & Cache ManagerDatabaseDistributed DBMS
Advantages of ClientServer Architectures More efficient division of labor Horizontal and vertical scaling of resources Better price/performance on client machines Ability to use familiar tools on client machines Client access to remote data (via standards) Full DBMS functionality provided to clientworkstations Overall better system price/performanceDistributed DBMS
Problems With MultipleClient/Single Server Server forms bottleneck Server forms single point of failure Database scaling difficultDistributed DBMS
Multiple Clients/Multiple Servers directoryApplications caching query decompositionClientServicesCommunications commit protocolsLANCommunicationsCommunicationsDBMS ServicesDBMS ServicesDatabaseDistributed DBMSDatabase
Server-to-Server SQL interfaceApplications programmaticinterfaceClientServicesCommunications other applicationsupportenvironmentsDistributed DBMSLANCommunicationsCommunicationsDBMS ServicesDBMS ServicesDatabaseDatabase
Peer-to-PeerComponent ArchitectureSystemresponsesDistributed ssorSystemLocalConceptual LogSchemaLocal RecoveryManagerGD/DGlobalExecutionMonitorGlobal QueryOptimizerUSERGlobalConceptualSchemaSemantic DataControllerUserrequestsUser InterfaceHandlerExternalSchemaDATA PROCESSORLocal QueryProcessorUSER PROCESSOR
Outline Introduction Distributed DBMS Architecture Distributed Database Design Fragmentation Data PlacementDistributed DBMS Distributed Query Processing Distributed Concurrency Control Distributed Reliability Protocols
Design Problem In the general setting : Making decisions about the placement of data andprograms across the sites of a computer network as well aspossibly designing the network itself. In Distributed DBMS, the placement ofapplications entails placement of the distributed DBMS software; and placement of the applications that run on the databaseDistributed DBMS
Distribution Design Top-down mostly in designing systems from scratch mostly in homogeneous systems Bottom-up when the databases already exist at a number of sitesDistributed DBMS
Top-Down DesignRequirementsAnalysisObjectivesUser InputConceptualDesignView CS’sPhysicalDesignLIS’sDistributed DBMSView DesignES’sUser Input
Distribution Design Fragmentation Localize access Horizontal fragmentation Vertical fragmentation Hybrid fragmentation Distribution Placement of fragments on nodes of a networkDistributed DBMS
Horizontal FragmentationPROJPROJ1 : projects with budgetsless than 200,000PNOP1P2P3P4P5PROJ2 : projects with budgetsgreater than or equal to 200,000PROJ1PNOP1BUDGETInstrumentationDatabase 0310000500000LOCMontrealNew YorkNewNew 150000P2 Database Develop. 135000Distributed 00New YorkNew YorkP4Maintenance310000ParisP5CAD/CAM500000Boston
Vertical FragmentationPROJPROJ1: information aboutproject budgetsPNOP1P2P3P4P5PROJ2: information aboutproject names andlocationsPROJ1PROJ2PNOPNOP1P2P3P4P5Distributed PNAMEBUDGETInstrumentationDatabase ionDatabase Develop.CAD/CAMMaintenanceCAD/CAMLOCMontrealNew YorkNew ntrealNew YorkNewNew YorkYorkParisBoston
Correctness of Fragmentation Completeness Decomposition of relation R into fragments R1, R2, ., Rn is completeiff each data item in R can also be found in some Ri Reconstruction If relation R is decomposed into fragments R1, R2, ., Rn, then thereshould exist some relational operator such that DisjointnessR 1 i nRi If relation R is decomposed into fragments R1, R2, ., Rn, and dataitem di is in Rj, then di should not be in any other fragment Rk (k j ).Distributed DBMS
Allocation Alternatives Non-replicated partitioned : each fragment resides at only one site Replicated fully replicated : each fragment at each site partially replicated : each fragment at some of the sites Rule of thumb:If read - only queries 1 replication is advantageous,update queriesotherwise replication may cause problemsDistributed DBMS
Fragment Allocation Problem Statement GivenF {F1, F2, , Fn}fragments S {S1, S2, , Sm}network sites Q {q1, q2, , qq}applications Find the "optimal" distribution of F to S. Optimality Minimal costCommunication storage processing (read & update) Cost in terms of time (usually) Performance Response time and/or throughput Constraints Per site constraints (storage & processing) Distributed DBMS
Allocation ModelGeneral Formmin(Total Cost)subject toresponse time constraintstorage constraintprocessing constraintDecision VariablexijDistributed DBMS 1 0if fragment Fi is stored at site Sjotherwise
Outline Introduction Distributed DBMS Architecture Distributed Database Design Distributed Query Processing Query Processing Methodology Distributed Query OptimizationDistributed DBMS Distributed Concurrency Control Distributed Reliability Protocols
Query Processinghigh level user queryqueryprocessorlow level data manipulationcommandsDistributed DBMS
Query Processing Components Query language that is used SQL: “intergalactic dataspeak” Query execution methodology The steps that one goes through in executing high-level(declarative) user queries. Query optimization How do we determine the “best” execution plan?Distributed DBMS
Selecting AlternativesSELECTFROMWHEREANDENAMEEMP,ASGEMP.ENO ASG.ENODUR 37Strategy 1ΠENAME(σDUR 37EMP.ENO ASG.ENO (EMP ASG))Strategy 2ΠENAME(EMPENO(σDUR 37 (ASG)))Strategy 2 avoids Cartesian product, so is “better”Distributed DBMS
What is the Problem?Site 1Site 2Site 3Site 4ASG1 σENO “E3”(ASG) ASG2 σENO “E3”(ASG) EMP1 σENO “E3”(EMP)result EMP1’ EMP2’EMP1’Site 3Site 1Site 4ENOASG1ASG1’ASG1’ σDUR 37(ASG1)Distributed DBMSEMP2 σENO “E3”(EMP)ResultSite 5Site 5EMP1’ EMP1Site 5’EMP2’EMP2’ EMP2Site 2result2 (EMP1 EMP2)ENOASG2ASG2’ASG2’ σDUR 37(ASG2)’ENOσDUR 37(ASG1 ASG1)ASG1ASG2EMP1Site 1Site 2Site 3EMP2Site 4
Cost of Alternatives Assume: size(EMP) 400, size(ASG) 1000 tuple access cost 1 unit; tuple transfer cost 10 units Strategy 1 produce ASG': (10 10)ѽtuple access cost20 transfer ASG' to the sites of EMP: (10 10)ѽtuple transfer cost produce EMP': (10 10) ѽtuple access costѽ2 transfer EMP' to result site: (10 10) ѽtuple transfer cost200Total cost46020040Strategy 2 transfer EMP to site 5:400ѽtuple transfer cost transfer ASG to site 5 :1000ѽtuple transfer cost produce ASG':1000ѽtuple access cost1,000 join EMP and ASG':400ѽ20ѽtuple access cost8,000Total costDistributed DBMS4,00010,00023,000
Query Optimization ObjectivesMinimize a cost functionI/O cost CPU cost communication costThese might have different weights in differentdistributed environmentsWide area networks communication cost will dominate low bandwidth low speedhigh protocol overhead most algorithms ignore all other cost componentsLocal area networks communication cost not that dominant total cost function should be consideredCan also maximize throughputDistributed DBMS
Query Optimization Issues –Types of Optimizers Exhaustive search cost-based optimal combinatorial complexity in the number of relations Heuristics not optimal regroup common sub-expressions perform selection, projection first replace a join by a series of semijoins reorder operations to reduce intermediate relation size optimize individual operationsDistributed DBMS
Query Optimization Issues –Optimization Granularity Single query at a time cannot use common intermediate results Multiple queries at a time efficient if many similar queries decision space is much largerDistributed DBMS
Query Optimization Issues –Optimization Timing Static compilationoptimize prior to the execution difficult to estimate the size of the intermediate resultserror propagation can amortize over many executions R* Dynamic run time optimizationexact information on the intermediate relation sizeshave to reoptimize for multiple executionsDistributed INGRESHybrid compile using a static algorithm if the error in estimate sizes threshold, reoptimize at runtime MERMAIDDistributed DBMS
Query Optimization Issues –Statistics Relation cardinality size of a tuple fraction of tuples participating in a join with another relation Attribute cardinality of domain actual number of distinct values Common assumptions independence between different attribute values uniform distribution of attribute values within their domainDistributed DBMS
Query OptimizationIssues – Decision Sites Centralized single site determines the “best” schedule simple need knowledge about the entire distributed database Distributed cooperation among sites to determine the schedule need only local information cost of cooperation Hybrid one site determines the global schedule each site optimizes the local subqueriesDistributed DBMS
Query Optimization Issues –Network Topology Wide area networks (WAN) – point-to-point characteristics low bandwidth low speed high protocol overhead communication cost will dominate; ignore all other costfactors global schedule to minimize communication cost local schedules according to centralized query optimization Local area networks (LAN) communication cost not that dominant total cost function should be considered broadcasting can be exploited (joins) special algorithms exist for star networksDistributed DBMS
Distributed Query ProcessingMethodologyCalculus Query on Algebraic Query on GMENTSCHEMAFragment QueryGlobalOptimizationSTATS ONFRAGMENTSOptimized Fragment Querywith Communication OperationsLOCALSITESLocalOptimizationOptimized LocalQueriesDistributed DBMSLOCALSCHEMAS
Step 1 – Query DecompositionInput : Calculus query on global relations Normalization manipulate query quantifiers and qualification Analysis detect and reject “incorrect” queries possible for only a subset of relational calculus Simplification eliminate redundant predicates Restructuring calculus queryalgebraic query more than one translation is possible use transformation rulesDistributed DBMS
Restructuring Convert relational calculus torelational algebraMake use of query treesExampleFind the names of employees otherthan J. Doe who worked on theCAD/CAM project for either 1 or 2years.SELECT ENAMEFROM EMP, ASG, PROJWHERE EMP.ENO ASG.ENOAND ASG.PNO PROJ.PNOAND ENAME “J. Doe”AND PNAME “CAD/CAM”AND (DUR 12 OR DUR 24)Distributed DBMSΠENAMEProjectσDUR 12 OR DUR 24σPNAME “CAD/CAM”SelectσENAME “J. DOE”PNOENOPROJASGJoinEMP
Restructuring –TransformationRules (Examples) Commutativity of binary operations R S R S R S S RS RS RAssociativity of binary operations (R S) T (R S) T R (S T)R (S T )Idempotence of unary operations ΠA’(ΠA’(R))ΠA’(R) σp1(A1)(σp2(A2)(R)) σp1(A1)where R[A] and A' A, A"p2(A2)(R)A and A' A"Commuting selection with projectionDistributed DBMS
ExampleRecall the previous example:Find the names of employees otherthan J. Doe who worked on theCAD/CAM project for either one ortwo years.SELECT ENAMEFROM PROJ, ASG, EMPWHERE ASG.ENO EMP.ENOAND ASG.PNO PROJ.PNOAND ENAME “J. Doe”AND PROJ.PNAME “CAD/CAM”AND (DUR 12 OR DUR 24)ΠENAMEσDUR 12 OR DUR 24σPNAME “CAD/CAM”SelectσENAME “J. DOE”PNOJoinENOPROJDistributed DBMSProjectASGEMP
Equivalent QueryΠENAMEσPNAME “CAD/CAM”(DUR 12DUR 24)ENAME “J. DOE”PNO ENO ASGDistributed DBMSPROJEMP
RestructuringΠENAMEPNOΠPNO,ENAMEENOΠPNOσPNAME "CAD/CAM"PROJDistributed DBMSΠPNO,ENOσDUR 12ASGDUR 24ΠPNO,ENAMEσENAME "J. Doe"EMP
Step 2 – Data LocalizationInput: Algebraic query on distributed relations Determine which fragments are involved Localization program substitute for each global query its materialization program optimizeDistributed DBMS
ExampleΠENAMEAssume EMP is fragmented into EMP1, EMP2,EMP3 as follows: EMP1 σENO “E3”(EMP) EMP2 σ“E3” ENO “E6”(EMP) EMP3 σENO “E6”(EMP)σDUR 12 OR DUR 24σPNAME “CAD/CAM” ASG fragmented into ASG1 and ASG2as follows: ASG1 σENO “E3”(ASG) ASG2 σENO “E3”(ASG)σENAME “J. DOE”PNOENOReplace EMP by (EMP1 EMP2 EMP3 )PROJand ASG by (ASG1 ASG2) in anyqueryDistributed DBMSEMP1 EMP2 EMP3 ASG1ASG2
Provides ParallellismENOEMP1ASG1Distributed DBMSENOEMP2ASG2ENOEMP3ASG1ENOEMP3ASG2
Eliminates Unnecessary WorkENOEMP1Distributed DBMSENOASG1EMP2ENOASG2EMP3ASG2
Step 3 – Global QueryOptimizationInput: Fragment query Find the best (not necessarily optimal) globalschedule Minimize a cost function Distributed join processing Bushy vs. linear trees Which relation to ship where? Ship-whole vs ship-as-needed Decide on the use of semijoins Semijoin saves on communication at the expense ofmore local processing. Join methods Distributed DBMSnested loop vs ordered joins (merge join or hash join)
Cost-Based Optimization Solution space The set of equivalent algebra expressions (query trees). Cost function (in terms of time) I/O cost CPU cost communication cost These might have different weights in different distributedenvironments (LAN vs WAN). Can also maximize throughput Search algorithm How do we move inside the solution space? Exhaustive search, heuristic algorithms (iterativeimprovement, simulated annealing, genetic, )Distributed DBMS
Query Optimization ProcessInput QuerySearch SpaceGenerationTransformationRulesEquivalent QEPSearchStrategyBest QEPDistributed DBMSCost Model
Search Space Search space characterized byalternative execution plansFocus on join treesFor N relations, there are O(N!)equivalent join trees that can beobtained by applyingcommutativity and associativityrulesSELECT ENAME,RESPFROM EMP, ASG, PROJWHERE EMP.ENO ASG.ENOANDASG.PNO PROJ.PNOPNOPROJENOEMPASGENOEMPPNOPROJASGENO,PNO PROJDistributed DBMSASGEMP
Search Space Restrict by means of heuristics Perform unary operations before binary operations Restrict the shape of the join tree Consider only linear trees, ignore bushy onesLinear Join TreeBushy Join TreeR4R3R1Distributed DBMSR2R1R2R3R4
Search Strategy How to “move” in the search space. Deterministic Start from base relations and build plans by adding onerelation at each step Dynamic programming: breadth-first Greedy: depth-first Randomized Search for optimalities around a particular starting point Trade optimization time for execution time Better when 5-6 relations Simulated annealing Iterative improvementDistributed DBMS
Search Strategies DeterministicR4R3R1R2R1R2R3R1R2 RandomizedR3R1Distributed DBMSR2R2R1R3
Cost Functions Total Time (or Total Cost) Reduce each cost (in terms of time) component individually Do as little of each cost component as possible Optimizes the utilization of the resourcesIncreases system throughput Response Time Do as many things as possible in parallel May increase total time because of increased total activityDistributed DBMS
Total CostSummation of all cost factorsTotal costcost CPU cost I/O cost communicationCPU cost unit instruction cost ѽ no.of instructionsI/O cost unit disk I/O cost ѽ no. of disk I/Oscommunication cost message initiation transmissionDistributed DBMS
Total Cost Factors Wide area network message initiation and transmission costs high local processing cost is low (fast mainframes orminicomputers) ratio of communication to I/O costs 20:1 Local area networks communication and local processing costs are more or lessequal ratio 1:1.6Distributed DBMS
Response TimeElapsed time between the initiation and the completion of aqueryResponse time CPU time I/O time communication timeCPU time unit instruction time ѽ no. of sequential instructionsI/O time unit I/O time ѽ no. of sequential I/Oscommunication time unit msg initiation time ѽno. of sequential msg unit transmission time ѽno. of sequential bytesDistributed DBMS
ExampleSite 1x unitsSite 3Site 2y unitsAssume that only the communication cost is consideredTotal time 2 ѽ message initialization time unit transmissiontime ѽ (x y)Response time max {time to send x from 1 to 3, time to sendy from 2 to 3}time to send x from 1 to 3 message initialization time unittransmission time ѽ xtime to send y from 2 to 3 message initialization time unittransmission time ѽ yDistributed DBMS
Join Ordering Alternatives Ordering joinsSemijoin orderingConsider two relations onlyif size (R) size (S)RSif size (R) size (S) Multiple relations more difficult because too manyalternatives. Compute the cost of all alternatives and select thebest one. Necessary to compute the size of intermediaterelations which is difficult. Use heuristicsDistributed DBMS
Join Ordering – ExampleConsiderPROJPNOASGENO EMPSite 2ASGENOEMPSite 1Distributed DBMSPNOPROJSite 3
Join Ordering – ExampleExecution alternatives:1. EMP Site 22. ASG Site 1Site 2 computes EMP' EMPEMP' Site 3Site 3 computes EMP’ASGSite 1 computes EMP' EMPEMP' Site 3PROJSite 3 computes EMP’3. ASG Site 3Site 3 computes ASG' ASGPROJASG' Site 1Site 1 computes ASG' EMPDistributed DBMSPROJPROJ4. PROJ Site 2Site 2 computes PROJ' PROJ ASGPROJ' Site 1Site 1 computes PROJ'EMP5. EMP Site 2PROJ Site 2Site 2 computes EMPASGASG
Semijoin Algorithms Consider the join of two relations: R[A] (located at site 1) S[A] (located at site 2) Alternatives:1Do the join R2Perform one of the semijoin equivalentsRAASS(RR(RDistributed DBMSAA(SAS)S)AASR)A(SAR)
Semijoin Algorithms Perform the join send R to Site 2 Site 2 computes R Consider semijoin (RASAS) Site 1 computes R' RAAS S' A(S) S' Site 1S' R' Site 2 Site 2 computes R'ASSemijoin is better ifsize(ΠA(S)) size(RDistributed DBMSA S)) size(R)
R* Algorithm Cost function includes local processing as wellas transmission Considers only joins Exhaustive search Compilation Published papers provide solutions to handlinghorizontal and vertical fragmentations but theimplemented prototype does notDistributed DBMS
R* AlgorithmPerforming joins Ship whole larger data transfer smaller number of messages better if relations are small Fetch as needed number of messages O(cardinality of external relation) data transfer per message is minimal better if relations are large and the selectivity is goodDistributed DBMS
R* Algorithm –Vertical Partitioning & Joins1. Move outer relation tuples to the site of the innerrelation(a) Retrieve outer tuples(b) Send them to the inner relation site(c) Join them as they arriveTotal Cost cost(retrieving qualified outer tuples) no. of outer tuples fetched ѽcost(retrieving qualified inner tuples) msg. cost ѽ (no. outer tuples fetched ѽavg. outer tuple size) / msg. sizeDistributed DBMS
R* Algorithm –Vertical Partitioning & Joins2. Move inner relation to the site of outer relationcannot join as they arrive; they need to be storedTotal Cost cost(retrieving qualified outer tuples) no. of outer tuples fetched ѽcost(retrieving matching inner tuplesfrom temporary storage) cost(retrieving qualified inner tuples) cost(storing all qualified inner tuplesin temporary storage) msg. cost ѽ (no. of inner tuples fetched ѽavg. inner tuple size) / msg. sizeDistributed DBMS
R* Algorithm –Vertical Partitioning & Joins3. Move both inner and outer relations to another siteTotal cost cost(retrieving qualified outer tuples) cost(retrieving qualified inner tuples) cost(storing inner tuples in storage) msg. cost ѽ (no. of outer tuples fetchedѽ avg. outer tuple size) / msg. size msg. cost ѽ (no. of inner tuples fetchedѽ avg. inner tuple size) / msg. size no. of outer tuples fetched ѽcost(retrieving inner tuples fromtemporary storage)Distributed DBMS
R* Algorithm –Vertical Partitioning & Joins4. Fetch inner tuples as needed(a) Retrieve qualified tuples at outer relation site(b) Send request containing join column value(s) for outer tuplesto inner relation site(c) Retrieve matching inner tuples at inner relation site(d) Send the matching inner tuples to outer relation site(e) Join as they arriveTotal Cost cost(retrieving qualified outer tuples) msg. cost ѽ (no. of outer tuples fetched) no. of outer tuples fetched ѽ (no. ofinner tuples fetched ѽ avg. inner tuplesize ѽ msg. cost / msg. size) no. of outer tuples fetched ѽcost(retrieving matching inner tuplesfor one outer value)Distributed DBMS
Step 4 – Local OptimizationInput: Best global execution schedule Select the best access path Use the centralized optimization techniquesDistributed DBMS
Outline Introduction Distributed DBMS Architecture Distributed Database Design Distributed Query Processing Distributed Concurrency Control Transaction Concepts & Models Serializability Distributed Concurrency Control Protocols Distributed DBMSDistributed Reliability Protocols
TransactionA transaction is a collection of actions that make consistenttransformations of system states while preserving systemconsistency. concurrency transparency failure transparencyDatabase in aconsistentstateBeginTransactionDistributed DBMSDatabase may betemporarily in aninconsistent stateduring executionExecution ofTransactionDatabase in aconsistentstateEndTransaction
Example DatabaseConsider an airline reservation example with therelations:FLIGHT(FNO, DATE, SRC, DEST, STSOLD, CAP)CUST(CNAME, ADDR, BAL)FC(FNO, DATE, CNAME,SPECIAL)Distributed DBMS
Example TransactionBegin transaction Reservationbegininput(flight no, date, customer name);EXEC SQLUPDATEEXEC SQLINSERTSETWHEREINTOVALUESSTSOLD STSOLD 1FNO flight no AND DATE date;FC(FNO, DATE, CNAME, SPECIAL);(flight no, date, customer name, null);output(“reservation completed”)end . {Reservation}Distributed DBMSFLIGHT
Termination of TransactionsBegin transaction Reservationbegininput(flight no, date, customer name);EXEC NO flight no AND DATE date;if temp1 temp2 thenoutput(“no free seats”);AbortelseEXEC SQLUPDATEFLIGHTSETSTSOLD STSOLD 1WHERE FNO flight no AND DATE date;EXEC SQLINSERTINTOFC(FNO, DATE, CNAME, SPECIAL);VALUES (flight no, date, customer name, null);Commitoutput(“reservation completed”)endifend . {Reservation}Distributed DBMS
Properties of TransactionsATOMICITY all or nothingCONSISTENCY no violation of integrity constraintsISOLATION concurrent changes invisible È serializableDURABILITY committed updates persistDistributed DBMS
Transactions Provide Atomic and reliable execution in the presenceof failures Correct execution in the presence of multipleuser accesses Correct management of replicas (if they supportit)Distributed DBMS
Architecture RevisitedBegin transaction,Read, Write,Commit, AbortResultsDistributedExecut
What is a Distributed Database System? A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. A distributed database management system (D-DBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users.
Distributed Database Design Distributed Directory/Catalogue Mgmt Distributed Query Processing and Optimization Distributed Transaction Mgmt -Distributed Concurreny Control -Distributed Deadlock Mgmt -Distributed Recovery Mgmt influences query processing directory management distributed DB design reliability (log) concurrency control (lock)
Distributed Database Cont 12 A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. In a distributed database system, the database is stored on several computers. Data management is decentralized but act as if they are centralized. A distributed database system consists of loosely coupled
Distributed databases One particular approach to database decentralization is commonly called distributed database systems. In this ap proach, a single logical database schema is defined, which describes all the data in the database system; the physical realization of the database is then distributed among the computers of a network.
This paper focuses on one of these technologies, the distributed databases. We define a distributed database as a collection of multiple, logically interrelated databases distributed over a computer network. Therefore, a Distributed database system is based on the union of a database system and computer network technologies. [ 1]
What is a Distributed Database System? A Distributed Database System (DDBS)is simply a collection of multiple, and interrelated databases that are distributed over a computer network. It is a relational schema that facilitates database decentralization for the management and manipulation of data in a database with the aid of same or common language
Hadoop Distributed File System (HDFS) 47 48 Distributed Directory Management A distributed DBMS must include a directory that keeps track of where the database tables, the replicated copies of database tables (if any), and the table partitions (if any) are located. When a query is presented at any site on the network, the distributed DBMS can
Database Applications and SQL 12 The DBMS 15 The Database 16 Personal Versus Enterprise-Class Database Systems 18 What Is Microsoft Access? 18 What Is an Enterprise-Class Database System? 19 Database Design 21 Database Design from Existing Data 21 Database Design for New Systems Development 23 Database Redesign 23
additif a en fait des effets secondaires nocifs pour notre santé. De plus, ce n’est pas parce qu’un additif est d’origine naturelle qu’il est forcément sans danger. Car si l’on prend l’exemple d’un champignon ou d’une plante toxique pour l’homme, bien qu’ils soient naturels, ils ne sont pas sans effets secondaires.