Query Processing In A System For Distributed Databases (SDD-1)

3y ago
41 Views
2 Downloads
1.41 MB
24 Pages
Last View : 3d ago
Last Download : 3m ago
Upload by : Nora Drum
Transcription

Query Processing in a Systemfor Distributed Databases (SDD-1)PHILIP A. BERNSTEIN and NATHAN GOODMANHarvard UniversityEUGENE WONGUniversity of California at BerkeleyCHRISTOPHER L. REEVEMassachusetts Institute of TechnologyandJAMES B. ROTHNIE, Jr.Computer Corporation of AmericaThii paper describes the techniques used to optimize relational queries in the SDD-1 distributeddatabase system. Queries are submitted to SDD-1 in a high-level procedural language called Datalanguage. Optimization begins by translating each Datalanguage query into a relational calculus formcalled an envelope, which is essentially an aggregate-free QUEL query. This paper is primarilyconcerned with the optimization of envelopes.Envelopes are processed in two phases. The first phase executes relational operations at varioussites of the distributed database in order to delimit a subset of the database that contains all datarelevant to the envelope. This subset is called a reduction of the database. The second phase transmitsthe reduction to one designated site, and the query is executed locally at that site.The critical optimization problem is to perform the reduction phase efficiently. Success depends ondesigning a good repertoire of operators to use during this phase, and an effective algorithm fordeciding which of these operators to use in processing a given envelope against a given database. Theprincipal reduction operator that we employ is called a sem@oin. In this paper we define the semijoinoperator, explain why semijoin is an effective reduction operator, and present an algorithm thatconstructs a cost-effective program of semijoins, given an envelope and a database.Key Words and Phrases: distributedmization, semijoinsCR Categories: 3.70,4.33databases, relationaldatabases, query processing, query opti-Permission to copy without fee all or part of this material is granted provided that the copies are notmade or distributed for direct commercial advantage, the ACM copyright notice and the title of thepublication and its date appear, and notice is given that copying is by permission of the Associationfor Computing Machinery.To copy otherwise, or to republish, requires a fee and/or specificpermission.This research was supported by the Advanced Research Projects Agency of the Department ofDefense and was monitored by the Naval Electronic System Command under Contract NO6039-77-C0074, ARPA Order 3175-6.Authors’ addresses: P.A. Bernstein and N. Goodman, Center for Research in Computing Technology,Aiken Computation Laboratory, Harvard University, Cambridge, MA 02138; E. Wong, Departmentof Electrical Engineering and Computer Sciences, University of California at Berkeley, Berkeley, CA94720; C.L. Reeve, Laboratory for Computer Science, Massachusetts Institute of Technology, 545Technology Square, Cambridge, MA 02139; J.B. Rothnie, Jr., Computer Corporation of America, 675Massachusetts Avenue, Cambridge, MA 02139.0 1981 ACM 0362-5915/81/1200-osoz6602 00.75ACM Transactions on Database Systems, Vol. 6, No. 4, December 1961, Pages 602-625

Query Processingin a System for DistributedDatabases*6031. INTRODUCTIONSDD-1 is a distributed database system developed by the Computer Corporationof America [23]. SDD-1 permits a relational database to be distributed amongthe sites of a computer network, yet accessedas if it were stored at a single site.Users interact with SDD-1 by submitting queries coded in a high-level procedurallanguage called Datalanguage [20]. Figures 1 and 2 illustrate an SDD-1 databaseand a Datalanguage query. This paper is concerned with efficient execution ofsuch queries. Other aspects of SDD-1 are discussed in [5, 6, 14, 231.Our objective is to process queries with a minimum quantity of intersite datatransfer. That is, we assume network bandwidth to be the system bottleneck andseek to minimize use of this resource; all other resources are assumed to be free.’This assumption is appropriate in SDD-1 because the network is the slowestsystem component by two orders of magnitude.2 This assumption has beenDatabase MANYCAyw,1,1,3,4,4,p#,1,2,3,1,5,P(P#, name, icrominimainhugemicroS describes suppliers.P describes parts.Y tells which suppliers supply which parts and in what quantity.Assume that S, Y, and P are sorted at sites 1,2, and 3, respectively.Fig. 1. Example database.Description of queyList the supplier name, part name, and quantity supplied for all parts supplied by aMassachusetts supplier. Also, print how many of these are minis.Query QBeginCount : 0;For SIf S.location “MA” then for YIf S.s# Y.s# then for PIf Y.p# P.p# then beginPrint S.name, P.name, Y.qty;If P.type “mini” then Count : Count 1; end;;;Print “Number of minis is”, Count ;End.Fig. 2. Example datalanguage query. Datalanguagein the appendix.used in this example is defined’ In practice, database processing within sites is considered as a secondary objective. For expositoryclarity, we shall not treat this issue.’ Sites in SDD-1 are mainframe computers (PDP-lOs), while the network is a packet-switched longdistance network (Arpanet). Sustainable bandwidth on the network is at most 10 kbits per second(see [24-261).ACM Transactions on Database Systems, Vol. 6, No. 4, December 1981.

604P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr.lEnvelopeRetrieveRetrieveRetrieveE(S.s#, S.name, Nocation)(Y.s#, Y.p#, Y.qty)(P.p#, P.name, P.type)qualification:Slocation “MA”where qualificationwhere qualificationwhere qualificationA S.s# Y.s# A Y.p# P.p#Fig. 3. Envelope for query of Figure 2. Envelopes aredefied in Section 2. Intuitively,an envelope specifies asubset of each relation in the database. We express envelopes in a relational calculus similar to QUEL [15].The result of envelope E is to retrieveexample,any superset of the data specified by E. 4,5,The specific superset retrievedis determinedThe retrieved relations are also transmittedFig. 4.by s.to a single site, for example, site 3.Processing envelope of Figure 3.adopted by other researchers [7,8,13, 16,17, 33,341, although naturally it is notappropriate in every system [lo, 19, 271. Section 5 discusses the impact of thisassumption on our approach.Our algorithm has three main steps. Step 1 maps a Datalanguage query Q intoa relational calculus form (an envelope) that specifies a superset of the databaseneeded to answer Q (see Figure 3). Step 1 depends on details of Datalanguageand is of general interest only insofar as Datalanguage resembles other proceduralquery languages. This step is described in [ll].Step 2 evaluates the envelope. This step retrieves a superset of the databasespecified by the envelope, assembling the result at a single site S, (see Figure 4).(The specific superset retrieved and the “assembly site” S, are determined byefficiency considerations.) Step 2 is accomplished by translating the envelope intoa program P containing relational operations (a reducer), followed by commandsto move the results of P to S, (see Figure 5). The goal is to construct a reducer Pand select a site S, such that the cost of computing P and moving the results toS, is minimum over aLl reducers and sites. This optimization problem constitutesthe core of the SDD-1 query processing algorithm and is the focus of this paper.Step 3 executes Q at S, using the data assembled by Step 2. Since Step 3 onlyinvolves local query processing, it will not be discussed further. Steps l-3 areoutlined in Figure 6.The paper has five sections. Section 2 defines envelopes and the operationsused to process envelopes. Section 3 presents techniques for estimating the costand effect of a reducer composed of these operations. Section 4 presents aheuristic algorithm that compiles envelopes into efficient (though not necessarilyoptimal) reducers. Section 5 discusses related work and suggests directions forACM Transactionson Database Systems, Vol. 6, No. 4, December 1981

Query Processing in a System for DistributedProgram P1. S : S[location “MA”]2. Y : Y(s# s#]SDatabasesl605; restrict S to MA suppliers; this operation is semijoin-it computes the set of Ytuples that corresponds toMA suppliers.endFigure 4 shows the result of applying P to the database of Figure 1.Fig. 5.pT-- Program for envelope of Figure 3.Q (Datalanguage1query)0 (Distributeddatabase)1E (Relationaland select assemblvenvelope)Distributed executionof reducersiteS, (Assembly\site)-//t-lStep 3Fig. 6.D ’ (Superset of databaseneeded to compute Q(D))I4Main steps of query processing algorithm.ACM Transactionson Database Systems, Vol. 6, No. 4, December1981.

606P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr.l(a) Relational Data ObjectsDefinitionTermDomainAttributeRelation schemaRelationTupleDatabaseattr(R)A set of valuesAn alternate name for a domainA description of a relation, consisting of a relation name andlist of attributesA subset of the Cartesian product of the domains of theattributes of the corresponding relation schemaAn element (or row) of a relationA set of relationsAttributes of relation R(b) Relational Algebraic OperationsRestriction:R[A k] (r E R ] r.A k}where r.A is the value of the A-domain in tuple rR[A B] (r E R ] r.A r.B)AlsoProjection:R[AI, AZ, . . . , An] ((r.Ar, r.Az,. . . , r.A,) ] r E R)Join:R[A B]S ((r,s)]rER,sES,andr.A s.B)Semijoin:R(A B]S (RCA BIS) [attr(R)] {r]rERAr.AEs[B])Fig. 7. Relational terminology.future research. We assume reader familiaritylevel of [9]. A review of relational terminologywith relational databases at theappears in Figure 7.2. QUERY PROCESSING STRATEGY2.1 EnvelopesThe attributes of relation R are denoted attr(R). Relation RI is a subrelation ofrelation Ri, if attr(RI) Cattr(Ri) and Rf c Ri[attr(Ri)].Let D {RI, . . . , R,}andD’ {R;,. , Rh} be databases. D’ is a subdcztabase of D, denoted D’ 5 D,if Rf is a subrelation of Ri for i 1, . . . , n. An envelope is a relational calculusexpression that maps a database into a subdatabase. We express envelopes in alanguage similar to QUEL [15].An envelope E consists of a qualificationq and target lists tl, . . . , t,. Theterm q is a Boolean formula with clauses of the form Ri. A Rj .B or Ri. A k.3The terms Ri . A and Rj . B are called indexed variables. Each ti is a set of variablesindexed by Ri: that is, ti is of the form {Ri. Ail, . . . , Ri. Ail}. Envelope E mapsdatabase D into subdatabase D’ defined by the following collection of QUELqueries.Retrieve into R; (tr ) where q.Retrieve into Rb(t,) where q.We limit the form of envelopes in two additionalways. One, qualificationsare3 Note that we avoid tuple variables. These can be accommodated by (conceptually) duplicating arelation, thereby having two relation names ranging over the same relation.ACM Transactionson Database Systems, Vol. 6, No. 4, December1981.

Query Processing in a System for DistributedDatabases-607Join graph for envelope of figure 1.0P.typeassumed to be pure conjunctions; disjunction is handled by placing the qualification in disjunctive normal form and treating each conjunct separately. Two, ifRi. A is a term of q, then 6 must contain Ri. A.E is an envelope for Datalanguage query Q if for all databases D, Q(E(D)) Q(D). I n t u iti ve 1y, an envelope for Q “envelopes” or delimits the portions of thedatabase needed to answer Q. In general, there are many envelopes for a given Q;a good envelope is one that tightly delimits the data needed by Q. Finding goodenvelopes is an optimization problem that depends on details of Datalanguage,and our approach to this problem is described in [ll].A graph representation of qualifications (a join graph) is useful. The nodes ofa join graph represent indexed variables and constants, and the edges representclauses. A join graph contains the edge {Ri. A, Rj .B} (respectively, {Ri. A, k} ) iffthe qualification contains Ri .A Rj .B (respectively, Ri. A k) (see Figure 8).The connected components of a join graph characterize the clauses implied bythe qualification: Let N and N’ be nodes of the join graph for q; q implies N N’iff N and N’ are in the same connected component. (Proofs appear in [2,3]. Notethat if N and N’ represent distinct constants, q is unsatisfiable.)2.2 ReducersA reduction of database D with respect to E is any D’ such that E(D) 5 D’ 5 D.A reducer for E is a sequential program4 P of relational operations such that for4 A reducer is executed as a parallel program, however (see [23]).ACM Transactions on Database Systems, Vol. 6, No. 4, December 1981.

608*P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr.GivenS(s#,name,Acme,Best,1,2,location)MAMAl Y(s# s#]S Y( s#,1,1,l P(p# p#]Y P( p#,1,2,l S(s# s#]Y S(s#,1,*All of these ,CRI,8080,microminimainhugemicro2,3,4,5,qty) (Y tuples that correspond to a MA supplier)2050name,LSI,Pll,Pdtype) {parts that are supplied by some MAsupplier}microminilocation)MA {MA suppliersthing)are profitable.However,Fig. 9.Semijoin.Y(p#who p#]Psupplywouldanynot beall databases D, P(D) is a reduction of D with respect to E. Given E and D, ouroptimization task is to construct a reducer P and select a site S, such that thecost of computing P(D) and moving the results to S, is minimum over all reducersand sites. A reduction operation for E is an operation that is permitted in areducer for E. A reduction operation reduces the size of D by eliminating datanot specified by E(D). The benefit of a reduction operation is the amount of datait eliminates; the cost is the amount of intersite data transfer required to computethe operation.Restrictions and projections have zero cost and nonnegative benefit, and soevery restriction and projection permitted by E should be included in everyreducer for E. The projections permitted by E are Ri[ti], for i 1, . . . , n. Therestrictions permitted by E can be determined from its join graph: E permitsRi[A B] (respectively, Ri[A k]) iff q implies Ri.A Ri.B (respectively,Ri. A k) iff Ri. A and Ri. B (respectively, k) are connected in the join graph.To reduce the database further, data from two or more relations must becombined. The obvious operation for this purpose isjoin. However, our algorithmuses an operation called semijoin, which we deem to be superior. A semijoin is“half of a join”; the semijoin of relation Ri by relation Rj on clause Ri. A Rj. B,denoted Ri(A B]Rj, equals the join of Ri and Rj on that clause projected backonto attr(Ri) (see Figure 9). (Notice that semijoin, unlike join, is asymmetric;that is, Ri( A B]Rj # Rj( B AIRi. The former reduces Ri, while the latterreduces Rj .) As with restrictions, the semijoins permitted by E can be determinedfrom its join graph: E permits Ri( A B]Rj and Rj (B A]Ri iff Ri. A and Rj . Bare connected in the join graph.We prefer semijoins to joins for three reasons. First, Ri( A B]Rj c Ri, and sosemijoins monotonically reduce the size of the database. By contrast, joins canincrease the size of the database; in the worst case, 1Ri[A B]Rj 1 ) Ri 1 * ] Rj I.ACM Transactionson Database Systems, Vol. 6, No. 4, December1981.

Query Processinglll-609Let D be the database%(A,B)011 0site 1lin a System for Distributed DatabasesMB,C)110 0site 2MA,C)1 10 0site 3MC,D, E,00000001 11F,G,1 1site 4H,1)11Let E be the envelopeq: R,.A &.A A R1.B R2.B A R2.C R3.C A R3.C R.Cfor i 1, . . . ,4.t, attr(RJUsing semijoins, the optimal evaluation of E(D) is to move RI, Rz, and Ra to site4-that is, no semijoins should be used. This requires the transmission of 12 dataitems.Using joins, the optimal evaluation isRIP : R B B]RI at site 2-cost 4RI : RnJA A A C C] at site 2-costNote that RIZS ( )% : Rs[C C]Rn3 at site 4-cost 0.Total cost 8. 4Fig. 10. Bad case for semijoins. This example is adapted from [2].Second, semijoins can be computed with less intersite data transfer than joins.To compute Ri(A B]Rj, we need only transmit a projection of a relation (viz.,Rj[B]), whereas to compute Ri[A B]Rj we must transmit an entire relation. Ofcourse, the semijoin may also have less effect than the join, since Ri(A B]Rjonly reduces Ri, whereas Ri[A B]Rj simultaneously reduces Ri and Rj. However,the third advantage of semijoins is that the “reductive effect” of any single joincan be attained by two semijoins, usually at lower cost, as follows.Let Rij RJA B]Rj. The reductive effect of this join is its effect on Ri andRj, namely, Ru[attr(Ri)]and Rij[attr(Rj)].By definition of projection,Rd[attr(Ri)] {ri ] 3 (ri, rj) E Rij} {ri E Ri] (3 rj E Rj)(ri.A Ri(A B]Rj, rj.B)},by definitionby definitionof joinof semijoin.Similarly, Rij[attr(Rj)] Rj( B AIR. Thus the reductive effect of Ri[A B]Rjis attained by two semijoins as claimed.Now let us compare the cost of the join versus the two semijoins. To computeRi[A B]Rj, one of the relations, Rj say, must be transmitted to the other’s site.This has cost 1Rj I* width(Rj), where width(Rj) equals the number of bits in eachtuple of Rj. TO compute the semijoins, we transmit Rj [B] to Ri and Ri[A] to Rj,for a cost of I Rj [B] I *width(B) I Ri[A] I *width(A). But Ri[A] c Rj[B] afterRi(A B]Rj is executed, and so if we execute the semijoins in sequence, the costis less than or equal to I Rj[B] ) * (width(A) width(B)). This quantity is lessthan or equal to I Rj \ * width(Rj) under the reasonable assumption that width(A) width(B) I width(Rj). Given this assumption, the cost of the semijoins is lessthan or equal the cost of the join, as claimed.Our arguments in support of semijoins are heuristic and there are cases inwhich joins outperform semijoins. Figure 10 illustrates such a case. An optimalquery processing algorithm would almost certainly include both joins and semiACM Transactions on Database Systems, Vol. 6, No. 4, December 1981.

610.Let D P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, n1,(a) Attributty domains dom(S.s#) dom(Y.s#) {idA’s from 1 to 10k)dom(S.name) dom(P.name) (names of length 10)dom(S.location) {states of U.S.}v (provinces of Canada)etc.(b) Auxiliary domains X1 (strings of length 10)XP (integers from 1 to IOk}(c) Subset hierarchiesdom(S.name) dom(P.name) dom(S.location) dom(P.type)dom(S.s#) dom(Y.s#) dom(Y.p# dom(P.p#)dom(Y.qty)Fig. 11. Domains.joins. The graceful integration of these tactics is an open problem, however, andour algorithm only uses semijoins.3. COSTAND BENEFITESTIMATIONTo compile an envelope into an efficient reducer, we need to estimate the costand benefit of reduction operations. This section presents an estimation procedurebased on a statistical model of the database. We only consider the estimationproblem for semijoins; estimation techniques for restrictions and projections aredescribed by [28].Our statistical model is an approximationof a set theoretic model of thedatabase and is described in Section 3.1. Section 3.2 presents a technique forestimating the effect of set operations. Section 3.3 extends this technique toestimate the effect of a sequence of semijoins.3.1 DatabaseModelLet D {RI, . . . . R,} be a database. Associated with each attribute of eachrelation, for example, Ri. A, is a finite domain of values, dom(Ri. A). Ri[A] isconstrained to be a subset of dom(Ri. A) (see Figure lla). The model also containsauxiliary domains (see Figure llb). The set of all domains is partitioned intodomain hierarchies, each of which contains a maximum domain X,,, and alldomains X such that X CX,,, (see Fig

SDD-1 is a distributed database system developed by the Computer Corporation of America [23]. SDD-1 permits a relational database to be distributed among the sites of a computer network, yet accessed as if it were stored at a single site. Users interact with SDD-1 by submitting queries coded in a high-level procedural

Related Documents:

Why should you Query? Centers for Medicare and Medicaid Services supports the use of query forms as a supplement to the health care record. “Use of the physician query form is permissible to the extent it provides clarification and is consistent with other medical record documentation.” 3 File Size: 254KBPage Count: 26Explore furtherPhysician Query Examples Journal Of AHIMAjournal.ahima.org2019 update: Guidelines for achieving a compliant query .acdis.orgGuidelines for Achieving a Compliant Query Practice (2019 .bok.ahima.orgThe Physician Query Process Compliance Issuesassets.hcca-info.orgThe Physician Query: What Every Coder Wants You To Knowcapturebilling.comRecommended to you b

Tags:css media query, css media query examples, css media query for ipad, css media query for mobile, css media query value defined in the query. max-width Rules applied for any browser width below the value defined in the query. min-height Rules applied for any browser height over the value defined in the query. max-height Rules applied for any

This system calls, Advanced SQL Query To Flink Translator This proposed system receives Ad-vanced SQL Query from the user then generate Flink Code for exe-cuting this Query. Finally, it returns the results of Query to the user. General Terms: SQL Query, Apache Flink Keywords Big data, Flink, SQL Translator, Hadoop, Hive, Advanced SQL Query 1 .

PeopleSoft Query Welcome to PeopleSoft Query! This versatile tool is simple to use and will allow Query Developers to create Queries in an effective and efficient manner. Introduction to PeopleSoft Query eopleSoft Query or PS Query is an end

Records for holidays, weather, eyeballs. Forecast is done one week ahead. Measure SMAPE: Query Previous Model Described Neural Network Query #1 10.60 13.05 Query #2 23.23 22.60 Query #3 48.57 18.23 Query #4 47.41 26.35 Query #5 19.40 16.87 Query #6 19.25 22.65 .

Introduction to PeopleSoft Query PeopleSoft Query or PS Query is an end-user reporting tool that allows Query Developers to extract information in the form of a query from the relational database, without the need to write SQL (Structur

2 excel power query tutorial ( Image Search ) 2 power query training ( Image Search ) 2 vba excel power query power pivots ( Image Search ) 1 excel and power query report ( Image Search ) www.accessanalytic.com.au 2 excel power query power pivot ( Image Search ) 2 power query excel 2010 ( Image Search ) 2 vb

Advanced Query for Secondary Schools Conference 201 5 Advanced Query for Secondary Schools Section 270 - Page 1 Session Description: Multi-table queries, the "CHANGE" button and command; advanced applications of Query for secondary school personnel. 1. Query Principles and a Review of Basics Query Tips