Similarity Join Implementation Approach The Similarity .

2y ago
25 Views
3 Downloads
1.34 MB
12 Pages
Last View : 3d ago
Last Download : 3m ago
Upload by : Allyson Cromer
Transcription

The Similarity Join Database Operator*Yasin N. Silva1, Walid G. Aref 1, Mohamed H. Ali21Department of Computer Science, Purdue University, Indiana, USA{ysilva,aref}@cs.purdue.edu2Microsoft Corporation, Washington, USAmali@microsoft.comAbstract— Similarity joins have been studied as key operations inmultiple application domains, e.g., record linkage, data cleaning,multimedia and video applications, and phenomena detection onsensor networks. Multiple similarity join algorithms andimplementation techniques have been proposed. They rangefrom out-of-database approaches for only in-memory andexternal memory data to techniques that make use of standarddatabase operators to answer similarity joins. Unfortunately,there has not been much study on the role and implementation ofsimilarity joins as database physical operators. In this paper, wefocus on the study of similarity joins as first-class databaseoperators. We present the definition of several similarity joinoperators and study the way they interact among themselves,with other standard database operators, and with otherpreviously proposed similarity-aware operators. In particular,we present multiple transformation rules that enable similarityquery optimization through the generation of equivalentsimilarity query execution plans. We then describe an efficientimplementation of two similarity join operators, Ɛ-Join and JoinAround, as core DBMS operators. The performance evaluationof the implemented operators in PostgreSQL shows that theyhave good execution time and scalability properties. Theexecution time of Join-Around is less than 5% of the one of theequivalent query that uses only regular operators while Ɛ-Join’sexecution time is 20 to 90% of the one of its equivalent regularoperators based query for the useful case of small Ɛ (0.01% to10% of the domain range). We also show experimentally that theproposed transformation rules can generate plans with executiontimes that are only 10% to 70% of the ones of the initial queryplans.Similarity Join Implementation ApproachIntegrated inUsing BasicAs StoredOutside of DBDB EngineSQL OperatorsProceduresCertain types maySupportedbe unfeasible orAllAllAllJoin typesrequire verycomplex queriesRequiresRequiresCan reuseQueries use aspecializedspecializedImplementation and extendcomplex mix ofstructures,structures,complexityDB operatorsjoins andmechanisms to spillingand structures aggregationsdeal with large mechanisms,data sets, etc. etc.ComposableYes (fullYes (resultingwith other DBpipelining ofqueries can beNoNooperatorsresults)highly complex)TakeYes (trans. rules,advantage ofpre-aggregation,No directlyNoNoDB optimizerMVs, etc.)Fig. 1 Comparison of similarity join implementation approachesmarketing analysis, multimedia and video applications, etc.Multiple SJ algorithms and implementation techniques havebeen proposed. They range from out-of-database approachesfor only in-memory or external memory data, to techniquesthat use standard database operators to answer SJs. However,there has not been much study on the role and implementationof similarity joins as database operators. Fig. 1 comparesseveral approaches to implement Similarity Joins. Theimplementation of SJ as integrated database operators has thefollowing key advantages: (i) SJ database operators can beinterleaved with other regular and similarity-aware operatorsand their results pipelined for further processing; (ii)important optimization techniques, e.g., pushing certainfiltering operators to lower levels of the execution plan, preI. INTRODUCTIONaggregation, and the use of materialized views can beThe shift from systems that focus on exact semantics of extended to the new operators; and (iii) the implementation ofdata and queries to systems that focus on approximate and these operators can reuse and extend other operators andimprecise semantics is recognized as one of the main current structures to handle large datasets, and use the cost-basedparadigm transitions in data management systems. Different query optimizer machinery to enhance query execution time.This paper focuses on the study of similarity joins as firstareas have made important contributions to this paradigm shift,among them: similarity-aware query processing in database class database operators. Its main contributions are: We study the similarity join as a first-class databasesystems, integration of information retrieval and databaseoperator, its interaction with other non-similarity andoperations, and uncertain or probabilistic databases. The studysimilarity-based operators, and its implementation asof the similarity-aware counterparts of common databaseintegrated component of the DBMS query processing andoperations, i.e., selection, join, and grouping is a central goaloptimization engine.of the work on similarity query processing. Similarity joins(SJ) are operations that combine two sets of data using We present the different types of similarity joins,similarity join predicates that match tuples with similar orintroduce a new useful similarity join type, the Joinapproximate values. Similarity joins have been studied as keyAround, and propose SQL syntax to express similaritycomponents to solve multiple problems, e.g., record linkage,join predicates.data cleaning, phenomena detection on sensor networks, We analyze multiple transformation rules for the SJoperators. These rules enable query optimization ��*the generation of equivalent query execution plans. WeThis work was partially supported by NSF Grant IIS-0811954.

study: (i) multiple core equivalence rules for SJ operators;(ii) the main theorem of Eager and Lazy aggregation forqueries with similarity join and similarity group-by; (iii)the scenarios in which similarity predicates can be pushedfrom similarity join to similarity group-by; and (iv)equivalence rules between different SJ operators andbetween SJ and the similarity group-by operator. We describe an efficient implementation of two SJoperators, the Epsilon-Join and Join-Around, as coreDBMS operators. We consider the case of multiple SJpredicates and one-dimensional (1D) attributes. We evaluate the performance and scalability properties ofour implementation of the Epsilon-Join and Join-Aroundoperators in PostgreSQL. The execution time of JoinAround is less than 5% of the one of the equivalent querythat uses only regular operators while Ɛ-Join’s executiontime is 20 to 90% of the one of its equivalent regularoperators based query for the useful case of small Ɛ(0.01% to 10% of the domain range). We also evaluate experimentally the effectiveness of theproposed transformation rules and show they can generateplans with execution times that are only 10% to 70% ofthe ones of the initial query plans.The rest of this paper is organized as follows. Section IIdiscusses the related work. Section III presents the differenttypes of SJ and the proposed syntax to specify their similaritypredicates. Section IV studies the equivalence rules among SJand other regular and similarity-aware operators. Section Vpresents implementation guidelines based on a prototyperealization of two SJ operators within PostgreSQL. Section VIreports the performance evaluation of the implementedoperators and Section VII presents the conclusions anddirections for future research.blocks to minimize I/O. The Generic External Space Sweep(GESS) algorithm [11] creates hypersquares centered on eachdata point with epsilon length sides, and joins thesehypersquares using a spatial join on rectangles. The Quickjoinalgorithm [12] recursively partitions the data until the subsetsare small enough to be efficiently processed using a nestedloop join. The algorithm makes recursive calls to process eachpartition and a separate recursive call to process the “windows”around the partition boundary. Quickjoin has been shown toperform better than EGO and GESS [12].Also, of importance is the work on similarity jointechniques that make use of relational database technology[17], [18], [19]. These techniques are applicable only to stringor set-based data. The general approach pre-processes the dataand query, e.g., decomposes data and query strings into sets ofq-grams, and stores the results of this stage on separaterelational tables. Then, the result of the similarity join can beobtained using standard aggregate/group-by/join SQLstatements. Indices on the pre-processed data are used toimprove performance. A key difference of this work with ourcontributions in this paper is that we focus on studying theproperties, optimization techniques, e.g., pre-aggregation andquery transformation rules, and implementation techniques ofseveral types of similarity joins as database operatorsthemselves rather than studying the way a SJ can be answeredusing standard operators. In fact, several of the discussedproperties for epsilon-join in this paper are also applicable tothe operators proposed in [17] and [18]. Moreover, theimplementation section of our work focuses on SJ onnumerical data rather than string data.A related type of join is the band join introduced in [32].The join predicate of this join type has the form S.s-Ɛ1 R.r S.s Ɛ2. A key difference of our work with the work on bandjoins is that band joins represent only a special case of one ofII. RELATED WORKthe four types of joins considered in our study. Specifically, aSeveral types of similarity join, and corresponding band join where Ɛ1 Ɛ2 is a special case of Ɛ-Join for the caseimplementation strategies, have been proposed in the literature, of 1D data. We propose transformation rules and propertiese.g., range distance join (retrieves all pairs whose distances for similarity joins that apply in general to multi-dimensionalare smaller than a pre-defined threshold) [1], [2], [3], [8], [9], data. Moreover, a key goal of our implementation is to take[10], k-Distance join (retrieves the k most-similar pairs) [4], advantage of the mechanisms and data structures alreadyand kNN-join (retrieves, for each tuple in one table, the k available in most DBMS’ engines to facilitate the integrationnearest-neighbors in the other table) [5], [6], [7]. The range of similarity joins into real world DBMSs. Thedistance join, also known as the Ɛ-Join, has been the most implementation of band joins in [32] makes use of specializedstudied type of similarity join. Among its most relevant sampling, partitioning, and page replacement mechanisms.Some recent work in the area of similarity joins has focusedimplementation techniques, we find approaches that rely onthe use of pre-built indices, e. g., eD-index [8] and D-index on: proposing a compact way to represent the output of an[9]. These techniques strive to partition the data while epsilon join [11], i.e., reporting groups of nearby pointsclustering together similar objects. However, this approach instead of every join link; efficient algorithms for in-memorymay require rebuilding the index to support queries with similarity join with edit distance constraints [14]; algorithmsdifferent similarity parameter values, i.e., epsilon. for near duplicate detection that exploit the ordering of tokensFurthermore, eD-index and D-index are directly applicable in a record to reduce the number of required distanceonly to the case of self-joins. Several non-index-based computations [15]; and similarity join algorithms that exploittechniques have also been proposed to implement the Ɛ-Join. sorting and searching capabilities of GPUs [16].The extension of other standard operations to theirEGO [10], GESS [11], and QuickJoin [12] are three of themost relevant non-index-based algorithms. The Epsilon Grid similarity-based counterparts, e. g., similarity selection [20],Order (EGO) algorithm [10] imposes an epsilon-sized grid [21], [22], [23], and similarity grouping [24], has been studiedover the space and uses an efficient schedule of reads of previously. Among the important recent contributions in this

area are: the study of fast indices and algorithms for setsimilarity selection using semantic properties that allowpruning large percentages of the search space [20], aquantitative cost-based approach to build high-quality gramsto support selection queries on strings [21], a method thatfinds all data objects that match with a given query object in alow-dimensional subspace instead of the original full space[22], and flexible dimensionality reduction techniques tosupport similarity search using the Earth Mover’s Distance[23]. Of special interest is the work on Similarity Group-by(SGB) presented in [24]. SGB is an extension of the group-bydatabase operator that supports the formation of groups ofsimilar objects. Three SGB instances are introduced, i.e.,group-around, unsupervised group-by, and group-by withdelimiters; and are shown to have good execution time andscalability properties with at most only 25% increase inexecution time over the regular group-by [24]. We study theinteraction and equivalences between SJ and SGB.Furthermore, we discuss scenarios in which the similaritypredicate of SJ can be pushed partially or totally to SGB.The work in [25] proposes an algebra for similarity-basedqueries. This work presents the extension of simple algebrarules, e.g., pushing selection into join, to the case of similarityoperators. The work in [26] proposes an extension to therelational algebra to support similarity queries with severalsimilarity predicates combined using the Boolean operatorsand, or, and not. However, [26] does not consider similarityjoins or queries that combine non-similarity and similaritypredicates. [27] proposes an extended SQL syntax to expressqueries that use both non-similarity and similarity predicates.The work in [28] presents a cost model to estimate the numberof I/O accesses and distance calculations to answer similarityqueries over data indexed using metric access methods. Both[27] and [28] only consider range distance and knn-joins. Aframework for similarity query optimization is presented in[29]. This work makes use of simple equivalence rules togenerate multiple alternative query plans. The main differencebetween [25], [26], [27] and our work is that we focus onanalyzing in detail the properties and equivalence rules thatinvolve the different kinds of similarity join. Our studyconsiders four types of SJ, the equivalences among them andwith the similarity group-by operator. Furthermore, we studyextensions of the important Lazy and Eager aggregationtransformations to the case of similarity join queries.Some of the optimization techniques of SJ presented in thispaper build on previous work on optimization of regular nonsimilarity queries. Larson et al. study pull-up and push-downtechniques that allow the query optimizer to move aggregationoperators up and down the query plan [30], [31]. Thesetechniques enable complete [30] or partial [31] preaggregation that can reduce significantly the input size of ajoin and decrease the execution time of an aggregation query.III. SIMILARITY JOIN OPERATORSThe generic definition of the Similarity Join (SJ) operator isas follows:𝐴 𝜃 𝑆 𝐵 𝑎, 𝑏 𝜃𝑆 𝑎, 𝑏 , 𝑎 𝐴, 𝑏 𝐵}ε-Join:SELECT FROM A, BWHERE A.a WITHIN ε OF B.bAround-Join: SELECT FROM A, BWHERE A.a AROUND B.b [MAX DIAMETER 2r]kNN-Join:SELECT . FROM A, BWHERE B.b k NEAREST NEIGHBOR OF A.akD-Join:SELECT . FROM A, BWHERE A.a k TOP CLOSEST PAIRS B.bFig. 2 Extended SQL syntax for similarity join predicatesABABABk 2ABrk 2εε-JoinkNN-JoinkD-JoinJoin-AroundFig. 3 Types of Similarity Joinwhere θs represents the similarity join predicate. Thispredicate specifies the similarity-based conditions that thepairs a,b need to satisfy to be in the similarity join output.The similarity join predicates for the similarity join operatorsconsidered in our study are as follows. Range Distance Join (Ɛ-Join):𝜃𝜀 𝑑𝑖𝑠𝑡 𝑎, 𝑏 𝜀 kNN-Join:𝜃𝑘𝑁𝑁 𝑏 𝑖𝑠 𝑎 𝑘 𝑐𝑙𝑜𝑠𝑒𝑠𝑡 𝑛𝑒𝑖𝑔 𝑏𝑜𝑟 𝑜𝑓 𝑎 k-Distance-Join (kD-Join):𝜃𝑘𝐷 𝑎, 𝑏 𝑖𝑠 𝑜𝑛𝑒 𝑜𝑓 𝑡 𝑒 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑘 𝑐𝑙𝑜𝑠𝑒𝑠𝑡 𝑝𝑎𝑖𝑟𝑠 Join-Around (A-Join):𝜃𝐴,𝑀𝐷 2𝑟 𝑏 𝑖𝑠 𝑡 𝑒 𝑐𝑙𝑜𝑠𝑒𝑠𝑡 𝑛𝑒𝑖𝑔 𝑏𝑜𝑟 𝑜𝑓 𝑎 𝑎𝑛𝑑𝑑𝑖𝑠𝑡 𝑎, 𝑏 𝑟The range distance, kNN, and k-Distance join operators arecommon and extensively used types of similarity join. TheJoin-Around is a new useful type of similarity join thatcombines some properties of both the range distance and kNNjoins. Every value of the first joined set is assigned to itsclosest value in the second set. Additionally, only the pairsseparated by a distance of at most r are part of the join output.MD stands for Maximum Diameter and r MD/2 representsthe Maximum Radius. As presented in Section IV, the JoinAround operator with MD is equivalent to the kNN-Joinfor k 1. Some queries that show the usefulness of this newtype of similarity join are presented later in this section.Fig. 2 shows an extension of SQL syntax to express thedifferent types of similarity join predicates. Fig. 3 showsexamples of the four types of similarity join operators whenthey are applied to two numerical datasets.Similarity joins are core operations in multiple applicationdomains, e.g., data cleaning, pattern recognition,bioinformatics, multimedia, phenomena detection on sensornetworks, marketing analysis, etc. Many of these scenarios,e.g., pattern recognition and bioinformatics, inherently needthe support of similarity joins on multidimensional data.However, there are also many application scenarios, e.g.,marketing analysis and phenomena detection on sensor

networks, that can greatly benefit from the use of similarityjoins on one dimensional data. Fig. 4 gives four similarityqueries that use similarity joins to answer business-orientedquestions in a decision support system. The presentedsimilarity queries are extensions of several non-similaritybased TPC-H queries [33]. The similarity queries in Fig. 4illustrate that the use of similarity joins allows answeringmore complex and interesting business questions.IV. OPTIMIZING SIMILARITY JOINSThis section presents the study of similarity join propertiesand techniques that enable the optimization of similarity joinqueries through the generation of alternative execution plans.This section introduces: (i) core equivalence rules that exploitspecific properties of SJs, (ii) equivalence rules betweenmultiple SJ operators and between SJ and similarity group-by(SGB) operators, and (iii) the study of Eager and Lazytransformation techniques that exploit pre-aggregation usinggroup-by and similarity group-by to significantly reduce theamount of data to be processed by SJs.A. Core Equivalence RulesThis section presents multiple equivalence rules thatinvolve the different SJ operators. This section not onlyconsiders the extension of common equivalence rules to thecase of similarity joins, but particularly also studies scenariosthat exploit certain specific properties of SJs to enable moreeffective query transformations. The rules in this section andin section IV.B use the notation presented in Fig. 5. Theexamples assume the following relations’ content:E1 E2 E3 {1,2,.,100}, and E4 {21,22,.,25}.1) Basic Distribution of Selection over SJ: The regularselection operation distributes over the similarity joinoperations according to the following rules.When all the attributes of the selection predicate θ involveonly the attributes of one of the expressions being joined (E1):a. 𝜎𝜃 𝐸1 𝜃𝜀 𝐸2 (𝜎𝜃 (𝐸1 )) 𝜃𝜀 𝐸2Similarity Query Example 1Original TPC-H QueryQ4 – Business Question: Study how well the order priority system isworking in a given quarterSimilarity-aware QueryBusiness Question: Study how well the order priority system works arounddates of interest (holydays, marketing campaigns, etc.)Select d refdate, o orderpriority, count(*) as order count from orders, DatesOfInterestWhere o orderdate AROUND d refdateand exists (Select * from lineitemWhere l orderkey o orderkey and l commitdate l receiptdate)group by o orderpriority, d refdate order by o orderpriority, d refdateSimilarity Query Example 2Original TPC-H QueryQ5 – Business Question: Study

numerical data rather than string data. A related type of join is the band join introduced in [32]. The join predicate of this join type has the form S.s-Ɛ1 R.r S.s Ɛ2. A key difference of our work with the work on band joins is that band joins represent only a special case of one of the four types of joins considered in our study.

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

(ii) Kinematic similarity: geometric similarity and similarity of motion between model and prototype particles (iii) Dynamic similarity: requires geometric and kinematic similarity and in addition that all force ratios in the two systems are identical Perfect model-prototype similarity: Mechanical similarity 8 . 5 Similarities Most relevant forces in fluid dynamics are: Inertial force mass .

5. I can use the AA Similarity Postulate to prove triangle similarity 6. I can use the SSS Similarity Theorem to prove triangle similarity 7. I can use the SAS Similarity Theorem to prove triangle similarity 8. I can use proportion theorems to find missing lengths in geometric problems 9. I can perform dilations