Approximate String Joins In A Database (Almost) For Free

3y ago
43 Views
2 Downloads
207.78 KB
10 Pages
Last View : 16d ago
Last Download : 3m ago
Upload by : Roy Essex
Transcription

Approximate String Joins in a Database (Almost) for FreeLuis GravanoColumbia UniversityPanagiotis G. IpeirotisColumbia UniversityH. V. JagadishUniversity of dujag@eecs.umich.eduNick KoudasAT&T Labs–ResearchS. MuthukrishnanAT&T Labs–ResearchDivesh SrivastavaAT&T ch.att.comdivesh@research.att.comAbstractString data is ubiquitous, and its management hastaken on particular importance in the past fewyears. Approximate queries are very important onstring data especially for more complex queriesinvolving joins. This is due, for example, to theprevalence of typographical errors in data, andmultiple conventions for recording attributes suchas name and address. Commercial databases donot support approximate string joins directly, andit is a challenge to implement this functionality efficiently with user-defined functions (UDFs).In this paper, we develop a technique for building approximate string join capabilities on top ofcommercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length, called -grams, and taking into account bothpositions of individual matches and the total number of such matches. Our approach applies to bothapproximate full string matching and approximatesubstring matching, with a variety of possible editdistance functions. The approximate string matchpredicate, with a suitable edit distance threshold,can be mapped into a vanilla relational expressionand optimized by conventional relational optimizers. We demonstrate experimentally the benefitsof our technique over the direct use of UDFs, using commercial database systems and real data.To study the I/O and CPU behavior of approximate string join algorithms with variations in editdistance and -gram length, we also describe detailed experiments based on a prototype implementation.1IntroductionString data is ubiquitous. To name only a few commonplace applications, consider product catalogs (for books,music, software, etc.), electronic white and yellow pagedirectories, specialized information sources such as patentdatabases, and customer relationship management data.As a consequence, management of string data indatabases has taken on particular importance in the pastfew years. Applications that collect and correlate data fromindependent data sources for warehousing, mining, and statistical analysis rely on efficient string matching to performtheir tasks. Here, correlation between the data is typicallybased on joins between descriptive string attributes in thevarious sources. However, the quality of the string information residing in various databases can be degraded dueto a variety of reasons, including human typing errors andflexibility in specifying string attributes. Hence the resultsof the joins based on exact matching of string attributes areoften of lower quality than expected. The following example illustrates these problems:Example 1.1 [String Joins] Consider a corporation maintaining various customer databases. Requests for correlating data sources are very common in this context. A specific customer might be present in more than one databasebecause the customer subscribes to multiple services thatthe corporation offers, and each service may have developed its database independently. In one database, acustomer’s name may be recorded as John A. Smith,while in another database the name may be recorded asSmith, John. In a different database, due to a typingerror, this name may be recorded as Jonh Smith. A request to correlate these databases and create a unified viewof customers will fail to produce the desired output if exactstring matching is used in the join. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercialadvantage, the VLDB copyright notice and the title of the publication andits date appear, and notice is given that copying is by permission of theVery Large Data Base Endowment. To copy otherwise, or to republish,requires a fee and/or special permission from the Endowment.Proceedings of the 27th VLDB Conference,Roma, Italy, 2001Unfortunately, commercial databases do not directlysupport approximate string processing functionality. Specialized tools, such as those available from Trillium Software1 , are useful for matching specific types of valuessuch as addresses, but these tools are not integrated with1 www.trillium.com

databases. To use such tools for information stored indatabases, one would either have to process data outsidethe database, or be able to use them as user-defined functions (UDFs) in an object-relational database. The formerapproach is undesirable in general. The latter approach isquite inefficient, especially for joins, because relational engines evaluate joins involving UDFs whose arguments include attributes belonging to multiple tables by essentiallycomputing the cross-products and applying the UDFs in apost-processing fashion.To address such difficulties, we need techniques for efficiently identifying all pairs of approximately matchingstrings in a database of strings. Whenever one deals withmatching in an approximate fashion, one has to specify theapproximation metric. Several proposals exist for stringsto capture the notion of “approximate equality.” Amongthose, the notion of edit distance between two strings isvery popular. According to this notion, deletion, insertion,and substitution of a character are considered as unit costoperations and the edit distance between two strings is defined as the lowest cost sequence of operations that cantransform one string to the other.Although there is a fair amount of work on the problemof approximately matching strings (see Section 6), we arenot aware of work related to approximately matching allstring pairs based on edit distance (or variants of it), as isneeded in approximate string joins. Moreover, we are notaware of any work related to this problem in the context ofa relational DBMS.In this paper, we present a technique for computing approximate string joins efficiently. At the core, our technique relies on matching short substrings of length of thedatabase strings (also known as -grams). We show howa relational schema can be augmented to directly represent-grams of database strings in auxiliary tables within thedatabase in a way that will enable use of traditional relational techniques and access methods for the calculation ofapproximate string joins. By taking into account the total number of such matches and the positions of individual -gram matches we guarantee no false dismissals underthe edit distance metric, as well as variations of it, and theidentification of a set of candidate pairs with a few falsepositives that can be later verified for correctness.Instead of trying to invent completely new join algorithms from scratch (which would be unlikely to be incorporated into existing commercial DBMSs), we opted fora design that would require minimal changes to existingdatabase systems. We show how the approximate stringmatch predicate, with a suitable edit distance threshold, canbe mapped into a vanilla SQL expression and optimizedby conventional optimizers. The immediate practical benefit of our technique is that approximate string processingcan be widely and effectively deployed in commercial relational databases without extensive changes to the underlying database system. Furthermore, by not requiring anychanges in the DBMS internals, we can re-use existing facilities, like the query optimizer, join ordering algorithmsand selectivity estimation.The rest of the paper is organized as follows: In Section 2 we give the notation and the definitions that we willuse. Then, in Section 3 we introduce formally the prob-lem of approximate string joins and we present our proposal. In Section 4 we present the results of an experimental study comparing the proposed approach to other applicable methods, demonstrating performance benefits andpresenting performance trends for several parameters of interest. Finally, in Section 5 we describe how we can adaptour techniques to address further problems of interest. Inparticular we show how to incorporate an alternate stringdistance function, namely the block edit distance (whereedit operations on contiguous substrings are inexpensive),and we address the problem of approximate substring joins.2Preliminaries2.1 Notation We use , possibly with subscripts, to denote tables, ,possibly with subscripts, to denote attributes, and , possibly with subscripts, to denote records in tables. We usethe notationto refer to attributeof table , andto refer to the value in attributeof record.Let be a finite alphabet of size. We use lowercase Greek symbols, such as , possibly with subscripts, todenote strings in . Letbe a string of length .We use,, to denote a substring ofof lengthstarting at position . # ! " ! &%' Definition 2.1 [Edit Distance] The edit distance betweentwo strings is the minimum number of edit operations (i.e.,insertions, deletions, and substitutions) of single charactersneeded to transform the first string into the second. 2.2(-grams: A Foundation for Approximate StringProcessingBelow, we briefly review the notion of positional -gramsfrom the literature, and we give the intuition behind theiruse for approximate string matching [16, 15, 13].Given a string , its positional -grams are obtained by“sliding” a window of length over the characters of .Since -grams at the beginning and the end of the stringcan have fewer than characters from , we introduce newcharacters “#” and “ ” not in , and conceptually extendthe string by prefixing it withoccurrences of “#”and suffixing it withoccurrences of “ ”. Thus, each-gram contains exactly characters, though some of thesemay not be from the alphabet . ## *) % # ,. / 0% # %Definition 2.2 [Positional -gram] A positional -gram, whereof a string is a pairis the -gram of that starts at position , counting onthe extended string. The setof all positional -grams ofpairs constructeda string is the set of all thefrom all -grams of .# The intuition behind the use of -grams as a foundationfor approximate string processing is that when two stringsandare within a small edit distance of each other,they share a large number of -grams in common [15, 13].The following example illustrates this observation. 21 43

Example 2.1 [Positional -gram]The positional grams of length 3 for string john smith are(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ),(6,n s), (7, sm), (8,smi), (9,mit), (10,ith),(11,th ), (12,h ) . Similarly, the positional grams of length 3 for string john a smith, whichis at an edit distance of two from john smith, are(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ),(6,n a), (7, a ), (8,a s), (9, sm), (10,smi),(11,mit), (12,ith), (13,th ), (14,h ) . Ifwe ignore the position information, the two -gram setshave 11 -grams in common. Interestingly, only the firstfive positional -grams of the first string are also positional-grams of the second string. However, an additional sixpositional -grams in the two strings differ in their position by just two positions. This illustrates that, in general,the use of positional -grams for approximate string processing will involve comparing positions of “matching” grams within a certain “band.” In the next section we describe how we exploit the concept of -grams to devise effective algorithms for approximate string joins (as opposed to the individual approximatestring matches described above).3Approximate String JoinsIn the context of a relational database, we wish to studytechniques and algorithms enabling efficient calculation ofapproximate string joins. More formally, we wish to address the following problem:1 3 ) 1 0) 3 1Problem 1 (Approximate String Joins) Given tablesandwith string attributesand, and an integer , retrieve all pairs of records such that edit distance( ) .313Our techniques for approximate string processing indatabases share a principle common in multimedia and spatial algorithms. First, a set of candidate answers is obtainedusing a cheap, approximate algorithm that guarantees nofalse dismissals. We achieve this by performing a join onthe -grams along with some additional filters that are guaranteed not to eliminate any real approximate match. Then,as a second step, we use an expensive, in-memory algorithm to check the edit distance between each candidatestring pair and we eliminate all false positives.In the rest of this section we describe in detail the algorithms used, and how they can be mapped into vanillaSQL expressions. More specifically, the rest of the section is organized as follows. In Section 3.1 we describethe naive solution, which involves the direct application ofuser-defined functions (UDFs) to address the problem. InSection 3.2 we describe how to augment a database with-gram information that is needed to run the approximatestring joins. Finally, in Section 3.3 we describe a set of filters that we use to ensure a small set of candidates and wedescribe how to map these filters into SQL queries that canbe subsequently optimized by regular query optimizers.3.1 Exploiting User-Defined FunctionsOur problem can be expressed easily in any objectrelational database system that supports UDFs, such as Oracle or DB2. One could register with the database a ternaryUDF edit distance(s1, s2, k) that returns trueif its two string argumentsare within edit distance ofthe integer argument . Then, the approximate string joinproblem for edit distance could be represented in SQL as:)1 ) 3 "1, 3Q1:SELECTFROMWHEREedit distance(1 *) 3 ) )To evaluate this query, relational engines would essentially have to compute the cross-product of tablesand, and apply the UDF comparison as a post-processingfilter. However, the cross-products of large tables are hugeand the UDF invocation, which is an expensive predicate,on every record in the cross-product makes the cost of thejoin operation prohibitive. For these reasons, we seek abetter solution and we describe our approach next.133.2 Augmenting a Database with Positional -GramsTo enable approximate string processing in a database system through the use of -grams, we need a principled mechanism for augmenting the database with positional -gramscorresponding to the original database strings.Let be a table with schema , suchis the key attribute that uniquely identifies recordsthatin , and some attributes , , are string-valued.For each string attributethat we wish to consider forapproximate string processing, we create an auxiliary ta blewith three attributes. For aof a record of , itspostring in attributesitional -grams are represented as separate records in thetable, where identifies the position ofthe -gram contained in ! " . Theserecords all share the same value for the attribute ,which serves as the foreign key attribute to table .Since the auxiliary -gram tables are used only duringthe approximate join operation, they can be created on-thefly, when the database wants to execute such an operation,and deleted upon completion. In the experimental evaluation (Section 4) we will show that the time overhead is negligible compared to the cost of the actual join. The spaceoverhead for the auxiliary -gram table for a string fieldof a relation with records is: ( ) (# ( )( ( ( ( )* 1 ) ) / %# / *% # ( # %&% % %'% (*) 1 # 1 where % is the size of the additional fields in the auxiliary -gram table (i.e., -, and ./ ). Since(0) , for any reasonable value of , it follows#12% (*) that. Thus, thesize of the auxiliary table is bounded by some linear function of times the size of the corresponding column in theoriginal table.After creating an augmented database with the auxiliarytables for each of the string attributes of interest, we can 1 ( %

efficiently calculate approximate string joins using simpleSQL queries. We describe the methods next.3.3 Filtering Results Using -gram PropertiesIn this section, we present our basic techniques for processing approximate string joins based on the edit distancemetric. The key objective here is to efficiently identify candidate answers to our problems by taking advantage of the-grams in the auxiliary database tables and using featuresalready available in database systems such as traditional access and join methods.For reasons of correctness and efficiency, we require nofalse dismissals and few false positives respectively. Toachieve these objectives our technique takes advantage ofthree key properties of -grams, and uses the three filteringtechniques described below.Count Filtering:The basic idea of C OUNT F ILTERING is to take advantageof the information conveyed by the setsand of-grams of the strings and , ignoring positional information, in determining whetherandare within editdistance . The intuition here is that strings that are withina small edit distance of each other share a large number of-grams in common.This intuition has appeared in the literature earlier [14],and can be formalized as follows. Consider a string , andletbe obtained by a substitution of a single character in. Then, the sets of -gramsand differ by at most(the length of the -gram). This is because -grams thatdo not overlap with the substituted character must be common to the two sets, and there are only -grams that canoverlap with the substituted character. A similar observation holds true for single character insertions and deletions.and must have at leastIn other words, in these cases, -grams in common. When the edit distance betweenandis , the following lower bound on the number of matching-grams holds. 1 1 1, - 1 3, - 1 ) 3 % # # 3, 3 1, - 3 1 ) 3 # 1 1 3 3, 1 ) 43 #Proposition 3.1 Consider stringsand , of lengthsand, respectively. Ifandare within an edit distance of , then the cardinality of , ignoringpositional information, must be at least . 1 3 # # 1 3, - Position Filtering:While C OUNT F ILTERING is effective in improving the efficiency of approximate string processing, it does not takeadvantage of -gram position information.In general, the interaction between -gram match positions and the edit distance threshold is quite complex. Anygiven -gram in one string may not occur at all in the otherstring, and positions of successive -grams may be off dueto insertions and deletions. Furthermore, as always, wemust keep in mind the possibility of a -gram in one stringoccurring at multiple positions in the other string.in one string toWe define a positional -gramcorrespond to a positional -gramin another string ) 1 ) 3 1 , , , , , , ! " AND # ! " AND ' ( 6 * AND. % " & ')( *,- " . ! / " 01243 " 012AND. 5!62 78' 9;: % 8?@3A2 7 ')9;: @ B #?GROUP BY C/ D C/ ECE # 6HAVINGCOUNT(*) FG2 7 ')9;: @ B 8?@3IH 3! 3-H ?KJ L AND6COUNT(*) FG2 7 ')9;: @ B )?@3IH 3! 3IH ?KJ L AND6edit distance( ECE # #C )SELECTFROMWHEREFigure 1: Query Q2: Expressing C OUNT F ILTERING, P O SITION F ILTERING , and L ENGTH F ILTERING as an SQLexpression. 313 ) 1 , after the sequence of edit opera 1 to 3 , “becomes” -gram ) 3 in the Mifandtions that convertedited string.Example 3.1 [Corresponding -grams] Consider the ONQP NKR NQP NSNKP N and & ONKP NSNQP NTNQP N . Thestringsedit distance between these strings is 1 (delete x to transform the first string to the second

approximate string joins. More formally, we wish to ad-dress the following problem: Problem 1 (Approximate String Joins) Given tables 1 and # 3with string attributes 1 ,an integer ), retrieve all pairs of ecords 1 3 such that edit distance(1 0) 3) . Our techniques for approximate string processing in databases share a principlecommon .

Related Documents:

You can also tune your guitar to a keyboard or piano. The open strings of a guitar correspond to certain notes on a keyboard. SESSION 1 3 Starting Off Right Learn &Master Guitar E A D G B E B 6th string 5th string 4th string 3rd string 2nd string 1st string 5th Fret 1st string 6th string 5th string 4th string 3rd string 2nd string E A D GB E .

You can also tune your guitar to a keyboard or piano. The open strings of a guitar correspond to certain notes on a keyboard. SESSION 1 3 Starting Off Right Learn &Master Guitar E A D G B E B 6th string 5th string 4th string 3rd string 2nd string 1st string 5th Fret 1st string 6th string 5th string 4th string 3rd string 2nd string E A D GB E .

is a query string. Approximate selections are special cases of the similarity join operation. While several predicates are introduced and benchmarked in [5], the extension of ap-proximate selection to approximate joins is not considered. Furthermore, the effect of threshold values on accuracy for approximate joins is also not considered. 3 .

queries with fuzzy string predicates, especially in the context of data cleansing [24, 31]. Gravano et al. [11] present a technique to do similar string joins inside a relational database system. Jin et al. [19] develop an efficient approach to approximate string joins using mapping techniques. Chaudhuri et al. [4] propose an

query string. Given a query string and a string tuple , the similarity score of and in this class of predicates is of the form weight of the token,where is the query-based in string and weight of the token is the tuple-based in string . 3.2.1 Tf-idf Cosine Similarity The tf-idf cosine similarity [24] between a query string and a string tuple

Barber, Samuel String Quartet No.1, Op.11 Bartok, Bela String Quartet No.2, Op.17 String Quartet No.4 Beethoven, Ludwig van String Quartet No.1 in F major, Op.18 No.1 String Quartet No.2 in G major, “Compliments” Op.18 No.2 String Quartet No.6 in B-flat major, Op.18 No.6 String Quartet No.7 in F major, “Rasumovsky 1” Op.59 No.1

String Quartet n. 15 op. 144 Anton Webern String Quartet op. 28 Five Movements for String Quartet Six Bagatelles for String Quartet Alexander Von Zemlinsky String Quartet n. 2 op. 15 2) Toshio Hosokawa UTA-ORI. Weaving Song for string quartet (2020) New composition for String Quartet

Description Logic Knowledge Base Exchange Elena Botoeva supervisor: Diego Calvanese PhD Final Examination April 10, 2014 Bolzano Elena Botoeva(FUB)Description Logic Knowledge Base Exchange1/33. Outline 1 Introduction 2 Summary of Work 3 Results 4 Technical Development Universal Solutions Universal UCQ-solutions UCQ-representations Elena Botoeva(FUB)Description Logic Knowledge Base Exchange2/33 .