Approximate String Joins In A Database (Almost) For Free

3y ago

43 Views

2 Downloads

207.78 KB

10 Pages

Last View : 16d ago

Last Download : 3m ago

Upload by : Roy Essex

Report this link

Download PDF

Transcription

Approximate String Joins in a Database (Almost) for FreeLuis GravanoColumbia UniversityPanagiotis G. IpeirotisColumbia UniversityH. V. JagadishUniversity of dujag@eecs.umich.eduNick KoudasAT&T Labs–ResearchS. MuthukrishnanAT&T Labs–ResearchDivesh SrivastavaAT&T ch.att.comdivesh@research.att.comAbstractString data is ubiquitous, and its management hastaken on particular importance in the past fewyears. Approximate queries are very important onstring data especially for more complex queriesinvolving joins. This is due, for example, to theprevalence of typographical errors in data, andmultiple conventions for recording attributes suchas name and address. Commercial databases donot support approximate string joins directly, andit is a challenge to implement this functionality efficiently with user-defined functions (UDFs).In this paper, we develop a technique for building approximate string join capabilities on top ofcommercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length, called -grams, and taking into account bothpositions of individual matches and the total number of such matches. Our approach applies to bothapproximate full string matching and approximatesubstring matching, with a variety of possible editdistance functions. The approximate string matchpredicate, with a suitable edit distance threshold,can be mapped into a vanilla relational expressionand optimized by conventional relational optimizers. We demonstrate experimentally the benefitsof our technique over the direct use of UDFs, using commercial database systems and real data.To study the I/O and CPU behavior of approximate string join algorithms with variations in editdistance and -gram length, we also describe detailed experiments based on a prototype implementation.1IntroductionString data is ubiquitous. To name only a few commonplace applications, consider product catalogs (for books,music, software, etc.), electronic white and yellow pagedirectories, specialized information sources such as patentdatabases, and customer relationship management data.As a consequence, management of string data indatabases has taken on particular importance in the pastfew years. Applications that collect and correlate data fromindependent data sources for warehousing, mining, and statistical analysis rely on efficient string matching to performtheir tasks. Here, correlation between the data is typicallybased on joins between descriptive string attributes in thevarious sources. However, the quality of the string information residing in various databases can be degraded dueto a variety of reasons, including human typing errors andflexibility in specifying string attributes. Hence the resultsof the joins based on exact matching of string attributes areoften of lower quality than expected. The following example illustrates these problems:Example 1.1 [String Joins] Consider a corporation maintaining various customer databases. Requests for correlating data sources are very common in this context. A specific customer might be present in more than one databasebecause the customer subscribes to multiple services thatthe corporation offers, and each service may have developed its database independently. In one database, acustomer’s name may be recorded as John A. Smith,while in another database the name may be recorded asSmith, John. In a different database, due to a typingerror, this name may be recorded as Jonh Smith. A request to correlate these databases and create a unified viewof customers will fail to produce the desired output if exactstring matching is used in the join. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercialadvantage, the VLDB copyright notice and the title of the publication andits date appear, and notice is given that copying is by permission of theVery Large Data Base Endowment. To copy otherwise, or to republish,requires a fee and/or special permission from the Endowment.Proceedings of the 27th VLDB Conference,Roma, Italy, 2001Unfortunately, commercial databases do not directlysupport approximate string processing functionality. Specialized tools, such as those available from Trillium Software1 , are useful for matching specific types of valuessuch as addresses, but these tools are not integrated with1 www.trillium.com

databases. To use such tools for information stored indatabases, one would either have to process data outsidethe database, or be able to use them as user-defined functions (UDFs) in an object-relational database. The formerapproach is undesirable in general. The latter approach isquite inefficient, especially for joins, because relational engines evaluate joins involving UDFs whose arguments include attributes belonging to multiple tables by essentiallycomputing the cross-products and applying the UDFs in apost-processing fashion.To address such difficulties, we need techniques for efficiently identifying all pairs of approximately matchingstrings in a database of strings. Whenever one deals withmatching in an approximate fashion, one has to specify theapproximation metric. Several proposals exist for stringsto capture the notion of “approximate equality.” Amongthose, the notion of edit distance between two strings isvery popular. According to this notion, deletion, insertion,and substitution of a character are considered as unit costoperations and the edit distance between two strings is defined as the lowest cost sequence of operations that cantransform one string to the other.Although there is a fair amount of work on the problemof approximately matching strings (see Section 6), we arenot aware of work related to approximately matching allstring pairs based on edit distance (or variants of it), as isneeded in approximate string joins. Moreover, we are notaware of any work related to this problem in the context ofa relational DBMS.In this paper, we present a technique for computing approximate string joins efficiently. At the core, our technique relies on matching short substrings of length of thedatabase strings (also known as -grams). We show howa relational schema can be augmented to directly represent-grams of database strings in auxiliary tables within thedatabase in a way that will enable use of traditional relational techniques and access methods for the calculation ofapproximate string joins. By taking into account the total number of such matches and the positions of individual -gram matches we guarantee no false dismissals underthe edit distance metric, as well as variations of it, and theidentification of a set of candidate pairs with a few falsepositives that can be later verified for correctness.Instead of trying to invent completely new join algorithms from scratch (which would be unlikely to be incorporated into existing commercial DBMSs), we opted fora design that would require minimal changes to existingdatabase systems. We show how the approximate stringmatch predicate, with a suitable edit distance threshold, canbe mapped into a vanilla SQL expression and optimizedby conventional optimizers. The immediate practical benefit of our technique is that approximate string processingcan be widely and effectively deployed in commercial relational databases without extensive changes to the underlying database system. Furthermore, by not requiring anychanges in the DBMS internals, we can re-use existing facilities, like the query optimizer, join ordering algorithmsand selectivity estimation.The rest of the paper is organized as follows: In Section 2 we give the notation and the definitions that we willuse. Then, in Section 3 we introduce formally the prob-lem of approximate string joins and we present our proposal. In Section 4 we present the results of an experimental study comparing the proposed approach to other applicable methods, demonstrating performance benefits andpresenting performance trends for several parameters of interest. Finally, in Section 5 we describe how we can adaptour techniques to address further problems of interest. Inparticular we show how to incorporate an alternate stringdistance function, namely the block edit distance (whereedit operations on contiguous substrings are inexpensive),and we address the problem of approximate substring joins.2Preliminaries2.1 Notation We use , possibly with subscripts, to denote tables, ,possibly with subscripts, to denote attributes, and , possibly with subscripts, to denote records in tables. We usethe notationto refer to attributeof table , andto refer to the value in attributeof record.Let be a finite alphabet of size. We use lowercase Greek symbols, such as , possibly with subscripts, todenote strings in . Letbe a string of length .We use,, to denote a substring ofof lengthstarting at position . # ! " ! &%' Definition 2.1 [Edit Distance] The edit distance betweentwo strings is the minimum number of edit operations (i.e.,insertions, deletions, and substitutions) of single charactersneeded to transform the first string into the second. 2.2(-grams: A Foundation for Approximate StringProcessingBelow, we briefly review the notion of positional -gramsfrom the literature, and we give the intuition behind theiruse for approximate string matching [16, 15, 13].Given a string , its positional -grams are obtained by“sliding” a window of length over the characters of .Since -grams at the beginning and the end of the stringcan have fewer than characters from , we introduce newcharacters “#” and “ ” not in , and conceptually extendthe string by prefixing it withoccurrences of “#”and suffixing it withoccurrences of “ ”. Thus, each-gram contains exactly characters, though some of thesemay not be from the alphabet . ## *) % # ,. / 0% # %Definition 2.2 [Positional -gram] A positional -gram, whereof a string is a pairis the -gram of that starts at position , counting onthe extended string. The setof all positional -grams ofpairs constructeda string is the set of all thefrom all -grams of .# The intuition behind the use of -grams as a foundationfor approximate string processing is that when two stringsandare within a small edit distance of each other,they share a large number of -grams in common [15, 13].The following example illustrates this observation. 21 43

Example 2.1 [Positional -gram]The positional grams of length 3 for string john smith are(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ),(6,n s), (7, sm), (8,smi), (9,mit), (10,ith),(11,th ), (12,h ) . Similarly, the positional grams of length 3 for string john a smith, whichis at an edit distance of two from john smith, are(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ),(6,n a), (7, a ), (8,a s), (9, sm), (10,smi),(11,mit), (12,ith), (13,th ), (14,h ) . Ifwe ignore the position information, the two -gram setshave 11 -grams in common. Interestingly, only the firstfive positional -grams of the first string are also positional-grams of the second string. However, an additional sixpositional -grams in the two strings differ in their position by just two positions. This illustrates that, in general,the use of positional -grams for approximate string processing will involve comparing positions of “matching” grams within a certain “band.” In the next section we describe how we exploit the concept of -grams to devise effective algorithms for approximate string joins (as opposed to the individual approximatestring matches described above).3Approximate String JoinsIn the context of a relational database, we wish to studytechniques and algorithms enabling efficient calculation ofapproximate string joins. More formally, we wish to address the following problem:1 3 ) 1 0) 3 1Problem 1 (Approximate String Joins) Given tablesandwith string attributesand, and an integer , retrieve all pairs of records such that edit distance( ) .313Our techniques for approximate string processing indatabases share a principle common in multimedia and spatial algorithms. First, a set of candidate answers is obtainedusing a cheap, approximate algorithm that guarantees nofalse dismissals. We achieve this by performing a join onthe -grams along with some additional filters that are guaranteed not to eliminate any real approximate match. Then,as a second step, we use an expensive, in-memory algorithm to check the edit distance between each candidatestring pair and we eliminate all false positives.In the rest of this section we describe in detail the algorithms used, and how they can be mapped into vanillaSQL expressions. More specifically, the rest of the section is organized as follows. In Section 3.1 we describethe naive solution, which involves the direct application ofuser-defined functions (UDFs) to address the problem. InSection 3.2 we describe how to augment a database with-gram information that is needed to run the approximatestring joins. Finally, in Section 3.3 we describe a set of filters that we use to ensure a small set of candidates and wedescribe how to map these filters into SQL queries that canbe subsequently optimized by regular query optimizers.3.1 Exploiting User-Defined FunctionsOur problem can be expressed easily in any objectrelational database system that supports UDFs, such as Oracle or DB2. One could register with the database a ternaryUDF edit distance(s1, s2, k) that returns trueif its two string argumentsare within edit distance ofthe integer argument . Then, the approximate string joinproblem for edit distance could be represented in SQL as:)1 ) 3 "1, 3Q1:SELECTFROMWHEREedit distance(1 *) 3 ) )To evaluate this query, relational engines would essentially have to compute the cross-product of tablesand, and apply the UDF comparison as a post-processingfilter. However, the cross-products of large tables are hugeand the UDF invocation, which is an expensive predicate,on every record in the cross-product makes the cost of thejoin operation prohibitive. For these reasons, we seek abetter solution and we describe our approach next.133.2 Augmenting a Database with Positional -GramsTo enable approximate string processing in a database system through the use of -grams, we need a principled mechanism for augmenting the database with positional -gramscorresponding to the original database strings.Let be a table with schema , suchis the key attribute that uniquely identifies recordsthatin , and some attributes , , are string-valued.For each string attributethat we wish to consider forapproximate string processing, we create an auxiliary ta blewith three attributes. For aof a record of , itspostring in attributesitional -grams are represented as separate records in thetable, where identifies the position ofthe -gram contained in ! " . Theserecords all share the same value for the attribute ,which serves as the foreign key attribute to table .Since the auxiliary -gram tables are used only duringthe approximate join operation, they can be created on-thefly, when the database wants to execute such an operation,and deleted upon completion. In the experimental evaluation (Section 4) we will show that the time overhead is negligible compared to the cost of the actual join. The spaceoverhead for the auxiliary -gram table for a string fieldof a relation with records is: ( ) (# ( )( ( ( ( )* 1 ) ) / %# / *% # ( # %&% % %'% (*) 1 # 1 where % is the size of the additional fields in the auxiliary -gram table (i.e., -, and ./ ). Since(0) , for any reasonable value of , it follows#12% (*) that. Thus, thesize of the auxiliary table is bounded by some linear function of times the size of the corresponding column in theoriginal table.After creating an augmented database with the auxiliarytables for each of the string attributes of interest, we can 1 ( %

efficiently calculate approximate string joins using simpleSQL queries. We describe the methods next.3.3 Filtering Results Using -gram PropertiesIn this section, we present our basic techniques for processing approximate string joins based on the edit distancemetric. The key objective here is to efficiently identify candidate answers to our problems by taking advantage of the-grams in the auxiliary database tables and using featuresalready available in database systems such as traditional access and join methods.For reasons of correctness and efficiency, we require nofalse dismissals and few false positives respectively. Toachieve these objectives our technique takes advantage ofthree key properties of -grams, and uses the three filteringtechniques described below.Count Filtering:The basic idea of C OUNT F ILTERING is to take advantageof the information conveyed by the setsand of-grams of the strings and , ignoring positional information, in determining whetherandare within editdistance . The intuition here is that strings that are withina small edit distance of each other share a large number of-grams in common.This intuition has appeared in the literature earlier [14],and can be formalized as follows. Consider a string , andletbe obtained by a substitution of a single character in. Then, the sets of -gramsand differ by at most(the length of the -gram). This is because -grams thatdo not overlap with the substituted character must be common to the two sets, and there are only -grams that canoverlap with the substituted character. A similar observation holds true for single character insertions and deletions.and must have at leastIn other words, in these cases, -grams in common. When the edit distance betweenandis , the following lower bound on the number of matching-grams holds. 1 1 1, - 1 3, - 1 ) 3 % # # 3, 3 1, - 3 1 ) 3 # 1 1 3 3, 1 ) 43 #Proposition 3.1 Consider stringsand , of lengthsand, respectively. Ifandare within an edit distance of , then the cardinality of , ignoringpositional information, must be at least . 1 3 # # 1 3, - Position Filtering:While C OUNT F ILTERING is effective in improving the efficiency of approximate string processing, it does not takeadvantage of -gram position information.In general, the interaction between -gram match positions and the edit distance threshold is quite complex. Anygiven -gram in one string may not occur at all in the otherstring, and positions of successive -grams may be off dueto insertions and deletions. Furthermore, as always, wemust keep in mind the possibility of a -gram in one stringoccurring at multiple positions in the other string.in one string toWe define a positional -gramcorrespond to a positional -gramin another string ) 1 ) 3 1 , , , , , , ! " AND # ! " AND ' ( 6 * AND. % " & ')( *,- " . ! / " 01243 " 012AND. 5!62 78' 9;: % 8?@3A2 7 ')9;: @ B #?GROUP BY C/ D C/ ECE # 6HAVINGCOUNT(*) FG2 7 ')9;: @ B 8?@3IH 3! 3-H ?KJ L AND6COUNT(*) FG2 7 ')9;: @ B )?@3IH 3! 3IH ?KJ L AND6edit distance( ECE # #C )SELECTFROMWHEREFigure 1: Query Q2: Expressing C OUNT F ILTERING, P O SITION F ILTERING , and L ENGTH F ILTERING as an SQLexpression. 313 ) 1 , after the sequence of edit opera 1 to 3 , “becomes” -gram ) 3 in the Mifandtions that convertedited string.Example 3.1 [Corresponding -grams] Consider the ONQP NKR NQP NSNKP N and & ONKP NSNQP NTNQP N . Thestringsedit distance between these strings is 1 (delete x to transform the first string to the second

approximate string joins. More formally, we wish to ad-dress the following problem: Problem 1 (Approximate String Joins) Given tables 1 and # 3with string attributes 1 ,an integer ), retrieve all pairs of ecords 1 3 such that edit distance(1 0) 3) . Our techniques for approximate string processing in databases share a principlecommon .

Related Documents:

Legacy Learning Systems - Mr. S. R. Brandt

You can also tune your guitar to a keyboard or piano. The open strings of a guitar correspond to certain notes on a keyboard. SESSION 1 3 Starting Off Right Learn &Master Guitar E A D G B E B 6th string 5th string 4th string 3rd string 2nd string 1st string 5th Fret 1st string 6th string 5th string 4th string 3rd string 2nd string E A D GB E .

17 Views

1y ago

Legacy Learning Systems - Weebly

17 Views

1y ago

Accuracy of Approximate String Joins Using Grams

is a query string. Approximate selections are special cases of the similarity join operation. While several predicates are introduced and benchmarked in [5], the extension of ap-proximate selection to approximate joins is not considered. Furthermore, the eﬀect of threshold values on accuracy for approximate joins is also not considered. 3 .

39 Views

3y ago

Selectivity Estimation for Fuzzy String Predicates in ...

queries with fuzzy string predicates, especially in the context of data cleansing [24, 31]. Gravano et al. [11] present a technique to do similar string joins inside a relational database system. Jin et al. [19] develop an eﬃcient approach to approximate string joins using mapping techniques. Chaudhuri et al. [4] propose an

50 Views

3y ago

Benchmarking Declarative Approximate Selection Predicates

query string. Given a query string and a string tuple , the similarity score of and in this class of predicates is of the form weight of the token,where is the query-based in string and weight of the token is the tuple-based in string . 3.2.1 Tf-idf Cosine Similarity The tf-idf cosine similarity [24] between a query string and a string tuple

52 Views

3y ago

The Hausmann String Quartet was formed in the summer of ...

Barber, Samuel String Quartet No.1, Op.11 Bartok, Bela String Quartet No.2, Op.17 String Quartet No.4 Beethoven, Ludwig van String Quartet No.1 in F major, Op.18 No.1 String Quartet No.2 in G major, “Compliments” Op.18 No.2 String Quartet No.6 in B-flat major, Op.18 No.6 String Quartet No.7 in F major, “Rasumovsky 1” Op.59 No.1

29 Views

2y ago

th International String Quartets Competition Premio Paolo ...

String Quartet n. 15 op. 144 Anton Webern String Quartet op. 28 Five Movements for String Quartet Six Bagatelles for String Quartet Alexander Von Zemlinsky String Quartet n. 2 op. 15 2) Toshio Hosokawa UTA-ORI. Weaving Song for string quartet (2020) New composition for String Quartet

30 Views

2y ago

Description Logic Knowledge Base Exchange

Description Logic Knowledge Base Exchange Elena Botoeva supervisor: Diego Calvanese PhD Final Examination April 10, 2014 Bolzano Elena Botoeva(FUB)Description Logic Knowledge Base Exchange1/33. Outline 1 Introduction 2 Summary of Work 3 Results 4 Technical Development Universal Solutions Universal UCQ-solutions UCQ-representations Elena Botoeva(FUB)Description Logic Knowledge Base Exchange2/33 .

74 Views

3y ago

Recent Views

Quotes within Quotes: When Single (') and Double (") Quotes . - SAS

Here the outside double quotes are replaced by a single quote and the apostrophe is replaced by two single quotes. This works because when the parser sees two single (or double) quotes immediately following each other, the parser resolves them into one quote mark after the closing quote has been determined.

1y ago

237 Views

IN THIS ISSUE CAR WASH INSIGHT Recent, Notable M&A Transactions .

9/8/2022 Club Car Wash Sites of Tidal Wave Express Car Wash 8 8/29/2022 Take 5 Car Wash Soft Touch Car Wash, Auto Oasis Car Wash, Clearwater Car Wash and Birdie's Car Wash 5 8/25/2022 WhiteWater Express Geaux Clean Car Wash 7 8/19/2022 ModWash Home Team Car Wash 3 8/18/2022 Splash In ECO Car Wash (Wills Group) Blue Hen Car Wash 2

9m ago

100 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

What These Inspirational Quotes Say

Self Motivation Quotes Success Quotes Teacher Quotes And after reading all of these inspirational quotes you’d like to share which quotation is . -- Brian Tracy "You must constantly ask yourself these questions: Who am I around? What are they doing to me? Wha

2y ago

302 Views

ESSENTIAL PLAN - Discovery

Car insurance only Car and home insurance Car insurance only Car and home insurance 12.5% 25% 5% 10% YOUR FUEL CASH BACK PERCENTAGE GET TO THE HIGHEST CASH BACK PERCENTAGE Add at least R250 000 of home insurance (household contents, buildings or both) Take your car to Tiger Wheel & Tyre and pass the Annual MultiPoint check

1y ago

269 Views

CAR INSURANCE EVERYTHING EXPLAINED - RSA Insurance Group

CAR INSURANCE 93013821.indd 1 15/03/2018 10:46. 2 WELCOME TO µ CAR INSURANCE Thank you for choosing µ to protect you and your car. This booklet is intended to help you check your cover and to reassure you that µ will give you the protection you need for the year ahead. First of all, to help you understand your car insurance policy we want to .

1y ago

274 Views

Describe types and purposes of insurance.

D.O. CAPS Consumer Skills: Insurance—10E 3 Your car - The car you drive can also affect your insurance rates. Insurance companies place certain kinds of cars in special risk categories. You should ask your insurance agent before making a car purchase to make sure you aren't getting a car that will cost you extra for your liability insurance.

1y ago

233 Views

Money Online Price Comparison - WordPress

you to compare car insurance quotes. You'll notice at the top of the screen is a warning regarding telling the truth when completing any form of car insurance quote as something withheld, which later becomes known, can void an insurance claim. 7 The process of completing a car insurance price comparison is broken down into 4

1y ago

174 Views

Quotations - Free Website Builder: Create free websites

cards, but sometimes, playing a poor hand well." . 50th Birthday Quotes 60th Birthday Quotes And there are more. Funny Birthday Quotes Cute Birthday Quotes . it a try, itʼs free. Triumph over failure can be a

2y ago

267 Views

The Top 100 Motivational & Inspirational Quotes for 2015

I've spent hours crawling through the web trying to find the best quotes to keep me motivated and inspired all throughout the New Year. I've saved hundreds of quotes on my laptop and figured that words alone could motivate and inspire me. but if I couple the quotes

2y ago

329 Views

Inspirational Quotes - Guideposts

Inspirational Quotes Inspiring quotes are like vitamins for the soul. From the heartfelt to the humorous, the words of wisdom you’ll find here will strengthen your faith, lift your spirits, and even spark a positive change in your life. This collection of some our favorite inspirational quotes from religious figures, world leaders, authors,

2y ago

553 Views

Contours Options Infant Car Seat Adapter Instruction Sheet

your Infant Car Seat, as described in the instruction manual provided by the Infant Car Seat manufacturer. † WHEN USING ONLY ONE INFANT CAR SEAT ADAPTER OR TWO FOR TWINS, THE FOLLOWING INFANT CAR SEATS CAN BE USED: † If your Infant Car Seat is not one of the models listed above, DO NOT use your infant car seat with this car seat adapter.

2y ago

564 Views

Microsoft Advertising Travel Update

last minute cruise deals -58.50% Car Rental Queries WoW Change car rental -43.80% rental cars -46.30% car rentals -40.60% cheap car rentals -48.00% car rentals cheapest rates -52.20% rent a car- 40.30% cheap rental cars -45.60% rental car -41.80% car rental deals -49.30% rental cars lowest price -53.90% Flight Queries WoW Change cheap flights .

1y ago

337 Views

Design and development of lift for an automatic car parking system

1. Stacker type car parking system 2. Puzzle type car parking system 3. Level type car parking system 4. Chess type car parking system 5. Rotary type car parking system 6. Tower type car parking system But lift is used only in tower type car parking system. Objectives:-

6m ago

172 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Approximate String Joins In A Database (Almost) For Free

It looks like you're using an ad-blocker