Accuracy Of Approximate String Joins Using Grams

2y ago
38 Views
2 Downloads
769.45 KB
8 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Duke Fulford
Transcription

Accuracy of Approximate String Joins Using GramsOktie HassanzadehMohammad SadoghiRenée J. MillerUniversity of Toronto10 King’s College Rd.Toronto, ON M5S3G4, CanadaUniversity of Toronto10 King’s College Rd.Toronto, ON M5S3G4, CanadaUniversity of Toronto10 King’s College Rd.Toronto, ON M5S3G4, cs.toronto.eduABSTRACTApproximate join is an important part of many data cleaning and integration methodologies. Various similarity measures have been proposed for accurate and efficient matchingof string attributes. The accuracy of the similarity measureshighly depends on the characteristics of the data such asamount and type of the errors and length of the strings. Recently, there has been an increasing interest in using methods based on q-grams (substrings of length q) made out ofthe strings, mainly due to their high efficiency. In this work,we evaluate the accuracy of the similarity measures usedin these methodologies. We present an overview of severalsimilarity measures based on q-grams. We then thoroughlycompare their accuracy on several datasets with differentcharacteristics. Since the efficiency of approximate joins depend on the similarity threshold they use, we study how thevalue of the threshold (including values used in recent performance studies) effects the accuracy of the join. We alsocompare different measures based on the highest accuracythey can gain on different datasets.1.INTRODUCTIONData quality is a major concern in operational databasesand data warehouses. Errors may be present in the data dueto a multitude of reasons including data entry errors, lack ofcommon standards and missing integrity constraints. Stringdata is by nature more prone to such errors. Approximatejoin is an important part of many data cleaning methodologies and is well-studied: given two large relations, identifyall pairs of records that approximately match. A variety ofsimilarity measures have been proposed for string data inorder to match records. Each measure has certain characteristics that makes it suitable for capturing certain typesof errors. By using a string similarity function sim() for theapproximate join algorithm, all pairs of records that havesimilarity score above a threshold θ are considered to approximately match and are returned as the output.Performing approximate join on a large relation is a noto-Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the VLDB copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.VLDB ‘07, September 23-28, 2007, Vienna, Austria.Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.riously time-consuming task. Recently, there has been an increasing interest in using approximate join techniques basedon q-grams (substrings of length q) made out of strings.Most of the efficient approximate join algorithms (which wedescribe in Section 2) are based on using a specific similaritymeasure, along with a fixed threshold value to return pairs ofrecords whose similarity is greater than the threshold. Theeffectiveness of majority of these algorithms depend on thevalue of the threshold used. However, there has been littlework studying the accuracy of the join operation. The accuracy is known to be dataset-dependent and there is no common framework for evaluation and comparison of accuracyof different similarity measures and techniques. This makescomparing their accuracy a difficult task. Nevertheless, weargue that it is possible to evaluate relative performance ofdifferent measures for approximate joins by using datasetscontaining different types of known quality problems such astyping errors and difference in notations and abbreviations.In this paper, we present an overview of several similaritymeasures for approximate string joins using q-grams andthoroughly evaluate their accuracy for different values ofthresholds and on datasets with different amount and typesof errors. Our results include: We show that for all similarity measures, the value ofthe threshold that results in the most accurate joinhighly depends on the type and amount of errors inthe data. We compare different similarity measures by comparing the maximum accuracy they can achieve on different datasets using different thresholds. Althoughchoosing a proper threshold for the similarity measureswithout a prior knowledge of the data characteristicsis known to be a difficult task, our results show whichmeasures can potentially be more accurate assumingthat there is a way to determine the best threshold.Therefore, an interesting direction for future work isto find an algorithm for determining the value of thethreshold for the most accurate measures. We show how the amount and type of errors affect thebest value of the threshold. An interesting result ofthis is that many previously proposed algorithms forenhancing the performance of the join operation andmaking it scalable for large datasets are not effectiveenough in many scenarios, since the performance ofthese algorithms highly depends on choosing a highvalue of threshold which could result in a very low

accuracy. This shows the effectiveness of those algorithms that are less sensitive to the value of the threshold and opens another interesting direction for futurework which is finding algorithms that are both efficientand accurate using the same threshold.The paper is organized as follows. In Section 2, we overviewrelated work on approximate joins. We present our framework for approximate join and description of the similaritymeasures used in Section 3. Section 4 presents thoroughevaluation of these measures and finally, Section 5 concludesthe paper and explains future directions.2.RELATED WORKApproximate join also known as similarity join or recordlinkage has been extensively studied in the literature. Several similarity measures for string data have been proposed[14, 4, 5]. A recent survey[9], presents an excellent overviewof different types of string similarity measures. Recently,there has been an increasing interest in using measures fromthe Information Retrieval (IR) field along with q-grams madeout strings [10, 6, 2, 18, 5]. In this approach, strings aretreated as documents and q-grams are treated as tokens inthe documents. This makes it possible to take advantageof several indexing techniques as well as various algorithmsthat has been proposed for efficient set-similarity joins. Furthermore, these measures can be implemented declarativelyover a DBMS with vanilla SQL statements [5].Various recent works address the problem of efficiency andscalability of the similarity join operations for large datasets[6, 2, 18]. Many techniques are proposed for set-similarityjoin, which can be used along with q-grams for the purposeof (string) similarity joins. Most of the techniques are basedon the idea of creating signatures for sets (strings) to reduce the search space. Some signature generations schemesare derived from dimensionality reduction for the similarity search problem in high dimensional space. One efficientapproach uses the idea of Locality Sensitive Hashing (LSH)[13] in order to hash similar sets into the same values withhigh probability and therefore is an approximate solution tothe problem. Arasu et al. [2] propose algorithms specificallyfor set-similarity joins that are exact and outperform previous approximation methods in their framework, althoughparameters of the algorithms require extensive tuning. Another class of work is based on using indexing algorithms,primarily derived from IR optimization techniques. A recentproposal in this area [3] presents algorithms based on novelindexing and optimization strategies that do not rely on approximation or extensive parameter tuning and outperformprevious state-of-the-art approaches. More recently, Li etal.[15] propose VGRAM, a technique based on the idea ofusing variable-length grams instead of q-grams. At a highlevel, it can be viewed as an efficient index structure over thecollection of strings. VGRAM can be used along with previously proposed signature-based algorithms to significantlyimprove their efficiency.Most of the techniques described above mainly addressthe scalability of the join operation and not the accuracy.The choice of the similarity measure is often limited in thesealgorithms. The signature-based algorithm of [6] also considers accuracy by introducing a novel similarity measurecalled fuzzy match similarity and creating signatures for thismeasure. However, accuracy of this measure is not com-pared with other measures. In [5] several such similaritymeasures are benchmarked for approximate selection, whichis a special case of similarity join. Given a relation R, theapproximate selection operation using similarity predicatesim(), will report all tuples t R such that sim(tq , t) θ,where θ is a specified numerical ’similarity threshold’ and tqis a query string. Approximate selections are special casesof the similarity join operation. While several predicatesare introduced and benchmarked in [5], the extension of approximate selection to approximate joins is not considered.Furthermore, the effect of threshold values on accuracy forapproximate joins is also not considered.3. FRAMEWORKIn this section, we explain our framework for similarityjoin. The similarity join of two relations R {ri : 1 i N1 } and S {sj : 1 j N2 } outputs is a set of pairs(ri , sj ) R S where ri and sj are similar. Two recordsare considered similar when their similarity score based asimilarity function sim() is above a threshold θ. For thedefinitions and experiments in this paper, we assume we areperforming a self-join on relation R. Therefore the outputis a set of pairs (ri , rj ) R R where sim(ri , rj ) θ forsome similarity function sim() and a threshold θ. This is acommon operation in many applications such as entity resolution and clustering. In keeping with many approximatejoin methods, we model records as strings. We denote by rthe set of q-grams (sequences of q consecutive characters ofa string) in r. For example, for t ‘db lab’, t {‘db ’ ,‘b l’,‘la’, ‘lab’} for tokenization using 3-grams. In certain cases,a weight may be associated with each token.The similarity measures discussed here are those basedon q-grams created out of strings along with a similaritymeasure that has shown to be effective in previous work [5].These measures share one or both of the following properties: High scalability: There are various techniques proposed in the literature as described in Section 2 forenhancing the performance of the similarity join operation using q-grams along with these measures. High accuracy: Previous work has proved that in mostscenarios these measures perform better or equally wellin terms of accuracy comparing with other string similarity measures. Specifically, these measures have showngood accuracy in name-matching tasks [8] or in approximate selection [5].3.1 Edit SimilarityEdit-distance is widely used as the measure of choice inmany similarity join techniques. Specifically, previous work[10] has shown how to use q-grams for efficient implementation of this measure in a declarative framework. Recentworks on enhancing performance of similarity join has alsoproposed techniques for scalable implementation of this measure [2, 15].Edit distance between two string records r1 and r2 is defined as the transformation cost of r1 to r2 , tc(r1 , r2 ), whichis equal to the minimum cost of edit operations applied tor1 to transform it to r2 . Edit operations include charactercopy, insert, delete and substitute [11]. The edit similarity is

defined as:simedit (r1 , r2 ) 1 tc(r1 , r2 )max{ r1 , r2 }(1)There is a cost associated with each edit operation. Thereare several cost models proposed for edit operations for thismeasure. In the most commonly used measure called Levenshtein edit distance, which we will refer to as edit distancein this paper, uses unit cost for all operations except copywhich has cost zero.Jaccard similarity is the fraction of tokens in r1 and r2that are present in both. Weighted Jaccard similarity is theweighted version of Jaccard similarity, i.e.,Pt r r wR (t)simW J accard(r1 , r2 ) P 1 2(2)t r1 r2 wR (t)where w(t, R) is a weight function that reflects the commonality of the token t in the relation R. We choose RSJ(Robertson-Sparck Jones) weight for the tokens which wasshown to be more effective than the commonly-used InverseDocument Frequency (IDF) weights [5]: N nt 0.5wR (t) log(3)nt 0.5where N is the number of tuples in the base relation R andnt is the number of tuples in R containing the token t.3.3 Measures from IRA well-studied problem in information retrieval is thatgiven a query and a collection of documents, return themost relevant documents to the query. In the measures inthis part, records are treated as documents and q-grams areseen as words (tokens) of the documents. Therefore sametechniques for finding relevant documents to a query canbe used to return similar records to a query string. In therest of this section, we present three measures that previouswork has shown their higher performance for approximateselection problem [5].3.3.1 Cosine w/tf-idfThe tf-idf cosine similarity is a well established measure inthe IR community which leverages the vector space model.This measure determines the closeness of the input stringsr1 and r2 by first transforming the strings into unit vectorsand then measuring the angle between their correspondingvectors. The cosine similarity with tf-idf weights is givenby:Xwr1 (t) · wr2 (t)(4)t r1 r2where wr1 (t) and wr2 (t) are the normalized tf-idf weightsfor each common token in r1 and r2 respectively. The normalized tf-idf weight of token t in a given string record r isdefined as follows:wr (t) q Pwr′ (t)′ ′ 2t′ r wr (t ), wr′ (t) tfr (t) · idf (t)3.3.2 BM25The BM25 similarity score for a query r1 and a stringrecord r2 is defined as follows:simBM 25 (r1 , r2 ) Xŵr1 (t) · wr2 (t)(5)t r1 r2where3.2 Jaccard and WeightedJaccardsimCosine (r1 , r2 ) where tfr (t) is the term frequency of token t within stringr and idf (t) is the inverse document frequency with respectto the entire relation R.ŵr1 (t) (k3 1)·tfr1 (t)k3 tfr1 (t)wr2 (t) (1)(t)(k 1)·tfr21wR (t) K(r2 ) tfr (t)2(1)and wR is the RSJ weight:(1)wR (t) K(r) t 0.5log N nnt 0.5 r k1 (1 b) b avgrlwhere tfr (t) is the frequency of the token t in string recordr, r is the number of tokens in r, avgrl is the averagenumber of tokens per record, N is the number of records inthe relation R, nt is the number of record containing thetoken t and k1 , k3 and b are set of independent parameters.We set these parameters based on TREC-4 experiments [17]where k [1, 2], k3 8 and b [0.6, 0.75].3.3.3 Hidden Markov ModelThe approximate string matching could be modeled bya discrete Hidden Markov process which has shown betterperformance than Cosine w/tf-idf in IR literature [16] andhigh accuracy and running time for approximate selection[5]. This particular Markov model consists of only two stateswhere the first state models the tokens that are specific toone particular “String” and the second state models the tokens in the “General English”, i.e., tokens that are commonin many records. Refer to [5] and [16] for complete description of the model and possible extensions.The HMM similarity function accepts two string recordsr1 and r2 and returns the probability of generating r1 givenr2 is a similar record:simHM M (r1 , r2 ) Y(a0 P (t GE) a1 P (t r2 ))(6)t r1where a0 and a1 1 a0 are the transition states probabilities of the Markov model and P (t GE) and P (t r2 ) is givenby:number of times t appears in r2 r2 numberoftimest appears in rr RP r r RP (t r2 ) PP (t GE) 3.4 Hybrid MeasuresThe implementation of these measures involves two similarity functions, one that compares the strings by comparingtheir word tokens and another similarity function which ismore suitable for short strings and is used for comparison ofthe word tokens.

3.4.1 GESThe generalized edit similarity (GES) [7] which is a modified version of fuzzy match similarity presented in [6], takestwo strings r1 and r2 , tokenizes the strings into a set ofwords and assigns a weight w(t) to each token. GES definesthe similarity between the two given strings as a minimumtransformation cost required to convert string r1 to r2 andis given by: tc(r1 , r2 ), 1.0(7)simGES (r1 , r2 ) 1 minwt(r1 )where wt(r1 ) is the sum of weights of all tokens in r1 andtc(r1 , r2 ) is a sequence of the following transformation operations: token insertion: inserting a token t in r1 with costw(t).cins where cins is the insertion factor constant andis in the range between 0 and 1. In our experiments,cins 1. token deletion: deleting a token t from r1 with costw(t). token replacement: replacing a token t1 by t2 in r1with cost (1 simedit (t1 , t2 )).w(t) where simedit is theedit-distance between t1 and t2 .3.4.2 SoftTFIDFSoftTFIDF is another hybrid measure proposed by Cohenet al. [8], which relies on the normalized tf-idf weight of wordtokens and can work with any arbitrary similarity functionto find similarity between word tokens. In this measure, thesimilarity score, simSof tT F IDF , is defined as follows:Xw(t1 , r1 )·w(arg max (sim(t1 , t2 )), r2 )· max (sim(t1 , t2 ))t1 C(θ,r1 ,r2 )t2 r2t2 r2(8)where w(t, r) is the normalized tf-idf weight of word tokent in record r and C(θ, r1 , r2 ) returns a set of tokens t1 r1such that for t2 r2 we have sim(t1 , t2 ) θ for some similarity function sim() suitable for comparing word strings.In our experiments sim(t1 , t2 ) is the Jaro-Winkler similarityas suggested in [8].4.EVALUATION4.1 DatasetsIn order to evaluate effectiveness of different similaritymeasures described in previous section, we use the samedatasets used in [5]. These datasets were created using amodified version of UIS data generator, which has previously been used for evaluation of data cleaning and recordlinkage techniques [12, 1]. The data generator has the ability to inject several types of errors into a clean database ofstring attributes. These errors include commonly occurringtyping mistakes (edit errors: character insertion, deletion,replacement and swap), token swap and abbreviation errors(e.g., replacing Inc. with Incorporated and vice versa).The data generator has several parameters to control theinjected error in the data such as the size of the dataset tobe generated, the distribution of duplicates (uniform, Zipfian or Poisson), the percentage of erroneous duplicates, theextent of error injected in each string, and the percentageof different types of errors. The data generator keeps 0105050505050Percentage ofErrors able 1: Datasets Used in the Experimentsof the duplicate records by assigning a cluster ID to eachclean record and to all duplicates generated from that cleanrecord.For the results presented in this paper, the datasets aregenerated by the data generator out of a clean dataset of2139 company names with average record length of 21.03and an average of 2.9 words per record. The errors inthe datasets have a uniform distribution. For each dataset,on average 5000 dirty records are created out of 500 cleanrecords. We have also run experiments on datasets generated using different parameters. For example, we generateddata using a Zipfian distribution, and we also used data fromanother clean source (

is a query string. Approximate selections are special cases of the similarity join operation. While several predicates are introduced and benchmarked in [5], the extension of ap-proximate selection to approximate joins is not considered. Furthermore, the effect of threshold values on accuracy for approximate joins is also not considered. 3 .

Related Documents:

approximate string joins. More formally, we wish to ad-dress the following problem: Problem 1 (Approximate String Joins) Given tables 1 and # 3with string attributes 1 ,an integer ), retrieve all pairs of ecords 1 3 such that edit distance(1 0) 3) . Our techniques for approximate string processing in databases share a principlecommon .

You can also tune your guitar to a keyboard or piano. The open strings of a guitar correspond to certain notes on a keyboard. SESSION 1 3 Starting Off Right Learn &Master Guitar E A D G B E B 6th string 5th string 4th string 3rd string 2nd string 1st string 5th Fret 1st string 6th string 5th string 4th string 3rd string 2nd string E A D GB E .

You can also tune your guitar to a keyboard or piano. The open strings of a guitar correspond to certain notes on a keyboard. SESSION 1 3 Starting Off Right Learn &Master Guitar E A D G B E B 6th string 5th string 4th string 3rd string 2nd string 1st string 5th Fret 1st string 6th string 5th string 4th string 3rd string 2nd string E A D GB E .

queries with fuzzy string predicates, especially in the context of data cleansing [24, 31]. Gravano et al. [11] present a technique to do similar string joins inside a relational database system. Jin et al. [19] develop an efficient approach to approximate string joins using mapping techniques. Chaudhuri et al. [4] propose an

query string. Given a query string and a string tuple , the similarity score of and in this class of predicates is of the form weight of the token,where is the query-based in string and weight of the token is the tuple-based in string . 3.2.1 Tf-idf Cosine Similarity The tf-idf cosine similarity [24] between a query string and a string tuple

Barber, Samuel String Quartet No.1, Op.11 Bartok, Bela String Quartet No.2, Op.17 String Quartet No.4 Beethoven, Ludwig van String Quartet No.1 in F major, Op.18 No.1 String Quartet No.2 in G major, “Compliments” Op.18 No.2 String Quartet No.6 in B-flat major, Op.18 No.6 String Quartet No.7 in F major, “Rasumovsky 1” Op.59 No.1

String Quartet n. 15 op. 144 Anton Webern String Quartet op. 28 Five Movements for String Quartet Six Bagatelles for String Quartet Alexander Von Zemlinsky String Quartet n. 2 op. 15 2) Toshio Hosokawa UTA-ORI. Weaving Song for string quartet (2020) New composition for String Quartet

Jadi osteologi adalah cabang dari anatomi yang memelajari tentang tulang. Dalam memelajari tulang sering pula dijumpai istilah “skeleteon”, yang berasal dari bahasa latin yang berarti kerangka. Tulang atau kerangka bagi manusia mempunyai fungsi yang amat besar, antara lain: a. Melindungi organ vital b. Penghasil darah tertentu c. Menyimpan dan mangganti kalsium dan fosfat d. Alat gerak .