2y ago

41 Views

3 Downloads

618.04 KB

12 Pages

Transcription

Benchmarking Declarative Approximate SelectionPredicatesAmit Chandel, Oktie Hassanzadeh,Nick Koudas, Mohammad SadoghiDivesh SrivastavaUniversity of Torontodivesh@research.att.comAT&T Labs-Researchamit,oktie,koudas,mo @cs.toronto.eduABSTRACT1. INTRODUCTIONDeclarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality isthe use of declarative statements to realize data quality primitiveson top of any relational data source. A primary advantage of suchan approach is the ease of use and integration with existing applications. Over the last few years several similarity predicates havebeen proposed for common quality primitives (approximate selections, joins, etc) and have been fully expressed using declarativeSQL statements. In this paper we propose new similarity predicates along with their declarative realization, based on notions ofprobabilistic information retrieval. In particular we show how language models and hidden Markov models can be utilized as similarity predicates for data quality and present their full declarativeinstantiation. We also show how other scoring methods from information retrieval, can be utilized in a similar setting. We thenpresent full declarative specifications of previously proposed similarity predicates in the literature, grouping them into classes according to their primary characteristics. Finally, we present a thorough performance and accuracy study comparing a large numberof similarity predicates for data cleaning operations. We quantifyboth their runtime performance as well as their accuracy for several types of common quality problems encountered in operationaldatabases.The importance of data cleaning and data quality technologiesfor business practices is well recognized. Data cleaning has beenan active research topic in several communities including statistics,machine learning and data management. The quality of data suffers from typing mistakes, lack of standards for recording databasefields, integrity constraints that are not enforced, inconsistent datamappings, etc. For years, data quality technology has developedindependently from core data management. Data quality tools became part of Extract Transform Load (ETL) technologies, commonly applied during the initial loading phase of data into a warehouse. Although this might be a viable approach for data analytics,where the data processed are static, it is far from acceptable for operational databases, which are dynamic and face proliferating quality problems that degrade common business practices.Recently, there has been a major focus on tighter integration ofdata quality technology with database technology. In particularthere has been research work on the efficient realization of popular data cleaning algorithms inside database engines as well asstudies for the efficient realization of data quality primitives in adeclarative way. The approaches are complementary, the formerassuring great performance and the latter ease of deployment andintegration with existing applications without modification of theunderlying database engine. We are concerned with declarative implementations of data quality primitives in this paper. In particularwe study declarative realizations of several similarity predicatesfor the popular approximate (flexible) selection operation for datade-duplication [17]. A similarity predicate is a predicatethat numerically quantifies the “similarity” or “closeness” of two(string) tuples. Given a relation , the approximate selection operation using similarity predicate , will report all tuples such that , where is a specified numerical ”similarity threshold” and is a query tuple. Approximate selectionsare special cases of the approximate join (record linkage, similarityjoin) operation [17]. Several efficient declarative implementationsof this operation for specific similarity predicates have been proposed [17] both for approximate selections and joins.In this paper, we conduct a thorough study of declarative realizations of similarity predicates for approximate selections. We introduce and adapt novel predicates, realize them declaratively andcompare them with existing ones for accuracy and performance. Inparticular we make the following contributions:Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: Information Searchand RetrievalGeneral TermsAlgorithms, PerformanceKeywordsDeclarative data quality, data cleaning, SQL, performance, accuracyPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD’07, June 12-14, 2007, Beijing, ChinaCopyright 2007 ACM 978-1-59593-686-8/07/0006 . 5.00. Inspired by the success of tf-idf cosine similarity from information retrieval [24] as a similarity predicate for approximate selections, we introduce declarative realizations of othersuccessful predicates from information retrieval and in particular the popular BM25 measure.353

We introduce declarative realizations of probabilistic similarity predicates inspired by Language Models from information retrieval [20] and Hidden Markov Models [19], suitablyadapted for the case of approximate selections.frequency, document length and query term frequency very accurately. Moreover, in the information retrieval literature, languagemodeling has been a very active research topic as an alternativescheme to weight documents for their relevance to user queries.Starting with Ponte and Croft [20], language models for information retrieval have been widely studied.Hidden Markov Models (HMM) have been very successful inmachine learning and they have been utilized for a variety of learning tasks such as named entity recognition and voice recognition [21].They have been utilized for information retrieval as well [19]. Anexperimental study on TREC data demonstrated that an extremelysimple realization of HMM outperforms standard tf-idf for information retrieval [19]. Finally, researchers (e.g., [22]) have alsotried to formally reason about the relative goodness of informationretrieval weighting schemes. We present declarative realizations of previously proposedsimilarity predicates for the approximate selection problemand we propose a categorization of all measures both previously proposed and new according to their characteristics. We present a thorough experimental study comparing all similarity predicates for accuracy and performance, under various types of quality problems in the underlying data.The rest of this paper is organized as follows. In Section 2, wedescribe related work in the area of data quality and InformationRetrieval. We present our overall framework in Section 3, alongwith the similarity predicates considered and their classification.Declarative realizations of the predicates in each class are discussedin Section 4, where we also provide SQL expressions required forthe different predicates. Finally, we experimentally evaluate theperformance of each of the similarity predicates and compare theiraccuracy in Section 5.2.3. FRAMEWORKLet be a query string anda string tuple from a base relation . We denote by , the set respectively. We refer to substrings of aof tokens in andstring as tokens in a generic sense. Such tokens can be words orq-grams (sequence of consecutive characters of a string) for example. For ‘db lab’, ‘db’, ‘lab’ for word-based tokenization and ‘db ’ ,‘b l’,‘ la’, ‘lab’ for tokenization using 3-grams.We refer to tokens throughout the paper when referring to words orq-grams. We make the choice specific (word or q-gram) for techniques we present, when absolutely required. In certain cases, wemay associate a weight with each token. Several weighting mechanisms exist. We present our techniques referring to weights oftokens, making the choice of the weighting scheme concrete whenrequired. In Section 5 we realize our techniques for specific choiceof tokens and specific weighting mechanisms.Our goal is to calculate a similarity score between and using a similarity predicate. We group similarity predicates into fiveclasses based on their characteristics, namely:RELATED WORKData quality has been an active research topic for many years. Acollection of statistical techniques have been introduced initially forthe record linkage problem [9, 8]. The bulk of early work on dataquality was geared towards correcting problems in census files [27].A number of similarity predicates were developed taking into account the specific application domain (i.e., census files) for assessing closeness between person names (e.g., Jaro [15], Jaro-Winkler[27], etc).An initial work geared towards database tuples was themerge/purge technique [14]. The work of Cohen [6] introducedthe use of primitives from information retrieval (namely cosinesimilarity, utilizing tf-idf[24]) to identify flexible matches amongdatabase tuples. A performance/accuracy study conducted by Cohen et al. [7] demonstrated that such techniques outperform common predicates introduced for specific domains (e.g., Jaro, JaroWinkler, etc).Several predicates to quantify approximate match between stringshave been utilized for dealing with quality problems, including editdistance and its variants [13]. Hybrid predicates combining notionsof edit distance and cosine similarity have also been introduced [4,1]. Recently, [5, 2] presented SSJOIN, a primitive operator forefficient set similarity joins. Utilizing ideas from [26], such an operator can be used for approximate matching based on a numberof similarity functions, including hamming distance, edit-distanceand Jaccard similarity. However, the choice of the similarity predicate in this approach is limited [2]. The bulk of the techniquesand predicates however have been introduced without a declarativeframework in mind. Thus, integrating them with applications utilizing databases in order to enable approximate selections is notvery easy.Gravano et al. [11] and Galhardas et al. [10], introduced a declarative methodology for realizing approximate joins and selectionsfor edit distance. Subsequently a declarative framework for realizing tf-idf cosine similarity was introduced [12, 16, 17].There has been a great deal of research in the information retrieval literature on weighting schemes beyond cosine similaritywith tf-idf weighting. Recent IR research has shown BM25 to bethe most effective among the known weighting schemes [23]. Thisweighting scheme models the distribution of within-document term Overlap predicates: These are predicates that assess similarity based on the overlap of tokens in . Aggregate Weighted Predicates: Predicates that assess similarity by manipulating weights (scores) assigned to elementsof . Language Modeling Predicates: Predicates that are basedon probabilistic models imposed on elements of . Edit Based Predicates: Predicates based on a set of editoperations applied between and . Combination Predicates: Predicates combining features fromthe classes above.The classes were defined by studying the properties of previouslyproposed similarity predicates as well as ones newly proposed herein.Within each class we discuss declarative realizations of predicates.3.1 Overlap PredicatesSuppose is the set of tokens in the query string and is theset of tokens in the string tuple . The IntersectSize predicate [26]is simply the number of common tokens between and , i.e.: Jaccard similarity [26] is the fraction of tokens in andpresent in both, namely: 354 (1)that are(2)

3.3 Language Modeling PredicatesA language model is a form of a probabilistic model. To realize things concretely, we base our discussion on a specific modelintroduced by Ponte and Croft [20]. Given a collection of documents, a language model is inferred for each; then the probabilityof generating a given query according to each of these models isestimated and documents are ranked according to these probabilities. Considering an approximate selection query, each tuple in thedatabase is considered as a document; a model is inferred for eachtuple and the probability of generating the query given the model isthe similarity between the query and the tuple.3.2 Aggregate Weighted PredicatesThe predicates in this class encompass predicates widely adoptedfrom information retrieval (IR). A basic task in IR is, given a query,identifying relevant documents to that query. In our context, wewould like to identify the tuples in a relation that are similar to aquery string.Given a query string and a string tuple , the similarity scoreof and in this class of predicates is of the form is the query-based , where weight of the token in string and is the tuple-basedweight of the token in string .3.3.1 Language Modeling ¼ (3) (4) (6) ¾µ (8) (9) is the expected term count for token in tuple if the tokenoccurred at the average rate, i.e., . The intuitionbehind this formula is that as the gets further away from thenormalized mean, the mean probability becomes riskier to use as anestimate. Finally, is the raw count of token in the collection,i.e. and is the raw collection size or the total number of tokens in the collection, i.e. . is used asthe probability of observing a non-occurring token.is a modified form of Robertson-Sparck Jones weight: where where is the document frequency of token . This term is usedsince we only have a tuple sized sample from the distribution of , thus the maximum likelihood estimate is not reliable enough; we need an estimate from a larger amount of data. The term is used to model the risk for a term in a documentusing ageometric distribution:The similarity score between a query string and a tuple, is given as: 3.2.2 BM25 Predicate if otherwise(7) is the maximum likelihood estimate of the probabilityof the token under the token distribution for tuple and is equal to µ where is raw term frequency and is the totalnumber of tokens in tuple . is the mean probability oftoken in documents containing it, i.e., The term makes the weight of a token inversely proportionalto its frequency in the database; the term makes it proportionalto its frequency in . Intuitively, this assigns low scores to frequent tokens in the database and high scores to rare tokens in thedatabase. More discussion is available elsewhere [6, 12]. is defined as:where is the probability of token occurring in tupleand is given as follows: are the normalized tf-idf weights [24].whereThe normalized tf-idf for a token and a string , is givenby: The tf-idf cosine similarity [24] between a query string and astring tuple is defined as follows: The similarity score between query and tuple3.2.1 Tf-idf Cosine Similarity ¾ i.e.and , , and are independent parameters. For TREC-4 experiments [23], , and .If we assign a weight 1 to each token , we can define weightedversions of the above predicates. WeightedMatch [26] is the to .tal weight of common tokens in and , i.e., Similarly, WeightedJaccard is the sum of the weights of tokens in divided by the sum of the weights of tokens in .(5)3.3.2 Hidden Markov Models The query generation process can be modeled by a discrete Hidden Markov process. Figure 1 shows a simple yet powerful twostate HMM for this process. The first state, labeled “String” represents the choice of a token directly from the string. The secondstate, labeled “General English” represents the choice of a tokenthat is unrelated to the string, but occurs commonly in queries.is a string tuple from theSuppose is the query string andbase relation ; the similarity score between and,and is the number of tuples in the base relation , is the number of tuples in containing the token , is the frequencyis the number ofof occurrence of the token within tuple ,tokens of tuple , is the average number of tokens per tuple,1Discussion of ways to assign such weights to tokens follows insubsequent sections.355

token insertion: Inserting a word token into with cost where , is a constant token insertion factor,with values between 0 and 1. token deletion: Deleting a word token from with cost .Suppose is the sum of weights of all word tokens in thestring . We define the generalized edit similarity predicate between a query string and a tuple as follows: , is equal to the probability of generating giventhat is similar, that is:is similar ! " # % (10) ! number of times appears inlength of number of times appears in length of(11) We now describe declarative realizations of predicates in eachclass. For all predicates, there is a preprocessing phase responsiblefor tokenizing strings in the base relation, , and calculating as wellas storing related weight values which are subsequently utilized atquery time. Tokenization of relation (BASE TABLE) creates thetable BASE TOKENS (tid, token), where is a unique tuple identifier for each tuple of BASE TABLE and token an associated token (from the set of tokens corresponding to the tuple withidentifier in BASE TABLE). The query string is also tokenizedon the fly (at query time) creating the tableQUERY TOKENS(token).In the rest of this section, we present SQL expressions requiredfor preprocessing and query time approximate selections for thedifferent predicates. In some cases, we re-write formulas to makethem amenable to more efficient declarative realization. Due tospace limitations, discussion and presentation of SQL is abridged.An important and widely used class of string matching predicates is the class of edit-based predicates. In this class, the similarity between and is the transformation cost of string to , . More specifically is defined as the minimumcost sequence of edit operations that converts to . Edit operations include copy, insert, substitute and delete characters in and[13]. Algorithms exist to compute in polynomial time[13] but complexity is sensitive to the nature of operations and theiroperands (individual characters, blocks of consecutive characters,etc). The edit similarity is therefore defined as: (14)4. DECLARATIVE FRAMEWORK(12)3.4 Edit-based Predicates (15)and and are transition probabilities of the HMM.The values for these parameters can be optimized to maximize accuracy given training data. where are the normalized tf-idf weights and# % ! is the set of words such that there exists some such that .where: A related predicate is the SoftTFIDF predicate [7]. In SoftTFIDF, normalized tf-idf weights of word tokens are used alongwith cosine similarity and any other similarity function " to find the similarity between word tokens. Therefore the similarityscore, ! , is equal to:Figure 1: Two State Hidden Markov Model 4.1 Overlap PredicatesThe IntersectSize predicate requi

query string. Given a query string and a string tuple , the similarity score of and in this class of predicates is of the form weight of the token,where is the query-based in string and weight of the token is the tuple-based in string . 3.2.1 Tf-idf Cosine Similarity The tf-idf cosine similarity [24] between a query string and a string tuple

Related Documents: