A Language Modeling Approach To Information Retrieval

1y ago
6 Views
3 Downloads
896.24 KB
7 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Abram Andresen
Transcription

1998 TOCSearchA Language Modeling Approach to Information RetrievalJay M. Ponte and W. Bruce CroftComputer Science DepartmentUniversity of Massachusetts, Amherst{ponte, croft}&s.umass.eduAbstractModels of document indexing and document retrieval have been extensively studied. The integration of these two classes of models has been thegoal of several researchers but it is a very difficult problem. We argue that much of the reason for this is thelack of an adequate indexing model. This suggests thatperhaps a better indexing model would help solve theproblem.However, we feel that making unwarrantedparametric assumptions will not lead to better retrievalperformance.Furthermore, making prior assumptionsabout the similarity of documents is not warranted either. Instead, we propose an approach to retrieval basedon probabilistic language modeling. We estimate modelsfor each document individually. Our approach to modeling is non-parametric and integrates document indexingand document retrieval into a single model. One advantage of our approach is that collection statistics whichare used heuristically in many other retrieval models arean integral part of our model. We have implementedour model and tested it empirically. Our approach significantly outperforms standard tf.idf weighting on twodifferent collections and query sets.1IntroductionOver the past three decades, probabilistic models of document retrieval have been studied extensively. In general, these approaches can be characterized as methodsof estimating the probability of relevance of documentsto user queries. One component of a probabilistic retrieval model is the indexing model, i.e., a model of theassignment of indexing terms to documents. We arguethat the current indexing models have not led to improved retrieval results. We believe this is due to twounwarranted assumptions made by these models. Wehave taken a different approach based on non-parametricestimation that allows us to relax these assumptions. Wehave implemented our approach and empirical results ontwo different collections and query sets are significantlybetter than the standard tf.idf method of retrieval. Nowwe take a brief look at some existing models of documentindexing.We begin our discussion of indexing models with the2-Poisson model, due to Bookstein and Swanson [l] andPermission to make digital/hard copy of all or part of this workfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and itsdate appear, and notice is given that copying is by permission ofACM, Inc. To copy otherwise, to republish, to post on servers orto redistribute to lists, requires prior specific permission and/orfee. SIGIR’93, Melbourne, Australia @ 1998 ACM l-58113-015-58/98 5.00.275also to Harter [7]. By analogy to manual indexing, thetask was to assign a subset of words contained in a document (the ‘specialty words’) as indexing terms. Theprobability model was intended to indicate the useful indexing terms by means of the differences in their rateof occurrence in documents ‘elite’ for a given term, i.e.,a document that would satisfy a user posing that single term as a query, vs. those without the property ofeliteness.The success of the 2-Poisson model has been somewhat limited but it should be noted that Robertson’stf,which has been quite successful, was intended to behavesimilarly to the 2-Poisson model [12].Other researchers have proposed a mixture model ofmore than two Poisson distributions in order to betterfit the observed data. Margulis proposed the n-Poissonmodel and tested the idea empirically [lo]. The conclusion of this study was that a mixture of n-Poisson distributions provides a very close fit to the data. In a certainsense, this is not surprising. For large values of n onecan fit a very complex distribution arbitrarily closely bya mixture of n parametric models if one has enough datato estimate the parameters [18]. However, what is somewhat surprising is the closeness of fit for relatively smallvalues of n reported by Margulis [lo].Nevertheless, the n-Poisson model has not broughtabout increased retrieval effectiveness in spite of the closefit to the data. In any event, the semantics of the underlying distributions are less obvious in the n-Poisson caseas compared to the 2-Poisson case where they model theconcept of eliteness.Apart from the adequacy of of the available indexing models, estimating the parameters of these modelsis a difficult problem. Researchers have looked at thisproblem from a variety of perspectives and we will discuss several of these of these approaches in section 2. Inaddition, as previously mentioned, many of the currentindexing models make assumptions about the data thatwe feel are unwarranted.lThe parametric assumption.lDocuments are members of pre-defined classes.In our approach we relax these two assumptions.Rather than making parametric assumptions, as is donein the 2-Poisson model it is assumed that terms follow amixture of two Poisson distributions, as Silverman said,“the data will be allowed to speak for themselves [16].”We feel that it is unnecessary to construct a parametricmodel of the data when we have the actual data. Instead,we rely on non-parametric methods.Regarding the second assumption, the 2-Poisson modelwas originally based on the idea of ‘eliteness’ [7]. It wasassumed that a document elite for a given term would

satisfy a user if the user posed that single term as a query.Since that time, the prevailing view has come to be thatmultiple term queries are more realistic. In general, thisrequires a combinatorial explosion of elite sets for all possible subsets of terms in the collection. We take the viewthat each query needs to be looked at individually andthat documents will not necessarily fall cleanly into eliteand non-elite sets.In order to relax these assumptions and to avoid thedifficulties imposed by separate indexing and retrievalmodels, we have developed an approach to retrieval basedon probabilistic language modeling. Our approach provides a conceptually simple but explanatory model of retrieval.At this time, we should make clear what we meanby the word ‘model.’ In our view, the word ‘model’ isused in information retrieval in two senses. The firstsense denotes an abstraction of the retrieval task itself.The best example of this is the vector space model whichallows one to talk about the task of retrieval apart fromimplementation details such as storage media, and datastructures [15]. A second sense of the word ‘model’ isthe probabilistic sense where it refers to an explanatorymodel of the data. This was intention behind the 2Poisson model.We add a third sense of the word when we refer to language modeling. The phrase ‘language model’ is used bythe speech recognition community to refer to a probability distribution that captures the statistical regularitiesof the generation of language [21]. In the context of theretrieval task, we can treat the generation of queries asa random process. Generally speaking, language models for speech attempt to predict the probability of thenext word in an ordered sequence. For the purposes ofdocument retrieval, one can model occurrences at thedocument level without regard to sequential effects andwill be the approach taken here. It is also possible tomodel local predictive effects for features such as phrasesbut that will be left for future work. Regarding querygeneration as a random process, it is not the case thatqueries really are generated randomly, but it is the casethat retrieval systems are not endowed with knowledgeof the generation process. Instead, we will treat languagegeneration as a random process modeled by a probabilitydistribution and focus on the estimation of probabilitiesas a means of achieving effective retrieval.Our approach to retrieval is to infer a language modelfor each document and to estimate the probability of generating the query according to each of these models. Wethen rank the documents according to these probabilities. This means that our data model and our discriminant function for retrieval are one and the same. Theintuition behind our approach is that, in our view, usershave a reasonable idea of terms that are likely to occur indocuments of interest and will choose query terms thatdistinguish these documents from others in the collection,an intuition discussed in more detail in section 5. By focusing on the query generation probability as opposed tothe probability of relevance, our model does not requireus to make a set of inferences for indexing and a separateset of inferences for retrieval.Most retrieval systems use term frequency, documentfrequency and document length statistics.Typicallythese are used to compute a tf.idf score with document length normalization.An example of this is theINQUERY ranking formula shown in section 4.3.In our approach, collection statistics such as term froquency, document length and document frequency are276integral parts of the language model and are not usedheuristically as in many other approaches. For this reason, we do not use the standard tf and idf scores. Inaddition, length normalization is implicit in the calculation of the probabilities and does not have to be done inan ad hoc manner.The remainder of the paper is organized as follows.In section 2 we review some existing retrieval models.Section 3 describes a language modeling approach thatclosely parallels the standard approach to IR. Section 4shows the effectiveness of this model empirically. Finallywe offer concluding remarks and describe future directions of this work.2PreviousWorkAs mentioned previously. the standard probabilistic indexing model is the Z-Poisson model. One of the assumptions of the model was that a subset of terms occurring ina document would be useful for indexing. According toHarter (71, such words can be identified by their distribution and thereby assigned as indexing terms. Documentswere assumed to be of approximately equal length, a reasonable assumption for the data used in the initial studies[7]. This model is somewhat similar to ours if one viewsthe probability of term assignment as analogous to theterm generation probability. The two main differencesare that we do not make distributional assumptions andwe do not not distinguish a subset of specialty words orassume a preexisting classification of documents into eliteand non-elite sets.Two well known probabilistic approaches to retrievalare the Robertson and Sparck Jones model [14] and theCroft and Harper model [3]. Both of these models estimate the probability of relevance of each document to thequery. Our approach differs in that we do not focus onrelevance except to the extent that the process of queryproduction is correlated with it.An additional probabilistic model is that of Fuhr [4].A notable feature of the Fuhr model is the integration ofindexing and retrieval models. The main difference between this approach and ours is that in the Fuhr modelthe collection statistics are used in a heuristic fashion inorder to estimate the probabilities of assigning conceptsto documents. In our approach, we are able to avoid USing heuristic methods since we are not inferring conceptsfrom terms.Another recent probabilistic approach is the INQUERYinference network model by Turtle and Croft [19]. Similar to the Fuhr model, Turtle and Croft integrate indexing and retrieval by making inferences of concepts fromfeatures. Features include words, phrases and more complex structured features. Evidence from multiple featuresets and multiple queries can be combined by means ofa Bayesian network in order to infer the probability thatthe information need of the user has been met. Thisdistinction between information need and query is a notable feature of this model. As previously noted, in ourapproach, we have shifted our emphasis from probabilityof relevance to probability of query production. We assume these are correlated but do not currently attemptto model that correlation explicitly. We will discuss thispoint further in section 5.In section 3 we will discuss our probability estimationprocedure. One statistic that we will be using is the average probability of term occurrence. A similar statisticwas used by Kwok [9] for a different purpose. Kwok used

the unnormalized average tf to estimate the importanceof a term with respect to the query. In our approach weuse the average of tf normalized by document length inthe estimation of the generation probability.Wong and Yao [20] proposed a model in which theyrepresented documents according to a probability distribution. They then developed two separate approachesto retrieval, one based on utility theory and the otherbased on information theory. Regarding the probabilitydistribution, Wong and Yao use a maximum likelihoodestimator for term probabilities.In our approach, weuse a more robust estimator. Wong and Yao’s utilityand information theoretic retrieval models are somewhatanalogous to other approaches to retrieval in that theyhave an indexing model apart from their retrieval model.Terms are associated with documents according to themaximum likelihood probability estimate and the discriminant is a utility theoretic or information theoreticfunction of this estimate. In our approach we have beenable to avoid this extra complexity and perform retrievalaccording to a single probabilistic model.The most similar approach to the one we have taken isthat of Kalt [8]. In this model, documents are assumedto be generated by a stochastic process; a multinomialmodel. The task Kalt investigated was text classification.Each document was treated as a sample from a languagemodel representing the class of that document. In thismodel, tf and document length are both integral partsof the model rather than being heuristics as they are inmany other models. The discriminant function is takento be the maximum likelihood estimator of the querygiven the document’s language model. Note that the‘query’ in this case was inferred from the training set inthe context of the classification task.While our approach is conceptually somewhat different, it is clearly related to the Kalt approach and sharesthe desirable property that the collection statistics are integral parts of the model, In our initial study we have deliberately not made any distributional assumptions. Another major difference from the Kalt approach is thatwe do not rely on the maximum likelihood estimator butinstead use a more robust estimator. Next, we do notassume that documents were necessarily drawn from I;language models representing the k classes of interest.Instead, we make a weaker assumption that we can getestimates of each document’s language model individually without making inferences about the class membership of documents. We then use these models to computethe query generation probability. We now describe thedevelopment of our approach.3ModelDescriptionAs mentioned, we infer a language model for each document and rank according to our estimate of producingthe query according to that model. We would like to estimate (Q]Md), the probability of the query given thelanguage model of document d.The maximum likelihood estimate of the probabilityof term t under the term distribution for document d is:where tf , ) is the raw term frequency of term t indocument d and d&j is the total number of tokens in document d. We assume that given a particular languagemodel that the query terms occur independently. This277gives rise to the ranking formula fl,,,&,l(t, d) for eachdocument. There are several problems with this estimator. The most obvious practical problem is that we donot wish to assign a probability of zero to a documentthat is missing one or more of the query terms. In addition to this practical consideration, from a probabilisticperspective, it is a somewhat radical assumption to inferthat p(i!]Md) 0. I.e., the fact that we have not seen itdoes not make it impossible. Instead, we make the a sumption that a non-occurring term is possible, but nomore likely than what would be expected by chance in thecollection. I.e., d, where cft is the raw count of term tin the collection%dcs is the raw collection size or thetotal number of tokens in the collection. This providesus with a more reasonable distribution and circumventsthe practical problem. It should be noted that in homogeneous databases, one may need to use a more carefulestimate of the collection probability since, in some casesthe absence of a very frequently occurring word (i.e. aword with the characteristics of a stopword) could conceivably contribute more to the score of a document inwhich it does not occur. This is not a problem in thecollections we have studied as they are heterogeneous innature and stopwords have been removed, but we planto address this issue in the future to be sure that ourapproach will be immune to these pathological cases.The other problem with this estimator may be lessobvious. If we could get an arbitrary sized sample ofdata from Md we could be reasonably confident in themaximum likelihood estimator. However we only have adocument sized sample from that distribution. To circumvent this problem we are going to need an estimatefrom a larger amount of data. That estimate is:where df, is the document frequency of t. In otherwords, the mean probability oft in documents containingit. This is a robust statistic in the sense that we have alot more data from which to estimate it but it too hasa problem. We cannot and are not assuming that everydocument containing t is drawn from the same languagemodel and so there is some risk in using the mean toestimate fJ(i?]Md) furthermore if we used the mean byitself, there would be no distinction between documentswith different term frequencies.In order to benefit from the robustness of this estimator and to minimize the risk we will model the risk for aterm t in a document d using the geometric distribution[22] as follows:where 7, is the the mean term frequency of term t indocuments where t occurs, i.e., pavg(t) x d&i. Anotherway to say this is that 7, is the term count one would getif the term occurred at the average rate. The intuitionbehind this formula is that as the tf gets further awayfrom the normalized mean, the mean probability becomesriskier to use as an estimate. For a somewhat related useof the geometric distribution see [5].Now we will use this risk function as a mixing parameter in our calculation of fi(Q]Md), our estimate of theprobability of producing the query for a given documentmodel as follows:

Let,-s-rfi(tlMd)&et.:Prec.0.00 d)(‘.‘-‘-)p,,(t,5!Lxp ,(t)%-iftf(t,d) 0otherwisec.5Then,1.0 XIItBQ (tlkfd)In this formula, the first term is the probability ofproducing the terms in the query and the second term isthe probability of not producing other terms. Notice therisk function, &,d and the background probability 2mentioned earlier. We compute this function for eachcandidate document and rank accordingly. Now we describe our empirical results.44.1EmpiricalDataImplementationWe have implemented a research prototype retrieval engine known as Labrador to test our approach. This engine was originally implemented as a high throughputretrieval system in the context of our previous work ontopic segmentation [13]. For these experiments, the system does tokenization, stopping and stemming in theusual way. We have implemented both standard tf.idfweighting as well our language modeling approach.4.3Recall/PrecisionExperimentsTable 1 shows the results for TREC topics 202-250 onTREC disks 2 and 3. In the figure we see the eleven pointrecall/precision results as well as the non-interpolated average precision and precision figures for the top N documents for several values of N. The first two columnscompare the baseline result to our new approach. Thebaseline result was obtained using the INQUERY rankingformula which uses Robertson’stf score and a standardidf score and is defined as follows:tft.dIfbelt,didf, 0.06870.2876%chg 5.09 2.0 8.6 15.1 21.0 22.9 32.3 37.1 68.7 169.6 89.3 76.9 19.55 12.8 11.2 11.9 15.9 15.0 11.4 5.11 16.321 I/D 1 Sign1 Wk.[ 36143 1 O.OOOO* 1 .OOOS*0.0037*0.0107*0.07610.0081 *o.oooo*o.oooo*0.0002*o.oooo*0.5709Table 1: Comparison of tf.idf to the language modelingapproach on TREC queries 202-250 on TREC disks 2 and3.ResultsWe performed recall/precision experiments on two datasets. The first set was TREC topics 202-250 on TRECdisks 2 and 3, the TREC 4 ad hoc task and the second wasTREC topics 51-100 on TREC disk 3 using the conceptfields. We chose these query sets because they are quitedifferent from each other. The 51-100 concept fields areessentially lists of good terms while topics 202-250 are‘natural language’ queries consisting of one sentence d o.5 1.5 .,;:,4,,log( F)/log(N 1)tThe third column reports the percent change. Column four is of the form I/D where I is the count of278queries for which performance improved using the newmethod and D is count of queries for which performancewas different. Column five reports significance values according to the sign test and column six does likewise according to the Wilcoxon test. The entries in these twocolumns marked with a star indicate a statistically significant difference at the 0.05 level. Note that these areone sided tests.Notice that on the eleven point recall/precision section, the language modeling approach achieves betterprecision at all levels of recall, significantly at several levels. Also notice that there is a significant improvement inrecall, uninterpolated average precision and R-precision,the precision after R documents where R is equal to thenumber of relevant documents for each query. On the second part of the figure there is again an improvement atall levels of recall, most of them statistically significant.The results for TREC queries 51-100 on TREC disk 3are shown in table 2. Once again we see improvement inprecision at all levels of recall on the eleven point chartand we also see improvement in precision for each levelof recall by document count. Several levels show a significant improvement.4.4ImprovingtheBasicModelWe now try to improve our probability estimates since,according to our model, this should yield better retrievalperformance. A simple improvement of the estimate developed in section 3 is to smooth the estimates of the average probability for terms with low document frequency.The estimate of the average probability of these terms isbased on a small amount of data and so could be sensitiveto outliers.In order to correct for this, we binned the low frequency data by document frequency and used the binnedestimate for the average. We used a cutoff of u’f 100for low frequency terms, though it turned out that thecutoff is not critical.

700.42930.33440.26700.17970.11640.2836LM104856105 090.05290.01520.00040.2486 7.3 2.9 4.9 8.2 8.4 16.2 15.2 21.5 3.7-14.9-11.9 20.28520.18810.12210.3013 12.0 3.5 2.4 4.7 7.0 6.5 6.8 4.7 4.9 143II%chg1[I/D321421 Sign1 Wk.1 0.0005*1 0076*0.0009*0.0011*0.0003*0.0052*Table 2: Comparison of tf.idf to the language modelingapproach on TREC queries 51-100 on TREC disk 3.Conclusionsand Future1 I/D ISignI 04500.00650.00400.2318 1.7 4.2 2.3 5.9 5.3 5.0-0.0-4.8 4.1 4.6-19.1 .26550.19320.11280.06840.2928 8.9f3.7 3.1 4.5 1.3 6.2 1.5 0.8-0.4 fundef0.0055*language model1::able 3: Comparison 01E the Ioriginaling approach to the new language modeling approach onTREC queries 202-250 on TREC disks 2 and 3.This new estimate of the average is incorporated intoour ranking formula as before and rerun on TREC queries202-250 against TREC disks 2 and 3 and the results areshown in table 3.The results show a statistically significant improvement in precision at several levels of recall. The average precision is also improved. Running the new modelon our second query set, TREC queries 51-100 againstTREC disk 3, we get the result shown in table 4.Again we see significant improvements, albeit modestones, at several levels of recall and on average. Our conjecture is that that smaller improvement on this query setis due to the longer average query length as compared tothe other query set. It appears that for low frequencyterms, the effects on the average due to outliers is justas likely to overestimate as it is to underestimate and sothese effects cancel each other out with more terms in thequery. However, this is only a conjecture, the verificationof which we leave for future rkWe have presented a novel way of looking at the problemof text retrieval based on probabilistic language modelingthat is both conceptually simple and explanatory.We feel that our model will provide effective retrievaland can be improved to the extent that the followingconditions can be met:1. Our language models are accurate representationsof the data.2. Users understand our approach to retrieval.3. Users have a some sense of term distribution.We feel that condition one has been met reasonablywell by the approach we have taken in this study. However, we also feel that our models can and should be improved. Our current language models do not incorporateany knowledge of the language generation process. It isLMLM2%chgII/D1 SignI Wk.104856105104856107 0.03I3/51 0.50001 40.30770.25050.17770.11130.05309.01541.00041.2488 0.7 0.1 0.2 0.4-0.3-1.2 0.3 0.1 0.9 .35680.28590.18840.12210.3011 0.7 o.o 0.8 0.6 0.4 0.2 0.2 0.1 o.o-0.08 1330.25390.50000.9270ITable 4: Comparison of the original language modeling approach to the new language modeling approach onTREC queries 51-100 on TREC disk 3.279

possible that additional knowledge added to the modelswill yield better estimates.Regarding point two, we feel that our model is simple enough to be explained to users at an intuitive leveland that the understanding of it will facilitate the formation of better queries. It is not that users will needor want to know the details of the model but it is morethe case that if users have a general understanding ofhow the system works, they will be able to use it moreeffectively. Users are typically instructed to pose natural language descriptions of their information needs asqueries. A user that understands our model will tend tothink in terms of which words will help the system distinguish the documents of interest from everything else. Wefeel that if we can get users to think in this manner theywill be able to formulate queries that will better expresstheir information needs in manner useful to the retrievalsystem.Regarding point three, in order for users to identifyuseful words, we feel that they would benefit from a senseof how the words are distributed in the collection. Again,it is not the case that users need or want to know theterm distribution in detail, but a sense of which termsare more likely to be useful would be beneficial. We canimagine a variety of both textual and graphical tools tohelp users get a better sense of the distribution of terms.This especially important to win over expert users whooften prefer boolean retrieval because they like the senseof control over the search [2].Regarding the results of our study, performance ontwo different query sets was better than that obtainedby tf.idf weighting. However, the improvement in performance is not the main point. More significant is thata different approach to retrieval has been shown to be effective. The ability to think about retrieval in a new waycan lead to insights that would be less obvious in otherapproache

parametric assumptions will not lead to better retrieval performance. Furthermore, making prior assumptions about the similarity of documents is not warranted ei- ther. Instead, we propose an approach to retrieval based on probabilistic language modeling. We estimate models for each document individually.

Related Documents:

14 D Unit 5.1 Geometric Relationships - Forms and Shapes 15 C Unit 6.4 Modeling - Mathematical 16 B Unit 6.5 Modeling - Computer 17 A Unit 6.1 Modeling - Conceptual 18 D Unit 6.5 Modeling - Computer 19 C Unit 6.5 Modeling - Computer 20 B Unit 6.1 Modeling - Conceptual 21 D Unit 6.3 Modeling - Physical 22 A Unit 6.5 Modeling - Computer

Modeling these components requires a general purpose modeling language that can fit embedded systems such as systems modeling language (SysML) [7] and [16]. Authors of that work propose a TCP/IP model requirement, design and analysis using SysML. The Unified Modeling Language (UML) [27] is an object oriented modeling language, which cannot fit

IST 210 What is the UML? UML stands for Unified Modeling Language The UML combines the best of the best from Data Modeling concepts (Entity Relationship Diagrams) Business Modeling (work flow) Object Modeling Component Modeling The UML is the standard language for visualizing, specifying, constructing, and documenting the artifacts of a software-intensive system

4. Modeling observation Modeling of observation systems can be done in the Uni ed Modeling Language (UML). This language is an industry-wide standard for modeling of hardware and software systems. UML models are widely understood by developers in the com-munity, and the modeling process bene ts from extensive tool support. UML o ers a light-weight

work/products (Beading, Candles, Carving, Food Products, Soap, Weaving, etc.) ⃝I understand that if my work contains Indigenous visual representation that it is a reflection of the Indigenous culture of my native region. ⃝To the best of my knowledge, my work/products fall within Craft Council standards and expectations with respect to

Structural equation modeling Item response theory analysis Growth modeling Latent class analysis Latent transition analysis (Hidden Markov modeling) Growth mixture modeling Survival analysis Missing data modeling Multilevel analysis Complex survey data analysis Bayesian analysis Causal inference Bengt Muthen & Linda Muth en Mplus Modeling 9 .

Oracle Policy Modeling User's Guide (Brazilian Portuguese) Oracle Policy Modeling User's Guide (French) Oracle Policy Modeling User's Guide (Italian) Oracle Policy Modeling User's Guide (Simplified Chinese) Oracle Policy Modeling User's Guide (Spanish) Structure Path Purpose Program Files\Oracle\Policy Modeling This is the default install folder.

5 Process Modeling using UML G. ENGELS† and A. FORSTER ‡ and R. HECKEL§ and S. THONE ¶ University of Paderborn, Germany 5.1 INTRODUCTION The Unified Modeling Language (UML)1 is a visual, object-oriented, and multi-purpose modeling language. While primarily designed for modeling soft-