Language Models For Information Retrieval

2y ago
15 Views
2 Downloads
385.30 KB
25 Pages
Last View : 6d ago
Last Download : 2m ago
Upload by : Xander Jaffe
Transcription

Language Models for Information RetrievalBerlin ChenDepartment of Computer Science & Information EngineeringNational Taiwan Normal UniversityReferences:1. W. B. Croft and J. Lafferty (Editors). Language Modeling for Information Retrieval. July 20032. T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, JanuaryFebruary 20013. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval,Cambridge University Press, 2008. (Chapter 12)4. D. A. Grossman, O. Frieder, Information Retrieval: Algorithms and Heuristics, Springer, 2004 (Chapter 2)

Taxonomy of Classic IR ModelsSet TheoreticClassic rProbabilisticStructured ModelsNon-Overlapping ListsProximal NodesBrowsingBrowsingFuzzyExtended BooleanAlgebraicGeneralized VectorNeural NetworksProbabilisticInference NetworkBelief NetworkLanguage Model-Probabilistic LSI-Topical Mixture ModelFlatStructure GuidedHypertextIR – Berlin Chen 2

Statistical Language Models (1/2) A probabilistic mechanism for “generating” a piece of text– Defines a distribution over all possible word sequencesW w1 w 2 K w NP(W ) ? What is LM Used for ?–––––––Speech recognitionSpelling correctionHandwriting recognitionOptical character recognitionMachine translationDocument classification and routingInformation retrieval IR – Berlin Chen 3

Statistical Language Models (2/2) (Statistical) language models (LM) have been widelyused for speech recognition and language (machine)translation for more than twenty years However, their use for use for information retrievalstarted only in 1998 [Ponte and Croft, SIGIR 1998]IR – Berlin Chen 4

Query Likelihood Language Models Documents are ranked based on Bayes (decision) ruleP (D Q ) P (Q D )P (D )P (Q )– P (Q ) is the same for all documents, and can be ignored– P (D ) might have to do with authority, length, genre, etc. There is no general way to estimate it Can be treated as uniform across all documents Documents can therefore be ranked based onP (Q D )(or denotedas P (Q M D ))– The user has a prototype (ideal) document in mind, andgenerates a query based on words that appear in this document– A document D is treated as a model M D to predict (generate)the queryIR – Berlin Chen 5

Schematic DepictionDocumentCollectionDocumentModelsD1MD1P(Q MD )1D2MD2D3MD3.MD QP()3query (Q)IR – Berlin Chen 6

n-grams Multiplication (Chain) ruleP (w1 w 2 . w N ) P (w1 ) P (w 2 w1 )P (w3 w1 w 2 )L P (w N w1 w 2 K w N 1 )– Decompose the probability of a sequence of events into theprobability of each successive events conditioned on earlier events n-gram assumption– UnigramP (w1 w 2 . w N ) P (w1 ) P (w 2 )P (w3 )L P (w N ) Each word occurs independently of the other words The so-called “bag-of-words” model– BigramP (w1 w 2 . w N ) P (w1 ) P (w 2 w1 )P (w3 w 2 )L P (w N w N 1 )– Most language-modeling work in IR has used unigram languagemodels IR does not directly depend on the structure of sentencesIR – Berlin Chen 7

Unigram Model (1/4) The likelihood of a query Q w1 w2 . w N given adocument DP (Q M D ) P (w1 M D ) P (w 2 M D )L P (w N M D ) iN 1 P (wi M D )– Words are conditionally independent of each other given thedocument– How to estimate the probability of a (query) word given thedocument P (w M D ) ? Assume that words follow a multinomial distributionpermutation is considered heregiven the documentP (C (w1 ),., C (wV ) M Dwhere( Vj 1 C (w j ))!) C (w ) Vi 1 λ wi i Vi 1 (C (wi ) ! )C (wi ) : the number of times a word occursλ wi P (wi M D ), Vi 1 λ wi 1IR – Berlin Chen 8

Unigram Model (2/4) Use each document itself a sample for estimating itscorresponding unigram (multinomial) model– If Maximum Likelihood Estimation (MLE) is adoptedC (wi , D )Pˆ (wi M D ) DDoc Dwa wc wbwawawcwbwbwawdP(wb MD) 0.3P(wc MD) 0.2P(wd MD) 0.1whereC (wi , D ) : number of times wi occurs in DD : length of D, i C (wi , D ) DThe zero-probability problemIf we and wf do not occur in Dthen P(we MD) P(wf MD) 0P(we MD) 0.0This will cause a problem in predictingthe query likelihood (See the equation forP(wf MD) 0.0the query likelihood in the preceding slide)IR – Berlin Chen 9

Unigram Model (3/4) Smooth the document-specific unigram model with acollection model (a mixture of two multinomials)P (Q M D ) iN 1 [λ P (wi M D ) (1 λ ) P (wi M C )] The role of the collection unigram model P (wi M C )– Help to solve zero-probability problem– Help to differentiate the contributions of different missing terms ina document (global information like IDF ? ) The collection unigram model can be estimated in asimilar way as what we do for the document-specificunigram modelIR – Berlin Chen 10

Unigram Model (4/4) An evaluation on the Topic Detection and Tracking (TDT)corpra– Language ModelmAPUnigramUnigram BigramTQ/TD0.63270.5427TDT2 TQ/SD0.56580.4803TQ/TD0.65690.6141TDT3 TQ/SD0.63080.5808– Vector Space ModelmAPTDT2TDT3UnigramUnigram 50.6531TQ/SD0.62160.6233PUnigram (Q M D ) iN 1 [λ P (wi M D ) (1 λ ) P (wi M C )](Q M D ) iN 1 [λ1 P (wi M D ) λ 2 P (wi M C )λ 3 P (wi wi 1 , M D ) (1 λ1 λ 2 λ3 ) P (wi wi 1 , M C )]PUnigram BigramIR – Berlin Chen 11

Maximum Mutual Information Documents can be ranked based their mutual informationwith the queryP (Q , D )MI (Q , D ) logP (Q )P (D ) log P (Q D ) log P (Q )being the same for all documents,and hence can be ignored Document ranking by mutual information (MI) is equivalentthat by likelihoodarg max MI (Q , D ) arg max P (Q D )DDIR – Berlin Chen 12

Probabilistic Latent Semantic Analysis (PLSA)Thomas Hofmann 1999 Also called The Aspect Model, Probabilistic LatentSemantic Indexing (PLSI)– Graphical Model Representation (a kind of Bayesian Networks)P (Q D )P (D )Language modelsim (Q , D ) P (D Q ) P (Q ) P (Q D )P (D )P (D )P (w D )QD P (Q D ) [λ P (w M D ) (1 λ ) P (w M C )]C (w,Q )w QPLSAC (w ,Q )P (DP (T k D)DReference:P (w T k)Tsim (Q , D ) P (Q D ) P (w D ))w QQThe latent variables The unobservable class variables Tk(topics or domains) K P (w , T k D ) w Q k 1 C (w ,Q ) K P (w T k )P (T k D ) w Q k 1 1. T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, January-February 2001C (w ,Q )IR – Berlin Chen 13

PLSA: Formulation Definition– P (D ) : the prob. when selecting a doc D–P (T k D ): the prob. when pick a latent class()– P w T k : the prob. when generating a wordT k for the doc Dwfrom the classTkIR – Berlin Chen 14

PLSA: Assumptions Bag-of-words: treat docs as memoryless source, wordsare generated independentlyC (w , Q )sim (Q , D ) P (Q D ) P (w D )w Conditional independent: the doc D and word w areindependent conditioned on the state of the associatedlatent variable T kP (w , D T k ) P (w T k )P (D T k )KP (w , D , T kP (D )k 1KP (w D ) P (w , T k D ) k 1K k 1KP (w T k )P (D T k )P (T kP (D) P (w T k )P (T k Dk 1)) K k 1K P (w , D T k )P (T kP (Dk 1)P (w T k )P (T k , DP (D))))IR – Berlin Chen 15

PLSA: Training (1/2) Probabilities are estimated by maximizing the collectionlikelihood using the Expectation-Maximization (EM)algorithmL C C (w , D ) log P (w D )D w C (w , D ) log P (w T k )P (T k D ) D w Tk EM tutorial:- Jeff A. Bilmes "A Gentle Tutorial of the EM Algorithm and its Applicationto Parameter Estimation for Gaussian Mixture and Hidden Markov Models," U.C. Berkeley TR-97-021IR – Berlin Chen 16

PLSA: Training (2/2) E (expectation) stepP (T k w , D) P (w T k )P (T k D Tk)P (w T k )P (T k D) M (Maximization) stepPˆ (w T k) DC (w , D )P (T k w , Dw D)C (w , D )P (T k w , D) w C (w , D )P (T k w , D )ˆP (T k D i ) w ′ C (w ′ , D )IR – Berlin Chen 17

PLSA: Latent Probability Space (1/2)Dimensionality K 128 (latent classes)medical imagingP (w j , D i ) P (w P (wjTkimage sequencephoneticcontext of contouranalysisboundary detection segmentation, T k , D i ) P (w j T k , D i ) P (T k , D i )TkjT k ) P (T k )P (D i T k )TkP (W , D) Uˆ : (P (w j T k )) j ,k . Σ̂ : diag (P (T k ))k( (. Vˆ : P D i T k))i ,kIR – Berlin Chen 18

PLSA: Latent Probability Space (2/2)D1 D2w1w2Di(P w j , DiDnw1w2)wjwm wjD1 D2t1 tk tKDiP (T k )DnP (D i T k )P (w j T k )kxkΣkVTrxnkxnwmPmxkUmxkmxnP (w j , D i ) P (wjT k ) P (T k )P (D i T k )TkIR – Berlin Chen 19

PLSA: One more example on TDT1 datasetaviationspace missionsfamily loveHollywood loveIR – Berlin Chen 20

PLSA: Experiment Results (1/4) Experimental Results– Two ways to smoothen empirical distribution with PLSA Combine the cosine score with that of the vector spacemodel (so does LSA)PLSA-U* (See next slide)K Combine the multinomials individuallyPPLSA (w D ) P (w Tk )P (Tk D )k 1PLSA-Q*PPLSA Q* (w D) λ PEmpirical(w D) (1 λ ) PPLSA(w D)PEmpirical ( w D ) (c (w , D )c (D ))PPLSA Q* (Q D) λ PEmpirical(w D) (1 λ ) PPLSA(w D) c(w,D )w QBoth provide almost identical performance– It’s not known if PLSA was used aloneIR – Berlin Chen 21

PLSA: Experiment Results (2/4)PLSA-U* Use the low-dimensional representation P (Tk Q ) and P (Tk D )(be viewed in a k-dimensional latent space) to evaluaterelevance by means of cosine measure Combine the cosine score with that of the vector spacemodel Use the ad hoc approach to re-weight the different modelcomponents (dimensions) byR PLSA U * (Q , D ) P (T Q )P (T D ) P (T Q ) P (T D )kkk22kk, where P(Tk Q) kk C(w, Q) P(Tk w, Q)w Q C(w′, Q)w′ Qonline folded-inv r RPLSA U * (Q, D ) λ RPLSA U * (Q, D ) (1 λ ) RVSM (Q, D )IR – Berlin Chen 22

PLSA: Experiment Results (3/4) WhyR PLSI Q * (Q , Di ) P (Tk Q )P (Tk Di )k P (Tk Q ) P (Tk Di )22k?k– Reminder that in LSA, the relations between Dany two docs canbe formulated asDsiA’TA’ (U’Σ’V’T)T ’(U’Σ’V’T) V’Σ’T’UT U’Σ’V’T (V’Σ’)(V’Σ’)TDˆ i Σ 2 Dˆ sTˆˆsim (D i , D s ) coine ( D i Σ , D s Σ ) Dˆ i Σ Dˆ s Σ– PLSA mimics LSA in similarity measure P (Di Tk )P (Tk )P (Tk )P (D s Tk )R PLSI Q * ( Di , D s ) k [P (Di Tk )P (Tk )] 2kP (Di Tk )P (Tk ) P (Tk Di )P (Di ) P (Tk Di )P (Di )P (Tk D s )P (D s )k [P (Tk Di )P (Di )]2k [P (Di Tk )P (Tk )]2kDˆ i and Dˆ s are row vectors2k P (Tk Di )P (Tk D s )k P (Tk Di )2k [P (Tk D s )P (D s )] P (Tk D s )2kIR – Berlin Chen 23

PLSA: Experiment Results (4/4)IR – Berlin Chen 24

PLSA vs. LSA Decomposition/Approximation– LSA: least-squares criterion measured on the L2- or Frobeniusnorms of the word-doc matrices– PLSA: maximization of the likelihoods functions based on thecross entropy or Kullback-Leibler divergence between theempirical distribution and the model Computational complexity– LSA: SVD decomposition– PLSA: EM training, is time-consuming for iterations ?IR – Berlin Chen 25

Probabilistic Latent Semantic Analysis (PLSA) Also called The Aspect Model, Probabilistic Latent Semantic Indexing (PLSI) – Graphical Model Representation (a kind of Bayesian Networks) Reference: 1. T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

The 7 Basic Principles of Retrieval Practice Following are the seven basic principles of retrieval practice. 1. Keep It Short and Simple Retrieval practice should only take a few of minutes of class time and should be easy to explain, set up, and conclude. A perfect example is Agarwal and Bain’s (2019) retrieval

Manipulations of Initial Retrieval Practice Conditions 7 Retrieval Practice Compared to Restudy and Elaborative Study 7 Comparisons of Recall, Recognition, and Initial Retrieval Cueing Conditions 8 Retrieval Practice With Initial Short-Answer and Multiple-Choice Tests 9 Positive and Negative Effects of Initial Multiple-Choice Questions 11

[B]. RETRIEVAL PHASE The retrieval phase is the reverse process of the storage phase. In this phase another automatic monorail will arrive at the retrieval reference point without any load (package) on it. The proximity sensor will sense it, the sensor will change to on state which sends the signal to PLC alerting it about the request of retrieval.