Introduction To Information Retrieval And Web Search

9m ago
3.52 MB
57 Pages
Last View : 6d ago
Last Download : 2m ago
Upload by : Rafael Ruffin

Introduction to InformationRetrieval and Web SearchTao YangUCSB CS293S, Winter 2017

Table of Content Information RetrievalSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search EngineOptimizationRelated ents1. Doc12. Doc23. Doc3.

History of IR and Web Search 1960-70’s: § Initial exploration of text retrievalsystems for “small” corpora ofscientific abstracts, and law andbusiness documents.§ Development of the basicBoolean and vector-spacemodels of retrieval.1980’s:§ Larger document databasesystems, many run bycompanies:– Lexis-Nexis– Dialog– MEDLINE1990’s:§ Organized Competitions– NIST TREC§ Searching FTPabledocuments on the Internet– Archie– WAIS§ Searching the World WideWeb– Lycos– Yahoo– Altavista3

History of IR/Web Search 2000’s§ Link analysis for WebSearch– Google– Inktomi– Teoma§ Feedback based engine:– DirectHit (§ Automated InformationExtraction– Whizbang– Fetch– Burning Glass§ Question Answering– TREC Q/A track– Jeeves2000’s continued:§ Multimedia IR– Image– Video– Audio– music§ Cross-Language IR§ Document Summarization§ Mobile search4

Web search basicsSponsored LinksCG Appliance ExpressDiscount Appliances (650) 756-3931Same Day Certified Installationwww.cgappliance.comSan Francisco-Oakland-San Jose,CAUserMiele Vacuum CleanersMiele Vacuums- Complete SelectionFree Shipping!www.vacuums.comMiele Vacuum CleanersMiele-Free Air shipping!All models. Helpful 1 - 10 of about 7,310,000 for miele. (0.12 seconds)Miele, Inc -- Anything else is a compromiseAt the heart of your home, Appliances by Miele. . USA. to Residential Appliances.Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System - 20k - Cached - Similar pagesWeb spiderMieleWelcome to Miele, the home of the very best appliances and kitchens in the - 3k - Cached - Similar pagesMiele - Deutscher Hersteller von Einbaugeräten, Hausgeräten . - [ Translate thispage ]Das Portal zum Thema Essen & Geniessen online unter Miele weltweit.ein Leben lang. . Wählen Sie die Miele Vertretung Ihres - 10k - Cached - Similar pagesHerzlich willkommen bei Miele Österreich - [ Translate this page ]Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatischweitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE - 3k - Cached - Similar pagesSearchIndexerThe WebIndexesAd indexes

Search engine architecture: key pieces Spider (a.k.a. crawler/robot) – builds corpus§ Collects web pages recursively– For each known URL, fetch the page, parse it, and extract newURLs– Repeat§ Additional pages from direct submissions & other sources Indexer and offline text mining§ create inverted indexes so online system cansearch§ Enrich knowledge on things and their relationship(e.g. names and events) and documents thoughdata mining and learning Online query process– serves query results§ Front end – query reformulation, word processing§ Back end – finds matching documents and ranks them

Inverted index Linked lists generally preferred to arrays§ Dynamic space allocation§ Insertion of terms into documents easy§ Space overhead of 13211283416PostingsSorted by docID (more later on why). 7

Indexing ProcessKnowledge onevents/things

Indexing Process with Mining Text acquisition§ identifies and stores documents for indexing Text transformation§ transforms documents into index terms or features Index creation§ takes index terms and creates data structures(indexes) to support fast searching Data mining§ Knowledge learning on things (people name,organization, etc) and their relationship (knowledgegraphs)

Indexing and Mining at ParsingParsingContentclassificationSpammer DuplicateremovalremovalWebdocumentsInverted indexgenerationLink graphgenerationClick dataanalysisOnlineDatabase

Query Process User interaction§ supports creation andrefinement of query, displayof resultsRanking§ uses query and indexes togenerate ranked list ofdocumentsEvaluation§ monitors and measureseffectiveness and efficiency(primarily offline) Online Engine ArchitectureTraffic load balancerClient lCacheCacheCacheCacheClustering MiddlewareRankingWeb page tionWeb pageindexStructuredDBPageInfoPage ctAbstractdescription

User Interaction Query transformation§ Improves initial query,– Stopword removal, spell correction, long querytrimming– marriot hotel at golet§ Spell checking suggestion and query suggestionprovide alternatives to original query– Did you mean “Marriott hotel at Goelta”?§ Query expansion and relevance feedback modify theoriginal query with additional terms– UC santa babara admission rate

User Interaction Results output§ Constructs the display of ranked documents for aquery– Merge results from multiple channels– Retrieves appropriate advertising§ Generates snippets (dynamic description) toshow how queries match documents– Highlights important words and passages§ May provide clustering and other visualizationtools

Online System Support Performance optimization§ Designing matching&ranking algorithms for efficientprocessing– Term-at-a time vs. document-at-a-time processing– Safe vs. unsafe optimizations Distribution§ Processing queries in a distributed environment§ Query broker distributes queries and assemblesresults§ Caching is a form of distributed searching

Evaluation Logging§ Logging user queries and interaction is crucial forimproving search effectiveness and efficiency§ Query logs and clickthrough data used for querysuggestion, spell checking, query caching, ranking,advertising search, and other components Ranking analysis§ Measuring and tuning ranking effectiveness Performance analysis§ Measuring and tuning system efficiency

General Search vs. Vertical Search General Search: identify relevant information with ahorizontal/exhaustive view of the world. Vertical Search: Focus on specific segment of web content Integrate domain knowledge (e.g. taxonomies/ontology), & deep web Examples: travel in Expedia, products in Amazon.

Example of Vertical Search: Question Answering

Table of Content Information RetrievalSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search EngineOptimization Related Fields

Characteristics of Web Content No design/co-ordination Distributed content creation, linking Content includes truth, lies, obsoleteinformation, contradictions Structured (databases), semistructured Scale -- huge Growth – slowed down from initial“volume doubling every few months” Content can be dynamically generatedThe Web

Dynamic Web ContentAA129Application serverBrowserBack-enddatabases A page without a static html version§ E.g., current status of flight AA129§ Current availability of rooms at a hotel Usually, assembled at the time of a request from abrowser§ Typically, URL has a ‘?’ character in it Most dynamic content is ignored by web spiders§ Many reasons including malicious spider traps§ Acquired for some content (e.g. news stores)– Application-specific spidering

The web: size What is being measured?§ Number of hosts§ Number of (static) html pages– Volume of data Number of hosts – netcraft survey§ server survey.html– -2014-web-server-survey.html§ Gives monthly report on how many web servers are out there Number of pages – numerous estimates§ More to follow later in this course§ For a Web engine: how big its index is

The web: the number of hosts

The web: web server vendors

Static pages: rate of change Fetterly et al. study: several views of data, 150 millionpages over 11 weekly crawls§ Bucketed into 85 groups by extent of change

Diversity Languages/Encodings§ Hundreds (thousands ?) of languages,§ W3C encodings Document & query topic

Table of Content Information RetrievalSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search EngineOptimization Search Engine History/Related Fields

The user Diverse in access methodology§ Increasingly, high bandwidth connectivity§ Growing segment of mobile users: limitations ofform factor – keyboard, display Diverse in search methodology§ Search, search browse, filter by attribute – Average query length 2.5 terms Poor comprehension of syntax§ Early engines surfaced rich syntax – Boolean,phrase, etc.§ Current engines hide these

Web Search: How do users findcontent? Informational ( 25%) – want to learn about somethingautism Navigational ( 40%) – want to go to that pageUnited Airlines Transactional ( 35%) – want to do something (web-mediated)§ Access a service§ Downloads§ Shop Gray areasSanta barbara weatherMars surface imagesNikon D-SLR§ Find a good hub§ Exploratory search “see what’s there”Car rental Finland29Broder 2002, A Taxomony of web search

Users’ evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective, the engine wants tohelp me Pre/Post process tools provided§ Mitigate user errors (auto spell check)§ Explicit: Search within results, more like this, refine.§ Anticipative: related searches

Users’ evaluation Quality of pages varies widely§ Relevance is not enough§ Duplicate elimination Precision vs. recall What matters§ Precision at position 1? Precision above the fold?§ Comprehensiveness – must be able to deal withobscure queries– Recall matters when the number of matches is very small User perceptions may be unscientific, but aresignificant over a large aggregate

What about on Mobile Query characteristics:§ Best known studies by Kamvar and Baluja (2006and 2007) and by Yi, Maghoul, and Pedersen(2008) Have a different distribution than the querydistribution for PC users§ Bias towards shorter queries– Data contradicts that: 2.6 words per query, same # charsas PC§ Difficulty of query entry is a significant hurdle§ Much higher location-based activity More notification-driven tasks32

Implications and Challenges Task-orientation§ Specialized content packaging§ “Santa Barbara” Locality Inference from queries and fromdevices§ “Dentist” Minimize typing and round-trips: getresults, not just links§ Less room to display search engine replypage other accessories§ Direct answer33

Table of Content Information RetrievalSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search EngineOptimization

Search queryAd35

Questions Do you think an “average” user, knows thedifference between sponsored search links andalgorithmic search results?36

How it worksAdvertiserI want to bid 5 oncanon cameraI want to bid 2 oncannon cameraAd IndexSponsoredsearch engineEngine decides when/where to show this ad.Landing pageEngine decides how much to charge advertiser on a click.37


Three sub-problems1. Match ads to query/context2. Order the ads3. Pricing on a click-throughIREcon

Table of Content Information RetrievalSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search EngineOptimization Related Fields

Search Traffic is Important for Business:Example of Site Traffic Analysis

Paid placement vs Search EngineOptimization Paid placement costs money. What’s thealternative? Search Engine Optimization:§ “Tuning” your web page to rank highly in the searchresults for select keywords§ Alternative to paying for placement§ Thus, intrinsically a marketing function§ Also known as Search Engine Marketing

Search engine optimization Motives§ Commercial, political, religious, lobbies§ Promotion funded by advertising budget Operators§ Contractors (Search Engine Optimizers) for lobbies,companies§ Web masters§ Hosting services Forum§ Web master world ( )– Search engine specific tricks– Discussions about academic papers J– More pointers in the Resources

The spam industry

Simplest forms Early engines relied on the density of terms§ The top-ranked pages for the query maui resortwere the ones containing the most maui’s andresort’s SEOs responded with dense repetitions of chosenterms§ e.g., maui resort maui resort maui resort§ Often, the repetitions would be in the same color asthe background of the web page– Repeated terms got indexed by crawlers– But not visible to humans on browsersCan’t trust the words on a web page, for ranking.

Keyword stuffing



Link FarmsBoost pagerank of a website

Table of Content Information RetrievalSearch Engine Architecture and ProcessWeb Content and SizeUsers Behavior in SearchSponsored Search: AdvertisementImpact to Business and Search EngineOptimization Related Fields

From Information Retrieval to Web Search Challenging due to Large-scale and noisy data.§ retrieving relevant documents to a query.§ retrieving from large sets of documents efficiently. Relevance is a subjective judgment and mayinclude:§ Simplest notion of relevance is that the query stringappears verbatim in the document.§ More:––––Being on the proper subject.Being timely (recent information).Being authoritative (from a trusted source).Satisfying the goals of the user and his/her intended use ofthe information (information need).51

Related Areas Information Management and Data Mining§ Information Science &CHI§ Machine Learning and data mining§ Natural Language Processing Large-scale systems§ Database/data stores§ Operating systems/networking support§ Web language analysis§ Compression/fast algorithms.§ Fault tolerance/paralle distributed systems52

Problems with Keywords May not retrieve relevant documents thatinclude synonymous terms.§ “car” vs. “automobile”§ “UCSB” vs. “UC Santa Barbara” May retrieve irrelevant documents that includeambiguous terms.§ “bat” (baseball vs. mammal)§ “Apple” (company vs. fruit)§ “bit” (unit of data vs. act of eating)53

Search Intent Analysis Taking into account the meaning of the wordsused. Taking into account the order of words in thequery. Adapting to the user based on direct or indirectfeedback. Taking into account the authority of the source.54

Topics: Text mining “Text mining” is a cover-all marketing term A lot of what we’ve already talked about is actuallythe bread and butter of text mining:§ Text classification, clustering, and retrieval But we will focus in on some of the higher-leveltext applications:§ Extracting document metadata§ Topic tracking and new story detection§ Cross document entity and event coreference§ Text summarization§ Question answering

Topics: Information extraction Getting semantic information out of textual data§ Filling the fields of a database record E.g., looking at an event web page:§ What is the name of the event?§ What date/time is it?§ How much does it cost to attend Other applications: resumes, health data, A limited but practical form of natural languageunderstanding

Topics: Recommendation systems Using statistics about the past actions of a groupto give advice to an individual§ E.g., Amazon book suggestions or NetFlix moviesuggestions A matrix problem:§ but now instead of words and documents, it’s usersand “documents”

Table of Content Information Retrieval Search Engine Architecture and Process Web Content and Size Users Behavior in Search Sponsored Search: Advertisement Impact to Business and Search Engine Optimization Related fields IR System Query String D

Related Documents:

The 7 Basic Principles of Retrieval Practice Following are the seven basic principles of retrieval practice. 1. Keep It Short and Simple Retrieval practice should only take a few of minutes of class time and should be easy to explain, set up, and conclude. A perfect example is Agarwal and Bain’s (2019) retrieval

Manipulations of Initial Retrieval Practice Conditions 7 Retrieval Practice Compared to Restudy and Elaborative Study 7 Comparisons of Recall, Recognition, and Initial Retrieval Cueing Conditions 8 Retrieval Practice With Initial Short-Answer and Multiple-Choice Tests 9 Positive and Negative Effects of Initial Multiple-Choice Questions 11

[B]. RETRIEVAL PHASE The retrieval phase is the reverse process of the storage phase. In this phase another automatic monorail will arrive at the retrieval reference point without any load (package) on it. The proximity sensor will sense it, the sensor will change to on state which sends the signal to PLC alerting it about the request of retrieval.

Retrieval practice with short-answer, multiple-choice, and hybrid tests Megan A. Smith and Jeffrey D. Karpicke Department of Psychological Sciences, Purdue University, West Lafayette, IN, USA (Received 29 May 2013; accepted 29 July 2013) Retrieval practice improves meaningful learning, and the most frequent way of implementing retrieval

The Role of Episodic Context in Retrieval Practice Effects Joshua W. Whiffen and Jeffrey D. Karpicke Purdue University The episodic context account of retrieval-based learning proposes that retrieval enhances subsequent retention because people must think back to and reinstate a p

Retrieval Practice – Why Karpicke & Roediger, 2008 Learning pairs of words (Swahili – English) 15 Retrieval Practice – Why Retrieval Practice strengthens memory and interrupts forgetting. Retrieval Practice makes that knowledge easier to retrieve in the future. Neural pathways that make up a body of learning get stronger. 16

Lead Retrieval Solution Guide 4 Here are key reasons to share with your exhibitors why they should purchase lead retrieval: 1. Buyers Abound According to CEIR, 81% (or 4 out of 5 attendees on the show floor) have buying authority2. Lead retrieval is the most effective and efficient way to collect data about potential buyers. 2. No Lost Leads

Cheriton School of Computer Science University of Waterloo April 2019 Wei Yang End-to-end Neural Information Retrieval 1 / 29. Table of Contents 1 Introduction 2 Related Work 3 End-to-end Neural Information Retrieval Archit

work/products (Beading, Candles, Carving, Food Products, Soap, Weaving, etc.) ⃝I understand that if my work contains Indigenous visual representation that it is a reflection of the Indigenous culture of my native region. ⃝To the best of my knowledge, my work/products fall within Craft Council standards and expectations with respect to

We construct new private-information-retrieval protocols in the single-server setting. Our schemes allow a client to privately fetch a sequence of database records from a server, while the server answers each query in average time sublinear in the database size. Speci cally, we introduce the rst single-server private-information-retrieval schemes

Dictionaries and Tolerant Retrieval Most slides were adapted from . Stanford CS 276 course and University of Munich IR course. Introduction to Information Retrieval. Dictionary data structures for inverted indexes The dictionary data structure stores the term

Encoding, Storage and Retrieval Memory is the mental processes that enable us to retain and use information over time that involve three fundamental processes: encoding, storage and retrieval Encoding: The processing of transforming information into a form that can be

EXHIBITOR LEAD RETRIEVAL SYSTEM The CTI Meeting Technology Lead System is a state-of-the-art, fast and easy system for exhibitors to record contact information. The Lead technology runs with an app and uses a high quality bar code scanner attached to an iOS mobile device (iPod touch). Exhibition Lead Retrieval System

parametric assumptions will not lead to better retrieval performance. Furthermore, making prior assumptions about the similarity of documents is not warranted ei- ther. Instead, we propose an approach to retrieval based on probabilistic language modeling. We estimate models for each document individually.

Effective and Secure Content Retrieval in Unstructured P2P . and timely availability of the reputation data from one peer to the other peers the self certifica ALGORITHM and MD5) is used. The peers are here repeated in order to check whether a peer is a . Effective and secure content retrieval in unstructured p2p .

Dr. Arlindo da Silva and Dr. Peter Norris for making the integration with GEOS-5 possible. Brad Wind for laying down the groundwork for simplifying the modifications to the Collection 5 operational MODIS cloud optical and microphysical retrieval code that made creation of CHIMAERA retrieval system possible and for floating the

items to be handled, storage and retrieval methods and interaction of a stacker crane and a human worker. The following are the principal types (Groover 2001; Automated Storage Retrieval Systems Production Section of the Material Han-dling Industry of America 2009): 1. Unit-load AS/RS. The unit-load AS/RS is typically a large automated system

SCOR is a novel retrieval framework that combines MRF and word2vec to model order and semantics together. SCOR gives state-of-the-art results on source code retrieval task of bug localization. In the process of developing SCOR we also generated semantic word embeddings for 0.5 million software-centric terms from 35000 Java repositories.

Retrieval practice in its many forms (clickers, mini-quizzes, practice problems, and so on) is excellent for improving learning. As discussed in our guide, How to Use Retrieval Practice to Improve Learning, . another, such as from short answer to multiple-choice. This includes all major question formats.

application of Johannes Itten’s color theory to image retrieval problems developing both a visual language for color description [13] and an image retrieval system for painting [14]. Itten proposed a taxonomy of colors based on hue, luminance, and saturation that