How Search Engines Work - Lehigh University

2y ago
20 Views
3 Downloads
3.44 MB
36 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Nixon Dill
Transcription

How Search Engines Work Today we show how a search engine works––– What happens when a searcher enters keywordsWhat was performed well in advanceAlso explain (briefly) how paid results are chosenIf we have time, we will also talk about the sizeof the Web(If you really want to know how web searchengines work, take my CSE345 WWW SearchEngines course in the spring!)Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-1

(Google results example)PAID RESULTSORGANIC RESULTSFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-2

Building an index A search engine does not examine every pageon the web when a user puts in a queryThe engine first builds an index––Custom database of all the words on all pagesSearch engine also stores other informationFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-3

Overview of organic searchFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-4

Matching the Search Query The search query is everything that the usertypes to get results– It is made up of one or more search terms, plusoptional special charactersAnalyzing the Query–Expanding the query –––Word variants: plural/singular, various verb formsSpelling correctionPhrases, anti-phrases, and stop wordsWord orderSearch operatorsFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-5

Matching the Search Query Organic query matches–––Find pages with each of the remaining query termsDocument IDs are listed in a term indexDocument information is in a separate doc indexFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-6

Matching the Search Query Paid placement matches–––Similar to organic match, but using a separatedatabase of adsUses similar processing to select which queryterms to useAdvertisers choose which queries can match –Might require exact match, or allow broad matchingSimpler/faster because there are fewer ads tosearch throughFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-7

Ranking Organic Matches This is a complex, active research area––– Goal is to sort matching results from 'best' to 'worst'Many factors contribute to different rankings in thevarious enginesRanking functions are under continuous changePrimary factors––––Text analysis: keyword density and prominenceLink analysis: page and site authority estimatesAnchor text: terms used to describe page by othersTraffic analysis: which results get clicked onFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-8

Text Analysis: Keyword Density A.k.a. keyword weightGenerally refers to the relative frequency of aterm on the pageHigher keyword density generally means that adocument is more 'about' that keywordNatural text has a maximum reasonable density– The book cites a 7% density thresholdMulti-term queries target keyword proximity–Pages with the same terms adjacent in same orderwould benefit mostFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-9

Text Analysis: Term Prominence Where do the query terms appear? Good places include:––– TitleHeadingsStart ofbodyTerms insuch placescould getextra weightFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-10

Link Analysis: Estimating Authority A typical short query matches millions of pages– Many could even have the same textual (relevance)weight from keyword density and prominenceLink analysis estimates the importance of eachpage, based on the link structure around itThe more respected a site is, the more linkspoint to itSome links are more important than others–A link from Yahoo (or the White House!) signifiesmuch more than a link from geocities.comFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-11

Google's PageRank The best-known link analysis algorithm–– The authoritativeness of a page grows if–– Algorithm published in 1998Very well-studied; improvements are still beingmade to it todayMore pages link to itThe pages that link to it increase their authorityThe original algorithm is not a significantcomponent of Google's ranking approach today–Many have shown that it performs poorly nowFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-12

Anchor Text What is a page about?Page builders often summarize a page (or thesignificant aspect of a page) in the anchor text(the text of a link)–––These short descriptions look a lot like queries!Can help determine value of linkA significant component for ranking todayFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-13

Traffic Analysis Many engines will track which links you click onfrom a results pageSuch clicks can be considered “votes” for URLsRe-ordering based on clicks can improveranking quality [Joachims et al., 2005]DirectHit search engine used click-throughs togenerate top-10 results (purchased by AskJeeves in 2000)Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-14

Ranking Paid Placement Simplest approach: rank by highest bidder–– Originally developed by Overture (a.k.a. goto.com)Advertisers can change bids continuously, and canspecify a particular budgetGoogle's approach: rank bymost valuable–––Combination of bid andclick-through rateMore relevant (clicked) adsmove up in rankUsers find ads more usefulFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-15

Displaying Search Results Once the set of results has been collected andranked, the results page needs to be generatedFor first page, select top results (typically 10)––Look up title, URL for linking (and often display)Generate snippet (portion of page text thatillustrates query terms) or look up ad copyFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-16

CollectingMaterial forthe OrganicIndex Primarily usinga crawler/spider–Given a seedlist of links, visiteach one andadd any newURLs found tothe list of linksto visitFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-17

Building the Organic Index For each page retrieved, extract the text–For each term in the text, add the page's ID (andoptionally, positions) to the list of docs for that termFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-18

Building the Organic Index For each page retrieved–Extract the links – Record Title and URLWhat to crawl?–– Record anchor text for each linkCan't crawl all pages!Need to re-crawl oft-changing pagesSome engines allow trusted feeds (typically aform of paid inclusion) to get content indexedFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-19

Content Analysis Convert different types of documents––Use a single standard internal representationLots of file types: Word, PDF, PostScript, etc. Recognize language used They also extract additional text from a pageFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-20

What searchengines(and sight-impaired users)don't see They cannotread images(even text inimages)Often they donot read Flashcontent orJavaScriptFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-21

What search engines can see Image names Image alt textFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-22

What search engines can see Image names Image alt text Meta text–––TitleDescriptionKeywords – (often ignored)Other directivesURL textFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-23

Search EngineRelationships XBusiness relationshipshave changedsignificantly over thepast five years or so.See the Search EngineRelationship Chart as itcan also showconnections over time.There are more playersthan shown (such asGigablast, Snap.com)and lots of internationalengines.Fall 2006 Davison/LinA9CSE 197/BIS 197: Search Engine Strategies 2-24

Evaluating Organic Search Results Precision: fraction of search results that arecorrect (relevant) to a queryRecall: fraction of all correct (relevant) answersincluded in a set of search resultsFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-25

Evaluating Organic Search Results Precision: fraction of search results that arecorrect (relevant) to a queryRecall: fraction of all correct (relevant) answersincluded in a set of search resultsImproving oneusually resultsin worsening ofthe otherFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-26

Evaluating Organic Search Results Precision: fraction of search results that arecorrect (relevant) to a queryRecall: fraction of all correct (relevant) answersincluded in a set of search resultsImproving one usually results in worsening ofthe otherIn web search, neither can be measuredexactly!–Still useful to think about how a change will affectperformanceFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-27

How big is the Web?Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-28

How big is the Web? Depends!Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-29

How big is the Web? Depends!What if I turn on a laptop that can produce linksto an infinite number of pages?–Proposed by Andrei Broder who has studied thisFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-30

How big is the Web? Perhaps you meanthe size of the indexused by web searchengines?Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-31

How big is the Web? Perhaps you mean the size of the index usedby web search engines?–––––This is a recurring debateIn 2005, Google was reporting 8B pages indexedYahoo then announced it had indexed almost 20BGoogle declared Yahoo as counting differentlyGoogle no longer reports its index size and regularly underreports the number of machines it usesFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-32

How big is the Web? Perhaps you mean the size of the index usedby web search engines?–––––This is a recurring debateIn 2005, Google was reporting 8B pages indexedYahoo then announced it had indexed almost 20BGoogle declared Yahoo as counting differentlyGoogle no longer reports its index size and regularly underreports the number of machines it usesEstimates of intersection size in 1995 of top 4indexes was only about 2.7B (different crawls!)Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-33

How big is the Web? Perhaps you mean the size of the index usedby web search engines?–––––This is a recurring debateIn 2005, Google was reporting 8B pages indexedYahoo then announced it had indexed almost 20BGoogle declared Yahoo as counting differentlyGoogle no longer reports its index size and regularly underreports the number of machines it usesEstimates of intersection size in 1995 of top 4indexes was only about 2.7B (different crawls!)What about pages not indexed by the engines?Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-34

How big is the Web? How large is the indexable web?––That is, ignoring the pages that require passwords,links within flash content, or forms to be filled in(search boxes, registration, etc.)Recent estimate is 11.5B [Gulli & Signorini, 2005] Fairly close in time to Yahoo's 20B claimFall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-35

How big is the Web? How large is the indexable web?––That is, ignoring the pages that require passwords,links within flash content, or forms to be filled in(search boxes, registration, etc.)Recent estimate is 11.5B [Gulli & Signorini, 2005] The hidden web (the rest) is 2-500 times larger!– Fairly close in time to Yahoo's 20B claimAgain, just reported estimates.So it is impossible to know the size of the Web!Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-36

Fall 2006 Davison/Lin CSE 197/BIS 197: Search Engine Strategies 2-1 How Search Engines Work Today we show how a search engine works – What happens when a searcher enters keywords – What was performed well in advance – Also explain (briefly) how paid results are chosen If we have time, we will also talk about the s

Related Documents:

Lehigh County Drug and Alcohol 71, 74 Lehigh County Family Court 128 Lehigh County Juvenile Probation 128 Lehigh County Conference of Churches 65 Lehigh County Housing Authority and Valley Housing Development Corporation 57 Lehigh County Information and Referral 41 Lehigh County Office of Children and Youth 29 .

Lehigh Valley Drug and Alcohol Intake Units 25 Lehigh Valley Eye Center 65 Lehigh Valley Eye Center and Children's Eye Care 58 Lehigh Valley Family Health Center 49 Lehigh Valley Hospital Center 40 Lehigh Valley Hospital Center for Women's Medicine 58 Lehigh Valley Hospital Center for Women's Health at Casa - "Viva Nueva" Clinic 58 .

Lehigh Valley Drug and Alcohol Intake Units 46 Lehigh Valley Eye Center - Bethlehem 98 Lehigh Valley Eye Center and Children's Eye Care - Allentown 90 Lehigh Valley Family Health Center 81 Lehigh Valley Health Network 25 Lehigh Valley Hospital Center 71 Lehigh Valley Hospital Center for Women's Health at Casa - "Viva Nueva" Clinic 91

Engines regulated by 40 CFR Part 86 typically include engines used in on-highway applications such as heavy-duty gasoline fueled engines (HDGEs), heavy-duty diesel fueled engines (HDDEs), and heavy-duty engines using alternate fuels (CNG, LPG and LNG). Engines regulated by 40 CFR Part 89 include compression-ignition engines used in nonroad .

clustering engines is that they do not maintain their own index of documents; similar to meta search engines [Meng et al. 2002], they take the search results from one or more publicly accessible search engines. Even the major search engines are becoming more involved in the clustering issue. Clustering by site (a form of clustering that

though, have insisted that, since the competition is 'only a click away',2 search engines will naturally endeavour to provide the best results possible. The lack of a consensus on the incentives facing search engines creates a degree of ambiguity with respect to the appropriate regulatory stance vis-à-vis search engines' provision of .

Search engines are about excitement, optimism, hope and enrichment. Search engines are also about despair and disappointment. A researcher while using search engines for resource discovery might have experienced one or the other sentiments. One may say that user satisfaction depends much upon the search strategies deployed by the user. .

early 1990s, search engines have become almost as important as email as a primary online activity. Arguably, search engines are among the most important gatekeepers in today's digitally networked environment. Thus, it does not come as a surprise that the evolution of search technology and the diffusion of search engines have been accompanied by .