The Weaknesses Of Full-Text Searching

2y ago

40 Views

5 Downloads

783.84 KB

8 Pages

Last View : Today

Last Download : 3m ago

Upload by : Maxine Vice

Report this link

Download PDF

Transcription

The Weaknesses of Full-Text Searchingby Jeffrey BeallThis paper provides a theoretical critique of thedeficiencies of full-text searching in academiclibrary databases. Because full-text searchingrelies on matching words in a search query withwords in online resources, it is an inefficientmethod of finding information in a database.This matching fails to retrieve synonyms, andit also retrieves unwanted homonyms.Numerous other problems also make full-textsearching an ineffective information retrievaltool. Academic libraries purchase and subscribeto numerous proprietary databases, many ofwhich rely on full-text searching for accessand discovery. An understanding of theweaknesses of full-text searching is needed toevaluate the search and discovery capabilities ofacademic library databases.INTRODUCTIONDefinition of Full-Text SearchingFull-text searching is the type of search a computer performswhen it matches terms in a search query with terms inindividual documents in a database and ranks the resultsalgorithmically. This type of searching is ubiquitous on theInternet and includes the type of natural language search wetypically find in commercial search engines, Web site searchboxes, and in many proprietary databases. The term full-textsearching has several synonyms and variations, includingkeyword searching, algorithmic searching, stochastic searching, and probabilistic searching.Metadata-Enabled SearchingThere is one other main type of online searching. This ismetadata-enabled searching, which is also called deterministicsearching. In this type of search, searchers pre-select and searchindividual facets of an information resource, such as author,title, and subject. In this type of search, the system matchesterms in the search with terms in structured metadata andgenerates results, often a browse display sorted alphanumerically. Author, title, and subject searches in online librarycatalogs are examples of this type of search.ImportanceJeffrey Beall is a Metadata Librarian/AssistantProfessor,AurariaLibrary, University of Colorado Denver, 1100 Lawrence Street,Denver, CO 80204, USA jeffrey.beall@ucdenver.edu .438Understanding the weaknesses of full-text searching isimportant for academic libraries for several reasons. First,academic libraries purchase or subscribe to numerous proprietary databases, including many full-text databases. When theydecide whether to pay for a particular database, libraries need toevaluate the search engine or system that accompanies thedatabase. When these databases provide only full-text searchingand not metadata-enabled searching, resource discovery withinthe resource may be difficult, putting libraries in the position ofpaying for content that is hard to find. Library-created databases, such as institutional repositories, are another area wherean understanding of the weaknesses of full-text searching isneeded. Providing only full-text access to a library's digitalobjects may not provide resource discovery of sufficient qualityfor the collection's users. Academic libraries need to evaluatethese collections and the available search engines and systemsand select the best one for their particular databases. Finally,much current debate centers on the need for online librarycatalogs versus the ability to access academic library materialsthrough a commercial search engine. A thorough knowledge ofthe weaknesses of full-text searching adds to the debate andhelps academic librarians in the evaluation, recommendationand design of library database search engines.The Journal of Academic Librarianship, Volume 34, Number 5, pages 438-444

ObjectiveThe purpose of this article is to list and describe the chiefweaknesses of full-text searching. We limit the scope of thisarticle to true full-text searching that automatically matcheswords entered in the search box with words in resources adatabase contains to generate results. This study does notinclude in its analysis new, semantic search engines such asHakia, which stores metadata for each Web page indexed anduses that metadata, along with word matching, to generatesearch results. Indeed, many popular search engines do incorporate metadata into their searches. For example, the Googleadvanced search allows for limiting search results to a specificlanguage. This search limit is generated by language metadatathat the search engine assigns to each Web page it indexes (theaccuracy of this automatically-generated language metadatamay not always be high).Still, the great majority of the searches performed on theInternet are of the type this paper seeks to study: full-textsearching that matches words in a search box with words inonline documents or online text. This study is not a comparisonof full-text searching and metadata-enabled searching. Both ofthese two types of searching have their various strengths andweaknesses. This article seeks chiefly to describe the weaknesses of full-text searching.This paper is a theoretical critique of full-text searching andfocuses on the type of searching done in academic libraries. Itdescribes and categorizes the ways in which full-text searchingcan fail, failures that most searchers have likely encounteredthemselves. While outside the scope of this paper, quantitativeresearch that measures the extent of these problems would bevaluable and would further inform the debate.capitale"). And no software algorithm will solve this problem when itis confined to dealing with only the actual words that it can retrieve2 fromwithin the given documents (or citations or abstracts) themselves.Beall3 ' 4 presents two brief but more complete looks at theproblems of full-text searching. The present paper aims for amore comprehensive analysis. Moreover, Beall 5 introduces theterm "search fatigue" to describe the feelings of frustrationsearchers feel when they are unsuccessful in finding information due to the weaknesses of full-text searching. A recent studyby Hemminger, Saelim, Sullivan, and Vision 6 compares fulltext searching to metadata searching and finds that "it may betime to make the transition to direct full-text searching as thestandard". However, later in the article the authors concede thattheir study may not be truly representative, for it compared thetwo searching modes using gene names, which are consistentlyused in the literature they studied.THE WEAKNESSES OF FULL-TEXT SEARCHINGThe Synonym ProblemPerhaps the biggest and most pervasive weakness of full-textsearching is the synonym problem. This problem occursbecause there is often more than one way to name or expressa given concept, such as a person, place, or thing. There areseveral different aspects of the synonym problem."Perhaps the biggest and most pervasiveweakness of full-text searching is the synonymproblem."PREVIOUS STUDIESMost information retrieval and information discovery has transitioned from searching dominated by metadata-enabled searching (academic library card catalogs) to the present full-text oralgorithmic searching (Web search engines). This transitionoccurred without sufficient analysis of the weaknesses of fulltext searching. Perhaps if searchers understood the number ofresources they were missing because of full-text searching'sreliance on word matching to generate retrieval, they would beless satisfied with it. Generally, books and articles on information retrieval often cite one or two examples of the weaknesses of full-text searching; few have been comprehensive intheir analyses, as this one seeks to be.Among those to write about the weaknesses of full-textsearching is Thomas Mann, a reference librarian at the Library ofCongress. He states "Keyword searching fails to map the taxonomies that alert researchers to unanticipated aspects of theirsubjects. It fails to retrieve literature that uses keywords other thanthose the researcher can specify; it misses not only synonyms andvariant phrases but also all relevant works in foreign languages.Searching by keywords is not the same as searching by conceptualcategories". I Here he makes reference to the synonym problem infull-text searching (and he prefers to use the term keywordsearching rather than full-text searching, providing yet anotherexample of the synonym problem). Mann also states,When all is said and done, keyword searching necessarily entails theproblem of the unpredictability of the many variant ways the same subjectcan be expressed, within a single language ("capital punishment", "deathpenalty") and across multiple languages ("peine de mort", "penaTrue SynonymsSynonyms are two words that mean the same thing in onelanguage. In full-text searching, synonyms hinder effective information retrieval when a searcher enters a term in the search boxand the system only returns results that match the term and doesnot return results that refer to the concept only by one of itssynonyms. For example, if a searcher seeks information onleprosy, he would likely enter "leprosy" in the search box andexpect complete results. However, some online documents refer tothis disease as "Hansen's disease". While it's true that manydocuments will contain both terms, thus enabling access regardlessof which term is searched, a certain percentage of the documentswill only contain one term, thus providing an incomplete retrieval.Variant SpellingsWords that mean the exact same thing can sometimes bespelled differently, as in variant British and American spellings.In full-text searching, a search for "harbour" will miss resultsthat use the spelling "harbor". It is true that many full-textsearch engines have developed methods for overcoming thisproblem; searchers can use wild card or truncation operators toretrieve multiple spellings of a word. But there are also variantspellings within a single dialect of a language, and thesedifferences are often beyond the scope of the truncation or wildcard operators. For example, in American English the spellings"donut" and "doughnut" are both common. Unlike the case ofsynonyms, where in a single document both synonyms maySeptember 2008439

appear, spelling tends to be consistent within a document. Adocument about harbors written in the United States is unlikelyto also contain the spelling "harbour". This means that there is asmaller chance of retrieving documents with variant spellingsthan there is with true synonyms.Shortened Forms of TermsAbbreviations, acronyms, and initialisms can hinder recall infull-text search systems because a document may contain onlythe short forrn of the word or only the long form. When thisoccurs, someone searching on the short form (PETA) will miss inhis retrieval documents that only use the long form (People forthe Ethical Treatment of Animals). Alternately, searching on thelong form of the term, like Magnetic Resonance Spectroscopy,will miss documents that only refer to the concept by its shortform, MRS.Different Languages or DialectsWhen searching a term in one language, a searcher will notmatch documents that contain the foreign-language version forthat concept, unless the two terms happen to be cognates. Forexample, if you search the term "butter", the search will missdocuments that only refer to this by its Spanish equivalent"mantequilla". For many searchers, this exclusion is not a problern; they prefer their search results to be in just one language.However, scholarship supported by academic libraries, such asmedical research, or research for a thesis or dissertation, needs tobe comprehensive regardless of language. Additionally, variationoccurs within a single language. The phrase "football coach"means different things in British and American English. In theUnited States, this term refers to a person who directs anAmerican football team, that is, the coach. In British English, a"football coach" refers to a bus (motorcoach) for soccer players."The phrase "football coach" means differentthings in British and American English."When the words are the same in two or more languages ordialects of a single language, however, such as the word"migration", which means the same in English and French, thedifferent language problem does not occur.Obsolete TermsLinguistic change can also prevent complete informationretrieval in full-text searching. For example, the phrase "Frenchdistemper" is one of many archaic ways of referring to syphilis(the term was also used metaphorically by the English to refer tothe French Revolution). Someone researching the history ofsyphilis and using full-text searching would miss resources thatonly use the term "French distemper". It is possible in GoogleBooks to find digitized academic library books that only use thisterm. While it is possible to search every possible variant termto generate a complete search result in full-text searching, thismethod is not very efficient and requires that one know all thevariant terms, an unlikely possibility.humanities than it does in science, technology, and medicine.STM scholarship tends to be more consistent in its terminology,even across languages. For example, the scientific names of plantsand animals (binominal nomenclature) are the same in mostlanguages (Tyrannosaurusrex, for example). This tendency to usea standard terminology even across languages ameliorates thesynonym problem in these fields. (Note, however, that Tyrannosaurusrex is often abbreviated to T-rex, creating an instance of theabbreviation problem described above.) This is not to say thatSTM fields always use consistent terminology. There are at least7sixty different terms that all mean "Atlantic cod", for example.The variation occurs in the common names and not in thescientific names, though. While scientific names tend to beapplied consistently within the scientific domain, popular termsfor natural things reflect a diverse terminology.Unlike scientific terminology, humanities terminology variessignificantly from one language to another and by time anddialect within a single language. Take the term "short stories" forexample. In French it's "nouvelles", in Spanish it's "cuentos",and in German, "Erziihlungen". The names for languagesthemselves differ from language to language too. The names forthe German language include alemd.n, Deutsch, and allemand.Perhaps one area in the humanities where there is some crosslanguage consistency is music. Many languages share terms like"soprano". Also, as described earlier, regional differences withina single language can lead to problems in information retrievalwhen using full-text searching. In British English a "solicitor" isa lawyer; in American English, it is someone who goes door todoor selling something or asking for contributions for charity.The Homonym ProblemThe homonym problem occurs in full-text searching when asingle word or phrase has more than one meaning. Because fulltext searching relies on word matching to generate results, a searchfor a term with several meanings will retrieve documents for all ofthe meanings, rather than just the one the searcher wants.Homonyms are perhaps the chief cause of low search precision.True HomonymsWithout metadata, computers do not know the sense of each ofa given pair of homonyms. That is, computers cannot effectivelydisambiguate two concepts when they are called by the sameterm. For example, a search on "cookies" will pull up documentsboth about the food and the little files stored on a computer.Searchers are aware of this problem, for it occurs frequently.Many have developed strategies to eliminate unwanted hits andincrease the probability of search results matching the particularmeaning of the homonym they seek. For example, someonelooking for information on computer file cookies might add theword "computer" to the search terms (instead of only searchingfor "cookies"), because the documents about edible cookies areless likely to have this term in them. Alternatively, a sophisticatedsearcher might use the "not" operator to try to eliminate unwantedhomonyms and increase a search's precision. The searcher mightenter "cookies not recipes", for example. While these strategieshelp, they are not completely effective. Words can have manymore meanings than just two, and one often does not anticipatethat a search term has synonyms.Humanities vs. STMDisambiguation of Personal NamesOverall, despite the above example, we should note that thesynonym problem probably occurs more frequently in theThis problem occurs in both full-text searching and in metadata-enabled search systems where the practice of name440The Journal of Academic Librarianship

disambiguation is not employed. Name disambiguation is theprocess of making each person's name unique in a database. Themore common a name in a database, the greater the problem. Theproblem is made worse by names that also function as other partsof speech, like bill, April, miller, and mike. Because namedisambiguation necessarily involves adding metadata, virtually allfull-text documents lack this value-added feature. This problem issignificant in academic libraries because some style guidesprescribe the use of initials instead of given names in citations,making a full-text search for an author's name more difficult.False CognatesThese are two words that are spelled the same (or almostthe same) in two languages but, deceptively, do not mean thesame thing. In full-text searching, false cognates are only aproblem when they are spelled exactly the same. The problemoccurs when a word entered into a search box happens tomatch a word in a different language that has no semanticrelationship to the original search tenn. For example, the word"location" in French doesn't mean "location" in English; itmeans a rental or a lease.Inability to Search by FacetsSometimes searchers have a need to search by only a specificcharacteristic or attribute of an online resource, such as author,title, subject, date of creation, etc. These attributes, or facets,help to cluster resources by specific shared characteristics.Clustering, or collocating, is helpful in information retrievalbecause it helps exclude unwanted resources from a search.Also, clustering matches typical searcher queries, such as "Iwant all DVDs on agriculture", or "I want all PDF files on landuse planning in Utah published before 2000". Pure, full-textsearching fails at these tasks, because the search engine doesn'tknow the format (DVD's) or the subject (agriculture) or thepublication date (2000) of the documents it searches. If a searchengine does know these dates, then it's not a pure, full-textsearch engine. Instead, it is a metadata-enhanced search engineand draws its ability to sort by facets from metadata assigned toeach resource it indexes.ClusteringClustering is most helpful when it attempts to solve thehomonym problem in subject searches. Here, clustering is theprocess of grouping and separating out resources by subject. Forexample someone searching for information on ocean banksmightjust enter "banks" as the search term. A search engine withthe ability to cluster would then separate out the results that referto ocean banks from those that refer to banks, the financialinstitutions. It's probably not uncommon for users who stumbleon the homonym problem in a full-text database to do a revisedsearch that includes a second search term, as a strategy foreliminating unwanted documents. For example, a searcher couldenter "banks ocean" to eliminate documents in the retrieval thatare about banks the financial institutions. This stratagem is notfoolproof, however, for there are many resources about financialinstitutions that contain the words "banks" and "ocean".Increasingly, proprietary databases are performing this type ofcluster analysis algorithmically, but with limited success.Inability to SortJust as full-text search engines lack the ability to clustersearch results, they also lack the ability to sort results by facets.Sorting plays an important role in and can increase the value ofinformation retrieval because it helps arrange search results in ameaningful and

deficiencies of full-text searching in academic library databases. Because full-text searching relies on matching words in a search query with words in online resources, it is an inefficient method of finding information in a dat

Related Documents:

Actividades De Repaso : Primer Grado Cálculo Mental Matemáticas

Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text

50 Views

1y ago

Nonprofit Self-Assessment Checklist

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

1.4K Views

2y ago

Name of thé élément in thé language and script of thé ... - UNESCO

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

113 Views

9m ago

[Kl - Mauritius

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

467 Views

1y ago

Employee Benefits Event - Schneider Downs Tax Services

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

326 Views

1y ago

Study Investigating thè Effect of E- Service Quality on Customer's ...

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

122 Views

9m ago

Kinh Giải Thâm Mật HT. Thích Trí Quang dịch giải

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

1.6K Views

3y ago

1 REFERENCES GENERALES 2 - bourre

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. Crawford M., Marsh D. The driving force : food in human evolution and the future.

986 Views

3y ago

Recent Views

Grammar as a Foreign Language - List of Proceedings

Grammar as a Foreign Language Oriol Vinyals Google vinyals@google.com Lukasz Kaiser Google lukaszkaiser@google.com Terry Koo Google terrykoo@google.com Slav Petrov Google slav@google.com Ilya Sutskever Google ilyasu@google.com Geoffrey Hinton Google geoffhinton@google.com Abstract Synta

2y ago

445 Views

Attention is All you Need - NIPS

Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomezy University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia .

1y ago

303 Views

GSA Implementation of Google (G) Suite

Google Meet Classic Hangouts Google Chat Google Calendar Google Drive and Shared Drive Google Docs Google Sheets Google Slides Google Forms Google Sites Google Keep Apps Script D

2y ago

316 Views

Google Drive (Google Docs, Google Sheets, Google Slides)

Google Drive (Google Docs, Google Sheets, Google Slides) Employees are automatically issued a Kyrene Google account. Navigate to drive.google.com. Use Kyrene email address and network password to login. Launch in Chrome browser for best experience. Google Drive is a cloud storage sys

2y ago

388 Views

Quick Guide of Using Google Home to Control Smart Devices

Configuration needs Google Home app. Search "Google Home" in App Store or Google Play to install the app. 3.1 Set up Google Home with Google Home app You can skip this part if your Google Home is already set up. 1. Make sure your Google Home is energized. 2. Open the Google Home app by tapping the app icon on your mobile device. 3.

1y ago

326 Views

Elaboração de Provas Online usando o Formulário Google Docs

2 Após o login acesse o Google Drive ou o Google Docs e selecione a ferramenta Google Forms (Formulários). Clique na caixa de Ferramentas do Google, localizada no canto direito superior da tela e selecione o Google Drive. Na tela do Google Drive clique em New , opção More e selecione Google Forms. OBS: É possível acessar o google

11m ago

123 Views

ACS WASC Templates

File upload, Folder upload, Google Docs, Google Sheets, or Google Slides. You can also create Google Forms, Google Drawings, Google My Maps, etc. Share with exactly who you want — without email attachments. Search or sort your list of files, folders, and Google Docs. Preview files and Google Docs.

2y ago

366 Views

Google Drive - San Bernardino City Unified School District

Google Apps All of the Google applications that are available upon logging into Google.com (G , Gmail, Gphotos, Gdrive, etc.). Google Suite Google’s online cloud based office companion applications (Docs, Sheets, Slides). Google Drive Google’s online cloud storage and file sharing/collaboration application.

2y ago

378 Views

Single Sign On for Google Apps with NetScaler Unified Gateway

Google Apps for Work is a suite of cloud computing productivity and collaboration applications provided by Google on a subscription basis. It includes Google’s popular web applications including Gmail, Google Drive, Google Hangouts, Google Calendar and Google

2y ago

295 Views

Serviceteil

Google 84, 87, 124 Google 110 Google AdWords 101, 103 Google Alerts 127 Google Analytics 89 Google Maps 100, 110, 173 Google-Maps 63 Google Places 100, 103, 124 Graphiken 66 H Haftung 170 Haftungsausschluss 72 Hausfarbe 11 Headline 35 Heilmittelwerbegesetz 14, 69, 163 Heilversprechen 164 HONcode 78 HTML 58 HWG 31 I Imagefilm 31

2y ago

336 Views

Best practices for managing identities when you move to Google Cloud

Google Cloud. To provide t he informat ion an organizat ion would ne e d to transfer data and ownership from one Google Account to anot her for s ome of t he noncore Google s er vice s, such as Google Ads, Google Analyt ics, or DV360. Intende d audience Organizat ion administrators. Sta planning Google Cloud / Google Wor kspace migrat ion. Key .

1y ago

481 Views

MANAGERIAL FINANCE - GBV

of Managerial Finance page 2 Introduction to Managerial Finance 1 Starbucks—A Taste for Growth page 3 1.1 Finance and Business What Is Finance? 4 Major Areas and Opportunities in Finance 4 Legal Forms of Business Organization 5 Why Study Managerial Finance? Review Questions 9 1.2 The Managerial Finance Function 9 Organization of the Finance

3y ago

6.8K Views

Chapter 1 The roles of finance function in organisations

The roles of the finance function in organisations 4. The role of ethics in the role of the finance function Ethics is the system of moral principles that examines the concept of right and wrong. Ethics underpins an organisation’s sustained value creation. The roles that the finance function performs should be carried out in an .File Size: 888KBPage Count: 10Explore furtherRole of the Finance Function in the Financial Management .www.managementstudyguide.c Roles and Responsibilities of a Finance Department in a .www.pharmapproach.comRoles and Responsibilities of a Finance Department .www.smythecpa.comTop 10 – Functions of Business Finance in an om23 Functions and Duties of Accounting and Finance nded to you b

2y ago

335 Views

2013 National Senior Games presented by Humana Medal

3 martin cherie ann canada track & field 2 martin cherieann canada track & field 3 rossi elsie canada track & field 1 stuart pam canada track & field 2 stuart pam canada track & field 3 stuart pam canada track & field 1 stuart pam canada track & field 1 sleepers canada volleyball 3 volleyhawks canada volleyball 1 horiuchi kumi co archery

2y ago

176 Views

International Registered and Reporting Companies .

Dorel Industries Inc. Canada GLOBAL MKT Draxis Health Inc. Canada GLOBAL MKT Dundee Corp. Canada OTC DynaMotive Energy Systems Corp. Canada OTC Eiger Technology Inc. Canada OTC El Nino Ventures, Inc. Canada OTC Eldorado Gold Corp. Canada AMEX Elephant & Castle Group, Inc. Canada OTC Emgold Mining Corp. Canada OTC

1y ago

112 Views

The Weaknesses Of Full-Text Searching

It looks like you're using an ad-blocker