The Weaknesses Of Full-Text Searching

2y ago
40 Views
5 Downloads
783.84 KB
8 Pages
Last View : Today
Last Download : 3m ago
Upload by : Maxine Vice
Transcription

The Weaknesses of Full-Text Searchingby Jeffrey BeallThis paper provides a theoretical critique of thedeficiencies of full-text searching in academiclibrary databases. Because full-text searchingrelies on matching words in a search query withwords in online resources, it is an inefficientmethod of finding information in a database.This matching fails to retrieve synonyms, andit also retrieves unwanted homonyms.Numerous other problems also make full-textsearching an ineffective information retrievaltool. Academic libraries purchase and subscribeto numerous proprietary databases, many ofwhich rely on full-text searching for accessand discovery. An understanding of theweaknesses of full-text searching is needed toevaluate the search and discovery capabilities ofacademic library databases.INTRODUCTIONDefinition of Full-Text SearchingFull-text searching is the type of search a computer performswhen it matches terms in a search query with terms inindividual documents in a database and ranks the resultsalgorithmically. This type of searching is ubiquitous on theInternet and includes the type of natural language search wetypically find in commercial search engines, Web site searchboxes, and in many proprietary databases. The term full-textsearching has several synonyms and variations, includingkeyword searching, algorithmic searching, stochastic searching, and probabilistic searching.Metadata-Enabled SearchingThere is one other main type of online searching. This ismetadata-enabled searching, which is also called deterministicsearching. In this type of search, searchers pre-select and searchindividual facets of an information resource, such as author,title, and subject. In this type of search, the system matchesterms in the search with terms in structured metadata andgenerates results, often a browse display sorted alphanumerically. Author, title, and subject searches in online librarycatalogs are examples of this type of search.ImportanceJeffrey Beall is a Metadata Librarian/AssistantProfessor,AurariaLibrary, University of Colorado Denver, 1100 Lawrence Street,Denver, CO 80204, USA jeffrey.beall@ucdenver.edu .438Understanding the weaknesses of full-text searching isimportant for academic libraries for several reasons. First,academic libraries purchase or subscribe to numerous proprietary databases, including many full-text databases. When theydecide whether to pay for a particular database, libraries need toevaluate the search engine or system that accompanies thedatabase. When these databases provide only full-text searchingand not metadata-enabled searching, resource discovery withinthe resource may be difficult, putting libraries in the position ofpaying for content that is hard to find. Library-created databases, such as institutional repositories, are another area wherean understanding of the weaknesses of full-text searching isneeded. Providing only full-text access to a library's digitalobjects may not provide resource discovery of sufficient qualityfor the collection's users. Academic libraries need to evaluatethese collections and the available search engines and systemsand select the best one for their particular databases. Finally,much current debate centers on the need for online librarycatalogs versus the ability to access academic library materialsthrough a commercial search engine. A thorough knowledge ofthe weaknesses of full-text searching adds to the debate andhelps academic librarians in the evaluation, recommendationand design of library database search engines.The Journal of Academic Librarianship, Volume 34, Number 5, pages 438-444

ObjectiveThe purpose of this article is to list and describe the chiefweaknesses of full-text searching. We limit the scope of thisarticle to true full-text searching that automatically matcheswords entered in the search box with words in resources adatabase contains to generate results. This study does notinclude in its analysis new, semantic search engines such asHakia, which stores metadata for each Web page indexed anduses that metadata, along with word matching, to generatesearch results. Indeed, many popular search engines do incorporate metadata into their searches. For example, the Googleadvanced search allows for limiting search results to a specificlanguage. This search limit is generated by language metadatathat the search engine assigns to each Web page it indexes (theaccuracy of this automatically-generated language metadatamay not always be high).Still, the great majority of the searches performed on theInternet are of the type this paper seeks to study: full-textsearching that matches words in a search box with words inonline documents or online text. This study is not a comparisonof full-text searching and metadata-enabled searching. Both ofthese two types of searching have their various strengths andweaknesses. This article seeks chiefly to describe the weaknesses of full-text searching.This paper is a theoretical critique of full-text searching andfocuses on the type of searching done in academic libraries. Itdescribes and categorizes the ways in which full-text searchingcan fail, failures that most searchers have likely encounteredthemselves. While outside the scope of this paper, quantitativeresearch that measures the extent of these problems would bevaluable and would further inform the debate.capitale"). And no software algorithm will solve this problem when itis confined to dealing with only the actual words that it can retrieve2 fromwithin the given documents (or citations or abstracts) themselves.Beall3 ' 4 presents two brief but more complete looks at theproblems of full-text searching. The present paper aims for amore comprehensive analysis. Moreover, Beall 5 introduces theterm "search fatigue" to describe the feelings of frustrationsearchers feel when they are unsuccessful in finding information due to the weaknesses of full-text searching. A recent studyby Hemminger, Saelim, Sullivan, and Vision 6 compares fulltext searching to metadata searching and finds that "it may betime to make the transition to direct full-text searching as thestandard". However, later in the article the authors concede thattheir study may not be truly representative, for it compared thetwo searching modes using gene names, which are consistentlyused in the literature they studied.THE WEAKNESSES OF FULL-TEXT SEARCHINGThe Synonym ProblemPerhaps the biggest and most pervasive weakness of full-textsearching is the synonym problem. This problem occursbecause there is often more than one way to name or expressa given concept, such as a person, place, or thing. There areseveral different aspects of the synonym problem."Perhaps the biggest and most pervasiveweakness of full-text searching is the synonymproblem."PREVIOUS STUDIESMost information retrieval and information discovery has transitioned from searching dominated by metadata-enabled searching (academic library card catalogs) to the present full-text oralgorithmic searching (Web search engines). This transitionoccurred without sufficient analysis of the weaknesses of fulltext searching. Perhaps if searchers understood the number ofresources they were missing because of full-text searching'sreliance on word matching to generate retrieval, they would beless satisfied with it. Generally, books and articles on information retrieval often cite one or two examples of the weaknesses of full-text searching; few have been comprehensive intheir analyses, as this one seeks to be.Among those to write about the weaknesses of full-textsearching is Thomas Mann, a reference librarian at the Library ofCongress. He states "Keyword searching fails to map the taxonomies that alert researchers to unanticipated aspects of theirsubjects. It fails to retrieve literature that uses keywords other thanthose the researcher can specify; it misses not only synonyms andvariant phrases but also all relevant works in foreign languages.Searching by keywords is not the same as searching by conceptualcategories". I Here he makes reference to the synonym problem infull-text searching (and he prefers to use the term keywordsearching rather than full-text searching, providing yet anotherexample of the synonym problem). Mann also states,When all is said and done, keyword searching necessarily entails theproblem of the unpredictability of the many variant ways the same subjectcan be expressed, within a single language ("capital punishment", "deathpenalty") and across multiple languages ("peine de mort", "penaTrue SynonymsSynonyms are two words that mean the same thing in onelanguage. In full-text searching, synonyms hinder effective information retrieval when a searcher enters a term in the search boxand the system only returns results that match the term and doesnot return results that refer to the concept only by one of itssynonyms. For example, if a searcher seeks information onleprosy, he would likely enter "leprosy" in the search box andexpect complete results. However, some online documents refer tothis disease as "Hansen's disease". While it's true that manydocuments will contain both terms, thus enabling access regardlessof which term is searched, a certain percentage of the documentswill only contain one term, thus providing an incomplete retrieval.Variant SpellingsWords that mean the exact same thing can sometimes bespelled differently, as in variant British and American spellings.In full-text searching, a search for "harbour" will miss resultsthat use the spelling "harbor". It is true that many full-textsearch engines have developed methods for overcoming thisproblem; searchers can use wild card or truncation operators toretrieve multiple spellings of a word. But there are also variantspellings within a single dialect of a language, and thesedifferences are often beyond the scope of the truncation or wildcard operators. For example, in American English the spellings"donut" and "doughnut" are both common. Unlike the case ofsynonyms, where in a single document both synonyms maySeptember 2008439

appear, spelling tends to be consistent within a document. Adocument about harbors written in the United States is unlikelyto also contain the spelling "harbour". This means that there is asmaller chance of retrieving documents with variant spellingsthan there is with true synonyms.Shortened Forms of TermsAbbreviations, acronyms, and initialisms can hinder recall infull-text search systems because a document may contain onlythe short forrn of the word or only the long form. When thisoccurs, someone searching on the short form (PETA) will miss inhis retrieval documents that only use the long form (People forthe Ethical Treatment of Animals). Alternately, searching on thelong form of the term, like Magnetic Resonance Spectroscopy,will miss documents that only refer to the concept by its shortform, MRS.Different Languages or DialectsWhen searching a term in one language, a searcher will notmatch documents that contain the foreign-language version forthat concept, unless the two terms happen to be cognates. Forexample, if you search the term "butter", the search will missdocuments that only refer to this by its Spanish equivalent"mantequilla". For many searchers, this exclusion is not a problern; they prefer their search results to be in just one language.However, scholarship supported by academic libraries, such asmedical research, or research for a thesis or dissertation, needs tobe comprehensive regardless of language. Additionally, variationoccurs within a single language. The phrase "football coach"means different things in British and American English. In theUnited States, this term refers to a person who directs anAmerican football team, that is, the coach. In British English, a"football coach" refers to a bus (motorcoach) for soccer players."The phrase "football coach" means differentthings in British and American English."When the words are the same in two or more languages ordialects of a single language, however, such as the word"migration", which means the same in English and French, thedifferent language problem does not occur.Obsolete TermsLinguistic change can also prevent complete informationretrieval in full-text searching. For example, the phrase "Frenchdistemper" is one of many archaic ways of referring to syphilis(the term was also used metaphorically by the English to refer tothe French Revolution). Someone researching the history ofsyphilis and using full-text searching would miss resources thatonly use the term "French distemper". It is possible in GoogleBooks to find digitized academic library books that only use thisterm. While it is possible to search every possible variant termto generate a complete search result in full-text searching, thismethod is not very efficient and requires that one know all thevariant terms, an unlikely possibility.humanities than it does in science, technology, and medicine.STM scholarship tends to be more consistent in its terminology,even across languages. For example, the scientific names of plantsand animals (binominal nomenclature) are the same in mostlanguages (Tyrannosaurusrex, for example). This tendency to usea standard terminology even across languages ameliorates thesynonym problem in these fields. (Note, however, that Tyrannosaurusrex is often abbreviated to T-rex, creating an instance of theabbreviation problem described above.) This is not to say thatSTM fields always use consistent terminology. There are at least7sixty different terms that all mean "Atlantic cod", for example.The variation occurs in the common names and not in thescientific names, though. While scientific names tend to beapplied consistently within the scientific domain, popular termsfor natural things reflect a diverse terminology.Unlike scientific terminology, humanities terminology variessignificantly from one language to another and by time anddialect within a single language. Take the term "short stories" forexample. In French it's "nouvelles", in Spanish it's "cuentos",and in German, "Erziihlungen". The names for languagesthemselves differ from language to language too. The names forthe German language include alemd.n, Deutsch, and allemand.Perhaps one area in the humanities where there is some crosslanguage consistency is music. Many languages share terms like"soprano". Also, as described earlier, regional differences withina single language can lead to problems in information retrievalwhen using full-text searching. In British English a "solicitor" isa lawyer; in American English, it is someone who goes door todoor selling something or asking for contributions for charity.The Homonym ProblemThe homonym problem occurs in full-text searching when asingle word or phrase has more than one meaning. Because fulltext searching relies on word matching to generate results, a searchfor a term with several meanings will retrieve documents for all ofthe meanings, rather than just the one the searcher wants.Homonyms are perhaps the chief cause of low search precision.True HomonymsWithout metadata, computers do not know the sense of each ofa given pair of homonyms. That is, computers cannot effectivelydisambiguate two concepts when they are called by the sameterm. For example, a search on "cookies" will pull up documentsboth about the food and the little files stored on a computer.Searchers are aware of this problem, for it occurs frequently.Many have developed strategies to eliminate unwanted hits andincrease the probability of search results matching the particularmeaning of the homonym they seek. For example, someonelooking for information on computer file cookies might add theword "computer" to the search terms (instead of only searchingfor "cookies"), because the documents about edible cookies areless likely to have this term in them. Alternatively, a sophisticatedsearcher might use the "not" operator to try to eliminate unwantedhomonyms and increase a search's precision. The searcher mightenter "cookies not recipes", for example. While these strategieshelp, they are not completely effective. Words can have manymore meanings than just two, and one often does not anticipatethat a search term has synonyms.Humanities vs. STMDisambiguation of Personal NamesOverall, despite the above example, we should note that thesynonym problem probably occurs more frequently in theThis problem occurs in both full-text searching and in metadata-enabled search systems where the practice of name440The Journal of Academic Librarianship

disambiguation is not employed. Name disambiguation is theprocess of making each person's name unique in a database. Themore common a name in a database, the greater the problem. Theproblem is made worse by names that also function as other partsof speech, like bill, April, miller, and mike. Because namedisambiguation necessarily involves adding metadata, virtually allfull-text documents lack this value-added feature. This problem issignificant in academic libraries because some style guidesprescribe the use of initials instead of given names in citations,making a full-text search for an author's name more difficult.False CognatesThese are two words that are spelled the same (or almostthe same) in two languages but, deceptively, do not mean thesame thing. In full-text searching, false cognates are only aproblem when they are spelled exactly the same. The problemoccurs when a word entered into a search box happens tomatch a word in a different language that has no semanticrelationship to the original search tenn. For example, the word"location" in French doesn't mean "location" in English; itmeans a rental or a lease.Inability to Search by FacetsSometimes searchers have a need to search by only a specificcharacteristic or attribute of an online resource, such as author,title, subject, date of creation, etc. These attributes, or facets,help to cluster resources by specific shared characteristics.Clustering, or collocating, is helpful in information retrievalbecause it helps exclude unwanted resources from a search.Also, clustering matches typical searcher queries, such as "Iwant all DVDs on agriculture", or "I want all PDF files on landuse planning in Utah published before 2000". Pure, full-textsearching fails at these tasks, because the search engine doesn'tknow the format (DVD's) or the subject (agriculture) or thepublication date (2000) of the documents it searches. If a searchengine does know these dates, then it's not a pure, full-textsearch engine. Instead, it is a metadata-enhanced search engineand draws its ability to sort by facets from metadata assigned toeach resource it indexes.ClusteringClustering is most helpful when it attempts to solve thehomonym problem in subject searches. Here, clustering is theprocess of grouping and separating out resources by subject. Forexample someone searching for information on ocean banksmightjust enter "banks" as the search term. A search engine withthe ability to cluster would then separate out the results that referto ocean banks from those that refer to banks, the financialinstitutions. It's probably not uncommon for users who stumbleon the homonym problem in a full-text database to do a revisedsearch that includes a second search term, as a strategy foreliminating unwanted documents. For example, a searcher couldenter "banks ocean" to eliminate documents in the retrieval thatare about banks the financial institutions. This stratagem is notfoolproof, however, for there are many resources about financialinstitutions that contain the words "banks" and "ocean".Increasingly, proprietary databases are performing this type ofcluster analysis algorithmically, but with limited success.Inability to SortJust as full-text search engines lack the ability to clustersearch results, they also lack the ability to sort results by facets.Sorting plays an important role in and can increase the value ofinformation retrieval because it helps arrange search results in ameaningful and

deficiencies of full-text searching in academic library databases. Because full-text searching relies on matching words in a search query with words in online resources, it is an inefficient method of finding information in a dat

Related Documents:

Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. Crawford M., Marsh D. The driving force : food in human evolution and the future.