INTRODUCTION TO MODERN INFORMATION RETRIEVAL I

2y ago
139 Views
2 Downloads
457.72 KB
15 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Adalynn Cowell
Transcription

INTRODUCTION TO MODERNINFORMATION RETRIEVAL iSCIEN CESER IES

Introductionto ModemInformation Retrieval

McGraw-Hill Computer Science SeriesAhuja: D esign and A nalysis o f Computer Com m unication N etw orksAllen: A natom y o f LISPBarbacci and Siewiorek: T he D esign and A nalysis o f Instruction Set P rocessorsBell and Newell: Computer Structures: Readings and E xam plesDonovan: S ystem s ProgrammingGear: C om puter Organization and ProgrammingGivone: Introduction to Sw itching Circuit TheoryGoodman and Hedetniemi: Introduction to the D esign and A nalysis o f Algorithm sHam acher, Vranesic, and Zaky: Computer OrganizationHamming: Introduction to A pplied Num erical A nalysisHayes: C om puter A rchitecture and OrganizationHellerman: Digital Computer S ystem PrinciplesHellerman and Conroy: Com puter System PerformanceKatzan: Microprogramming PrimerKeller: A First C ourse in C om puter Programming U sin g P A SC A LLiu: E lem ents o f D iscrete M athem aticsLiu: Introduction to Combinatorial M athem aticsMacEwen: Introduction to Com puter System s: U sin g the PDP-11 and PascalM adnick and Donovan: Operating S ystem sManna: M athem atical T heory o f ComputationNewman and Sproull: Principles o f Interactive C om puter GraphicsNilsson: Problem -Solving M ethods in Artificial IntelligencePayne: Introduction to Sim ulation: Programming T echniques and M ethods o f A nalysisRice: Matrix Com putations and M athem atical SoftwareSalton and McGill: Introduction to M odern Information RetrievalShooman: Software Engineering: D esign , R eliability, and M anagementSiewiorek, Bell, and Newell: Com puter Structures: Principles and E xam plesStone: Introduction to Com puter Organization and D ata StructuresStone and Siewiorek: Introduction to Computer Organization and D ata Structures:PDP-11 EditionTonge and Feldman: Computing: An Introduction to Procedures and Procedure-FollowersTremblay and Bunt: A n Introduction to Computer Science: A n Algorithm ic ApproachTremblay and Bunt: An Introduction to Computer Science: A n Algorithm ic Approach,Short EditionTremblay and Manohar: D iscrete M athem atical Structures with A pplications to C om puter S cien ce.Tremblay and Sorenson: A n Introduction to D ata Structures with A pplicationsTucker: Programming LanguagesW iederhold: D atabase D esignMcGraw-Hill Advanced Computer Science SeriesDavis and Lenat: K n ow ledge-B ased S ystem s in Artificial IntelligenceKogge: T he Architecture o f Pipelined ComputersLindsay, Buchanan, Feigenbaum , and Lederberg: A pplications o f Artificial Intelligencefor Organic Chemistry: T he Dendral ProjectNilsson: Problem -Solving M ethods in Artificial IntelligenceW ulf, Levin, and Harbison: Hydra/C.m mp: An Experim ental Computer S ystem

Introductionto ModemInformation RetrievalGerard SaltonProfessor of Computer ScienceCornell UniversityMichael J. McGillAssociate Professor of Information StudiesSyracuse UniversityMcGraw-Hill Book CompanyNew York St. Louis San Francisco Auckland Bogota HamburgJohannesburg London Madrid Mexico Montreal New DelhiPanama Paris Sao Paulo Singapore Sydney Tokyo Toronto

This book was set in Times Roman by Progressive Typographers.The editors were Charles E. Stewart and James E. Vastyan;the production supervisor was John Mancia.The drawings were done by VIP Graphics.R. R. Donnelley & Sons Company was printer and binder.INTRODUCTION TO MODERN INFORMATION RETRIEVALCopyright 1983 by McGraw-Hill, Inc. All rights reserved. Printed in the United Statesof America. Except as permitted under the United States Copyright Act of 1976, no partof this publication may be reproduced or distributed in any form or by any means, orstored in a data base or retrieval system, without the prior written permission of the pub lisher.1 2 3 4 5 6 7 8 9 0 DODO 8 9 8 7 6 5 4 3 2ISBNLibrary of Congress Cataloging in Publication DataSalton, Gerard.Introduction to modern information retrieval.(McGraw-Hill computer science series)Includes indexes.1. Information storage and retrieval systems.I. McGill, Michael J. II. Title. III. Series.Z699.S313025'.0481-20843ISBN 0-07-054484-0AACR2

ContentsPrefaceCHAPTER 1CHAPTER 2xiInformation Retrieval: An Introduction10123P reviewO verviewChanging T echnologyInformation System T ypesA Information Retrieval SystemsB Data Base Management SystemsC Management Information SystemsD Decision Support SystemsE Question-Answering Systems4 Functional Approach to Information Retrieval5 Sim ple F ile StructuresA Linear ListsB Ordered Sequential Files*C Indexed Files6 Summary11157788991012121316 21Systems Based on Inverted Files 24i.0 Preview1 General Considerationsj.24j24i v

vlCHAPTER 3CONTENTSA Boolean ExpressionsB Order of Operations2 A djacency and Term Frequency FeaturesA Adjacency OperationsB Frequency Information3 C om m ercial Inverted File S ystem sA The DIALOG System*B The STAIRS SystemC The Bibliographic Retrieval Services (BRS) SystemD The MEDLARS System .E The ORBIT SystemF The Information BankG The LEXIS System4 E nhancem ents o f B asic R etrieval Strategy2526282830303034414245454646Text Analysis and Automatic 9499P reviewIndexing Environm entManual and A utom atic IndexingA utom atic Term E xtraction and W eightingA General Considerations*B The Inverse Document Frequency Weight**C The Signal-Noise Ratio*D The Term Discrimination Value4 A Sim ple A utom atic Indexing P rocess5 A utom atic Term A ssociation and U se o f C ontextA Thesaurus Rules*B Automatic Thesaurus ConstructionC Thesaurus UseD Construction of Term PhrasesE Automatic Sentence Extraction6 Som e T heoretical A pproaches*A The Use of Linguistic Methods*B Fragment Encoding**C Probabilistic Information Retrieval7 A utom atic Indexing E xperim entsCHAPTER 4The SMART and SIRE ExperimentalRetrieval Systems0 P review1 Introduction2 The SM A R T S ystem Environm ent*A Vector Representation and Similarity Computation*B Vector ManipulationC Vector Generation3 SM A RT S ystem Procedures*A Automatic Indexing*B Automatic Document Classification*C Relevance Feedback Operations*D Dynamic Document Space118118118120120123127130130137140145

CONTENTS4CHAPTER 5A utom atic E nhancem ents o f C onventional Retrieval*A Document Ranking and Term Weighting*B Retrieval through Man-Machine Dialogue andLocal Clustering1570 P review1 Introduction2 E valuation o f Retrieval E ffectiven ess*C The Computation of Recall and Precision3 M easures o f Retrieval E ffectiv en essA Measurement Problems*B Recall, Precision, and Fallout**C Single-Valued Measures**D Utility Measure4 Evaluation o f S ystem C ost and E fficiencyA System Tradeoffs**B Cost Analysis5 91Retrieval Refinements19901*23P reviewIntroductionV ector Similarity FunctionsTerm W eighting S ystem sA Principal Weighting Strategies*B Evaluation of Weighting Systems**C Term Weighting in Boolean Query Systems4 File Clustering*A Main Considerations*B Classification Methods*C Cluster Search Evaluation**D Automatic Pseudoclassification5 D ynam ic Query Adjustm entA General Considerations*B Feedback Theory6 Citation P rocessingA Basic Citation Properties*B Main Citation Usage7 38240244246246247250Natural Language Processing2570 P review1 C om ponents o f Natural Language S ystem sA Interest in Natural Language Processing257258258*C Feedback VariationsD Dynamic Document SpaceCHAPTER 7151Retrieval EvaluationA System ComponentsB Evaluation Viewpoints and the Relevance ProblemCHAPTER 6146146

viiiCONTENTSB Levels of Language ProcessingC Language Understanding Systems2 Language Processing and Inform ation Retrieval3 Syntactic A nalysis S ystem s*A Phrase Structure Grammars*B Transformational Grammars**C Augmented Transition Network Grammars4 Syntactic A nalysis in Information Retrieval5 Linguistic M ethods in Q uestion A nsw ering**A Knowledge Representation6 ss to Information: Hardware andSoftware Approaches303B Question-Answering Environment*C Linguistic Features in Question AnsweringCHAPTER 80 P review1 C onventional Storage D ev ic e sA Punched CardsB Magnetic Tape**E String Matching Using the Finite State Automaton Model**F The Boyer and Moore String Matching Method4 22324326328329329333338339340345348Data Management Systems354CDEFMagnetic DisksRandom Access Storage DevicesData CellAccess to Storage2 Hardware E nhancem ent o f RetrievalA Microprocessors and Processing ChipsB General Characteristics of Retrieval HardwareC Parallel Processors*D Associative Processors*E Fast Computations Using Array Processors*F Content Addressable Segment Sequential Memory (CASSM)*G Relational Associative Processors (RAP)*H Data Base Computer (DBC)I Other Special-Purpose Devices3 T ext A c c e ss M ethods*A*B*CDCHAPTER 9Dictionary Search Methods for Static FilesDictionary Search Methods for Dynamic FilesMultiple Key Dictionary SearchText Scanning Machines0 P review1 T ypes o f Inform ation S ystem sA Information Retrieval and Question AnsweringB Data Management Systems. 2 The Structure o f Data B ase M anagem ent S ystem sA Basic ConceptsB Structure of Information Items354355355357359359362

CONTENTSix*C The Relational Data Base Model*D The Hierarchical Data Base Model*E The Network Data Base Model3 Query P rocessing*A Query Language Types*B Processing Strategies4 D ata Quality*A Integrity and Security**B Concurrent Data Base Operations**0 Restart and Recovery Methods**D Distributed Data Bases5 SummaryCHAPTER 10Future Directions in Information Retrieval0 P review1 Introduction2 T echnological D evelopm entsA Automatic Document InputB Optical Storage3 Inform ation T heories and M odelsA Natural Language Processing**B Fuzzy Set Theory**C Term Dependency Models*D Composite Document Representations4 A dvanced Information S ystem sA Mixed Information Retrieval SystemsB Personal Computing and Paperless Information Systems5 C onclusionIndexesName IndexSubject 410410413418418421422425426426428430437

PrefaceAn information retrieval system is an information system, that is, a systemused to store items of information that need to be processed, searched, re trieved, and disseminated to various user populations. Information retrievalsystems thus share many of the concerns of other information systems, such asdata base management and decision support systems. In particular, it is neces sary to choose efficient organizations for the stored records, rapid search pro cedures capable of finding items of interest in specific cases, and effectivemethods for disseminating the retrieved data and interacting with the systemusers. Information retrieval systems are normally used to handle bibliographicrecords and textual data. This is in contrast to data base management and man agement information systems that process structured data, and to question-an swering systems that use complex information organizations and inference pro cedures designed to answer questions in particular subject areas. In anextended sense, however, any information system designed to augment thestate of human knowledge and to aid human activities does utilize concepts andprocedures from information storage and retrieval.Today, information processing activities are carried out with the assist ance of automatic equipment. Thus, a direct link exists between information1xi

xiiPREFACEretrieval and computer science. On the other hand, information retrieval alsotakes on aspects of behavioral science, since retrieval systems are designed toaid human activities.Most practitioners interested in the design and operations of actual re trieval systems are concerned only about applied computer science. However,many topics in theoretical computer science are also of direct importance toinformation retrieval, including, for example, information theory, probabilitytheory, computational semantics, and programming theory and algebra. Tech niques from these disciplines may be used to build information retrieval modelsand to obtain insights into various aspects of retrieval theory and practice.Although information retrieval is mentioned in many computer sciencecurriculum proposals, retrieval courses are often replaced in practice by ma terial on data structures and data base systems where attractive approacheshave been developed for formalization and abstraction. The study of informa tion retrieval is thus frequently carried out in library science, information sci ence, and information management schools. In these environments, the mathe matical foundations necessary to make a substantial impact are often omittedfrom retrieval system courses.This text is aimed at increasing the understanding of modern informationretrieval by students of computer science as well as by students of informationscience and management science. The book covers the basic aspects of infor mation retrieval theory and practice, and also relates the various techniques tothe design and evaluation of complete retrieval systems. The book is introduc tory in the sense that no prior knowledge of retrieval methodology is assumed;it is modern because currently active trends and developments are examined.The text concentrates in particular on the description of the concepts, func tions, and processes of interest in retrieval rather than on the detailed opera tions of any one existing retrieval system. In order to keep the material at anintroductory level, the more advanced mathematical aspects of retrieval theoryhave been deemphasized or simplified. The text should thus be accessible tostudents with only a cursory knowledge of the operations of digital computersand only a superficial exposure to computer programming. More advancedreaders can supply the relevant mathematical background by consulting the ref erences given.The text begins With an introduction and a description of the main retrievalprocesses incorporated into existing, operational systems based on keyword in dexing and Boolean query formulations (Chapters 1 and 2). Chapter 3 containsa detailed explanation of modern automatic indexing techniques with evalua tion results and assessments of the importance and practical usefulness of thetechniques. Experimental retrieval systems, based in part on fully automaticanalysis, search, and retrieval methods, are covered in Chapter 4 with emphasison the design of the SMART and SIRE systems developed by the authors. Themain evaluation techniques used to assess the effectiveness and efficiency ofinformation retrieval systems are covered in Chapter 5 with emphasis on theuse of the well-known recall and precision measures. Chapter 6 deals with im portant techniques usable in the design of future systems, such as automatic

PREFACExiiiclassification methods, query negotiation, and reformulation processes used inon-line environments, collection restructuring, and bibliographic citation pro cessing. Language processing methods useful in retrieval, including currentsyntactic and semantic methodologies, and artificial intelligence approaches tolanguage understanding are examined in Chapter 7. Chapter 8 introduces spe cialized hardware useful in retrieval, such as parallel processing devices, arrayprocessors, and special back-end search devices useful for manipulating andsearching large data bases. Also covered are modern techniques used for dic tionary searching and for automatic text scanning systems. The relationship be tween information retrieval and other information systems, such as data baseand decision support systems, is examined in detail in Chapter 9, and the cur rent work in data base processing and data base management is described. Fi nally the expected future directions and developments in information retrievalare covered in Chapter 10, including the importance and likely effect of per sonal computers, word processing, advanced display systems, and paperlessinformation systems.The text should be useful for computer science as well as library scienceand information science students on a junior-senior level in college or for begin ning graduate students. The book can also serve the professional reader as anintroduction to the design and operations of information retrieval and manage ment information systems. A common core for computer science as well as in formation science readers is contained in Chapters 1, 3, 4, and 10. Computerscience audiences should profit in addition from a complete treatment of Chap ters 5, 6, and 8, since these chapters emphasize the more mathematical aspectsof the field and the connection to software and hardware implementations. Li brary and information science readers, on the other hand, should concentrateon Chapters 2 ,7 , and 9 in addition to the core to gain a thorough understandingof conventional retrieval and other information systems and of the languageanalysis methods useful for processing natural language texts.To simplify the reader’s task, the material has been graded for technicaldifficulty:*Sections marked with a single asterisk contain technical material that maybe difficult for some readers. Often a modest computer science backgroundmay be useful, or an acquaintance with elementary algebra or basic probabilitytheory. This material is considered important, and the reader is encouraged toread the section and obtain an understanding of the content.** Sections identified by two asterisks contain technical material at a some what more detailed level. A particular procedure may be covered in detail; al ternatively, a theory may be introduced requiring some technical know-how.Readers who find this material difficult may wish to skim the section rather thandwell on the details.Sections not marked by * or ** should be accessible to all readers withoutspecial background.The following sample curricula will provide complete coverage Of the prin cipal aspects of retrieval system design and operations:

xivPREFACEComputer scienceInformation scienceand management scienceChapter 1(Core)What is information retrieval?Functional view of retrievalWhat is information retrieval?Functional view of retrievalChapter 2(IS emphasis)Basic set theory inherent in listprocessingStandard Boolean operationsStandard retrievalConventional systemsChapter 3(Core)Theory of automatic indexingTerm weighting and associativeindexingBasic evaluation resultsTheory of automatic indexingTerm weighting and associativeindexingBasic evaluation resultsChapter 4(Core)The SMART systemRelevance feedback and clustersearchingWeighted retrieval in Booleansystems (SIRE)The SMART systemRelevance feedback and clustersearchingWeighted retrieval in Booleansystems (SIRE)Chapter 5(CS emphasis)Mathematics of evaluationEvaluation parameters andcomputational aspectsBasic definition ofrecall-precision parametersCost evaluationChapter 6(CS emphasis)Term relevance theoryCluster generation and searchPseudoclassificationDocument space alteration anddynamic file processingUse of citations in informationsearch systemsChapter 7(IS emphasis)ATN grammarsCriterion tree processingConcept representationLanguage analysisSyntax and semanticsContext-free andcontext-sensitive grammarsConcept representation ininformation systemsChapter 8(CS emphasis)Basic hardware devicesParallel processing techniquesMicroprocessorsDictionary search methodsString search algorithmsBasic hardware and parallelprocessing techniquesMicroprocessorsText scanning machinesChapter 9(CS and ISemphasis)Relationship of informationretrieval to other informationsystemsData base systems and modelsRelationship of informationretrieval to other informationsystemsFile processing, accessing,searchingFile security, data structuresChapter 10(Core)Future directionsText inputDistributed architectureNew retrieval theoriesAdvanced information systemsFuture directionsText inputDistributed architectureMixed information retrievalsystemsPaperless information systems

PREFACExvBy limiting the coverage to the more basic aspects of the various topics,the material can be assimilated in a one-semester course. A second semestermay be required if the various algorithms and techniques—for word stem gen eration, pseudoclassification, string searching, and so on— are covered in de tail, and if additional sources are consulted.A number of graduate students at Cornell and Syracuse universities havemade substantial contributions to the design of the SMART and SIRE systems,including, in particular: Robert E. Williamson, Clement T. Yu, Chung ShuYang, Anita Wong, Harry Wu, and Edward A. Fox at Cornell and TerryNoreault, Jennifer Kuehn, Judy Tessier, and Matthew Koll at Syracuse. Sev eral readers have reviewed the manuscript and made many valuable commentsand suggestions, including, in particular: Professor Richard H. Austing of theUniversity of Maryland, Professor Michael D. Cooper of the University of Cal ifornia at Berkeley, Professor Jeffrey Katzer of Syracuse University, Dr. Mi chael E. Lesk of Bell Laboratories, Professor J. F. Nunamaker of the Univer sity of Arizona, and Professor Linda C. Smith of the University of Illinois. Theauthors have also profited from discussion with many colleagues and friends.Some early material was typed by Peggy Montgomery at Syracuse. Geri Pinkham at Cornell has typed the complete manuscript over several times with un usual speed and competence; in the process she became familiar with the intri cacies of an automated text editing system. Without her help the text wouldhave remained in manuscript form for a long time to come. Edward Fox andElena Seifrid have also helped to produce a version ready for automatic type setting. The writers are greatly indebted to all these individuals for guidanceand assistance.Gerard SaltonMichael J. McGill

McGraw-Hill Computer Science Series Ahuja: Design and Analysis of Computer Communication Networks Allen: Anatomy of LISP Barbacci and Siewiorek: The Design and Analysis of Instruction Set Processors . data base management and

Related Documents:

The 7 Basic Principles of Retrieval Practice Following are the seven basic principles of retrieval practice. 1. Keep It Short and Simple Retrieval practice should only take a few of minutes of class time and should be easy to explain, set up, and conclude. A perfect example is Agarwal and Bain’s (2019) retrieval

Manipulations of Initial Retrieval Practice Conditions 7 Retrieval Practice Compared to Restudy and Elaborative Study 7 Comparisons of Recall, Recognition, and Initial Retrieval Cueing Conditions 8 Retrieval Practice With Initial Short-Answer and Multiple-Choice Tests 9 Positive and Negative Effects of Initial Multiple-Choice Questions 11

[B]. RETRIEVAL PHASE The retrieval phase is the reverse process of the storage phase. In this phase another automatic monorail will arrive at the retrieval reference point without any load (package) on it. The proximity sensor will sense it, the sensor will change to on state which sends the signal to PLC alerting it about the request of retrieval.

Retrieval practice with short-answer, multiple-choice, and hybrid tests Megan A. Smith and Jeffrey D. Karpicke Department of Psychological Sciences, Purdue University, West Lafayette, IN, USA (Received 29 May 2013; accepted 29 July 2013) Retrieval practice improves meaningful learning, and the most frequent way of implementing retrieval

Retrieval Practice – Why Karpicke & Roediger, 2008 Learning pairs of words (Swahili – English) 15 Retrieval Practice – Why Retrieval Practice strengthens memory and interrupts forgetting. Retrieval Practice makes that knowledge easier to retrieve in the future. Neural pathways that make up a body of learning get stronger. 16

The Role of Episodic Context in Retrieval Practice Effects Joshua W. Whiffen and Jeffrey D. Karpicke Purdue University The episodic context account of retrieval-based learning proposes that retrieval enhances subsequent retention because people must think back to and reinstate a p

Lead Retrieval Solution Guide 4 Here are key reasons to share with your exhibitors why they should purchase lead retrieval: 1. Buyers Abound According to CEIR, 81% (or 4 out of 5 attendees on the show floor) have buying authority2. Lead retrieval is the most effective and efficient way to collect data about potential buyers. 2. No Lost Leads

work/products (Beading, Candles, Carving, Food Products, Soap, Weaving, etc.) ⃝I understand that if my work contains Indigenous visual representation that it is a reflection of the Indigenous culture of my native region. ⃝To the best of my knowledge, my work/products fall within Craft Council standards and expectations with respect to