TITLE 523p. - ERIC

3y ago
61 Views
4 Downloads
6.71 MB
524 Pages
Last View : 18d ago
Last Download : 3m ago
Upload by : Macey Ridenour
Transcription

DOCUMENT RESUMEED 048 910TITLEINSTITUT104SPONS AGENCYREPORT NOPUB DATENOTEEDRS PRICEDESCRIPTORSIDENTItIERSLI 002 719information Storage and Retrie-,a1.Reports onAnalysis, Dictionary Construction, User Feeloack,Clustering, and On-Line Retrieval.Cornell Univ., Ithaca, N.Y. Dept. of ComputerScience.National Library of Medicine (DHEW), Bethesda, Md.;National Science Foundation, Wi.shingtom, D.C.ISR-19Oct 70523p.EDRS Price MF- 0.65 liC- 19.74Automatic Indexing, Automation, Classification,Content. AuLlysis, *Dictionaries, Electronic LataProcessing, *Feedback, *Information Retrieval,*Information Storage, Relevance (InformationRetrieval) , Thesauri, *Use stu(liesOn Line Retrieval Systems, *Saltons MagicalAutomatic Retriever of Texts, SMARTABSTRACTPart One, Automatic Content Analysi:, contains:"Content Analysis in Information Retrieval:" "The 'Generality' Effectand the Retrieval Evaluation for Large Collections;" "AutomaticIndexing Using Bibliographic Citations" and "Automatic Tiesolution ofAmbiguities from Natural Language Text." Part Two, AutomaticDictionary Construction, includes: "The Effect of Cocomon Words andSynonyms on Retrieval Performance;" "Negative Dictionaries" and"Experiments in Automatic Thesaurus Constructioa for InformationRetrieval." Part Three, user feedback proced.lres, incl!tdos:"Variations on tie Query Splitting Technique with RelevanceFeedback:" "Effectiveness of Feedback Strategies on Collections ofDiffering Generality;" "Selective Negative Feedback Methods" and "TheUse of Past Relevance Decisions i;! Relevance Feedback." Part Four,clustering methods, includes: "A Controlled Single PassClassification Algorithm with Application to Multilevel Clustering"ar,d "A Systematic Study of Query -- Clustering Techniguel: A Prs-JgressReport." Part Five, on-line retrieval system design, contains: "APrototype On-Line Document Retrieval System" and "Template Analysisis a Conversational 'system." (Author:NH)

'PERMISSION TO hEPRODUCE THIS CJPYRIGHTED MATERIAL HA; BEEN GRANT-,DBYScT.,,cvo o 0LibutsDepartment of Computer Science,O ERIC AND ifFMANIZATiONS OPERATINGUNDER AGREEMENTS WITH THE US cFncEOF EVICAT.0% FUR" IcR REPROOUCT,ONOUTSIDE THE ERIC SYSTEM REGVRES PTRCornell UniversityMISSION DT THE CCPYPoCO,T COMER'Ithaca, New York 14850Scientific Report No. ISR18INFORMATION STORAGE AND RETRIEVAL.toThe National Science, Fondationand toThe National Library of MedicineReports on Analysis, Dictionary Construction, UserFeedlack, C:ustering, and On-Line Retrieval,Itnaca, New YorkOctober 1970Gerard SaltonProject Director

Copyright, 1970by Cornell UniversityUse, reproduction, or publication, in whole or in part, is poraittedfor any purpose of the United States Government.ii2

SMART Project StaffRobert CrawfordBarbara GalaskaEileen GudatMarcia KerchnerEllen LundellRobert PeckJacob RazonGerard SaltonDonna WilliamsonRobert WilliamsonSteven WoronaJoel Zumoffiii

TABLE ,1F CONTENTSPagexvSUMhIARYPART ONEAUTOMTIC CONTENT ANALYSISI.WEISS,F."Ccntent Analysis in Information Retrieval"AbstractI-11.Introduction1-22.ADI Experiments1-5A)Statistical Phrases1-5B)Syntactic Phrases1-7C)Cooccurrence1-9D)Elimination of Phrase List1-12E)Analysis of ADI ResultsI-293.The Cranfield Collection1-264.The TIME Subset Collection1-27A)Construction1-27B)Analysis of Results1-315.A Third TON, G."The 'Generality' Effect and the Retrieval Evaluation for LargeCollections"iv

TABLE OF CONTENTS I-12.Basic System Parameters .11-33.Variations in Collection Size11-7.A)Theoretical Considerations11-7B)Evaluation ResultsII-10C)Feedback Performance.4.Variations in Relevance sIII.SALTON, G."Automatic Indexing Using Bibliographic Citations"AbstractIV.I1I-11.Significance of Bibliographic Citations.III-12.The ,itation Test111-43.Evaluation Fesults111-9References111-19Appendix111-20WEISS, S. P."Automatic Resolution of Ambiguities '.rom Natural Language Text"V

:A5LE OF CONTENTS -7tionIV-22.The Nature of AmbigW.tiesIV-43.Approaches tr DisambiguationIV84.Automatic DisambiguationIV-14Application of Extended Template Analysis toDisambiguationIV-14B)The Disambiguation ProcessIV-15C)ExperimentsIV-17DiFurther Disambiguation ProcessesIV-20A)5.6.Learning to Disambiyuate AutomaticallyA)IntroductionIV-21B)Dictionary and CorpusIV-21C)The Learning ProcessIV-23D}Spurious RulesIV-28E)Experiments and ncesIV-50PART TWOAUTOMATIC DICTIONARY CONSTRUCTIONV.IV-21BERGMARK, D.vi

TABLE CF CONTENTS (continued)PageV.continuedThe Effect of Common Words and Synonyms on Retrieval nt OutlineV-23.A)The Experimental Data BaseV-2B)Creation of the Significant Stem Dictionary.V-2C)Generation of New Query and Document VectorsV-4D)Document AnalysisSearch and Average Rims.V-5.V-7Retrie,,a1 Performance ResultsV-7A)Significant vs. Standard Stem Dictionary.B)Significant Stem vs. ThesaurusV-9C)Standard Stem vs. ThesaurusV-11D)Recall ResultsV-11E)Effect of "Query Wordiness" on Search Performance.V-15F)Effect of Query Length on Search PerformanceG)Effect of Query Generality on Search PerformanceH)Conclusions of the Global Analysis4.Analysis of Search Performance5.Conclusions6.Further dix IV-35Appendix IIV-39vii

TABLE OF CONTENTS (continued)PageVI.BONWIT, K. and ASTE-TONSMANN, J."Negative Dictionaries"AbstractVI-11.IntroductionVI -12.TheoryVI-23.Experimental ResultsVI-74.Experimental MethodVI-19A)Calculating QiVI-19B)Deleting and SearchingVI-205.Cost AnalysisVI-256.ConclusionsV1 -29ReferencesVII.VI-33SALTON, G."Experiments in Automatic Thesaurus Construction for InformationRetrieval"AbstractVII-11.Manual Dictionary ConstructionVII-12.Common Word RecognitionVII-83.Automatic Concept Grouping i

TABLE OF CONTENTScontinued)PagePART THREEUSER FLI:IBACK PROCEDURESVIII.BAKER, T. P."Variations on the Query Splitting Technique with 22.Algorithms for Query SplittingVIII-33.Results or Experimental CAPPS, B. and YIN, M."Effectivenes , of Feedback Strategies on Collections ofDiffering ental EnvironmentIX-33.Experimental xix

TALE OF CONTENTS (continued)PageX.KERCHNER, M."Sel Aive Negative Feedback Methods"AbstractX-11.Introduc LionX12,MethodologyY-23.Selective Negative Relevance Feedback Strategies.X-54.The Experimental EnvironmentX-65.Experimental ResultsX-86.Evaluation of Experimental ResultsX-13Ref.:,rercesXI.X-20PAAVOLA, L."The Use of Past Relevance Decisions in Relevance Feedback"AbstractYI-1IntroductionXI-12.Assumptions and HypothesesXI-23.Experimental sXI-14xr.

TABLE OF CONTENTS (continued)PagePART FOUR'CLUSTERING METHODSXII.JOHNSON, D. B. and LAITENTE, J. M."A Controlled Single Pass Classification Algorithm with Applicationto Multilevel ds of ClusteringXII-33.StrategyXII-54.The AlgorithmXII-65.A)Cluster SizeB)Number of ClustersXII-9C)OverlapXII-10D)An ge ManagementXII-14ResultsXII-14A)Clustering CostsXII-15B)Effect of Document OrderingXII-19ClSearch Results on Clustered ADI Collection .D)Search Results of Clustered Cranfield XII-37xi1i

TABLE OF CONT2NTS (continued)PageXIII.WORONA, S."A Systematic Study of Query-Clustering Techniques:Progress Report"AXIII-1Abstract1.IntroductionXIII-12.The ExperimentXIII-4A)Splitting the ColllectionB)Phase 1:Clustering the QueriesC)Phase 2:Clustering the Documents.D)Phase 3:Assigning -133.ResultsXIII -l34.Principles of EvaluationXIII-16ReferencesXIII-22Appendix AXIII-24Appendix BXIII-29Appendix CXIII-36PART FIVEON-LINE RETRIEVAL SYSTEM DESIC'XIV.WILLIAMSON, D. and WILLIAMSC1, R."A Prototype On-Lin' Document Retrieval System"AbstractXIV-1

TABLE OF CONTENTS nticipated Computer ConfigurationXIV-23.On-Line Document Retrieval4.Console Driven Document Retrieval5.6.A User's ViewAn Internal ViewXIV-10A)The Internal StructureB)General Characteristics of SMART RoutinesC)Pseudo-BatchingXIV-17D)Attaching Consoles to SMARTXIV-19E)ConsoleF)Parameter VectorsHandlingXIV-10XIV-16.The Superivsor InterfaceXIV-21.G)The Flow of ControlXIV-22H)Timing ConsiderationsXIV-23I)Noncore Resident Fi1 sXIV-26J)Core Resident FilesXIV-28A Detailed LookXIV-30ConsolA)Competition for CoreB)The SMART 04-line Console Control Block.C)The READY Flag and the TRT Instruction.XIV -30.XIV-31.XIV-32.XIV-32D)The Routines LATCH, CONSIN, aild CONSOTE)CONSOL as a ',raffle ControllerXIV-34F)A DetP'1.ed View of CYCLEXIV-37SummaryAppendixW.XIV-4WEISS, S. F."Template Analysis in a Lorversational System".XIV-39XIV-40

TABLE OF CONTENTS (continued)PageXV.continuedAbstractXV -11.Motivation2.Some Existing Conversational Systems.3.Goals for a Proposed Conversational System.XV-74.Implementation of the Conventional SystemXV-115.XV-1XV-4.A)CapabilitiesXV-11B)Input ConventionsXV-12C)The Structure of the ProcessXV-13D)Template Analysis in theCbmersational SystemXV-14E)The Guide )System PerformanceXV-30B)User PerformanceXV-31C)TimingXV-346.Future ExtensionsXV-357.ConclusionXV -37ReferencesXV-39xiv

SummaryThe present report is the eighteenth in a series describing researchin automatic information storage and retrieval conducted by the Departmentof Computer Science at Cornell University.The report covering work carriedout by the SMART project for approximately one year (summer 1969 to summer1970) is separated into five parts:automatic content analysis (SectionsI to IV), automatic dictionary construction (Sections V to VII), user feed-back procedures (Sections VIII to XI), document ar3 query clustering methods(Sections XII and XIII), aud SMART systems design for on-line operationsSections XIV and XV).Most recipients of SMART project reports will experience a gap inthe series of scientific reports received to date.Report ISR-17, consistingof a master's thesis by Thomas Brauen entitled "Document Vector Modificationin On-line Information Retrieval Systems" was prepared for limited distribution during the fall of 1369.Report ISR-17 is available from the NationalTechnial Information Service in Springfield, Virginia 22151, under ordernumber PB 186-135.The SMART system continues to operate in a bat.:11 processing modeon the IBM 360 mode] 65 system 4: Cornell University.The standard processingmode is eventually to be replaced by an on-line system using time-sharedconsole devices for input and output.The overall design for such an on-lineversion of SMAR',' %as been completed, and is described in Section XIV of thepresent report.While awaiting the tine-sharing implementation of thesystem, new retrieval experiments have been performed using larger documentcollections within the existing system.xvAttempts to compare the performance1

of several collections oi different sizes must take into account thecollection "generality".the present report.A study of this problem is made in Section II ofOf special interest may also be the new proceduresfor the automatic recognition of "common" words in English texts (SectionVI), and the automatic construction of thesauruses and dictionaries for usein an automatic language analysis system (Section VII).Finally, a newinexpensive method of document classification aid tens grouping isdescribed and evaluated in Section XII of the present report.Sections I to IV cover experiments in automatic content analysisand automatic indexip.q.Section I by S. F. Weiss contains the results ofexperiments, using statistical and syntactic procedures for the automaticrecognition of phrases in written texts.It is shown once again that be-cause of the relative heterogeneity of most document collections, andthe sparseness of the document space, phrases are not normally neededfor content identification.In Section II by G. Salton, the "generality" problem is examinedwhich arises when two or more distinct collections are compared in aretrieval environment.It is shown. proportionately fewer nonrelevantitems tend to be retrieved when large' collections (of low generality)are used, than when small, high generality collections serve for evaluationpurposes.The systems viewpoint thus normally favors the larger, lowgenerality output, whereas the user viewpoint, prefers du: performance ofthe smaller collection.The effectiveness; of bibliographic citations for content analysispurposes is examined in Section III by G. Salton.It is shown that insome situations when the citation space is reasonablyx.iaea, the use ,af

citations attached to documents is even more effective than the use ofstandard keywords or descriptors.In any case, citations should be addedto the normal descriptors whenever they happen to be available.In the last section of Part 1, certain template analysis methodsare applied to the automatic resolution of ambiguous constructions(Section IV by S. F. 1.:eiss).It is shown that a set of contextual rulescan be constructed by a semi-automatic learning process, which will eventuallylead to an automatic recognition of over nin.?ty percent of the exili7tingtextual ambiguities.Part 2, consisting of Sections V, VI and VII covers proceduresfor the automatic construction of dictionaries and thesauruses useful intext analysis systems.In Section V by D. Bergmark it is shown that wordstem methods using large common word lists are more effective in an information retrieval environment that some manually constructed thesauruses,even though the latter also include synonym recognition facilicie.A new model for the automatic determination of " common" words(which are not to be used for content identification) is proposed andevaluated in Section VI Ly K. Bonwi.t and J. Aste-Tonsmann.The resultingprocess can be incorporated into fully automatic dictionary constructionsystems.The complet. thesaurus construction problem is reviewed in SectionVII by G. Salton, and the effectiveness of a variety of automatic dictionariesis evaluated.Part 3, consisting of Sections VIII through XI, deals with anumber of refinements of the normal relevance feedback process which hasbeen examined in ? Yuriber of previous reports in tFis series.In SectionVIII by T. P. Baker, a query splitting process is evaluated in which inputxvii1"

queries are split into two or more parts during feedback whenever therelevant documents identified by the user are separated by one or more nonrelevant ones.The effec:iveness of relevance feedback techniques in an environment cs variable generality is examined in Section IX by B. Capps and M.Yin.It is shown that some of the feedback techniques are equally applica-ble to collections of small and large generality.Techniques of negativefeedback (when no relevant items are identified by the users, but onlynonrelevant ones) are considered in Section X by M. Xerchner.It is shownthat a number of selective negative techniques, in which only certainspecific concepts are actually modified during the feedback process, bringgood improvements in retrieval effectiveness over the standard nonselectivemethods.Finally, a new feedback methodology in which a number of dor-.1mentsjointly identified as relevant to earlier queries are used as a set forrelevance feedback purposes is proposed and evaluated in Section XI by L.Paavola.Two new clustering techniques are examined in Part 3 of this report,consisting of Sections XII and XIII.A controlled, inexpensive, single-passclustering algorithm is described and evaluated in Section XII by D. 3.Johnson and J. M. Lafuente.In this clustering method, each document isexamined only once, and the procedure is shown to be equivalent in certaincircumstances to other more demanding clustering procedures.The query clustering process, in which query groups are used todefine the information search strategy is studied in Section XIII by S.Woror.a.A variety of parameter values is evaluated in a retrieval environ-

ment to be used for cluster generation, centroid definition, and finalsearch strategy.The list part, number five, consisting of Sections XIV and XV,covers the design of on-line information retrieval systems.A newSMART system design for on-line use is proposed in Section XIV by D. andR. Williamson, based cn the concepts of pseudo-batching and the interactionof a cycling program with a console monitor.The user interface and.conversational facilities are also described.A template analysis technique is used in Section XV by S. F. Weissfor the implementation of conversational retrieval systems used in a timesharing environment.The effectiveness of the metl-cd is discussed, aswell as its implementation in a retrieval situation.Additional automatic content analysis and search procedures usedwith the SMI.RT system are described in several previous reports in thisseries, ir;luding notably repo:ts ISR-11 to ISR-16 published between 1966and 1969.These reports are all available from the National TechnicalInformation Service in Springfield, Virginia.G. Saltonxix

I-1I.Content Analysis in Information RetrievalS. F. '.TeissAbstractIn information retrieval there exist a number of content analysisschemes which analyze natural language text to varying degrees of complexity.Regard.,ess of how well the text analysis is performed by each process,the true value of a given process lies in its effectiveness as an informationretrieval tool.The performance may in each case be investigated byactual retrieval tests using the various proposed content analysis schemes.Results obtained with a variety of linguistic phrase recognitionmethods show that very little, if any, improvements in retrieval effectivenessare obtained when any of the refined content analysis schemes arc usedwith existing document collections.The main reason appears to be the factthat the value of refined content analysis systems resides in theireffectiveness in separating lexically similar, but semantically differentdocuments.Existing collections are too sparse, and do not contain manyclose documents.When denser collections are created, it can be shown thatlinguistic content anal"sis methods become of increasing value as the densityincreases.used.The queries also influence the type of content analysis to beIn general, queries of the question-answering variety show improvedretrieval results with increasing refinements in the content analysis.Document retrieval queries do not exhibit this type of improvement.Future work must be devoted to a determination of what nakes a userjudge a particular document to be relevant.With more insight into therelevance area, the role of linguistic content analysis in informationretrieval may become more clearly defined.

1-21.IntroductionThe purpose of a content analysis system as considered in this studyis as an information retrieval aid.It is therefore necessary to performretrieval using various content analysis methods to determine how well itfulfills its actual role.This study presents experiments and resultsaimed at determining the concitions under which content analysis improvesretrieval results as well as the degree of improvement obtained.Allinformation retrieval systems use some degree of content analysis in itsbroadest sense.This is generally in the form of assignment of conceptindicators to Individual words.But in this study content analysis refersto the analysis and utilizaticn of multi-word groups as informationretrieval tools

Number of Clusters XII-9. C) Overlap XII-10. D) An Example XII-10. 5. Implementation XII-13. A) Storage Management XII-14. 6. Results XII-14. A) Clustering Costs XII-15. B) Effect of Document Ordering XII-19. Cl. Search Results on Clustered ADI Collection . XII-20. D) Search Results of Clustered Cranfield Collection. XII-31. 7. Conslusions XII .

Related Documents:

Eric Clapton Journeyman Eric Clapton Me & Mr. Johnson Eric Clapton One More Car, One Mor Eric Clapton Pilgrim Eric Clapton Reptile Eric Clapton Sessions for Robert J [C Eric Clapton Unplugged Eric Clapton Riding with the King Eric Clapton & B.B. King At Last! Etta James Eurythmics : Greatest Hits Eurythmics American Tune Eva Cassidy Eva .

ERIC A. GREENLEAF ERIC J. JOHNSON VICKI G. MORWITZ EDITH SHALEV* * Order of authorship is alphabetical. Eric A. Greenleaf is Professor of Marketing, Leonard N. Stern School of Business, New York University, 40 West 4th Street, Suite 813, New York, NY 10012-1126 (egreenle@stern.nyu.edu). Eric J.

A Bell for Ursli Carigiet, Alois "Slowly, Slowly, Slowly", Said the Sloth Carle, Eric Do you want to be my friend? Carle, Eric Does a kangaroo have a mother, too? Carle, Eric From head to toe Carle, Eric Mister Seahorse Carle, Eric Pancakes, Pancakes! Carle, Eric Ten little rubbe

Title - Lender's Title Policy 535 Title - Settlement Agent Fee 502 Title - Title Search 1,261 Title - Lender's Title Insurance 1,100 Delta Title Inc. Frank Fields 321 Avenue D Anytown, ST 12321 frankf@deltatitle.com 222-444-6666 Title - Other Title Services 1,000 Title - Settlement Agent Fee 350

MILLION DOLLAR INTERVIEW WITH DARIN KIDD BY ERIC WORRE Eric: Hey everybody. This is Eric Worre and welcome to Network Marketing Pro. I’m here with Darin Kidd. Darin, how are you doing? Darin: I’m doing great. Eric: Darin has come to share some ideas with us. Darin is

Feb 16, 2021 · 3/2/17: Eric S. will take business cards, so he can update list. 6/15/17: Eric S. will work on the list. 10/18/17: Eric S. will cleanup list and send to Caltrans. 2/15/18: Ongoing item. Remove Darren and add Summer. 9/20/18: send to Eric and he will update. 01/16/20: Eric asked mem

friend, Bill Martin Jr. Eric Carle has been awarded numerous honors throughout the years; most recently the 2003 Laura Ingalls Wilder Award from the Association for Library Service to Children. For more information on Eric Carle visit www.eric-carle.com and the Eric Carle Museum of Picture Book Art at www.picturebookart.org.

Software development is a source of security vulnerabilities. Software-developing organizations therefore need to pay at-tention to security and apply secure development practices. However, managing software development is a challenge in itself even without the added complexity of security work. Agile methodologies like Scrum are commonly .