Intelligent Interface Design For A Question Answering System

1y ago
8 Views
2 Downloads
995.53 KB
68 Pages
Last View : 12d ago
Last Download : 3m ago
Upload by : Jewel Payne
Transcription

INTELLIGENT INTERFACE DESIGNFOR A QUESTION ANSWERING SYSTEMByNICHOLAS ANTONIOA THESIS PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEUNIVERSITY OF FLORIDA2001

Copyright 2001byNicholas Antonio

To the innocent victims of terrorism.

ACKNOWLEDGMENTSI wish to thank my advisor, Dr. Douglas Dankel, for his extensive assistance andsupport on this thesis and all aspects of my college life at the University of Florida.Despite his hectic schedule, he never failed to respond to my many requests. I also wishto thank Dr. Joseph Wilson and Dr. Paul Fishwick for serving on my supervisorycommittee. Finally, I wish to thank my mother, Elpida.iv

TABLE OF CONTENTSpageACKNOWLEDGMENTS . ivLIST OF FIGURES . viiABSTRACT. ixCHAPTERS1 THE PROBLEM.1Historical Attempts to Solve the Problem . 1An Ideal Solution to the Problem. 2Description of Selected Problem Sub-Area . 4No Domain Restrictions. 4Using the Internet as a Knowledge Base . 4Description of an Intermediate Solution . 52 SYSTEM OVERVIEW .7Flow of Operation . 7The XML Knowledge Base . 7The Parser . 12The Query Generator . 15The Intelligent Interface. 15Summary. 173 BACKGROUND MATERIAL.18Natural Language Generation . 18Introduction. 18Historical Perspective of Natural Language Generation . 20Natural Language Generation Perspectives . 21Natural Language Generator Tasks . 22Traditional Approaches to Text Realization . 23XML. 23Well-Formed XML . 24Valid XML. 25Macromedia Flash. 28Summary. 31v

4 THE INTELLIGENT INTERFACE.32What is an Intelligent Interface? . 32Introduction. 32Intelligent Interface Issues . 32Application Areas . 34Description of the Intelligent Interface Module. 34Appearance of the Interface . 35XML Parser. 36Natural Language Generator. 40Information Filter. 43Examples. 44Summary. 505 CONCLUSIONS.51Intelligent Interface Evaluation. 51Methodology. 51Usability and Aesthetics . 52Information Filtering. 52Limitations of the Implemented System . 53Future Research . 53Intelligent Interface. 53Question Answering System. 54Afterword. 54LIST OF REFERENCES.56BIOGRAPHICAL SKETCH .58vi

LIST OF FIGURESFigurePage2.1: How the system works .82.2: Sample sub-area XML file fragment from the knowledge base .92.3: Sample fragment of the XMLKB DTD .102.4: Sample fragment of XMKB directory file .112.5: Desired modified output of parser on input “what are the core classes?” .132.6: Sample XML file written by the query generator .142.7: System’s response to question, “what is the description of COP5555?”.163.2: Sample XML file .253.3: Sample DTD .263.4: Sample XML-Schema document.283.5: Frames and layers in Flash.294.1: How the intelligent interface works.354.2: Appearance of the question answering system’s interface .364.3: The FAQ window .374.4: Part of the code of function askQuestion.374.5: Key features of XML files created by “query generator” .384.6: Part of function convertXML .394.7: How the system handles the COURSE element.414.8: How the system handles the sub-elements of COURSE .42vii

4.9: Algorithm used by the information filter .444.10: Initial system response to “what are the core classes?” .454.11: The Master’s core courses .464.12: Description of analysis of algorithms .474.13: The Ph.D. core courses .484.14: Summary of the graduate web pages .49viii

Abstract of Thesis Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Master of ScienceINTELLIGENT INTERFACE DESIGNFOR A QUESTION ANSWERING SYSTEMByNicholas AntonioDecember 2001Chairman: Douglas D. Dankel IIMajor Department: Computer and Information Science and EngineeringThis thesis describes the design and implementation of an intelligent interface fora question answering system. The system accepts natural language questions andprovides natural language answers within the domain of the graduate Web pages. Thesystem has four components, the intelligent interface, the parser, the query generator andthe XML knowledge base.The system sends the user’s question to the parser, which in turn passes its resultson to the query generator. The latter retrieves a part of the knowledge base and writes anXML file that the interface then processes. It also presents the generated answer to theuser. Processing the XML file includes natural language generation and informationfiltering.The interface, implemented using Macromedia Flash, runs on any Flash-enabledweb browser. The system works by reading and parsing the XML file created by thequery generator. It then generates natural language content using the template-basedix

realization approach. It filters out information by creating links from high-level conceptsto specific details, as well as external links to the Internet. It finally presents the answer tothe user who in turn can follow the aforementioned links or ask another question.This interface has been designed for question answering and can be used to viewanswers generated by the query generator, but can also be used to view any XML file ofsimilar specification. Thus, the intelligent interface is a solution to the generic problem ofpresenting filtered information to users, whether it is part of a question answering systemor not.x

CHAPTER 1THE PROBLEMThe last decade of the previous millennium has seen a revolution unlike any otherwitnessed by mankind. The information revolution has started a transformation of theworld in a speed unrivaled in the three thousand years of recorded history.The explosion of the Internet has brought access to vast amounts of knowledge,which were previously only present in some of the world’s finest libraries, to everybody’shome. As with most prior technologies, people have developed a love-hate relationshipwith the Internet. They love access to the information, but they hate the painstakingprocess of searching through huge quantities of irrelevant material with the hope offinding the specific information for which they are looking.Historical Attempts to Solve the ProblemIn the infancy of the Internet and more specifically of the World Wide Web,people could only reach information by knowing its web address a priori. The solution tothis problem came in the form of search engines (e.g., Altavista, Lycos, Yahoo, etc.).Search engines attempted to solve the problem of finding relevant informationusing keywords entered by the user. These engines use the provided keywords to searchthrough indexes they have built to return a list of web pages containing those keywords.At this point, the agonizing procedure begins for the user—manually searching throughthe returned pages in the hope of finding the information he desires.1

2It became evident that people wanted a better solution. More specifically, the needfor direct question answering arose. Users wanted, and still want, not only to be able toask questions in natural language but to have their questions precisely answered as well.A second wave of attempts (i.e., Ask Jeeves! [1]) resulted in search engines thathave a sizeable amount of prebuilt natural language questions. If the user is lucky, thequestion he has asked is present in the system and he is redirected to a web pageproviding an answer. However, if the user is looking for something other than the localweather forecast, chances are that the system just returns a list of web pages which mustbe manually searched, placing the user back to square one!An Ideal Solution to the ProblemWhat is an ideal solution to this problem? Is there only one ideal solution, or aremultiple solutions available? Does a solution satisfying the majority of Internet usersexist?The only fact that is beyond doubt is that people desire different solutions to thisproblem. A solution satisfying a power user will probably leave the novice stranded,while the novice user’s solution will leave the power user unsatisfied. People have beenusing natural language for communication purposes for thousands of years. Amonghuman societies, fairly universal rules have evolved for the question answeringprocedure. When you ask someone for the time, you expect a direct response. Somethingalong the lines of “it’s four o’ clock” or “sorry, I don’t have the time.” And with theexception of running into someone having a bad day, that is what you usually get as aresponse. What you definitely do not want and do not get is an answer like “the time isprobably available at one of the following places.”

3Therefore, it is safe to argue that the main guidelines for an ideal solutionsatisfying the majority of the population is:To be able to take a question from a user on any subject and use the Web as aknowledge base to construct and provide a “good” direct answer.This solution can be explained in more detail. First, it refers to the ability to takenatural language questions as input. Second, it considers the lack of restrictions on whatthe system can answer. Ideally the system should be able to answer any question as longas the information needed to provide the answer resides somewhere on the Internet. Thisraises the issue of using the Web as a knowledge base. Most of the information on theWeb is in Hyper Text Markup Language (HTML) format with quite a few documentsalso being in simple text, Adobe Acrobat format, Postscript format, and recentlyExtended Markup Language (XML). This means that the system should be able to handleall these file types. Finally, the most vague part of the description is the construction andprovision of a good direct answer. The term good refers to an answer that is intelligentand in a presentable format where intelligent is judged content-wise and presentable isjudged appearance-wise.A user friendly and aesthetically satisfying web page that does not answer thequestion is probably better than one that does, but leaves the user helpless in trying tolocate the answer among the garbage (i.e., the results returned from the current searchengines). In addition, presentable format refers to a page that fits on the user’s screenwithout the need to scroll.

4Description of Selected Problem Sub-AreaThe ideal solution to the question-answering problem presents two majorobstacles that at this time are too complicated to tackle. The first is the lack of domainrestrictions or, more simply, the ability to answer questions on any subject. The second isthe use of the Internet as a huge knowledge base.No Domain RestrictionsIn human-to-human conversations whenever a question is asked, the domain isusually implied by the surrounding circumstances. When a student is asking his advisorabout his thesis defense, when a basketball player is asking his coach about defense, andwhen two politicians are discussing defense there is no confusion about the terms: thesisdefense, basketball defense, and military defense, respectively. People can usuallydetermine the domain and, hence, the meaning of any ambiguous words.However, when a user fires a question at a computer system with the worddefense the latter has no contextual knowledge to determine the domain the user isdiscussing. Natural language and English words, specifically, can have a multitude ofdifferent meanings rendering the task of disambiguation almost impossible at the presenttime. Therefore, any realistic attempt at a question answering system must be domainspecific.Using the Internet as a Knowledge BaseKnowledge bases are usually in some hierarchical or relational form. This allowseasy and efficient querying. When dealing with text, numerous problems arise includingthe fact that all efficiency is lost since the document must be searched in a serial manner.With the unprecedented amount of information available on the Internet, inefficiency is

5not an option. Most importantly, it becomes extremely difficult for queries to return goodresults, since the most they can do is return a paragraph of text within which the selectedkeywords are found. This severely limits the ways in which an answer can be formulatedfor the user.Ideally, the pages holding the necessary information can be transformed into aknowledge base from which the answer can then be formulated requiring a modulecapable of turning text into knowledge. This is a major issue in computer science, onethat the field of “textual knowledge acquisition” [2] hopes to solve. Until this solution isdeveloped, we must find intermediate solutions like manually encoding knowledge basesfor small domains.Description of an Intermediate SolutionSince an ideal solution is not achievable at this time, we must look forintermediate solutions that can be used as stepping-stones towards the ideal one. Adescription of an intermediate solution to question answering, on which our researchgroup has been working, is:To take a question from a user about the information in the Computer andInformation Science and Engineering (CISE) department’s graduate Web pagesand use a specially formatted version of these pages as a knowledge base toconstruct and provide a “good” direct answer to this question.There are two key differences between this system and the ideal one. The domainis now restricted to the CISE graduate Web pages and the Web pages are not used as is,but have instead been manually converted into a hierarchical XML version.

6Chapter 2 provides a quick overview of our question answering system and howthe various modules of the system should operate. The system consists of four modules:the parser, the query generator, the XML knowledge base, and the intelligent interface,which is the subject of this thesis. Chapter 3 covers the background material that isrequired for a more in-depth description of the intelligent interface module covered inChapter 4. Chapter 5 ends the thesis with conclusions and suggested extensions for futureresearch with the ideal solution always being the ultimate goal.

CHAPTER 2SYSTEM OVERVIEWThe question answering system described in this thesis consists of four modules.Three of these are active while one is inactive. The active modules are the parser, thequery generator, and the intelligent interface. All three of these take some input andgenerate some output, hence the term active. The fourth module is the XML knowledgebase (XMLKB), which neither takes input nor generates output, hence the term inactive.This chapter provides a high level description of this question answering system.Flow of OperationThe only module visible to the user is the interface. As far as he is concerned, hetypes in questions and receives answers from this interface. Behind the scenes, theinterface starts the parser and sends the question asked by the user to it. The parser thengenerates a specifically modified XML parse tree of the question. The interface thenstarts the query generator, which reads the parse tree, creates a query, executes it on theXMLKB, and retrieves a part of the XMLKB as the answer. The interface reads thisXML structure and in turn generates a page as the answer to the user’s question. Thisprocess can be seen more clearly in Figure 2.1.The XML Knowledge BaseThe heart and soul of the system is the knowledge base. This is the repository ofall the information that the user desires. As shown in Figure 2.1, a manual transformation7

8of the current graduate web pages from HTML to XML has been performed. This comesin contrast with what the ideal system should do, which is to perform this operationautomatically.SQL-TYPE QUERYKNOWLEDGEBASE(XML)QUERY GENERATIONMODULEXML SECTIONCONTAININGTHE ANSWERXML PARSE TREEPARSERUSER'S QUESTIONMANUALTRANSFORMATIONFROM HTML TOXMLUSERINTERFACE(FLASHENABLED WEBPAGE)GRADWEBPAGESQUESTIONANSWERFigure 2.1: How the system worksThe manual transformation of the web pages to an agreed upon hierarchicalstructure, enables us to use this structure as a knowledge base that can be queried. It alsoserves as an indication of what the desired output of an automatic transformation moduleshould look like.The knowledge base consists of a set of sub-area files, the directory (meta) file,and the DTD file. The sub-area files contain all the information present on the graduateweb pages, while the other two files serve as internal meta-knowledge for the system.Figure 2.2 shows a sample XML fragment detailing how sub-area files are structured,Figure 2.3 shows a sample fragment of the XMLKB DTD and Figure 2.4 shows a samplefragment of the XMLKB directory file.

9 ?xml version '1.0' encoding "UTF-8" standalone "no"? GRAD PAGES lastRevised "09/25/01" OVERVIEW CW overview /CW CONTENT Overview of the information in the Graduate Brochure /CONTENT TEXT This document describes the degree requirements for students entering theGraduate Program in Computer and Information Science and Engineering (CISE) withthe intention of receiving the Master's, Engineer, or Ph.D. Degree. It is intended to beused in conjunction with the University of Florida's Graduate Catalog. While this guide isintended to be self-contained and accurate, the CISE Department reserves the right tocorrect errors when found, without further notice to students. It is the student'sresponsibility to ensure that they are in compliance with both Departmental andUniversity requirements. /TEXT ROOT TEXT DOCUMENT DESCRIBE DEGREE REQUIREMENT STUDENTENTERING GRADUATE PROGRAM COMPUTER INFORMATION SCIENCEENGINEERING CISE INTENTION RECEIVING MASTER ENGINEER PHDINTEND CONJUNCTION UNIVERSITY FLORIDA CATALOG GUIDE SELFCONTAINED ACCURATE DEPARTMENT RESERVES RIGHT CORRECT ERRORFOUND WITHOUT FURTHER NOTICE 'S RESPONSIBILITY ENSURECOMPLIANCE DEPARTMENTAL /ROOT TEXT LINK TEXT CISE /TEXT TARGET http://www.cise.ufl.edu /TARGET /LINK LINK TEXT Master's Degree /TEXT TARGET http://www.cise.ufl.edu/ ddd/grad/ms.html /TARGET /LINK /OVERVIEW /GRAD PAGES Figure 2.2: Sample sub-area XML file fragment from the knowledge base

10 ?xml version "1.0" encoding "UTF-8"? !-- ************** CORE COURSES element ********************* -- !ELEMENT CORE COURSES(CW,CONTENT,TEXT?,ROOT TEXT?,MASTERS CORE,PHD CORE) !ELEMENT MASTERS CORE(CW,CONTENT,TEXT?,ROOT TEXT?,LINK?,COURSE*) !ELEMENT PHD CORE (CW,CONTENT,TEXT?,ROOT TEXT?,COURSE*) !ELEMENT COURSE(CW,CONTENT,TEXT,ROOT TEXT?,LINK?,NUMBER?,DESCRIPTION?,PREREQ?) !ELEMENT NUMBER (CW,CONTENT,TEXT,ROOT TEXT?) !ELEMENT DESCRIPTION (CW,CONTENT,TEXT,ROOT TEXT) !ELEMENT PREREQ (CW,CONTENT,TEXT,ROOT TEXT,LINK*) !-- *************** OVERVIEW element ************************ -- !ELEMENT OVERVIEW (CW,CONTENT,TEXT,ROOT TEXT,LINK*) !-- **************** GEN INFO element *********************** -- !ELEMENT GEN INFO(CW,CONTENT,TEXT?,ROOT TEXT?,DEGREES OFFERED,STUDY AREAS,COMPUTING RESOURCES) !ELEMENT DEGREES OFFERED(CW,CONTENT,TEXT?,ROOT TEXT?,LINK*,DEGREE*) !ELEMENT DEGREE (CW,CONTENT,TEXT,ROOT TEXT?) !ELEMENT STUDY AREAS(CW,CONTENT,TEXT?,ROOT TEXT?,STUDY AREA*) !ELEMENT STUDY AREA(CW,CONTENT,TEXT,ROOT TEXT?,DESCRIPTION) !ELEMENT COMPUTING RESOURCES(CW,CONTENT,TEXT?,ROOT TEXT?,RESOURCE*) !ELEMENT RESOURCE (CW,CONTENT,TEXT,ROOT TEXT) Figure 2.3: Sample fragment of the XMLKB DTD

11 ?xml version "1.0" encoding "UTF-8" standalone "no"? !DOCTYPE GRAD PAGES SYSTEM "mainDTD.dtd" GRAD PAGES lastRevised "08/29/01" DIRECTORY domain "www.cise.ufl.edu/ ddd/grad" LISTING file "core courses.xml" CW core course master master's degree ph.d. doctor philosophy phd ms m.s. /CW CONTENT CISE Graduate Program core courses /CONTENT /LISTING LISTING file "overview.xml" CW overview summary /CW CONTENT Overview of the information in the Graduate Brochure /CONTENT /LISTING LISTING file "gen info.xml" CW general information graduate degree offer study area specialization computecomputing resource /CW CONTENT General information about the CISE graduate program /CONTENT /LISTING LISTING file "admission.xml" CW application apply admission information material mail office cise computerscience department submit submission process /CW CONTENT Information on admission to the CISE graduateprogram /CONTENT /LISTING LISTING file "financial.xml" CW financial assistance option assistantship fellowship tuition payment feeresponsibility certification /CW CONTENT Information on available financial assistance /CONTENT /LISTING Figure 2.4: Sample fragment of XMKB directory file

12The ParserAfter the user submits a question, the interface sends that question to the parser.Parsers are an integral part of every natural language processing system. Their job is toidentify and tag every word with an appropriate part of speech and generate a treestructure that correctly represents the syntax of the sentence.Many parsers exist, each of which operates slightly differently, but all follow thesame principles. For our system, a parser that can generate XML parse trees is needed.Since many very good parsers already exist, we decided to modify an existing one to suitour needs rather than to write one from scratch.After a review of various parsers, the “Link Grammar Parser” from CarnegieMellon University was chosen. In the developers’ own words, “the Link Grammar Parseris a syntactic parser of English, based on link grammar, an original theory of Englishsyntax.” [3] Given a sentence, the system assigns a syntactic structure consisting of a setof labeled links connecting pairs of words.The parser has a dictionary of approximately 60000 word forms. It covers a widevariety of syntactic constructions, including many that are rare and idiomatic. The parseris robust; skipping over portions of the sentence that it cannot understand to assign somestructure to the rest of the sentence. It is able to handle unknown vocabulary makingintelligent guesses about the syntactic categories of unknown words from context andspelling. It has knowledge of capitalization, numerical expressions, and a variety ofpunctuation symbols.

13 SENTENCE NOUNPHRASE PRONOUN string "what" ROOT what /ROOT NUMBER indeterminate /NUMBER /PRONOUN /NOUNPHRASE SENTENCE string "are the core classes?" VERBPHRASE VERB string "are" ROOT be /ROOT TENSE present /TENSE NUMBER plural /NUMBER /VERB NOUNPHRASE NOUNPHRASE ARTICLE string "the" ROOT the /ROOT type definite /type /ARTICLE ADJECTIVE string "core" ROOT core /ROOT /ADJECTIVE NOUN string "classes" ROOT class /ROOT NUMBER plural /NUMBER /NOUN /NOUNPHRASE /NOUNPHRASE /VERBPHRASE /SENTENCE /SENTENCE Figure 2.5: Desired modified output of parser on input “what are the core classes?”

14Figure 2.5 shows a sample of the desired modified output of this parser on input“what are the core classes?” Modifications include an XML parse tree as output and extratags defining a word’s root, tense, and number. ?xml version '1.0'? RESULT number '1' QUERY string "How can I get the Degree of Engineer?" /QUERY ANSWER type "E" ENGINEER CONTENT Information on the Degree of Engineer /CONTENT TEXT To be admitted to the Degree of Engineer Program students must havecompleted a Master's Degree in Engineering. To earn the Degree of Engineer, thestudents must obtain at least a 3.0 GPA in at least 30 graduate credit hours beyond theMaster's Degree, within five calendar years of enrollment. These credit hours mayinclude CIS 6972, Research for Engineer's Thesis, hours. The

This thesis describes the design and implementation of an intelligent interface for a question answering system. The system accepts natural language questions and provides natural language answers within the domain of the graduate Web pages. The system has four components, the intelligent interface, the parser, the query generator and

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

The Intelligent Transmitter/0 to 20 mA Output Interface Module contains four Intelligent Transmitter interface channels and four 20 mA dc analog output channels. The module is a main type and . The Intelligent Transmitter interface portion of the module is a transmitter host, thus enabling the system to receive digital messages from the

och krav. Maskinerna skriver ut upp till fyra tum breda etiketter med direkt termoteknik och termotransferteknik och är lämpliga för en lång rad användningsområden på vertikala marknader. TD-seriens professionella etikettskrivare för . skrivbordet. Brothers nya avancerade 4-tums etikettskrivare för skrivbordet är effektiva och enkla att

Den kanadensiska språkvetaren Jim Cummins har visat i sin forskning från år 1979 att det kan ta 1 till 3 år för att lära sig ett vardagsspråk och mellan 5 till 7 år för att behärska ett akademiskt språk.4 Han införde två begrepp för att beskriva elevernas språkliga kompetens: BI