An Extensive Literature Review On CLIR And MT Activities .

3y ago
57 Views
4 Downloads
5.57 MB
21 Pages
Last View : 6d ago
Last Download : 3m ago
Upload by : Nixon Dill
Transcription

International Journal of Scientific & Engineering Research Volume 4, Issue 2, February-2013ISSN 2229-55181An Extensive Literature Review on CLIRand MT activities in IndiaKumar SourabhAbstract:This paper addresses the various developments in Cross Language IR and machine transliteration system in India, First part of thispaper discusses the CLIR systems for Indian languages and second part discusses the machine translation systems for Indianlanguages. Three main approaches in CLIR are machine translation, a parallel corpus, or a bilingual dictionary. Machine translationbased (MT-based) approach uses existing machine translation techniques to provide automatic translation of queries. Theinformation can be retrieved and utilized by the end users by integrating the MT system with other text processing services such astext summarization, information retrieval, and web access. It enables the web user to perform cross language information retrievalfrom the internet. Thus CLIR is naturally associated with MT (Machine Translation). This Survey paper covers the major ongoingdevelopments in CLIR and MT with respect to following: Existing projects, Current projects, Status of the projects, Participants,Government efforts, Funding and financial aids, Eleventh Five Year Plan (2007-2012) activities and Twelfth Five Year Plan (20122017) Projections.Keywords: Machine Translation, Cross Language Information Retrieval, NLP—————————— ——————————1. INTRODUCTION:support truly cross-language retrieval. Many search enginesInformation retrieval (IR) system intends to retrieve relevantare monolingual but have the added functionality to carry outdocuments to a user query where the query is a set oftranslation of the retrieved pages from one language tokeywords. Monolingual Information Retrieval - refers to theanother, for example, Google, yahoo and AltaVista.Information Retrieval system that can identify the relevantQuery and the documents are needed to be translateddocuments in the same language as the query was expressedin case of CLIR. But this translation causes a reduction in thewhereas, Cross Lingual Information Retrieval System (CLIR) isretrieval performance of CLIR. Most approaches translatea sub field of Information Retrieval dealing with retrievingqueries into the document language, and then performinformation written in a language different from the languagemonolingual retrieval. There are three main approaches inof the user's query. The ability to search and retrieveCLIR are machine translation, a parallel corpus, or a bilingualinformation in multiple languages is becoming increasinglydictionary. Machine translation-based (MT-based) approachimportantenvironment.uses existing machine translation techniques to provideConsequently cross-lingual (language) information retrievalautomatic translation of queries. The information can behas received more research attention and is increasingly beingretrieved and utilized by the end users by integrating the MTused to retrieve information on the Internet. While there aresystem with other text processing services such as textnumerous search engines that are currently in existence, fewsummarization, information retrieval, and web access. Itandchallengingintoday’sIJSER 2013http://www.ijser.org

2International Journal of Scientific & Engineering Research Volume 4, Issue 2, February-2013ISSN 2229-5518enables the web user to perform cross language informationcarried out in the field of MT and CLIR.retrieval from the internet. Thus CLIR is naturally associated2. CLIR State of Art: Indian Languagewith MT (Machine Translation)Work in the area of Machine Translation in India hasPerspectivebeen going on for several decades. During the early 90s,advanced research in the field of Artificial Intelligence andBengali and Hindi to English CLIRComputational Linguistics made a promising development ofDebasisMandal,MayankDandapat,Pratyushusable Machine Translation Systems in certain well-definedDepartment of Computer Science and Engineering IITdomains. The work on Indian Machine Translation is beingKharagpur, India presented a cross-language retrieval systemperformed at various locations like IIT Kanpur, Computer andfor the retrieval of English documents in response to queries inInformationNCSTBengali and Hindi, as part of their participation in CLEF1 2007Mumbai, CDAC Pune, department of IT, Ministry ofAd-hoc bilingual track. They followed the dictionary-basedCommunication and IT Government of India. In the mid 90’sMachine Translation approach to generate the equivalentand late 90’s some more machine translation projects alsoEnglish query out of Indian language topics. Their mainstarted at IIT Bombay, IIT Hyderabad, department of computerchallenge was to work with a limited coverage dictionary (ofscience and Engineering Jadavpur University, Kolkata, JNUcoverage 20%) that was available for Hindi-English, andNew Delhi etc. Research on MT systems between Indian andvirtually non-existent dictionary for Bengali-English. So theyforeign languages and also between Indian languages aredepended mostly on a phonetic transliteration system togoing on in these institutions. The Department of Informationovercome this. The CLEF results point to the need for a richTechnologyandbilingual lexicon, a translation disambiguator, Named EntityInformation Technology is also putting the efforts forRecognizer and a better transliterator for CLIR involvingproliferation of Language Technology in India, And otherIndian languages tionandSandipantranslation technology. This helped in the development ofScienceBanerjee,Gupta,SudeshnaSarkarIndian government ministries, departments and agencies suchas the Ministry of Human Resource, DRDO (Defense ResearchHindi and Marathi to English Cross Languageand Development Organization), Department of AtomicInformation RetrievalEnergy, All India Council of Technical Education, UGC (UnionManoj Kumar Chinnakotla, Sagar Ranadive, PushpakGrants Commission) are also involved directly and indirectlyBhattacharyya and Om P. Damani Department of CSE IITin research and development of Language Technology byBombay presented Hindi to English and Marathi to Englishproviding funds and financial aids for major projects beingCLIR systems developed as part of their participation in theCLEF 2007 Ad-Hoc Bilingual task. They took a — Kumar Sourabh is pursuing research (PhD) in Department of CS and ITUniversity of Jammu, PH- 919469163570. E-mail:kumar9211.sourabh@gmail.comtranslation based approach using bi-lingual dictionaries. Querywords not found in the dictionary are transliterated using asimple rule based approach which utilizes the corpus to returntheIJSER literationsofthegiven

3International Journal of Scientific & Engineering Research Volume 4, Issue 2, February-2013ISSN /transliteration choices for each query word arecomparatively short period. The results reported appear todisambiguated using an iterative page-rank style algorithmconfirm that some of the language resources developed for thewhich, based on term-term co-occurrence statistics, producesSurprise Language exercise are indeed reusable, and thatthe final translated query. Using the above approach, formeaning matching yields reasonably good results with lessHindi, they achieve a Mean Average Precision (MAP) of 0.2366carefully constructed language resources than had previouslyin title which is 61.36% of monolingual performance and abeen demonstrated [4].MAP of 0.2952 in title and description which is 67.06% ofmonolingual performance. For Marathi, they achieve a MAP ofEnglish to Kannada / Telugu Name Transliteration in0.2163 in title which is 56.09% of monolingual performance ndi and Telugu to English Cross LanguageComputer Science and Applications, Bangalore University,Information RetrievalThey present a method for automatically learning aPrasad Pingali and Vasudeva Varma Languagetransliteration model from a sample of name pairs in twoTechnologies Research Centre IIIT, Hyderabad presented thelanguages. Transliteration is mapping of pronunciation andexperiments of Language Technologies Research Centrearticulation of words written in one script into another script.(LTRC) as part of their participation in CLEF 2006 ad-hocHowever, they are faced with the problem of translatingdocument retrieval task. They focused on Afaan Oromo, HindiNames and Technical Terms from English to Kannada/Telugu.and Telugu as query languages for retrieval from English[5].document collection and contributed to Hindi and Telugu toEnglish CLIR system with the experiments at CLEF [3]Kannada and Telugu Native Languages to EnglishCross Language Information RetrievalFIRE-2008 at Maryland: English-Hindi CLIRMallamma v reddy, Hanumanthappa Department ofTan Xu and Douglas W.Oard College of InformationComputer Science and Applications, Bangalore UniversityStudies and CLIP Lab, Institute for Advanced Computerconducted experiments on translated queries. One of theStudies, University of Maryland participated in the Ad-hoccrucial challenges in cross lingual information retrieval is thetask cross-language document retrieval task, with Englishretrieval of relevant information for a query expressed in asqueries and Hindi documents. Their experiments focused onnative language. While retrieval of relevant documents isevaluating the effectiveness of a “meaning matching”slightly easier, analyzing the relevance of the retrievedapproach based on translation probabilities. The FIRE Hindidocuments and the presentation of the results to the users aretest collection provides the first opportunity to carefully assessnon-trivial tasks. To accomplish the above task, they presentsome of the resources and techniques developed for thetheir Kannada English and Telugu English CLIR systems asTranslingualAndpart of Ad-Hoc Bilingual task by translation based approachSummarization (TIDES) program’s “Surprise Language”using bi-lingual dictionaries. When a query words not found inexercise in 2003, in which a broad range of languagethe dictionary then the words are transliterated using a simpleInformationDetection,ExtractionIJSER 2013http://www.ijser.org

4International Journal of Scientific & Engineering Research Volume 4, Issue 2, February-2013ISSN 2229-5518rule based approach which utilizes the corpus to return the ‘k’closest English transliterations of the given Kannada/TeluguRecall Oriented Approaches for improved Indianword. The resulting multiple translation/transliteration choicesLanguage Information Accessfor each query word are disambiguated using an iterativePingali V.V. Prasad Rao Language Technologiespage-rank style algorithm which, based on term-term co-Researchoccurrence statistics, produces the final translated query.Technology Hyderabad: investigated into Indian languageFinally they conduct experiments on these translated queriesinformation access. The investigation shows that Indianusing a Kannada/Telugu document collection and a set oflanguage information access technologies face severe recallEnglish queries to report the improvements, performanceproblem when using conventional IR techniques (used forachieved for each task [6].English-like languages). During this investigation they e web extensively for Indian languages, characterized theBilingual Information Retrieval System for EnglishIndian language web and in the process came up with someand Tamilsolutions for the low recall problem. They focused theirDr.S.Saraswathi, Asma Siddhiqaa.M, Kalaimagal.K,investigation on the loss of recall in monolingual and cross-Kalaiyarasi.M addresses the design and implementation oflingual based IR and text summarization. The following areBiLingual Information Retrieval system on the domain,some of their major contributions.Festivals. A generic platform is built for BiLingual Information They showed that Indian language information accessretrieval which can be extended to any foreign or Indiantechnologies that use state-of-the-art technologieslanguage working with the same efficiency. Search for theused by English like languages, face low recall. Theysolution of the query is not done in a specific predefined set ofobserved the recall loss to be relatively higher whenstandard languages but is chosen dynamically on processingthe target language corpus is English.the user’s query. Their research deals with Indian language They came up with a unified information accessTamil apart from English. The task is to retrieve the solutionframework which can address the problems offor the user given query in the same language as that of themonolingual and Cross-lingual Information Retrievalquery. In this process, an Ontological tree is built for theand Text Summarization.domain in such a way that there are entries in the above listed They showed that, word spelling normalization is antwo languages in every node of the tree. A Part-Of-Speechessential component of Indian language information(POS) Tagger is used to determine the keywords from theaccessgiven query. Based on the context, the keywords are translatedmotivated rule based approach and showed that thisto appropriate languages using the Ontological tree. A searchapproach works better than the various approximateis performed and documents are retrieved based on thestring matching algorithms.keywords. With the use of the Ontological tree, InformationExtraction is done. Finally, the solution for the query istranslated back to the query language (if necessary) andproduced to the user [7].IJSER 2013http://www.ijser.org systemsandproposedalinguisticallyThey modeled the problem of Dictionary based querytranslation as an IR problem [8].

5International Journal of Scientific & Engineering Research Volume 4, Issue 2, February-2013ISSN 2229-5518A high recall error identification tool for Hindipositive effect on the result. A robust stemmer is required forTreebank Validationthe highly inflective Indian languages [10].Bharat Ram Ambati, Mridul Gupta, Samar Husain,Dipti Misra Sharma. Language Technologies Research Centre,Using Morphology to Improve Marathi MonolingualInternational Institute of Information Technology Hyderabad,Information Retrievalproposed tool that has been used for validating theAshish Almeida, Pushpak Bhattacharyya IIT Bombay.dependency representation of a multi-layered and multiThey study the effects of lexical analysis on Marathirepresentational tree bank for Hindi. The tool identifies errorsmonolingual search over the news domain corpus (obtainedin the Hindi annotated data at POS, chunk and dependencythrough FIRE-2008) and observe the effect of processes such aslevels. They proposed a new tool which uses both rule-basedlemmatization, inclusion of suffixes in indexing and stop-and hybrid systems to detect errors during the process ofwords elimination on the retrieval performance. Their resultstreebank annotation. They tested it on Hindi dependencyshow that lemmatization significantly improves the retrievaltreebank and were able to detect 75%, 62.5% and 40.33% ofperformance of language like Marathi which is agglutinative inerrors in POS, chunk and dependency annotation respectively.nature. Also, it is observed that indexing of suffix terms, whichFor detecting POS and chunk errors, they used the rule-basedshowsystem. For dependency errors, they used the combination ofprecision. Along with these, the effects of elimination of stop-both rule-based and hybrid systems. The proposed approachwords are also observed. With all three methods combinedworks reasonably well for relatively smaller annotated datasetsthey are able to get mean average precision (MAP) of 0.4433 for[9].25 queries [11].English Bengali Ad-hoc Monolingual InformationA Query Answering System for E-Learning HindiRetrieval Task Result at FIRE ovethePraveen Kumar, Shrikant Kashyap, Ankush MittalSivaji Bandhyopadhyay, Amitava Das, PinakiBhaskar Department of Computer Science and EngineeringIndian Institute of Technology, Roorkee, India developed aJadavpur University, Kolkata. Their experiments suggest thatQuestion Answering (QA) System for Hindi documents thatsimple TFIDF based ranking algorithms with positionalwould be relevant for masses using Hindi as primary languageinformation may not result in effective ad-hoc mono-lingual IRof education. The user should be able to access informationsystems for Indian language queries. Any additionalfrom E-learning documents in a user friendly way, that is byinformation added from corpora either resulting in queryquestioning the system in their native language Hindi and theexpansion could help. Application of certain machine learningsystem will return the intended answer (also in Hindi) byapproaches for query expansion through theme detection orsearching in context from the repository of Hindi documents.event tracking may increase performance. Document-levelThe language constructs, query structure, common words, etc.scoring entailment technique also could be a new direction toare completely different in Hindi as compared to English. Abe explored. Application of word sense disambiguationnovel strategy, in addition to conventional search and NLPmethods on the query words as well as corpus would have atechniques, was used to construct the Hindi QA system. TheIJSER 2013http://www.ijser.org

6International Journal of Scientific & Engineering Research Volume 4, Issue 2, February-2013ISSN 2229-5518focus is on context based retrieval of information. For thisdeveloped a multimodal interface to the computer that ispurpose they implemented a Hindi search engine that worksrelevant for India. Although India’s average literacy level ison locality-based similarity heuristics to retrieve relevantabout 65%, less than 5% of India’s population can use Englishpassages from the collection. It also incorporates languagefor communication. And even though the world-wide web andanalysis modules like stemmer and morphological analyzer ascomputer communication has given us access to information atwell as self constructed lexical database of synonyms. Thethe click of a mouse, 95% of our population is excluded fromexperimental results over corpus of two important domains ofthis revolution due to dominance of English. To overcome thisagriculture and science show effectiveness of their approach.problem they propose to set up an Indian Language Systems[12]Laboratory at IIT Madras. Their initial goal will be to develop amultimodal interface to the computer that is relevant for India,Om: One tool for many (Indian) languagesGanpathirajuMadhavi,Balakraishnani.e., one that enables Indic computing [14]. The components ofMini,this Indian language interface will be:Balakrishnan N., Reddy Raj (Language Technologies Institute,1. Keyboard and display interfaceCarnegie Mellon University, Pittsburgh) (Supercomputer2. Speech interfaceEducation and Research Centre, Indian Institute of Science,3. Handwriting interfaceBangalore 560 012, India) describe the development of atransliteration scheme Om which exploits this phonetic natureof the alphabet. Om uses ASCII characters to represent IndianPart of Speech Taggers for Morphologically RichIndian Languageslanguage alphabets, and thus can be read directly in English,Dinesh Kumar Gurpreet Singh Josan Department ofby a large number of users who cannot read script in otherInformation Technology DAV Institute of Engineering &Indian languages than their mother tongue. It is also useful inTechnology Jalandhar, Punjab, INDIAcomputer applications where local language tools such asTheir research, reports about the Part of Speech (POS) taggersemail and chat are not yet available. Another significa

transliteration model from a sample of name pairs in two languages. Transliteration is mapping of pronunciation and articulation of words written in one script into another script. However, they are faced with the problem of translating Names and Technical Terms from English to Kannada/Telugu. [5].

Related Documents:

mass m extensive kg molar mass M intensive kg mol-1 temperature T intensive K pressure P, p intensive Pa fugacity f intensive Pa density intensive kg m-3 volume V extensive m3 molar volume V m, v, intensive m3 mol-1 heat Q extensive J work W extensive J inner energy U extensive J enthalpy H extensi

Most researchers in the sciences do not plan how to write a literature review Graphically describes the types of literature reviews States 10 rules in writing a good literature review. Taylor-Powell, E. and Renner, M. / 2003 Analyzing Qualitative Data Qualitative Analysis or Content Analysis -- another name for Literature Review?

- English Literature 2: Medieval and Early Modern Literature - English Literature 3: The Long Nineteenth Century - English Literature 4: Literary Theory - English Literature 5: Modern and Contemporary Literature - English Research Seminar - Literature, Empire and the Postcolonial World - Texts in Focus 1 - Texts in Focus 2 5.

1 EOC Review Unit EOC Review Unit Table of Contents LEFT RIGHT Table of Contents 1 REVIEW Intro 2 REVIEW Intro 3 REVIEW Success Starters 4 REVIEW Success Starters 5 REVIEW Success Starters 6 REVIEW Outline 7 REVIEW Outline 8 REVIEW Outline 9 Step 3: Vocab 10 Step 4: Branch Breakdown 11 Step 6 Choice 12 Step 5: Checks and Balances 13 Step 8: Vocab 14 Step 7: Constitution 15

Literature Reviews What this handout is about This handout will explain what a literature review is and offer insights into the form and construction of a literature review in the humanities, social sciences, and sciences. Introduction OK. You’ve got to write a literature review. You dust off a novel and a book of poetry, settle

Literature review papers are often highly cited! Evidence-based practice, commissioned reviews ! MSc dissertations based solely on a literature review (a project on the literature)! Advances in technologies - making it more important to keep up-to-date! The importance of literature reviews!

The literature review (USC Libraries) Literature reviews: an overview for graduate students (NCSU Libraries) The literature review: a step-by-step guide for students (D. Ridley, 2012) Writing a successful thesis or dissertation (F. Lunenburg & B. Irby, 2008) Writing literature reviews (J. Galvan, 2013)

literature review can be the most intense and time-consuming component of the research process, especially when the extant literature for the underlying topic is extensive. A second difficulty of the literature review process stems from the fact that it is not a linear process (Onwuegbuzie & Frels, 2016).