English–Indonesia Machine Translation Using Statistical .

2y ago
33 Views
2 Downloads
280.87 KB
5 Pages
Last View : 6d ago
Last Download : 3m ago
Upload by : Sutton Moon
Transcription

ISSN: 2088-6578TechnicalICITEE 2011English–Indonesia Machine Translation Using StatisticalApproachY. Astuti*, T.B. Adji#, S.S. Kusumawardani#EE&IT Departement, Gadjah Mada UniversityJalan Grafika 2, 55281, Yogyakarta, INDONESIA*yennistut s209@mail.te.ugm.ac.id#{adji, suning}@mti.ugm.ac.idAbstract— Most of the digital information is available inEnglish language. However, Indonesian people do not useEnglish as the daily conversation. This makes the Englishproficiency of most Indonesian becomes very low. Toovercome this situation, the development of MachineTranslation (MT) is needed which maps English words intoIndonesian words in one-to-many, many-to-one, or manyto-many. Thus, a method should be provided to handle thesewords mapping. This paper proposed an MT techniqueusing statistical approach to solve the problem. By using thetechnique, the English–Indonesian translation of a sourceword becomes more adaptable to the word context within asentence.Keywords-component; machine translation; statisticalI.INTRODUCTIONDigital information is available in many languages,which most of them use English language. Indonesia isone of the countries that do not use English as a dailylanguage. It makes the English’ ability of Indonesianpeople becomes very low. This fact triggers a need of aMachine Translation (MT). The MT is defined inReference [1] as the use of computer in automating someor all the translating process from a language to others.For many decades, researcher had used many methodsto make a robust and flexible MT. There are five methodsthat are already known in the MT development field. Thefirst three methods are called the classical approachesnamely direct approach, rule-based/transfer approach,and Interlingua approach [2]. The other two methods area data-driven approach, which are the example-basedapproach and the statistical approach.A direct approach MT research was developed by aresearch group from Gadjah Mada University, Indonesia[3]. This MT research could handle many tenses; such aspresent, present continuous, present perfect, past, pastperfect, and future tenses. However, the precision of thisMT system had not been examined yet. An InterlinguaMT research was made by a group project namelyMultilingual Machine Translation System (MMTS) [4].This is a multi-national project research among China,Indonesia, Malaysia, Thailand, and Japan as the projectleader. The MMTS includes an analysis component forIndonesian language part that called Bahasa IndonesiaAnalyzer System (BIAS). BIAS uses Indonesian texts asthe input and abstract meaning as the output.Unfortunately BIAS accuracy was not presented. A rulebased MT research was done by two researchers from74Petra University, Indonesia [5]. This research translatesmany sentences from English to Indonesian language butthe precision had not been evaluated yet. This rule-basedMT is able to translate daily conversation sentences.However, the system could not handle a word that hasmore than one meaning.The data-driven approaches are also known as corpusbased approaches. This approach uses bilingual corporato automate the information of the translation learning.Bilingual corpora are defined as texts that are available inparallel in two different languages. The use of bilingualcorpora can minimize the human involvement.Consequently, this approach can achieve a rapiddevelopment of MT systems only in a couple of months.As well as the rapid achievement, this approach can alsoovercome the bottlenecks of rule-based approach [6]. Aresearch that uses example-based method was done byBrown [7]. This research uses the example-based methodin translating Spanish to English language. An MTapplication that uses statistical approach is GoogleTranslate. This application can translate many languages(includes English) into Indonesian. This MT system usesa statistical approach based on phrase translation [8].However, this MT system was not provided in opensource. Another MT activity that uses statistical approachwas conducted by Agency for the Assessment andApplication of Technology (BPPT) and National NewsAgency (ANTARA). This MT system was developedusing Pharaoh as the decoder in 500K training pairsentences [9].In this paper, the MT system is developed in theavailability of the bilingual corpora, in which English isthe source language and Indonesian is the target language.The technique of English-Indonesian MT using statisticalapproach will be explained in several sections. Section IIprovides the explanation about statistical method in MT.Section III and IV give details about other techniques thatare needed in the statistical MT; such as alignment anddecoder. Section V explains the technique of English –Indonesian MT using statistical approach. Section VIdescribes the implementation and analysis. Finally,Section VII gives the discussion about the work.II. STATISTICAL METHODStatistical method for MT is developed by applyingBayes Rule as we can see in (1) [10].Yogyakarta, 28 July 2011Gadjah Mada University

ICITEE 2011Technical(1)The denominator, P(S), can be discarded because theright equation does not depend on S. This S symbol isequal for all the target sentence possibilities. Thus, (1) canbe written as (2), where T refers to the target sentence(Indonesian) and S refers to the source sentence (English).P(T S) P(S T)P(T)ISSN: 2088-65786 is known as N-gram model [13]. If the occurrence of aword is affected by one previous word, then it is calledbigram model. If the occurrence of a word is affected bytwo previous words, then it is trigram model. Therefore,the chain-rule for the bigram model is expressed as in (7)[10].(7)(2)The P(S T) factor is called Translation Model (TM)that represents the MT faithfulness. The P(T) factor isnamed as Language Model (LM), which represents theMT fluency [11].The best translation of the source language is found byapplying (3).The bigram for P(t3 t2) of the sentence Saya pergi kepasar naik sepeda is shown in Fig. 2.(3)A. Translation Model (TM)The TM or the P(S T) shows the probability of thesource sentence given the target sentence. TM givesinformation of the relation strength between thecandidates target sentence and the source sentence. Thebigger the probability, the stronger relation it must be.Mathematically, TM is presented in (4) [12], where snrefers to the source word in the n-th position that istranslated into tn.(4)Given a source sentence “I go to the market bybicycle” then the relation with the target sentence Sayapergi ke pasar naik sepeda can be calculated as shown in(5).Figure 2. Bigram probability for P(t3 t2) in Saya pergi ke pasar naiksepedaThere are occasion where word order does not appear,i.e. P(ti tx) 0. In this case, a smoothing technique shouldbe applied. It is stated in [10] that there are two smoothingtechniques; one of them is Witten-Bell discounting. TheWitten-Bell discounting smoothing technique is appliedusing (8), where Ty(tx) refers to the number of tx bigramtype.(8)P(I go to the market by bicycle Saya pergi ke pasar naik sepeda) P(I Saya) P(go pergi) . P(bicycle sepeda)(5)The probability of P(go pergi) can be obtained fromthe training corpora. If we have a translation table of theword Saya as in Fig.1, then the relation between the wordpergi and the word “go” is equal to 0.4. Besides the word“go”, the word pergi can be the tranlation of the words“went” and “going”.Figure 1. The word pergi and its translationsB. Language Model (LM)P(T) shows the relation of the word orders of thetarget language. Mathematically, the LM can be computedusing chain-rule as shown in (6)III.ALIGNMENTAlignment is a process of translation training from theparallel corpora. This process must produce the wordtranslations to be used in the TM. However, beforegetting the word alignments, we must solve the sentencealignment. Sentence alignment finds the parallelsentences from the parallel paragraphs. Afterwards, theseparallel sentences will be worked out to obtain theparallel words. In [14], there are two approaches to solvethe word alignment. The first method is lexical alignmentand the second one is EM (Expectation Maximization)algorithm. This work will use the lexical alignmentbecause of its simplicity.Lexical alignment obtains the word translation byusing a dictionary. For example a parallel sentence “I goto the market by bicycle” and Saya pergi ke pasar naiksepeda. From the English–Indonesian dictionary, we willfind the words “I”, “go”, “to”, “market”, “by”, “bicycle”that correspond in parallel with the words Saya, pergi, ke,pasar, naik, sepeda. In this example, the word “the” doesnot align to any target word, so we add “NULL” (noalignment) for the parallel of the word “the”.(6)It is explained in (6) that the occurrence of a targetword is affected by (N-1) previous target words. EquationGadjah Mada UniversityYogyakarta, 28 July 201175

ISSN: 2088-6578TechnicalIV.DECODERThe decoder is also called the searching algorithm.The function of the decoder is to find the best translationto fulfill Equation (3). The translation initiate with theinitial translation. This initial translation will be thenimproved to obtain the best translation. In [15], there arethree kinds of decoder methods that we can apply i.e.stack-based decoder, greedy decoder, and integerprogramming decoder. This research will use the greedydecoder because of its ease in making initial translation.In addition, the greedy decoder translation’s quality isgood enough compared to the other two decoders [15].Let us consider to find the best translation of “I go tothe market by bicycle” (see Fig. 3). For the initialtranslation, the greedy decoder will find the biggestprobability among target words in the translation table.For example, the word “I” can be translated into the wordsaya with the probability of 0.6 or the word aku with theprobability of 0.4. The decoder will choose the word sayabecause it has the biggest probability.Figure 3. The initial sentence for the source “I go to the market bybicycle”ICITEE 2011alignments are saved in the bigram and unigram modelsafter being processed in the Bigram and Unigramcomponents.Figure 5. The Training block of English-Indonesian Statistical MTThe Testing block of the English – IndonesianStatistical MT consists of two components (see Fig. 6).The first component is called the Decoder. In thiscomponent, each English sentence is translated by usinggreedy decoder technique. This component provides theinitial translation that will be improved by the nextcomponent, which is called the Bigram model. If theimproved translation is the best translation then thesystem will consider this as the final translation.Nonetheless, if the translation can still be improved, theBigram model will reprocess it until the best translation isobtained.Figure 6. The Testing block of English-Indonesian Statistical MTVI.V.ENGLISH–INDONESIA STATISTICAL MTThe English–Indonesian statistical MT technique canbe illustrated as in Fig. 4 that is divided into two mainblocks; Training block and Testing block.Figure 4. The main diagram of English-Indonesian Statistical MTThere are two kinds of input for this system. The firstinput is the parallel corpora that are the input for theTraining block. The second one is the English sentencesthat are the input for the Testing step. The output of theTraining step, which is Unigram & Bigram translation,will affect the Testing block’s output.The Training block can be seen in Fig. 5. In this work,we use 30 parallel sentences for the training. Theseparallel sentences are aligned in the Alignmentcomponent to obtain the word alignments. The word76THE RESULT AND ANALYSISThe technique that is explained in Section V will beimplemented to 30 English – Indonesian sentences for theTraining block, and seven English sentences for theTesting block. The English sentences for both trainingand testing are provided in the Internet and taken fromstories about myths, legends, and fables that we can findat the popularchildrenstories.com, eastoftheweb.com, andlonglongtimeago.com. These sites provide stories only inthe English language. Thus for the Training block, wehave to translate them into Indonesian language, of whichthe grammar was given in details by Dwipayana [16],Keraf [17], Widyamartaya [18], and Wilujeng [19]. Thetesting results for the seven English sentences are givenin Table I.Now let us see the implementation in the Trainingblock. Our first parallel sentence are “Otherwise the godswill be angry with you” and Sebaliknya, para dewa akanmarah padamu. First, the Alignment componentprocesses this parallel sentence. The result of thisalignment can be seen in Table II. After all the wordalignments are done the Unigram translation and theBigram translations are implemented. The unigram tablewill save each word translation, as shown in Table III.Meanwhile, the bigram table will save every two wordsof the target words, as can be seen in Table IV.Yogyakarta, 28 July 2011Gadjah Mada University

ICITEE 2011TechnicalTABLE I. TESTING RESULTNo1.2.3.4.5.6.7.EnglishThe cottage wassurrounded with theangry people.They catch those fisheswith nets.The rich man was verypleased with the news.The crocodile did notagree with himI begged you to comewith me to the party.She smiled at the Princewith joy.The two sisters marriedwith the two richgentlemen.ResultGubug mengelilingi denganmarah pendudukMereka menangkap thosefishes dengan jaringMan yang kaya itu sangatsenang dengan kabar.Buaya itu did tidak setujubersamanya.I begged kau datangdenganku ke pestaDia tersenyum pada Princedengan sukacita.Two sisters nikahi dengantwo yang kaya pria.TABLE II. ALIGNMENT RESULT FOR THE 1ST PARALELL bawaituviolinCount11111111TABLE III. UNIGRAM TRANSLATIONEnglish otherwisethegodswillbeangrywithyou Indonesian sebaliknyaparadewaakanNULLmarahpadamu Count 11111111 As it is explained in Section VI, the number ofparallel sentences is not sufficient to provide goodtranslations. Thus, huge parallel sentences should beadded in the Training block.The decisions of the good or bad translation are basedon the manual human perception since this work has notincluded any automatic evaluation method. There aremany automatic evaluation methods in MT field and oneof them is BLEU-metric [20]. This evaluation – thatcompares the output of the MT system with four humantranslations – will be the next possible future research.English Otherwiseotherwise thethe godsgods willwill bebe angryangry withwith youREFERENCES[1][2][3]Now, we can proceed to the Testing block. Given anEnglish sentence: “The cottage was surrounded with theangry people.” then this sentence will be translated intoIndonesian language as explained in the following lines.The first step of the Testing is Decoder step. It means thatwe have to go to the Unigram Translation Table. Bychoosing the biggest probability of each word then wewill obtain the initial translation: as (NULL) gubug(NULL) mengelilingi dengan (NULL) marah penduduk.This initial translation will be improved in the Bigramcomponent. Unfortunately, this initial translation cannotGadjah Mada Universitybe improved anymore because there are no bigramtranslations that can improve it. Thus, the translation ofthe sentence “The cottage was surrounded with the angrypeople” is the sentence gubug mengelilingi denganmarah penduduk.The translation’s accuracy of the system based onunigram and bigram evaluations are 58% and 35%respectively. These values are obtained by comparing themachine result and the human translation [20].The performance of machine translation sentence isnot good enough because of the small number of corpusthat we used (30 parallel sentences only). If we add largerparallel sentences, then the translation coverage willincrease. As a result, the translation will be better.If we have another parallel sentence that contains thewords “was surrounded” and (NULL) dikelilingi, then theinitial translation could be improved as gubug dikelilingidengan marah penduduk. If we have another sentencecontaining the words “angry (NULL)” and “(NULL)people” that parallel with penduduk yang and yang marahthen the translation could be improved as gubugdikelilingi dengan penduduk yang marah. If we haveanother parallel sentence that contains the words “thecottage” and gubug itu, then the translation would becomegubug itu dikelilingi dengan penduduk yang marah. Thislast result would be much better than the initial sentence.However, this better result can be achieved if we have ahuge corpus.VII. DISCUSSIONTABLE IV. TARGET LANGUAGE BIGRAMIndonesian sebaliknyasebaliknya parapara dewadewa akanakan (NULL)(NULL) marahmarah padapadamu ISSN: 2088-6578[4][5][6]R. M. Kaplan, “A general syntactic processor,” in NaturalLanguage Processing, Rustin, R., Ed. New York: AlgorithmicsPress, 1973, pp. 193-241.T. B. Adji, “Annotated disjunct for machine translation,”Computer and Information Science Departement, UniversitiTechnology Petronas, Unpublished Dissertation Rep. Malaysia,2010.F. Novento, “Perangkat Lunak Penerjemah Kalimat InggrisIndonesia Menggunakan Metode Loading Data Sementara,”Electrical Engineering Department, Gadjah Mada University,Final Rep., 2003.H. Yusuf, “An Analysis of Indonesian Language for InterlingualMachine-Translation System,” Proc. 14th Conf. on ComputationalLinguistic (COLING), Nantes, France, Aug. 1992, vol. 4, pp.1228-1232.E. Utami and S. Hartati, “Pendekatan metode rule based dalammengalihbahasakan teks Bahasa Inggris ke teks BahasaIndonesia,” Jurnal Informatika, vol.8, no.1, 2007, pp: 42 – 53.K. Probst, “Learning transfer rules for machine translation withlimited data,” Language Technologies Institute, School ofComputer Science, Carnegie Mellon University, Ph.D.dissertation, Aug. 2005.Yogyakarta, 28 July 201177

ISSN: 2088-6578[7][8][9][10][11][12][13]78TechnicalR. D. Brown, “Example-based machine translation in thePANGLOSS system,” Proc. 16th International Conference onComputational Linguistics, Copenhagen, Denmark, 1996, pp. 169174.F. J. Och and H. Ney, “Improved statistical alignment models,”Proc. 38th Annual Meeting of the Association for ComputationalLinguistics (ACL), Hong Kong, 2000, pp. 440-447.H. Riza, “Resources report on languages of indonesia,” Proc. 6thWorkshop on Asian Language Resources, Hyderabad, India, Jan.2008, pp. 93-94.P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek,J. D. Lafferty, R. L. Mercer, and P. S. Roossin, “A Statisticalapproach to machine translation,” in Journal of ComputationalLinguistics, vol.16 no.2, Jun. 1990, pp. 79-85.R. Francois and P. Lison, “Probabilistic language modeling withN-grams,” Artificial Intelligence Seminar, Universit e Catholiquede Louvain. Belgium, May 2005.C. Nusai, Y. Suzuki and H. Yamazaki, “Estimating wordtranslation probabilities for Thai – English machine translationusing EM Algorithm,” International Journal of ComputationalIntelligence 4, 2008.A. Ramanathan, P. Bhattacharyya and M. Sasikumar, “StatisticalMachine Translation,” Mumbai, India, Dissertation SeminarReport, 2005.ICITEE 2011[14] P. Koehn, “Empirical Methods in Natural Language ProcessingLecture 15,” School of Informatics, California, 2008.[15] U. Germann, M. Jahr, K. Knight, D. Marcu and K. Yamada, “Fastdecoding and optimal decoding for machine translation”.Proceedings of the 39th Annual Meeting of the Association forComputational Linguistics (ACL), Toulouse, 2001, pp. 228–235.[16] G. Dwipayana, “Sari Kata Bahasa Indonesia (The Essence ofIndonesian Language)”, Surabaya: Terbit Terang, 2001.[17] G. Keraf, “Tata bahasa rujukan Bahasa Indonesia”, Jakarta: PTGramedia Widiasarana, 1999.[18] A. Widyamartaya, “Seni menerjemahkan”, 13th ed. Yogyakarta:Kanisius, 2003.[19] A. Wilujeng, “Inti sari kata Bahasa Indonesia lengkap”, Surabaya:Serba Jaya, 2002.[20] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: amethod for automatic evaluation of machine translation,” Proc

stack-based decoder, greedy decoder, and integer programming decoder. This research will use the greedy decoder because of its ease in making initial translation. In addition, the greedy decoder translation’s quality is good enough compared to the other two decoders [15]. Let us consider

Related Documents:

neural machine translation (NMT) paradigm. To this end, an experiment was carried out to examine the differences between post-editing Google neural machine translation (GNMT) and from-scratch translation of English domain-specific and general language texts to Chinese. We analysed translation process and translation product data from 30 first-year

Dr. Didik Wahjudi (Indonesia) Dr. Oki Sunardi (Indonesia) Dr. Ishak Ramli (Indonesia) Dr. Moeljono Widjaja (Indonesia) Dr. Iwan Aang Soenandi (Indonesia) Ignasia Yuyun, M.Pd. (Indonesia) Dr. Evans Garey (Indonesia) Yuseva Ariyani Iswandari, M.Pd. (Indonesia) PUBLISHED FIRST TIME BY: UKRIDA PRESS

The importance of Translation theory in translation Many theorists' views have been put forward, towards the importance of Translation theory in translation process. Translation theory does not give a direct solution to the translator; instead, it shows the roadmap of translation process. Theoretical recommendations are, always,

Rule-based machine translation. Statistical machine transla-tion. Evaluation of machine translation output. 1. Introduction. 1.1 Defining machine translation. In today’s globalized world, the need for instant translation is . constantly growing, a demand human translators cannot meet fast enough (Quah 57).

Introduction Statistical Machine Translation Neural Machine Translation Evaluation of Machine Translation Multilingual Neural Machine Translation Summary. Automatic conversion of text/speech from one natural langu

accuracy in machine translation [1] [2] [5] [14] [17] [21]. The particular problem of machine translation between English and Japanese has a long history as well. In a 1982 paper by Nagao [18] implements a rule-based machine translation system by attmepting to transfer grammatical conc

(Statistical) Machine Translation Cristina Espana i Bonet MAI{ANLP Spring 2014. Overview 1 Introduction 2 Basics 3 Components 4 The log-linear model 5 Beyond standard SMT . Example-based Translation Rule-based systems. Introduction Machine Translation Taxonomy Machine Translation systems Human T

Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi, Ondˇrej Bojar, Chris Callison-Burch, Brooke Cowan, Chris Dyer, Hieu Hoang, Richard Zens, . Statistical machine translation has emerged as the dominant paradigm in machine translation research. Statistical machine translation is built on the insight that many translation choices