MACHINE TRANSLATION: A CRITICAL LOOK AT THE

2y ago
40 Views
3 Downloads
398.80 KB
18 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Mika Lloyd
Transcription

INE TRANSLATION: A CRITICAL LOOK AT THEPERFORMANCE OF RULE-BASED AND STATISTICALMACHINE TRANSLATIONBrita Banitz11Universidad de las Américas Puebla, San Andrés Cholula, MéxicoAbstract: The essay provides a critical assessment of the performance oftwo distinct machine translation systems, Systran and Google Translate.First, a brief overview of both rule-based and statistical machinetranslation systems is provided followed by a discussion concerningthe issues involved in the automatic and human evaluation of machinetranslation outputs. Finally, the German translations of Mark Twain’s TheAwful German Language translated by Systran and Google Translate arebeing critically evaluated highlighting some of the linguistic challengesfaced by each translation system.Keywords: Rule-based machine translation. Statistical machine translation. Evaluation of machine translation output1. Introduction1.1 Defining machine translationIn today’s globalized world, the need for instant translation isconstantly growing, a demand human translators cannot meet fastenough (Quah 57). Machine translation (MT), defined by Somersas “a range of computer-based activities involving translation”(Somers 428), is therefore considered a “cost-effective alternativeto human translators” (Quah 57).Esta obra utiliza uma licença Creative Commons CC BY:https://creativecommons.org/lice

Brita BanitzThe goal of MT is, according to Hutchins and Somers, theproduction of useful automatic translations within specific contexts,requiring the least amount of changes to the output in order to makeit acceptable to users (Hutchins and Somers 2). But the early historyof MT was driven by an unrealistic expectation of creating computerprograms capable of high-quality fully automatic translation andthe infamous ALPAC report of 1966, which argued that “MT wasslower, less accurate, and twice as expensive as human translation”(Somers 428), brought MT research to a standstill in the USA.However, research in other countries continued thus leading tothe realization that high-quality fully-automatic translation was notfeasible and that systems producing acceptable output, often basedon restricted texts, were preferable (Somers 429).Quah distinguishes between three generations of MTarchitectures: The first generation (1960s to 1980s) was based ondirect translation, the second generation (1980s to present) consistsof rule-based systems such as the transfer and interlingua systems,and the third generation (1990s to present) includes corpus-basedsystems that are either statistical based or example based (Quah68). While direct translation systems employed a “word-for-wordtranslation with no clear built-in linguistic component” (Quah60), the rule-based and corpus-based systems are far more complexand will be dealt with in more detail below.1.2 ObjectivesThe purpose of this essay is to provide an overview of twodifferent approaches to MT, rule-based and statistical MT, and tocritically analyze the performance of each based on the translationof a short text translated by Systran and Google Translate. Systranis a well-known rule-based system freely available online at http://www.systranet.com/translate. Google, on the other hand, is astatistical MT system which is based on a large corpus of bilingualaligned texts. The free online translator can be accessed at https://translate.google.com.Cad. Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.55

Machine Translation: A critical look at the performance of rule-based and statistical.As the source text for the translation, the first 24 sentences,687 words, of the English text The awful German language byMark Twain was used whereas the German outputs of Systran andGoogle Translate served as the target texts for the present analysis.In addition, Schneider’s human translation into German served asa reference translation to evaluate the MT output.In section 2 below, a brief overview of both rule-based andstatistical machine translation is given followed by section 3 whichpresents some of the issues related to the automatic and humanevaluation of the outputs provided by MT systems. In this section,the performance of both Systran and Google Translate as well asthe linguistic challenges faced by both MT systems are discussedin greater detail.2. Approaches to MTCurrently, the two most common MT systems are rule-basedMT (RBMT) and statistical MT (SMT) (Costa-Jussà et al. 247).Both approaches are dealt with next.2.1 Rule-based MTAccording to Quah, “rule-based approaches involve theapplication of morphological, syntactic and/or semantic rules tothe analysis of a source-language text and synthesis of a targetlanguage text” (Quah 70-71) requiring “linguistic knowledge ofboth the source and the target languages as well as the differencesbetween them” (Douglas, Arnold et al. 66, emphasis in original).Rule-based systems are further divided into transfer and interlinguasystems (Hutchins; Somers). Interlingua systems work with anabstract intermediate representation of the source text out of whichthe target text is generated “without ‘looking back’ to the originaltext” (Hutchins; Somers 73). However, in practice, “designing ageneral-purpose interlingua is tantamount to designing a completeCad. Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.56

Brita Banitzmodel of the real world” (Forcada 219) limiting this approach totranslation within specific domains only (ibid.).Consequently, the more common approach to rule-based MTare transfer systems (218). According to Somers, transfer-basedsystems analyze a source text sentence by sentence identifying thepart of speech of each word and its possible meanings (Somers 433).If the source language is morphologically rich, language specificmorphological rules are used to analyze the source text. Languagespecific syntactic rules are then applied to identify the syntacticcategories of the words contained in the sentence. Finally, thesystem determines the target word and generates the target sentenceclosely following the structure of the source sentence (ibid.) furthersubjecting the target sentence to a “simple morphological generationroutine” (Hutchins; Somers 134, emphasis in original) in order toapply target language-specific morphological rules to the MT output.2.2 Statistical MTStatistical MT, on the other hand, is currently “theoverwhelmingly predominant method in MT research” (Somers434). Working with massive bilingual corpora, the system looksfor the target sentence with the highest probability match. Thisis different from the example-based method in which the systemsearches for a previously translated sentence in an aligned corpus oftranslated source and target sentences, similar to using a translationmemory (Forcada). Since both methods work with large corporaof parallel texts, they are commonly classified as corpus-basedapproaches to MT (ibid.).Statistical MT systems are further divided into word-based andphrase-based models (Costa-Jussà et al.). Word-based models workwith the assumption that for each individual word, the probabilityfor how that word should be translated can be computed. However,more modern SMT systems use phrases as the unit of translation(251) where a phrase is defined as a “contiguous multiwordsequence, without any linguistic motivation” (Koehn 148).Cad. Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.57

Machine Translation: A critical look at the performance of rule-based and statistical.After a source text is segmented into phrases, these aresubsequently compared to an aligned bilingual corpus and astatistical measure is used to compute the most probable targetlanguage segment based on the information gathered from thesystem’s translation model and target-language model (Quah 77).The translation model is responsible for calculating the degreeto which each source-language word contained in the phrasecorresponds to possible target-language words selecting themost probable lexical choice contained in the corpus (Somers).The target-language model, on the other hand, computes howlikely it is that the target segment is considered legitimate, againbased on the data contained in the bilingual corpus (ibid.). As afinal step, the target text is produced with the newly translatedsegments (Quah). In the next section, the evaluation of MT outputis discussed in greater detail.3. Evaluation of MT outputAs Douglas et al. point out, “the evaluation of MT systemsis a complex task” (157) since the adequacy of a system’s outputlargely depends on the purpose of the translation (Forcada;Somers). Therefore, there is “no golden standard against whicha translation can be assessed” (Kalyani et al. 54, emphasis inoriginal). In the following section, an overview of both theautomatic and human evaluation of MT output is provided and acritical discussion of the Systran and Google translations of thesource text mentioned above is offered.For the analysis of the MT output, the first 24 sentences ofthe source text were entered into Systran, translated, and copiedinto a Word document. The same procedure was followed forGoogle. Next, all of the sentences were aligned along with thecorresponding reference translation and subjected to automaticand human evaluation.Cad. Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.58

Brita Banitz3.1 Automatic evaluationThe automatic evaluation of MT output has “become the norm”(Somers 438) since it is faster and more cost efficient (Kalyani etal.), more objective (Quah), allows for a large number of outputs tobe evaluated (Somers), and provides useful and immediate feedbackduring system development (Forcada). According to Somers, themost widely used automatic evaluation metric is BLEU. It comparesthe MT output, segmented into four-word sequences, to a humanreference translation in terms of lexical precision and assigns ascore of 0 for the worst translation and a score of 1 for the besttranslation (Costa-Jussà et al. 257). However, the system is limitedto a relatively small sequence of words, “penalizes valid translationsthat differ substantially in choice of target words or structures”(Somers 438), does not efficiently evaluate the MT output offree word order languages such as Hindi (Kalyani et al. 57), andgreatly underestimates the quality of non-statistical system outputcompared to human raters (Callison-Burch; Osborne; Koehn), ashortcoming that also applies to other automatic evaluation enginessuch as METEOR and Precision and Recall (Callison-Burch et al.).As a consequence, other measures have been proposed.One such measure is the TER score suggested by Snover, Dorr,Schwartz, Micciulla, and Makhoul. The TER score, or TranslationError Rate, “measures the number of edits required to change asystem output” (Costa-Jussà et al. 257) to match a human referencetranslation as closely as possible (Snover et al.). According toSnover et al., insertions, deletions, substitutions, and changesin word order count as edits (Definition of translation edit rate,para. 2). Yet, while the measure does give some indication as tohow close the MT output is to a human translation, two importantshortcomings have to be pointed out. First, the TER score does notnecessarily reflect the acceptance or adequacy of the MT output(para. 6) and second, the measure directly depends on the qualityof the reference translation since any deviation from the humantranslation will be penalized. Nonetheless, the TER score offers aCad. Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.59

Machine Translation: A critical look at the performance of rule-based and statistical.“more intuitive measure of ‘goodness’ of MT output” (Introduction,para. 2) and can be easily calculated using the Levenshtein distancecalculator, a free measurement tool available online at http://planetcalc.com/1721 .Using the Levenshtein distance calculator, the TER scoremeasure was applied to the translations of the source text providedby Systran and Google. The results of the automatic evaluation ofthe output are presented in Table 1 below.Table 1: Results of the automatic evaluation of the outputSentenceLengthLengthsourcetargetsentence sentenceSystranTER leTER 21326151612124211391417131329123416Cad. Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.60

Brita 62473.126Source: AutorTable 1: TER scoresTable 1 above lists the word length of each of the sourcesentences, the length of target sentences translated by Systran, thelength of the target sentences translated by Google, and the lengthof the sentences of the human reference translation. The TERscores for each of the sentences translated by Systran and Googleare provided along with the overall word count of each of the texts,the average word length per sentence, and the average TER scoresfor the Systran and Google translations.As the table indicates, the average word length of the source textsentences was 28.6 words per sentence, very similar to Systran’stranslation with an average sentence length of 28.5 words persentence. Google’s sentence length was slightly less with an averageof 26.3 words per sentence, closer to the reference translation withan average of 26 words per sentence. Similarly, the total word countof the source text was 687 words, Systran’s translation consistedof a total of 683 words, whereas Google’s translation consisted offewer words, a total of 630 words, again closer to the referencetranslation with a total of 624 words. The translation offered byGoogle is therefore more similar to the human translation in overallword count as well as the average number of words per sentence.Cad. Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.61

Machine Translation: A critical look at the performance of rule-based and statistical.As far as the TER score is concerned, Systran’s translationresulted in an average TER score of 92.2 whereas the average TERscore for Google was 73.1 indicating that, in general, Google’soutput requires fewer edits to match the reference translationmore closely. In fact, out of the 24 target sentences, only fourobtained a higher TER score for the Google translation (markedin bold). It also appears clear that the longer the sentence, thehigher the TER score in general. The obtained results suggest thatautomatic evaluation measures, at least the one used here, evaluatethe SMT output more favorably than the RBMT output and that thetranslation by Systran requires more post-editing to be closer to thehuman reference translation.3.2 Human evaluationAlthough the human evaluation of MT output is costly, timeconsuming, and rather subjective (Kalyani et al.), it does providea more detailed analysis of the quality of the output depending onthe rating criteria applied. From a set of target translations, theevaluator chooses the best translation option based on the providedreference translation (Farrús; Costa-Jussà; Popović). Althoughdifferent rating scales do exist, the most common evaluationcriteria suggested in the literature are fluency and adequacy(Quah). Fluency, also referred to as intelligibility (Douglas etal.), is concerned with both the grammatical correctness and wordchoice of the translation (Kalyani et al.) whereas adequacy, alsocalled accuracy or fidelity (Douglas et al.), evaluates the degreeto which the translation managed to represent the original meaning(Kalyani et al.). The rating scales suggested by Callison-Burch etal. (Implications) are, in my opinion, the most concrete suggestedin the literature and were therefore used to assess the MT outputprovided by Systran and Google. Both scales are represented inTable 2 and Table 3 below:Cad. Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.62

Brita BanitzTable 2: Fluency scaleTable 3: Adequacy scaleFluencyAdequacyHow do you judge the fluencyof this translation?How much of the meaning expressed in the referencetranslation is also expressed in the hypothesistranslation?5 Flawless German4 Good German3 Non-native German2 Disfluent German1 Incomprehensible5 All4 Most3 Much2 Little1 NoneSource: AutorAfter having applied both scales to the output provided by Systranand Google, the results were summarized in Table 4 below. Thetable lists the sentence by sentence fluency and adequacy scoresfor the source text translations along with the average score foreach scale as well as the percentage of how often one system waschosen as better. As can be seen in the table, 75% of the fluencyscores were better for Google whereas 25% were rated as equalto Systran. On the other hand, none of the sentences translatedby Systran were rated better than Google with Google achievingan average fluency score of 3.6 compared to Systran’s averagefluency score of 2.5.The length of the sentence did not seem to affect the fluencyscores since regardless of length, Google’s translation tended toreceive a higher fluency score indicating that the grammaticalityof the translation offered by Google was generally better thanSystran’s. This was an expected result because, as suggested byCosta-Jussà et al., Systran’s approach to translation is rule-based,translating each sentence word-for word, which tends to result inlower fluency scores.Cad. Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.63

Machine Translation: A critical look at the performance of rule-based and statistical.Table 4: Fluency and adequacy scoresSentence leAdequacyscore . Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.64

Brita Source: AutorAs far as the adequacy score is concerned, Google was alsoevaluated as better with 63% of the scores being higher thanSystran’s and 37% being equal. For shorter than average lengthsentences, Systran did receive a better result compared to its fluencyscore, which indicates that the content of the source sentence wasrepresented better than its grammatical structure might suggest.Yet, Google still received a higher score overall in terms ofadequacy representing the original meaning of the source sentencemore faithfully than Systran. Therefore, even though the adequacyof Sytran’s translation was rated slightly better than its fluency,Google was rated better overall for both criteria.3.3 Linguistic challenges for MT systemsThe fluency and adequacy measures discussed above, however,still do not provide any insight into the types of errors both systemscommitted. In order to gain a better understanding of the challengesfaced by both Systran and Google, a linguistic error analysis of thesystems’ translations of the source text was performed, taking intoconsideration the following sub-categories within the classificationsuggested by Farrús et al., p. 176-177 (see Table 5 below):Table 5: Classification of linguistic errorsClassificationCategorySemantic errorsHomographPolysemyLexical errorsIncorrect wordUnknown wordMissing target wordExtra target wordCad. Trad., Florianópolis, v. 40, nº 1, p. 54-71, jan-abr, 2020.65

Machine Translation: A critical look at the per

Rule-based machine translation. Statistical machine transla-tion. Evaluation of machine translation output. 1. Introduction. 1.1 Defining machine translation. In today’s globalized world, the need for instant translation is . constantly growing, a demand human translators cannot meet fast enough (Quah 57).

Related Documents:

The importance of Translation theory in translation Many theorists' views have been put forward, towards the importance of Translation theory in translation process. Translation theory does not give a direct solution to the translator; instead, it shows the roadmap of translation process. Theoretical recommendations are, always,

Introduction Statistical Machine Translation Neural Machine Translation Evaluation of Machine Translation Multilingual Neural Machine Translation Summary. Automatic conversion of text/speech from one natural langu

neural machine translation (NMT) paradigm. To this end, an experiment was carried out to examine the differences between post-editing Google neural machine translation (GNMT) and from-scratch translation of English domain-specific and general language texts to Chinese. We analysed translation process and translation product data from 30 first-year

(Statistical) Machine Translation Cristina Espana i Bonet MAI{ANLP Spring 2014. Overview 1 Introduction 2 Basics 3 Components 4 The log-linear model 5 Beyond standard SMT . Example-based Translation Rule-based systems. Introduction Machine Translation Taxonomy Machine Translation systems Human T

Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi, Ondˇrej Bojar, Chris Callison-Burch, Brooke Cowan, Chris Dyer, Hieu Hoang, Richard Zens, . Statistical machine translation has emerged as the dominant paradigm in machine translation research. Statistical machine translation is built on the insight that many translation choices

Accepted translation 74 Constraints on literal translation 75 Natural translation 75 Re-creative translation 76 Literary translation 77 The sub-text 77 The notion of theKno-equivalent1 word - 78 The role of context 80 8 The Other Translation Procedures 81 Transference 81 Naturalisation 82 Cultural equivalent 82 Functional equivalent 83

translation, idiomatic translation and communicative translation. From 116 data of passive voice in I Am Number Four novel, there is 1 or 0.9% datum belongs to word-for-word translation, there are 46 or 39.6% data belong to literal translation, there is 1 or 0.9% datum belongs to faithful translation, there are 6 or

ANATOMI & FISIOLOGI SISTEM LIMFATIK DAN KONSEP IMUN Atika Dalili Akhmad, M. Sc., Apt . PENDAHULUAN 20 L cairan plasma difiltrasi keluar menuju bagian interstisial, 17 L direabsorpsi oleh pembuluh darah, BAGAIMANA 3 L SISANYA ? Sistem Limfatik sistem yang terdiri dari pembuluh, sel, dan organ yang membawa kelebihan cairan insterstisial ke dalam aliran darah dan filter patogen dari darah. FUNGSI .