Combination Of Machine Translation Systems Via Hypothesis .

2y ago
25 Views
2 Downloads
680.01 KB
8 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Jenson Heredia
Transcription

[8th AMTA conference, Hawaii, 21-25 October 2008]Combination of Machine Translation Systemsvia Hypothesis Selection from Combined N-Best ListsAlmut Silja Hildebrand and Stephan VogelLanguage Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213, USAsilja, vogel @cs.cmu.eduAbstractand Lavie, 2005) and Sim et al. (2007) propose solutions to this problem.Different approaches in machine translationachieve similar translation quality with a variety of translations in the output. Recentlyit has been shown, that it is possible to leverage the individual strengths of various systemsand improve the overall translation quality bycombining translation outputs. In this paperwe present a method of hypothesis selectionwhich is relatively simple compared to systemcombination methods which construct a synthesis of the input hypotheses. Our methoduses information from n-best lists from severalMT systems and features on the sentence levelwhich are independent from the MT systemsinvolved to improve the translation quality.1Recently there have been a number of publications in this area, for example (Rosti et al., 2007)and (Huang and Papineni, 2007). Both of theseapproaches combine the system output on severallevels: word, phrase and sentence level. Rosti etal. (2007) use several information sources suchas the internal scores of the input systems, n-bestlists and source-to-target phrase alignments to buildthree independent combination methods on all threelevels. The best single combination method is theone on the word level, only outperformed by thecombination of all three methods. Huang and Papineni (2007) also use information on all three levels from the input translation systems. They recalculate word lexicon costs, combine phrase tables,boost phrase pairs and reorderings used by the inputsystems. Then they re-decode a lattice constructedfrom source-target phrase pairs used by all the inputsystems during the first pass. Finally they apply anindependent hypothesis selection step, which usesall original systems as well as the combined systemas input.IntroductionIn the field of machine translation, systems basedon different principles for the generation of automatic translations such as phrase based, hierarchical, syntax based or example based translation haveadvanced to achieve similar translation quality. Thedifferent methods of machine translation lead to avariety of translation hypotheses for each of thesource sentences.Extensive work has been done on the topic of system combination for a number of years, for examplethe ROVER system (Fiscus, 1997), which combinesthe output of several speech recognition systems using a voting method on the word level. In machinetranslation, aligning the translation hypotheses toeach other poses an additional problem because ofthe word reordering between the two respective languages. For example the MEMT system (JayaramanOur method is very straight forward compared tothe elaborate methods mentioned above, while stillachieving comparable results. We simply combinen-best lists from all input systems and then select thebest hypothesis according to several feature scoreson a sentence to sentence basis. Our method is independent of internal translation system scores because those are usually not comparable. Besides then-best list, no further information from the input systems is needed, which makes it possible to also in-254

[8th AMTA conference, Hawaii, 21-25 October 2008]probability for each word, given its history. Wethen normalize the sentence log-probability with thetarget sentence length to get an average word logprobability, which is comparable for translation hypotheses of different length.clude non-statistical translation systems in the combination.In section 2 we describe the three types of featuresused in our combination: language model features,lexical features and n-best list based features. Weoptimize the feature weights for linear combinationusing MERT. We report our results on the large scaleChinese-English translation task combining six MTsystems in section 3.22.2Brown et al. (1990) describe five statistical modelsfor machine translation, the so-called IBM model 1 IBM model 5. We use word lexica from either model1 or model 4, which contain translation probabilitiesfor source-target word pairs.The statistical word to word translation lexicon allows to calculate the translation probability Plex (e)of each word e of the target sentence. Plex (e) is thesum of all translation probabilities of e for each wordfj from the source sentence f1J . This feature doesnot take word order or word alignment into account.FeaturesAll features described in this section are calculatedbased on the translation hypotheses only. We do notuse any feature scores assigned to the hypotheses bythe individual translation systems, but recalculate allfeature scores in a consistent manner. In preliminary experiments we added the system scores to ourfeatures in the combination. This did not improvethe combination result and in some cases even hurtthe performance, probably because the scores andcosts used by the individual systems are generallynot comparable.Using only system independent features enablesour method to use the output of any translation system, no matter what method of generation was usedthere.Because the strengths of individual systems mightvary on the level of different genres as well as on asentence by sentence basis, we also did not want toassign global weights to the individual systems.The system independence of our features alsoleads to a robustness regarding varying performanceof the individual systems on the different test setsused for tuning the feature weights and the unseentest data. For example the NIST test set from 2003contains only newswire data, while the one from2006 also contains weblog data, hence global systemweights trained on MT03 might not perform well onMT06. It is also robust to incremental changes in anindividual system between translation of the tuningand the testing data sets.2.1Statistical Word LexicaPlex (e f1J ) J1 Xp(e fj )J 1 j 0(1)where f1J is the source sentence, J is the sourcesentence length, f0 is the empty source word andp(e fj ) is the lexicon probability of the target worde, given one source word fj .Because the sum in equation 1 is dominated by themaximum lexicon probability as described in (Ueffing and Ney, 2007), we also use it as an additionalfeature:Plex max (e f1J ) maxj 0,.,J p(e fj )(2)For both lexicon score variants we calculate anaverage word translation probability as the sentencescore, we sum over all words ei in the target sentenceand normalize with the target sentence length I.From the word lexicon we also calculate the percentage of words, whose lexicon probability fallsunder a threshold. In one language direction it represents the fraction of source words that could notbe translated and in the other direction it gives thefraction of target words that were generated from theempty word or were translated unreliably. This worddeletion model was described and successfully applied in (Bender et al., 2004) and (Zens et al., 2005).All three lexicon scores are calculated in both language directions. The source and the target sentenceLanguage ModelsTo calculate language model scores, we use traditional n-gram language models with n-gram lengthsof four and five. We calculate the score for eachsentence in the n-best list by summing the log-255

[8th AMTA conference, Hawaii, 21-25 October 2008]that contain the n-gram ei (n 1) .ei , independentfrom the position of the n-gram in the sentence.switch roles and the lexicon from the reverse language direction is used.This results in six separate features per pair of statistical word lexica.2.3hk (eii (n 1) )Position Dependent N-best List WordAgreementThe agreement score of a word e occurring in position i of the target sentence is calculated as the relative frequency of the Nk translation hypotheses inthe n-best list for source sentence k containing worde at position i. It is the percentage of entries in then-best list, which ”agrees” on a translation with e inposition i. As described in (Ueffing and Ney, 2007)the relative frequency of e occurring in target position i in the n-best list is computed as:hk (ei ) Nk1 Xδ(en,i , e)Nk n 1hk (ei ) 1Nkδ(en,i t · · · en,i t , e)(3)2.5N-best List N-gram ProbabilityThe n-best list n-gram probability is a traditional ngram language model probability. The counts for then-grams are collected on the n-best list entries forone source sentence only. No smoothing is applied,as the model is applied to the same n-best list it wastrained on, hence the n-gram counts will never bezero. The n-gram probability for a target word eigiven its history ei 1i (n 1) is defined asp(ei ei 1i (n 1) ) (4)C(eii (n 1) )C(ei 1i (n 1) )(6)n 1where C(eii (n 1) ) is the count of the ngram ei (n 1) .ei in all n-best list entries forthe respective source sentence.This feature set is derived from (Zens and Ney,2006) with the difference that we use simple countsinstead of fractional counts. This is because we wantto be able to use this feature in cases where no posterior probabilities from the translation system areavailable.The probability for the whole hypothesis is normalized by the hypothesis length to get an averageword probability. We use n-gram lengths n 1.6as six separate features.where δ(wn,i t · · · wn,i t , w) 1 if word w occursin the window wi t · · · wi t of n-best list entry n.The score for the whole hypothesis is the sum overall word agreement scores normalized by the sentence length.We use window sizes for t 0 to t 2 as threeseparate features.2.4(5)where δ(eii (n 1) , eI1,j ) 1 if n-gram eii (n 1) occurs in n-best list entry eI1,j .This feature represents the percentage of thetranslation hypotheses, which contain the respectiven-gram. If a hypothesis contains an n-gram morethan once, it is only counted once, hence the maximum for h is 1.0 (100%). The score for the wholehypothesis is the sum over the word scores normalized by the sentence length.We use n-gram lengths n 1.6 as six separatefeatures.Here Nk is the number of entries in the n-bestlist for the corresponding source sentence k andδ(w1 , w2 ) 1 if w1 w2 .This feature tries to capture how many entriesin the n-best list agree on not only the same wordchoice in the translation, but also the same word order. Since corresponding word positions might beshifted due to variations earlier in the sentence, wealso use a word agreement score based on a windowof size i t around position i. The agreement scoreis calculated accordingly:NkXNk1 Xδ(eii (n 1) , eI1,j ) Nk j 1Position independent N-best List N-gramAgreementThe n-gram agreement score of each n-gram in thetarget sentence is the relative frequency of targetsentences in the n-best list for one source sentence,256

[8th AMTA conference, Hawaii, 21-25 October 2008]33.1Experimentsbilingual data, the newswire data from the ChineseXinhua News Agency and the Agence France Presshave the largest weights. This reflects the makeup ofthe test data, which comes in large parts from thesesources. Other sources, for example the UN parlamentary speeches or the New York Times, differ significantly in style and vocabulary from the test dataand, therefore, get small weights.EvaluationIn this paper we report results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2005) metrics. In the MER training we optimize for maximumBLEU.The Chinese to English test sets from the NISTMT evaluations in 2003 and 2006 were used as development and unseen test data. The MT03 test setcontains 919 sentences of newswire data with fourreference translations. From the 2006 NIST evaluation we used the text translation portion of the NISTpart of the test data which consists of 1099 sentenceswith four references of newswire and weblog data.For each result reported in the following sections weused MER training to optimize the feature weightson a n-best list for MT03. These always containedthe same set of systems and were combined underthe same conditions as for the unseen data.3.2xin 0.30bil 0.26afp 0.21cna 0.06un 0.07apw 0.05nyt 0.03ltw 0.01Table 1: LM interpolation weights per sourceThe statistical word lexica were trained on theChinese-English bilingual corpora relevant to GALEavailable through the LDC1 . After sentence alignment and data cleaning these sources add up to 10.7million sentences with 260 million running wordson the English side. The lexica were trained withthe GIZA toolkit (Och and Ney, 2003).The research groups who provided the systemoutputs all had access to the same training data.ModelsIn all of the following experiments we used two language models, six features from a pair of statisticalword lexica, three features from the position dependent n-best list word agreement and six features eachfrom the n-best list n-gram agreement as well as then-best list n-gram probability, 23 features in total.The language models were trained on the datafrom all sources in the English Gigaword CorpusV3, which contains several newspapers of the yearsbetween 1994 to 2006, observing the blackout datesfor all NIST test sets. We also included the Englishside of the bilingual training data, resulting in a totalof 2.7 billion running words after tokenization.From these corpora we trained two language models. A 500 million word 4-gram LM from the bilingual data plus the data from the Chinese XinhuaNews Agency and an interpolated 5-gram LM fromthe complete 2.7 giga word corpus.We trained separate open vocabulary languagemodels for each source and interpolated them usingthe SRI Language Modeling Toolkit (Stolcke, 2002).Held out data for the interpolation weights was comprised of one reference translation each from theChinese MT03, MT04 and MT05 test sets. Table 1 shows the interpolation weights for the different sources. Apart from the English part of the3.3SystemsWe used the output from six different ChineseEnglish machine translation systems trained on largedata for the GALE and NIST evaluations in the beginning of 2008. They are based on phrase based, hierarchical and example based translation principles,trained on data with different Chinese word segmentations, built by three translation research groups,running four MT decoders. The systems A to F areordered by their performance in BLEU on the unseen Chinese MT06 test set (see Table T06 BLEU31.4531.2831.2531.0430.3626.00MT06 TER59.4357.9257.5557.2059.3262.43Table 2: Individual systems sorted by BLEU on ata/catalog.html

[8th AMTA conference, Hawaii, 21-25 October 2008]3.4Feature Impact3.5For comparison to our set of 23 features we ran oursetup with the two language models only as a simple baseline. In (Och et al., 2004) word lexicon features were described as the most useful features forn-best list re-scoring. Thus, we added those to thelanguage model probabilities as a second baseline(LM Lex). The results in Table 3 show that a system combination which uses these models alone cannot improve over the BLEU score of 31.45 of systemA. This probably is the case, because the statisticalsystems among the input systems are already usingthis type of information, and in fact share the training data which was used to build those models.To explore the question which feature group contributes the most to the improvement in translationquality and to avoid testing all possible combinations of features, we removed one feature groupat a time from the complete set. Table 3 showsthat although adding the word lexicon features tothe language models did not improve the result forthe LM Lex baseline, the overall result still dropsslightly from 33.72 BLEU for all features to 33.61BLEU for noLex. The combination result decreasesinsignificantly but consistently when removing anyfeature group.featuresLM onlyLM Lexno LMno Lexno WordAgrno NgrAgrno NgrProbLM .4539.76N-Best List SizeTo find the optimal size for the n-best list combination, we compared the results of using list sizes from1-best up to 1000-best for each individual system.Hasan et al. (2007) investigated the impact of n-bestlist size on the rescoring performance. They testedn-best list sizes up to 100 000 hypotheses. Theyfound, that using more than 10,000 hypotheses doesnot help to improve the translation quality and thatthe difference between using 1000 and 10,000 hypotheses was very small. Based on their results wedecided not to go beyond 1000-best.A&BA&BA,34.50B, D & Ebaseline1102550100200500 1000A,B, 31.59D&E31.4532.42 32.90 32.86 32.51 32.26 32.07 32.1131.45 31.84 32.93 33.02 33.31 33.37 33.24 33.1934.0033.31 33.3732.90 29.50baseline 1MT06 BLEU / TER31.17 / 59.3430.97 / 59.4132.83 / 56.2333.61 / 56.8833.67 / 57.2533.47 / 56.5833.65 / 57.4033.58 / 57.1533.72 / 56.79102550100200500 1000Figure 1: Combination results for different n-best sizesfor two and four systems for MT06 in BLEUBecause unique 1000-best lists were only available from the systems A, B, D and E, we ran twoseries of experiments using the top two systems aswell as these four systems. In the combination oftwo systems the n-best list sizes of 25 and 50 hypotheses achive virtually the same score (See Figure1). The same is true for sizes 50 and 100 in the combination of four systems.Because the experiments agree on 50 as the optimal size of the n-best list from each input system,and also because we only had unique 50-best listsavailable for one of the six systems, we chose the nbest list size of 50 hypotheses for all our followingexperiments.The reason why the optimal n-best list size israther small could be due to the fact that the inputlists are ordered by the producing systems. Includ-Table 3: Impact of feature groups on the combination resultThe biggest drops are caused by removing thelanguage model (-0.89 for no LM) and the n-gramagreement (-0.25 for no NgrAgr) feature groups.Using only those feature groups which have thebiggest impact brings the combination result up to33.58 BLEU which is close to the best, but using allfeatures still remains the best choice.258

[8th AMTA conference, Hawaii, 21-25 October 2008]ing hypotheses lower in the list introduces more andmore bad hypothesis along with some good candidates. The restriction of the n-best list to the smallsize of 50, could be interpreted as indirectly usingthe knowledge of the decoder about the quality ofthe hypotheses, which is represented in the rank 03 235 255 248 200 106MT0620%287 181 544 138 171Starting with rescoring the n-best list of system Aby itself, we progressively added all systems to thecombination. The results in Table 4 show, thatadding systems one by one improves the result withsmaller impact for each additional 5.6%27.7%27.0%21.8%11.5%28718154413817170%MT06A287 26.1%B 60%181 16.5%C544 49.5%D50%138 12.6%E171 15.6%Combination of all SystemssystemA B C D E F23590%25524820080%10610%0%MT03combined31.76 / 58.9532.86 / 57.9033.32 / 56.8733.51 / 56.7733.72 / 56.7933.63 / 56.45MT06Figure 2: Contribution of each system to the new firstbest of the five system combination for MT03 and MT06it. We did not remove duplicate hypotheses generated by different input systems, because the boosting effect in the n-best list based features is desired.In the combination of all six systems, for example,72 of the chosen hypotheses were generated by twosystems, 4 by all six systems. These are typicallyvery short sentences, for example by-lines.Table 4: Combination results for adding in all systemsprogressively for MT06 in BLEU/TERWe achieved the best result of 33.72 BLEU bycombining five systems, which is a gain of 2.27points over the best system. The BLEU score onthe tuning

cal, syntax based or example based translation have advanced to achieve similar translation quality. The different methods of machine translation lead to a variety of translation hypotheses for each of the source sentences. Extensive work has been done on the topic of sys-tem combin

Related Documents:

The importance of Translation theory in translation Many theorists' views have been put forward, towards the importance of Translation theory in translation process. Translation theory does not give a direct solution to the translator; instead, it shows the roadmap of translation process. Theoretical recommendations are, always,

(Statistical) Machine Translation Cristina Espana i Bonet MAI{ANLP Spring 2014. Overview 1 Introduction 2 Basics 3 Components 4 The log-linear model 5 Beyond standard SMT . Example-based Translation Rule-based systems. Introduction Machine Translation Taxonomy Machine Translation systems Human T

Rule-based machine translation. Statistical machine transla-tion. Evaluation of machine translation output. 1. Introduction. 1.1 Defining machine translation. In today’s globalized world, the need for instant translation is . constantly growing, a demand human translators cannot meet fast enough (Quah 57).

Introduction Statistical Machine Translation Neural Machine Translation Evaluation of Machine Translation Multilingual Neural Machine Translation Summary. Automatic conversion of text/speech from one natural langu

neural machine translation (NMT) paradigm. To this end, an experiment was carried out to examine the differences between post-editing Google neural machine translation (GNMT) and from-scratch translation of English domain-specific and general language texts to Chinese. We analysed translation process and translation product data from 30 first-year

Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi, Ondˇrej Bojar, Chris Callison-Burch, Brooke Cowan, Chris Dyer, Hieu Hoang, Richard Zens, . Statistical machine translation has emerged as the dominant paradigm in machine translation research. Statistical machine translation is built on the insight that many translation choices

Accepted translation 74 Constraints on literal translation 75 Natural translation 75 Re-creative translation 76 Literary translation 77 The sub-text 77 The notion of theKno-equivalent1 word - 78 The role of context 80 8 The Other Translation Procedures 81 Transference 81 Naturalisation 82 Cultural equivalent 82 Functional equivalent 83

Transactions, the National Finance Center report that shows total disbursements by appropriations and schedule number, to the general ledger and to the Government-wide Accounting (GWA) Account Statement for each appropriation. Any differences must be resolved in a timely manner. Section 6.0 Time and Attendance . Time and attendance is to be recorded accurately to ensure that the presence and .