Machine Translation For Subtitling: A Large-Scale Evaluation

3y ago
15 Views
2 Downloads
320.71 KB
8 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Wade Mabry
Transcription

Machine Translation for Subtitling: A Large-Scale EvaluationThierry Etchegoyhen,1 Lindsay Bywood,2 Mark Fishel,3 Panayota Georgakopoulou,4 Jie Jiang,5Gerard van Loenhout,6 Arantza del Pozo,1 Mirjam Sepesy Maučec,7 Anja Turner,8 Martin Volk31Vicomtech-IK4, Donostia San Sebastián, Spain, 2 Voice & Script International, London, United Kingdom,3Text Shuttle GmbH, Zurich, Switzerland, 4 Deluxe Media, London, United Kingdom,5Capita TI, Greater Manchester, United Kingdom, 6 invision Ondertiteling, Amsterdam, The Netherlands,7University of Maribor, Maribor, Slovenia, 8 Titelbild Subtitling and Translation, Berlin, Germany1{tetchegoyhen, adelpozo}@vicomtech.org, 2 lindsay@vsi.tv, 3 {fishel, volk}@cl.uzh.ch, 4 -ti.com, 6 gerard@ondertiteling.nl, 7 mirjam.sepesy@uni-mb.si, 8 Anja.Turner@titelbild.deAbstractThis article describes a large-scale evaluation of the use of Statistical Machine Translation for professional subtitling. The work wascarried out within the FP7 EU-funded project SUMAT and involved two rounds of evaluation: a quality evaluation and a measureof productivity gain/loss. We present the SMT systems built for the project and the corpora they were trained on, which combineprofessionally created and crowd-sourced data. Evaluation goals, methodology and results are presented for the eleven translation pairsthat were evaluated by professional subtitlers. Overall, a majority of the machine translated subtitles received good quality ratings.The results were also positive in terms of productivity, with a global gain approaching 40%. We also evaluated the impact of applyingquality estimation and filtering of poor MT output, which resulted in higher productivity gains for filtered files as opposed to fullymachine-translated files. Finally, we present and discuss feedback from the subtitlers who participated in the evaluation, a key aspect forany eventual adoption of machine translation technology in professional subtitling.Keywords: statistical machine translation, user evaluation, subtitling1.Introductionchine translation on subtitle translation and develop anonline subtitle translation service catering for nine European languages combined into 14 bidirectional language pairs: English-Dutch, English-French, EnglishGerman, English-Portuguese, English-Spanish, EnglishSwedish, and Serbian-Slovenian. A subset of the language pairs was used for the evaluation, selected in termsof market potential, with Serbian-Slovenian as a test-caseof an under-resourced language pair. The selected translation pairs were: English into Dutch, French, German, Portuguese, Spanish & Swedish; French, German & Spanishinto English; and Serbian-Slovenian in both directions.We first present an overview of the systems developed forthe project and the corpora used to build them, followed bya description of the quality evaluation design and results.We then describe the experimental design and results for theproductivity evaluation round, and the feedback collectedthroughout the evaluation.Thanks to the availability of large amounts of parallel andmonolingual corpora, statistical machine translation (SMT)systems are being developed for a wide range of domainsand real-world applications. Subtitling has been previouslyrecognized as a domain which was likely to benefit frommachine translation technology (Volk, 2009). Although thevariety of genres and content covered in subtitling represents a challenge for MT technology, subtitles are short andmeaningful units which can serve as adequate training material for SMT systems.In this paper, we describe a large-scale evaluation of SMTtechnology for professional subtitling work and present results describing the quality and usefulness of SMT systemswhose cores were built on professionally created subtitlecorpora (Petukhova et al., 2012). Quality evaluation wasundertaken by professional subtitlers, who post-edited machine translated output, rated individual subtitles in termsof their quality, and collected recurrent errors. Usefulnessof the SMT systems in the domain is also assessed througha measure of productivity gain/loss, comparing timed postediting of machine translated output to translation fromsource.The work we describe is part of the SUMAT project(www.sumat-project.eu), funded through the EU ICT Policy Support Programme (2011-2014), and involving ninepartners: four subtitle companies (Deluxe Media, InVision, Titelbild, Voice & Script International) and fivetechnical partners (Athens Technology Center, CapitaTI,TextShuttle, University of Maribor and Vicomtech-IK4).The goal of the project is to explore the impact of ma-2.SUMAT: Corpora & SystemsAt their core, the machine translation systems developedwithin the project are phrase-based SMT systems (Koehn etal., 2003), built with the Moses toolkit (Koehn et al., 2007)and trained on professional parallel corpora provided by thesubtitle companies in the SUMAT consortium. More than2.5 million parallel subtitles were added to the resourcesdescribed in (Petukhova et al., 2012), resulting in an average of 1 million aligned parallel subtitles for our languagepairs, and approximately 15 million monolingual subtitlesoverall which were used to train the language model components of the systems.46

To improve systems coverage and quality, various approaches have been explored over the course of the project(Etchegoyhen et al., 2013), from the inclusion of variouslinguistic features to domain adaptation through additionaldata incorporation and selection. The most successful approach, in terms of improvement in automated metrics andsystems efficiency, was translation model domain adaptation (Sennrich, 2012). In this approach, separately trainedtranslation models are combined into a joint model andtheir combination weights are optimized for a specific domain by reducing the perplexity of the resulting model on adomain-specific dataset. For our models, the systems weretuned on the SUMAT development sets.We tested various combinations of models, built on separatedata, eventually retaining the optimal combination whichconsisted of models trained on the SUMAT, Europarl andOpenSubs corpora.1 Tables 1 and 2 provide an overview ofthe parallel corpora used to train the systems that were evaluated, and the systems’ respective scores on the SUMATtest sets.2 For each language pair, the development and testsets consisted of 2000 and 4000 subtitles respectively, randomly selected across genres and domains.SUMAT1 488 341978 7051 326 6161 397 810762 490786 783167 717EN-DEEN-ESEN-FREN-NLEN-PTEN-SVSL-SREuroparl3 763 6161 011 054977 2253 762 6634 223 8161 862 234n/aEN to DEEN to ESEN to FREN to NLEN to PTEN to SVDE to ENES to ENFR to ENNL to ENPT to ENSL to SRSR to SLSV to 2.1612.939.0310.7610.9011.612.320.6Table 2: Systems evaluation on SUMAT test setsthe first two phases, and one for the third, consisting ofboth scripted and unscripted material from different genresand domains (e.g. drama, documentaries, magazine programmes, corporate talk shows). Note that, to increase theoverall amount of different subtitles to be annotated, theevaluators did not process the same files. There was thusno measure of inter-annotator agreement in this phase. Correlation measures between ratings and post-editing effortwere however computed, and are discussed in Section 3.3.Overall, 27565 subtitles were post-edited, rated and annotated in this evaluation round. The main aspects and resultsof the evaluation are described hereafter.OpenSubs4 631 97431 456 40019 006 60421 260 77220 128 4907 302 6031 921 087Table 1: Parallel training 33.117.819.134.33.1.Quality RatingFirst, professional subtitlers evaluated the quality of machine translation output by assigning a score to each machine translated subtitle. The rating scale was the one established for the WMT 2012 Shared Task on MT qualityestimation:3 each subtitle was to be annotated on a 1 to5 scale indicating the amount of post-editing effort, wheresubtitles rated 1 signal incomprehensible and unusable MT,and subtitles rated 5 denote perfectly clear and intelligible MT output, with little to no post-editing required. Figure 1 summarizes the results for our SMT systems, takingthe average of all evaluated translation pairs. The resultsrise in percentage from poor to good MT, with a predominance of machine translated output that required little postediting effort. Given the unrestricted nature of the inputdata, which covered various genres, domains and languageregisters, these results can be considered quite satisfactory.Table 3 summarizes the average rating assigned by the evaluators, and the average results on automated metrics usingpost-edited files as references, for all translation pairs inthe experiment. With post-editing in mind, two results areworth noting: 1 in 5 machine translated subtitles requiredno post-editing at all and more than 1 in 3 required lessthan five character-level editing steps. These two measuresindicate a substantial volume of unambiguously useful MToutput, with only minor post-editing needed.Quality EvaluationThe first round of evaluation was designed to estimate thequality of the systems. Subtitles were assigned qualityscores by subtitlers and we evaluated the correlation between these scores and automated metrics computed onpost-edited files. We also asked subtitlers for general feedback on the post-editing experience and any additionalcomments they had regarding their perception of MT output quality. Furthermore, we collected recurrent MT errorsin order to gradually improve the systems throughout thethree phases of the evaluation, each phase consisting of MToutput evaluation followed by systems improvement.Each phase involved two subtitlers per translation pair, whowere asked to post-edit to their usual translation qualitystandards and perform the task in their usual subtitling software environment. There were two input files for each of1For both Europarl and OpenSubs, we used the corpora available in the OPUS repository (Tiedemann, 2012) and experimentedwith various types of data selection in distinct language pairs (e.g.,data selection through bilingual cross-entropy difference (Axelrodet al., 2011)).2Equal indicates the percentage of MT output identical to thereference and Lev5 is a Levenshtein-distance metric measuring thepercentage of MT output that can reach a reference translation inless than five character editing steps (Volk, ation-task.html

a transformed (i.e., post-edited) reference text than in an independently translated reference. As can be seen in the figure, this has been the case for all but one translation pairs,namely, Spanish to English. This is one of the surprisingresults in this evaluation round, given that this translationpair is the highest scoring one on the SUMAT test sets. ThehBLEU and hTER results were consistent with the manualquality rating, where Spanish to English was the only translation pair where the volume of MT output rated as poorwas larger than the one rated as good. A manual examination of a subset of the annotation showed that a noticeableamount of MT output rated 1 or 2 was actually grammatically correct, but fully discarded by post-editors as they hadoffered a different translation alternative. Although it couldbe argued that the letter of the evaluation guidelines waspartially respected here, with low scores given for fully discarded MT output, this translation pair stands isolated withrespect to the way in which grammatically correct MT output was considered. Finally, the subtitlers working on thislanguage pair noted that several of the source files were difficult to use, with audio and template issues that renderedthe post-editing task all the more difficult.Another notable result is the very positive evaluation scoresobtained for Serbian and Slovenian, which scored the lowest on the SUMAT test sets but gave the highest ratings andbest metrics on post-edited files. Previous manual examination of the test sets had shown them to contain largevolumes of difficult and unusual text, and the results fromthis evaluation round seem to confirm that the quality ofthe SMT systems for this language pair is undervalued bycurrent test set scores.For the other language pairs, the differential between metrics is quite uniform, with hBLEU scores consistentlyhigher than test set BLEU scores, and quality ratings seemingly correlating with the automated metrics. A finergrained analysis of correlation aspects is presented in thenext .00%12345Figure 1: Global rating 8820.135.69Table 3: Average metrics on post-edited files3.2.Translation Pair ComparisonThe previous results were based on global averages for automated metrics and ratings. Figure 2 presents a comparative view, where the following elements were measured foreach translation pair: i) the BLEU scores on the SUMATtest sets, ii) the average hBLEU scores on the post-editedfiles, and iii) the average rating, ported to a [0-50] scale foreasier visualization.460.0050.0040.0030.003.3.20.00To estimate the degree to which rating was correlated to theactual post-editing effort, we computed the Pearson correlation coefficient between average ratings and automatedmetrics for each post-edited file. As can be seen in Table 4, when estimated on all translation pairs, the resultsranged from moderate correlation for BLEU to strong forTER (both above statistical significance). As expected, thecorrelation between the percentage of subtitles rated 5 andLev5 was strong.A closer examination made apparent that three of the elevenlanguage pairs, namely German to English, English toSpanish and English to Portuguese, showed weak inversecorrelation below statistical significance. Excluding thesethree pairs resulted in the figures shown in the third andfourth lines of Table 4, with stronger correlation for all metrics. These results indicate that rating was strongly correlated with the actual post-editing effort, except in a minority of cases where a larger number of subtitlers would havebeen needed to balance individual rating to post-editing effort disparities.10.000.00DE2EN EN2DE EN2ES EN2FR EN2NL EN2PT EN2SV ES2EN FR2EN SL2SR SR2SLHBLEUBLEU TestSetScaled RatingFigure 2: Language pairs comparative resultsAs hBLEU scores are measured on post-edited files, theyare expected to be higher than the BLEU scores on test sets,as there should be a higher amount of common n-grams in4We present results in terms of BLEU scores here, rather thanTER, as it makes it easier to compare them with average ratings,an increase in both being positive. The BLEU and TER metricswere very strongly correlated, with a Pearson correlation coefficient of 0.96 0.002, and either one can thus be safely used topresent the results.48Correlation Measures

r (all pairs)p-value (all pairs)r (8 pairs)p-value (8 Table 4: Rating-Metric correlations3.4.Error Collection30.00%As mentioned above, we also collected recurrent MT errors for possible correction by the technical partners in theproject. For this purpose, we provided evaluators with anerror taxonomy and asked them to indicate the errors forsubtitles rated 3 or higher only, since we assumed thatlower rated subtitles would contain too many errors toproperly distinguish them. The taxonomy included: agrfor grammatical agreement errors; miss(ing) for contentwords/segments that were lost in the translation process; order for grammatical ordering errors in the target language;phrase for any multiword expression wrongly treated asseparate words, or any separate words wrongly translatedas a unit; cap for capitalization errors; punc for punctuation errors; spell(ing) for any spelling mistake; length forany machine translated output deemed too long given constraints on subtitle length; and trans(lation) for mistranslations, a large category that includes any lexical or phrasalmistranslation.The results are given in Figure 3.5 Overall, the distribution shows a dominance of mistranslations, followed byagreement errors and segments lost in the translation process. This is not unexpected for phrase-based SMT systems, with no access to linguistic information to handlegrammatical errors like agreement, for instance. Over thethree phases, the systems were improved for other moremanageable categories, e.g. punctuation, capitalization andmulti-word units. Given the amount of named entities inthe overall subtitling domain, improving the systems in thisregard was strongly requested by post-editors and led to thesystems being retrained with truecasing. Finally, the resultson the subtitle-specific category length are also worth noting; further research would be necessary to tune the statistical translation engine towards producing output adjusted tosubtitle length constraints in the target language (see (Azizet al., 2012) for an approach along those orderphrasecappuncspelllengthtransFigure 3: Global distribution of errorsto translate a subtitle file from source vs. post-editing machine translated output. We hypothesized that this type ofevaluation could be a strong additional indicator of the usefulness of machine translation for pamongrofessional subtitling. A pilot study was executed in 2012 for the EnglishSwedish language pair, as described in (Bywood et al.,2012). There were large variations in the results, whichshowed both increases and decreases in productivity forsubtitlers post-editing MT output.4.1.Experimental DesignThe experimental design involved the same translation pairsused for the quality evaluation round, with two subtitlersper pair. Productivity was measured in terms of subtitlesper minute, comparing speed of post-editing to translationfrom source.In this round, an additional scenario was implemented, withautomatic quality estimation and filtering of MT output.6 Inthis configuration, poor machine translated subtitles wereremoved from the MT output files, thus providing posteditors with empty MT subtitles to be translated from thesource; good quality MT went through the filters unmodified, to be post-edited. The main driver for adding this thirduse-case came from general feedback provided by subtitlersin the quality evaluation round. Although the feedback included comments regarding the surprisingly good MT quality for some translation pairs, with post-editing becomingeasier after some practice, it also included repeated men-Productivity MeasurementThe second major phase of the evaluation focused on measuring productivity gain/loss by comparing the time needed5For the distribution of errors shown here, the agr categoryhas been weighted, to account for a change in the error typologywhich was effected in phases 2 and 3. In the first phase, the transcategory was omitted, as this class of errors is difficult to correct inSMT systems and no technical fixes were envisioned. However,subtitlers requested the inclusion of this error category, as theyfrequently felt the need to indicate such translation errors. DuringPhase 1, mistranslations were eventually marked as agr errors,thus over-representing this category. The above figure providesfor a more representative view of the distribution of errors, usingthe ratio of trans and agr errors that were found in phases 2 and 3.6Quality estimation was performed with the QuEst toolkit(Specia et al., 2013). Space limitations prevent us from providingthe complete experimental design and results here. Summarizingthe approach, ROC curves were constructed to choose betweendifferent

7 University of Maribor, Maribor, Slovenia, 8 Titelbild Subtitling and Translation, Berlin, Germany 1 ftetchegoyhen, adelpozo g@vicomtech.org, 2lindsay@vsi.tv, 3 fishel, volk @cl.uzh.ch, 4yota.georgakopoulou@bydeluxe.com, 5jie.jiang@capita-ti.com, 6gerard@ondertiteling.nl, 7mirjam.sepesy@uni-mb.si, 8Anja.Turner@titelbild.de Abstract This article describes a large-scale evaluation of the use .

Related Documents:

to subtitling Arabic colloquial expressions into English: (1) some colloquialisms, especially those reflecting religious overtones, have been missubtitled, (2) some colloquialisms have been totally dropped out from subtitling (i.e. zero- subtitling), (3) and in subtitling certain colloquialisms, a considerable subtitling loss has occurred. Finally, to resolve such recalcitrant problems and .

AUDIO VISUAL TRANSLATION: SUBTITLING ENGLISH – ARABIC SUBTITLING: PROBLEMS AND SOLUTIONS A COMMENTARY ON THE APPLICATION OF THE THEORY OF DOMESTICATION AND FOREIGNIZATION IN SUBTITLING RELIGIOUS TERMS, NAMES, VERSES AND BELIEFS SALMA EDREES Lecturer in Translation A Member of Translation Committee, Member of Scientific Research Committee Academic Campus 2, Jazan University, Saudi Arabia doi .

Analysis Subtitling Strategy in The Revenant Movie (2015). 2. RESEARCH METHOD This study focuses on the subtitling analysis and its quality which is analyzed in the movie entitled The Revenant. Then, the analysis of the study will focus on the subtitling variation and the quality of itself. The writer uses the theory of Gottlieb. The objective

programming of films. Subtitling costs are the same for any given language. This cost may deter audiovisual players from releasing films with subtitling in languages spoken by a small number of people. There is no certainty that for some target audiences, the potential demand for non-national European films will compensate for the subtitling costs.

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Issues of culture specific item translation in subtitling . translation from English into Lithuanian analyzed in the subtitles of the Australian TV reality show My Kitchen Rules. The results indicate that not all culture specific items are appropriately rendered into the target language, which is a significant factor in the evaluation of translation quality. 6 The Authors. Published by .

includes humour as a concept, humour in translation as well as a general overview of the nature and limitations of subtitling as a translation practice. Section 3 presents the research questions and hypotheses, while section 4 contains a description of this study's corpus as well as a background of The Big Bang Theory and its main characters .

2 John plans a day at the park with his daughter John and his 7-year-old daughter, Emma, are spending the day together. In the morning, John uses his computer to look up the weather, read the news, and check a