Attention-based Multimodal Neural Machine Translation

2y ago
27 Views
2 Downloads
1.20 MB
7 Pages
Last View : 12d ago
Last Download : 3m ago
Upload by : Oscar Steel
Transcription

Attention-based Multimodal Neural Machine TranslationPo-Yao Huang, Frederick Liu, Sz-Rung Shiang, Jean Oh† , Chris DyerLanguage Technologies Institute, Robotics Institute†Carnegie Mellon UniversityPittsburgh, PA, USA{poyaoh fliu1 sshiang cdyer}@cs.cmu.edu, jeanoh@nrec.ri.cmu.edu†AbstractWe present a novel neural machine translation (NMT) architecture associating visual and textual features for translationtasks with multiple modalities. Transformed global and regional visual featuresare concatenated with text to form attendable sequences which are dissipated overparallel long short-term memory (LSTM)threads to assist the encoder generating arepresentation for attention-based decoding. Experiments show that the proposedNMT outperform the text-only baseline.1Figure 1: Attention-based neural machinetranslation framework using a context vector tofocus on a subset of the encoding hidden states.the image captions in English into German. Withthe additional information from images, we wouldfurther resolve the problem of ambiguity in languages. For example, the word “bank” may referto the financial institution or the land of the river’sedge. It would be confusing if we only look at thelanguage itself. In this task, the image may help todisambiguate the meaning if it shows that there isa river and thus the “bank” means “river bank”.In this paper, we explore approaches to integrating multimodal information (text and image) intothe attention-based encoder-decoder architecture.We transform and make the visual features as oneof the steps in the encoder as text, and then makeit possible to attend to both the text and the imagewhile decoding. The image features we used are(visual) semantic features extracted from the entire images (global) as well as the regional bounding boxes proposed by the region-based convolutional neural networks (R-CNN) (Girshick et al.,2014). In the following section, we first describethe related works, and then we introduce the proposed multimodal attention-based NMT in Section3, followed by re-scoring of the translation candidates in Section 4. Finally we demonstrate theexperiments in Section 5.IntroductionIn fields of machine translation, neural network attracts lots of research attention recently that theencoder-decoder framework is widely used. Nevertheless, the main drawback of this neural machine translation (NMT) framework is that the decoder only depends on the last state of the encoder,which may deteriorate the performance when thesentence is long. To overcome this problem, attention based encoder-decoder framework as shownin Figure 1 is proposed. With the attention model,in each time step the decoder depends on both theprevious LSTM hidden state and the context vector, which is the weighted sum of the hidden statesin the encoder. With attention, the decoder can“refresh” it’s memory to focus on source wordsthat may help to translate the correct words ratherthan only seeing the last state of the sentenceswhere the words in the sentence and the orderingof words are missing.Most of the machine translation task only focustextual sentences of the source language and targetlanguage; however, in the real world, the sentencesmay contain information of what people see. Beyond the bilingual translation, in WMT 16’ multimodal translation task, we would like to translate2Related WorkAs the advances of deep learning, Neural MachineTranslation (NMT) (Kalchbrenner and Blunsom,639Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 639–645,Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics

2013; Jean et al., 2014) leveraging encodedecoder architecture attracts research attention.Under the NMT framework, less domain knowledge is required and large training corpora cancompensate for it. However, encoder-decoderstructure encodes the source sentence into onefixed-length vector, which may deteriorate thetranslation performance as the length of sourcesentences increasing. (Bahdanau et al., 2014) extended encoder-decoder structure that the decoderonly focuses on parts of source sentence. (Luong et al., 2015) further proposed attention-basedmodel that combine global, attending to all sourcewords, and local, only focusing on a part of sourcewords, attentional mechanism.Rather than using the embedding of eachmodality independently, Some works (Hardoonet al., 2004; Andrew et al., 2013; Ngiam et al.,2011; Srivastava and Salakhutdinov, 2014) focuson learning joint space of different modalities. Inmachine translation fields, (Zhang et al., 2014; Suet al., 2015) learned phrase-level bilingual representation using recursive auto-encoder. Beyondtextual embedding, (Kiros et al., 2014) proposedCNN-LSTM encoder to project two modalitiesinto the same space. Based on the jointly learning of multiple modalities or languages, we findit possible to evaluate the quality of the translations that if the space of the translated sentence issimilar to the source sentence or the image, it mayimply that the translated sentence is good.3ferred as a content-based measurement of the similarity between the currently translating target andthe source words. We utilize a transformation matrix Wa which associates source and target hiddenstate to learn the general similarity measure by:score(ht , hs ) ht Wa hsWe produce an attentional hidden state ĥt bylearning Wc of a single layer perceptron activatedby tanh. The input is simply the concatenationof the target hidden state ht and the source-sidecontext vector ct :ĥt tanh(Wc [ct ; ht ])(3)After generating the context feature vector andthe attentional hidden state, the target word ispredicted through the softmax layer with the attentional hidden state ht vector by p(yt x) sof tmax(Ws ĥt ). The following we will introduce how we incorporate images features based onthe attention models.3.1Model 1: LSTM with global visualfeatureVisual features from convolution neural network(CNN) may provide additional information to textual features in machine translation with multiplemodalities. As depicted in Figure 2, we propose toappend visual features at the head/tail to the original text sequence in the encoding phase. Note thatfor simplicity, we omit the attention part in the following figures.Global (i.e., whole image) visual feature are extracted from the last fully connected layer knownas f c7, a 4096-dimensional semantic layer inthe 19-layered VGG (Simonyan and Zisserman,2014). With the dimension mismatch and the inherent difference in content between the visual andtextual embedding, a transformation matrix Wimgis proposed to learn the mapping. The encoderthen encode both textual and visual feature sequences to generate the representation for decoding. In the decoding phase, the attention modelweights all the possible hidden states in the encoding phase and produce the context vector ct withEq. 1 and Eq. 2 for NMT decoding.Attention-based Multimodal MachineTranslationBased on the encoder-decoder framework, theattention-based model aim to handle the missingorder and source information problems in the basicencoder-decoder framework. At each time step tin the decoding phrase, the attention-based modelattends to subsets of words in the source sentencesthat can form up the context which can help the decoder to predict the next word. This model infers avariable-length alignment weight vector at basedon the current target state ht and all source stateshs . The context feature vector ct at · hs is theweighted sum of the source states hs according toat , which is defined as:escore(ht ,hs )at (s) P0 score(h ,h0t s)se(2)3.2(1)Model 2: LSTM with multiple regionalvisual featuresIn addition to adding only one global visual feature, we extend the original NMT model by incorporating multiple regional features in the hopeThe scoring function score(ht , hs ) can be re640

Figure 2: Model 1: Attention-based NMT with single additional global visual feature. Decoder mayattend to both text and image steps of encoding. For clarity, the possible attention path is hidden here.Figure 3: Model 2: Attention-based NMT with multiple additional regional visual features.objects plus the whole image and then extractedtheir f c7 with VGG-19 to form the visual sequence followed by the text sequence. If there areless than 4 objects recognized in the original image, zero vectors are padded instead for the batchprocess during training.that those regional visual attributes would assistLSTM to generate better and more accurate representations. The illustration of the proposed modelis depicted in 3. We will first explain how to determine multiple regions from one image and explainhow these visual features are extracted and sorted.Intuitively, objects in an image are most likelyto appear in both source and target sentences.Therefore. we utilize the region proposal network(RPN) in the region-based convolutional neuralnetwork (Ren et al., 2015) (R-CNN) to identifyobjects and their bounding boxes in an image andthen extract visual feature from those regions. Inorder to integrate these images to the original sequence in the LSTM model, we design a heuristic approach to sort those visual features. Theregional features are fed in the ascending orderof the size of the bounding boxes; followed bythe original global visual feature and the text sequence. Visual features are sequentially fed insuch order since important features are designedto be closer to the encoded representation. Heuristically, larger objects may be more noticeable andessential in an image described by both the sourceand target language contexts.In the implementation, we choose top 4 regional3.3Model 3: Parallel LSTM threadsTo further alleviate the assumption that regionalobjects share some pre-defined order, we furtherpropose a parallel structure as shown in Figure 4.The encoder of NMT is composed of multiple encoding threads where all the LSTM parameters areshared. In each thread, a (regional) visual feature is followed by the text sequence. This parallel structure would associate the text to the mostrelevant objects in the encoding phase and distinguish them when computing attention during decoding. Intuitively, the text sequence follows aregional object would be interpreted as encodingthe visual information with the textual description(i.e., encoding captions as well as visual featuresfor that object). An encoder hidden state for attention can be interpreted as the “word” imprintedby the semantics features of some regional object.The decoder can therefore distinctively attend to641

Figure 4: Model 3: Parallel LSTM threads with multiple additional regional visual features.4.1words that describe different visual objects in multiple threads.In the encoding phase, parameters in LSTM areshared over threads. All possible hidden statesover multiple threads are recorded for attention.At the end of encoding phase, the outputs of different encoding threads are fused together to generatethe final embedding of the whole sentence as wellas all the image objects. In the decoding phase,candidates of global attention are all the text hidden states over multiple threads. For example, attime t, the decoder may choose to attend to ‘bear’at the second thread (which sees a teddy bear image at the beginning) as well as the ’bear’ in theglobal image thread. At time t 1, the decodermay switch to another thread and focus on “theman” with the person image.For implementation simplicity for batch training, we limit the number of regional objects to 4and add one global image thread. We also choosean average pooling in the encoder fusion processand back-propagate accordingly.4Monolingual Re-scoringTo evaluate the quality of the translation, the mostsimple approach is to check whether the translatedsentences are readable. To achieve this, using language model is an effective way to check whetherthe sentences fit into the model that trained on alarge corpus. If the language model score is high,it implies that the sentence holds the high probability to be generated from the corpus. We traineda single layer LSTM with 300 hidden state to predicting the next word. Image caption datasetsMSCOCO and IAPR TC-12 (overall 56,968 sentences) are used as training data.4.1.1Bilingual autoencoderA good translation would also recognize the sentence in the source language. We utilize bilingual autoencoder (Ngiam et al., 2011) depicted asin Fig.5 to reconstruct source language given thesource language. Bilingual autoencoder only usessingle modality (here we used source language ortarget language) and re-constructs the both modalities. We project bilingual information into thejoint space (the bottleneck layer); if the two targetand source sentences have similar representation,the model is able to reconstruct both sentences.Moreover, if the similarity of values of bottlenecklayer is high, it may indicate that the source sentence and the translated sentence are similar inconcepts; therefore, the quality of the translationwould be better. The inputs of the autoencoder arethe last LSTM encoder states trained on monolingual image captions dataset. The dimension of theinput layer is 256, and 200 for the middle, and 128for the joint layer.Re-scoring of Translation CandidatesIn the neural machine translation, the easiest wayto decode is to greedily get the words with highestprobability step-by-step. To achieve better performance, ensemble of models are required. Translation candidates are generated from multiple models, and we aim to figure out which candidateshould be the best one. The following we describe the approaches we investigated to re-scorethe translation candidates using monolingual andbilingual information.642

Table 1: BLEU and METEOR of the proposedmultimodal NMTText baselinem1:image at tailm1:image at headm2:5 sequential RCNNsm3:5 parallel RCNNsBilingual dictionaryIn the WMT 16’ multimodal task, captions arestructured with simple grammars; therefore, onlyconsidering language model may be insufficientto distinguish good translations. In order to directly consider whether the concepts mentioned inthe source sentences are all well-translated, we utilize the bilingual dictionary Glosbe1 , in which weuse the words in one language extracting the corresponding words in the other language. We directlycount the number of words in the source languagethat the synonyms in target language are also inthe translated results as the re-ranking score.55.2Results of Adding Visual InformationThe quantitative performance of the proposedmodels can be seen in Table 1. We evaluate BLEUand METEOR scores with tokenization under theofficial settings of WMT 2016 multimodal machine translation challenge. The text-only baselineis the NMT implementation with global attention.Adding single global visual feature from an imageat the head of a text sequence improves BLEU by0.6% and METEOR by 0.4% respectively.The results show that the additional visual information improves the translations in this dataset.However, the lukewarm improvement is not as significant as we expected. One possible explanation is that the information required for the multimodal translation task is mostly self-contained inthe source text transcript. Adding global featuresfrom whole images do not provide extra supplementary information and thus results in a subtleimprovement.Detailed regional visual features provide extraattributes and information that may help the NMTtranslates better. In our experiment, the proposedmodel2 with multiple regional and one global visual features showed an improvement of 1.7%in BLEU and 1.6% in METEOR while model3Experiments5.1METEOR51.8 (0.7)51.6 (0.7)52.2 (0.7)53.4 (0.6)54.1 (0.7)ject classes.We use a single-layered LSTM with 256 cellsand 128 batch size for training. The dimension ofword embedding is 256. Wimg is a 4096 256matrix transforming visual features into the sameembedding space as words. When training NMT,we follow (Luong et al., 2015) with similar settings: (a) we uniformly initialized all parametersbetween -0.1 and 0.1, (b) we trained the LSTMfor 20 epochs using simple SGD, (c) the learningrate was initialized as 1.0, multiplied by 0.7 after 12 epochs, (d) dropout rate was 0.8. Note thatthe same dropout mask and NMT parameters areshared by all LSTM threads in model 3.Figure 5: Bilingual auto-encoder to re-constructboth English and German using only one of them.4.2BLEU34.5 (0.7)34.8 (0.6)35.1 (0.8)36.2 (0.8)36.5 (0.8)Experimental SetupIn the official WMT 2016 multimodal translationtask dataset (Elliott et al., 2016), there are 29,000parallel sentences from English to German fortraining, 1014 for validation and 1000 for testing.Each sentence describes an image from Flickr30kdataset (Young et al., 2014). We preprocessed allthe descriptions into lower case with tokenizationand German compound word splitting.Global visual features (fc7) are extracted withVGG-19 (Simonyan and Zisserman, 2014). Forregional visual features, the region proposal network in RCNN (Girshick et al., 2014) first recognizes bounding boxes of objects in an image andthen we computed 4096-dimensional fc7 featuresfrom these regions with VGG-19. The RPN ofRCNN is pre-trained on ImageNet dataset 2 andthen fine-tuned on MSCOCO dataset 3 with 80 http://mscoco.org/2643

Table 2: Results of re-scoring using monolingual LSTM, Bi-lingual auto-encoder, and dictionary based on multimodal NMT results.showed an improvement of 2.0% in BLEU and2.3% in METEOR. The results correspond to ourobservation that most sentences would describeimportant objects which could be identified by RCNN. The most commonly mentioned object is“person”. It’s likely that the additional attributesprovided by the visual features about the person inan image help to encode more detailed context andthus benefit NMT decoding. Other high frequencyobjects are “car”, “baseball”, “cellphone”, etc.For the proposed LSTM with multiple regionalvisual features (model 2), the semantic featuresin f c7 of the regions-of-interest in an image provide additional regional visual information to forma better sentence representation. We also experimented other sorting methods including descending size, random, and categorical order to generatethe visual sequences. However, ascending-orderedsequences achieve the best result.For the proposed parallel LSTM architecturewith regional visual features (model 3), the regional visual features further help the NMT decoder to attend more accurately and accordinglyto focus on the right thread where the hidden statesare twiddle by the local visual attributes. The bestresult of our models achieve 36.5% in BLEU and54.1% in METEOR, which is comparable to thestate-of-the-art Moses results in this challenge.5.3Original Model 3Language modelBilingual autoencoderBilingual dictionaryBLEU36.5 (0.8)36.3 (0.8)35.9 (0.8)35.7 (0.8)METEOR54.1 (0.7)53.3 (0.6)53.4 (0.7)55.2 (0.6)in language modeling and the effects of unknownwords. It’s clear that more investigation is requiredfor designing a better bilingual autoencoder for rescoring.The last row shows the results using the bilingual dictionary. For each word in the source sentence and the target candidates, we retrieve theterm and the translation in the other language, andcount the number of matching. We can achievemuch more improvement on METEOR comparedto other methods. This is because that the quality of the translation of captions depends on howmuch we correctly translate the objects and theirmodifiers. The bad translation can still achieve fairperformance without re-scoring because the sentence structure is similar to good translation. Forexample, a lot of sentences start with “A man” andboth good and bad translation can also translatethe sentences start with “Ein Mann”. The bilingualdictionary is proved to be an efficient re-scoringapproach to distinguish these cases.Results of Re-ScoringThe experimental re

yond the bilingual translation, in WMT 16' multi-modal translation task, we would like to translate Figure 1: Attention-based neural machine translation framework using a context vector to focus on a subset of the encoding hidden states. the image captions in English into Germ

Related Documents:

An Introduction to and Strategies for Multimodal Composing. Melanie Gagich. Overview. This chapter introduces multimodal composing and offers five strategies for creating a multimodal text. The essay begins with a brief review of key terms associated with multimodal composing and provides definitions and examples of the five modes of communication.

Hence, we aimed to build multimodal machine learning models to detect and categorize online fake news, which usually contains both images and texts. We are using a new multimodal benchmark dataset, Fakeddit, for fine-grained fake news detection. . sual/language feature fusion strategies and multimodal co-attention learning architecture could

2 Preliminaries: Attention-based Neural Machine Translation In this section, we briey introduce the architec-ture of the attention-based NMT model (Bahdanau et al.,2015), which is the basis of our proposed models. 2.1 Neural Machine Translation An NMT model usually consists of two connected neural networks: an encoder and a decoder. Af-Cited by: 15Publish Year: 2017Author: Shonosuke Ishiwatari, Jingtao Yao, Shujie Liu, Mu Li, Ming Zhou, Naoki

multilingual and multimodal resources. Then, we propose a multilingual and multimodal approach to study L2 composing process in the Chinese context, using both historical and practice-based approaches. 2 L2 writing as a situated multilingual and multimodal practice In writing studies, scho

interaction, and multimodal fusion is 87.42%, 92.11%, 93.54% and 93%, respectively. The level of user satisfaction towards the multimodal recognition-based human-machine interaction system developed was 95%. According to 76.2% of users, this interaction system was natural, while 79.4% agreed that the machine responded well to their wishes. Keywords

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

Neuroblast: an immature neuron. Neuroepithelium: a single layer of rapidly dividing neural stem cells situated adjacent to the lumen of the neural tube (ventricular zone). Neuropore: open portions of the neural tube. The unclosed cephalic and caudal parts of the neural tube are called anterior (cranial) and posterior (caudal) neuropores .

7. What is the name of this sequence of events which results in the production of a protein? 8. What is Reverse Transcription? 9. When does Reverse Transcription occur? 10. How can Reverse Transcription be used in Biotechnology? DESIGNER GENES: PRACTICE –MOLECULAR-GENETIC GENETICS 2 CENTRAL DOGMA OF MOLECULAR GENETICS 1. Where is DNA housed in Eukaryotic Cells? most is stored in the nucleus .