Learning Image Embeddings Using Convolutional Neural .

3y ago
54 Views
2 Downloads
1.97 MB
10 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Harley Spears
Transcription

Learning Image Embeddings using Convolutional Neural Networks forImproved Multi-Modal SemanticsDouwe Kiela University of CambridgeComputer Laboratorydouwe.kiela@cl.cam.ac.ukLéon BottouMicrosoft ResearchNew Yorkleon@bottou.orgAbstractlabeled dataset (Krizhevsky et al., 2012). Theconvolutional layers are then used as mid-levelfeature extractors on a variety of computer vision tasks (Oquab et al., 2014; Girshick et al.,2013; Zeiler and Fergus, 2013; Donahue et al.,2014). Although transferring convolutional network features is not a new idea (Driancourt andBottou, 1990), the simultaneous availability oflarge datasets and cheap GPU co-processors hascontributed to the achievement of considerableperformance gains on a variety computer visionbenchmarks: “SIFT and HOG descriptors produced big performance gains a decade ago, andnow deep convolutional features are providing asimilar breakthrough” (Razavian et al., 2014).This work reports on results obtained by usingCNN-extracted features in multi-modal semanticrepresentation models. These results are interesting in several respects. First, these superior features provide the opportunity to increase the performance gap achieved by augmenting linguisticfeatures with multi-modal features. Second, thisincreased performance confirms that the multimodal performance improvement results from theinformation contained in the images and not theinformation used to select which images to useto represent a concept. Third, our evaluation reveals an intriguing property of the CNN-extractedfeatures. Finally, since we use the skip-gram approach of Mikolov et al. (2013) to generate ourlinguistic features, we believe that this work represents the first approach to multimodal distributional semantics that exclusively relies on deeplearning for both its linguistic and visual components.We construct multi-modal concept representations by concatenating a skip-gramlinguistic representation vector with a visual concept representation vector computed using the feature extraction layersof a deep convolutional neural network(CNN) trained on a large labeled objectrecognition dataset. This transfer learning approach brings a clear performancegain over features based on the traditionalbag-of-visual-word approach. Experimental results are reported on the WordSim353and MEN semantic relatedness evaluationtasks. We use visual features computed using either ImageNet or ESP Game images.1IntroductionRecent works have shown that multi-modal semantic representation models outperform unimodal linguistic models on a variety of tasks, including modeling semantic relatedness and predicting compositionality (Feng and Lapata, 2010;Leong and Mihalcea, 2011; Bruni et al., 2012;Roller and Schulte im Walde, 2013; Kiela et al.,2014). These results were obtained by combining linguistic feature representations with robustvisual features extracted from a set of images associated with the concept in question. This extraction of visual features usually follows the popularcomputer vision approach consisting of computing local features, such as SIFT features (Lowe,1999), and aggregating them as bags of visualwords (Sivic and Zisserman, 2003).Meanwhile, deep transfer learning techniqueshave gained considerable attention in the computer vision community. First, a deep convolutional neural network (CNN) is trained on a large22.1Related workMulti-Modal Distributional SemanticsMulti-modal models are motivated by parallelswith human concept acquisition. Standard se- This work was carried out while Douwe Kiela was anintern at Microsoft Research, New York.36Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 36–45,October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics

systems on a variety of visual recognition tasks(Razavian et al., 2014). Embeddings from stateof-the-art CNNs (such as Krizhevsky et al. (2012))have been applied successfully to a number ofproblems in computer vision (Girshick et al.,2013; Zeiler and Fergus, 2013; Donahue et al.,2014). This contribution follows the approach described by Oquab et al. (2014): they train a CNNon 1512 ImageNet synsets (Deng et al., 2009),use the first seven layers of the trained network asfeature extractors on the Pascal VOC dataset, andachieve state-of-the-art performance on the PascalVOC classification task.mantic space models extract meanings solely fromlinguistic data, even though we know that human semantic knowledge relies heavily on perceptual information (Louwerse, 2011). That is, thereexists substantial evidence that many conceptsare grounded in the perceptual system (Barsalou,2008). One way to do this grounding in the contextof distributional semantics is to obtain representations that combine information from linguisticcorpora with information from another modality,obtained from e.g. property norming experiments(Silberer and Lapata, 2012; Roller and Schulte imWalde, 2013) or from processing and extractingfeatures from images (Feng and Lapata, 2010;Leong and Mihalcea, 2011; Bruni et al., 2012).This approach has met with quite some success(Bruni et al., 2014).2.23Figure 1 illustrates how our system computesmulti-modal semantic representations.Multi-modal Deep LearningOther examples that apply multi-modal deeplearning use restricted Boltzmann machines (Srivastava and Salakhutdinov, 2012; Feng et al.,2013), auto-encoders (Wu et al., 2013) or recursive neural networks (Socher et al., 2014). Multimodal models with deep learning componentshave also successfully been employed in crossmodal tasks (Lazaridou et al., 2014). Work that isclosely related in spirit to ours is by Silberer andLapata (2014). They use a stacked auto-encoderto learn combined embeddings of textual and visual input. Their visual inputs consist of vectorsof visual attributes obtained from learning SVMclassifiers on attribute prediction tasks. In contrast, our work keeps the modalities separate andfollows the standard multi-modal approach of concatenating linguistic and visual representations ina single semantic space model. This has the advantage that it allows for separate data sources for theindividual modalities. We also learn visual representations directly from the images (i.e., we applydeep learning directly to the images), as opposedto taking a higher-level representation as a starting point. Frome et al. (2013) jointly learn multimodal representations as well, but apply them toa visual object recognition task instead of conceptmeaning.2.3Improving Multi-ModalRepresentations3.1Perceptual RepresentationsThe perceptual component of standard multimodal models that rely on visual data is oftenan instance of the bag-of-visual-words (BOVW)representation (Sivic and Zisserman, 2003). Thisapproach takes a collection of images associatedwith words or tags representing the concept inquestion. For each image, keypoints are laid outas a dense grid. Each keypoint is represented bya vector of robust local visual features such asSIFT (Lowe, 1999), SURF (Bay et al., 2008) andHOG (Dalal and Triggs, 2005), as well as pyramidal variants of these descriptors such as PHOW(Bosch et al., 2007). These descriptors are subsequently clustered into a discrete set of “visualwords” using a standard clustering algorithm likek-means and quantized into vector representationsby comparing the local descriptors with the clustercentroids. Visual representations are obtained bytaking the average of the BOVW vectors for theimages that correspond to a given word. We useBOVW as a baseline.Our approach similarly makes use of a collection of images associated with words or tags representing a particular concept. Each image is processed by the first seven layers of the convolutional network defined by Krizhevsky et al. (2012)and adapted by Oquab et al. (2014)1 . This network takes 224 224 pixel RGB images and applies five successive convolutional layers followedby three fully connected layers. Its eighth and lastDeep Convolutional Neural NetworksA flurry of recent results indicates that image descriptors extracted from deep convolutional neural networks (CNNs) are very powerful and consistently outperform highly tuned earch/cnn/

Training visual features (after Oquab et al., 2014)Convolutional mfeaturevectorImagenet labelsAfrican elephantFC8Wall clock AggregateFC76144-dim feature vectorsWord100-dim word projectionsMultimodal word vectorSelect imagesfrom ImageNet or ESPFully-connected layers100-dim word projectionsw(t-2)w(t-2)w(t)w(t 1)w(t 2)Training linguistic features (after Mikolov et al., 2013)Figure 1: Computing word feature vectors.the 400M word Text8 corpus of Wikipedia text2together with the 100M word British NationalCorpus (Leech et al., 1994). We also experimented with dependency-based skip-grams (Levyand Goldberg, 2014) but this did not improve results. The skip-gram model learns high quality semantic representations based on the distributionalproperties of words in text, and outperforms standard distributional models on a variety of semanticsimilarity and relatedness tasks. However we notethat Bruni et al. (2014) have recently reported aneven better performance for their linguistic component using a standard distributional model, although this may have been tuned to the task.layer produces a vector of 1512 scores associatedwith 1000 categories of the ILSVRC-2012 challenge and the 512 additional categories selected byOquab et al. (2014). This network was trained using about 1.6 million ImageNet images associatedwith these 1512 categories. We then freeze thetrained parameters, chop the last network layer,and use the remaining seventh layer as a filter tocompute a 6144-dimensional feature vector on arbitrary 224 224 input images.We consider two ways to aggregate the featurevectors representing each image.1. The first method (CNN-Mean) simply computes the average of all feature vectors.3.32. The second method (CNN-Max) computesthe component-wise maximum of all featurevectors. This approach makes sense becausethe feature vectors extracted from this particular network are quite sparse (about 22%non-zero coefficients) and can be interpretedas bags of visual properties.3.2Multi-modal RepresentationsFollowing Bruni et al. (2014), we construct multimodal semantic representations by concatenatingthe centered and L2 -normalized linguistic and perceptual feature vectors vling and vvis , vconcept α vling (1 α) vvis ,Linguistic representations(1)where denotes the concatenation operator and αis an optional tuning parameter.For our linguistic representations we extract 100dimensional continuous vector representations using the log-linear skip-gram model of Mikolovet al. (2013) trained on a corpus consisting of238http://mattmahoney.net/dc/textdata.html

Figure 2: Examples of dog in the ESP Game dataset.Figure 3: Examples of golden retriever in ImageNet.4Experimental Setupdog, golden retriever, grass, field, house, door inthe ESP Dataset. In other words, images in theESP dataset do not make a distinction between objects in the foreground and in the background, orbetween the relative size of the objects (tags forimages are provided in a random order, so the toptag is not necessarily the best one).Figures 2 and 3 show typical examples of images belonging to these datasets. Both datasetshave attractive properties. On the one hand, ImageNet has higher quality images with better labels.On the other hand, the ESP dataset has an interesting coverage because the MEN task (see section4.4) was specifically designed to be covered by theESP dataset.We carried out experiments using visual representations computed using two canonical imagedatasets. The resulting multi-modal concept representations were evaluated using two well-knownsemantic relatedness datasets.4.1Visual DataWe carried out experiments using two distinctsources of images to compute the visual representations.The ImageNet dataset (Deng et al., 2009) isa large-scale ontology of images organized according to the hierarchy of WordNet (Fellbaum,1999). The dataset was constructed by manuallyre-labelling candidate images collected using websearches for each WordNet synset. The imagestend to be of high quality with the designated object roughly centered in the image. Our copy ofImageNet contains about 12.5 million images organized in 22K synsets. This implies that ImageNet covers only a small fraction of the existing117K WordNet synsets.The ESP Game dataset (Von Ahn and Dabbish,2004) was famously collected as a “game witha purpose”, in which two players must independently and rapidly agree on a correct word labelfor randomly selected images. Once a word labelhas been used sufficiently frequently for a givenimage, that word is added to the image’s tags. Thisdataset contains 100K images, but with every image having on average 14 tags, that amounts to acoverage of 20,515 words. Since players are encouraged to produce as many terms per image, thedataset’s increased coverage is at the expense ofaccuracy in the word-to-image mapping: a dog ina field with a house in the background might be agolden retriever in ImageNet and could have tags4.2Image SelectionSince ImageNet follows the WordNet hierarchy,we would have to include almost all images inthe dataset to obtain representations for high-levelconcepts such as entity, object and animal. Doingso is both computationally expensive and unlikelyto improve the results. For this reason, we randomly sample up to N distinct images from thesubtree associated with each concept. When thisreturns less than N images, we attempt to increasecoverage by sampling images from the subtree ofthe concept’s hypernym instead. In order to allowfor a fair comparison, we apply the same methodof sampling up to N on the ESP Game dataset. Inall following experiments, N 1.000. We usedthe WordNet lemmatizer from NLTK (Bird et al.,2009) to lemmatize tags and concept words so asto further improve the dataset’s coverage.4.3Image ProcessingThe ImageNet images were preprocessed as described by (Krizhevsky et al., 2012). The largestcentered square contained in each image is resam39

with at least 50 images in the ESP Game datasetwere included in the evaluation pairs. The MENdataset has been found to mirror the aggregatescore over a variety of tasks and similarity datasets(Kiela and Clark, 2014). It is also much larger,with 3000 words pairs consisting of 751 individualwords. Although MEN was constructed so as tohave at least a minimum amount of images available in the ESP Game dataset for each concept,this is not the case for ImageNet. Hence, similarly to WordSim353, we also evaluate on a subset(MEN-Relevant) for which images are availablein both datasets.We evaluate the models in terms of their Spearman ρ correlation with the human relatedness ratings. The similarity between the representationsassociated with a pair of words is calculated usingthe cosine similarity:pled to form a 256 256 image. The CNN inputis then formed by cropping 16 pixels off each border and subtracting 128 to the image components.The ESP Game images were preprocessed slightlydifferently because we do not expect the objectsto be centered. Each image was rescaled to fit inside a 224 224 rectangle. The CNN input is thenformed by centering this image into the 224 224input field, subtracting 128 to the image components, and zero padding.The BOVW features were obtained by computing DSIFT descriptors using VLFeat (Vedaldi andFulkerson, 2008). These descriptors were subsequently clustered using mini-batch k-means (Sculley, 2010) with 100 clusters. Each image is thenrepresented by a bag of clusters (visual words)quantized as a 100-dimensional feature vector.These vectors were then combined into visual concept representations by taking their mean.4.4cos(v1 , v2 ) Evaluation5We evaluate our multi-modal word representationsusing two semantic relatedness datasets widelyused in distributional semantics (Agirre et al.,2009; Feng and Lapata, 2010; Bruni et al., 2012;Kiela and Clark, 2014; Bruni et al., 2014).v1 · v2kv1 k kv2 k(2)ResultsWe evaluate on the two semantic relatednessdatasets using solely linguistic, solely visual andmulti-modal representations. In the case of MENRelevant and W353-Relevant, we report scores forBOVW, CNN-Mean and CNN-Max visual representations. For all datasets we report the scoresobtained by BOVW, CNN-Mean and CNN-Maxmulti-modal representations. Since we have fullcoverage with the ESP Game dataset on MEN, weare able to report visual representation scores forthe entire dataset as well. The results can be seenin Table 1.There are a number of questions to ask. Firstof all, do CNNs yield better visual representations? Second, do CNNs yield better multi-modalrepresentations? And third, is there a differencebetween the high-quality low-coverage ImageNetand the low-quality higher-coverage ESP Gamedataset representations?WordSim353 (Finkelstein et al., 2001) is a selection of 353 concept pairs with a similarity rating provided by human annotators. Since this isprobably the most widely used evaluation datasetfor distributional semantics, we include it for comparison with other approaches. WordSim353 hassome known idiosyncracies: it includes named entities, such as OPEC, Arafat, and Maradona, aswell as abstract words, such as antecedent andcredibility, for which it may be hard to find corresponding images. Multi-modal representationsare often evaluated on an unspecified subset ofWordSim353 (Feng and Lapata, 2010; Bruni etal., 2012; Bruni et al., 2014), making it impossible to compare the reported scores. In this work,we report scores on the full WordSim353 dataset(W353) by setting the visual vector vvis to zero forconcepts without images. We also report scoreson the subset (W353-Relevant) of pairs for whichboth concepts have both ImageNet and ESP Gameimages using the aforementioned selection procedure.MEN (Bruni et al., 2012) was in part designedto alleviate the WordSim353 problems. It was constructed in such a way that only frequent words5.1Visual RepresentationsIn all cases, CNN-generated visual representationsperform better or as good as BOVW representations (we report results for BOVW-Mean, whichperforms slightly better than taking the elementwise maximum). This confirms the motivationoutlined in the introduction: by applying state-ofthe-art approaches from computer vision to multimodal semantics, we obtain a signficant perfor40

DatasetLinguisticVisualMulti-modalBOVW CNN-Mean CNN-MaxBOVW CNN-Mean CNN-MaxImageNet visual levant0.510.300.320.300.550.560.57ESP game visual 60W353-Relevant0.510.380.440.560.520.550.61Table 1: Results (see sections 4 and 5).features on the ESP game images increases theperformance boost without changing the association between word labels.mance increase over standard multi-modal models.5.2Multi-modal Representations5.4Higher-quality perceptual input leads to betterperforming multi-modal representations. In allcases multi-modal models with CNNs outperformmulti-modal models with BOVW, occasionally byquite a margin. In all cases, multi-modal representations outperform purely linguistic vectorsthat were obtained using a state-of-the-art system.This re-affirms the importance of multi-modal representations for distributional semantics.5.3Image DatasetsIt is important to ask whether the source image dataset has a large impact on performance.Although the scores for the visual representation in some cases differ, performance of multimodal representations remains close for both image datasets. This implies that our method is robust over different datasets. It also suggests that itis beneficial to trai

2.3 Deep Convolutional Neural Networks A urry of recent results indicates that image de-scriptors extracted from deep convolutional neu-ral networks (CNNs) are very powerful and con-sistently outperform highly tuned state-of-the-art systems on a variety of visual recognition tasks (Razavian et al., 2014). Embeddings from state-

Related Documents:

adopt phoneme embeddings to replace or complement common text representations, e.g., word embeddings [18, 24, 25], or character embeddings [11]. Few existing works studied phoneme embeddings. Li et al. [13] explored the application of phoneme embeddings for the task of speech-dri

Learning a Deep Convolutional Network for Image Super-Resolution . a deep convolutional neural network (CNN) [15] that takes the low- . Convolutional Neural Networks. Convolutional neural networks (CNN) date back decades [15] and have recently shown an explosive popularity par-

The multilingual embeddings are then taken to be the rows of the matrix U. 3 Evaluating Multilingual Embeddings One of our main contributions is to streamline the evaluation of multilingual embeddings. In addition to assessing goals (i–iii) s

CoCon: Cooperative-Contrastive Learning Nishant Rai1, Ehsan Adeli1, Kuan-Hui Lee2, Adrien Gaidon2, Juan Carlos Niebles1 1Stanford University 2Toyota Research Institute Flow RGB Keypoint Flow RGB Keypoint RGB Embeddings Keypoint Embeddings Flow Embeddings Figure 1: Given a pair of instances (e.g. people doing squats) and corresponding multiple views, features are computed using view-

ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV 2014 K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR 2015

2 Convolutional neural networks CNNs are hierarchical neural networks whose convolutional layers alternate with subsampling layers, reminiscent of sim-ple and complex cells in the primary visual cortex [Wiesel and Hubel, 1959]. CNNs vary in how convolutional and sub-sampling layers are realized and how the nets are trained. 2.1 Image processing .

Knowledge graph embeddings learn a mapping from the knowledge graph to a feature space solving an optimization problem, minimizing the time-consuming endeavor of feature engineering and leading to higher quality features. Thus, the main pillar of this thesis investigates the use of knowledge graph embeddings for recommender systems.

Principles of Animal Nutrition Applied Animal Science Research Techniques for Bioscientists Principles of Animal Health and Disease 1 Optional Physiology of Electrically Excitable Tissues Animal Behaviour Applied Agricultural and Food Marketing Economic Analysis for Agricultural and Environmental Sciences Physiology and Biotechnology option Core Endocrine Control Systems Reproductive .