Convolutional Neural Network For Sentence Classification

3y ago
35 Views
3 Downloads
7.28 MB
62 Pages
Last View : Today
Last Download : 2m ago
Upload by : Cannon Runnels
Transcription

Convolutional Neural Network forSentence ClassificationbyYahui ChenA thesispresented to the University of Waterlooin fulfillment of thethesis requirement for the degree ofMaster of MathematicsinComputer ScienceWaterloo, Ontario, Canada, 2015c Yahui Chen 2015

I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis,including any required final revisions, as accepted by my examiners.I understand that my thesis may be made electronically available to the public.ii

AbstractThe goal of a Knowledge Base–supported Question Answering (KB-supported QA) systemis to answer a query natural language by obtaining the answer from a knowledge database,which stores knowledge in the form of (entity, relation, value) triples. QA systems understand questions by extracting entity and relation pairs. This thesis aims at recognizingthe relation candidates inside a question. We define a multi-label classification problem forthis challenging task. Based on the word2vec representation of words, we propose two convolutional neural networks (CNNs) to solve the multi-label classification problem, namelyParallel CNN and Deep CNN. The Parallel CNN contains four parallel convolutional layerswhile Deep CNN contains two serial convolutional layers. The convolutional layers of boththe models capture local semantic features. A max over time pooling layer is placed onthe top of the last convolutional layer to select global semantic features. Fully connectedlayers with dropout are used to summarize the features. Our experiments show that thesetwo models outperform the traditional Support Vector Classification (SVC)–based methodby a large margin. Furthermore, we observe that Deep CNN has better performance thanParallel CNN, indicating that the deep structure enables much stronger semantic learningcapacity than the wide but shallow network.iii

AcknowledgementsI would like to thank all the people who made this possible. First and foremost, I wantto express profound gratitude towards my supervisor Dr. Ming Li for his support. Hisinvaluable detailed advices on project and thesis encourage me a lot. Second, I would liketo thank Dr. Pascal Poupart. He provides insight and expertise that assisted my thesis.Third, I would like to thank Dr. Chrysanne Di Marco and Dr. Khuzaima Daudjee to readmy thesis and to provide valuable advices. Finally, My sincere thanks go to my father andall my friends for their encouragement and support.iv

DedicationThis is dedicated to my family.v

Table of ContentsList of TablesixList of Figuresx1 Introduction11.1Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . .11.2Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.3Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.4Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32 Background2.14Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42.1.1Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . .42.1.2Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . .52.1.3Recursive Neural Network . . . . . . . . . . . . . . . . . . . . . . .72.2Motivation and History . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82.3Basic Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82.4Review of Discrete Convolution Definition . . . . . . . . . . . . . . . . . .82.5Volumes of Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.6Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.6.19Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . .vi

2.6.2Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122.6.3Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . .132.6.4Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142.6.5Activation Function and Cost Function . . . . . . . . . . . . . . . .152.6.6Common CNN Architectures . . . . . . . . . . . . . . . . . . . . . .173 Related Work183.1Single–Convolutional–Layer CNNs . . . . . . . . . . . . . . . . . . . . . . .183.2Multi–Convolutional–Layer CNNs . . . . . . . . . . . . . . . . . . . . . . .204 Dataset and Environment4.14.223Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .234.1.1Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .234.1.2Answerable Queries Coverage . . . . . . . . . . . . . . . . . . . . .24Tool and Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .244.2.1word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .244.2.2NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254.2.3CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254.2.4Python Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . .254.2.5Spelling Corrector . . . . . . . . . . . . . . . . . . . . . . . . . . . .254.2.6Experiment Environment . . . . . . . . . . . . . . . . . . . . . . . .255 Main Results5.127Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .275.1.1Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .275.1.2Multi–Label Classification . . . . . . . . . . . . . . . . . . . . . . .305.1.3Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .315.1.4Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . .34vii

5.2Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .355.2.1Baseline versus CNNs. . . . . . . . . . . . . . . . . . . . . . . . .365.2.2Parallel versus Deep CNN . . . . . . . . . . . . . . . . . . . . . . .365.2.3Further Observations . . . . . . . . . . . . . . . . . . . . . . . . . .366 Conclusions416.1Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .416.2Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41APPENDICES43A Python Implementation for Building CNNs44A.1 Deep CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44A.2 Parallel CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45References46viii

List of Tables5.1Three Datasets. Each subset is chosen from the whole dataset according tominimum sentences per class. For example, for subset 1, only classes whichhave more than 8000 sample sentences will be chosen. . . . . . . . . . . . .275.2Multi–Label Task for Relation Classification Example. . . . . . . . . . . .305.3Parameters for CNNs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .315.4F1 Scores of Di erent Models. SVC stands for a linear kernel Support VectorClassifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35Samples Ranking with Descending Scores of Neurons in 2nd convolutionallayer. This result is trained on dataset 1 with Deep CNN. 137th neuron in2nd convolutional layer learns features of phrases like the population of. Inthe table, ˆ and stand for blank words before and after the sentence. . .375.5ix

List of Figures1.1KB-supported QA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.1Convolutional Neural Network. Neurons in CNN are locally connected withneurons in previous layer. Weights of the same filter are shared across thesame layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5Recurrent Neural Network. RNN takes input sequence. Weights of hiddenunits are updated according to current input and previous weights of hiddenunits at each time step. Outputs of RNN are calculated according to currenthidden units state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62.3Error Surface of a Single Hidden Unit RNN [41]. . . . . . . . . . . . . . . .62.4Recursive Neural Network. . . . . . . . . . . . . . . . . . . . . . . . . . . .72.5Sparse Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102.6Shared Weights. Connections with same colour share weights. . . . . . . .102.7Convolutional Layer [26]. . . . . . . . . . . . . . . . . . . . . . . . . . . . .112.8Pooling Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132.9Dropout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142.10 Activation Function Applied to a Neuron. . . . . . . . . . . . . . . . . . .152.11 Activation Functions. This figure shows sigmoid, tanh and ReLU function.As seen from the figure, sigmoid’s output range is [0, 1], while tanh’s outputrange is [ 1, 1] and ReLU’s output range is [0, 1] . . . . . . . . . . . . .162.23.1Neural Network for Relation Classification and Framework for ExtractingSentence Level Features [55]. In the right hand figure, WF stands for wordfeatures and PF stands for position features. . . . . . . . . . . . . . . . . .x18

3.2CNN Model [44]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .193.3CNN Model [51]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .193.4CNN Model for Several Sentence Classification Tasks [27]. . . . . . . . . .203.5ARC–II Model [22]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .203.6DCNN Model for Modeling Sentence [25]. . . . . . . . . . . . . . . . . . . .214.1The Curve of Coverage of Queries Answerable and Number of Relation. . .245.1Sentence Space. Depth is one, width is the number of words in a sentence,and height is three hundred which is the dimension of word2vec. . . . . . .285.2Padding Zeros. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .295.3Deep CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .325.4Parallel CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .335.5Precision Recall Curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . .385.6Receiver Operating Characteristic Curves. . . . . . . . . . . . . . . . . . .39xi

Chapter 1Introduction1.1Background and MotivationFigure 1.1: KB-supported QA.Natural language processing (NLP) is the focus of artificial intelligence research andhas many applications: machine translation, named entity recognition (NER), questionanswering (QA), etc. The purpose of a QA system is to automatically answer questionsposed by humans in a natural language. Knowledge–Base supported (KB–supported) QAsystem obtains answers by querying from a structured knowledge database, which storestuples of (entity, relation, value), and generating answers in natural languages accordingto the query result, as shown in Figure 1.1. For example, for the question “Who’s thepresident of the United States? ” the entity is “USA”, the relation is “be–president–of ”,and the value is “Barack Obama”. Understanding human questions, especially extractingthe entity and relation candidates, is the first and vital step toward implementing the wholesystem. Many traditional methods depend on keywords or templates matching. But theyrely heavily on hand–crafted rules, which cannot be scaled up. To leverage human labourin constructing the keywords or templates, some recent machine learning algorithms havebeen proposed to automatically learn features and measure semantic distances betweenqueries and known domains.1

1.2Problem DefinitionThis thesis defines a multi–label classification problem for extracting the relation candidatesfrom a question. We target a widely used question dataset [15], which is crawled fromWikiAnswer and consists of a set of questions with over 19K relations. We assume thatthese open–domain questions have only first–order relations, which we call single–relationquestions, for example, “Who’s the president of the United States? ” has a first–orderrelation, but “Who’s the wife of the United States’ president? ” has a second–order relation.Single–relation questions are the most commonly observed ones in QA sites [16]. However,since human expressions or understanding could be ambiguous, each question may haveseveral relation candidates, for example, “What is the primary duty of judicial branch? ”has relation candidates “be–primary–responsibility–of ”, “be–primary–role–of ”, and “have–role–of ”. Thus we address the problem as recognizing the relations inside a question in amulti–label manner.1.3ContributionsWe explore various deep learning models to solve the proposed multi–label recognition problem. At the first step, we exploit the widely used word2vec [33] [35] [36] to represent eachword as a 300 dimensional vector, and the whole sentence as a matrix by stacking all theword vectors. Word2vec converts the semantic relations between words into the distanceof their vectors, for example, word2vec(‘Paris’) - word2vec(‘France’) word2vec(‘China’) word2vec(‘Beijing’).Based on the matrix representation of each sentence, we propose two kinds of convolutional neural networks (CNNs): Parallel CNN and Deep CNN. Convolutional layers of bothnetworks can learn phrases, such as “where do . . . live”, and “the population of”. ParallelCNN is a shallow network but has multiple parallel convolutional layers. Deep CNN, onthe contrary, has multiple serial convolutional layers. Our experiments show that bothParallel and Deep CNN outperform the traditional Support Vector Classification (SVC)–based method by a large margin. Furthermore, we observe that Deep CNN has betterperformance than Parallel CNN, indicating that the deep structure enables much strongersemantic learning capacity than the wide but shallow network.2

1.4Thesis OrganizationSection 2 mainly background knowledge of the structures and components of CNNs. Section 3 introduces recent research on CNNs for NLP tasks. Section 4 presents the dataset,tools, and environment used in this work. Section 5 describes our method and showsexperimental results. Section 6 concludes this thesis.3

Chapter 2Background2.1Deep Neural NetworkDeep learning has shown powerful feature learning skills and achieved remarkable performance in computer vision (CV) [8] [43], speech recognition [11] [21], and natural languageprocessing (NLP) [9]. Deep neural network is a kind of deep learning method. The difference between deep neural network (DNN) and shallow artificial neural network (ANN)is that the former contains multiple hidden layers so that it can learn more complex features. It has several variants: convolutional neural network, recurrent neural network, andrecursive neural network. DNNs have forward pass and back propagation. The parametersof networks are updated according to learning rate, cost function via stochastic gradientdescent during the back propagation. In the following, we briefly introduce the structuresof di erent DNNs applied in NLP tasks.2.1.1Convolutional Neural NetworkConvolutional neural networks (CNNs) learn local features and assume that these featuresare not restricted by their absolute positions. In the field of NLP, they are applied inPart–Of–Speech Tagging (POS), Named Entity Recognition (NER) [9], etc.01x0Figure 2.1 shows a two–layer CNN. For the green node h0 f (W @ x1 A b) x24

Figure 2.1: Convolutional Neural Network. Neurons in CNN are locally connected withneurons in previous layer. Weights of the same filter are shared across the same layer.01x1f (w0 x0 w1 x1 w2 x2 b) and for the green node h1 f (W @ x2 A b) f (w1 x1 x3w2 x2 w3 x3 b). W is shared by the same filter in the same layer.2.1.2Recurrent Neural NetworkThe limitation of convolutional neural network is that they take fixed–sized inputs and produce fixed–sized outputs. Recurrent neural networks (RNNs) can operate over sequentialinput and predict sequential output. They can do one–to–one, one–to–many, many–to–one, many–to–many jobs. They can be used in machine translation [34] and other NLPtasks.Figure 2.2 shows a simple recurrent neural network with three layers: input layer x, hidden layer h and output layer y. Horizontal arrows stand for time changing. Input sequence:x1 , x2 , . . . , xT . For each time step t, ht f (W hh ht 1 W hx xt ) and yt g(W hy ht ), whereW hh , W hx and W hy are parameters shared across time sequence. Hidden layer’s states areinfluenced by all the previous inputs. It also has a bidirectional structure to incorporateboth forward and backward inputs. RNN has a vanishing or exploding gradient problem,as shown in Figure 2.3, while initializing weight matrix to identity matrix [46] and usingReLU activation function [29] address this problem to a certain degree.5

Figure 2.2: Recurrent Neural Network. RNN takes input sequence. Weights of hiddenunits are updated according to current input and previous weights of hidden units at eachtime step. Outputs of RNN are calculated according to current hidden units state.Figure 2.3: Error Surface of a Single Hidden Unit RNN [41].6

2.1.3Recursive Neural NetworkRecursive neural networks (RNNs) have been applied to multiple NLP tasks, such assentence classification [47].Figure 2.4: Recursive Neural Network.Figure 2.4 shows a simple recursive neural network. Each node takes two children asinputs. h f (W x b) and y U T h. For example, for the green node, the parent ofx0x0 and x1 , h01 f (W b), and for the purple node, the parent of h01 and x2 ,x1 h01h012 f (W b). In CNNs, weights are shared within the same filter, while inx2RNNs, weights are shared across di erent layers. Recursive neural networks have di erentcomposition functions: Matrix–Vector RNNs, Recursive Neural Tensor Networks, TreeLSTM, etc.Recursive neural networks require parsers to get the semantic structures of the sentences. Recurrent neural networks are good at dealing with learning time–sequential features. Convolutional neural networks have good performances in classification and are usedas models for the task described in this thesis.7

2.2Motivation and HistoryConvolutional Neural Networks are inspired by a cat’s visual cortex. Visual cortex containsa complex arrangement of cells. These cells are responsible for detecting small sub–fieldsof the visual field, called receptive fields. The sub–fields are tiled to cover the whole visualfield. These cells act as local filters over the input space and are well–suited to exploit thestrong spatially local correlation present in natural images.Neocognitron was introduced by Fukushima in 1980 [18] and improved in 1998 byLeCun, Bottou, Bengio, and Ha ner [30]. They proposed the famous LeNet–5 — a convolutional neural network. Then it was generalized by Behnke [6], and pre-digested bySimard and his collaborators in 2003 [45]. Convolutional neural networks perform well onproblems such as recognizing handwritten numbers, but the computational power at thattime limited their ability to solve more complex problems until the rise of efficient GPUcomputing.2.3Basic AssumptionThe convolutional layer is based on the assumption that features are learned regardless oftheir absolute positions. This is reasonable in many cases, for example in image learning,if detecting a horizontal edge is important at some location in the image, it should also beuseful at other locations.Convolutional layers focus on learning local features. In natural language processing,if in the sentence “give me an example of thank you letter” example of has been learnedas a feature, then it should also be recognized in sentence “what is an example of scientifichypothesis”. But example of may not have any relation with thank you letter or scientifichypothesis. For example in audio recognition, features of time spans of audio clips arelearned instead of that of the whole input audio.2.4Review of Discrete Convolution DefinitionRecall the definition of convolution for a 1D signal [12]. The discrete convolution of f andg is given by:o[n] f [n] g[n] 1Xf [u]g[nu 1u] 1Xu 18f [nu]g[u].(2.1)

This can be extended to 2D as follows:o[m, n] f [m, n] g[m, n

of networks are updated according to learning rate, cost function via stochastic gradient descent during the back propagation. In the following, we briefly introduce the structures of di erent DNNs applied in NLP tasks. 2.1.1 Convolutional Neural Network Convolutional neural networks (CNNs) learn local features and assume that these features

Related Documents:

Learning a Deep Convolutional Network for Image Super-Resolution . a deep convolutional neural network (CNN) [15] that takes the low- . Convolutional Neural Networks. Convolutional neural networks (CNN) date back decades [15] and have recently shown an explosive popularity par-

Performance comparison of adaptive shrinkage convolution neural network and conven-tional convolutional network. Model AUC ACC F1-Score 3-layer convolutional neural network 97.26% 92.57% 94.76% 6-layer convolutional neural network 98.74% 95.15% 95.61% 3-layer adaptive shrinkage convolution neural network 99.23% 95.28% 96.29% 4.5.2.

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

2 Convolutional neural networks CNNs are hierarchical neural networks whose convolutional layers alternate with subsampling layers, reminiscent of sim-ple and complex cells in the primary visual cortex [Wiesel and Hubel, 1959]. CNNs vary in how convolutional and sub-sampling layers are realized and how the nets are trained. 2.1 Image processing .

ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV 2014 K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR 2015

Video Super-Resolution With Convolutional Neural Networks Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K. Katsaggelos, Fellow, IEEE Abstract—Convolutional neural networks (CNN) are a special type of deep neural networks (DNN). They have so far been suc-cessfully applied to image super-resolution (SR) as well as other image .

Image Colorization with Deep Convolutional Neural Networks Jeff Hwang jhwang89@stanford.edu You Zhou youzhou@stanford.edu Abstract We present a convolutional-neural-network-based sys-tem that faithfully colorizes black and white photographic images without direct human assistance. We explore var-ious network architectures, objectives, color .

literary techniques, such as the writer’s handling of plot, setting, and character. Today the concept of literary interpretation frequently includes questions about social issues as well.Both kinds of questions are included in the chart that begins at the bottom of the page. Often you will find yourself writing about both technique and social issues. For example, Margaret Peel, a student who .