A Fast And Accurate Dependency Parser Using Neural Networks

2y ago
29 Views
2 Downloads
578.78 KB
11 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Francisco Tran
Transcription

A Fast and Accurate Dependency Parser using Neural NetworksDanqi ChenComputer Science DepartmentStanford Universitydanqi@cs.stanford.eduAbstractAlmost all current dependency parsersclassify based on millions of sparse indicator features. Not only do these featuresgeneralize poorly, but the cost of featurecomputation restricts parsing speed significantly. In this work, we propose a novelway of learning a neural network classifierfor use in a greedy, transition-based dependency parser. Because this classifier learnsand uses just a small number of dense features, it can work very fast, while achieving an about 2% improvement in unlabeled and labeled attachment scores onboth English and Chinese datasets. Concretely, our parser is able to parse morethan 1000 sentences per second at 92.2%unlabeled attachment score on the EnglishPenn Treebank.1IntroductionIn recent years, enormous parsing success hasbeen achieved by the use of feature-based discriminative dependency parsers (Kübler et al., 2009).In particular, for practical applications, the speedof the subclass of transition-based dependencyparsers has been very appealing.However, these parsers are not perfect. First,from a statistical perspective, these parsers sufferfrom the use of millions of mainly poorly estimated feature weights. While in aggregate bothlexicalized features and higher-order interactionterm features are very important in improving theperformance of these systems, nevertheless, thereis insufficient data to correctly weight most suchfeatures. For this reason, techniques for introducing higher-support features such as word class features have also been very successful in improvingparsing performance (Koo et al., 2008). Second,almost all existing parsers rely on a manually designed set of feature templates, which require a lotChristopher D. ManningComputer Science DepartmentStanford Universitymanning@stanford.eduof expertise and are usually incomplete. Third, theuse of many feature templates cause a less studied problem: in modern dependency parsers, mostof the runtime is consumed not by the core parsing algorithm but in the feature extraction step (Heet al., 2013). For instance, Bohnet (2010) reportsthat his baseline parser spends 99% of its time doing feature extraction, despite that being done instandard efficient ways.In this work, we address all of these problemsby using dense features in place of the sparse indicator features. This is inspired by the recent success of distributed word representations in manyNLP tasks, e.g., POS tagging (Collobert et al.,2011), machine translation (Devlin et al., 2014),and constituency parsing (Socher et al., 2013).Low-dimensional, dense word embeddings can effectively alleviate sparsity by sharing statisticalstrength between similar words, and can provideus a good starting point to construct features ofwords and their interactions.Nevertheless, there remain challenging problems of how to encode all the available information from the configuration and how to modelhigher-order features based on the dense representations. In this paper, we train a neural network classifier to make parsing decisions withina transition-based dependency parser. The neural network learns compact dense vector representations of words, part-of-speech (POS) tags, anddependency labels. This results in a fast, compact classifier, which uses only 200 learned densefeatures while yielding good gains in parsing accuracy and speed on two languages (English andChinese) and two different dependency representations (CoNLL and Stanford dependencies). Themain contributions of this work are: (i) showingthe usefulness of dense representations that arelearned within the parsing task, (ii) developing aneural network architecture that gives good accuracy and speed, and (iii) introducing a novel acti-

vation function for the neural network that bettercaptures higher-order interaction features.2Transition-based Dependency ParsingTransition-based dependency parsing aims to predict a transition sequence from an initial configuration to some terminal configuration, which derives a target dependency parse tree, as shown inFigure 1. In this paper, we examine only greedyparsing, which uses a classifier to predict the correct transition based on features extracted from theconfiguration. This class of parsers is of great interest because of their efficiency, although theytend to perform slightly worse than the searchbased parsers because of subsequent error propagation. However, our greedy parser can achievecomparable accuracy with a very good speed.1As the basis of our parser, we employ thearc-standard system (Nivre, 2004), one of themost popular transition systems. In the arcstandard system, a configuration c (s, b, A)consists of a stack s, a buffer b, and a set ofdependency arcs A. The initial configurationfor a sentence w1 , . . . , wn is s [ROOT], b [w1 , . . . , wn ], A . A configuration c is terminal if the buffer is empty and the stack containsthe single node ROOT, and the parse tree is givenby Ac . Denoting si (i 1, 2, . . .) as the ith topelement on the stack, and bi (i 1, 2, . . .) as theith element on the buffer, the arc-standard systemdefines three types of transitions: LEFT-ARC(l): adds an arc s1 s2 withlabel l and removes s2 from the stack. Precondition: s 2. RIGHT-ARC(l): adds an arc s2 s1 withlabel l and removes s1 from the stack. Precondition: s 2. SHIFT: moves b1 from the buffer to thestack. Precondition: b 1.In the labeled version of parsing, there are in total T 2Nl 1 transitions, where Nl is numberof different arc labels. Figure 1 illustrates an example of one transition sequence from the initialconfiguration to a terminal one.The essential goal of a greedy parser is to predict a correct transition from T , based on one1Additionally, our parser can be naturally incorporatedwith beam search, but we leave this to future work.Single-word features (9)s1 .w; s1 .t; s1 .wt; s2 .w; s2 .t;s2 .wt; b1 .w; b1 .t; b1 .wtWord-pair features (8)s1 .wt s2 .wt; s1 .wt s2 .w; s1 .wts2 .t;s1 .w s2 .wt; s1 .t s2 .wt; s1 .w s2 .ws1 .t s2 .t; s1 .t b1 .tThree-word feaures (8)s2 .t s1 .t b1 .t; s2 .t s1 .t lc1 (s1 ).t;s2 .t s1 .t rc1 (s1 ).t; s2 .t s1 .t lc1 (s2 ).t;s2 .t s1 .t rc1 (s2 ).t; s2 .t s1 .w rc1 (s2 ).t;s2 .t s1 .w lc1 (s1 ).t; s2 .t s1 .w b1 .tTable 1: The feature templates used for analysis.lc1 (si ) and rc1 (si ) denote the leftmost and rightmost children of si , w denotes word, t denotesPOS tag.given configuration. Information that can be obtained from one configuration includes: (1) allthe words and their corresponding POS tags (e.g.,has / VBZ); (2) the head of a word and its label(e.g., nsubj, dobj) if applicable; (3) the position of a word on the stack/buffer or whether it hasalready been removed from the stack.Conventional approaches extract indicator features such as the conjunction of 1 3 elementsfrom the stack/buffer using their words, POS tagsor arc labels. Table 1 lists a typical set of featuretemplates chosen from the ones of (Huang et al.,2009; Zhang and Nivre, 2011).2 These featuressuffer from the following problems: Sparsity. The features, especially lexicalizedfeatures are highly sparse, and this is a common problem in many NLP tasks. The situation is severe in dependency parsing, because it depends critically on word-to-wordinteractions and thus the high-order features.To give a better understanding, we perform afeature analysis using the features in Table 1on the English Penn Treebank (CoNLL representations). The results given in Table 2demonstrate that: (1) lexicalized features areindispensable; (2) Not only are the word-pairfeatures (especially s1 and s2 ) vital for predictions, the three-word conjunctions (e.g.,{s2 , s1 , b1 }, {s2 , lc1 (s1 ), s1 }) are also veryimportant.2We exclude sophisticated features using labels, distance,valency and third-order features in this analysis, but we willinclude all of them in the final evaluation.

punctnsubjROOTCorrect transition: SHIFTdobjrootamodStackHehas good control .PRP VBZJJNN.TransitionStack[ROOT]SHIFT[ROOT He]SHIFT[ROOT He has]LEFT-ARC(nsubj)[ROOT has]SHIFT[ROOT has good]SHIFT[ROOT has good control]LEFT-ARC(amod)[ROOT has control]RIGHT-ARC(dobj)[ROOT has].RIGHT-ARC(root)[ROOT]ROOThas VBZBu ergood JJcontrol NN.nsubjHe PRPBuffer[He has good control .][has good control .][good control .][good control .][control .][.][.][.].[]A A nsubj(has,He)A amod(control,good)A dobj(has,control).A root(ROOT,has)Figure 1: An example of transition-based dependency parsing. Above left: a desired dependency tree,above right: an intermediate configuration, bottom: a transition sequence of the arc-standard system.FeaturesAll features in Table 1single-word & word-pair featuresonly single-word featuresexcluding all lexicalized featuresUAS88.082.776.981.5Table 2: Performance of different feature sets.UAS: unlabeled attachment score. Incompleteness. Incompleteness is an unavoidable issue in all existing feature templates. Because even with expertise and manual handling involved, they still do not include the conjunction of every useful wordcombination. For example, the conjunction of s1 and b2 is omitted in almost allcommonly used feature templates, howeverit could indicate that we cannot perform aRIGHT-ARC action if there is an arc from s1to b2 . Expensive feature computation. The feature generation of indicator features is generally expensive — we have to concatenatesome words, POS tags, or arc labels for generating feature strings, and look them up in ahuge table containing several millions of features. In our experiments, more than 195% ofthe time is consumed by feature computationduring the parsing process.So far, we have discussed preliminaries oftransition-based dependency parsing and existingproblems of sparse indicator features. In the following sections, we will elaborate our neural network model for learning dense features along withexperimental evaluations that prove its efficiency.3Neural Network Based ParserIn this section, we first 1present our neural networkmodel and its main components. Later, we givedetails of training and speedup of parsing process.3.1ModelFigure 2 describes our neural network architecture. First, as usual word embeddings, we repredsent each word as a d-dimensional vector ewi Rwd Nwand the full embedding matrix is E Rwhere Nw is the dictionary size. Meanwhile,we also map POS tags and arc labels to a ddimensional vector space, where eti , elj Rd arethe representations of ith POS tag and j th arc label. Correspondingly, the POS and label embedding matrices are E t Rd Nt and E l Rd Nlwhere Nt and Nl are the number of distinct POStags and arc labels.We choose a set of elements based on thestack / buffer positions for each type of information (word, POS or label), which mightbe useful for our predictions. We denote thesets as S w , S t , S l respectively. For example,given the configuration in Figure 2 and S t

Softmax layer:p softmax(W2 h)Hidden layer:h (W1w xw W1t xt W1l xl b1 )3······Input layer: [xw , xt , xl ]······POS tagswordsStackConfigurationROOT has VBZ good JJarc labelsBuffercontrol NN.nsubjHe PRPFigure 2: Our neural network architecture.{lc1 (s2 ).t, s2 .t, rc1 (s2 ).t, s1 .t}, we will extractPRP, VBZ, NULL, JJ in order. Here we use a special token NULL to represent a non-existent element.We build a standard neural network with onehidden layer, where the corresponding embeddings of our chosen elements from S w , S t , S l willbe added to the input layer. Denoting nw , nt , nl asthe number of chosen elements of each type, wewwadd xw [eww1 ; ew2 ; . . . ewnw ] to the input layer,wwhere S {w1 , . . . , wnw }. Similarly, we addthe POS tag features xt and arc label features xl tothe input layer.We map the input layer to a hidden layer withdh nodes through a cube activation function:h (W1w xw W1t xt W1l xl b1 )3where W1w Rdh (d·nw ) , W1t Rdh (d·nt ) ,W1l Rdh (d·nl ) , and b1 Rdh is the bias.A softmax layer is finally added on the top ofthe hidden layer for modeling multi-class probabilities p softmax(W2 h), where W2 R T dh .POS and label embeddingsTo our best knowledge, this is the first attempt tointroduce POS tag and arc label embeddings instead of discrete representations.Although the POS tags P {NN, NNP,NNS, DT, JJ, . . .} (for English) and arc labelsL {amod, tmod, nsubj, csubj, dobj, . . .}(for Stanford Dependencies on English) are relatively small discrete sets, they still exhibit manysemantical similarities like words. For example,NN (singular noun) should be closer to NNS (plural10.5 1 0.8 0.6 0.4 0.20.20.40.60.81cubesigmoidtanhidentity 0.5 1Figure 3: Different activation functions used inneural networks.noun) than DT (determiner), and amod (adjectivemodifier) should be closer to num (numeric modifier) than nsubj (nominal subject). We expectthese semantic meanings to be effectively capturedby the dense representations.Cube activation functionAs stated above, we introduce a novel activationfunction: cube g(x) x3 in our model insteadof the commonly used tanh or sigmoid functions(Figure 3).Intuitively, every hidden unit is computed by a(non-linear) mapping on a weighted sum of inputunits plus a bias. Using g(x) x3 can modelthe product terms of xi xj xk for any three differentelements at the input layer directly:g(w1 x1 . . . wm xm b) XX(wi wj wk )xi xj xk b(wi wj )xi xj . . .i,j,ki,jIn our case, xi , xj , xk could come from differentdimensions of three embeddings. We believe thatthis better captures the interaction of three ele-

ments, which is a very desired property of dependency parsing.Experimental results also verify the success ofthe cube activation function empirically (see morecomparisons in Section 4). However, the expressive power of this activation function is still opento investigate theoretically.The choice of S w , S t , S lFollowing (Zhang and Nivre, 2011), we pick arich set of elements for our final parser. In detail, S w contains nw 18 elements: (1) The top 3words on the stack and buffer: s1 , s2 , s3 , b1 , b2 , b3 ;(2) The first and second leftmost / rightmostchildren of the top two words on the stack:lc1 (si ), rc1 (si ), lc2 (si ), rc2 (si ), i 1, 2. (3)The leftmost of leftmost / rightmost of rightmost children of the top two words on the stack:lc1 (lc1 (si )), rc1 (rc1 (si )), i 1, 2.We use the corresponding POS tags for S t(nt 18), and the corresponding arc labels ofwords excluding those 6 words on the stack/bufferfor Sl (nl 12). A good advantage of our parseris that we can add a rich set of elements cheaply,instead of hand-crafting many more indicator features.3.2TrainingWe first generate training examples {(ci , ti )}mi 1from the training sentences and their gold parsetrees using a “shortest stack” oracle which alwaysprefers LEFT-ARCl over SHIFT, where ci is aconfiguration, ti T is the oracle transition.The final training objective is to minimize thecross-entropy loss, plus a l2 -regularization term:L(θ) Xilog pti λkθk22where θ is the set of all parameters{W1w , W1t , W1l , b1 , W2 , E w , E t , E l }.A slightvariation is that we compute the softmax probabilities only among the feasible transitions inpractice.For initialization of parameters, we use pretrained word embeddings to initialize E w and userandom initialization within ( 0.01, 0.01) for E tand E l . Concretely, we use the pre-trained wordembeddings from (Collobert et al., 2011) for English (#dictionary 130,000, coverage 72.7%),and our trained 50-dimensional word2vec embeddings (Mikolov et al., 2013) on Wikipediaand Gigaword corpus for Chinese (#dictionary 285,791, coverage 79.0%). We will also compare with random initialization of E w in Section4. The training error derivatives will be backpropagated to these embeddings during the training process.We use mini-batched AdaGrad (Duchi et al.,2011) for optimization and also apply a dropout(Hinton et al., 2012) with 0.5 rate. The parameters which achieve the best unlabeled attachmentscore on the development set will be chosen forfinal evaluation.3.3ParsingWe perform greedy decoding in parsing. At eachstep, we extract all the corresponding word, POSand label embeddings from the current configuration c, compute the hidden layer h(c) Rdh ,and pick the transition with the highest score:t arg maxt is feasible W2 (t, ·)h(c), and then execute c t(c).Comparing with indicator features, our parserdoes not need to compute conjunction features andlook them up in a huge feature table, and thusgreatly reduces feature generation time. Instead,it involves many matrix addition and multiplication operations. To further speed up the parsingtime, we apply a pre-computation trick, similarto (Devlin et al., 2014). For each position chosen from S w , we pre-compute matrix multiplications for most top frequent 10, 000 words. Thus,computing the hidden layer only requires lookingup the table for these frequent words, and addingthe dh -dimensional vector. Similarly, we also precompute matrix computations for all positions andall POS tags and arc labels. We only use this optimization in the neural network parser, but it is onlyfeasible for a parser like the neural network parserwhich uses a small number of features. In practice, this pre-computation step increases the speedof our parser 8 10 times.44.1ExperimentsDatasetsWe conduct our experiments on the English PennTreebank (PTB) and the Chinese Penn Treebank(CTB) datasets.For English, we follow the standard splits ofPTB3, using sections 2-21 for training, section22 as development set and 23 as test set. Weadopt two different dependency representations:CoNLL Syntactic Dependencies (CD) (Johansson

DatasetPTB: CDPTB: t2,4162,4161,910#words (Nw )44,35244,38934,577#POS (Nt )454535#labels (Nl )174512projective (%)99.499.9100.0Table 3: Data Statistics. “Projective” is the percentage of projective trees on the training set.and Nugues, 2007) using the LTH Constituent-toDependency Conversion Tool3 and Stanford BasicDependencies (SD) (de Marneffe et al., 2006) using the Stanford parser v3.3.0.4 The POS tags areassigned using Stanford POS tagger (Toutanova etal., 2003) with ten-way jackknifing of the trainingdata (accuracy 97.3%).For Chinese, we adopt the same split of CTB5as described in (Zhang and Clark, 2008). Dependencies are converted using the Penn2Malt tool5with the head-finding rules of (Zhang and Clark,2008). And following (Zhang and Clark, 2008;Zhang and Nivre, 2011), we use gold segmentation and POS tags for the input.Table 3 gives statistics of the three datasets.6 Inparticular, over 99% of the trees are projective inall datasets.4.2ResultsThe following hyper-parameters are used in all experiments: embedding size d 50, hidden layersize h 200, regularization parameter λ 10 8 ,initial learning rate of Adagrad α 0.01.To situate the performance of our parser, we firstmake a comparison with our own implementation of greedy arc-eager and arc-standard parsers.These parsers are trained with structured averagedperceptron using the “early-update” strategy. Thefeature templates of (Zhang and Nivre, 2011) areused for the arc-eager system, and they are alsoadapted to the arc-standard system.7Furthermore, we also compare our parserwith two popular, off-the-shelf parsers: MaltParser — a greedy transition-based dependencyparser (Nivre et al., 2006),8 and MSTParser —3http://nlp.cs.lth.se/software/treebank ser.shtml5http://stp.lingfil.uu.se/ nivre/research/Penn2Malt.html6Pennconverter and Stanford dependencies generateslightly different tokenization, e.g., Pennconverter splits thetoken WCRS\/Boston NNP into three tokens WCRS NNP /CC Boston NNP.7Since arc-standard is bottom-up, we remove all featuresusing the he

Danqi Chen Computer Science Department Stanford University danqi@cs.stanford.edu Christopher D. Manning Computer Science Department Stanford University manning@stanford.edu Abstract Almost all current dependency parsers classify based on millions of sparse indi-cator features. Not only do

Related Documents:

20 Chemical Dependency Professional Chemical Dependency Professional Certificate 101YA0400X - Chemical Dependency Professional (CDP) 21 Chemical Dependency Professional Trainee Chemical Dependency Professional Trainee Certificate 101Y99995L - MH & CDPT in training; crosswal

3 Dependency-Based Neural Networks In this section, we will introduce how we use neu-ral network techniques and dependency informa-tion to explore the semantic connection between two entities. We dub our architecture of model-ing ADP structures as dependency-based neural networks (De

Detailed instructions on adding the required Camel dependencies. Maven Dependency One of the most common ways to include Apache Camel in your application is through a Maven dependency. By adding the dependency block below, Maven will resolve the Camel libraries and dependencies for you. dependency groupId org.apache.camel /groupId

3 Deep Dependency Graph Our deep dependency graphs (DDG) preserve only two out of the four tree properties: single-root and connected. Two types of dependencies are used to represent DDG. The primary dependencies, represented by the top arcs in figures, form dependency trees similar to the ones introduced by the Universal Dependencies (UD) [33 .

LCDC Licensed Chemical Dependency Counselor LCDP Licensed Chemical Dependency Professional LCDS Licensed Chemical Dependency Supervisor LCSW Licensed Clinical Social Worker LICDC-CS Licensed Independent Chemical Dependency Counselor--Clinical Supervisor LME Local Management Enti

Daniel Fast during which they will use this fast to refrain from secular distractions and increase their time in prayer and Bible study. Here are some ways one might conduct a Daniel Fast or a Modified Daniel Fast: FAST SPECIFIC FOOD AND/OR DRINK: This is an accurate representation of the Daniel Fast where Daniel refrained from eating rich food or

The MikroTik Fast Path and Conntrack's work together gave the name Fast Track. Fast Track Fast Path extentions Only Ipv4 TCP/UDP (Total Traffic %99) FastTrack management is left to network admin FastTrack can be used on devices with Fast Path support. After the first packet of the connection passing through the router is marked as Fast Track .

Technology Compatibility Kit Reference Guide for JSR 346: Contexts and Dependency Injection for Java EE 1.1 Specification Lead: Red Hat Inc. Pete Muir JSR 346: Contexts and Dependency Injection (CDI) for Java EE 1.1 specification lead Gavin King JSR 299: Contexts and Dependency Injection (CDI) for Java EE 1.0 specification lead Martin Kouba .