Semi-supervised New Event Type Induction And Event Detection

3y ago
34 Views
2 Downloads
682.62 KB
7 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Esmeralda Toy
Transcription

Semi-supervised New Event Type Induction and Event DetectionLifu Huang and Heng JiComputer Science DepartmentUniversity of Illinois at ctMost previous event extraction studies assumea set of target event types and correspondingevent annotations are given, which could bevery expensive. In this paper, we work on anew task of semi-supervised event type induction, aiming to automatically discover a set ofunseen types from a given corpus by leveraging annotations available for a few seen types.We design a Semi-Supervised Vector Quantized Variational Autoencoder framework toautomatically learn a discrete latent type representation for each seen and unseen type andoptimize them using seen type event annotations. A variational autoencoder is further introduced to enforce the reconstruction of eachevent mention conditioned on its latent typedistribution. Experiments show that our approach can not only achieve state-of-the-artperformance on supervised event detection butalso discover high-quality new event types. 11IntroductionEvent extraction is a task of automatically identifying and typing event trigger words (Event Detection), and extracting participants for each trigger(Argument Extraction) from natural language text.Traditional event extraction studies (Ji and Grishman, 2008; McClosky et al., 2011; Li et al., 2013;Chen et al., 2015; Yang and Mitchell, 2016; Liuet al., 2018; Nguyen and Nguyen, 2019; Lin et al.,2020; Li et al., 2020) usually assume there exists aset of predefined event types and argument roles, sothat supervised machine learning models, e.g., deepneural networks, can be employed to extract eventsfor each type based on human annotations. However, in practice, it is usually very expensive andtime-consuming to manually craft an event schema,which defines the types and complex templates of1The programs are publicly available for research purposeat https://github.com/wilburOne/SSVQVAEEvent MentionsEvent TypesAnnotated MentionsSeen TypesNew MentionsUnseen TypesFigure 1: Semi-supervised new event type induction:discovering a set of new event types and their eventmentions given the annotations for a few seen types.the expected events. Moreover, the coverage ofmanually crafted schemas is often very low, making them fail to generalize to new scenarios.Recent studies have shown that it’s possible toautomatically induce an event schema from rawtext. Some researchers explore probabilistic generative methods (Chambers, 2013; Nguyen et al.,2015; Yuan et al., 2018; Liu et al., 2019) or ad-hocclustering-based algorithms (Huang et al., 2016)to discover a set of event types and argumentroles. Several studies (Huang et al., 2018; Lai andNguyen, 2019) also explore zero-shot and few-shotlearning approaches to leverage available resourcesand extend event extraction to new types. Generally, event schema induction can be divided intotwo steps: event type induction, aiming to discovera set of new event types for the given scenario, andargument role induction which discovers a set of argument roles for each type. In this work, we focuson tackling the first problem only.We propose a task of semi-supervised event typeinduction, which is shown in Figure 1 and aims toleverage available event annotations for a few types,which are called as seen types, and automaticallydiscover a set of new unseen types, as well as theircorresponding event mentions. As a solution, wedesign a new Semi-supervised Vector QuantizedVariational Autoeocoder framework (short as SSVQ-VAE) which first assigns a discrete latent type

representation for each seen and unseen type, andoptimizes them during the process of projectingeach candidate trigger into a particular seen or unseen type. The candidate triggers are discoveredwith a heuristic approach.Experiments under the setting of both supervised event detection and new event type inductiondemonstrate that our approach can not only detectevent mentions for seen types with high precision,but also discover high-quality new unseen types.2ApproachSeen Types. .Encoderfc(vt)TriggerRepresentationyt Efc(vt)Encoderfe(vt)LinearDecoderfr(zt, yt)LinearBERT EncodingEAymanEwasEarrestedEandAyman was arrested andEwasEsentencedEtoElifeEinEprisonwas sentenced tolifeinprison.Figure 2: Architecture tolifeinprisonAs Figure 2 shows, given an input sentence, wefirst automatically discover all candidate triggersand encode each trigger with a contextual vectorLinear(DevlinClassifier et al., 2019) enusing a pre-trained BERTcoder. Then, we predict the type of each candidatetrigger by looking up a dictionary of discrete latentrepresentations of all seen and unseen types. Meanwhile, to avoid the type prediction to be over-fittedto seen types, we apply a variational autoencoder(VAE) as a regularizer to first project each trigger into a latent variational embedding and thenreconstruct the trigger conditioned on its type distribution.2.12.2Event Trigger IdentificationTrigger Representation LearningGiven a sentence s [w1 , ., wn ], where we assume wi is identified as a candidate trigger, we usea pre-trained BERT encoder to encode the wholesentence and get a contextual representation for wi .If wi can be split into multiple subwords or words,we use the average of all subword vectors as thefinal trigger representation.2.3Seen TypeAnnotationsUnseen Typesto OntoNotes senses as candidate triggers. In addition, the concepts that can be matched with verbsor nominal lexical units in FrameNet (Baker et al.,1998) are also considered as candidate triggers.Event Type Prediction with VectorQuantizationTo predict a type for a candidate trigger, an intuitive approach is to learn a classifier using the eventannotations of seen types. However, as we also aimto discover a set of unseen types, without any annotations, the classifier for the unseen types cannotbe optimized.To solve this problem, we employ a Vector Quantization (Gersho and Gray, 2012) strategy. We firstdefine a discrete latent event type embedding spaceE Rk d , where k is the number of candidateevent types, and d is the dimensionality of eachtype embedding ei . Each ei can be viewed as thecentroid of the triggers belonging to the corresponding event type. For each seen type, we initializee with the contextual vector of a trigger which israndomly selected from the corresponding annotations. For each unseen type, we initialize e with thecontextual vector of another trigger which is randomly picked from all unannotated event mentions.Assuming there are m seen types, we arbitrarilyassign E [1:m] as their type representations.Given a candidate trigger t and its contextual vector v t , we first apply a linear encoder fc (v t ) Rdto extract type-specific features. Then, we computea type distribution y based on fc (v t ) by looking upall the discrete latent event type embeddings withinner-product operationSimilar to (Huang et al., 2016), we identify allcandidate triggers based on word sense induction.Specifically, for each word, we disambiguate itssenses and link each sense to OntoNotes (Hovyet al., 2006) using a word sense disambiguationsystem — IMS (Zhong and Ng, 2010) 2 . We consider all noun and verb concepts that can be mappedThe feature encoder fc (.) is optimized using allevent annotations for seen types (the cross-entropyterm in Equation 2) and event mentions for unseentypes (the second term in Equation 2 3 ). The intuition of the second term in Equation 2 is that, for2We use the OntoNotes based IMS word sense disambiguator (https://github.com/c-amr/camr)3We only apply this term when we know the new eventmentions do not belong to any seen typesy t Softmax(E [1:k] · fc (v t ))(1)

each new event mention, we don’t know the correct type but we know that the type must be froma set of unseen types, so we maximize the marginbetween the probability of the most likely unseentype and the highest probability of the incorrectseen type.Lc X(t,ỹt ) Ds ỹ t log(y t ) X[1:m]max(y t) Gaussian distribution. For each unlabeled candidate trigger t, the likelihood p(t) approximates toanother variational lower boundlog p(t) Xq(y t)( L(t, y)) q(y t) log q(y t) L(t)ywhere q(y t) is obtained from Equation 1.As for model implementation, given a candidate trigger t and its contextual embedding v t ,we first pass it through an encoder fe (v t ) to extract features. As we assume the latent variatonal embedding z t follows Gaussian distributionz t N (µt , σ t ), we apply two linear functionsto obtain the mean vector µt fµ (fe (v t )) anda variance vector σ t fσ (fe (v t )). For decoding, we employ another linear function to reconstruct v t from the concatenation of z t and y t :0v t fr ([z t : y t ]). We optimize the followingobjective for the semi-supervised VAE[m:k]max(y t)t Du(2)where ỹt is the ground truth label. Ds and Dudenote the set of annotated event mentions for seentypes and new event mentions for unseen types.[1:m][m:k]ytand y tare the type prediction scores forseen and unseen types respectively.To optimize the type embeddings E, we followthe VQ objective (van den Oord et al., 2017) anduse l2 error to move the type vector ei towardsthe type-specific feature fc (v t ) (the first term inEquation 3) while ei of t is determined by y t . Tomake sure fc (.) commits to an embedding, we adda commitment loss (the second term in Equation 3)Lvq sg(fc (v t )) ei 2 fc (v t ) sg(ei ) 2(3)Lv Xt DuL(t) XL(t, y)The overall loss function for optimizing thewhole SS-VQ-VAE framework isL αLc βLvq γLvwhere sg stands for the stop gradient operator tomake its operand to be a non-updated constant. Theoutput of sg is the same as the input in the forwardpass, and it is zero when computing gradients inthe training process.2.4Variational Autoencoder as RegularizerTo avoid the type prediction to be over-fitted tothe seen types, we employ a semi-supervised variational autoencoder as a regularizer. The intuitionis that each event mention can be generated conditioned on a latent variational embedding z andits corresponding type distribution y, which is predicted by the approach described in Section 2.3.We first describe the semi-supervised variationalinference process. It consists of an inference network q(z t) which is a posterior of the learning ofa latent variable z given the trigger t, and a generative network p(t z, y) to reconstruct the candidatetrigger t from the latent variable z and type information y. For each candidate trigger t with humanannotated label y, the likelihood p(t, y) can be approximated to a variational lower boundlog p(t, y) log p(t y, z) KL(q(z t) p(z)) L(t, y)where log p(t z, y) is the expectation of reconstruction of t conditioned on z and y, p(z) is the prior(4)(t,y) Ds(5)where α, β and γ are hyper-parameters to balancethese three objectives.33.1Experiments and ResultsDatasetWe perform experiments on Automatic ContentExtraction (ACE) 2005 dataset and evaluate ourapproach under two settings: (1) supervised eventextraction, where the target types include 33 ACEpredefined types and other, thus k is set as 34. Giving all candidate triggers, the goal is to correctlyidentify all ACE event mentions and classify theminto corresponding types. We follow the same datasplit with prior work (Li et al., 2013; Nguyen et al.,2016; Yang and Mitchell, 2016) in which 529/30/40newswire documents are used for training/dev/testset. (2) new event type induction, where we followa previous study (Huang et al., 2018) and use top10 most popular event types from ACE05 data asseen and the remaining 23 types as unseen. Givenall ACE annotated event mentions, the goal of thistask is to test whether the approach can automatically discover the remaining 23 unseen ACE typesand categorize each candidate trigger into a particular seen or unseen type. In this experiment, k isset as 500.

MethodsEncoderTrigger IdentificationPRFTrigger DetectionPRFDMCNN (Chen et al., 2015)JRNN (Nguyen et al., 2016)JMEE (Liu et al., 2018)Joint3EE (Nguyen and Nguyen, 2019)MOGANED (Yan et al., 2019)BERT-CRFDMBERT (Wang et al., 169.373.769.875.771.874.6SS-VQ-VAE w/o VQ-VAESS-VQ-VAE w/o e 1: Supervised Event Detection Performance on ACE 2005 (F-score %).MetricsNormalized Mutual InfoFowlkes MallowsCompletenessHomogeneityV-MeasureBERT C-KmeansSS-VQ-VAE w/o -VQ-VAE40.8831.4653.5731.1939.43Table 2: Evaluation of New Event Type Induction on 23 Unseen Types of ACE 2005 (%).In terms of implementation details, we use thepre-trained bert-large-cased 4 model for fine-tuning,and optimize our model with BertAdam. weoptimize the parameters with grid search: training epoch 15, learning rate l {1e 5, 2e 5, 3e 5, 5e 5}, gradient accumulation steps g {1, 2, 3}, training batch size b {5g, 8g, 10g},the hyper-parameters for the overall loss function α {1.0, 5.0, 10.0}, β {0.1, 0.5, 1.0},γ {0.1, 0.5, 1.0}. The dimensionality of typeembedding as well as latent variational embedding,and the hidden states of fc (.) are all 500 while thehidden states of fe (.), fµ (.), fσ (.) are all 1024.3.2Supervised Event DetectionTable 1 compares our approach with several baselines. We conduct ablation study to testify theimpact of the VQ and VAE components: SS-VQVAE w/o VQ-VAE is only optimized with the classification loss (Equation 2) while SS-VQ-VAE w/oVAE is optimized with the classification loss (Equation 2) and the VQ objective (Equation 3).As we can see, BERT based approaches generally outperform the methods using CNN, RNNor GRU. Our approach achieves the state-of-theart among all methods. In particular, the recall ofour approach is much higher than other methods,which demonstrate the effectiveness of the triggeridentification step. It can narrow the learning spaceof the model. The ablation studies also prove theeffectiveness of the VQ and VAE t3.3New Event Type InductionFor new event type induction, we compare our approach with another intuitive baseline, BERT-CKmeans, which takes in the BERT based triggerrepresentations and group all candidate triggersinto clusters with a Constrained K-means (Wagstaffet al., 2001), a semi-supervised clustering algorithm which enforces all trigger candidates annotated with the same seen type to belong to the samecluster. Table 2 shows the performance with several clustering metrics (Chen and Ji, 2010), whichmeasure the agreement between the ground truthclass assignment and system based unseen typeprediction.Normalized Mutual Info is a normalization ofthe Mutual Information (MI) score and scales theMI score to be between 0 and 1.N M I(Y, C) 2 I(Y ; C)[H(Y ) H(C)]where Y denotes the ground truth class labels, Cdenotes the cluster labels, H(.) denotes the entropyfunction and I(Y ; C) is the mutual informationbetween Y and C.Fowlkes Mallows (Fowlkes and Mallows, 1983)is to evaluate the similarity between the clustersobtained from our approach and ground-truth labelsof the data.TPF M (Y, C) p((T P F P ) (T P F N ))where T P means True Positive, which is calculatedas the number of data point pairs that are in the

same cluster in Y and in C. F P refers to FalsePositive and it is calculated as the number of datapoint pairs that are in the same cluster in Y but notin C. F N is False Negative and it is calculated asthe number of pair of data points that are not in thesame cluster in Y but are in the same cluster in C.Completeness : A clustering result satisfies completeness if all members of a given class are assigned to the same cluster.C(Y, C) 1 H(C Y )H(C)where H(C Y ) is the conditional entropy of theclustering output given the class labels.Homogeneity : A clustering result satisfies completeness if all of its clusters contain only datapoints which are members of a single class.C(Y, C) 1 H(Y C)H(Y )V-Measure (Rosenberg and Hirschberg, 2007)is the weighted harmonic mean between homogeneity score and completeness score.V (Y, C) (1 β) · h · c)(β · h) cwhere h denotes the homogeneity score and c refersto the completeness score.As qualitative analysis, we further pick 6 unseenACE types and randomly select at most 100 eventmentions for each type. We visualize their typedistribution y using TSNE 5 . As Figure 3 shows,most of the event mentions that are annotated withthe same ACE type tends to be predicted to thesame new unseen type.4Related WorkTraditional event extraction studies (Ji and Grishman, 2008; McClosky et al., 2011; Li et al., 2013;Chen et al., 2015; Yang and Mitchell, 2016; Liuet al., 2018; Nguyen and Nguyen, 2019; Lin et al.,2020; Li et al., 2020) assume all the target eventtypes and annotations are given. They can extracthigh-quality event mentions for the given types, butcannot extract mentions for any new types. Recentstudies (Huang et al., 2018; Chan et al., 2019; Ferguson et al., 2018) leverage annotations for a fewseen event types or several keywords provided forthe new types to extract mentions for new types.However, all these studies assume all the targettypes are given, which is very costly when movingto a new scenario.Recent studies have explored probabilistic generative methods (Chambers, 2013; Nguyen et al.,2015; Yuan et al., 2018; Liu et al., 2019) or ad-hocclustering based algorithms (Huang et al., 2016) toautomatically discover a set of event types as wellas argument roles. Most of these studies are completely unsupervised and mainly rely on statisticalpatterns or semantic matching, while our work triesto leverage the knowledge learned from availableannotations to discover new event types.5Conclusion and Future WorkWe have designed a semi-supervised vector quantized variational autoencoder approach which automatically learns a discrete representations for eachseen and unseen type and predict a type for eachcandidate trigger. Experiments show that our approach achieves the state-of-the-art on supervisedevent extraction and discovers a set of high-qualityunseen types. In the future, we will extend thisapproach to argument role induction to discovercomplete event schemas.AcknowledgementFigure 3: Type Distribution of 6 Unseen Types of ules/generated/sklearn.manifold.TSNE.htmlThis research is based upon work supported in partby U.S. DARPA KAIROS Program Nos. FA875019-2-1004, U.S. DARPA AIDA Program No.FA8750-18-2-0014 and Air Force No. FA865017-C-7715. The views and conclusions containedherein are those of the authors and should not beinterpreted as necessarily representing the officialpolicies, either expressed or implied, of DARPA,or the U.S. Government. The U.S. Government isauthorized to reproduce and distribute reprints for

governmental purposes notwithstanding any copyright annotation therein.ReferencesCollin F Baker, Charles J Fillmore, and John B Lowe.1998. The berkeley framenet project. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, pages 86–90. Association for Computational Linguistics.Nathanael Chambers. 2013. Event schema inductionwith a probabilistic entity-driven model. In Proceedings of the 2013 Conference on Empirical Methodsin Natural Language Processing, pages 1797–1807.Yee Seng Chan, Joshua Fasching, Haoling Qiu, andBonan Min. 2019. Rapid customization for eventextraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:System Demonstrations, pages 31–36.Lifu Huang, Taylor Cassidy, Xiaocheng Feng, Heng Ji,Clare Voss, Jiawei Han, and Avirup Sil. 2016. Liberal event extraction and event schema induction. InProceedings of the 54th Annual Meeting

Nguyen,2019) also explore zero-shot and few-shot learning approaches to leverage available resources and extend event extraction to new types. Gener-ally, event schema induction can be divided into two steps: event type induction, aiming to discover a set of new event types for the given scenario, and argument role induction which discovers a .

Related Documents:

Event 406 - Windows Server 2019 58 Event 410 58 Event 411 59 Event 412 60 Event 413 60 Event 418 60 Event 420 61 Event 424 61 Event 431 61 Event 512 62 Event 513 62 Event 515 63 Event 516 63 Event 1102 64 Event 1200 64 Event 1201 64 Event 1202 64 Event 1203 64 Event 1204 64

Semi-supervised learning algorithms reduce the high cost of acquiring labeled training data by using both la-beled and unlabeled data during learning. Deep Convo-lutional Networks (DCNs) have achieved great success in supervised tasks and as such have been widely employed in the semi-supervised learning. In this paper we lever-

Variational Auto-Encoder (VAE), in particu- . However, manual labeling of the large dataset is very time- and labor-consuming. Sometimes, it even . SSVAE [45] extends Semi-VAE for sequence data and also demonstrates its effectiveness in the semi-supervised learning on the text data. The aforementioned semi-supervised VAE all use a .

Semi-Supervised Learning. MIT Press, Cambridge, 2006. X. Zhu. Semi-supervised learning literature survey. TR-1530. University of Wisconsin-Madison Department of Computer Science, 2005. M. Seeger. Learning with labeled and unlabeled data. Technical Report. University of Edinburgh, 2001. Semi-s

Active learning literature survey. 50th International Conference on Parallel Processing (ICPP) 4 August 9-12, 2021 in Virtual Chicago, IL . Semi-supervised Learning The key to semi-supervised learning is how to use a large amount of . The existing semi-

In contrast with supervised learning algorithms, SSL algorithms can improve their performance by leveraging information in unlabeled data. Some recent results (Laine and Aila,2017;Miyato et al., 2019;Tarvainen and Valpola,2017) have shown that semi-supervised learning could reach perfor-mance of purely supervised learning in certain sce-narios.

veloped for both generative and discriminative models. A straightforward, generative semi-supervised method is the expectation maximization (EM) algorithm. The EM ap-proach for naive Bayes text classification models is discussed by Nigam et al. in [17]. Generative semi-supervised meth-od

Secret Wall O2 Pit to Q2 X2 To Level 7 (X3) A1 Portal to L10 (A2) [] Button Q1 From Pit O1 X3 To Level 7 (X1) 0 Pressure Pad Q2 From Pit O2 X4 To Level 5 (X2) Y Nest In the place where you found a lot of Kenkus (bird creatures) is a place called "Nest." After killing both Kenkus, put all ten Kenku eggs on the floor. The wall will disappear, and .