The Penn Chinese TreeBank: Phrase Structure Annotation Of .

2y ago
38 Views
3 Downloads
244.08 KB
32 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Emanuel Batten
Transcription

c 2005 Cambridge University PressNatural Language Engineering 11 (2): 207–238. 207doi:10.1017/S135132490400364X Printed in the United KingdomThe Penn Chinese TreeBank: Phrase structureannotation of a large corpusN A I W E N X U E, F E I X I A, F U - D O N G C H I O Uand M A R T A P A L M E RUniversity of Pennsylvania, Philadelphia, PA 19104, USAe-mail: (Received 3 October 2002; revised 4 November 2003 )AbstractWith growing interest in Chinese Language Processing, numerous NLP tools (e.g., wordsegmenters, part-of-speech taggers, and parsers) for Chinese have been developed all overthe world. However, since no large-scale bracketed corpora are available to the public,these tools are trained on corpora with different segmentation criteria, part-of-speech tagsetsand bracketing guidelines, and therefore, comparisons are difficult. As a first step towardsaddressing this issue, we have been preparing a large bracketed corpus since late 1998. The firsttwo installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged andsyntactically bracketed, have been released to the public via LDC (www.ldc.upenn.edu). Inthis paper, we discuss several Chinese linguistic issues and their implications for our treebanking efforts and how we address these issues when developing our annotation guidelines. Wealso describe our engineering strategies to improve speed while ensuring annotation quality.1 IntroductionThe creation of annotated corpora has led to major advances in corpus-basednatural language processing technologies. Most notably, the Penn English Treebank(Marcus, Santorini and Marcinkiewicz 1993) has proven to be a crucial resource inthe recent success of English Part-Of-Speech (POS) taggers and parsers (Collins 1997,2000; Charniak 2000), as it provides common training and testing material so thatdifferent algorithms can be compared and progress be gauged. Its success triggeredthe development of treebanks in a variety of languages. As displayed in a recent bookon treebanks (Abeillé 2003), there are efforts in progress for Czech, German, French,Japanese, Polish, Spanish and Turkish, to name just a few. Specific to Chinese,however, most of the annotation effort has been devoted to word segmentation(tokenization) and POS tagging. Several segmented and POS tagged corpora havebeen developed, based on standards published in different Chinese-speaking regions, This work was done while the author was a graduate student at the University ofPennsylvania. The author currently is a research staff member at the IBM T. J. WatsonResearch Center, Yorktown Heights, NY 10598, USA.

208N. Xue et al.most notably the Beijing University Institute of Computational Linguistics Corpus(PKU) (Yu, Zhu, Wang and Zhang 1998) and the Academia Sinica (Taiwan)Balanced Corpus (ABSC) (CKIP 1995). More recently, the LIVAC synchronouscorpus1 has been developed at City University of Hong Kong. However, there hasbeen a general lack of syntactically annotated Chinese corpora which hinders thedevelopment of Chinese NLP tools and makes it difficult to compare results andmeasure progress in Chinese language processing. In fact, there was no publiclyavailable syntactically bracketed Chinese treebank when the Penn Chinese Treebankwas started in late 1998 to address this need. The first installment of the PennChinese Treebank (CTB-I hereafter), a 100 thousand words of annotated Xinhua2newswire articles, along with its segmentation (Xia 2000b), POS-tagging (Xia 2000a)and syntactic bracketing guidelines (Xue and Xia 2000), was released in the fallof 2000 (see the Appendix for the timeline). The second installment of the PennChinese Treebank (CTB-II hereafter)3 , containing an additional 150,000 words, andbeginning to include Hong Kong News and Sinorama4 articles in an attempt todiversify its data source, was released in the spring of 2003. The eventual goal of thison-going project is to build a large-scale Chinese corpus as a sharable resource thataddresses the need for training and testing material in the Chinese NLP community.Building a treebank requires tremendous human effort. To ensure high qualitywhile maintaining reasonable annotation speed is a major challenge. In order tospeed up the annotation, we use a series of NLP tools to preprocess the data atdifferent stages of annotation. We also adopt several strategies to control the qualityof the annotation: (i) a significant effort is devoted to the creation of clear, consistent,and complete annotation guidelines; (ii) all the annotation in the treebank is doublechecked by a second annotator; (iii) a gold standard is created and annotationaccuracy and inter-annotator agreement are monitored; and (iv) the treebank goesthrough a final cleanup with semi-automatic tools before the release.While the engineering strategies may be language-independent, creating a treebankfor a particular language also requires a thorough study of the language itself,especially its morphology and syntax. The properties of the language should be takeninto consideration when designing the overall annotation paradigm and writingannotation guidelines. For instance, Chinese written texts do not contain worddelimiters. To build a treebank for Chinese, we need to break a sentence into aword sequence before adding POS tags and phrase structures. Chinese also lacksinflectional morphology, a property that complicates all aspects of Chinese textannotation: word segmentation, POS tagging, and syntactic bracketing. As a result,many diagnostic tests that work well for English do not work for Chinese, andnew diagnostic tests have to be found when developing annotation guidelines. Themultitude of differences between Chinese and Indo-European languages have led1234More information can be found at www.rcl.cityu.edu.hk/english/livac.Xinhua is the official news agency of the People’s Republic of China.CTB-I is released by LDC as Chinese Treebank Versions 1.0 and 2.0. CTB-II is includedin Chinese Treebank Version 3.0Sinorama is a Taiwan news magazine.

Penn Chinese TreeBank: Phrase structure annotation of a large corpus209many Chinese linguists to doubt the feasibility of applying standard Western-stylephrase structure analysis to Chinese. As a result, other recent efforts to build Chinesetreebanks have adopted a different approach, putting more emphasis on providingsemantic analysis. For example, Li, Li, Dong, Wang and Lu (2003) have elected toannotate dependency structures, along the lines of the Prague Dependency Treebank(Böhmová, Hajič, Hajicová and Hladká 2003). The Sinica Treebank (Chen, Huang,Chen, Luo, Chang and Chen 2003) also has a more semantic orientation, althoughit does provide simple syntactic analysis. The Penn Chinese Treebank represents theonly attempt to provide full phrase structure for complete sentences in Chinese as thePenn English Treebank did for English. However, CTB goes further than the EnglishTreebank in marking dropped arguments, providing argument/adjunct distinctions,and some NP-internal structure. Its efficacy for training statistical parsers has beenvalidated by the development of several different systems (Bikel and Chiang 2000;Levy and Manning 2003; Luo 2003).This paper is organized as follows. In section 2, we discuss several Chineselinguistic issues and our basic strategies for creating a high-quality treebank. Insection 3, we address major problems that we encountered when creating threesets of annotation guidelines (for word segmentation, POS tagging and syntacticbracketing, respectively). In section 4, we briefly compare the design of our treebankwith that of the Penn English Treebank (Marcus, Kim and Marcinkiewicz 1994)and the Sinica Treebank (Chen et al. 2003). In section 5, we discuss our approachto speed up the annotation and to control quality. Specifically, we describe howwe use a word segmenter, a POS tagger, and a parser to speed up annotation, anduse LexTract (Xia 2001) and CorpusSearch to find annotation errors. Section 6concludes this paper, and describes future directions.2 Linguistic issues and engineering strategiesIn this section we first give an overview of the Penn Chinese Treebank as atreebanking task. Then we outline several Chinese linguistic issues that have to beaddressed when preparing the guidelines. Next we discuss the engineering issues ofthis project and our basic strategies for addressing them as well.2.1 An overview of the Penn Chinese TreebankThe data in the Penn Chinese Treebank are mostly newswire and magazine articlesfrom Xinhua newswire, Hong Kong news and the Sinorama magazine. The structureof the original articles is maintained as much as possible without modification orediting. CTB-I, the first installment of the Penn Chinese Treebank, includes 325articles of Xinhua newswire. Most of the articles focus on economic developmentfrom 1994 to 1998, while the remaining documents describe general political andcultural topics at the same period of time. The average sentence length is 28.7 words.55In our treebank, we use periods, exclamation marks, and questions marks to break adocument into a sequence of sentences. We do not use commas, in contrast with theAcademia Sinica Treebank. See section 4.2 for details.

210N. Xue et al.Starting with CTB-II, we began to include data sources other than Xinhua newswire.CTB-II, the second installment of the treebank, contains an additional 150,000 wordsand includes 373 articles of Xinhua newswire (130,000 words), 55 articles of HongKong News (15,000 words), and two articles from Sinorama (6000 words). Theaverage sentence length is 28.9 words. We are currently working on the thirdinstallment of the treebank which will continue to diversify our data sources.The task of annotating sentences in this treebank can be broken into threesubtasks: word segmentation, part-of-speech tagging and syntactic bracketing. Thisprocess is illustrated in (1): (1a) is an example Chinese sentence before annotation,(1b), (1c) and (1d) illustrate the same sentence after segmentation, POS-tagging andsyntactic bracketing, respectively.6 The actual annotation is carried out in two-stages:word segmentation and POS tagging are performed first, and phrase structures areadded later.2.2 Addressing Chinese linguistic issuesThe development of a large-scale annotated Chinese corpus pushes to the forefrontsome fundamental issues in Chinese linguistics. In this section we outline a fewof them, and discuss their implications for our annotation efforts. We focus onthree issues: (i) the feasibility of the word segmentation task; (ii) the impoverishedinflectional morphology of Chinese; and (iii) difficult constructions in Chinese syntax.2.2.1 The feasibility of the word segmentation taskAs demonstrated in Example 1, a Chinese sentence is a sequence of Chinesecharacters without natural delimitors between words. As a result, it has to besegmented into words before POS tags and phrase structures can be added. Thefeasibility of word segmentation as an annotation task for Chinese7 has been asubject of considerable research interest. Sproat, Gale, Shih and Chang (1996), forexample, reported experimental results that show native speakers of Chinese have avery low degree of agreement among them as to what a word is. In their experiments,six native speakers were asked to mark all the places where they might pause ifthey were reading the text aloud. The inter-judge agreement reported is only 76%.However, the experiments were set up in the context of a text-to-speech synthesissystem and thus the results may not speak directly to the feasibility of a moregeneral word segmentation task.To test how well native speakers agree on word segmentation of written texts,we randomly chose 100 sentences (5060 hanzi) from the Xinhua newswire and67The English gloss of the Chinese examples throughout this paper is not part of theannotation. It is included for the convenience of non-Chinese speakers.Even for languages which use delimiters between words, such as English, the distinctionbetween a word and a non-word is not always clear-cut. For example, pro- normally cannotstand alone, therefore, it is like a prefix. However, it can appear in a coordinated structure,such as pro- and anti-abortion, and under the assumption that only words and phrases canbe coordinated, it is a word. As a reviewer pointed out, deciding word boundaries is alsoa difficult task for other languages, such as Portuguese (Santos, Costa and Rocha 2003).

Penn Chinese TreeBank: Phrase structure annotation of a large corpus211(a) Raw data:(b) Segmented:He also propose one series concrete measure and policy essential .(He also proposed a series of concrete measures and essentials on policy.)(c) POS-tagged:/PN /AD/NN /PU/VV/CD/M/JJ/NN/CC/NN(d) Bracketed:(IP (NP-SBJ (PN /he))(VP (ADVP (AD /also))/propose)(VP (VV(NP-OBJ (QP (CD /one)/series)))(CLP (M/concrete))(NP (NP (ADJP (JJ/measure)))(NP (NN(CC /and)/policy)(NP (NN/essential))))))(NN(PU ))Example 1. A sample Chinese sentence.asked the participants of the First Chinese Language Processing Workshop, whichwas held at the University of Pennsylvania in 1998, to segment them accordingto their personal preferences.8 We got replies from eight groups, and all but oneof them hand corrected their output before sending it. To measure the agreementbetween each pair of the groups that did hand correction, we use three measuresthat are widely used to measure parsing accuracy: precision, recall, and the numberof crossing brackets (Black, Abney, Flickinger, Gdoniec et al. 1991).9 Following89We did not give them any segmentation guidelines. Some participants applied their ownguideline standards for which they had automatic segmenters while others simply usedtheir intuitions.Given a candidate file and a Gold Standard file, the three metrics are defined as: precisionis the number of correct constituents in the candidate file divided by the number ofconstituents in the candidate file; recall is the number of correct constituents in thecandidate file divided by the number of constituents in the Gold Standard file; and thenumber of crossing brackets is the number of constituents in the candidate file that crossa constituent in a Gold Standard file.If we treat each word as a constituent, a segmented sentence is similar to a bracketedsentence and its depth is one. To compare two outputs, we chose one as the Gold

212N. Xue et al.Table 1. Comparison of hand-corrected word segmentation results from seven oat et al. (1996), we calculate the arithmetic mean of the precision and the recallas one measure of agreement between each output pair, which produces an averageagreement of 87.6%, much higher than the 76% reported in Sproat et al. (1996).Table 1 shows the results of comparing the output between each group pair. Foreach x/y/z in the table, x and y are precision and recall rates, respectively, and z isthe total number of crossing brackets in the 100 sentences.The fact that the average agreement in our experiment is 87.6% and the highestagreement among all the pairs is 91.5% confirms the belief that native speakers dohave significant disagreement on where word boundaries should be. On the otherhand, on average there are only 5.4 crossing brackets in the 100 sentences, and mostof these crossing brackets turned out to be human errors. This suggests that much ofthe disagreement is not critical and if native speakers are given good segmentationguidelines, consistent word segmentation can be achieved. There are several possibleexplanations for the discrepancy between our results and those reported in Sproatet al. (1996). One is that the instructions given to the judges are different. In ourexperiment, the judges were asked to segment the sentences into words based ontheir own definitions, while in their experiment, the judges were asked to mark allplaces where they might possibly pause if they were reading the text aloud. Thereare places in Chinese, such as the place between a verb and an aspect marker thatfollows the verb, where native speakers normally do not pause but would add wordboundaries if asked to segment the sentence. Pragmatic factors can also influencea decision to pause which would be independent of word segmentation. Anotherreason why the degree of agreement in our experiment was much higher is thatin our experiment all the judges were well-trained computational linguists who arefamiliar with both the linguistic and computational issues of the word segmentationtask. Some judges had their own segmentation guidelines and/or segmenters. Theyeither followed their guidelines or used their segmenters to automatically segment theStandard, and evaluated the other output against it. As noted in Sproat et al. (1996), fortwo outputs J1 and J2 , taking J1 as the Gold Standard and computing the precision andrecall for J2 yields the same results as taking J2 as the Gold Standard and computing therecall and the precision respectively for J1 . However, the number of crossing brackets whenJ1 is the standard is not the same as when J2 is the standard. For example, if the stringis ABCD and J1 segments it into AB CD and J2 marks it as A BC D, then the number ofcrossing brackets is 1 if J1 is the standard and the number is 2 if J2 is the standard.

Penn Chinese TreeBank: Phrase structure annotation of a large corpus213data and then hand corrected the output. As a result, their resulting segmentationis more consistent. Taken as a whole, the results show that word segmentation isfeasible as an annotation task. Therefore, it is reasonable to assume that, given thesame set of guidelines, the human agreement on segmentation would be well over90%.2.2.2 The impoverished morphological systemThe second characteristic of Chinese that has far-reaching consequences on theannotation of Chinese text is the fact that Chinese has very little, if any, inflectionalmorphology. This general lack of morphological clues affects every aspect of Chinesetext annotation: word segmentation, POS-tagging and syntactic bracketing. Forinstance, if there were abundant prefixes or suffixes in Chinese, they could beused to signal the beginning or the end of a word even in the absence of naturaldelimiters. Without these convenient means for word boundary detection, in theword segmentation guidelines we have to resort to phonological, syntactic, andsemantic tests to decide on proper word boundaries.The lack of inflectional morphology simplifies some aspects of the POS taggingtask. For instance, lemmatization is usually not necessary in Chinese POS tagging.In general, however, this characteristic of the language makes the POS-tagging taskharder. Determining the POS tag of a word becomes less straightforward becauseof the lack of morphological clues. For a word that is ambiguous between a nounand a verb, to determine its part-of-speech requires a careful analysis of its syntacticenvironment. (See section 3.2 for a detailed discussion of our methodologies forPOS-tagging nouns and verbs.)The lack of morphological clues also has implications for determining thesubcategorization frames of verbs, which is crucial in deciding the syntactic structureof a clause. For a language like English, morphological clues can be used to signalthe subcategorization frame of a verb, thus the syntactic structure of a clause.For instance, based on the morphological clues, it is easy to distinguish a verbwith a sentential complement such as say from an object control verb such asforce: the structural distinction between “John said that he would come” and “Johnforced him to come” can easily be

measure progress in Chinese language processing. In fact, there was no publicly available syntactically bracketed Chinese treebank when the Penn Chinese Treebank was started in late 1998 to address this need. The first installment of the Penn Chinese Treebank (CTB-I hereafter), a 100 thousand words of annotated Xinhua2

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

phrase, verb phrase, infinitive phrase, participial phrase Ges gerund phrase. Záe Giv me mgq eváK adjectives, adverbs, nouns A_ev verbs wnámáe KvR Kái D wbáPi sentence ájváZ euvKv (Italic) phrase, verb phrase, infinitive phrase, participial phrase ev gerund phrase) wjLyb Ges

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

The English Penn Treebank Why do we need treebanks? Hw1 2 (Syntactic) Treebank . thousands of mailers, catalogs and sales pitches go straight into the trash. 6. Multi-representational, multi-layered treebank . –As a discovery tool –One can test linguistic theories and coll

Princess Anne -Minchinhampton " S.LBC.414/ . Mr. Sutherland Haresfield SLBC L4 Granleon Ltd Wallbridge, Stroud U .Lek5/ Powell Mr D Cainscross" SaJBQ.238/B Littlestok E ineigNailsworth .4B 68 . LageScam,mell Cons truc tionjEbley'I 5.L .66 The Vicar and ChurchflWardens Strod S . LBC65 Mr _D.Weeks Newport " SLB.ko.M Hn MriAdrdpe c0. Mr. Ingham SouthWoodchester 5.7.02 S.LBC.177/A Zermi S.A .