Korean Language Resources For Everyone

2y ago
373.95 KB
10 Pages
Last View : 30d ago
Last Download : 2y ago
Upload by : Callan Shouse

PACLIC 30 ProceedingsKorean Language Resources for EveryoneJeen-Pyo HongNAVER LABSNAVER CorporationRepublic of Koreajeenpyo.hong@navercorp.comJungyeul ParkDepartment of LinguisticsUniversity of ArizonaTucson, AZ 85721jungyeul@email.arizona.eduJeong-Won ChaDepartment of Computer EngineeringChangwon National UniversityRepublic of Koreajcha@changwon.ac.krAbstractThis paper presents open language resourcesfor Korean. It includes several language processing models and systems including morphological analysis, part-of-speech tagging,syntactic parsing for Korean, and standardevaluation Korean-English machine translation data with the Korean-English statisticalmachine translation baseline system. We makethem publicly available to pave the way forfurther development regarding Korean language processing.1IntroductionThis paper presents open language resources (LRs)for Korean. We provide necessary data, models,tools, and systems to analyze Korean sentences. Itincludes the whole working pipeline from part-ofspeech (POS) tagging to syntactic parsing for Korean. We also provide the Korean-English statistical machine translation (SMT) baseline system andnewly created standard data for MT evaluation. AllLRs described in this paper will be publicly available under the MIT License (MIT).2Korean LanguageKorean is an agglutinative language in which “wordstypically contain a linear sequence of MORPHS”(Crystal, 2008). Words in Korean (eojeols), therefore, can be formed by joining content and functional morphemes to indicate such meaning. Theseeojeols can be interpreted as the basic segmentation unit and they are separated by a blank spacein the Korean sentence. Let us consider the sentence in (1). For example, unggaro is a content morpheme (a proper noun) and a postposition -ga (anominative case marker) is a functional morpheme.They form together a single word unggaro-ga (‘Ungaro NOM’). For convenience sake, we add - atthe beginning of functional morphemes, such as -gafor NOM to distinguish between content and functional morphemes. The nominative case marker -gaor -i may vary depending on the previous letter vowel or consonant. A predicate naseo-eoss-da alsoconsists of the content morpheme naseo (‘become’)and its functional morphemes (-eoss ‘PAST’ and -da‘DECL’).3Morphological analysis and POS taggingNumerous studies pertaining to morphological analysis and POS tagging for Korean have been conducted over the past decades (Cha et al., 1998; Leeand Rim, 2004; Kang et al., 2007; Lee, 2011). Mostmorphological analysis and POS tagging for Koreanhave been conducted based on an eojeol. In the system of Korean POS taggers, a morphological analysis is generally followed by a POS tagging step. Thatis, all possible sequences of morphological segmentation for a given word are generated during the morphological analysis and the possible (or best) correctsequences are then selected during POS tagging.E SPRESSO, a Korean POS tagger described inHong (2009) is publicly available1 . It greatly improves the accuracy of POS tagging using POS patterns of words in which it obtains up to 95.85% ac1Note that there is another resource with the same name(Pantel and Pennacchiotti, 2006).30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30)Seoul, Republic of Korea, October 28-30, 201649

(1)a. 프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다.b. peurangseu-ui segyejeok-inuisang dijaineo emmanuel unggaro-ga silnae jangsikyongFrance-GEN world class-REL fashion designer Emanuel Ungaro-NOM interior decorationjikmul dijaineo-ro naseo-eoss-da.textile designer-AJT become-PAST-DECL.‘The world class French fashion designer Emanuel Ungaro became an interior textile designer.’Figure 1: Example of the Korean sentenceInput:프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 �물디자이너로나섰다.BOSEOS프랑스/NNP 의/JKG세계/NNG 적/XSN 이/VCP ��가로/NNP ��/NNG 로/JKB나서/VV 었/EP 다/EF ./SFFigure 2: Input and output examples of E SPRESSO for Korean POS taggingcuracy for Korean. Figure 2 shows the input and output formats of E SPRESSO for Korean POS tagging.Even though E SPRESSO can yield several output formats, we only show the Sejong corpus-like formatin this paper, in which we use the format for the input of syntactic analysis. While E SPRESSO indicatesBOS and EOS (the beginning and the end of a sentence, respectively), the actual Sejong corpus doesnot contain BOS and EOS labels. The original Sejong morphologically analyzed corpus annotates thesentence boundary using the markup language.We use Sejong POS tags, the mostly used POStag information for Korean. Figure 3 shows the summary of the Sejong POS tag set and its mapping tothe Universal POS tag (Petrov et al., 2012). We convert the XR (non-autonomous lexical root) into theNOUN because they are mostly considered as a nounor a part of noun (e.g. minju/XR (‘democracy’)). Thecurrent Universal POS tag mapping for Sejong POStags is based on a handful of POS patterns of Koreanwords. However, combinations of words in Koreanare very productive and exponential. Therefore, thenumber of POS patterns of the word does not converge as the number of words increases. For example, the Sejong Treebank contains about 450K wordsand almost 5K POS patterns. We also test with theSejong morphologically analyzed corpus which contains over 10M words. The number of POS patternsdoes not converge and it increases up to over 50K.The wide range of POS patterns is mainly due to thefine-grained morphological analysis results, whichshows all possible segmentations divided into lexical and functional morphemes. These various POSpatterns indicate useful morpho-syntactic information for Korean. For example, Oh et al. (2011) predicted function labels (phrase-level tags) using POSpatterns that would improve dependency parsing results.50

PACLIC 30 ProceedingsSejong POSNNG, NNP, NNB, NR, XRNPMAG,MAJMMVV, VX, VCN, VCPVAEP, EF, EC, ETN, ETMJKS, JKC, JKG, JKO, JKB, JKV, JKQ, JX, JCXPN, XSN, XSA, XSVSF, SP, SE, SO, SSSWSH, SLSNNA, NF, NVdescriptionNoun relatedPronounAdverbConjunctive adverbDeterminerVerb relatedAdjectiveVerbal endingsPostpositions (case markers)SuffixesPunctuation marksSpecial charactersForeign charactersNumberUnknown wordsUniversal POSNOUNPRONADVCONJDETVERBADJPRTADPPRTPUNC (.)XXNUMXFigure 3: POS tags in the Sejong corpus and their 1-to-1 mapping to Universal POS tags4Syntactic analysisStatistical parsing trained from an annotated dataset has been widespread. However, while there aremanually annotated several Korean Treebank corpora such as the Sejong Treebank (SJTree), only afew works on statistical Korean parsing have beenconducted.4.1Phrase structure parsingFor previous work on constituent parsing, Sarkarand Han (2002) used an early version of the Korean Penn Treebank (KTB) to train lexicalized TreeAdjoining Grammars (TAG). Chung et al. (2010)used context-free grammars and tree-substitutiongrammars trained on data from the KTB. Choiet al. (2012) proposed a method to transform theword-based SJTree into an entity-based Treebankto improve the parsing accuracy. There exit severalphrase structure parsers such as Stanford (Klein andManning, 2003), Bikel (Bikel, 2004), and Berkeley(Petrov and Klein, 2007) parsers (either lexicalizedor unlexicalized) that we can train with the Treebank.For phrase structure parsing, we provide a parsing model for the Berkely parser.2 Choi et al. (2012)tested Stanford, Bikel, and Berkeley parsers and rkeley parser shows the best results for phrasestructure parsing for Korean. The input sentence ofphrase structure parsers is generally the tokenizedsentence. It can be obtained by performing the segmentation task for a word. Each segmented morpheme becomes a leaf node in the phrase structure.Therefore, we use the tokenization scheme based onPOS tagging. Figure 4 shows the input and outputformats for the Berkeley parser. As preprocessingtools, we provide MakeBerkeleyTestIn andMakeBerkeleyTestWithPOSIn. They convertE SPRESSO’s output into the Berkely parser’s inputby tokenizing the Korean sentence with or withoutPOS information, respectively.4.2Dependency parsingFor previous work on dependency parsing for Korean, Chung (2004) presented a model for dependency parsing using surface contextual information.Oh and Cha (2010), Choi and Palmer (2011) andPark et al. (2013) independently developed a parsing model from the Korean dependency Treebank.They converted automatically the phrase-structuredSejong Treebank into the dependency Treebank.To convert into dependency grammars, Park etal. (2013) summarized as follows.We, first, assign an anchor for nonterminal nodesusing bottom-up breadth-first search. An anchor is51

Input:프랑스 의 세계 적 이 ㄴ 의상 디자이너 엠마누엘 웅가로 가 실내 장식 용 직물 디자이너 로 나서었다.Output:(S (NP-SBJ (NP (NP-MOD (NNP 프랑스) (JKG 의)(NP (VNP-MOD (NNG 세계) (XSN 적) (VCP 이) (ETM ㄴ))(NP (NP (NNG 의상))(NP (NNG 디자이너)))))(NP-SBJ (NP (NNP 엠마누엘))(NP-SBJ (NNP 웅가로) (JKS 가))))(VP (NP-AJT (NP (NP (NP (NNG 실내))(NP (NNG 장식) (XSN 용)))(NP (NNG 직물)))(NP-AJT (NNG 디자이너) (JKB 로)))(VP (VV 나서) (EP 었) (EF 다) (SF .))))Figure 4: Input and output examples for Korean phrase structure parsingthe lexical terminal node where each nonterminalnode can have as a head node. We use lexical anchorrules described in Park (2006) for the SJTree. Lexical anchor rules distinguish dependency relations.We assign only the lexical anchor for nonterminalnodes and finding dependencies in the next step.Lexical anchor rules give priorities to the rightmostchild node, which inherits mostly the same phrasetag. Exceptionally, in case of “VP and VP" (or “Sand S"), the leftmost child node is assigned as ananchor. Then, we can find dependency relations between terminal nodes using the anchor informationas follows:1. The head is the anchor of the parent of the parent node of the current node.2. If the anchor is the current node and(a) if the parent of the parent node does nothave another right sibling, the head is itself.(b) if the parent of the parent node have another right sibling, the head if the anchorof the right sibling.Results from the conversion can allow to train existing dependency parsers. Figure 5 presents an example of the original Sejong Treebank (above) andits automatically-converted dependency representation.3 The address of terminal nodes (underneath)and the anchor of nonterminal node (on its right)are arbitrarily assigned for dependency conversionalgorithm using lexical head rules. The head of theterminal node 1 is the node 4, which is the anchorof the parent of the parent node (NP:4). The head ofthe terminal node 4 is the node 6 where the anchor ofits ancestor node is changed from itself (NP-SBJ:6).The head of the terminal node 11 is itself where theanchor of the root node and itself are same (S:11).The parsing model of MaltParser (Nivre etal., 2006) is provided for dependency parsingfor Korean.4 As preprocessing tools, we provideMakeMaltTestIn. It converts E SPRESSO’s output into the MaltParser’s input by generating required features for MaltParser. Figure 6 shows example of the input and the output of MaltParser.We use the data format of CoNLL-X dependencyparsing, described in Figure 7 (partially presented).See http://ilk.uvt.nl/conll for other information about the data format of CoNLL-X thatMaltParser requires. From word and POS information, we convert them into features that MaltParserrequires for Korean dependency parsing.3The figure originally appeared in Park et al. (2013) withminor errors, and we corrected them.4http://www.maltparser.org52

PACLIC 30 P-SBJ:6NP:4NP-MOD:10 "/NNP */JKG‘France :9NP-AJT:10! NNG ./XSN , * , ' & ) /NNP -#/NNG / /VCP /ETM/NNP /JKS/NNG/NNG/NNG (/XSN /NNG /NNG AS’‘designer NOM’‘Emmanuel’‘Ungaro 0VP:11 /VV %/EP /EF ./SF‘became’11Figure 5: Example of the original Sejong Treebank (above) and its automatically-converted dependency representation(below)4.3 Discussion on parsing for KoreanIn previous work on parsing for Korean, eitherphrase structure or dependency parsing, while Parket al. (2013) proposed the 80-10-10 corpus split fortraining, development and evaluation, others oftenused cross validation (Oh and Cha, 2010; Choi etal., 2012; Oh and Cha, 2013).For phrase structure parsing, Choi et al. (2012)obtained up to 78.74% F1 score. For dependencyparsing, Oh and Cha (2013) obtained 87.03% (10fold cross validation) and Park et al. (2013) up to86.43% (corpus split) by using external case frameinformation.Currently, we distribute only parsing models instead of parsers and training data themselves because of following reasons. First, the Sejong Treebank that we use to train and evaluate is not allowed to be distributed by third parties. Corpus usersshould ask directly to National Institute of the Korean Language5 for their own usage. Therefore, itwould be easy that we only make current parsingmodels publicly available instead of actual trainingdata. Second, multilingualism becomes more andmore important. Many natural language processing(NLP)-related works rely on a single system to dealwith multiple languages homogeneously. Berkeleyparser and MaltParser in which we provide parsingmodels have been developed for many other languages and users can easily obtain their up-to-dated5http://www.korean.go.krparsing systems and models for several other languages.We provide parsing models trained only on thetraining data, which can be subject to the baselineparsing system for Korean to be compared in future work. Table 1 presents the current baseline parsing results using phrase structure grammars by theBerkeley parser. We performed 5-fold and 10-foldcross-validation as well as corpus split evaluationfor comparison purpose. We also tested both cases inwhic

2 Korean Language Korean is an agglutinative language in which “words typically contain a linear sequence of MORPHS ” (Crystal, 2008). Words in Korean (eojeols), there-fore, can be formed by joining content and func-tional morphemes to indicate such meaning. These eojeols can be interpreted as the basic segmenta-tion unit and they are separated by a blank space in the Korean sentence. Let .

Related Documents:

to intermediate and advanced Korean lessons. Before you go, here is a bit of history of 한글 (Hangeul, the Korean alphabet): Korean is the official language of Korea, both North and South. There are around 78 million people who speak Korean around the world. [1] 한글 (the Korean alphabet) was invented by Sejong the Great in the 15th century.File Size: 903KB

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Intermediate Korean: Read Less [-] KOREAN 10AX Intermediate Korean for Heritage Speakers 5 Units Terms offered: Fall 2019, Fall 2018, Fall 2017 This is an intermediate course for students whose Korean proficiency level is higher in speaking than in reading or writing due to Korean-herita

1.2.2 Korean Language Institute courses (Course code: KLI) Intensive Korean language courses are offered by the Korean Language Institute. A placement test to determine Korean language proficiency is required. Taking Korean language

Korean Language 3 KOREAN 1BX Elementary Korean for Heritage Speakers 5 Units Terms offered: Spring 2021, Spring 2020, Spring 2019 With special emphasis on reading and writing, students will expand common colloquialisms and appropriate speech acts. Elementary Korean for Heritage Speakers: Read More [ ] Rules & Requirements Prerequisites: Korean 1AX; or consent of instructor Credit Restrictions .

The Korean language in historical perspective (6 credits) KORE3032. Directed readings in Korean Studies (6 credits) KORE3034. Korean Studies internship (6 credits) KORE3035. Korean Studies field trip (6 credits) KORE3036. Crime, passion, love: Korean popular culture before K-pop (6 credits) Fine Arts FINE2097. Arts of Korea (6 credits) FINE2098. History of Korean paintings (6 credits) Japanese .

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

CURRICULUM VITAE : ANN SUTHERLAND HARRIS EDUCATION B.A. Honors (First Class) University of London, Courtauld Institute 1961 European art and architecture, 1250-1700 PhD. University of London, Courtauld Institute 1965 Dissertation title: Andrea Sacchi, 1599-1661 EMPLOYMENT 1964-5 Assistant Lecturer, Art Dept., University of Leeds. 1965-6 Assistant Lecturer, Barnard and Columbia College. 1965-71 .