Korean Language Resources For Everyone

3y ago

82 Views

4 Downloads

373.95 KB

10 Pages

Last View : 1d ago

Last Download : 3m ago

Upload by : Callan Shouse

Report this link

Download PDF

Transcription

PACLIC 30 ProceedingsKorean Language Resources for EveryoneJeen-Pyo HongNAVER LABSNAVER CorporationRepublic of Koreajeenpyo.hong@navercorp.comJungyeul ParkDepartment of LinguisticsUniversity of ArizonaTucson, AZ 85721jungyeul@email.arizona.eduJeong-Won ChaDepartment of Computer EngineeringChangwon National UniversityRepublic of Koreajcha@changwon.ac.krAbstractThis paper presents open language resourcesfor Korean. It includes several language processing models and systems including morphological analysis, part-of-speech tagging,syntactic parsing for Korean, and standardevaluation Korean-English machine translation data with the Korean-English statisticalmachine translation baseline system. We makethem publicly available to pave the way forfurther development regarding Korean language processing.1IntroductionThis paper presents open language resources (LRs)for Korean. We provide necessary data, models,tools, and systems to analyze Korean sentences. Itincludes the whole working pipeline from part-ofspeech (POS) tagging to syntactic parsing for Korean. We also provide the Korean-English statistical machine translation (SMT) baseline system andnewly created standard data for MT evaluation. AllLRs described in this paper will be publicly available under the MIT License (MIT).2Korean LanguageKorean is an agglutinative language in which “wordstypically contain a linear sequence of MORPHS”(Crystal, 2008). Words in Korean (eojeols), therefore, can be formed by joining content and functional morphemes to indicate such meaning. Theseeojeols can be interpreted as the basic segmentation unit and they are separated by a blank spacein the Korean sentence. Let us consider the sentence in (1). For example, unggaro is a content morpheme (a proper noun) and a postposition -ga (anominative case marker) is a functional morpheme.They form together a single word unggaro-ga (‘Ungaro NOM’). For convenience sake, we add - atthe beginning of functional morphemes, such as -gafor NOM to distinguish between content and functional morphemes. The nominative case marker -gaor -i may vary depending on the previous letter vowel or consonant. A predicate naseo-eoss-da alsoconsists of the content morpheme naseo (‘become’)and its functional morphemes (-eoss ‘PAST’ and -da‘DECL’).3Morphological analysis and POS taggingNumerous studies pertaining to morphological analysis and POS tagging for Korean have been conducted over the past decades (Cha et al., 1998; Leeand Rim, 2004; Kang et al., 2007; Lee, 2011). Mostmorphological analysis and POS tagging for Koreanhave been conducted based on an eojeol. In the system of Korean POS taggers, a morphological analysis is generally followed by a POS tagging step. Thatis, all possible sequences of morphological segmentation for a given word are generated during the morphological analysis and the possible (or best) correctsequences are then selected during POS tagging.E SPRESSO, a Korean POS tagger described inHong (2009) is publicly available1 . It greatly improves the accuracy of POS tagging using POS patterns of words in which it obtains up to 95.85% ac1Note that there is another resource with the same name(Pantel and Pennacchiotti, 2006).30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30)Seoul, Republic of Korea, October 28-30, 201649

(1)a. 프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다.b. peurangseu-ui segyejeok-inuisang dijaineo emmanuel unggaro-ga silnae jangsikyongFrance-GEN world class-REL fashion designer Emanuel Ungaro-NOM interior decorationjikmul dijaineo-ro naseo-eoss-da.textile designer-AJT become-PAST-DECL.‘The world class French fashion designer Emanuel Ungaro became an interior textile designer.’Figure 1: Example of the Korean sentenceInput:프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 �물디자이너로나섰다.BOSEOS프랑스/NNP 의/JKG세계/NNG 적/XSN 이/VCP ��가로/NNP ��/NNG 로/JKB나서/VV 었/EP 다/EF ./SFFigure 2: Input and output examples of E SPRESSO for Korean POS taggingcuracy for Korean. Figure 2 shows the input and output formats of E SPRESSO for Korean POS tagging.Even though E SPRESSO can yield several output formats, we only show the Sejong corpus-like formatin this paper, in which we use the format for the input of syntactic analysis. While E SPRESSO indicatesBOS and EOS (the beginning and the end of a sentence, respectively), the actual Sejong corpus doesnot contain BOS and EOS labels. The original Sejong morphologically analyzed corpus annotates thesentence boundary using the markup language.We use Sejong POS tags, the mostly used POStag information for Korean. Figure 3 shows the summary of the Sejong POS tag set and its mapping tothe Universal POS tag (Petrov et al., 2012). We convert the XR (non-autonomous lexical root) into theNOUN because they are mostly considered as a nounor a part of noun (e.g. minju/XR (‘democracy’)). Thecurrent Universal POS tag mapping for Sejong POStags is based on a handful of POS patterns of Koreanwords. However, combinations of words in Koreanare very productive and exponential. Therefore, thenumber of POS patterns of the word does not converge as the number of words increases. For example, the Sejong Treebank contains about 450K wordsand almost 5K POS patterns. We also test with theSejong morphologically analyzed corpus which contains over 10M words. The number of POS patternsdoes not converge and it increases up to over 50K.The wide range of POS patterns is mainly due to theﬁne-grained morphological analysis results, whichshows all possible segmentations divided into lexical and functional morphemes. These various POSpatterns indicate useful morpho-syntactic information for Korean. For example, Oh et al. (2011) predicted function labels (phrase-level tags) using POSpatterns that would improve dependency parsing results.50

PACLIC 30 ProceedingsSejong POSNNG, NNP, NNB, NR, XRNPMAG,MAJMMVV, VX, VCN, VCPVAEP, EF, EC, ETN, ETMJKS, JKC, JKG, JKO, JKB, JKV, JKQ, JX, JCXPN, XSN, XSA, XSVSF, SP, SE, SO, SSSWSH, SLSNNA, NF, NVdescriptionNoun relatedPronounAdverbConjunctive adverbDeterminerVerb relatedAdjectiveVerbal endingsPostpositions (case markers)SufﬁxesPunctuation marksSpecial charactersForeign charactersNumberUnknown wordsUniversal POSNOUNPRONADVCONJDETVERBADJPRTADPPRTPUNC (.)XXNUMXFigure 3: POS tags in the Sejong corpus and their 1-to-1 mapping to Universal POS tags4Syntactic analysisStatistical parsing trained from an annotated dataset has been widespread. However, while there aremanually annotated several Korean Treebank corpora such as the Sejong Treebank (SJTree), only afew works on statistical Korean parsing have beenconducted.4.1Phrase structure parsingFor previous work on constituent parsing, Sarkarand Han (2002) used an early version of the Korean Penn Treebank (KTB) to train lexicalized TreeAdjoining Grammars (TAG). Chung et al. (2010)used context-free grammars and tree-substitutiongrammars trained on data from the KTB. Choiet al. (2012) proposed a method to transform theword-based SJTree into an entity-based Treebankto improve the parsing accuracy. There exit severalphrase structure parsers such as Stanford (Klein andManning, 2003), Bikel (Bikel, 2004), and Berkeley(Petrov and Klein, 2007) parsers (either lexicalizedor unlexicalized) that we can train with the Treebank.For phrase structure parsing, we provide a parsing model for the Berkely parser.2 Choi et al. (2012)tested Stanford, Bikel, and Berkeley parsers and rkeley parser shows the best results for phrasestructure parsing for Korean. The input sentence ofphrase structure parsers is generally the tokenizedsentence. It can be obtained by performing the segmentation task for a word. Each segmented morpheme becomes a leaf node in the phrase structure.Therefore, we use the tokenization scheme based onPOS tagging. Figure 4 shows the input and outputformats for the Berkeley parser. As preprocessingtools, we provide MakeBerkeleyTestIn andMakeBerkeleyTestWithPOSIn. They convertE SPRESSO’s output into the Berkely parser’s inputby tokenizing the Korean sentence with or withoutPOS information, respectively.4.2Dependency parsingFor previous work on dependency parsing for Korean, Chung (2004) presented a model for dependency parsing using surface contextual information.Oh and Cha (2010), Choi and Palmer (2011) andPark et al. (2013) independently developed a parsing model from the Korean dependency Treebank.They converted automatically the phrase-structuredSejong Treebank into the dependency Treebank.To convert into dependency grammars, Park etal. (2013) summarized as follows.We, ﬁrst, assign an anchor for nonterminal nodesusing bottom-up breadth-ﬁrst search. An anchor is51

Input:프랑스 의 세계 적 이 ㄴ 의상 디자이너 엠마누엘 웅가로 가 실내 장식 용 직물 디자이너 로 나서었다.Output:(S (NP-SBJ (NP (NP-MOD (NNP 프랑스) (JKG 의)(NP (VNP-MOD (NNG 세계) (XSN 적) (VCP 이) (ETM ㄴ))(NP (NP (NNG 의상))(NP (NNG 디자이너)))))(NP-SBJ (NP (NNP 엠마누엘))(NP-SBJ (NNP 웅가로) (JKS 가))))(VP (NP-AJT (NP (NP (NP (NNG 실내))(NP (NNG 장식) (XSN 용)))(NP (NNG 직물)))(NP-AJT (NNG 디자이너) (JKB 로)))(VP (VV 나서) (EP 었) (EF 다) (SF .))))Figure 4: Input and output examples for Korean phrase structure parsingthe lexical terminal node where each nonterminalnode can have as a head node. We use lexical anchorrules described in Park (2006) for the SJTree. Lexical anchor rules distinguish dependency relations.We assign only the lexical anchor for nonterminalnodes and ﬁnding dependencies in the next step.Lexical anchor rules give priorities to the rightmostchild node, which inherits mostly the same phrasetag. Exceptionally, in case of “VP and VP" (or “Sand S"), the leftmost child node is assigned as ananchor. Then, we can ﬁnd dependency relations between terminal nodes using the anchor informationas follows:1. The head is the anchor of the parent of the parent node of the current node.2. If the anchor is the current node and(a) if the parent of the parent node does nothave another right sibling, the head is itself.(b) if the parent of the parent node have another right sibling, the head if the anchorof the right sibling.Results from the conversion can allow to train existing dependency parsers. Figure 5 presents an example of the original Sejong Treebank (above) andits automatically-converted dependency representation.3 The address of terminal nodes (underneath)and the anchor of nonterminal node (on its right)are arbitrarily assigned for dependency conversionalgorithm using lexical head rules. The head of theterminal node 1 is the node 4, which is the anchorof the parent of the parent node (NP:4). The head ofthe terminal node 4 is the node 6 where the anchor ofits ancestor node is changed from itself (NP-SBJ:6).The head of the terminal node 11 is itself where theanchor of the root node and itself are same (S:11).The parsing model of MaltParser (Nivre etal., 2006) is provided for dependency parsingfor Korean.4 As preprocessing tools, we provideMakeMaltTestIn. It converts E SPRESSO’s output into the MaltParser’s input by generating required features for MaltParser. Figure 6 shows example of the input and the output of MaltParser.We use the data format of CoNLL-X dependencyparsing, described in Figure 7 (partially presented).See http://ilk.uvt.nl/conll for other information about the data format of CoNLL-X thatMaltParser requires. From word and POS information, we convert them into features that MaltParserrequires for Korean dependency parsing.3The ﬁgure originally appeared in Park et al. (2013) withminor errors, and we corrected them.4http://www.maltparser.org52

PACLIC 30 P-SBJ:6NP:4NP-MOD:10 "/NNP */JKG‘France :9NP-AJT:10! NNG ./XSN , * , ' & ) /NNP -#/NNG / /VCP /ETM/NNP /JKS/NNG/NNG/NNG (/XSN /NNG /NNG AS’‘designer NOM’‘Emmanuel’‘Ungaro 0VP:11 /VV %/EP /EF ./SF‘became’11Figure 5: Example of the original Sejong Treebank (above) and its automatically-converted dependency representation(below)4.3 Discussion on parsing for KoreanIn previous work on parsing for Korean, eitherphrase structure or dependency parsing, while Parket al. (2013) proposed the 80-10-10 corpus split fortraining, development and evaluation, others oftenused cross validation (Oh and Cha, 2010; Choi etal., 2012; Oh and Cha, 2013).For phrase structure parsing, Choi et al. (2012)obtained up to 78.74% F1 score. For dependencyparsing, Oh and Cha (2013) obtained 87.03% (10fold cross validation) and Park et al. (2013) up to86.43% (corpus split) by using external case frameinformation.Currently, we distribute only parsing models instead of parsers and training data themselves because of following reasons. First, the Sejong Treebank that we use to train and evaluate is not allowed to be distributed by third parties. Corpus usersshould ask directly to National Institute of the Korean Language5 for their own usage. Therefore, itwould be easy that we only make current parsingmodels publicly available instead of actual trainingdata. Second, multilingualism becomes more andmore important. Many natural language processing(NLP)-related works rely on a single system to dealwith multiple languages homogeneously. Berkeleyparser and MaltParser in which we provide parsingmodels have been developed for many other languages and users can easily obtain their up-to-dated5http://www.korean.go.krparsing systems and models for several other languages.We provide parsing models trained only on thetraining data, which can be subject to the baselineparsing system for Korean to be compared in future work. Table 1 presents the current baseline parsing results using phrase structure grammars by theBerkeley parser. We performed 5-fold and 10-foldcross-validation as well as corpus split evaluationfor comparison purpose. We also tested both cases inwhic

2 Korean Language Korean is an agglutinative language in which “words typically contain a linear sequence of MORPHS ” (Crystal, 2008). Words in Korean (eojeols), there-fore, can be formed by joining content and func-tional morphemes to indicate such meaning. These eojeols can be interpreted as the basic segmenta-tion unit and they are separated by a blank space in the Korean sentence. Let .

Related Documents:

Korean Language Guide - PDF Learn Korean: LP's Korean ...

to intermediate and advanced Korean lessons. Before you go, here is a bit of history of 한글 (Hangeul, the Korean alphabet): Korean is the official language of Korea, both North and South. There are around 78 million people who speak Korean around the world. [1] 한글 (the Korean alphabet) was invented by Sejong the Great in the 15th century.File Size: 903KB

317 Views

2y ago

Bruksanvisning för bilstereo Bruksanvisning for bilstereo ... - Jula

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

375 Views

1y ago

Korean (KOREAN) - 2020-21 Berkeley Academic Guide

Intermediate Korean: Read Less [-] KOREAN 10AX Intermediate Korean for Heritage Speakers 5 Units Terms offered: Fall 2019, Fall 2018, Fall 2017 This is an intermediate course for students whose Korean proficiency level is higher in speaking than in reading or writing due to Korean-herita

358 Views

2y ago

EXCHANGE & VISITING STUDENT ACADEMIC GUIDELINES …

1.2.2 Korean Language Institute courses (Course code: KLI) Intensive Korean language courses are offered by the Korean Language Institute. A placement test to determine Korean language proficiency is required. Taking Korean language

54 Views

2y ago

Korean Language - 2020-21 Berkeley Academic Guide

Korean Language 3 KOREAN 1BX Elementary Korean for Heritage Speakers 5 Units Terms offered: Spring 2021, Spring 2020, Spring 2019 With special emphasis on reading and writing, students will expand common colloquialisms and appropriate speech acts. Elementary Korean for Heritage Speakers: Read More [ ] Rules & Requirements Prerequisites: Korean 1AX; or consent of instructor Credit Restrictions .

209 Views

3y ago

SCHOOL OF MODERN LANGUAGES AND CULTURES

The Korean language in historical perspective (6 credits) KORE3032. Directed readings in Korean Studies (6 credits) KORE3034. Korean Studies internship (6 credits) KORE3035. Korean Studies field trip (6 credits) KORE3036. Crime, passion, love: Korean popular culture before K-pop (6 credits) Fine Arts FINE2097. Arts of Korea (6 credits) FINE2098. History of Korean paintings (6 credits) Japanese .

159 Views

3y ago

ASIAN & MIDDLE EASTERN STUDIES Course # Title Instructor ...

KOREAN 204 Intermediate Korean Eunyoung Kim KOREAN 306S Advanced Korean Eunyoung Kim KOREAN 408S Issues in Korean Lang/Soc II Hae-Young Kim LINGUISTICS Course # Title Instructor Cross-listing **LINGUIST 2

194 Views

2y ago

Unit 18: Managing a Business Event

planning a business event D1 evaluate the management of a business event making recommendations for future improvements P2 explain the role of an event organiser [IE] P3 prepare a plan for a business event [TW] P4 arrange and organise a venue for a business event, ensuring health and safety requirements are met [SM, EP] M2 analyse the arrangements

70 Views

3y ago

Recent Views

No. 41018

die Boedelwet, 1965, ten einde die Kabinetslid verantwoordelik vir die . die aanstellingstermyn van lede van die Raad van Regshulp Suid-Afrika verder te reel; en . artikel3 van Wet 104 van 1996, artikel 3 van Wet 66 van 1998, artikel 1 van Wet 62 van 2000, artikel 1 van Wet 28 van 10 . 6 No. 41018 Act No.8 of 2017

3y ago

184 Views

STAATSKOERANT - Parliament of Namibia

V AN DIE REPUBLIEK VAN SUID-AFRIKA REPUBLIC OF SOUTH AFRICA . GOVERNMENT GAZETTE . . Tot wysiging van die Boedelwet, 1965, om sekere bedrae te . Wysiging van artikel 35 van Wet 66 van 1965. Wysiging van artikel 80 van Wet 66 van 1965. Wysiging van artikel 102 van

3y ago

152 Views

The State of Van Nuys - Final Report - Van Nuys Neighborhood Council

Geographic Location of Van Nuys in Los Angeles City Figure 2. Van Nuys Neighborhood Council Figure 3.!Founding of Van Nuys in 1911 Figure 4. Original Van Nuys Hotel, Van Nuys, Calif., on Van Nuys Blvd. Figure 5. Van Nuys Population Trends 1970-2010 Figure 6. Population Trends in Race/Ethnicity, 1980 - 2010 Figure 7. Van Nuys Land Use Figure 8.

1y ago

89 Views

VLAAMSE OVERHEID 17 MEI 2019 Besluit van de Vlaamse .

Vlaamse Regering van 19 maart 2010 betreffende de organisatie van de fokkerij van de voor de landbouw nuttige huisdieren; Gelet op het ministerieel besluit van 26 juli 2011 tot erkenning van centra voor varkens ter uitvoering van artikelen 35 en 59, par. 2, van het Fokkerijbesluit van 19 maart 2010;

3y ago

150 Views

YOGA VASISTHA SARA (De essentie van Yoga Vasishtha)

YOGA VASISTHA SARA (De essentie van Yoga Vasishtha) Śrī Ramana Maharṣi 1 - Onthechting 2 - Onwerkelijkheid van de wereld 3 - Kenmerken van de bevrijde 4 - Het oplossen van de geest 5 - Het uitwissen van de onbewuste denk- en voelpatronen. 6 - Meditatie van het Zelf 7 - Methode van Zuivering 8 - Verering van het Zelf

2y ago

309 Views

Training Didactische inzet van ICT - edufit.nl

De rol van facilitator van leerprocessen De mogelijkheden van de didactische inzet van ICT om het onderwijs te verbeteren vraagt ook om beleidsbeslissingen en het ondersteunen van veranderprocessen. Het opschrijven van een visie op de inzet van ICT in het onderwijs is daarbij stap één, het motiveren en stimuleren van docenten om ICT te .

1y ago

104 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

AGRICULTURAL CREDIT ACT NO. 28 OF - FAO

1965 (Wet No. 66 van 1965), aangestel, wat ten opsigte van daardie aangeleentheid, goed of boedel met regsbevoegdheid beklee is; ,,Ministerv die Minister van Landbou; [Omskrywing van ,,h1inisterv vervang deur a. 1 (a) van Wet No. 45 van 1968, deur a. 1 (c) van Wet No. 73 van 1981, deur a.

3y ago

160 Views

Onderwijs- en ExamenRegeling (OER) - Anton de Kom .

1. Het borgen van de kwaliteit van de toetsing. 2. De coordinatie van en controle op examens en tentamens. 3. Het bekrachtigen van tentamenresultaten. 4. Het vaststellen van richtlijnen binnen het kader van het OER om de uitslag van examens vast te stellen. 5. In overleg met de betreffende discipline, verlenen van vrijstelling. 6.

3y ago

241 Views

BIJLAGE I Lijst met door leerlingen geselecteerde werken

3 Beijnum, Kees van - De ordening 1 Beijnum, Kees van - Dichter op zee 1 Beijnum, Kees van - Het mooie seizoen 3 Beijnum, Kees van - Het verboden pad 4 Beijnum, Kees van - Over het IJ 12 Beijnum, Kees van - Paradiso 16 Beijnum, Kees van - Zoon van 2 Beishuizen, Ti

2y ago

214 Views

Musiek van die 'dood'! Of Musiek van die 'lewe'!

Musiek van die 'dood'! Musiek van die 'lewe'! Of Amos 6:5 Wat liedjies sing met begeleiding van die harp, wat net soos Dawid vir hulle musiekinstrumente uitdink! Amos 5:23 Verwyder van My die geraas van jou liedere!En na die geluid van jou harpe wil Ek ( GOD ) nie luister nie.

2y ago

261 Views

van de Europese Unie - Huisvoorklokkenluiders

Gezien het advies van de Rekenkamer (1), Gezien het advies van het Europees Economisch en Sociaal Comité . van de vlaggenstaat met betrekking tot de naleving en de handhaving van het Verdrag betreffende maritieme arbeid, 2006 (PB L 329 van 10.12.2013, blz. 1) en Richtlijn 2009/16/EG van het Europees Parlement en de Raad van 23 april 2009 .

1y ago

111 Views

Gedragscode - dxc

Preventie, detectie en onderzoek van wangedrag Beheer van de Gedragscode Beheer en handhaving van beleid inzake zakelijk gedrag Beheer van naleving van wet-/regelgeving Training in en bewustzijn van ethiek en naleving Beheer van nalevingsrisico's Programma-administratie van SpeakUp! en OpenLine.

1y ago

92 Views

over dierproeven en proefdieren of Research

bevordering van de naleving van de wettelijke voorschriften en daarmee het bevorderen van het welzijn van proefdieren. De NVWA is met deze manier van inspecteren in staat mede vorm te geven aan het principe van de 3 V's (vervanging, vermindering en vooral verfijning van dierproeven) dat ook de basis is van het dierproefbeleid en de Wet op de .

1y ago

104 Views

Wat is de rol van TWW voor het preventiebeleid van mijn bedrijf

Middelen Sociaal strafwetboek (6 juni 2010) SLIC-document: gemeenschappelijke visie van de hoofden van de inspectiediensten in Europa op het vlak van het beheer van een inspectiedienst De Iso9001-norm over de vereisten van een kwaliteitssysteem De jaarlijkse uitwerking van een operationeel plan De samenwerking met diverse andere diensten: Afdeling

1y ago

115 Views

Korean Language Resources For Everyone

It looks like you're using an ad-blocker