UniDic for Early Middle Japanese:a Dictionary for Morphological Analysis of Classical JapaneseToshinobu Ogiso*†, Mamoru Komachi†, Yasuharu Den‡, Yuji Matsumoto†*Department of Corpus Studies, National Institute for Japanese Language and Linguistics (NINJAL)†Graduate School of Information Science, Nara Institute of Science and Technology (NAIST)‡Faculty of Letters, Chiba University10-2, Midori-cho, Tachikawa-shi, Tokyo JAPAN 190-8561E-mail: togiso@ninjal.ac.jp, komachi@is.naist.jp, den@cogsci.l.chiba-u.ac.jp, matsu@is.naist.jpAbstractIn order to construct an annotated diachronic corpus of Japanese, we propose to create a new dictionary for morphological analysis ofEarly Middle Japanese (Classical Japanese) based on UniDic, a dictionary for Contemporary Japanese. Differences between the EarlyMiddle Japanese and Contemporary Japanese, which prevent a naïve adaptation of UniDic to Early Middle Japanese, are found at thelevels of lexicon, morphology, grammar, orthography and pronunciation. In order to overcome these problems, we extended dictionaryentries and created a training corpus of Early Middle Japanese to adapt UniDic for Contemporary Japanese to Early Middle Japanese.Experimental results show that the proposed UniDic-EMJ, a new dictionary for Early Middle Japanese, achieves as high accuracy(97%) as needed for the linguistic research on lexicon and grammar in Japanese classical text analysis.Keywords: Morphological Analysis, Classical Japanese, Early Middle Japanese, Historical Corpus of Japanese1.contemporary dictionaries. It turned out that its accuracyon EMJ was considerably lower than the reportedaccuracy for newswire texts, and completely inadequatefor Japanese linguists. One of the reasons is that becausethere was a massive change in writing style in the Meijiera (1868-1912).Early Middle Japanese is a historical stage of the Japaneselanguage used in the Heian period (A.D. 794 - 1185). Inthe Heian period, various styles of Japanese literaturesuch as monogatari (tales) and nikki bungaku (diaryliterature) appeared for the first time in history. Waka(native Japanese poetry) also flourished at this time. Forexample, masterpieces such as the Tale of Genji, the TosaDiary, and the Kokin Waka-shū poetry anthology werewritten in this era, to name a few. Therefore, amorphological analysis of EMJ is especially useful forJapanese historical linguists.As the first step toward rich annotation of linguisticinformation for historic texts in the diachronic corpus, wepropose to start with building an electronic dictionary formorphological analysis adapted for EMJ. Morphologicalanalysis is one of the fundamental annotations forconstruction of a full-scale corpus.The rest of this paper is organized as follows. Section 2describes characteristics of Early Middle Japanese.Section 3 explains how we built the UniDic for EarlyMiddle Japanese. Section 4 compares the UniDic forEarly Middle Japanese with other dictionaries to show itseffectiveness. Section 5 presents conclusions and suggestsfuture direction.BackgroundRecently, the use of corpus linguistics has becomepopular among Japanese linguists. To facilitate furtherresearch on corpus linguistics, the National Institute forJapanese Language and Linguistics (NINJAL) hascompiled one of the largest Japanese corpora, theBalanced Corpus of Contemporary Written Japanese(BCCWJ) (Maekawa et al., 2010). Following the sameline of research, a diachronic corpus of Japanese iscurrently under construction.Since corpus linguistics heavily relies on word-segmentedcorpora, it is important to have morphological annotationsfor the corpus that is the object of study. However,morphological annotations do not come for free, and thusan automatic morphological analyzer is desired forJapanese corpus linguists. To implement highly accurateand effective morphological analyzers, a carefullyconstructed wide-coverage dictionary is necessary. It isessential for statistical and machine learning-basedapproaches to be successful. For example, thestate-of-the-art Japanese morphological analyzer MeCab(Kudo et al., 2004) is trained with an electronic dictionarycalled UniDic 1 on a manually annotated BCCWJ. InUniDic, all entries are based on the definition of short unitword (SUW), which provides word segmentation inuniform size suited for linguistic research. UniDic alsoachieves high performance in many text genres includingliterature, spoken texts, and so on (Den et al., 2007).However, the original UniDic is only for theContemporary Japanese (CJ). We conducted preliminaryexperiments of morphological analysis of literaturewritten in Early Middle Japanese (EMJ) by adopting thestate-of-the-art morphological analyzer MeCab with12.Linguistic Characteristics of EarlyMiddle JapaneseEarly Middle Japanese has various characteristics thatdistinguish it from CJ in several linguistic fields: lexicon,morphology, syntax, orthography and pronunciation. Wehttp://download.unidic.org/911

will briefly describe the differences between CJ and EMJin terms of corpus linguistics.abundant in texts. Thus, a naïve application to EMJ is notpractical for the part-of-speech tagging model learnedfrom CJ.2.1 Lexical DifferencesThe Japanese lexicon mainly consists of three types ofwords: wago, kango and gairaigo. Wago are words ofJapanese origin which had existed before kango wereintroduced from China. Kango are words of Chineseorigin which were imported from China or created inJapan using kanji (Chinese characters). Gairaigo areforeign words not originating from Chinese, usuallytransliterated and written in Katakana. These word typesare called “goshu”. In CJ, approximately 18% to 70% ofwords used in texts are kango (in SUW). On the contrary,in the literary text in EMJ only 1% to 5% of words arekango. This fact suggests that numerous kango wordshave been newly imported or created and many wagowords have become obsolete, even though most of thebasic words in EMJ are wago and still remain the sametoday. Thus, dictionaries for CJ tend to lack outdated butessential words.2.3 Grammatical Differences2.2 Morphological DifferencesKana UsageAlthough the word order of EMJ is almost the same asthat of CJ, function words such as particles and auxiliaryverbs have changed considerably over time. For example,the most frequently used auxiliary verbs in EMJ, such as“mu”, “beshi”, “keri”, are no longer used today. For thisreason, corpora of CJ are not appropriate for machinelearning-based approaches to morphological analysis ofEMJ.2.4 Orthographic DifferencesThere are many orthographic differences between EMJand CJ texts. Usages of kana and kanji characters are themost significant differences. Table 2 shows the examplesof these differences.Conjugation type has changed throughout the history ofJapanese language. For example, conjugation of verb“kuru 来る” (come) and adjective “akai 赤い” (red) havechanged as below (Table 1).Kanji UsageWord (meaning)koe (n. voice)CJEMJこえこゑomou (v. �ふkuni (n. country)kuru (v. come)Kana and Kanji au (v. meet)Conjugationmizen (irrealis)ren'yō (continuative)kuru shūshi (terminal)来る rentai (attributive)(v.come) izen (realis)/ katei (hypothetical)meirei (imperative)mizen (irrealis)ren'yō (continuative)shūshi (terminal)akai赤い(adj.red) rentai (attributive)izen (realis)/ katei (hypothetical)meirei i(akakaru)akaiakakereakakere(akakare)akakareTable 2. Differences of kana and kanji OrthographyIn EMJ, words written in the kana orthography werespelled in Rekishi Kanazukai (historical kana usage)based on the pronunciations at the time. RekishiKanazukai was the mainstream orthography until theGendai Kanazukai (modern kana usage) was introducedin 1946. Because most morphological analyzers do notcanonicalize these usages, they fail to analyze thesecharacters correctly.Furthermore, there are some old kanji characters notpresent in CJ. Since EMJ contains old variants of kanji,these characters deteriorate the performance ofmorphological analysis if the dictionary only includes thenewer counterparts.There is a further complication: Old kanji and differentkana usage are often used compositely.akakuakakat-3.Table 1. Differences of ConjugationMaking the UniDic for Early MiddleJapaneseIn order to overcome the problems stemming from thedifferences between the Contemporary Japanese and theEarly Middle Japanese mentioned above, we decided tobuild a new dictionary and a corpus especially for EMJ.We used two approaches: One is to expand entries of thecontemporary UniDic dictionary, and the other is toannotate a new corpus of EMJ as training data formorphological analysis.Though most lexical entries of verbs had already beenincluded in the UniDic dictionary and most of theconjugations in EMJ can be formed by derivation, theconjugation table had to be modified for EMJ. Becausethere are many irregularly changed words and manycontemporary words not used in EMJ, we had to check allderived entries.Moreover, this difference in conjugation type affects wordbigram probability, since conjugations of verbs are912

3.1 Extension of Dictionary Entriesvariants are dealt with in the Orthography level.Figure 2 shows the extensions of word entries for EMJ. Inthis figure, the Form “ahu” is added to annotate oldconjugation forms in EMJ corresponding to the CJ word“au 会う” (meet) . Likewise, old orthographic forms of“ahu” such as “あふ” and “會ふ” are added under theform.Each conjugation form is generated automatically byapplying the inflection table prepared for EMJ.We added approximately 20,000 entries to cope with thelexical, morphological, and orthographic differences.Rules of newly added entries for EMJ are summarized inOgura et al., (2012).Starting from the existing UniDic, we extended wordentries to cope with the problem of lexical, morphologicaland orthographic differences.As was mentioned above, UniDic is an electronicdictionary designed for linguistic use. UniDic isstructured with layered entries to treat words flexiblydepending on the purposes of researchers.Figure 1 exemplifies the structured word indexes ofUniDic. The Lemma layer is prepared to treat words atabstract lemmatized level, like the entries of the generaldictionary. The Form layer is prepared to distinguishallomorphs and different conjugations. Specification ofconjugations type is held in this layer. The Orthographiclayer is prepared to distinguish orthographic variants.3.2 Training Corpus of Early Modern JapaneseTo remedy the issues of morphological and syntacticdifferences, we manually annotated a corpus of EMJcontaining 271,000 words (SUWs) to produce trainingand test corpora. Table 3 summarizes the texts we selected.This corpus contains major styles of Japanese literaturesuch as monogartari and nikki bungaku, and thus servesas the fundamental resource for EMJ.Number ofWordsText(A part of ) The Tale of Genji(Genji Monogatari)The Diary of Lady Murasaki(Murasaki Shikibu Nikki)The Tosa Diary(Tosa Nikki)As I Crossed a Bridge of Dreams(Sarashina Nikki)The Tales of Ise(Ise Monogatari)The Tales of Yamato(Yamato Monogatari)The Tale of the Bamboo Cutter(Taketori Monogatari)Figure 1. Hierarchical Structure of UniDicThis structure helped us to add new entries in each level.For example, morphological differences like word formsor conjugations are handled in the Form level, andorthographic differences such as kana usage and le 3. Annotated Corpus of EMJ3.3 Configuration of AnalyzerMeCab is a morphological analyzer based on CRF(Lafferty et al., 2001) and achieves state-of-the-artperformance in Contemporary Japanese morphologicalanalysis. One of the main advantages of the tool is that itsfeature template is flexibly designable. We added thefeature of archaic particles, affixes, and auxiliary verbs inorder to address the problem of grammatical differencesbetween EMJ and CJ. Furthermore, the goshu features arealso added to correspond with the lexical differences.Goshu features have been used for the original UniDic(for CJ) and it is confirmed that they are effective (Den etal., 2007). MeCab can automatically learn feature weightsfor UniDic from an annotated corpus of EMJ to build amorphological analyzer.As local context, MeCab uses part-of-speech-level bigramfor general words to avoid sparseness, with the onlyexception of function words such as particles or affixes,Figure 2. Extensions of EMJ Word Entries913

4.2 Comparison with Other UniDicswhich use word-level bigram (lexicalization). In thesetting of UniDic-EMJ, word-level bigram is used forarchaic particles, auxiliary verbs and affixes, in place offunction words of CJ. All the other configurations of theanalyzer basically remain at the same setting as is used forCJ.4.We compared the performance of UniDic-EMJ with theoriginal UniDic and UniDic-MLJ (Kindai-Bungo UniDic)in the analysis of Japanese classical texts. OriginalUniDic (UniDic-CJ) does not contain obsolete words.UniDic-MLJ is a morphological dictionary for ModernEvaluation of the UniDic for EarlyMiddle JapaneseLevel 1InputwordsOutputwordsCorrectwords4.1 Experimental SettingsWe evaluated the performance of the UniDic for EMJversion 0.6. The test data contains 27,100 words (SUWs)of randomly sampled sentences (10% of the annotatedcorpus). Note that although the test data was not used astraining corpus, it contained no words unknown by thedictionary.The evaluations were carried out in four levels. Level 1 isthe accuracy of word segmentation. Level 2 is theaccuracy of part-of-speech tagging for items correct atLevel 1. Level 3 is the accuracy of lemmatization foritems correct at Levels 1 and 2. Level 4 is the accuracy ofdistinction of allomorphs for items correct at all otherlevels. Table 4 shows the number of correct words in theanalyzed texts and corresponding degrees of performance.for the four levelsThe accuracy of Level 3, which is mainly used bylinguists, isz approximately 97%. This number is not somuch inferior in comparison with the accuracy of themorphological analysis dictionary of CJ (approximately98%). Although it depends on the purposes of the researchin question, 97% accuracy is sufficient for a variety ofhistorical linguistic studies.Level 2Level 3Level 0.96572F-value0.993400.976870.969820.96551Table 4. Numbers of Correct Words and AccuracyLiterary Japanese (literary style texts in Meiji Era).Although UniDic-MLJ contains almost the same lexiconas UniDic-EMJ, it is trained on a different corpus.The test data for this comparison is the same as the dataused in Table 3. This test corpus is outside of the datadomain of both UniDic-CJ and UniDic-MLJ, and thus it isno wonder they do not perform well on this data set.However, these two had been the only availabledictionaries for EMJ until UniDic-EMJ was built.Figure 3 shows the performance of the three variants ofUniDics using the same criteria as Table 3. As you can see,100%90%80%70%60%50%Level 1 Segmentation40%Level 2 POS Tagging30%Level 3 LemmatizeLevel 4 Allomorph20%10%0%UniDic-EMJ 0.6UniDic-MLJ 1.1UniDic (-CJ) 1.3.12Level 1 Segmentation0.993400.940030.83265Level 2 POS Tagging0.976870.894640.62586Level 3 Lemmatize0.969820.850300.59499Level 4 Allomorph0.965510.845540.59002Figure 3. Performance Comparison with Other UniDics914

UniDic-EMJachievedthebestperformance.UniDic-EMJ outperformed UniDic-CJ for POS Tagging,Lemmatization and Allomorph by a large margin. Thisclearly demonstrates the effectiveness of building atailored dictionary for a specific period for historical textat hand.For the compilation of a Japanese diachronic corpus, wemust prepare more dictionaries for other types of Japaneselanguage: other times and genres. One highly neededresource is a dictionary for colloquial Early ModernJapanese. We are planning to build a new UniDic toanalyze texts of this type.6.4.3 Error AnalysisWe carried out an error analysis of the morphologicalanalysis using UniDic-EMJ. At Level 1, complexcompound words are divided into a set of two or moresimple words. For example, “tabikasanaru” (repeat) isdivided into “tabi” (time) and “kasanaru” (overlap), and“kataharaitasi” (disgusting) is divided to “katahara”(side) and “itasi” (painful). At Level 2, there are manymistakes in distinguishing short function words of thesame form. One of the most frequent words, “ni”, can beone of three different parts of speech: dative case marker,conjunction particle or a conjugated form of the copula“nari”. Errors also occur in the distinction between anoun derived from a verb and the original verb: forexample, the noun “wakare” (parting, separation) and theren'yō (continuative) form of the verb “wakareru” (part,separate). Some verbs realize two different conjugationalforms with the same surface form and ambiguities suchas these also caused a large number of errors. For example,both the shūshi (terminal) conjugation and the rentai(attributive) conjugation of the verb “tatu” (stand) takethe form “tatu” and are written in identical ways.At Level 3, there are many errors in identifiying wagowords of kindred meaning expressed by the same kanji.For example, “ne” and “oto” (sound) are both written“音”; “sita” and “simo” (under) are both written “下”;“toko” and “yuka” (bed or floor) are both written “床” ,and so on. Some errors were failures to recognize thedistinction between wago and kango written in the samekanji, such as wago “ama” and kango “ni” (written “尼”).However, as the result of using goshu features, such errorswere reduced. At Level 4, a large number of errors weredue to variations of forms produced by rendaku (voicingof the initial consonant).All these errors are hard to distinguish even for humans.Automatic morphological analysis using UniDic-EMJ hasalready accomplished a level of accuracy as high as that ofordinary non-experts.5.7.ReferencesYasuharu Den, Junichi Nakamura, Toshinobu Ogiso, andHideki Ogura. (2008). A proper approach to Japanesemorphological analysis: Dictionary, model, andevaluation. In Proceedings of the 6th LanguageResources and Evaluation Conference (LREC 2008),Marrakech, Morocco, pp. 1019-1024.Yasuharu Den, Toshinobu Ogiso, Hideki Ogura, AtsushiYamada, Nobuaki Minematsu, Kiyotaka Uchimoto andHanae Koiso. (2007). The development of an electronicdictionary for morphological analysis and itsapplication to Japanese corpus linguistics (in Japanese).Japanese Linguistics, 22: pp.101–123.Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto.(2004). Applying conditional random fields to Japanesemorphological analysis. In Proceedings of the 2004Conference on Empirical Methods in NaturalLanguage Processing, pp. 230–237, Barcelona, Spain.John D. Lafferty, Andrew McCallum, and Fernando C. N.Pereira. (2001). Conditional random fields:Probabilistic models for segmenting and labelingsequence data. In Proceedings of the 18th InternationalConference on Machine Learning, pp. 282–289,Williamstown, MA.Kikuo Maekawa, Makoto Yamazaki, Takehiko Maruyama,Masaya Yamaguchi, Hideki Ogura, Wakako Kashino,Toshinobu Ogiso, Hanae Koiso and Yasuharu Den.(2010). Design, compilation, and preliminary analysesof balanced corpus of contemporary written Japanese.In Proceedings of the 7th Language Resources andEvaluation Conference (LREC 2010). Valletta, Malta,pp. 1483-1486.Hideki Ogura, Tetsuya Sunaga, Toshinobu Ogiso. (2012).Rules of Short Unit Words for UniDic-EMJ. (inJapanese), Research Report of the Grants-in-Aid forScientific Research (Project Number: 21520492).Toshinobu Ogiso, Hideki Ogura, Makiro Tanaka, AsukoKondo, Yasuharu Den. (2012). Development of anElectronic Dictionary for Morphological Analysis ofClassical Japanese. (in Japanese) Research Report ofthe Grants-in-Aid for Scientific Research (ProjectNumber: 21520492).Conclusions and Future WorkWe have constructed an electronic dictionary formorphological analysis of Early Middle Japanese(Classical Japanese), which can analyze Japaneseclassical texts with high accuracy. Its accuracy (97%) isconsidered to be high enough for linguistic research onlexicon and grammar. UniDic-EMJ is now freelyavailable at our webpage 2 . Several reports on thedevelopment of UniDic-EMJ, software tools inassociation with UniDic-EMJ, and linguistic studies usingUniDic-EMJ are summarized in Ogiso et al., (2012).2AcknowledgementsThis work is partially supported by the collaborativeresearch project “Study of the history of the Japaneselanguage using statistics and machine-learning” carriedout at the National Institute for Japanese Language php?UniDic915

