A LINK GRAMMAR FOR TURKISH - Bilkent University

2y ago
56 Views
8 Downloads
520.85 KB
135 Pages
Last View : 15d ago
Last Download : 3m ago
Upload by : Kamden Hassan
Transcription

A LINK GRAMMAR FOR TURKISHA THESISSUBMITTED TO THE DEPARTMENT OF COMPUTER ENGINEERINGAND THE INSTITUTE OF ENGINEERING AND SCIENCESOF BILKENT UNIVERSITYIN PARTIAL FULLFILMENT OF THE REQUIREMENTSFOR THE DEGREE OFMASTER OF SCIENCEByÖzlem İstekAugust, 2006

I certify that I have read this thesis and that in my opinion it is fully adequate, inscope and in quality, as a thesis for the degree of Master of Science.Asst. Prof. Dr. İlyas Çiçekli (Supervisor)I certify that I have read this thesis and that in my opinion it is fully adequate, inscope and in quality, as a thesis for the degree of Master of Science.Prof. Dr. H. Altay GüvenirI certify that I have read this thesis and that in my opinion it is fully adequate, inscope and in quality, as a thesis for the degree of Master of Science.Assoc. Prof. Ferda Nur AlpaslanApproved for the Institute of Engineering and Sciences:Prof. Dr. Mehmet BarayDirector of Institute of Engineering and Sciencesii

ABSTRACTA LINK GRAMMAR FOR TURKISHÖzlem İstekM.S. in Computer EngineeringSupervisor: Asst. Prof. Dr. İlyas ÇiçekliAugust, 2006Syntactic parsing, or syntactic analysis, is the process of analyzing an inputsequence in order to determine its grammatical structure, i.e. the formalrelationships between the words of a sentence, with respect to a given grammar.In this thesis, we developed the grammar of Turkish language in the linkgrammar formalism. In the grammar, we used the output of a fully describedmorphological analyzer, which is very important for agglutinative languages likeTurkish. The grammar that we developed is lexical such that we used thelexemes of only some function words and for the rest of the word classes weused the morphological feature structures. In addition, we preserved the some ofthe syntactic roles of the intermediate derived forms of words in our system.Keywords: Natural Language Processing, Turkish grammar, Turkish syntax,Parsing, Link Grammar.iii

ÖZETTÜRKÇE İÇİN BİR BAĞ GRAMERİÖzlem İstekBilgisayar Mühendisliği Bölümü, Yüksek LisansTez Yöneticisi: Yar. Doç. Prof. Dr. İlyas ÇiçekliAğustos, 2006Sözdizimsel çözümleme veya ayrıştırma, bir tümcenin dilbilgisel yapısını yanikelimeleri arasındaki ilişkiyi ortaya çıkarmak amacıyla verilen bir gramere göreinceleme işlemidir. Bu çalışmada, Türkçe için bir bağ grameri geliştirilmiştir.Sistemimizde Türkçe gibi çekimli ve bitişken biçimbirimlere sahip diller içinçok önemli olan, tam kapsamlı, iki aşamalı bir biçimbirimsel tanımlayıcınınsonuçları kullanılmıştır. Geliştirdiğimiz gramer sözcükseldir ancak, bazı işlevselkelimeler oldukları gibi kullanılırken, diğer kelime türleri için kelimelerinkendilerinin yerine biçimbirimsel özellikleri kullanılmıştır. Ayrıca sistemimizdekelimelerin ara türeme formlarının sözdizimsel rollerinin bazıları muhafazaedilmiştir.Anahtar Kelimeler: Doğal Dil İşleme, Türkçe Dilbilgisi, Türkçe sözdizimi,Sözdizimsel Çözümleme, Bağ Grameri.iv

AcknowledgementI would like to express my deep gratitude to my supervisor Asst. Prof. Dr. İlyasÇiçekli for his invaluable guidance, encouragement, and suggestions throughoutthe development of this thesis.I would also like to thank Prof. Dr. H. Altay Güvenir and Assoc. Prof. Ferda NurAlpaslan for reading and commenting on this thesis.I would like to thank my friends Abdullah Fişne and Serdar Severcan for theirhelp. I am also grateful to my friend Arif Yılmaz for his invaluable help, moralsupport, encouragement and suggestions.I am grateful to my family for their infinite moral support and help throughoutmy life.v

To my mother, Fatma İSTEKvi

Contents1 Introduction. 11.1 Linguistic Background. 31.2 Thesis Outline. 72 Link Grammar . 82.1 Introduction . 82.2 Main Rules of the Grammar. 92.3 Language and Notion of Link Grammars . 102.3.1 Rules for Writing Connector Blocks or Linking Requirements. 102.3.2 The Concept of Disjuncts. 122.4 General Features of the Link Parser . 132.5 Special Features of the Dictionary. 142.6 Coordinating Conjunctions . 172.6.1 Handling Conjunctions . 182.6.2 Some Problematic Conjunctional Structures. 202.7 Post-Processing . 212.7.1 Introduction . 212.7.2 Structures of Domains. 212.7.3 Rules in Post Processing . 223 Turkish Morphology and Syntax. 243.1 Distinctive Features of Turkish . 243.2 Turkish Morphotactics . 283.2.1 Inflectional Morphotactics . 293.2.2 Derivational Morphotactics. 333.2.3 Question Morpheme. 373.3 Constituent Order in Turkish. 38vii

3.4 Classification of Turkish Sentences . 403.4.1 Classification by Structure . 413.4.2 Classification by Predicate Type . 423.4.3 Classification by Predicate Place. 443.4.4 Classification by Meaning. 443.5 Substantival Sentences. 454 Design. 474.1 Morphological Analyzer . 474.1.1 Turkish Morphological Analyzer . 474.1.2 Improvements and Modifications to Turkish Morphological Analyzer. 494.2 System Architecture. 525 Turkish Link Grammar . 615.1 Scope of Turkish Link Grammar. 635.2 Linking Requirements Related to All Words. 635.3 Compound Sentences, Nominal Sentences, and the Wall . 675.4 Linking Requirements of Word Classes . 745.4.1 Adverbs . 745.4.2 Postpositions. 765.4.3 Adjectives and Numbers . 785.4.4 Pronouns. 815.4.5 Nouns . 855.4.6 Verbs . 905.4.7 Conjunctions. 936 Performance Evaluation . 957 Conclusion . 101BIBLIOGRAPHY . 103A Turkish Morphological Features . 106viii

B Summary of Link Types . 108C Input Document and Statistical Results. 112D Example Output from Our Test Run. 113ix

List of FiguresFigure 1 METU-Sabancı Turkish Treebank . 3Figure 2 Typical Order of Constituents in Turkish. 39Figure 3 Architecture of a Two Level Morphological Analyzer . 48Figure 4 System Architecture . 53Figure 5 Special Preprocessing for Derived Words . 58Figure 6 Example to Preprocessing for Derived Words. 58Figure 7 Linking Requirements of Intermediate Forms of a Word, Wx. 64Figure 8 Change of Linking Requirements of an IDF According to Its Place . 65Figure 9 Macro for the Derivation Boundary and Question Morpheme. 67Figure 10 Linking Requirements of the LEFT-WALL . 69Figure 11 Rules for Adjectives . 71Figure 12 Suffixless Adjective to Verb Derivation, an Example IllustrativeSentence Structure . 72Figure 13 Linking Requirements of Adverbs . 75Figure 14 Linking Requirements of Postpositions. 77Figure 15 Linking Requirements of Adjectives. 78Figure 16 Linking Requirements of Numbers . 80Figure 17 Linking Requirements of Nominative Pronouns. 81Figure 18 Linking Requirements of Genitive and Accusative Pronouns . 83Figure 19 Linking Requirements of Locative/Ablative/Dative/InstrumentalPronouns . 85Figure 20 Left Linking Requirements Common to All Nouns. 88Figure 21 Right Linking Requirements of Nouns. 89x

List of TablesTable 1 Effects of Causation to Verbs. 36Table 2 Verb Subcategorization Information . 55Table 3 Subscript Set for S (Subject) Connector . 82Table 4 Statistical Results of the Test Run. 97xi

List of AbbreviationsSOVSubject object verbPOSPart of speech tagLGLink GrammarIDFIntermediate Derived FormLGLink GrammarTLGTurkish Link GrammarLRLinking RequirementsDLRDerivational Linking RequirementsLLRLeft Linking RequirementsRLRRight Linking RequirementsNDLRNon-Derivational Linking RequirementsNDLLRNon-Derivational Left Linking RequirementsNDRLR Non-Derivational Right Linking RequirementsDCDependent ClauseICIndependent ClauseNLPNatural Language Processingxii

Chapter 11 IntroductionSyntax is the formal relationships between words of a sentence. It deals withword order, and how the words depend on other words in a sentence. Hence, onecan write rules for the permissible word order combinations for any naturallanguage and this set of rules is named as grammar. Syntactic parsing, orsyntactic analysis, is the process of analyzing an input sequence in order todetermine its grammatical structure with respect to a given grammar. There aredifferent classes of theories for the natural language syntactic parsing problemand for creating the related grammars. One of these classes of formalisms iscategorical grammar motivated by the principle of compositionality1. Accordingto this formalism, syntactic constituents combine as functions or in a functionargument relationship. In addition to categorical grammars, there are two otherclasses of grammars, and these are phrase structure grammars, and dependencygrammars. Phrase structure grammars are the well-known Type-2, i.e. contextfree, grammars of Chomsky hierarchy. Phrase grammar constructs constituentsin a three-like hierarchy, head-driven phrase structure grammars (HPSG), andlexical functional grammars are some popular types of phrase structuregrammars. On the other hand, dependency grammars build simple relationsbetween pairs of words. Since dependency grammars are not defined by aspecific word order, they are well suited to languages with free word order, suchas Czech and Turkish. Link grammar, which is a theory of syntax by DavyTemperley and Daniel Sleator [1] , is similar to dependency grammar, but link1Principle of Compositionality is the principle that the meaning of a complex expression isdetermined by the meanings of its constituent expressions and the rules used to combine them.1

grammar includes directionality in the relations between words, as well aslacking a head-dependent relationship.In this thesis, we study Turkish syntax from a computational perspective.Our aim is to develop a link grammar for Turkish as complete as possible. Thereason for us to choose to study Turkish syntax computationally is syntacticanalysis underlies most of the natural language applications. Hence, toaccelerate new researches on Turkish as a lesser studied language, syntacticanalysis is a very important step. One of the reasons for us to choose the linkgrammar formalism to develop our grammar is that it is based on thedependency formalism which is known to be more suitable for free orderlanguages like Turkish. In addition, link grammar is lexical and this propertymakes it an easy development environment for a large, full coverage grammar.In addition to our work, there also some other researches on thecomputational analysis of Turkish syntax. One of these is a lexical functionalgrammar of Turkish by Güngördü in 1993 [8]. Demir [18] also developped anATN grammar for Turkish in 1993. Another grammar is based on HPSGformalism and developped by Sehitoglu in 1996 [7]. Hoffman in 1995 [19],Çakıcı in 2005[21], and Bozşahin in 1995 [20] worked on categorial grammarsfor Turkish.In addition to these categorial and context free works, Turkish syntax isstudied from the dependency parsing perspective. Oflazer presents a dependencyparsing scheme using an extended finite state approach. The parser augmentsinput representation with “channels” so that links representing syntacticdependency relations among words can be accomodated, and iterates on theinput a number of times to arrive at a fixed point [13]. During the iterationscrossing links, items that could not be linked to rest of the sentence, etc, arefiltered by finite state filters. They used this parser for building a Turkish2

treebank [22], namely METU-Sabancı Turkish Treebank. The explanatorypharagraph, in Figure 1 is directly taken from the web site of the treebank .METU-Sabanci Turkish Treebank is a morphologically and syntactically annotated treebankcorpus of 7262 grammatical sentences. The sentences are taken form METU Turkish Corpus.The percentages of different genres in METU-Sabanci Turkish Treebank and METU TurkishCorpus were kept the similar. The structure of METU-Sabanci Turkish Treebank is based onXML. The distribution of the treebank also includes a user guide, a display program, andrelated publications. Turkish is an agglutinative language with free word order. Therefore, adependency scheme was chosen to handle such a structure. Dependency links are put fromwords to inflectional groups of words.Figure 1 METU-Sabancı Turkish TreebankThe Turkish Dependency Treebank explained above is used for training andtesting a statistical dependency parser for Turkish by Oflazer and Eryiğit [12]. Intheir work, they explored different representational units for the statisticalmodels of parsing.1.1 Linguistic BackgroundIn this section, linguistic background for necessary for the rest of the thesistogether with some terms will be given in detail.The minimal meaning-bearing unit in a language is defined as a morpheme.For example, the word “books” consists of two morphemes, “book”, and “s”.Morphemes can be further categorized into two classes, stems, and affixes. Stemssupply the main meaning of the words while affixes supply the additionalmeanings. Hence, in the previous example, the morpheme “book” is the stem of3

the word “books”, and the morpheme “s” is an affix. The study of the way thatwords are built up from morphemes, stem and affixes, is defined as themorphology. New words can be formed from stems by inflection or derivation.The difference between inflection and derivation is that, the resulting word ofinflection has the same class as the original stem, whereas the resulting word hasa different class after derivation. For example, “books” is formed by inflectionfrom the stem “book” and the suffix “-s”. In addition, the word “books” and thestem “book” have the same class (noun). On the other hand, the noun“preparation” is derived from the verb “prepare”. Part of Speech (POS) Tag of aword represents its class. Noun is the POS tag of the word “book”. Therefore,each stem has a POS tag and derivational affixes can change the POS tag of thestems that they are appended. Orthographic rules are the spelling rules orphonetic rules and they are used to model the changes that occur in a word,usually when two morphemes combine. For example “y- ie” spelling rulechanges “baby -s” to “babies” instead of “babys” [16].Rules specifying the ordering of the morphemes are defined by the termmorphotactics. For example, in Turkish the plural suffix “-ler” may follownouns. Morphological features are the additional information about the stem andaffixes. “Book Noun Plural” contains the morphological features of the word“Books”. Morphological features of words are produced through morphologicalanalysis. Hence, the terms morphological features, morphological analysis, andmorphological parse of a word can be used interchangeably. Any morphologicalprocessor needs morphotactic rules, orthographic rules, and lexicons of itslanguage. A lexicon is the list of stems with their POS tags.A sentence is a group of words that contains subjects and predicates andexpresses assertions, questions, commands, wishes, or exclamations as completethoughts. Each sentence is thought to have a subject, an object, and a verb, andone of these can be implied. In a sentence with just one complete thought, the4

predicate of the sentence is the group of words that collectively modify thesubject. In the following examples, the predicate is underlined.I. Ali cooks.II. Özlem is in the cinema.III. He is attractive.Subject is defined as the origin of the action or undergoer of the state shownby the predicate in a sentence.Valence (valency) is the number of arguments that

Turkish. The grammar that we developed is lexical such that we used the lexemes of only some function words and for the rest of the word classes we used the morphological feature structures. In addition, we preserved the some of the syntactic roles of the intermediate derived forms of words in our system. Keywords: Natural Language Processing .File Size: 520KB

Related Documents:

Intermediate Turkish I TURK402. Intermediate Turkish II. TURK402-SA Intermediate Turkish II TURK403. Advanced Turkish I TURK403-SA. Advanced Turkish I TURK404. Advanced Turkish II TURK404-SA. Advanced Turkish II TURK407. 4th Year Turkish I TURK408. 4th-Year Turkish II TURK410. Topics in Turkish

TURKISH GRAMMAR UPDATED ACADEMIC EDITION 2013 3 TURKISH GRAMMAR I FOREWORD The Turkish Grammar book that you have just started reading is quite different from the grammar books that you read in schools. This kind of Grammar is known as tradit ional grammar. The main differenc

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

Grammar Express 79 Center Stage 79 Longman Advanced Learners’ Grammar 80 An Introduction to English Grammar 80 Longman Student Grammar of Spoken & Written English 80 Longman Grammar of Spoken & Written English 80 Grammar Correlation Chart KEY BOOK 1 BOOK 2 BOOK 3 BOOK 4 BOOK 5 BOOK 6 8. Grammar.indd 76 27/8/10 09:44:10