Developing An Unsupervised Grammar Checker For Filipino .

2y ago
54 Views
2 Downloads
384.88 KB
9 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Kaydence Vann
Transcription

PACLIC 30 ProceedingsDeveloping an Unsupervised Grammar Checker for Filipino UsingHybrid N-grams as Grammar RulesMatthew Phillip GoDe la Salle University2401 Taft Avenue,Manila, Philippinesmatthew phillip go@dlsu.edu.phAbstractThis study focuses on using hybrid n-gramsas grammar rules for detecting grammaticalerrors and providing corrections in Filipino.These grammar rules are derived fromgrammatically-correct and tagged texts whichare made up of part-of-speech (POS) tags,lemmas, and surface words sequences. Due tothe structure of the rules used by this system,it presents an opportunity to have anunsupervised grammar checker for Filipinowhen coupled with existing POS taggers andmorphological analyzers. The approach isalso customized to cover different error typespresent in the Filipino language. The systemachieved 82% accuracy when tested onchecking erroneous and error-free texts.1. IntroductionAccording to the philosopher and educator KevinBrowne, poor grammar implies two negativesentiments towards the writer: either he is notintelligent or he just does not care about hiswriting any better. Backing on this problem,there has been many researches and advances inthe field of computer-aided grammar checkingsuch as Microsoft Word, Google Docs,Grammarly, LanguageTool, and Ginger. Thesesoftware solutions can detect syntactical errorssuch as spelling, punctuation, word forms, andword usages. However, most of these solutionshave focused on the English language. There hasbeen very few works in the Filipino languagedespite being a language of at least 100 millionpeople 1 . Additionally, it is difficult to use anexisting grammar checker system of onelanguage and apply it on another since thesystem would have its specific design 538653/philippines-population-seen-hit-104mAllan BorraDe la Salle University2401 Taft Avenue,Manila, Philippinesallan.borra@dlsu.edu.phfunctionalities tackling the unique phenomena ofits target language.The Filipino language, just like any otherlanguage, has its own unique phenomena whichserve as a challenge in developing its owngrammar checker system. It has a ‘largevocabulary of root, borrowed, and derivedwords’ caused by the arrival and/or colonizationof foreign countries including: Spain, USA, andChina in the Filipino land 2 . It also has a highdegree of inflection and uses variety of affixes tochange the part-of-speech of a root word (ex.root: tira ‘live [on a house]’, tira han tirahan‘house’) or change the focus and aspect of a verb(tirhan ‘live’ – neutral aspect/object focus, titira‘will live’ – contemplative aspect/ actor focus,tumira ‘lived’ – perfective aspect/ actor focus.Another linguistic phenomenon in Filipino is itsfree-word order structure. Filipino sentences, inits natural form, follow the predicate-subjectsentence format (ex. Masaya ako – word-perword is translated as ‘Happy I’) or as subjectpredicate sentence format (ex. Ako ay masaya –word-per-word is translated as ‘I [none] happy’)where the word ay acts as a lexical marker and isusually placed after the subject and before thepredicate. In the Filipino language, direct objects,adjectives and adverbs may also be written asphrases and including prepositional phrases, theyalso follow the free-word order and not beinglimited to just one position in the sentence(Ramos, 1971). For example, the sentence ‘Markate an apple.’ can be translated to: Si Mark aykumain ng mansanas., Kumain si Mark ngmansanas., and Kumain ng mansanas si Mark.As seen in the last two translations, the directobject phrase ng mansanas ‘apple’ can be placeddirectly after the verb or after the subject yet bothproduce the exact same filipinolanguage-wikang-filipino/30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30)Seoul, Republic of Korea, October 28-30, 2016105

As of this writing, there are still nogrammar-checking software systems for Filipinothat is publicly available that cover broad-rangeof grammatical errors.This fact may beassociated with the complex structure of theFilipino language which makes it difficult inconstructing (error) grammar rules. Among thefew existing grammar checkers in Filipino are:Panuring Pampanitikan (PanPam) by Jasa et al.(2007) and Language Tool for Filipino (LTF) byOco & Borra (2011). PanPam is a syntax andsemantics-based grammar checker for Filipinothat makes use of error patterns as rules andlexical functional grammar as its parsingalgorithm. LTF, on the other hand, uses a rulefile containing error patterns in the form ofregular expressions and part-of-speech tags and adictionary file in detecting its errors andproviding corresponding suggestions. Althoughthese systems, especially LTF, could distinctlyrecognize grammatical errors from correct textby using error patterns, the main concern withthese systems is that the parser rules,dictionaries, affix-to-root-word mappings, wordto-part-of-speech mappings, error patterns, andother files are manually defined which is a verytedious task to cover the entire language and allpossible errors in it especially that the languageis ever growing and the number of errorscommitted by writers are directly proportional toit. This concern is evident on the systems’presented limitations and results where only asmall subset of errors was covered.In other languages such as English, thereare existing works such as Lexbar (Tsao &Wible, 2009), EdIt (Huang et al., 2011), Googlebooks n-gram corpus as grammar checker (Nazar& Renau, 2012), and Chunk-based grammarchecker for translated sentences (Lin et al., 2011)which are unsupervised grammar checkersystems that make use of grammatically correcttexts, their corresponding part-of-speech (POS)tags, and/or lemmas converted into n-gramsequences and used as grammar rules.The Lexbar application (Tsao & Wible,2009) generated hybrid n-grams, which are ngrams composed of words, POS tags, andlemmas. These hybrid n-grams are generatedfrom actual tagged word sequences. Forexample, given phrases such as ‘from her pointof view’ and ‘from his point of view’, the systemwill be able to generate the hybrid rule ‘from[dps] 3 point of view’. This rule can be used toflag the phrase ‘from my point of view’ asgrammatically correct and the phrase ‘from himpoint of view’ as incorrect. The Lexbar app wasonly tested on substitution-correctable errors.The EdIt system (Huang et al., 2011) also madeuse of hybrid n-grams (called pattern rules) asgrammar rules but only generates the rules suchas ‘play role in [Noun]’, ‘play role in [Ving]’, and ‘look forward to [V-ing] 4 ’ fromspecific lexical collocations such as ‘play role’and ‘look forward’. These types of rules tacklemuch more specific error types in English. Thekey difference of EdIt with Lexbar is that it onlylimits the number of POS tokens in an n-gramrule to one while Lexbar can have one or more5POS tokens such as the rule: ‘from [dps] [nn0] ’derived from the phrases like ‘from his house’and ‘from her balcony’. EdIt applied its rules indetecting errors correctable by substitution,insertion, and deletion. Both Lexbar and EdItused weighted Levenshtein edit distancealgorithm in prioritizing its suggestions.This research aims to build an unsupervisedgrammar checker system for Filipino usinghybrid n-grams as grammar rules following asimilar format as Lexbar’s grammar rules. Theserules will be used to detect grammatical errors inFilipino and provide suggestions such assubstitution, insertion, deletion, merging, andunmerging extending the existing suggestionsmade by both Lexbar and EdIt.2. Filipino Linguistic PhenomenaAside from the free-word order structure inFilipino, there are other linguistic phenomenasuch as being morphologically rich, existence ofcompound words, and the rule in Filipino: “Kungano ang bigkas, siyang sulat” ‘Spell as youpronounce it’ (Ortograpiyang Pambansa, 2013).There are at least 50 affixes and othermorphologies such as partial reduplication, fullreduplication, and compounding that are used inFilipino. These morphologies are categorizedinto three: inflectional – changes in word formthat ‘accompany case, gender, number, tense,person, mood, or voice that have no effect in theword’s part-of-speech’; derivational – changes in3dps is the part-of-speech (POS) tag for possessivepronouns such as his, her, my, their, etc in the CLAWS5tagset.4V-ing is the POS tag for verbs followed by –ing in theCLAWS5 tagset.5nn0 is the POS tag for neutral nouns in the CLAWS5tagset.106

PACLIC 30 Proceedingsword form that changes the word’s part-ofspeech category; and compounding – ‘whereindependent words are concatenated in some wayto form a new word’ (Bonus, 2003). See Table 1for some of the different forms of the root wordkain nankinakainanTranslationwill just eatjust eatfeedwill feedeat (something)ate (something)eating (something)(somebody) eatingeating/dinner tableeating placeeating place (wheredo-er will go later)eating place (wheredo-er is right now)foodpagkainAdjectivepalakainloves eatingTable 1: Different forms of kain ‘eat’There are also affixes that are separated by ahyphen (-) from its root word or morpheme (ex.mang-akit ‘to entice’ from the root akit ‘entice’).There are also cases wherein addition or insertionof an affix to a word could alter the spelling of itsbase form (ex. The prefix pang- palit ‘change’ pamalit ‘item for changing’). However, not allaffixes and reduplication can be applied to anyword. For instance, the root word luto ‘cook’ canuse ‘nag-‘ as prefix but kain ‘eat’ cannot. Itshould also be noted that there are assimilatedwords from English in Filipino wherein affixesare also appended to it (ex. magce-cellphone‘will use a cellphone’, i-file ‘to file (adocument)’). The Filipino language also has itsown set of compound words. There are two waysto combine words together, either with the use ofa hyphen (ex. halo-halo ‘(a type of Filipinodessert)’ from the word halo ‘mix’, and kisapmata ‘instant’ from the words kisap ‘blink’ &mata ‘eye’) or just combining them as is (ex.kapitbahay ‘neighbor’ from the words kapit‘hold onto’ & bahay ‘house’, and hanapbuhay‘livelihood’ from the words hanap ‘find’ &buhay ‘life’) (Paz, 2003).Another important linguistic phenomenon inFilipino is the rule: “Kung ano ang bigkas,siyang sulat” ‘Spell as you pronounce it’(Ortograpiyang Pambansa, 2013). As the rulestates, the words in Filipino are usually spelled asthey are pronounced with some exceptions. Thisphenomenon simplifies the way Filipino wordsare spelled out (ex. Filipinized form of‘computer’ as kompyuter) but also causes somespelling confusion which will be discussed in thenext section.3. Error TypesIn understanding the error types that exist inFilipino writing, three references were used: TheCambridge Learner Corpus (Nicholls, 1999),Wikapedia (2015), and a parallel corpus of 1252erroneous-and-correct word and phrase pairsfrom sentences written by Filipino universitystudents.The Cambridge Learner Corpus contains 16million words from English examination scriptsby learners of English containing different typesof errors. The corpus categorized the error typesinto general and specific errors. The proponentsnoticed that some error categories would have itsFilipino counterpart such as wrong form used,missing word/phrase, word/phrase needsreplacing, unnecessary word/phrase, punctuationerrors, countability errors, determiner agreement,incorrect verb inflection, spelling errors, andother error categories also exist in Filipino.Wikapedia (2015) is a booklet created bythe Presidential Communications Developmentand Strategic Planning Office of the Philippinescontaining correct usage of affixes, words, andphrases in Filipino which people may findconfusing. One example described in the bookwould be the use of ng, a function word definingpossession (ex. aso ng kapitbahay ‘dog ofneighbor’) and in a direct object phrase (ex.kumain ng mansanas ‘ate an apple’) vs the use ofnang which is commonly used before an adverb(ex. kumain nang mabilis ‘ate fast’). The usageof these two words is confusing because it ispronounced almost exactly the same. Otherexamples contained in the booklet are properusage of affixes and words, morphophonemics,usage of hyphens and spaces, and others.After analyzing the parallel corpus of 1252erroneous-correct word/phrase pairs, it is foundthat majority of the errors fall under spellingerrors, incorrect usage of affixes/reduplicationwhich is mostly caused by usage of hyphens andspaces, and wrong word usage.107

It is observed that one reason the studentsmade spelling errors is because of the way aword is pronounced which is usually simplifiedfor conversational use. Some of these simplifiedwords, see Table 2, are still not accepted informal Filipino writing which cause spellingerrors. Another cause of spelling errors is theconfusion whether to spell an English borrowedword in its English version or convert it to itsFilipinized spelling version.There were many instances of affix errorswhere the students were confused whether aword is an affix of a word, a separate word, or ifthere should be a hyphen between the affix andthe root word. A few of the affix errors also showthe confusion of students in selecting anappropriate affix of a verb when used for acertain focus and/or aspect. See Table 3.The students also committed severalmistakes in identifying which word to use incertain situations which is caused byunfamiliarity with Filipino syntax rules. SeeTable 4.Other errors that exist in the parallel corpusinclude the lack of space between words (ex. parin ‘still’ incorrectly written as parin), compoundwords that was separated by a space (ex. arawaraw ‘everyday’ incorrectly written as arawaraw) and punctuation errors where somecommas or periods are have’anong ontingnan ‘look’ ��researcher’Table 2: Spelling ErrorsCorrectMisspelledReasonWordasPangkainPang kainExtra Space‘usedforeating’TagtuyoTag-tuyoExtra Hyphen‘drought’IkawaloIka-waloExtra Hyphen‘eighth’i-predict ‘to ipredictMissingpredict’Hyphenmasmalaki masmalakiMissing Space‘bigger’inilagaysa nilagaysa Incorrectkahon ‘placed kahonAffix used forin a box’a verb focusTable 3: Affix ErrorsConfused between:ng ‘of’may ‘has (used beforenouns, verbs,adjectives andadverbs)’nang ‘(function wordbefore an adverb)’mayroon ‘has (usedbefore grammaticalparticles, personalpronouns, and adverbsof place’na ‘(type ofgrammatical particle)’suffix –ng ‘used inplace of na if wordpreceding it ends in avowelTable 4: Wrong Word Usage4. Overview of the Grammar CheckerThe grammar checker named Gramatika that isdiscussed in this paper utilizes the existingimplementation of the Lexbar application byTsao & Wible (2009) and extends it to covermore error types, some of which are unique inthe Filipino language. It uses n-grams as rules,commonly referred to as hybrid n-grams, fromgrammatically correct texts consisting of words,POS tags, and lemmas to detect grammaticalerrors and provide suggestions containingpossible corrections. The production of POStags, and lemmas can be produced by existing6POS taggers and morphological analyzers forFilipino making the system unsupervised suchthat new grammatically correct texts can be fedthrough these systems and to Gramatika to easilyincrease the number of grammar rules.6See Rabo & Cheng (2006) and Bonus (2003)108

PACLIC 30 Proceedings4.1 Rules LearningEven though Gramatika also uses hybrid n-gramssimilar to Lexbar’s (Tsao & Wible, 2009) andslightly similar to EdIt’s (Huang et al., 2011), theapproach in deriving the hybrid n-grams isdifferent. Gramatika uses a clustering approachas opposed to Lexbar’s pruning and EdIt’scollocations-based approaches. The n-gram sizesused as rules range from 2 to 7. For example,given an incorrect phrase para sa bata anglaruan ni iyon. ‘?that? toy is for the kid’, ifGramatika has the hybrid 7-gram ‘para sa[NNC] [DTC] [NNC] na [PRO].’ 7, then it canimmediately suggest to change the word ni ‘(agrammatical particle used before a personalproper noun)’ to na ‘(a grammatical particle usedaround adjectives, pointing pronouns, andothers)’ which produces the corrected version:para sa bata ang laruan na iyon ‘that toy is forthe kid’ which is a more appropriate suggestionthan the suggestion produced by the trigram[NNC] ni [NNP] 8 to change iyon to a propernoun (ex. Mark)producing the correctedversion: para sa bata ang laruan ni Mark‘Mark’s toy is for the kid’. The use of larger ngram sizes increases the context from which asuggestion can be based from.In the clustering approach, all n-gramsequences are retrieved from grammaticallycorrect texts and are stored in the database.During the storing process, the frequency of allPOS tag sequences is counted. POS tag sequencesexceeding the threshold of 2 are retrieved and theword n-grams are grouped as clusters. For eachn-gram clusters, the module checks if there areany token slot that can be generalized to POSlevel. For example, if a cluster has the instancesnagpunta sa bayan ‘went to the town’ andbumisita sa bahay ‘visited the house’, the firstand third tokens can be generalized because itmeets the minimum difference threshold of 2.This produces the hybrid n-gram [VBTS] sa[NNC] which can be used to flag the phraseumupo sa silya ‘sat on the chair’ asgrammatically correct or used to detectgrammatical errors. The n-gram rules are storedin the database as sequences of words, POS tags,lemmas, and a Boolean sequence denoting whichtoken slots are generalized. This is done to allowGramatika to provide word-specific suggestions7Based from the Rabo & Cheng (2006) tag set, NNC common noun, DTC determiner for common nouns, PRO pronoun pointing to an object8NNP proper nounand to also identify the appropriate transformedword to a specific POS -lemma mapping.4.2 Error DetectionIn detecting grammatical errors and producingsuggestions based on the hybrid n-grams, aweighted Levenshtein edit distance algorithm isused. This algorithm is commonly used in spellchecking to compute how many edits it will taketo convert a potentially misspelled word to acorrect word in the dictionary. It has also beenused by EdIt (Huang et al., 2011) in providingcorrections by substitution, insertion, anddeletion. In Gramatika, the edit distancealgorithm is extended to detect errors andprovide suggestions correctable by ,unmerging, and merging. The error types thatexists in Filipino are grouped based on the sixsuggestion types, see Table 5.CorrectionSubstitutionError TypesAffix/Formerrors,wrongword/punctuation usage (includespreposition, determiners, andothers)SpellingMisspelled words, misuse/lack ofCorrection hyphensInsertionMissing words and sUnmerging Incorrectlymergedwordsrequiring unmerging of words orremoval of hyphensMergingIncorrectlyunmergedwordrequiring removal of space orinsertion of hyphen between textsTable 5: Correction and Error TypesIn producing suggestions, Gramatika parsesthe input, which is POS and lemma-tagged, inton-grams starting from size 7 down to 2. For eachinput n-gram, it retrieves hybrid n-gram rules“similar” to the input n-gram from the database.A rule is considered “similar” to an input n-gramif at least n–2 POS tokens of it are equal to thePOS tokens in the input n-gram. Three sizes ofthe rules are also retrieved for each input n-gram:rules that are of equal size to the input n-gram tobe used for substitution and spelling correctionsuggestions, rules that are one token size largerto produce insertion and

Filipino is the rule: “Kung ano ang bigkas, siyang sulat” ‘Spell as you pronounce it’ (Ortograpiyang Pambansa, 2013).As the rule states, the words in Filipino are usually spelled as they are pronounced with some exceptions.This phenomenon simplifies the way Filipi

Related Documents:

This chapter describes how to install and launch D-checker. Double-click D-checker.exe, which can be found in the unzipped folder, to launch D-checker. Unzip the D-checker package into a folder of your choice (for example, on the desktop). When you launch D-checker for the first time, a firewall settings dialog box will be displayed.

Shows which Checker is connected, the Job name and if it has been saved, along with results for the most recent image. 2 Checker steps. Click each button in turn to build a Checker application. 3 Image display. Shows live video from Checker or individual images from a Filmstrip. 4 For each Checker step, instructions about what to do next are .

grammar checker as a learning tool, my mentor (and co-author) Dorothy and I designed an action research project on the use of the grammar checker in my classroom. Three main research questions guided this exploration: (1) When given direct instruction with the word-processing grammar checker, will students improve as critical, confi dent

assistance composition classes by giving a short course in grammar followed by a grammar checker project. The project provided a review of the grammar lessons, applied many grammar rules specifically to the students' writing, and taught students the effective use of the grammar checker. Today we find in many college composition classrooms a chang

rabbi lawrence charney * sally dickman chase harry chasen * sylvia chasen * jean chason * julius george chason * beatrice checker bill checker boris checker * kathy checker * kathy checker harriet chensky * maurice cherney * mollie cherney * muriel cherney henry chess . philip dolin sol dolin

grammar checker and as an integrated part of a word processor. The style and grammar checker described in this thesis takes a text and returns a list of possible errors. To detect errors, each word of the text is assigned its part-of-speech tag and each sentence is split into chunks, e.g. noun phrases.

Since research in feature selection for unsupervised learning is relatively recent, we hope that this paper will serve as a guide to future researchers. With this aim, we 1. Explore the wrapper framework for unsupervised learning, 2. Identify the issues involved in developing a feature selection algorithm for unsupervised learning within this .

Compliance Checker and STIG Viewer Job Aid . February 2017 . Revision Log . Date Revision Description of Change 2017FEB06 1.2 Updated to reflect OBMS Tool Availability . SCAP Compliance Checker The SCAP Compliance Checker is an automated