Comprehensive Stemmer For Morphologically Rich Urdu

2y ago
30 Views
2 Downloads
515.44 KB
10 Pages
Last View : 21d ago
Last Download : 2m ago
Upload by : Nora Drum
Transcription

138The International Arab Journal of Information Technology, Vol. 16, No. 1, January 2019Comprehensive Stemmer for Morphologically RichUrdu LanguageMubashir Ali1, Shehzad Khalid2, and Muhammad Saleemi21Department of Computer Science & IT, University of Lahore, Pakistan2Department of Computer Engineering, Bahria University Islamabad, PakistanAbstract: Urdu language is used by approximately 200 million people for spoken and written communication. Bulk ofunstructured Urdu textual data is available in the world. We can employ data mining techniques to extract useful informationfrom such a large potential information base. There are many text processing systems that are available. However, thesesystems are mostly language specific with the large proportion of systems are applicable to English text. This is primarily dueto the language dependant pre-processing systems mainly the stemming requirement. Stemming is a vital pre-processing stepin the text mining process and its core aim is to reduce many grammatical words form e.g., parts of speech, gender, tense etc.to their root form. In this proposed work, we have developed a rule based comprehensive stemming method for Urdu text. Thisproposed Urdu stemmer has the ability to generate the stem of Urdu words as well as loan words (words belonging toborrowed language i.e. Arabic, Persian, Turkish, etc) by removing prefix infix, and suffix. This proposed stemming techniqueintroduced six novel Urdu infix words classes and minimum word length rule. In order to cope with the challenge of Urdu infixstemming, we have developed infix stripping rules for introduced infix words classes and generic rules for prefix and suffixstemming. The experimental results show the superiority of our proposed stemming approach as compared to existingtechnique.Keywords: Urdu stemmer, infix classes, infix rules, stemming rules, stemming lists.Received September 5, 2015; accepted Jun 1, 20161. IntroductionStemming is a very fundamental pre-processing step inprocessing of textual data preceding the tasks of textmining, information retrieval, and natural languageprocessing. The primary goal behind the developmentof any stemmer is to improve the search effectivenessso an information retrieval system can respond to userquery accurately. In linguistic morphology, stemming isa process to produce the stem /root form of the word byreducing its inflected or derived form. Urdu is anational language of Pakistan and state language ofIndia. It is an Indo-Aryan language and is written fromright to left. Urdu is widely speaking in India,specifically, Indian states e.g., Delhi and Uttar Pradeshuse Urdu as an official language. According to Indiansurvey in 2011, 5% percent of Indian population alsospeaks Urdu language. Approximately more than 200million people use Urdu language.Urdu vocabulary is composed of many foreignlanguages i.e., English, Arabic, Persian, Turkish, Hindi,etc. The word 'Urdu' itself belongs to Turkish language.All these companion languages have their complexmorphological structure. Due to robust morphology ofborrowed languages, Urdu is a very rich morphologicallanguage. Urdu is robust in both inflectional andderivational morphology [2]. Morphology is the studyof internal structure of the words [3]. Inflectionalmorphology concerns with the grammatical formationof the words. Generating new words from the existingwords is called derivational morphology. The majorelement of Urdu morphology is morpheme.Morpheme is a smallest language unit that has somemeaning. Morphemes are of two types i.e. free andbound morphemes [8]. As information retrievalsystem is worked on the base /root form of the wordsrather than its inflected or derived form. So, in orderto boost the performance of IR system, thedevelopment of an Urdu stemmer that has the abilityto generate the stem of morphological rich language isvery important. Stemmer is an algorithm thatgenerates the stem/root form of the word. Urdustemmer produce the stem of a word by removingprefix, infix, and postfix attached to it, e.g., the stemof words (news) , (news), (newspapers), (newspapers), and (newspapers) is (news).The rest of paper is organized as follow. Section 2describes the brief review of existing stemming stateof-the art. The proposed Urdu stemming approach isdetailed in section 3. Experiments are discussed insection 4 to demonstrate the effectiveness of proposedapproach. Finally in the last section conclusion ispresented.2. Related WorkStemming can be performed by using three commonapproaches i.e., affix stripping, table lookup, andstatistical methods [4]. Affix removing approach

Comprehensive Stemmer for Morphologically Rich Urdu Languagedepends on the morphological structure of the givenlanguage. This approach is used to obtain the stem ofthe word by removing the attached prefix and postfixfrom the word. A well known porter stemmer is anexample of this approach [17]. In table lookupapproach each word and its associated stem is stored instructured table. This approach requires a lot of storagespace for its implementation and its table needs to beupdated manually for each new word. In Statisticalapproaches, based on the size of corpus wordsformation rules are developed. Some methodologies areused i.e., frequency count, n-gram [13], Hidden MarkovModels [15], and link analysis [5]. Until now lots ofstemming methods have been proposed for variety oflanguages i.e., English [12, 16, 17], Arabic [10, 20],Persian [12, 19] etc., These stemming methods arebased on rule based strategy. In literature, there alsoexist many stemming methods [13, 14] that aredeveloped by using statistical approach. Rule basedapproaches are highly dependent on the deepmorphological knowledge of the language, whereasstatistical analysis is performed on the base of corpussize. The study [11] developed first stemming methodfor English language. This stemming approach is basedon rule based strategy and comprises of 260 stemmingrules. This stemming method generates the stem ofEnglish word in two phases. In the first phase of thestemmer, the maximum matched suffix is removeddefined in suffix table and recodes the word to generatesuitable stem. Spelling exclusions are covered in thesecond phase of the stemmer. This stemmer is knownas Lovins stemmer. Dawson [6] came up with anotherrule based stemming method. It is an extension of J.B.Lovins stemmer and covers a comprehensive list of1200 suffixes. The suffixes are stored in reversed orderlisted by their length and last character. This methodcovers more suffixes than Lovins stemmer. Porter [17,18] developed a rule based stemmer for Englishlanguage. He simplified the rules of Lovins stemmer toabout 60 rules. In this proposed stemming method,suffixes are removed from words by using suffix listand some conditions are enforced to find out thesuffixes to be de-attached. This is one of the mostpopular stemming methods for English textual data andis known as Porter stemming algorithm. Porter alsodesigned a stemming framework referred to as“snowball”. The objective behind the development ofthis framework is to allow the programmer to developtheir own stemmer for languages. Porter [17, 18]discovers the problems of over-stemming, understemming, and mis-stemming. Paic [16] came up withanother stemming method based on rule-based strategy.It is an iterative algorithm based on a table comprising120 rules that are indexed by the last letter of a suffix.On each iteration it tries to find an appropriate rule bythe last character of the word. Each rule is used eitherfor deletion or replacement of an ending. If none of therule is found, it terminates. In previous stemming work,139many stemming algorithms have been developed forSouth Asian languages. Khoja and Garside [10]developed a superior root-based stemming method forArabic language. This stemming method generates thestem of Arabic word by removing prefix, infix, suffix,and then use pattern matching. In order to improve thestemming accuracy of proposed stemming approach,this stemmer uses several linguistic data files i.e.punctuation character, diacritic characters, and a list of168 stop words. For Arabic text, Thabet [20] proposeda light stemming approach. It is developed by usingrule based approach and is applied on classical Arabicin Quran. This Arabic stemmer generates the list ofwords from each surah. If the word in list do not foundin the stop word list then prefix is truncated from theword. Stemming accuracy of proposed algorithm forprefix stemming is 99.6% and 97% for postfixstemming. Tashakori [19] came up with first Persianstemmer called Bon that is based on rule-basedapproach. It is an iterative longest matching algorithmthat removes all the possible affix and suffix from theword until required. After truncation of prefixes andsuffixes a re-coding technique is used to generate thevalid stem. With the use of Bon, recall is improved by40%. Mokhtaripour [12] developed another stemmingmethod for Persian language by using rule basedstrategy. This stemmer generates the stem of Persiantext without using language dictionary. Theperformance of a query system was improved up to46% by using this developed stemmer. As far as Urdulanguage is concerned [1, 7, 8, 9] stemming methodshave been proposed i.e., Asass-band [2], Light Weightstemmer for Urdu text [8] and novel stemmingapproach for Urdu. These stemming methods generatethe stem by removing prefix and postfix present in theUrdu words. The [2, 7] stemmers are highly dependenton very large rules lists as well as exception lists.These large lists significantly affect the efficiency ofthese Urdu stemmers. As Urdu language is composedof many foreign languages such as English, Arabic,Persian, Turkish, etc., Existing stemming approaches[2, 7] are unable to generate the stem of words belongto borrowed languages. In Urdu morphology there aremany words that have infix in it in addition to prefixand postfix. The truncation of infix from Urdu wordsis very important for an effective Urdu stemmer.Exiting Urdu stemmers do not address the infixstemming. Our proposed stemming method is a firstwork that is capable to generate the stem of Urduwords as well as borrowed words by removing prefix,infix, and suffix attached to it.3. Proposed Urdu StemmerIn this section, we describe our proposed Urdustemming method. This developed stemmer is basedon the rule-based affix stripping approach to generatethe stem of Urdu as well as borrowed words. This

140The International Arab Journal of Information Technology, Vol. 16, No. 1, January 2019Urdu stemming approach is comprised of various infixwords classes, stemming rules, stemming list, and stemword dictionary.3.1. Stemming RulesIn this stemmer, we have developed three kinds ofstemming rules i.e., prefix, infix, and postfix rules. Theexisting state-of-the art approaches [2, 7] have alsodeveloped prefix and postfix stemming rules. But theypresented a huge set of rules. In this stemming work,we have minimized existing stemming rules andproposed generic rules that can be applied on any typeof Urdu words. Our developed rules are also capable toproduce the stem of borrowed words.3.1.1. Minimum Word Length RuleAfter a detailed analysis of Urdu morphology it isobserved that an Urdu word comprises of only two orthree characters is already a stem word. For example, the words(day), (night), (time) are alreadystemmed words. These words are treated as stem wordsand filtered out to avoid further stemming processing.The finding of this rule is a novel contribution ofproposed Urdu stemmer. Some example words of thisrule are given in Table 1.Table 1. Example of words handle by minimum word length rule. 3.1.2. Prefix Removing RulesPrefix is a morpheme that is attached to thebeginning of the word. In Urdu morphology it isknown as ! "# The prefix may compose of one ortwo characters and sometimes it is a completeword. In this stemmer we have developed a list of60 generic prefixes. Some instances of prefix rulesare presented in Table 2.Table 2. Example of prefix stripping rules. بر بد غٮر ال براے گل در خود غم تنگ ﻧا تا 3.1.3. Infix Removing RulesInfix stemming is the most prominent work of thisproposed stemming method. The most part of Urdugrammar is influenced by Arabic grammar. Therefore,Urdu morphology has inherited features of this parentlanguage. To handle the borrowed words is also asignificant contribution of this proposed method. Aftera detail study of Urdu morphology, it is observed thatmost of the words having infixes belong to Arabiclanguage. In order to handle Urdu infix stemming, wehave proposed six different Urdu infix words classesi.e., Alif Arabic Masdar (infinitive verbs beginningwith Alif), Te Arabic Masdar (Infinitive verbsbeginning with Te), Isam Fiale (Active subject), IsamMafool (passive object), Arabic Jamah (Arabic pluralwords), and Isam Zarf Makaan (place showing noun).To remove infixes from words that belong to proposedArabic infix classes, we have defined variety of infixrules. In order to identify the Arabic words forapplying proposed infix rules, the characters (% & ' () * , - . / 0 ) are verified in the Urduword. Infix rules are grouped w.r.t. infix classes thatthey handle.1. Alif Arabic Masdar (infinitive verbs beginning withAlif) Class Infix Stripping Rules: In order toremove the infixes of this class, we have developedthe following rules Rule 1: If word starts with Alif (“1 ”) and thelength of word is exactly equal to five, then removeall the Alif (‘1 2) from this word. Words handledby this rule are given in Table 3. Rule 2: If word start with Alif (“1 ”) and thelength of word is greater than five, Then remove allthe Alif (‘1 ’), Te (‘ 2), Sin (‘3’), Chhoti Yeh(‘4’), Nun Gunna (‘ ’),Chhoti Yeh Hamza (‘5’),Wao Hamza (‘6’), and Hamza (‘7’) from this word.Words handled by this rule are presented in Table3. Rule 3: If word start with Alif (“1 ”) and thecharacter at index one is Te (“ ”), and length ofthe word is exactly equal to five, then remove allthe Alif (‘1 2), Chhoti Yeh (‘42), Nun Gunna(‘ 2),Chhoti Yeh Hamza (‘5’), Wao Hamza (‘62),Hamza (‘72) and Wao(‘ 2) from this word. Wordsstemmed by this rule are shown in Table 3. Rule 4: If word start with Alif (“1 ”) and thecharacter at index two is Sin (“389 and length isexactly greater than five, then remove all the Alif(‘1 2), Te (‘ 2), Nun Gunna (‘ 2),Chhoti Yeh (‘42),Chhoti Yeh Hamza (‘52), Wao Hamza (‘62), Dochashmi he (‘:2), Badi Yeh (‘;2), Hamza (‘72) andWao(‘ 2) from this word. Words handled by this rulecan be seen in Table 3. Rule 5: If word start with Alif (‘1 ’) and thecharacter at index three is Sin (‘3’) and length isexactly greater than five, then remove all the Alif(‘1 2), Te (‘ ’), Nun Gunna ’),Chhoti Yeh (‘42),

141Comprehensive Stemmer for Morphologically Rich Urdu LanguageChhoti Yeh Hamza (‘52), Wao Hamza (‘62), Dochashmi he (‘:2), Badi Yeh (‘;2), Hamza (‘72) andWao( 2) from this word. Words handled by this ruleare given in Table 3.Table 3. Examples of words handled by Alif (“1 ”) arabic masdarinfix class.Rule-1OriginalWord Rule-2C RuleRule-4OP L W Rule-5 Rule-3StemWord?OriginalWord DEF QX aG StemWord HOriginalWord@A L IJKM R S@R YZ X[ 4 IJK M 4 aStemWordB N TUV\]J b2. Te Arabic Masdar (infinitive verbs beginning withTe) Class Infix Stripping Rules: To remove theinfixes from words that belongs to this class, wehave proposed the following infix rulesTable 5. Examples of words handled by isam fiale infix class.OriginalWordStemWordRule-1 sRule-2t GyRuleOriginalWordStemWord t \u# t GE zv OriginalWordStemWord w x{ @{4. Isam Mafool (Pasive Subject) Class Infix StrippingRules: To remove infixes from the words relates tothis class, following rules are developed. Rule: If word start with Meem 29} and length ofthe word is exactly equal to five and second lastcharacter of the word is Wao (‘ 2) then remove allthe Wao (‘ 2), and Meem 29 from this word. Wordshandled by this rule are presented in Table 6.Table 6. Examples of words handled by isam mafool infix class.Original Stem OriginalWord Word Word Rule 1: If word start with Te (‘ 2), and also contain StemWordL\u Alif (“1 ”) Then remove all the Alif (‘1 ), Te(‘ 2), Chhoti Yeh (‘4’), Nun Gunna (‘ ’),Chhoti YehHamza (‘5’), and Badi Yeh (‘;2), from this word.Words stemmed by this rule are presented in Table4. Rule 2: If word start with Te (‘ 2), and length of theword is exactly equal to five and second lastcharacter of the word is Chhoti Yeh (‘4’), Thenremove all Te (‘ 2), Chhoti Yeh (‘4’), Nun Gunna(‘ ’) and Badi Yeh (‘;2), from this word. Wordshandled by this rule are given in Table 4.Table 4. Examples of words handled by Te (‘ 2) arabic masdar infixclass.RuleOriginalWordRule-1 c dRule-2jkStemWordOriginalWorde f dElmnStemWordOriginalWordg h po "qXStemWordir3. Isam Fiale (Active Subject) Class Infix StrippingRules: In order to remove the infixes of this class,we have developed the following rules. Rule 1: If word length is exactly equal to four andalso contains Alif (‘1 ), then remove all the Alif(‘1 ), from this word. Some example words of thisrule are given in Table 5. Rule 2: If word length is exactly equal to four andsecond last character of the word is Chhoti Yeh(‘4’), then remove all the Chhoti Yeh (‘4’), from thisword. Words handled by this rule are given in Table5.5. Arabic Jamah (Arabic plural words) Class InfixStripping Rules: To remove the infixes from wordsthat belongs to this class, we have proposed thefollowing infix rules. Rule: If word length is exactly equal to four andsecond last character of the word is Wao (‘ 2) thenremove all the Wao (‘ 2) from this word. Someexample words of this rule are given in Table 7.Table 7. Examples of words handled by Arabic Jamah infix class.Original Word Stem WordOriginal WordStem Word l l x 3 t3t Arabic Jamah and Isam Fiale (Arabic plurals andActive subject) beginning with Meem ‘ ’9 InfixStripping Rules: To remove infixes from words of thatclass, we have proposed the following rules Rule 1: If a word start with Meem 29 and alsocontains Alif (‘1 ) then remove all the Alif (‘1 ),Te (‘ 2), Nun Gunna (‘ ’), Chhoti Yeh (‘4’), BadiYeh (‘;2), and Chhoti he : 9 ¡ ¡ ¡§ Rule 2: If a word start with Meem 29 and thecharacter at index two is Te (‘ 2) and length of theword is exactly equal to five then remove all the Te

142The International Arab Journal of Information Technology, Vol. 16, No. 1, January 2019(‘ 2), Nun Gunna (‘ ’), Chhoti Yeh (‘4’), and Badiprefix Å ¡ Yeh (‘;2) ¡ Æ Ä Æ ¡Æ Æ On the other hand ¡ ¡ § we cannot remove the prefix from the prefixrules list because this prefix generates the stem ofmany other important words. Therefore, to keep themeaning of such words intact they should be treatedas exceptional cases. In this proposed Urdustemmer, we have developed an exception list ofabout 5000 words that is significantly smaller insize as compare to the lists of existing stemmingstate-of-the art technique [7, 8].Table 8. Examples of words handled by arabic jamah and isam fialebeginning with Meem 29 infix rule.RuleRule-1Rule-2Original Stem OriginalWord Word Word ª« ² ³StemWordOriginalWord StemWord µ 3.1.4. Postfix Removing RulesPostfix is that morpheme that is attached at the end ofthe word. In Urdu morphology it is known as ¶· 9 The postfix may consist of one or two characters andsometimes may be a complete word. A list of 140suffixes is generated after a deep study of Urdugrammar and literature books. Examples of thesesuffixes are presented in Table 9.Table 9. Example of postfix stripping rules.; 45 45¹5 ¹º» 3.1.5. Rules for Borrowed/loan WordsUrdu morphology is derived from different borrowedlanguages i.e

Urdu vocabulary is composed of many foreign languages i.e., English, Arabic, Persian, Turkish, Hindi, etc. The word 'Urdu' itself belongs to Turkish language. All these companion languages have their complex morphological structure. Due to robust morphology of borrowed languages,

Related Documents:

Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 1-12, Los Angeles, California, June 2010. c 2010 Association for Computational Linguistics Statistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither Reut Tsarfaty Uppsala Universitet Djam e Seddah

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

According to "Leadership Versus Management" (2001), is an art form. Information from a secondary source It can be found in Stemmer's work (as cited in Pratt, 2008). According to Stemmer's work (as cited in Pratt, 2008), " " (p. 65). **Add the page number if you use a direct quote from Stemmer found in Pratt's work.

Selected sentiment datasetsLexica Tokenizing The dangers of stemming Other preprocessing techniques The dangers of stemming Stemming collapses distinct word forms. Three common stemming algorithms: É the Porter stemmer É the Lancaster stemmer É the WordNet stemmer Porter and Lancaster destroy too many sentiment distinctions.

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .