Syntactic Annotation Guidelines For The Quranic Arabic .

2y ago
67 Views
14 Downloads
605.46 KB
6 Pages
Last View : 14d ago
Last Download : 3m ago
Upload by : Sabrina Baez
Transcription

Syntactic Annotation Guidelines for the Quranic Arabic Dependency TreebankKais Dukes, Eric Atwell and Abdul-Baquee M. SharafSchool of Computing, University of Leeds, LS2 9JT, United KingdomE-mail: sckd@leeds.ac.uk, csc6ea@leeds.ac.uk, a.m.sharaf08@leeds.ac.ukAbstractThe Quranic Arabic Dependency Treebank (QADT) is part of the Quranic Arabic Corpus (http://corpus.quran.com), an onlinelinguistic resource organized by the University of Leeds, and developed through online collaborative annotation. The website hasbecome a popular study resource for Arabic and the Quran, and is now used by over 1,500 researchers and students daily. This paperpresents the treebank, explains the choice of syntactic representation (), and highlights key parts of the annotation guidelines. Thetext being analyzed is the Quran, the central religious book of Islam, written in classical Quranic Arabic (c. 600 CE). To date, all77,430 words of the Quran have a manually verified morphological analysis, and syntactic analysis is in progress. 11,000 words ofQuranic Arabic have been syntactically annotated as part of a gold standard treebank (). Annotation guidelines areespecially important to promote consistency for a corpus which is being developed through online collaboration, since often manypeople will participate from different backgrounds and with different levels of linguistic expertise. The treebank is available online forcollaborative correction to improve accuracy, with suggestions reviewed by expert Arabic linguists, and compared against existingpublished books of Quranic Syntax.1. IntroductionAnnotating an Arabic corpus presents a set of uniquechallenges when compared to linguistic annotation fortexts in other languages, due to complex orthography andhighly inflected morphology (Habash, 2007; Habash,Rambow & Roth, 2008). Annotation guidelines areespecially important for a corpus developed throughonline collaboration. Correct annotation of the Quranrequires not only a deep understanding of Arabiclinguistics, but also of the source material, the Quranitself. Given the importance of the Quran to the Islamicfaith, any syntactic annotation needs to be carefullyconsidered since alternative parses for a sentence cansuggest alternative meanings for the scripture in certaincases. Fortunately, the unique form of Arabic in which theQuran has been inscribed has been studied in detail forover 1,000 years (Jones, 2005; Ansari 2000). This is farlonger than corresponding grammars for most otherlanguages, and in fact traditional Arabic grammar isconsidered to be one of the origins of modern dependencygrammar (Kruijff, 2006; Owens, 1988).In the Arab-speaking world, there is a long tradition ofunderstanding the Quran through grammatical analysis,and over the centuries this knowledge has accumulated ina grammatical framework known as i'rāb (). The keyinsight in developing the Quranic Arabic DependencyTreebank is that instead of using an alternative theory ofArabic syntax, the treebank should attempt to adopt asmuch of traditional i'rāb as possible. This contrasts withthe approaches used in other recent Arabic treebanks, buthas brought many benefits to the project. For example,the Penn Arabic Treebank (Maamouri, Bies &Buckwalter, 2004) follows constituency phrase structuregrammar whereas the Prague Arabic Treebank (Smrz &Hajic, 2006) uses a form of dependency grammar knownas Functional Generative Description. Using familiarsyntax and terminology for the Quranic Arabic Treebankhas attracted volunteer Quranic scholars and expert Arabiclinguists to the project. In addition, the many detailedpublished works on Quranic syntax can be leveraged toverify syntactic annotation for each verse of the Quran.Figure 1: A hybrid dependency graph.This paper is organized as follows. Section 2 introducestraditional Arabic grammar and describes the annotationprocess, including a description of the syntactic relationsused to label dependency graphs. Section 3 highlights keyparts of the full annotation guidelines1, and Section 4concludes.11822The treebank and accompanying documentation are availableonline at http://corpus.quran.com/treebank.jsp.

2. Syntactic Annotation of Quranic Arabic2.1 Traditional Arabic Grammar ()Arabic is a morphologically rich language, and is highlyinflected. One motivation for the historic development oftraditional Arabic grammar has been to understandfunctional inflection. Nouns can be found in one of threecases (the nominative, genitive or accusative case). Eachof these grammatical cases is realized through a differentcase ending, which results in the noun being pronouncedin a slightly different way, and written using differentvowelized diacritics. Similarly, imperfect verbs () are found in three main moods (the indicative,subjunctive or jussive). A fundamental aim of historicaltraditional Arabic grammar is to explain the reason for theinflection of each noun and verb in a sentence based onsyntactic function. For example, when a noun is a subjectof a verb it is found in the nominative case, yet when it isthe object of a verb, it is found in the accusative case andis written using an alternative vowelized case ending(Mace, 2007; Muhammad, 2007).To relate inflection to syntactic function for the entireArabic language requires a sophisticated grammaticalframework, capable of handling multiple parts-of-speech,and a wide variety of linguistic constructions andgrammatical dependencies. By adopting traditional Arabicgrammar, as an educational resource the QuranicTreebank is more accessible to the wider public, and inaddition the project attracts a larger number of volunteersincluding experts who have received formal training ini'rāb. Using more familiar terminology also speeds up thesyntactic annotation process (Habash, Faraj & Roth,2009).However, traditional i'rāb is challenging to representcomputationally. Unlike in English, where words aretypically assigned a single part-of-speech, the fundamentalsyntactic unit in i'rāb is not a word, but morphologicalword segments. Quranic Arabic is morphologically rich,and often a single word will consist of a stem withmultiple fused prefixes and suffixes. Each of thesemorphological segments is assigned a part-of-speech intraditional Arabic grammar, and can take an independentsyntactic role in the sentence that influences inflection(Figure 1). Syntactic dependencies between morphologicalword-segments is a unique complexity not found inlanguages such as English. For example, an Arabic nounwith a fused preposition prefix will always be inflected forthe genitive case (Akesson, 2001). Together these twomorphological segments form a syntactic prepositionphrase (), even though this written as a singlewhitespace-delimited word.The Quranic Treebank introduces a novel approach toannotating these traditional Arabic grammatical relations.Dependency graphs are used to visualize the syntax of theQuran. This is not only a useful educational resource, butis also a machine-readable representation of Quranicgrammar suitable for further research. The syntacticrepresentation adopted in the treebank is a hybriddependency / constituency phrase structure model. This ismotivated by the fact that the Quranic treebank closelyfollows traditional grammar, and this representation isflexible enough to represent nearly all aspects oftraditional syntax. Dependency graphs are used in thetreebank to show relations between words, but relationsbetween phrases are also possible by introducing nonterminal nodes.Figure 1 shows a hybrid dependency graph. Arabic is readfrom right-to-left and directed edges in the graph pointfrom dependent nodes toward head nodes. The terminalnodes are morphological segments. The graph also makesuse of a non-terminal phrase node. This node, marked asS, represents a sentence which fills the role of a predicate.The above analysis could be collapsed into a puredependency graph without non-terminal nodes, by using atransformation in which a relation that ends at a node isapplicable to the entire sub-graph headed by that node.However, by using non-terminal nodes, the treebank moreaccurately follows historical analysis, since traditionalArabic grammar often describes relations betweenphrases, as well as between words and word segments.This representation has also been found to be more easilyunderstood by annotators who are native Arabic speakers,who use existing published works of Quranic grammar asa reference to verify syntactic annotation in the Treebank.2.2 The Syntactic Annotation ProcessThe annotation methodology used in the Quranic ArabicDependency Treebank follows an iterative approach,involving different stages of annotation. A rule-baseddependency parser developed specifically for QuranicArabic is used to perform initial syntactic analysis, with anF-measure accuracy of 78% (Dukes & Buckwalter, 2010).Automatic Analysis(Dependency Parser)TrustedPublicationsManual Verification(Linguistic Expert)AnnotationGuidelinesOnline CollaborativeAnnotation(corpus.quran.com)Figure 2: Iteration stages in the annotation process.The manual stages do not involve annotators performingcomplete syntactic annotation, but rather correction ofautomatic annotation performed by the dependency parser.Using a parser not only speeds up annotation butencourages greater internal consistency. The sameconstruct should get the same automatic analysis, leavingproofreaders to focus on correcting exceptional cases.1823

Cat*12345RelArabicThe second stage of annotation involves manualverification and correction by an Arabic linguistic expert.Using this approach, a single annotator working part-timewas able to produce an accurately annotated syntacticdependency treebank of 11,000 words in three months,amounting to 14% of the total 77,430 words in the Quran.The syntactic parses are initially verified by comparingagainst both existing trusted publications of Quranicgrammar, as well as the full annotation guidelines for theproject (see Figure 2).DescriptionadjAdjectivepossPossessive constructionpredPredicate of a subjectappAppositionspecSpecificationcpndCompound (numbers)subjSubject of a verbpassPassive subjectobjObject of a verbsubjxSubject of a special verbpredxPredicate of a special verbimpvImperativeimrsImperative resultproProhibitiongenPreposition phrase (PP)linkPP attachmentconjCoordinating conjunctionsubSubordinate clausecondConditionrsltResultcircCircumstantial accusativecogCognate accusativeprpAccusative of purposecomComitative uture tioneqEqualizationcausCauseamdAmendmentGiven the importance of the Quran as a central religioustext, a wide variety of interested volunteers regularlyparticipate in the annotation effort online, effectivelyturning the project into a community effort through onlinecollaborative annotation. While researchers and studentsmake use of the annotated corpus, they are able to addcomments to any annotation that they might disagree with,or that they feel requires further clarification. This leads todiscussion with other users through an online messageboard forum (http://corpus.quran.com/messageboard.jsp).The Quranic grammar message board promotes activediscussion, with over 4,000 messages posted over the past6 months. Some online discussion involves inaccuratesuggestions by beginners that are usually resolved througha deeper understanding of Quranic grammar. However,when genuine corrections are presented through onlinecollaborative annotation, these are then referred back to alinguistic expert, who can verify these suggestions againstboth the annotation guidelines and trusted publications ofQuranic syntax, which include books on Quranicgrammar, as well as Arabic morphological dictionaries(Nadwi 2006; Omar, 2005; Siddiqui 2008; Wightwick &Gaafar, 2008). General users are also encouraged to usethese types of additional information before postingsuggested corrections.2.3 Syntactic Dependency RelationsTraditional Arabic grammar defines several syntacticdependency relations, such as an adjective describing anoun, or a subject relation linking a noun to the verb onwhich it depends. Figure 3 shows a complete list of thesyntactic dependency relations currently annotated in theQuranic Arabic Dependency Treebank. The full list ofpart-of-speech tags used to label word segments arediscussed as part of morphological annotation of theQuranic Arabic Corpus (Dukes and Habash, 2010).*Categories: 1 Nominal dependencies, 2 Verbal dependencies,3 Phrases and clauses, 4 Adverbial dependencies, 5 ParticleDependenciesFigure 3: Edge labels for syntactic dependency relations.Each of the syntactic relations shown in Figure 3 is used tolabel edges in dependency graphs in the QuranicTreebank. The list of Arabic dependency tags are takendirectly from traditional Arabic grammar, and mapped toequivalent English terms as found in comprehensivepublications on Arabic grammatical theory (Haywood &Nahmad, 2005; Ryding 2008). This approach contrasts toother Arabic treebanks (such as the Penn and Praguetreebanks) where existing tagging schemes for otherlanguages such as English are adapted to Arabic.1824

3. Annotation GuidelinesThe syntactic annotation guidelines for the QuranicTreebank have been built up over time, and developedduring the course of the project. The guidelines are addedto whenever a new linguistic construction is discussedduring online collaborative annotation that requires furtherclarification in order to enforce consistency in the corpus.This section highlights key parts of the syntacticannotation guidelines which illustrate a variety of differentsyntactic constructions in Quranic Arabic, and discusseshow these are handled in the traditional Arabic grammarof i'rāb (). The full set of guidelines covering a widerrange of linguistic constructions is available online sp.3.1 Verbs, Subjects and ObjectsTraditional grammar places linguistic constraints on thepossible analysis of a sentence. On such constraint is thatevery verb must have a subject. This will be either anexplicit terminal node of the graph (a word ormorphological word segment), or otherwise an implicithidden node used to fill this syntactic role. A verb mayoptionally accept an object, and ditransitive verbs willtake two objects.Figure 5: Syntactic annotation of a passive verb.The above dependency graph also contains a conditionalrelation between the first word (99:1:1) and the followingphrase. In Arabic, the word idhā appears as a conditionalparticle when used in a temporal sense, and is usuallytranslated as "when". The clause following this word willbe the protasis of a conditional statement, and will oftenbe a clause or sentence beginning with a verb. The othertwo dependencies in the graph are the cognate accusative(), and the possessive construction ()also known as the genitive construction.3.2 Hidden and Empty NodesFigure 4: A verb with its dependent subject and object.Reading Figure 4 from right-to-left, the verb is followedby a subject and then its object. VSO word order is typicalin Arabic, although other word orders are also possibleand are not ambiguous, since a subject will always beinflected for the nominative case, and objects are alwaysfound in the accusative case (Haywood & Nahmad, 2005).Passive verbs do not have subjects associated with them.Instead, traditional Arabic grammar defines a syntacticrole named nāib fā'il () which may be translated asthe "passive subject representative". As with active verbs,a similar constraint exists so that this role must always befilled either explicitly or else implicitly through a hiddennode. Figure 5 shows an example of a passive verbfollowed by its subject representative.Quranic Arabic is a pro-drop language. Certain verbsimply a pronoun subject through inflection which may bedropped from the sentence (Fischer & Rodgers, 2002).Traditional Arabic grammar restores these dropped wordswhich are known as damīr mustatir ().Although this adds no new additional information to asentence, the advantage of this approach is that thesenodes satisfy constraints and can be referenced later, forexample as part of anaphora resolution. Different inflectedhidden pronouns are used depending on the verb’s person,gender and number. An additional benefit of showingimplicit hidden pronouns in the treebank is that anannotator can quickly determine if the verb has beentagged with correct inflection features.Figure 6 shows two sentences related through conjunction.Each sentence has a verb with an implicit subject pronoun,shown in gray and in brackets in the dependency graph. Inaddition to hidden nodes, dependency graphs may alsoinclude empty nodes used to fill syntactic roles. These areshown in the treebank using the asterisk notation (*). For adiscussion of empty nodes, see (Dukes & Buckwalter,2010).1825

Certain chapters of the Quran begin with a prepositionphrase used as an oath (Rafai, 1998). In this case thepreposition will be a particle of oath, usually wāw. Tosatisfy the PP-linking constraint, the preposition phrasewill attach to an implicit node such as the hidden verb "Iswear by" (see Figure 8). Although a preposition phrasemust always be linked to another head node, it not alwaysthrough attachment. For example, consecutive sequencesof preposition phrases may be related through conjunctionor through apposition.Figure 6: Implicit hidden pronouns.3.3 Preposition Phrase AttachmentPrepositions are easily identified in Quranic Arabic sincethey always modify the following noun which will befound in the genitive case. The preposition and its objectform a phrase in traditional Arabic grammar known as jārwa majrūr (). A dependency relation namedmuta'aliq is used to annotate preposition phrase (PP)attachment. This relation may be translated as "link" or"attachment". A constraint of the grammar is that apreposition phrase must always be linked to another headnode, which is usually either a verb or a noun (Ryding,2008). Deciding the location of attachment depends oncontext. Most often a preposition phrase will be attachedto its preceding verb, as shown in Figure 7.Figure 8: PP-attachment to a hidden node.4. Conclusion and Future WorkThe full annotation guidelines that are presented in thispaper are available online at the Quran corpus website(http://corpus.quran.com), to enable online collaborativeannotation. The website has attracted a wide variety ofvisitors including NLP researchers, many non-academicswanting to learn more about the Quran, and interestedvolunteers who are familiar with the source material andtraditional grammar. The aim of traditional Quranicstudies is to throw light upon the meanings of the Quranictext. Adopting the grammar framework of i'rāb andtraditional analytics expertise will lead to an enrichedcorpus. The markup is not only machine-readable, but canbe an aid to human understanding of the Arabic source fornon-Arabic speakers.For example, particles such as annā in Figure 1 can bedifficult to translate faithfully into other languages.Different English translations of the Quran use "that","how", "for" or some other construct (Awde & Smith,2004). The dependency analysis will help readers furtherin uncovering the detailed intended meanings of eachverse and sentence.Figure 7: PP-attachment to a verb.In Arabic, there is no direct equivalent of the Englishpresent tense copula verb, and equational sentences (suchas "Mankind are ungrateful to their Lord") arerepresented by writing two nouns side-by-side, with bothin the nominative case. The first noun will be the subject,and the second noun the predicate. When a prepositionphrase is used in an equational sentence, it is typicallyattached to the predicate.As well as morphological and syntactic analysis, a thirdplanned phase of annotation in the corpus will be asemantic layer, following completion of the syntactictreebank. It is hoped the resource will become moredirectly amenable to computational semantic modeling byannotating the text using semantic role labeling, or byrepresenting semantics using first-order predicate logic.1826

5. ReferencesJoyce Akesson (2001). Arabic Morphology andPhonology: Based on the Marah Al-Arwah by Ahmad b.'Ali Mas'ud. Brill.Haq Ansari (2000). Learning the Language of the Quran.MMI Publishers.Nicholas Awde and Kevin Smith (2004). ArabicEnglish/English-Arabic Dictionary. Bennett & Bloom.Kais Dukes and Nizar Habash (2010). MorphologicalAnnotation of Quranic Arabic. Language Resources andEvaluation Conference (LREC). Valletta, Malta.Kais Dukes and Tim Buckwalter (2010). A DependencyTreebank of the Quran using Traditional ArabicGrammar. 7th international conference on Informaticsand Systems. Cairo, Egypt.Geert-Jan Kruijff (2006). Dependency grammar. TheEncyclopedia of Language and Linguistics 2nd edition,Elsevier Publishers.Wolfdietrich Fischer and Jonathan Rodgers (2002). AGrammar of Classical Arabic: Third Revised Edition.Yale University ntations for Machine Translation.Nizar Habash, Owen Rambow and Ryan Roth (2008).MADA TOKAN: Quick Manual.Nizar Habash, Reem Faraj and Ryan Roth (2009).Syntactic Annotation in the Columbia Arabic Treebank.In Proceedings of the 2nd International Conference onArabic Language Resources and Tools (MEDAR),Cairo, Egypt.John A. Haywood and H. M. Nahmad (2005). A NewArabic Grammar of the Written Language. LundHumphries Publishers.Alan Jones (2005). Arabic Through the Qur'an. IslamicTexts Society.Mohamed Maamouri, Ann Bies and Tim Buckwalter(2004). The Penn Arabic treebank: Building a largescale annotated Arabic corpus. In NEMLARConference on Arabic Language Resources and Tools,Cairo, Egypt.John Mace (2007). Arabic Verbs. Bennett & Bloom.Ebrahim Muhammad (2007). From the Treasures ofArabic Morphology. Zam Zam Publishers.Abdullah Abbas Nadwi (2006). Vocabulary of the HolyQuran. Millat Book Centre.Abdul Mannan Omar (2005). Dictionary of the HolyQuran. Noor Foundation International.Jamal-Un-Nisa Bint Rafai (1998). Basic Quranic ArabicGrammar. Ta-Ha Publishers Ltd.Jonathan Owens (1988) The Foundations of Grammar:An Introduction to Medieval Arabic GrammaticalTheory. John Benjamins Publishers.Karin C. Ryding (2008). A reference grammar of ModernStandard Arabic. Cambridge University Press.Abdur Rashid Siddiqui (2008). Quranic Keywords: AReference Guide. The Islamic Foundation.Ryan Roth, Owen Rambow, Nizar Habash, Mona Diaband Cynthia Rudin (2008). Arabic MorphologicalTagging, Diacritization, and Lemmatization UsingLexeme Models and Feature Ranking. In Proceedingsof the Conference of American Association forComputational Linguistics (ACL08).Otakar Smrz and Jan Hajic (2006). The Other ArabicTreebank: Prague Dependencies and Functions. tions, CSLI Publications.Abdelhadi Soudi, Antal van den Bosch and GunterNeumann (2007). Arabic Computational Morphology:Knowledge-based and Empirical Methods. Springer.Jane Wightwick and Mahmoud Gaafar (2008). ArabicVerbs and Essentials of Grammar. McGraw-Hill.1827

2. Syntactic Annotation of Quranic Arabic 2.1 Traditional Arabic Grammar ( ) Arabic is a morphologically rich language, and is highly inflected. One motivation for the historic development of traditional Arabic grammar has been to

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Total cost 2.00 2.05 Total cost (median) 1.99 2.23 # segments 95.68 38.95 / segment 0.0215 0.0595 Table 1: Block vs Full Annotation. Average statistics per image. Figure 4: SUNCG/CGIntrinsics annotation. (a) Ground truth. (b) Block annotation (zoomed-in) (c) Full annotation (zoomed-in). White dotted box highlights an example where block .

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được