Transliteration Editors For Arabic, Persian And Urdu

2y ago
22 Views
2 Downloads
368.38 KB
8 Pages
Last View : 23d ago
Last Download : 3m ago
Upload by : Macey Ridenour
Transcription

Transliteration editors for Arabic, Persian and UrduE.Veera Raghavendra, Prahallad Lavanya, Fahmy MostafaCarnegie Mellon UniversityIIIT Hyderabad, India.Abstract: Transliteration editors are essential for keying-in language scripts into thecomputer using QWERTY keyboard. Applications of transliteration editors in the contextof Universal Digital Library (UDL) include entry of meta-data and dictionaries for manylanguages both local and International. In this paper we propose a simple approach forbuilding transliteration editors for International languages such as Arabic, Persian andUrdu using Unicode and by taking advantage of its rendering engine which is calledUnicode rendering engine. We demonstrate the usefulness of the Unicode based approachto build transliteration editors for International Languages, and report its advantagesneeding little maintenance and few entries in the mapping table, and ease of adding newfeatures such as adding letters, to the transliteration scheme. We also explain how easy itis to add any language and build a transliteration editor using Unicode and its mappingtables. We demonstrate the transliteration editor for 3 International languages and alsoexplain how this approach can be adapted for any foreign language.Keywords: Transliteration Editors, IT3, Arabic, Persian and Urdu, UDL.1. IntroductionSeveral processes exist in realizing theconcept of universal digital library(UDL). These processes includescanning of the books, improving thequality of scanned images, entry ofmeta-data of the scanned books, storageof the data, retrieval and access to thedata as and when required.books are scanned and the need to enterthe metadata in their local languagesbecomes necessary. To enter the text intheir local languages other than Englishis possible only through Transliterationeditors. These editors let you enter thelanguage you chose using the QWERTYlanguage scripts like Arabic, Persian andUrdu.1.1 Need for Transliteration Editors1. 2 How are they useful in digitallibrary?Transliteration editors are essential forkeying-in Indian language scripts intothe computer using QWERTY keyboard.Applications of transliteration editors inthe context of Universal Digital Library(UDL) include entry of meta-data anddictionaries for International languagessays it all, In the Universal DigitalLibrary, many international languageApplications of transliteration editors inthe context of Universal Digital Library(UDL) include entry of meta-data anddictionaries for Indian languages. Theissues in building transliteration editorsinclude design of a user-friendly andreadable transliteration scheme, userinterface to key-in the text and have thetext rendered in native script, provide

transliteration code for the characters inInternational languages.1.3 Are there any previousTransliteration EditorsThere are many transliteration editorsdeveloped for many languages all overthe world for example we have Indianlanguage Transliteration editors such asIndian Unitrans and Om in/ speech/Transliteration/http://www.cs.cmu.edu/ madhavi/OMThe above editors mentioned also useIT3, but the only drawback for the aboveeditors is they support only Indianlanguages and not extended to supportany other foreign language. SinceUniversal digital library includes thescanning and Meta data entry ofInternational languages like Arabic,Persian and Urdu, there is a need todevelop a stable Transliteration editorwhere the Meta data entry can be easyand effective. All the local people whospeak Arabic, Persian and Urdu can seetheir books in their own language. For Example: 2. Nature of Arabic, Persian andUrdu Scripts Arabic, Persian and Urdu often called asMiddle East Languages have manyfeatures in common. The nature of thescript for all the three languages ismostly same and as described as follows: The Arabic alphabet contains 28letters. Some additional lettersare used in Persian and Urdusuch as /p/ or /g/.Words are written in horizontallines from right to left, numeralsare written from left to rightMost letters change formdepending on whether theyappear at the beginning, middleor end of a word, or on their own.The Arabic, Persian and Urduscript is cursive, and all primaryletters have conditional forms fortheir glyphs, depending onwhether they are at thebeginning, middle or end of aword, so they may exhibit fourdistinct forms (initial, medial,final or isolated).Letters that can be joined arealways joined in both handwritten and printed Arabic,Persian and Urdu.The long vowels /a: /, /i: / and/u:/ are represented by the letters'alif, yā' and wāw respectively.Vowel diacritics, which are usedto mark short vowels and otherspecial symbols, appear only inthe Qur'an. They are also used,though with less consistency, inother religious texts, in classicalpoetry, in textbooks children andforeign learners, and occasionallyin complex texts to avoidambiguity.Sometimesthediacritics are used for decorativepurposesinbooktitles,letterheads, nameplates, etc.

Usually in normal texts thediacritics are not used.3. Middle East Script UnicodeSupportWith the advent of Unicode support inmany web browsers, Arabic, Persian andUrdu script display is no longer an issue.The Unicode rendering engine in XPwill take care of the display of theselanguage characters in any form whetherit appears in initial, middle, end orisolated positions. XP comes withdefault fonts for the Arabic, Persian andUrdu as we use Unicode; however thereare many freely downloadable fonts forthe above languages if we wish to usethem.4. A transliteration Scheme forArabic, Persian, UrduA transliteration scheme referred to asIT3 which is originally developed byIISc Bangalore and Carnegie MellonUniversity. This transliteration scheme isdesigned as an improvement over theITRANS scheme for typing anylanguage characters like Arabic etc using the standard keyboard. Thistransliteration mapping is meant to add afew more features to enhance theusability and readability, and has beendesigned on the following principles: the user to remember the keycombinations for different Indiancharacters.Example of the IT3 is shown as: a aa kg h’ etc.4.1 Mapping TableThe important part of the Arabic editoris to map the it3 symbol to thecorresponding Arabic Unicode characterUsing the IT3 notation and the Unicodecharacters, one can build a simpletransliteration editor in a short amount oftime. Any new language can be added toit with minimal effort. To add alanguage, a mapping table has to build tomap an IT3 character to thecorrespondingUnicodecharacter.However attention should be paid whilebuilding the Arabic, Persian and Urdumapping table, as all the letters in theselanguages are consonants and each lettercan appear separately or in combinationwith one another, unlike IndianLanguages we cannot have combinationletters as the mapping of IT3 to Unicodefor ex: sh is not equal to s and h.A sample mapping table for Arabic isshown below:Easy readabilityUse of case-insensitive mapping:While preserving readability, thisfeature allows the use of standardnatural language processing toolsfor parsing and informationretrieval to be directly applied tothe Indian language Texts.Phonetic mapping, as much aspossible. This makes it easier forFig2: Arabic Mapping Table

4. 2 User Interface DesignNow we have all the parameters to buildthe editor like the mapping table and weneed to build the interface where we cansee both the IT3 and the Arabic/Persianor Urdu characters depending on thelanguage is selected.To build the Arabic, Persian and Urdulanguage editor, the following is thepseudo code which is followed: Givenan IT3 word, parse it into sequence ofcharacters by phonifying them, for ex: ifyou give the sequence as nilu then theeditor as the first step will phonify thesequence as n i l u as individual phones.These phones are mapped to theircorresponding Unicode numbers and thedisplay of the Arabic/Persian or Urduscript is shown on the editor.(See Fig 2)The Middle East languages like Arabic,Persian and Indo-European languageUrduhavesomecommoncharacteristics. All these three languagesare written from right to left. Persian andUrdu are derived from Arabic.The characters of Arabic, Persian andUrdu characters are called alphabets.Each alphabet corresponds to a phone(Library of Congress, 1997).Arabic:Each alphabet corresponds to a phone.Arabic has 9 vowels and 28 consonants(Library of Congress,1997; Qur’an Transliteration, 2005).In Arabic it should be noticed that thevowels do not appear independently asseen in many Indian languages, theyoccur only with a consonant. All theArabic characters are written in differentways by the occurrence of the same, i.e.the alphabet is written differently if itoccurs in the initial position and it iswritten differently if it occurs in themiddle or end of a word. This rule issame for Persian and Urdu.Persian:Persian also has 9 vowels and 32consonants. Modern Persian uses amodified version of the Arabic alphabet.Persian adds four extra alphabets due tothe fact that four sounds that exist inPersian do not exist in Arabic. Theadditional four alphabets are shown inTable 1 (Omniglot, 2005).Urdu:Urdu is derived from Persian which inturn is derived from Arabic. Urdu usesmore complex and sinuous Nastaliqscript. It is said that Arabic is a subset ofUrdu. Urdu has 11 vowels in addition tothe 9 vowels of the Arabic alphabet and35 consonants (alphabets) (Hugo’sWebsite, 2005; U-TRANS, 2002).Urdu language has two noon (n), one isnoon and the other is noon gunna, whereas Persian and Arabic has one only noon.

(a)(b)(c)Fig 2: Screen shots of the Arabic, Persian and Urdu Unicode Editor (a) (b) (c)respectively

Fig3: Screen shot of the Meta data of the Arabic BooksFig 4: Screen shot of the Arabic books in ULIB website

IT3 codes such as ain (e or o) and zheh(z’) and many such more are created andadded to accommodate middle-eastlanguages.The following steps are followed todevelop the Transliteration:(1) Once all the alphabets are assignedIT3 codes, then each IT3 code should bemapped to the corresponding Unicodenumber.(2) The above languages have explicitUnicode number assigned to each of thealphabet.(3) Unlike Indian languages, a consonantalphabet represents a consonant aloneand a vowel alphabet represents a vowel.For example in Indian languages “k” ismapped to the Unicode representing /ka/.But in Middle East languages “k” isexplicitly mapped to /k/ (kaf). There isno syllabification required in theselanguages.(4) Easy to adapt for new languages.Unicode based approaches requireminimal knowledge to work in newlanguages, whereas ASCII font basedapproach requires a better understandingof the language to handle exceptionsrelated to rendering of consonantclusters.CONCLUSIONIn this work, we have described theprocess of building Arabic, Persian andUrdu language editors using a simplescheme based on Unicode. This simpleapproach has the following advantages:(1) Lesser number of entries in themapping table. There are only 37 entriesfor Arabic language.(2) Automatic rendering of the Unicodecharacters by the Unicode renderingengine in Windows XP/Linux.(3) Using Unicode based approach, asingle module can render all thelanguages. The mapping table changes,but the parsing of IT3 sequence andsyllabifications are the same across all ofthe Arabic, Persian and Urdu Languages.(4) Easy to adapt for new languages.Unicode based approaches requireminimal knowledge to work in newlanguages, whereas ASCII font basedapproach requires a better understandingof the language to handle exceptionsrelated to rendering of consonantclusters.(5) Our editor is used in the UniversalDigital Libraries to enter the metadatafor the Arabic, Persian and Urdu books.The screen shot of the Meta data isshown in the Fig 3(6) Using our editor code, it possible todisplay the Arabic, Persian and Urducharacters even on the webpage, this isalso incorporated in the webpage of theUniversalDigitalLibararies(www.ulib.org) and search for theArabic books, we can see all the Arabicbooks shown in Arabic script.ACKNOWLEDGEMENTWe would like to deeply thank Prof. RajReddy for guiding us through the projectand has given us the initiative. Wewould also like to thank Dr. NayelShafei for giving us the feedbackthrough out this project. We would liketo thank Mr. S P Kishore for helping usin completing this project.References(1) Alan, W., 2005. Unicode ml.(2) A Simple Approach for buildingTransliteration editors

ournal of Zhejiang University, 2005(3) Hugo’s Website. 2005. Urdu ) Library of Congress, 1997. ALA-LCRomanization Tables: TransliterationSchemes for Non-Roman (5)Markus, K., 2005. UTF-8 andUnicode FAQ for Unix/Linux.http://www.cl.cam.ac.uk/ www.omniglot.com/writing/persian.htm.(7)Qur’an Transliteration, 2005. /(8) The Unicode Consortium, 2003. TheUnicodeStandard,Version4.0.Addison-Wesley, Boston, -trans/

Transliteration editors for Arabic, Persian and Urdu E.Veera Raghavendra, Prahallad Lavanya, Fahmy Mostafa Carnegie Mellon University IIIT Hyderabad, India. Abstract: Transliteration editors are essential for keying-in language scripts into the computer using QWERTY keyboard. Applications of transliteration editors in the context

Related Documents:

Poetries in Contact: Arabic, Persian, and Urdu 1. Introduction The Arabic method of metrical analysis devised by al-Khalīl Ibn Aḥmad of Basra (b. 718) came with Islam into Persian, and spread from there with the rest of Persian literary culture into Urd

113 70 0645 arabic letter meem 114 71 06ba arabic letter dotless noon 115 72 0646 arabic letter noon 116 73 0648 arabic letter waw 117 74 0624 arabic letter hamzah on waw . 121 78 0649 arabic letter alef maqsurah 122 79 06d2 arabic letter ya barree 123 7a 06be arabic letter knotted ha 124 7b a

ﺑﺮﻌﻟا The Beginner's Guide to Arabic GUIDE TO STUDYING ARABIC 2 WHY STUDY ARABIC 2 HOW TO STUDY ARABIC 3 WHERE TO STUDY ARABIC 4 WHAT YOU NEED BEFORE YOU START 4 THE ARABIC ALPHABET 5 INTRODUCTION TO THE ALPHABET 5 THE LETTERS 6 THE VOWELS 11 SOME BASIC VOCABULARY 13 RESOURCES FOR LEARNING ARABIC 17 ONLINE 17 RECOMMENDED BOOKS 18 OUR NEWSLETTERS 19 by Mohtanick Jamil . Guide to .

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

0644 arabic letter lam 0645 arabic letter meem 0646 arabic letter noon 0647 arabic letter heh 0648 arabic letter waw 0649 arabic letter alef maksura 064a arabic letter yeh tashkil from iso 8859

Arabic Courses ARABIC 110 Elementary Arabic I Credits: 5 Fundamentals of the language, essentials of conversation, grammar, practical vocabulary, useful phrases, and the ability to understand, read and write simple classical Arabic. ARABIC 110 - MOTR LANG 105: Foreign Language I ARABIC 120 Elementary Arabic I

Classical Arabic to Modern standard Arabic Focusing on the main reason for changes within the Arabic language. Then it discusses the Arabic dialects focusing on the phenomenon of diglossia, which is the existence and use of two or more types of Arabic in an Arabic-speaking country, the reasons for its existence and its effect

2. AngularJS looks in the template for the ngApp directive which designates our application root. 3. Loads the module associated with the directive. 4. Creates the application injector 5. Compiles the DOM treating the ngApp directive as the root of the compilation AngularJS: beginner's Guide - part 1