PronouncUR: An Urdu Pronunciation Lexicon Generator

2y ago
9 Views
2 Downloads
578.71 KB
5 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Warren Adams
Transcription

PronouncUR: An Urdu Pronunciation Lexicon GeneratorHaris Bin Zia1, Agha Ali Raza1, Awais Athar21Information Technology University, 6th Floor, Arfa Software Technology Park, Ferozepur Road, Lahore, Pakistan2EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK{haris.zia, ate-of-the-art speech recognition systems rely heavily on three basic components: an acoustic model, a pronunciation lexicon and alanguage model. To build these components, a researcher needs linguistic as well as technical expertise, which is a barrier in lowresource domains. Techniques to construct these three components without having expert domain knowledge are in great demand.Urdu, despite having millions of speakers all over the world, is a low-resource language in terms of standard publically availablelinguistic resources. In this paper, we present a grapheme-to-phoneme conversion tool for Urdu that generates a pronunciation lexiconin a form suitable for use with speech recognition systems from a list of Urdu words. The tool predicts the pronunciation of wordsusing a LSTM-based model trained on a handcrafted expert lexicon of around 39,000 words and shows an accuracy of 64% uponinternal evaluation. For external evaluation on a speech recognition task, we obtain a word error rate comparable to one achieved usinga fully handcrafted expert lexicon.Keywords: Pronunciation Lexicon, Pronunciation Modeling, Lexicon Learning, Speech Recognition, Urdu1.IntroductionAutomatic Speech Recognition (ASR) for resourcescarce languages has been an active research area in thepast few years (Sherwani, 2009; Qiao, 2010; Chan, 2012).Modern speech recognition systems usually require threeresources: transcribed speech for acoustic modeling, alarge text data for language modeling and a pronunciationlexicon that maps words to sub-word units known asphonemes. Pronunciation lexicon acts as a link connectinglanguage model with the acoustic model.While it is comparatively easy to gather transcribedspeech waveforms and large text datasets, developing apronunciation dictionary is quite expensive and requirestremendous amount of manual effort and linguisticexpertise. Therefore, development of a pronunciationlexicon is the bottleneck when building ASR systems forlow-resource languages. Techniques to reduce the need ofexpert knowledge in design and development ofpronunciation lexicons are in great demand.We are interested in developing a pronunciation lexicongeneration tool for Urdu which is an Indo-Aryan languagespoken widely with over 100 million speakers1. Urdu isofficial language of Pakistan. Its writing system isSegmental and more specifically Abjad i.e. onlyconsonants are marked while vowels (diacritics) areoptional. Urdu follows Arabic script written from right toleft. A sentence written in Urdu along with its Englishtranslation is given below: اردو پاکستان کی قومی زبان ہے ۔ Urdu is the national language of Pakistan.Automatic Speech Recognition (ASR) research for Urduexhibits number of challenges which are discussed indetail in subsequent sections. Despite being spoken bymillions of speakers all over the world, Urdu is lowresource in terms of standard publically availablelinguistic To our best knowledge, our Urdu pronunciation lexicongeneration tool is the first tool of its kind that makes iteasier for researchers to work on Urdu speech recognitionsystems without prior linguistic knowledge.The remainder of the paper is structured as follows.Section 2 reviews similar kind of work for different worldlanguages. We then present Urdu orthography and Urduphonetic inventory in Section 3. Section 4 brieflydiscusses challenges in Urdu pronunciation modeling. Wepresent our tool in Section 5 and conclude in Section 6.2.Literature ReviewThere exists a range of research focusing on lexicalresources or tools available for different world languagesfor pronunciation modeling in speech recognition tasks. CMUdict2(CarnegieMellonpronunciationdictionary) is an open-source pronunciationdictionary for North American English that containsover 134,000 words and their pronunciations (Weide,1998). There is also a lexicon generation tool3available that uses CMUdict. Tan et al. (2009) proposed a rule based grapheme-tophoneme tool generating a pronunciation dictionaryfor Malay language. Their trained ASR on readspeech corpus, using tool generated pronunciationdictionary achieved a word error rate (WER) of16.5%. A Bengali pronunciation dictionary4 was developedunder Google Internationalization Project5 (Gutkin etal., 2016). The dictionary contains around 65,000words that were manually transcribed into theirphonemic representation by a team of five .com/international/32393

Pronunciation lexicons were developed for Amharic,Swahili and Wolof languages under LFFA Project6and were made available publically7 (Gauthier et al.,2016). Mandarin Chinese Phonetic Segmentation and Toneis a publically8 available corpus of 7,849 MandarinChinese utterances and their phonetic segmentation.The corpus can be used for pronunciation modelingof Mandarin Chinese. Arabic Speech Recognition Pronunciation Dictionaryis a publically9 available pronunciation dictionary forModern Standard Arabic (MSA) that contains526,000 words and two million pronunciations. Masmoudi et al. (2014) presented Tunisian ArabicPhonetic Dictionary based on a set of phonetic rulesand manually tagged lexicon of exceptions (for wordsthat do not follow phonetic rules). Egyptian Colloquial Arabic Lexicon is a publically10available pronunciation dictionary of EgyptianColloquial Arabic (ECA), it contains 51,202 wordsand their pronunciation.The Georgetown dictionary of Iraqi-Arabic is amodern, up-to-date, publically11 available dialectalArabic language resource that can be used forpronunciation modeling of Iraqi-Arabic. It contains17,500 Iraqi-Arabic entries along with their IPApronunciations. Bonaventura et al. (1998) presented a letter-to-phoneconversion system for Spanish that can be used tosupply phonetic transcriptions to a speech recognizer. Mendonça et al. (2014) proposed a hybrid approachbased on manual transcription rules and machinelearning algorithms to build a machine readablepronunciation dictionary for Brazilian Portuguese.The dictionary as well as algorithms used to buildpronunciation dictionary were made publically12available.Pronunciation dictionaries developed under GlobalPhoneProject (Schultz, 2014) are also available for research andcommercial purposes in 20 different languages - German,French, Russian, Korean, Turkish, Chinese and Thai toname a few.3.3.1Urdu sacier/ALFFA -arabic12https://github.com/gustavoauma/aeiouado g2p73.2PhoneticsUrdu has a very rich phonetic inventory13, combination ofUrdu letters and diacritics realizes 44 consonants (28 nonaspirated & 16 aspirated), 7 long vowels, 7 nasalized longvowels, 3 half long vowels, 3 short vowels and 3nasalized short vowels (Saleem et al., 2002; Hussain,2007; Hussain, 2004). Since speech recognition systemsrequire the representation of sounds using some phonemicnotation such as IPA14 or SAMPA15 etc., we have usedCISAMPA (Case Insensitive Speech Assessment MethodsPhonetic Alphabet) proposed by Raza et al. (2010) torepresent Urdu phonemes (see Appendix B).4.Challenges in Urdu PronunciationModelingPronunciation modeling for Urdu exhibits a number ofchallenges:Dialects: Due to large user base and variety of speakers,there are variations in dialect leading to large variations inpronunciation and phonetics.Script: In Urdu, diacritics serve to inform reader of theshort vowels accompanying each written consonant, butcommonly used Urdu script generally does not containdiacritics. Speakers can distinguish the words throughcontext and experience but some constructions may stillbe ambiguous, for instance, the word اس can mean either‘this’ ( )اِس or ‘that’ ( )اُس , their respective IPArepresentation being /ɪs/ or /ʊs/ respectively.Morphology: Urdu is a morphologically rich language,combinations of affixes and stems results into largevocabulary of words.Dual Behavior: Three Urdu characters show dualbehavior i.e. both consonantal and vocalic, based on theirposition of occurrence (Hussain, 2004).5.PronouncURWe have developed PronouncUR, an Urdu grapheme-tophoneme tool based on a model (c.f. Section 5.2) that cangenerate a pronunciation lexicon in a form suitable for usewith speech recognition systems from a list of Urduwords. PronouncUR is freely available online16.5.1OrthographyUrdu is written in Arabic script in a cursive format(Nastaliq style) from right to left using an extended6Arabic character set. The character set includes 37 basicand 4 secondary letters, 7 diacritics, punctuation marksand special symbols (Hussain & Afzal, 2001; Afzal &Hussain, 2001; Hussain, 2004) (see Appendix A).LexiconTo train our model we have developed a lexicon ofapproximately 46K words. Lexicon has been tagged bytrained transcription experts, carefully considering theletter-to-sound rules for Urdu proposed by ng //lextool.csalt.itu.edu.pk2394

The format of the training lexicon is very straight forward.Each line consists of one word form and its pronunciation.Word forms and their pronunciations are separated by tab.A small portion of the training lexicon is given in Table 1. فوالد عالمات جائیداد لَڑکِیوں درویشی الجھاؤ رکوا ایران خریدی آفات فریاد عراقی F O L A A D DA L A A M A A T DD Z A A I I D D A A D DL A R R K I J O O ND D A R V A Y S H I IU L D Z H A A O OR U K V A AI I R A A NX A R I I D D I IA A F A A T DF A R J A A D DI R A A Q I I5.2G2P ModelThe grapheme-to-phoneme (G2P) is the task of translatinginput sequence of graphemes (letters) to output sequenceof phonemes.GraphemesPhonemes ب BََA ن NTable 3: An example of grapheme-to-phoneme translationGiven the success of sequence-to-sequence learning(Sutskever et al., 2014) and power of LSTM for sequencemodeling (Hochreiter et al., 1997), we choose LSTM forgrapheme-to-phoneme conversion as proposed by Yao etal. (2015). We used open-source G2P toolkit17 to train ourG2P model with 2 LSTM layers and 512 hidden units ineach layer.Table 1: Training LexiconOut of 67 phonemes available in Urdu phonetic inventory(see Appendix B), our training lexicon currently caters for64 phonemes, while the work is in progress to include 3nasalized short vowels. Phonemes M H and J H occurvery rarely in Urdu and thus have only one entry each inthe training lexicon, for the rest of the phonemes thefrequency of occurrence is given in Table 293031PhonemeAA ARNI IILMST DKA YBUTD DZHO OPVO O NJU UA ES HD ZGFDT 9606162PhonemeQXR RA Y NN GA A NK HOG GT S HB HI I ND Z HD D HT D HT HP HG HA E HU U NR R HD HO O HZ ZA E NYA Y HN HL HR HO gure 1: An encoder-decoder LSTM with two layers.Figure 1 shows a sample of the model where the encoderLSTM is on the left of dotted line while decoder on theright. The encoder reads a time-reversed sequence “ s ن ََ ”ب and produces the last hidden layer activation toinitialize the decoder. The decoder reads “ os B A N” asthe past phoneme prediction sequence and uses “B A N /os ” as the output sequence to generate. s denotesinput sequence beginning while os and /os denotesoutput sequence beginning and ending respectively.5.3Performance EvaluationWe split our handcrafted lexicon in 85% training set, 5%validation and 10% test set. Intrinsic evaluation on unseentest set our G2P model achieved word error rate (WER) of36%. The same G2P model trained on CMUdict has WERof 28.61% (Yao et al., 2015). The low word error rate ofCMUdict can be attributed to its large size. Anotherreason for our comparatively higher WER may be thatonly about 11% of the words in our corpus have diacritics.As a result, a good performance would requireovercoming the problem of automatic diacritization whichgets harder while processing a list of isolated wordswithout any context.To perform extrinsic evaluation of the performance oflexicon tool on speech recognition task, we trained aHidden Markov Model (HMM) based speech recognitionsystem on phonetically rich Urdu speech corpus18 (Raza etal., 2009) and spontaneous speech corpus (Raza et al.,2010) using CMUSphinx19 speech recognition toolkit. Thecombined data from both corpora contains 3,974utterances spanning over 179 minutes of speech, out ofwhich 157 minutes (3,174 utterances) were used fortraining and 22 minutes (800 utterances) for testing. A tri-Table 2: Frequency Distribution of Phonemes in ml19https://cmusphinx.github.io/182395

gram language model using the training data transcriptswas applied during decoding. By using lexicon generatedthrough lexicon tool, we obtained a word error rate( 19%) that approaches the rate achieved using a fullyhandcrafted expert lexicon. We used the same train/testsplit as used by Raza et al. (2010) and thus results aredirectly comparable.6.Conclusion and Future WorkWe presented an online pronunciation lexicon generationtool for Urdu that can be used to generate pronunciationlexicon to be used with speech recognition systems.Experimental results showed that pronunciation lexicongenerated through lexicon tool behaves as good ashandcrafted expert lexicon in speech recognition tasks.As a future direction, we will look into the ways todecrease the WER of lexicon tool e.g. increase diacriticcoverage in training lexicon, increase size of traininglexicon, add support for nasalized short vowels andincrease the coverage of rarely occurring phonemes.7.AcknowledgementsWe would like to thank Atique-ur-Rehman for providingus with cloud hosting and Murtaza Azam Khan for hishelp with frontend.8.Bibliographical ReferencesAfzal, M., & Hussain, S. (2001). Urdu computingstandards: development of Urdu Zabta Takhti (UZT)1.01. In Multi Topic Conference, 2001. IEEE INMIC2001. Technology for the 21st Century. Proceedings.IEEE International (pp. 216-222). IEEE.Aminzadeh, A. R., & Shen, W. (2008, December). Lowresource speech translation of Urdu to English iteration. In Spoken Language TechnologyWorkshop, 2008. SLT 2008. IEEE (pp. 265-268). IEEE.Bonaventura, P., Giuliani, F., Garrido, J. M., & Ortin, I.(1998, August). Grapheme-to-phoneme transcriptionrules for Spanish, with application to automatic speechrecognition and synthesis. In Proceedings of theWorkshop on Partially Automated Techniques forTranscribingNaturallyOccurringContinuousSpeech (pp. 33-39). Association for ComputationalLinguistics.Chan, H. Y., & Rosenfeld, R. (2012, March).Discriminative pronunciation learning for speechrecognitionforresourcescarcelanguages.In Proceedings of the 2nd ACM Symposium onComputing for Development (p. 12). ACM.Gutkin, A., Ha, L., Jansche, M., Pipatsrisawat, K., &Sproat, R. (2016, May). TTS for Low ResourceLanguages: A Bangla Synthesizer. In LREC.Gauthier, E., Besacier, L., Voisin, S., Melese, M., &Elingui, U. P. (2016, May). Collecting resources in subsaharan african languages for automatic speechrecognition: a case study of wolof. In 10th LanguageResources and Evaluation Conference (LREC 2016).Hochreiter, S., & Schmidhuber, J. (1997). Long shortterm memory. Neural computation, 9(8), 1735-1780.Hussain, S., & Afzal, M. (2001). Urdu computingstandards: Urdu zabta takhti (uzt) 1.01. In Multi TopicConference, 2001. IEEE INMIC 2001. Technology forthe 21st Century. Proceedings. IEEE International (pp.223-228). IEEE.Hussain, S. (2004, August). Letter-to-sound conversionfor Urdu text-to-speech system. In Proceedings of theworkshop on computational approaches to Arabicscript-based languages (pp. 74-79). Association forComputational Linguistics.Hussain, S. (2007). Phonetic correlates of lexical stress inUrdu (Doctoral dissertation, UMI Ann Arbor).Masmoudi, A., Khmekhem, M. E., Esteve, Y., Belguith,L. H., & Habash, N. (2014, May). A Corpus andPhonetic Dictionary for Tunisian Arabic SpeechRecognition. In LREC (pp. 306-310).Mendonça, G., & Aluisio, S. (2014). Using a hybridapproach to build a pronunciation dictionary forBrazilian Portuguese. In Fifteenth Annual ociation.Qiao, F., Sherwani, J., & Rosenfeld, R. (2010, December).Small-vocabulary speech recognition for resourcescarce languages. In Proceedings of the First ACMSymposium on Computing for Development (p. 3).ACM.Raza, A. A., Hussain, S., Sarfraz, H., Ullah, I., & Sarfraz,Z. (2009, August). Design and development ofphonetically rich Urdu speech corpus. In SpeechDatabase and Assessments, 2009 Oriental COCOSDAInternational Conference on (pp. 38-43). IEEE.Raza, A. A., Hussain, S., Sarfraz, H., Ullah, I., & Sarfraz,Z. (2010). An ASR system for spontaneous Urduspeech. The Proc. of Oriental COCOSDA, 24-25.Saleem, A. M., Kabir, H. A. S. A. N., Riaz, M. K.,Rafique, M. M., Khalid, N. A. U. M. A. N., & Shahid,S. R. (2002). Urdu consonantal and vocalicsounds. CRULP Annual Student Report.Sherwani, J. (2009). Speech interfaces for informationaccess by low literate users (Doctoral dissertation,Carnegie Mellon University).Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequenceto sequence learning with neural networks. In Advancesin neural information processing systems (pp. 31043112).Schultz, T., & Schlippe, T. (2014, May). GlobalPhone:Pronunciation Dictionaries in 20 Languages.In LREC (pp. 337-341).Tan, T. P., & Ranaivo-Malançon, B. (2009). Malaygrapheme to phoneme tool for automatic speechrecognition. In Proc. Workshop of Malaysia andIndonesia Language Engineering (MALINDO) 2009.Weide, R. L. (1998). The CMU pronouncingdictionary. o, K., & Zweig, G. (2015). Sequence-to-sequenceneural net models for grapheme-to-phonemeconversion. arXiv preprint arXiv:1506.00196.9.Language Resource ReferencesAli, Ahmed. Arabic Speech Recognition adelphia: Linguistic Data Consortium, 2017.Kilany, Hanaa, et al. Egyptian Colloquial Arabic LexiconLDC99L22. Web Download. Philadelphia: LinguisticData Consortium, 1997.2396

Yuan, Jiahong, Neville Ryant, and Mark Liberman.Mandarin Chinese Phonetic Segmentation and ToneLDC2015S05. Web Download. Philadelphia: LinguisticData Consortium, 2015.Appendix A چ ز ع ن ج ڑ ظ م ث ر ط ل ٹ ذ ض گ ے ت ڈ ص ک ی پ د ش ق ء ب خ س ف ہ ا ح ژ غ و ۃ ں آ Table A2: Secondary Urdu LettersََََََََِTable A3: Urdu DiacriticsAppendix BSr. 2930313233343536Urdu LetterConsonants پ پھ ب بھ م مھ ط ، ت تھ د دھ ٹ ٹھ ڈ ڈھ ن نھ ک کھ گ گھ ن in نگھ ، نگ ، نکھ ، نک ق ع ف و س ظ ، ض ، ز ، ذ ش ژ خ غ ہ ، ح ل لھ ر رھ IPACISAMPAppʰbbʰmmʰt̪t̪ lʰrrʰPP HBB HMM HT DT D HD DD D HTT HDD HNN HKK HGG HN GQYFVSZS HZ ZXG GHLL HRR H ڑ ڑھ ی یھ چ چھ ج جھ 4546474849505152535455565758596061626364656667 َو و ََ و ا ، آ ی ے ََ ے َ وں وں ََ وں اں ، آں َِ یں یں ََ یں َِ ہ ََ ہ َہ ََِ ء ، ََ َِ ں َں ََ ں ɽɽʰjjʰtʃtʃʰdʒdʒʰR RR R HJJ HT ST S HD ZD Z ẽːæ̃ːeˑæˑoˑɪʊəɪ̃ʊ̃ə̃U UO OOA AI IA YA EU U NO O NO NA A NI I NA Y NA E NA Y HA E HO O HIUAI NU NA NVowelsTable A1: Basic Urdu Letters ھ 3738394041424344Table B1: Urdu Letters with IPA and CISAMPA2397

PronouncUR: An Urdu Pronunciation Lexicon Generator Haris Bin Zia1, Agha Ali Raza1, Awais Athar2 1Information Technology University, 6th Floor, Arfa Software Technology Park, Ferozepur Road, Lahore, Pakistan 2EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK {haris.zia, agha.ali.raza}@itu.edu.pk awais@ebi.ac.uk

Related Documents:

817 Palmistry ki Mukammal Kitab Naveed Akhtar Urdu 45 818 Mohabbat aur Palmistry Naveed Akhtar Urdu 33 819 Kero ki Palmistry Kero Urdu 90 820 Zindagi ki Lakeerain Kero Urdu 30 821 Kero ki book of Numbers Kero Urdu 50 822 Boltay Hath Kero Urdu 100 823 Dust Shanaasi Kero Urdu 27 824 Palmistry Tasveeron kay Aainay Main Dr. M. Katkar Urdu 90

Learn Urdu Through English Easy way to Pronunciation Very important to note that one can learn the proper pronunciation of Urdu by imitating sounds produced by a speaker of Urdu or by listening and repeating Urdu sounds from electronic sources. Careful listening will help improve the understanding of acoustic nature of different sounds of Urdu.

This paper deals with issues regarding Urdu orthography, corpus development (e.g. corpus acquisition, pre-processing, tokenization, cleaning e.g. typos, name recognition etc) and then finally lexicon development for common words. 2. Urdu Orthography Urdu is written in Arabic script in Nastaleeq style using an extended Arabic character set.

Writing Urdu Urdu Writing Workbook 5 01:00 to 01;40 (S) 01:15 to 01:50 (W) Urdu Reading Urdu Writing Urdu Games Urdu Workbook Activity 6. 0'1:40 to 02:20 (S) 01:50 to 02r25 (W) [/aths Book lVaths Book Notebook Writing Notebook Notebook Practice/ Activity 1 02 20 to 03:00 (S) 0225 t0 03:00 (W) Art & Craft Art & Craft Games Commun callon Activity .

1. Muqaddama-Tarikh-e-Zaban-e-Urdu: Masood Hussain Khan 2. Tarikh-e-Adab-e-Urdu :Jameel Jalibi 3. Tarikh-e-Adab-Urdu : Syeda Jaffar 4. Tarikh-e-Adab-e-Urdu – Ram Babu Saxena 5. Tarikh-e-Adab-e-Urdu –Wahab Ashrafi 6. Hindustani Lisaniath – Dr.S.M.Q. Zore 2. Classiki Nazm-o-Nasar (Hard Core) 4 Credits Sabras : Mulla Wajhi Bagh-o-Bahar : Meer Aman

Urdu language. MAURDUC102 : Urdu zaban-o- Adab ki tareekh The Students Come to know about origin and growth of Urdu language and her History. MAURDUC103 : Urdu Ghazal (Classical) The Students acquire the knowledge of classical Urdu prose and its importance. MAURDUC104 : Urdu Nazm : Shahr-e-Aashob,Qasidah,Marsiya,Masnavi,Rubai

Urdu (most formal)/Hindi (normal) Relevant Languages Bombay Hindi, Dialects (regional color) 3 History: Urdu/Hindi Oxford English Dictionary The name Urdu or Oordoo originally meant ÒcampÓ , short for zaban-i-urdu Òlanguage of the campÓ . The word Urdu comes from Turkish ordu,

Curriculum For Excellence Advanced Higher Physics Astrophysics 2 Compiled and edited by F. Kastelein Boroughmuir High School Source - Robert Gordon's College City of Edinburgh Council Historical Introduction The development of what we know about the Earth, Solar System and Universe is a fascinating study in its own right. From earliest times .