International Journal Of Scientific Research In Computer .

2y ago
58 Views
2 Downloads
1.70 MB
20 Pages
Last View : 29d ago
Last Download : 3m ago
Upload by : Nadine Tse
Transcription

International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 8 ISSN : 2456-3307DOI : https://doi.org/10.32628/CSEIT183844Zone-Wise Segmentation and Lexicon-Driven Recognition for Printed MyanmarCharactersChit San Lwin1, Xiangqian Wu21,2School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, P. R. ChinaDepartment of Mathematics, Monywa University, Monywa City, Sagaing Region, Myanmar1Corresponding Author : chitsanlwin.maths.mm@gmail.comABSTRACTThis paper presents a new segmentation and recognition algorithms for Myanmar script inputted from offlineprinted images. Zone segmentation considers horizontal and vertical zones; it is applied to segment lettersaccording to their roles such as primary or peripheral characters. In doing so, statistical and structural featuresof segmented characters are explored and exploited in recognition process. Hidden Markov model is used forrecognition of primary characters while Kohonen self-organization map is used for peripheral characters. Therecognized characters by each model are then combined, and finally are recognized by k-nearest neighborsalgorithm with the help of lexicon is composed of all common Myanmar characters. Our OCR system forMyanmar characters tested on a dataset that approximately contains 7560 compounded characters. From theresults, our system achieves higher significant results both segmentation and recognition compared to the othercontemporary Myanmar OCR‟s approaches.Keywords: Character Segmentation, Hidden Markov Model, Self-organization Map, k-nearest Neighbors,LexiconI.INTRODUCTIONThe Myanmar script is a cursive language likeArabic, Persian and Urdu scripts, it has a r set and several of these characters arehandwritten documents have already been ansimilar withdifferent meanings. One majorintensive research area in recent years. It is achallenge here is that Myanmar OCR system hasfundamental and an essential process in intelligentand machine learning systems for automaticbeen greatly under-researched. Another challenge isthat the unique features of the Myanmar scriptrecognition and translation of text images via astands as one of the main unresolved problems inrobot‟s eye, mobile phone or other electronicthe literature of the Myanmar OCR system.devices. Due to OCR‟s complex structure and highsuch, there is a demand for a considerable andcomputational demand, it is still challenging tosignificantdevelop a robust language independent offline andresearch, in order to keep pace with today‟s device-online recognition systems.In spite of theoriented technologies without needs of humanrelatively mature stage of OCR in most widely usedassistances. This paper therefore considers theselanguages like English, Indian, Chinese, Arabic, etc.,challenges of Myanmar OCR system and proposes aMyanmar (known as Burmese) language is stillnew algorithm for segmentation and recognition ofstruggling for a robust OCR system.printed Myanmar script.improvementinCSEIT183844 Received : 20 Nov 2018 Accepted : 02 Dec 2018 November-December-2018 [ 3 (8) : 161-180 ]MyanmarAsOCR161

Chit San Lwin et al. Int J S Res CSE & IT. 2018 November-December-2018; 3(8) : 161-180Zone-wise Segmentation: It segments MyanmarII. LITERATURE WORKscript based on two groups, namely, primary (forconsonants) and peripheral (for vowels) characters.Segmentation in OCR system is generally classifiedIt later discriminates each peripheral characterinto line, word and character segmentations. Linedepending on their features such as statistical andsegmentation has reached a quite standard level thatstructuralcan successfully be used by several types offeaturesandputsthemintocorresponding groups for later recognition process.languages. It segments the lines from paragraphtext. In word and character segmentations, there areLexicon-driven Recognition: It recognizes theslightly or specifically different approaches that arecharacters with two different models. The firstproposed by a number of research works relative tomodel called hidden Markov model (HMM) is usedthe languages they focus ion map (SOM) is trained to classifySahare et al [1] proposed a character segmentationdifferent peripheral characters. It then combines thealgorithm for Latin and Devanagari scripts. Theypartial recognized results from each model andmainlycompares the resulted compound character parrapport the characters from the lexicon with thecharacters in finding primary segmentation path.The other overlapped and joined characters arehelp of k-nearest neighbors (k-NN) algorithm. Inobserved by using graph distance theory andcomparison, k-NN clusters the most similar groupsindividually split them as separated characters.with input character in k-clusters and finds theAfterwards, they validated the segmentation resultsmost similar character in it for resulted compoundwith support vector machine (SVM) for accurateword.segmented results. Regard to recognition, they triedconsideredstructuralpropertiesofto recognize the characters using their three newSpecific Feature Extraction: Due to the existence ofgeometrical shape-based features together with k-specific character patterns in Myanmar script,NN classifier.necessary features for each character are extractedsuch as number and size of dots, place of dots, openloops or close-loops, number of strokes, end-points,The Indic handwritten character segmentation wasperformed by a study [2] with three horizontaletc. These extracted features are exploited in allzones segmentations using HMM model and SVMstages of OCR processes; segmentation, featuremodel. Whilst HMM is used for middle zoneextraction and recognition.segmentation, SVM is used for other two zones.Water reservoir feature and widow-based featureThe remainder of this paper is structured as follows.Section 2 presents literature review regarding OCR.called pyramid histogram of oriented gradient(PHOG) features are used in middle zoneThe characteristic and peculiarities of Myanmarsegmentation. They then combined with the partialscript and our OCR system splitting is discussed inrecognition results produced by each zone andSection 3. A detailed explanation of our proposedfinally performed word level recognition.system is given in Section 4. In addition, thedevelopment processes and experimental results areChinese character recognition system was proposedpresented in Section 5, and the paper is finallyby Tao et al [3] introduced a new manifold learningconcluded in Section 6 with description of itsalgorithm for characters based on subspace learninglimitation and prospective future works.algorithm, discriminative locality alignment (DLA)to find similar character groups for recognition ofVolume 3, Issue 8, November-December-2018 http:// ijsrcseit.com162

Chit San Lwin et al. Int J S Res CSE & IT. 2018 November-December-2018; 3(8) : 161-180input characters. They afterwards proposed a kerneluncontrolled conditions. To solve them, theyversion of their DLA algorithm, KDLA, bypresented a probabilistic model for STR system toconducting principal component analysis (PCA).organize similarity, language properties and lexicalAnother Chinese character recognition systemdecision by using sparse belief propagation, aapplied convolutional neural networks (CNN) forbottom-up method for shortening messages tooffline handwritten OCR [4]. They proposed adecreaseglobal supervised low-rank expansion method andsupported hypotheses.thedependencybetweenweaklyan adaptive drop weight (ADW) for speed andstorage capacity of their nine network layers forPremaratne et al [8] used lexicon-based Sinhalarecognition.script recognition systemwithHMM. Theyproposed segmentation-free recognition method byZarro et al [5] proposed an online Kurdishusing orientation features and linear symmetry.characters recognition system using HMM modelThey exploited the advantages of lexicon to verifyand harmony search. Their system firstly split theand correct false rejections, missing charactercharacters into different sub-groups based onpositions from the recognition stage, and alsocommon directional feature vectors. Markov modelwas then used in classifying each group ofoptimized the accuracy of missing words to anacceptable level.characters. After getting the candidate characterswith their associated features, they were classifiedIII. CHARACTERISTICS OF MYANMARby harmony search recognizer. They highlighted inLANGUAGEtheir paper that working with smaller groupsreduces the processing time in later recognition3.1 Myanmar Cursive Script Languageprocess.Myanmar language, also known as Burmeselanguage, is the national language of Myanmar.In accordance with the most popular approach inThereareapproximatelyahundredspokenOCR system, HMM model is properly used in Indiclanguages in Myanmar due to existence of 135scripts online recognition system [6]. In this study,the researchers presented two main techniquesdistinct ethnic groups who are speaking their ownlanguages in their regions. Amongst all, Myanmarusing HMM: lexicon driven and lexicon free for twolanguage is an official language spoken by almost 44Indic scripts, namely Devanagari and Tamil. Themillion, primarily by Burma (Burman) people anddifference of two techniques, lexicon driven andrelated ethnic groups in Myanmar and neighboringlexicon free are dependent or independent ofcountries, [9].handwritten writing orders but similarlyconsideration in symbol representation in theMyanmar language is one of Sino-Tibetan languagelexicon as the sequence of symbol HMM.groups and its alphabets are derived from a Brahmicand Kadamba-Pallava scripts. It is a tonal andThe lexicon-based text recognition was alsosyllable-timed language composed of subject-object-proposed by a research [7]. They investigated sceneverbtext recognition (STR) system to recognize the textlanguage is cursively written from left to rightfrom signboard, or anything that describes the text.without concept of lower and upper-case letters.This type of recognition is quite challenging due toThere are basic 33 consonants, 16 vowels, 10 specialvariability of font size, position of visible parts,characters and only two punctuation marks that actminimal language context, and unexpected andlike comma (,) and full stop (.) illustrated in Fig. 1(a-Volume 3, Issue 8, November-December-2018 http:// ijsrcseit.comorderinsentencestructure.Myanmar163

Chit San Lwin et al. Int J S Res CSE & IT. 2018 November-December-2018; 3(8) : 161-180c, e). There are additional glyphs of Myanmarcharacters called double-layers characters showingin Fig. 1(d). These double-layer characters can beformed by placing similar or different consonantsby layers. Though not all consonants can be use inthis form, it is applicable to almost half ofconsonants. However, they cannot stand alone toFigure 2. Characteristic of Myanmar ligaturesrepresent a meaning of a word without combinationwith other consonants.Cursive Style: Due to the delicate joint of letters inMyanmar script, it can be said that MyanmarTraditionally, a Myanmar word has one or morelanguage is cursive writing style. Myanmar ligaturesconsonants with zero or more vowels are separatelyare based on different sizes of close-circles, open-or jointly together. In this paper, we regardcircles in different directions and combine themconsonants as either primary letters or peripheralwith straight lines, rounded corners, slope lines andletters depending on its position, whereas vowelsdots. A word can be cursively organized with morealways regard as peripheral letters. All of them areinterchangeably termed as ligatures or characters inthan one ligature in different ways. A completesample Myanmar sentence is demonstrated in Fig. 3.this paper.Figure 3. A complete sentence in Myanmar scriptUpper/Lower Case: There is no concept about lettercasing in Myanmar language. Moreover, the sizes ofletters, small or large have no meaning.Space Usage: In English, space plays a key role inseparating the words, but in Myanmar language, itFigure 1. Basic characters in Myanmar scriptis for separating different phrases. In formal writingsuch as news in newspaper, spaces are normally3.2 Peculiarities of Myanmar Languageused between phrases. However, it is optionallyexcept for official letters.As mentioned earlier, Myanmar language is acursive language like Urdu, Arabic, etc. AlthoughNumber of Dots and Their Position: Dots play a keycharacter recognition process for those languagesrole in describing different meanings of a word inhas already reached a mature level, otherwise,Myanmar script. They can be put into three placesMyanmar OCR is still struggling to recognize allof a letter: upper, lower and right sides. They can becombined words due to large set of characters,complex combinations of consonants and vowelsused alone in one place or together in some possibleinto one or more layers, other special characters andat upper and lower places whereas double dots arevery similar characters in shape. This sectionused at the right side of a letter. Although they arediscusses peculiarities of Myanmar script withgenerally used as dots, some Myanmar scripts use itpictorial representation in Fig. 2.as small circle without changing any meaning.Volume 3, Issue 8, November-December-2018 http:// ijsrcseit.complaces as mentioned. Specifically, single dot is used164

Chit San Lwin et al. Int J S Res CSE & IT. 2018 November-December-2018; 3(8) : 161-180Direction of Writing: There are bi-directionalIV. OFFLINE PRINTED MYANMARwriting styles in Myanmar language. Almost allCHARACTERS RECOGNITION TECHNIQUEwords‟ directions lead from left to right in general.However, a few words are in opposite direction.This section elucidates the proposed system indetailswithitsoveralldesignandmajorCircles in Different Sizes: Circles are very basiccomponents as schematically described in Fig. 5.letters in Myanmar language. It can be used in allThe OCR system accepts input as scanned imagelayers of horizontal and vertical places. Normalfiles; high quality and high-speed scanners or othercircles represent characters while small circleselectronic devices like phone and cameras capturerepresent dots as above mentioned.images. The preprocessing comes as the first phaseof overall processes in order to smooth the images toLoop: Myanmar script uses some loops inside abe ready for segmentation and recognition process.circle showing in Fig. 2(d).Size and Cross: The size, the space taken by eachprimary letters varies depending on its structure.The size of a character can be known with itsnumber of crosses as shown in Fig. 2(b, c).Layers: There are three horizontal layers inMyanmar character. The outermost lines in Fig. 2(e)are for boundary of the characters. The central layeris to hold primary letters and other peripheralletters whereas the upper and lower layers are justfor peripheral letters. According to the nature ofMyanmar script, those three layers have sameheight such that height of Cw , Lw and U w areequal. The concept of equal layers will be mainlyused in segmentation process that will be discussedin next section. The combination of primary andperipheral characters of a word with their left toright sequences is demonstrated in Fig. 4.Figure 5. Overall design of proposed systemUnlike other OCR approaches, in our approach, n and separates the detected string oftext into corresponding groups with the help offeatures obtained from feature extraction phase.There are two steps in recognition process forprimary and secondary ligatures. HMM is used forprimaryligatures recognition while SOMisexploited for recognition of peripheral ligatures thatcontain complex structures and features. As finalstep of our work, a set of accurately recognizedcharacters is produced as text file.Figure 4. Writing sequence of Myanmar script(from left to right)Volume 3, Issue 8, November-December-2018 http:// ijsrcseit.com165

Chit San Lwin et al. Int J S Res CSE & IT. 2018 November-December-2018; 3(8) : 161-1804.1 Pre-processingword segmentation. Due to this processes, we alsoget word levels from individual lines. EnglishThe preprocessing step is the first step of the OCRlanguage is also used vertical projection profile tosystem. It includes the process of binarization,get character strings by searching the space betweenfiltering, noise or outlier-removal, skew correctioncharacters. After that, they get completely characterand baseline detection, etc. This step prepares anstrings. However, vertical projection profile do notinput image to be smooth for further recognitionperfectly acquit to get character strings forsteps such as segmentation, feature extraction, etc.Myanmar script because there is neither space usageFor a 2D gray scale input image, im( x, y) that has abetweenfunction of intensity value f ( x, y) for m to ncharacters of a word, that is, there may be differentpixel numbers for maximum row and column ofimage; binarization is executed to decrease thecomplexity for computations of OCR system. Toremove irregular patterns such as disconnected med and thinning process is afterwardsapplied. After this steps, we get all input imageshave the proper orientation and are free of anyskewness.nordefiniteendingending characters depending on combination ofdifferent ligatures as shown in Fig. 3 and Fig. 4. As aresult, we get lacking character levels. This meansthat, some segmented characters are disjointcharacters and joint ones. These facts, we must todivide the jointed characters until to achievedisjoint characters before character recognition step.Separation of characters into primary or peripheralsegmentation unlike English language. Of thecursive languages OCR‟s such as Urdu [10] andOffline printed or scanned document usuallycontains paragraphs composed of line-by-linesentences. Each sentence comprises a group ofcharacters are partially or totally connected to eachother. The segmentation technique is an essentialstep of OCR system. Its process is to divide thecharacter strings into individual character in whichligatures may or may not be uethathorizontalseparatesparagraph into disjoint lines with upper or lowerline. It finds the peaks and valleys between lines asthe separators of the text lines. Therefore, we getindividual lines from inputted paragraph text files.Not only horizontal but also vertical projectiontechniquesaccomplishBangla [11], they used freeman chain codes (FCC),while other research [12] used trigram probabilitiesby normalizing over the number of ligatures andwords in the sequence.We observed a variety of segmentation methods inliterature for different languages (especially cursivelanguages) due to their unique shapes and levels ofstructural complexity. In the light of this, weBefore word segmentation, we perform linesegmentation of the input scanned text files. Lineprofilecharacterscharacters is a major task in cursive language4.2 SegmentationprojectiontheperfectlyforMyanmar script like doing English languages inVolume 3, Issue 8, November-December-2018 http:// ijsrcseit.compresent a novel segmentation algorithm that fits thecursive Myanmar script as shown in Alg. 1.Algorithm 1. H-zone segmentationInput: im( x, y) : input image included a set ofcharactersf : statistical and structural features of acharacterOutput: Hzonei [] : i 1, 2, 3L 3 // for three horizontal layers166

Chit San Lwin et al. Int J S Res CSE & IT. 2018 November-December-2018; 3(8) : 161-180mux, y []: middle-upper coordinates of a13.im( x, y)character existing in Hzonei []mlx, y []: middle-lower coordinates of a14.15.lux , y []: left-upper coordinates of a16.character existing in Hzonei []llx, y []: left-lo

using HMM: lexicon driven and lexicon free for two Indic scripts, namely Devanagari and Tamil. The difference of two techniques, lexicon driven and lexicon free are dependent or independent of handwritten writing orders but similarly consideration in symbol rep

Related Documents:

[ ] International Journal of Mechanical Engineering and Research (HY) Rs. 3500.00 [ ] International Journal of Mechanical and Material Sciences Research (HY) Rs. 3500.00 [ ] International Journal of Material Sciences and Technology (HY) Rs. 3500.00 [ ] International Journal of Advanced Mechanical Engineering (HY) Rs. 3500.00

Anatomy of a journal 1. Introduction This short activity will walk you through the different elements which form a Journal. Learning outcomes By the end of the activity you will be able to: Understand what an academic journal is Identify a journal article inside a journal Understand what a peer reviewed journal is 2. What is a journal? Firstly, let's look at a description of a .

excess returns over the risk-free rate of each portfolio, and the excess returns of the long- . Journal of Financial Economics, Journal of Financial Markets Journal of Financial Economics. Journal of Financial Economics. Journal of Financial Economics Journal of Financial Economics Journal of Financial Economics Journal of Financial Economics .

Create Accounting Journal (Manual) What are the Key Steps? Create Journal Enter Journal Details Submit the Journal Initiator will start the Create Journal task to create an accounting journal. Initiator will enter the journal details, and add/populate the journal lines, as required. *Besides the required fields, ensure at least

international journal for parasitology-parasites and wildlife england int j bank mark international journal of bank marketing england int j bus commun international journal of business communication united states int j entrep behav r international journal of entrepreneurial behaviour & research england

Marketing Research, Journal of International Business Studies, Columbia Journal of World Busi-ness, International Journal of Research in Marketing, Journal of International Marketing. and other publications. Allie. PREFACE. In the relatively short time since the second edition of .

of SCIENTIFIC RESEARCH,Vol.5,Issue-2, Feb-2016 ISSN 2271—8179 17 Plea Bargaining-An overview Page no.121- 122 PARIPEX-INDIAN JOURNAL OF RESEARCH, An International Journal,Vol.5, Issue-2, Feb-2016 ISSN 2250—1991 18. Dowry Death and Law, Page no.72-73 GLOBAL JOURNAL FOR RESEARCH ANALYSIS, An International Journal,Volume.5 Issue - 2, Feb-2016 ISSN

32. Indian Journal of Anatomy & Surgery of Head, Neck & Brain 33. Indian journal of Applied Research 34. Indian Journal of Biochemistry & Biophysics 35. Indian Journal of Burns 36. Indian Journal of Cancer 37. Indian Journal of Cardiovascular Diseases in Women 38. Indian Journal of Chest Diseases and Allied Sciences 39.