APPEARANCE FEATURE EXTRACTION VERSUS IMAGE TRANSFORM-BASED .

3y ago
24 Views
2 Downloads
752.43 KB
22 Pages
Last View : 16d ago
Last Download : 3m ago
Upload by : Dahlia Ryals
Transcription

August 12, 2006 18:30 WSPC/157-IJCIA00180International Journal of Computational Intelligence and ApplicationsVol. 6, No. 1 (2006) 101–122c Imperial College Press APPEARANCE FEATURE EXTRACTION VERSUS IMAGETRANSFORM-BASED APPROACH FOR VISUALSPEECH RECOGNITIONALAA SAGHEERDepartment of Intelligent Systems, Kyushu University6-1, Kasuga-Koen, Kasuga, Fukuoka 816-8580, Japanalaa@limu.is.kyushu-u.ac.jpNAOYUKI TSURUTADepartment of Electronics Engineering and Computer ScienceFukuoka University, 8-9-1, NanakumaJonan-ku, Fukuoka 814-0180, Japantsuruta@tl.media.fukuoka-u.ac.jpRIN-ICHIRO TANIGUCHIDepartment of Intelligent Systems, Kyushu University6-1, Kasuga-Koen, Kasuga, Fukuoka 816-8580, Japanrin@limu.is.kyushu-u.ac.jpSAKASHI MAEDADepartment of Electronics Engineering and Computer ScienceFukuoka University, 8-9-1, NanakumaJonan-ku, Fukuoka 814-0180, Japanmaeda@tl.media.fukuoka-u.ac.jpReceived 10 September 2005Revised 15 February 2006In this paper we propose a new appearance based system which consists of twostages: visual speech feature extraction and classification, followed by recognition of theextracted feature, thereby the result is a complete lip-reading system. This lip-readingsystem employs our Hyper Column Model (HCM) approach to extract and classify thevisual features and uses the Hidden Markov Model (HMM) for recognition. This paperaddresses mainly the first stage; i.e. feature extraction and classification. We investigate the HCM performance to achieve feature extraction and classification and thencompare the performance when replacing HCM with Fast Discrete Cosine Transform(FDCT). Unlike FDCT, HCM could extract the entire features without any loss. Alsothe experiments have shown that HCM is generally better than FDCT and provides agood distribution of the phonemes in the feature space for recognition purposes. Forfair comparison, two databases are exploited with three different sets of resolution foreach database. One of these two databases is designed to include shifted and scaledobjects. Experiments reveal that HCM is capable of recovering and dealing with suchimage restrictions whereas the effectiveness of FDCT drops drastically especially for newsubjects.Keywords: Visual speech recognition; feature extraction; self organizing map; hypercolumn model; discrete cosine transform.101

August 12, 2006 18:30 WSPC/157-IJCIA10200180A. Sagheer et al.1. IntroductionRecently, visual speech recognition (or automatic lip-reading) systems find their wayto many application areas such as speaker verification, multimedia telephony forhearing impaired and interaction with terminals and machines for the handicappedand also the elderly in home health care systems. In principal, the visual speechrecognition problem is comprised to two stages: (1) Visual speech feature extractionand classification and (2) Visual speech feature recognition. In other words, thepattern (word or sentence) to be recognized is first converted to some features,believed to carry the class identity of the pattern, and then the set of features isclassified as one of the possible classes. Although significant advances have beenmade in visual speech recognition technology, it is still a difficult problem to designa speech recognition system that can generalize well without loss of features andimage/subject restrictions.1,2 In our opinion, this is due to the large appearancevariability during lip movements. In addition, differences between the appearenceof the subjects, lip size, face features, and in illumination conditions cause extradifficulty.3This paper is concerned with the first task; feature extraction and classification.Different approaches have been reported in the literature to able to perform thistask. These approaches can be broadly classified into three main categories:1. Geometric-feature based.2. Image transform based.3. Appearance based.Geometric-feature based approach obtains information from geometric featuresof the lip such as its height or width or color or shape or all of them.4,5 In the imagetransform based approach, the original gray level image containing the lip is transformed to a space of features by some image transform technique.6,7 Appearancebased approach learns the decision boundary among different articulations fromtraining data without any extraction of geometric features. In this approach, features depend on the intensity values of image pixels that include lip.8 The approachpresented for extraction of the visual features in this paper falls into the thirdcategory.Due to data reduction involved in the first and second categories, considerableamount of information related to features is lost, which may affect the recognitionaccuracy and results in relatively poor performance.9 In contrast, the last categoryuses the entire available information about the object; as will be explained shortly,and so results in better recognition accuracy. Another advantage of this approachis that important features can be represented in a low-dimensional space and canoften be made invariant to image transforms like translation, scaling, rotation andlighting whereas the second approach fails.7,10 The only disadvantage of the thirdcategory is that it needs a large amount of training data to learn the system so thatit can extract faithfully the features from an arbitrary input data. Much effort has

August 12, 2006 18:30 WSPC/157-IJCIA00180Appearance Feature Extraction Versus Image Transform-Based Approach103been put in to propose a lip reading system by combining any two or all the abovecategories to trade off the disadvantages of each individual approach.9It follows in general, from a variety of contributions reported in the literaturethat the performance of appearance based approach is better than that of geometricbased approach.9,11–13 Essential target of this paper is to show that the performanceof the appearance based approach is also better than that of the image transformbased approach. Additionally, during the development of our system, we focus onanother four issues. They are as follow:1. What is the appropriate set of visual units (or features) around the mouth forrepresenting the visual information?2. The system should extract the entire of features without reduction.3. Holding a parametric feature space with low dimensionality such that thedistribution of each phoneme should be simple and approximated by normaldistributions.4. How is the generalization of the system and what about its performance if thesubject is shifted or scaled?We believe that these four issues represent fundamental requirements for anyvisual speech recognition system. The system proposed in this paper try to satisfythese requirements. To do the evaluation of our system away with any bias, we conducted several experiments replacing HCM17 with two different feature extractorapproaches; Self Organizing Map18 (SOM) and Fast Discrete Cosine Transform19(FDCT). In separate experiments we combined Hidden Markov Model20 (HMM)as a feature recognizer with each one from the three approaches. All the experiments of each combination are conducted under same conditions and using samedatabases.1.1. Related worksImpaired and deaf people who are void of the hearing ability can understand speechby merely reading the speaker’s lips without any acoustic information. Motivatedby this ability of the impaired and the deaf people, the problem of automatic lipreading was studied and a lot of work has been established in this field. Recentlywith the development of computers there has been much research in trying to enablecomputers to perform the components of lip-reading using several approaches.Luettin11,12 used HMM based active shape models to extract active speech featuresset that includes derivative information and compared its performance with that ofa static feature set. Matthews13,14 compared three image transform based methodswith active appearance model (AAM) to extract features from lip image sequencesfor recognition using HMM. He utilized DCT, wavelet transform (WT) and principal component analysis (PCA) as image transform based methods. Heckmann7investigated different tactics to choose the coefficients of DCT to enhance feature extraction. Using asymmetrically boosted HMM, Yin et al.15 developed an

August 12, 2006 18:30 WSPC/157-IJCIA10400180A. Sagheer et al.automatic visual speech feature extraction to deal with their own ill-posed multiclass sample distribution problem. Guitarte et al.16 compared the Active ShapeModel (ASM) and DCT for feature extraction task in an embedded implementation. Hazen8 investigated several visual model structures, each of which providesa different means for defining the units of the visual classifier and the synchronyconstraints between the audio and visual stream.The arrangement of the rest of this paper is as follow: the two databasesemployed in our experiments are described in Sec. 2. Section 3 gives an overview toSOM. HCM will be elaborated in Sec. 4. Feature recognition by HMM is describedin Sec. 5. In Sec. 6, an overview of FDCT and its recognition results have been provided. Experimental results and comparison among the three systems will be presented in Sec. 7 with results analyses. Discussion of the paper’s results is given inSec. 8. Future work and the paper conclusion are given in Sec. 9.2. DatabaseOne of the most challenging problems in visual speech recognition domain is tocope with the large variation across speakers and individual appearance and features where sizes of lip vary greatly across different speakers. To accommodatethis challenge, we designed our database according to speaker-independent-basedrule using different speakers during training and testing phases. This rule enablesus to investigate how well the proposed system generalizes to new speakers. Ourdatabase consists of two different sets concerning two different languages; Japaneseand Arabic.2.1. Sentences databaseBoth databases include nine sentences, each sentence consisting of two words inJapanese set and three words in Arabic set but one. Table 1 lists the Japanese andArabic sentences along with their respective English meanings. Each of the nineTable 1.Sentence database, Japanese (left) and Arabic (right).Japanese Sentence123456789-ATAMA ITAISENAKA ITAIONAKA SUITAMUNE ITAITEACHI ITAIATAMA OMOIONAKA ITAIMUNE KURUSHITEACHI SHIBIRERUEnglish Meaning-A headache in headA pain in backFeel hungryA pain in chestA pain in limbsHeavy headA pain in stomachDifficult breathSpasm in hand and legEnglish Meaning123456789-A pain in my teethA headache in headA swelling in my backA pain in my gumThe Arabic SalutationA swelling in my legA pain in my backA swelling in my toothA pain in my headArabic Sentence

August 12, 2006 18:30 WSPC/157-IJCIA00180Appearance Feature Extraction Versus Image Transform-Based Approach105subjects (male and female) uttered all sentences one time without repeating. Inorder not to miss any part of the uttered sentence, the subject was requested tobegin and end each sentence with silence. Each Arabic sentence consists of threewords represented by 80 visual frames whereas the Japanese one includes two wordsin 70 frames.2.2. Image databaseThe Japanese database includes 5670 gray scale images subdivided into a traininggroup and a test group. The training group consists of 3780 images for 6 differentsubjects. The test group has 1890 images for 3 Japanese subjects entirely differentfrom those belongs to the training group. Similarly, the Arabic database includes6480 gray scale images, of which 4320 images are reserved for the training phaseand the remaining images are used for the test phase.Images of both databases were captured in the Laboratory of Spoken Languageand Image Processing, Fukuoka University, Japan, using an EVI-G20 Sony camera.However, although the capturing process was performed in a natural environmentwithout using special lighting effects or lip markers or coloring, there are somedifferences between the two sets.1. Position restriction: In the Japanese set, the subject was restricted to centralizehis/her mouth as he/she can, as shown in Fig. 1(a). In contrast, the subject inArabic set is free to shift his/her mouth or scale his/her face from the camera.In other words, the Arabic subject need not to center or put his/her mouth in aspecific position in front of the camera. The only restriction was that the user’lip should lie inside the frame not outside. Figure 1(b) shows samples for threedifferent subjects and it is easy to remark the shifted and scaled object in eachsample.2. Background : In Japanese data set, the background was simple or plain whilethe Arabic set uses a complex or natural environment as shown in Figs. 1(a)and 1(b).In order to obtain meaningful experimental results we conduct the experimentsusing 3 different resolution sets for each database. Definitely, we use the originalsize and another two sizes after cropping the region of interest (ROI) in the originalimage. These three sizes can be detailed as follow:1. Image set 1: The resolution of each image is 160 120 pixels without any croppingto the mouth area or the background as shown in Figs. 1(a) and (b) for bothdatabases.2. Image set 2: The resolution of each image is 140 140 pixels to include theROI only such that the rest of the image pixels, or around ROI, are white; seeFig. 1(c).

August 12, 2006 18:30 WSPC/157-IJCIA10600180A. Sagheer et al.(a)(b)(c)(d)Fig. 1. Snapshots of different subjects for each database: (a) Japanese subjects with plain background (b) Arabic subjects are shifted and scaled with a complex background (c) 140 140 imageset 1 (d) 128 128 image set 2.3. Image set 3: The resolution of each image is 128 128 pixels to include the ROIonly such that the rest of the image pixels, region of interest (ROI), are gray;see Fig. 1(d).The reason why we chose the latter two resolution sets is to be able to implementfast DCT; more details will be provided in Sec. 6. Also the reason that we use twocolors (white a

gate the HCM performance to achieve feature extraction and classification and then compare the performance when replacing HCM with Fast Discrete Cosine Transform (FDCT). Unlike FDCT, HCM could extract the entire features without any loss. Also the experiments have shown that HCM is generally better than FDCT and provides a

Related Documents:

Advance Extraction Techniques - Microwave assisted Extraction (MAE), Ultra sonication assisted Extraction (UAE), Supercritical Fluid Extraction (SFE), Soxhlet Extraction, Soxtec Extraction, Pressurized Fluid Extraction (PFE) or Accelerated Solvent Extraction (ASE), Shake Flask Extraction and Matrix Solid Phase Dispersion (MSPD) [4]. 2.

Licensing the ENVI DEM Extraction Module DEM Extraction User's Guide Licensing the ENVI DEM Extraction Module The DEM Extraction Module is automatically installed when you install ENVI. However, to use the DEM Extraction Module, your ENVI licen se must include a feature that allows access to this module. If you do not have an ENVI license .

L2: x 0, image of L3: y 2, image of L4: y 3, image of L5: y x, image of L6: y x 1 b. image of L1: x 0, image of L2: x 0, image of L3: (0, 2), image of L4: (0, 3), image of L5: x 0, image of L6: x 0 c. image of L1– 6: y x 4. a. Q1 3, 1R b. ( 10, 0) c. (8, 6) 5. a x y b] a 21 50 ba x b a 2 1 b 4 2 O 46 2 4 2 2 4 y x A 1X2 A 1X1 A 1X 3 X1 X2 X3

Feature extraction classification is distinguished from the pixel-based classification method because, instead of direct pixels, it operates on the pixel group. Feature extraction classification has two steps: (1) segmentation of images for producing segmented images, and (2) segmented image classification [4]. The essential and crucial step in .

The ENVI DEM Extraction Module is used to quickly and easily create spatially accurate DEMs from geospatial imagery. DEM DEM Extraction Module Find and extract specific objects of interest from all types of imagery with the ENVI Feature Extraction Module (ENVI FX). FX Feature Extraction

Licensing the ENVI DEM Extraction Module The ENVI DEM Extraction Module is automatically installed when you install ENVI 4.3. However, to use the DEM Extraction Module, your ENVI license must include a feature that allows access to this module. If you do not have an ENVI license that includes this feature, contact ITT Visual Information .

5 10 feature a feature b (a) plane a b 0 5 10 0 5 10 feature a feature c (b) plane a c 0 5 10 0 5 10 feature b feature c (c) plane b c Figure 1: A failed example for binary clusters/classes feature selection methods. (a)-(c) show the projections of the data on the plane of two joint features, respectively. Without the label .

APPLIED ENGLISH GRAMMAR AND COMPOSITION [For Classes IX & X] English (Communicative) & English (Language and Literature) By Dr Madan Mohan Sharma M.A., Ph.D. Former Head, Department of English University College, Rohtak New Saraswati House (India) Pvt. Ltd. Second Floor, MGM Tower, 19 Ansari Road, Daryaganj, New Delhi-110002 (India) Ph: 91-11-43556600 Fax: 91-11-43556688 E-mail: delhi .