Spontaneous Facial Expression Recognition: A Part Based .

3y ago
33 Views
4 Downloads
329.71 KB
6 Pages
Last View : 1d ago
Last Download : 3m ago
Upload by : Camden Erdman
Transcription

Spontaneous Facial Expression Recognition: A PartBased ApproachNazil Perveen, Dinesh Singh and C. Krishna MohanVisual Intelligence and Learning Group (VIGIL),Department of Computer Science and Engineering,Indian Institute of Technology Hyderabad, Kandi, Sangareddy-502285, India.email: {cs14resch11006, cs14resch11003, ckm}@iith.ac.inAbstract—A part-based approach for spontaneous expressionrecognition using audio-visual feature and deep convolutionneural network (DCNN) is proposed. The ability of convolutionneural network to handle variations in translation and scale isexploited for extracting visual features. The sub-regions, namely,eye and mouth parts extracted from the video faces are given asan input to the deep CNN (DCNN) inorder to extract convnetfeatures. The audio features, namely, voice-report, voice intensity,and other prosodic features are used to obtain complementaryinformation useful for classification. The confidence scores of theclassifier trained on different facial parts and audio informationare combined using different fusion rules for recognizing expressions. The effectiveness of the proposed approach is demonstratedon acted facial expression in wild (AFEW) dataset.Keywords—Isotropic smoothing, Expression recognition andConvolution Neural Network.I.I NTRODUCTIONknown as universal expressions [15]. Several exhaustive research works were being carried out in literature for automaticrecognition of expression in static images with high recognition rate. Recent advances in expression recognition from2013 to 2015 have changed the perception of the recognitionsystem. In 2014, vision and attention theory based sampling forcontinuous facial expression recognition by Bir Banu et al. [16]propose the way in which human visualize the expressions. Intheir approach, the dataset is divided into two categories basedon the frame rates, namely, low and high frame rate. In formerone, person is idle and expressing no emotions and in latterone, person is changing their expressions frequently. The basiccontribution of Bir Banu is to make a video based temporalsampling where they describe appearance based methodologyfor feature extraction and then classify the features usingsupport vector machine classifier. The recognition rate is 75%on the standard dataset AVEC 2011 or 2012, CK & CK ,MMI.Emotion reflects the mental status of the human mind.Mehrabian [1] indicated that the verbal part (i.e. spoken words)of a message contributes only 7% of the effect of any message;the vocal part (i.e. voice information) contributes for 38%,while facial expression contributes for 55% of the effect ofany message. Therefore, facial expression plays an importantrole in recognition of human emotions, like angry, disgust,fear, happy, neutral, sadness, and surprise. The expressionswhen recognized in an unconstrained environment is termedas spontaneous expression recognition, which becomes verydifficult task due to various real world issues like, illumination,posed faces, scaling, occlusion, etc. Handling these issueswhile maintaining reasonable classification accuracy is oneof the biggest challenge today. Being an active research area,spontaneous expression recognition has immense applications.It can be used to make smart devices smarter using emotionalintelligence [2], perform surveys on products and services,engagement systems, mood recognition, psychology, real timegaming, animated movies, etc. [3], [4], [5], [6], [7], [8], [9],[10], [11], [12], [13]. Spontaneous expression recognition usesdata science technologies like machine learning, artificial intelligence, big data, bio-sensors etc. to recognize the expressions.Expression analyst and data scientists are trying to synchronizestimuli to expressions for detecting micro-expressions, etc., toenhance the recognition rate of primary emotions [14].An automatic frame work for textured 3-D video based facial expression recognition by Munauwar and Bennamoun [17]hypothesize texture based dynamic approach for recognizingexpressions. Initially, small patches are extracted from thesample videos and these patches are then represented in pointssuch that each point is lying on Grassmanian manifold, andusing Grassmanian kernalization clusters are formed usinggraph based spectral clustering mechanism. All cluster centersare embedded with each other to reproduce the kernel Hilbertspace such that support vector machines (SVM) for eachexpressions are learned. The recognition accuracy is 93%-94%on BU4DFE (Binghamton University 3-D facial database).A different approach of 4-D facial expression recognitionby learning geometric deformation by Benamor et.al. [18]represented face as combinations of radial curves which lie onRiemannian manifold is proposed in 2014 that measures thedeformation induced by each facial expression. The featuresobtained are of very high dimension and hence linear discriminant analysis (LDA) transformation is applied for projectingit in low dimension. Two approaches are implemented forclassification, one is temporal or dynamic HMM and otheris mean deformation patches applied to random forest classification. The recognition rate is 93% on an average in differentdatasets, namely, BU4-DFE, Boshphorus, D3-DFACS and HI4D-ADSIP datasets.In 1978, Paul and Ekman define the human facial expressions which can be classified into seven basic classes, namely,angry, disgust, fear, happy, neutral, sad, and surprise, are alsoEarlier, the topic of spontaneous expression recognition i.e.expression recognition in an unconstrained environment, is notfocused in the literature. J. F. Cohn et al., introduce sponta-

neous facial expression recognition in the Handbook of FaceRecognition and throw some light on expression recognitionin real world but experiments performed by Cohn are alsounder constrained environment [19]. Later to it, a deformable3-D model for dynamic human emotional state recognition byYun et.al. [20] is proposed, which detects 26 fiducial pointsand displacement of each fiducial point is tracked. Dependingon the displacement, mesh model is formed that helps insynthesizing of the emotions. The deformation features obtained from the model are again used to map the features intolow dimension manifold by using discriminative isomap basedclassification which spans in one of the expression space withthe result of 80% accuracy. Another approach of simultaneousfacial feature tracking and facial expression recognition byLi et.al [21] describes about the facial activity levels andexplores the probabilistic framework i.e. Bayesian networksto all the three levels of facial involvement. In general, thefacial activity analysis is done either in one level or two level.But in their proposed methodology, all the three levels offacial involvement are explored by applying Gabor transformactive shape model, Adaboost classifier, facial activity analysis,Kalman filter, KL-divergence and dynamic Baysein network onCK & MMI in which they obtain 87.43 % recognition rate.Recently, fully automated recognition system of spontaneous facial expression in videos using random forest clustersby Moustafa K. et al. [22] uses pitt-patt face detector forextracting features and other information like yaw angles, rollangels etc., to predict poses till 90 degree and a novel classifierconsisting of set of random forests paired with SVM labelersis used to detect expression in wild and in such unconstrainedenvironment accuracy of 75% is obtained. However, the datasetused in the approach is also not fully unconstrained.The challenge, like EmotiW [23], is continuously trying to overcomeissues in spontaneous recognition by conducting AFEW/SFEWcompetition every year. The winners of EmotiW challengein 2013-14 [24]- [25] proposed a combination of differentmethodologies to cross the baseline of the challenge and toreach an acceptable accuracy. The winner of EmotiW2015Yao et al. [26] challenge explore the relationship between thefacial muscles known as latent relationship by extracting thepatches that are specific to facial action units and formulate theundirected graph with these patches as vertex and relationshipbetween them as the edge. This undirected graph distinguishesthe emotions based on their facial movement.We propose a simple and novel approach of dividing theface into expression centered regions and extracting the visualfeatures using deep convolution neural network for videomodality. For audio modality, we extract prosodic featuresand statistical features. The features extracted from both themodalities are classified using support vector machine andscores obtained during classification are fused together totake final decision. The paper is organized into the followingsections: Section 2 describes the complete proposed approachfor spontaneous recognition, including pre-processing, featureextraction, and classification methodologies. Section 3 focuseson fusion of the feature extracted from different modalities.Section 4 lists out the results obtained at each level of partbased approach, and Section 5 draws the conclusion and futurescope of the proposed approach.II.P ROPOSED M ETHODOLOGYThe current trends pursued for any recognition are categorized into three stages: pre-processing, feature extraction andclassification. Fig 1 and Fig 2 show the complete overviewof the proposed part based approach and details are explainedin later sections. The main aim of this paper is to implementthe simplest algorithm with reasonable accuracy. And one ofthe best way to explore is the part based approach. In thisapproach, we divide the whole face into the set of two mostexpressive salient regions, i.e. eye and mouth. And the wholeprocessing is done on these two parts which are later combinedto obtain the optimal result.Fig. 1: Block diagram of deep convolution neural network usedin our proposed part based approach for expression recognitionin wild. (a) Deep CNN used for eye part, (b) Deep CNN usedfor mouth partA. Pre-processingThe main idea of pre-processing is to combat the effect ofunwanted transformations as each and every part is not efficientfor recognition. Also, pre-processing proves to be one of themajor and important step of machine learning which leads toextraction of good features. The pre-processing can be donetwo ways:1)2)Holistic Pre-processing: Complete video frame isconsidered as an input for pre-processing, andPiecemeal Pre-processing: Meaningful parts of thevideo frame are considered as input to the preprocessing algorithm.In the proposed approach, piecemeal pre-processing, shown inFig 2 is used where each video frame is divided into twomost salient regions of the face, i.e., eye part and mouthpart with the help of annotations provided by Intraface [27]and then apply Isotropic Smoothing [28] as pre-processingtools. Isotropic smoothing is the normalization technique that

Fig. 3: Different preprocessing techniques and their outputs.B. Feature ExtractionFeature extraction is the process of converting the pixelinformation into some higher level representation of shape,motion, color, texture, structure, and different spatial configuration, so that it will best convey the important informationrelated to the image or pattern. The performance of anyrecognition system highly depends on the good features. In theproposed approach, features are extracted from two modality,video and audio.Video Features: Generally feature extraction in literaturefrom image or video are performed in two ways:1)2)Fig. 2: The framework of our proposed part based approachfor expression recognition in wild.reduces the noise in images without removing the importantinformation/details from the images. The isotropic is a variantof anisotropic diffusion proposed by Gross and Brajovic [29].The reason behind using the isotropic smoothing is that therepresentation of the video frame in dim lightening condition isbetter, and hence it helps in handling the illumination problem.The performance of isotropic smoothing is evaluated with twomost popular pre-processing normalization techniques: PCAwhitening and ZCA whitening, with the optimal parametersmentioned in [30]. Fig 3 gives the result of ZCA whitening,PCA whitening, and isotropic smoothing on eye and mouthparts. It can be observed that isotropic smoothing is goodfor our approach, as it reduces the problem of illuminationin videos and makes it more descriptive. It is also helpful forsmall images or the frames in which face is placed at differentscales.Holistic Feature Extraction: Complete pre-processedframes are input for extracting features, for e.g. CNN,statistical features etc.Piecemeal Feature Extraction: Meaningful Parts orPatches of the pre-processed frames are used forextracting features, for e.g. feature extracted fromfacial components etc.In this work, holistic feature extraction technique, namely,convolution neural network (CNN) separately for both eyesand mouth are implemented. The reason for choosing theCNN as the feature extraction mechanism is that it can handlethe translation and scale variances, and therefore scaling andtranslation issues can be resolved to some extent [31]. Fig 1describes the complete feature extraction process through deepCNNs and Table I shows the configuration used for eye andmouth deep CNN. Most of the frames from expression videosare given as input during training to enhance the accuracyof deep CNN. However, some of the frames are discardedmanually as there are certain frames which do not containfaces or due to some mislead posed faces which may effectthe training performance. In the next-level, 2-level deep CNNfor our part based approach is implemented. Following are thesteps that describe complete feature extraction process usingCNN (more details are mentioned in Table III):Step1:Step2:Step3:Video frames are divided into two parts, eye andmouth.Optimal size of parts are evaluated by taking modeof the frame size, for input to the deep CNN.Eye parts are input to the eye deep CNN to extractfeatures from four different expressions

TABLE I: Deep-CNN configuration of facial parts, eye, and mouthPartsEye partMouth partStep4:Step5:ImageSize74 x 7443 x 43Convolution 1Sub-Sampling 1Convolution 2Sub-Sampling 2Convolution 3Sub-Sampling 33x3 @ 102x2 @ 6Max pooling @ scale 2Max pooling @ scale 25x5 @ 102x2 @ 8Max pooling @ scale 2Max pooling @ scale 25x5 @ 123x3 @ 10Max pooling @ scale 2Max pooling @ scale 2Mouth parts are input to the mouth deep CNN toextract features from four different expressions.Output probabilities are used for further extraction ofthe features from remaining expressions.The value from the last sub-sampling layer before theoutput layer is treated as features for the classification. Theeye CNN forms the feature vector of 432 and mouth CNNforms the feature vector of 160. The feature vectors obtainedfrom different regions of face are then classified using supportvector machine (SVM) to generate scores based on the higheraccuracy obtained through default SVM kernels.Audio Features: To extract audio features from audio files,praat-musical software [32] is used, which helps in giving theintensity of the voice and voice report of the audio files. Thepraat phonetics software provides huge amount of informationrelated to audio signals, but only the relevant features relatedto voice of the person in audio like, jitter, shimmer, noiseto- harmonics, harmonics-to noise, pitch, and standard deviation of person’s voice are extracted. Also, different statisticalfeatures like, zero crossing rate, energy entropy block, shorttime energy, spectral flux, centroid, and roll off features areextracted as suggested in [33]. The reason behind selectingprosodic features from praat is that the voice report of theperson in a given video is much more accurate, for e.g., in asad expression video where the person is sad or crying andin background happy music or song is played then beautyof prosodic features lies in extracting the voice report of theperson in videos and not the background music. And, thus itis relevant for our unconstrained emotional AFEW dataset.C. Classificationwhere, X is the training samples, Φ(X) is the non-linearfunction, W Weight vector that creates hyper-plane, Z is thefeature space and Y is the target or class.By default, there are four kernels which are commonlyused for SVM, i.e. linear kernel, polynomial kernel, radialbasis kernel and sigmoid. Also, since SVM depends on supportvectors (some of the training samples) for creating hyper-planeand not on complete samples, hence it works well on smallerdataset also. Due to its immense application and advantage,we also select SVM as a classifier in our proposed method forclassifying the features obtained for deep convolution neuralnetwork. Table IV describe the accuracy of data we obtainedusing different kernel on different parts using SVM.III.y f (x)(1)Numerous and huge research is being carried out to devisean optimal algorithm for classifying the dataset with greateraccuracy. Different methodologies are suited for different applications. And in most of the experimentation support vectormachines (SVM) outperforms. Support Vector Machines [34]is a supervised learning algorithm which learns the discriminative functions between patterns of two classes by mappingit to the high dimensional space and find the hyper planethat maximizes the distance between closest training samples.Mapping to high dimensional space (kernel space) is helpfulin transforming non-linear relation to linear relation (accordingto Covers Theorem). A kernel function K(xTi , x) is used tocalculate distance in kernel space and the performance of SVMis highly dependant on these kernels. SVM is defined as:X Φ(X ) Z W T .Z Y(2)S CORING AND F USIONThe score level fusion is required because the number ofsamples and dimensions of the feature vector obtained frommultiple modalities are different. Therefore, to recognize theexpression, scores obtained from different modalities duringclassification are fused together using normal and weightedfusion rules. The result of the scores as an output score isdefined as:Output of scores N umber of samples (3)N umber of classesFurther, at the score level, all are of equal size and it will beeasier to fuse the different models like eye, mouth, and audio.Table V describes the different fusing methods for obtaininggood recognition rate.IV.Classification is the process of learning a target function fto map feature set x to anyone of the predefined class labelsy.FeatureVector Size432160E XPERIMENTAL R ESULTSThe experimental results of the proposed approach issummarized in this section. In the proposed approach, training,validation, and testing are done on the acted facial expressionin wild-AFEW dataset. AFEW dataset consists of acted expressions extracted from different movies. The author of thedataset [23] divides the dataset into seven different expressions,in which there are 723 videos in training set, 383 videos invalidation set, and 539 videos in test set.The proposed methodology divides each video frame intotwo expressive salient parts known as eye and mouth which arethen pre-processed using isotropic smoothening mechanism forhandling variation in illumination. These pre-processed partsare then input to the deep CNN. The total number of partsfor training and validation set of eye and mouth are shown inTable II.Each part is re-sized into optimal square image size as itis suitable for CNN model. This is done by taking the mostfrequent image size during extraction of facial parts. Thus, theoptimal size of eye frame is 74 74 and mouth frame is 43 43. As size of the image for each part is different, therefore twodeep CNNs are designed separately for each part, namely, eye

TABLE II: Total number of facial partsEyeMouthNumber of samplesin training set.19,22419,480TABLE IV: Classification performance (%) of different partsand audio using SVM with different kernel.Number of samplesin validation set.18,33218,38

Another approach of simultaneous facial feature tracking and facial expression recognition by Li et.al [21] describes about the facial activity levels and explores the probabilistic framework i.e. Bayesian networks to all the three levels of facial involvement. In general, the facial activity analysis is done either in one level or two level.

Related Documents:

simultaneous facial feature tracking and expression recognition and integrating face tracking with video coding. However, in most of these works, the interaction between facial feature tracking and facial expression recognition is one-way, i.e., facial feature tracking results are fed to facial expression recognition. There is

facial feature tracking can be used in the feature extraction stage in expression/AUs recognition, and expression/ AUs recognition results can provide a prior distribution for facial feature points [1]. However, most of the current methods only recognize the facial activities in one or two levels, and track

Simultaneous Facial Feature Tracking and Facial Expression Recognition Yongqiang Li, Yongping Zhao, Shangfei Wang, and Qiang Ji Abstract The tracking and recognition of facial activities from images or videos attracted great attention in computer vision field. Facial activities are characterized by three levels: First, in the bottom level,

simultaneous tracking and recognition of facial expressions. In contrast to the mainstream approach "tracking then recognition", this framework simultaneously retrieves the facial actions and expression using a particle filter adopting multi-class dynamics that are conditioned on the expression. 2. Face and facial action tracking

Facial expression recognition has attracted increasing atten-tion due to its wide applications in human-computer interac-tion [1]. There are two kinds of descriptors of expressions: expression category and Facial Action Units (AUs) [2]. The former describes facial behavior globally, and the latter rep-resents facial muscle actions locally.

posed DACL method compared to state-of-the-art methods. 1. Introduction Analyzing facial expressions is an active field of re-search in computer vision. Facial Expression Recognition (FER) is an important visual recognition technology to de-tect emotions given the input to the intelligent system is a facial image. FER is widely used in Human .

We present a new end-to-end network architecture for facial expression recognition with an attention model. It focuses attention in the human face and uses a Gaussian space representation for expression recognition. We de-vise this architecture based on two fundamental comple-mentary components: (1) facial image correction and at-

Korean Language 3 KOREAN 1BX Elementary Korean for Heritage Speakers 5 Units Terms offered: Spring 2021, Spring 2020, Spring 2019 With special emphasis on reading and writing, students will expand common colloquialisms and appropriate speech acts. Elementary Korean for Heritage Speakers: Read More [ ] Rules & Requirements Prerequisites: Korean 1AX; or consent of instructor Credit Restrictions .