IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7 .

3y ago
26 Views
2 Downloads
1.27 MB
15 Pages
Last View : 21d ago
Last Download : 3m ago
Upload by : Angela Sonnier
Transcription

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 20132559Simultaneous Facial Feature Tracking and FacialExpression RecognitionYongqiang Li, Shangfei Wang, Member, IEEE, Yongping Zhao, and Qiang Ji, Senior Member, IEEEAbstract— The tracking and recognition of facial activitiesfrom images or videos have attracted great attention in computervision field. Facial activities are characterized by three levels.First, in the bottom level, facial feature points around eachfacial component, i.e., eyebrow, mouth, etc., capture the detailedface shape information. Second, in the middle level, facial actionunits, defined in the facial action coding system, represent thecontraction of a specific set of facial muscles, i.e., lid tightener,eyebrow raiser, etc. Finally, in the top level, six prototypical facialexpressions represent the global facial muscle movement and arecommonly used to describe the human emotion states. In contrastto the mainstream approaches, which usually only focus on one ortwo levels of facial activities, and track (or recognize) them separately, this paper introduces a unified probabilistic frameworkbased on the dynamic Bayesian network to simultaneously andcoherently represent the facial evolvement in different levels, theirinteractions and their observations. Advanced machine learningmethods are introduced to learn the model based on both trainingdata and subjective prior knowledge. Given the model and themeasurements of facial motions, all three levels of facial activitiesare simultaneously recognized through a probabilistic inference.Extensive experiments are performed to illustrate the feasibilityand effectiveness of the proposed model on all three level facialactivities.Index Terms— Bayesian network, expression recognition, facialaction unit recognition, facial feature tracking, simultaneoustracking and recognition.I. I NTRODUCTIONTHE recovery of facial activities in image sequence isan important and challenging problem. In recent years,plenty of computer vision techniques have been developedto track or recognize facial activities in three levels. First,in the bottom level, facial feature tracking, which usuallydetects and tracks prominent facial feature points (i.e., thefacial landmarks) surrounding facial components (i.e., mouth,eyebrow, etc.), captures the detailed face shape information.Second, facial actions recognition, i.e., recognize facial ActionManuscript received April 25, 2012; revised December 18, 2012; acceptedFebruary 26, 2013. Date of publication March 20, 2013; date of current versionMay 13, 2013. The associate editor coordinating the review of this manuscriptand approving it for publication was Dr. Adrian G. Bors. (Correspondingauthors: Y. Li and S. Wang.)Y. Li and Y. Zhao are with the School of Electrical Engineering andAutomation, Harbin Institute of Technology, Harbin 150001, China (e-mail:yongqiang.li.hit@gmail.com; zhaoyp2590@hit.edu.cn).S. Wang is with the School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China (e-mail:sfwang@ustc.edu.cn).Q. Ji is with the Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180 USA (e-mail:jiq@rip.edu).Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TIP.2013.2253477Units (AUs) defined in the Facial Action Coding System(FACS) [1], try to recognize some meaningful facial activities(i.e., lid tightener, eyebrow raiser, etc.). In the top level, facialexpression analysis attempts to recognize facial expressionsthat represent the human emotional states.The facial feature tracking, AU recognition and expressionrecognition represent the facial activities in three levels fromlocal to global, and they are interdependent problems. Forexample, facial feature tracking can be used in the featureextraction stage in expression/AUs recognition, and expression/AUs recognition results can provide a prior distributionfor facial feature points. However, most current methods onlytrack or recognize the facial activities in one or two levels,and track them separately, either ignoring their interactions orlimiting the interaction to one way. In addition, the estimatesobtained by image-based methods in each level are alwaysuncertain and ambiguous because of noise, occlusion and theimperfect nature of the vision algorithm.In this paper, in contrast to the mainstream approaches, webuild a probabilistic model based on the Dynamic BayesianNetwork (DBN) to capture the facial interactions at differentlevels. Hence, in the proposed model, the flow of informationis two-way, not only bottom-up, but also top-down. In particular, not only the facial feature tracking can contribute to theexpression/AUs recognition, but also the expression/AU recognition helps to further improve the facial feature tracking performance. Given the proposed model, all three levels of facialactivities are recovered simultaneously through a probabilisticinference by systematically combining the measurements frommultiple sources at different levels of abstraction.The proposed facial activity recognition system consists oftwo main stages: offline facial activity model construction andonline facial motion measurement and inference. Specifically,using training data and subjective domain knowledge, thefacial activity model is constructed offline. During the onlinerecognition, as shown in Fig. 1, various computer visiontechniques are used to track the facial feature points, andto get the measurements of facial motions, i.e., AUs. Thesemeasurements are then used as evidence to infer the true statesof the three level facial activities simultaneously.The paper is divided as follows: In Sec. II, we present a briefreview on the related works on facial activity analysis; Sec. IIIdescribes the details of facial activity modeling, i.e., modelingthe relationships between facial features and AUs (Sec. III-B),modeling the semantic relationships among AUs (Sec. III-C),and modeling the relationships between AUs and expressions(Sec. III-D); In Sec. IV, we construct the dynamic dependencyand present a complete facial action model; Sec. V shows the1057-7149/ 31.00 2013 IEEE

2560IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013Active ShapeModelFace andEyeDetectionFacial featurepoints trackingGaborTransformInput imagesequenceExpressionProposed FacialActivity ModelAdaBoostClassifierFacial ActionUnitsFacial FeaturePointsAU classificationindividuallyPreprocessingFig. 1.Measurement ExtractionInferenceOutputFlowchart of the online facial activity recognition system.experimental results on two databases. The paper concludes inSec. VI with a summary of our work and its future extensions.II. R ELATED W ORKSIn this section, we are going to introduce the relatedworks on facial feature tracking, expression/AUs recognition and simultaneous facial activity tracking/recognition,respectively.A. Facial Feature TrackingFacial feature points encode critical information about faceshape and face shape deformation. Accurate location andtracking of facial feature points are important in the applications such as animation, computer graphics, etc. Generally, the facial feature points tracking technologies could beclassified into two categories: model free and model-basedtracking algorithms. Model free approaches [47]–[49] aregeneral purpose point trackers without the prior knowledgeof the object. Each feature point is usually detected andtracked individually by performing a local search for thebest matching position. However, the model free methodsare susceptible to the inevitable tracking errors due to theaperture problem, noise, and occlusion. Model based methods,such as Active Shape Model (ASM) [3], Active AppearanceModel (AAM) [4], Direct Appearance Model (DAM) [5], etc.,on the other hand, focus on explicitly modeling the shapeof the objects. The ASM proposed by Cootes et al. [3],is a popular statistical model-based approach to representdeformable objects, where shapes are represented by a set offeature points. Feature points are first searched individually,and then Principal Component Analysis (PCA) is appliedto analyze the models of shape variation so that the objectshape can only deform in specific ways found in the trainingdata. Robust parameter estimation and Gabor wavelets havealso been employed in ASM to improve the robustness andaccuracy of feature point search [6], [7]. The AAM [4] andDAM [5] are subsequently proposed to combine constraints ofboth shape variation and texture variation.In the conventional statistical models, e.g. ASM, the featurepoints positions are updated (or projected) simultaneously,which indicates that the interactions within feature pointsare interdependent. Intuitively, human faces have a sophisticated structure, and a simple parallel mechanism may notbe adequate to describe the interactions among facial featurepoints. For example, whether the eye is open or closed willnot affect the localization of mouth or nose. Tong et al. [8]developed an ASM based two-level hierarchical face shapemodel, in which they used multi-state ASM model for eachface component to capture the local structural details. Forexample, for mouth, they used three ASMs to represent thethree states of mouth, i.e., widely open, open and closed.However, the discrete states still cannot describe the detailsof each facial component movement, i.e., only three discretestates are not sufficient to describe all mouth movements. Atthe same time, facial action units inherently characterize facecomponent movements, therefore, involving AUs informationduring facial feature points tracking may help further improvethe tracking performance.B. Expression/AUs RecognitionFacial expression recognition systems usually try to recognize either six expressions or the AUs. Over the past decades,there has been extensive research on facial expression analysis [9], [14], [16], [21], [24]. Current methods in this areacan be grouped into two categories: image-based methods andmodel-based methods.Image-based approaches, which focus on recognizing facialactions by observing the representative facial appearancechanges, usually try to classify expression or AUs independently and statically. This kind of method usually consists oftwo key stages. First, various facial features, such as opticalflow [9], [10], explicit feature measurement (e.g., length ofwrinkles and degree of eye opening) [16], Haar features [11],[37], Local Binary Patterns (LBP) features [31], [32], independent component analysis (ICA) [12], feature points [47], Gaborwavelets [14], etc., are extracted to represent the facial gesturesor facial movements. Given the extracted facial features, theexpression/AUs are identified by recognition engines, suchas Neural Networks [15], [16], Support Vector Machines(SVM) [14], [20], rule-based approach [21], AdaBoost classifiers, Sparse Representation (SR) classifiers [33], [34], etc.A survey about expression recognition can be found in [22].The common weakness of image-based methods for AUrecognition is that they tend to recognize each AU or certainAU combinations individually and statically directly from theimage data, ignoring the semantic and dynamic relationshipsamong AUs, although some of them analyze the temporalproperties of facial features, e.g., [17], [45]. Model-based

LI et al.: SIMULTANEOUS FACIAL FEATURE TRACKING AND FACIAL EXPRESSION RECOGNITIONmethods overcome this weakness by making use of the relationships among AUs, and recognize the AUs simultaneously.Lien et al. [23] employed a set of Hidden Markov Models(HMMs) to represent the facial actions evolution in time. Theclassification is performed by choosing the AU or AU combination that maximizes the likelihood of the extracted facialfeatures generated by the associated HMM. Valstar et al. [18]used a combination of SVMs and HMMs, and outperformedthe SVM method for almost every AU by modeling the temporal evolution of facial actions. Both methods exploit the temporal dependencies among AUs. They, however, fail to exploitthe spatial dependencies among AUs. To remedy this problem,Tong and Ji [24], [25] employed a Dynamic Bayesian network to systematically model the spatiotemporal relationshipsamong AUs, and achieved significant improvement over theimage-based method. In this paper, besides modeling the spatial and temporal relationships among AUs, we also make useof the information of expression and facial feature points, andmore importantly, the coupling and interactions among them.2561III. FACIAL ACTIVITY M ODELINGA. Overview of the Facial Activity Model1) Single Dynamic Model: The graphical representationof the traditional tracking algorithm, i.e., Kalman Filter, isXtMt-1Mt(a)St-1StXt-1XtMt-1Mt(b)Et-1EtAUt-1AUtC. Simultaneous Facial Activity Tracking/RecognitionThe idea of combining tracking with recognition has beenattempted before, such as simultaneous facial feature trackingand expression recognition [47], [50], [51], and integratingface tracking with video coding [27]. However, in most ofthese works, the interaction between facial feature trackingand facial expression recognition is one-way, i.e., facial featuretracking results are fed to facial expression recognition [47],[51]. There is no feedback from the recognition results tofacial feature tracking. Most recently, Dornaika et al. [26] andChen & Ji [30] improved the facial feature tracking performance by involving the facial expression recognition results.However, in [26], they only modeled six expressions and theyneed to retrain the model for a new subject, while in [30], theyrepresented all upper facial action units in one vector node andin such a way, they ignored the semantic relationships amongAUs, which is a key point to improve the AU recognitionaccuracy.Compared to the previous related works, this paper has thefollowing features.1) First, we build a DBN model to explicitly model thetwo-way interactions between different levels of facialactivities. In this way, not only the expression and AUsrecognition can benefit from the facial feature trackingresults, but also the expression recognition can helpimprove the facial feature tracking performance.2) Second, we recognize all three levels of facial activitiessimultaneously. Given the facial action model and imageobservations, all three levels of facial activities are estimated simultaneously through a probabilistic inferenceby systematically integrating visual measurements withthe proposed model.Xt-1MAUt-1MAUtXtXt-1MXt-1MXt(c)Fig. 2. Comparison of different tracking models. (a) Traditional trackingmodel. (b) Tracking model with switch node. (c) Proposed facial activitytracking model.shown in Fig. 2(a). X t is the current hidden state, e.g., imagecoordinates of the facial feature points, we want to track,and Mt is the current image measurement (Hereafter, theshaded nodes represent measurements, i.e., estimates, and theunshaded nodes denote the hidden states). The directed linksare quantified by the conditional probabilities, e.g., the linkfrom X t to Mt is captured by the likelihood P(Mt X t ), and thelink from X t 1 to X t by the first order dynamic P(X t X t 1 ).For online tracking, we want to estimate the posteriorprobability based on the previous posterior probability and thecurrent measurementP(X t M1:t ) P(Mt X t )X t 1P(X t X t 1 )P(X t 1 M1:t 1 ).(1)M1:t is the measurement sequence from frame 1 to t. If bothX t and Mt are continuous and all the conditional probabilitiesare linear Gaussian, this model is a Linear Dynamic System(LDS).2) Dynamic Model With Switching Node: The above tracking model has only one single dynamic P(X t X t 1 ), andthis dynamic is fixed for the whole sequence. But for manyapplications, we hope that the dynamic can “switch” accordingto different states. Therefore, researchers introduce a switchnode to control the underling dynamic system [28], [29].For the switching dynamic model, the switch node representsdifferent states and for each state, there are particular predominant movement patterns. The works in [26] and [30] also

2562IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013involved multi-dynamics, and their idea can be interpreted asthe graphical model in Fig. 2(b). The St is the switch node,and for each state of St , there is a specific transition parameterP(X t X t 1 , St ) to model the dynamic between X t and X t 1 .Through this model, X t and St can be tracked simultaneously,and their posterior probability is P(X t , St M1:t ) P(Mt X t )P(X t X t 1 , St )X t 1 ,St 1P(St St 1 )P(X t 1 , St 1 M1:t 1 ) (2)In [26], they proposed to use particle filtering to estimate thisposterior probability.3) Our Facial Activity Model: Dynamic Bayesian networkis a directed graphical model, and compared to the dynamicmodels above, DBN is more general to capture complexrelationships among variables. We propose to employ DBN tomodel the spatiotemporal dependencies among all three levelsof facial activities (facial feature points, AUs and expression)as shown in Fig. 2(c) [Fig. 2(c) is not the final DBN model, buta graphical representation of the causal relationships betweendifferent levels of facial activities]. The E t node in the toplevel represents the current expression; AUt represents a setof AUs; X t denotes the facial feature points we are going totrack; M AUt and M X t are the corresponding measurements ofAUs and the facial feature points, respectively. The three levelsare organized hierarchically in a causal manner such that thelevel above is the cause while the level below is the effect.Specifically, the global facial expression is the main causeto produce certain AU configurations, which in turn causeslocal muscle movements, and hence feature points movements.For example, a global facial expression (e.g., Happiness)dictates the AU configurations, which in turn dictates the facialmuscle movement and hence the facial feature point positions.For the facial expression in the top level, we will focuson recognizing six basic facial expressions, i.e., happiness,surprise, sadness, fear, disgust and anger. Though psychologists agree presently that there are ten basic emotions [54],most current research in facial expression recognition mainlyfocuses on six major emotions, partially because they arethe most basic, and culturally and ethnically independentexpressions and partially because most current facial expression databases provide the six emotion labels. Given themeasurement sequences, all three level facial activities areestimated simultaneously through a probabilistic inference viaDBN (section. IV-C). And the optimal states are tracked bymaximizing this posteriorE t , AUt , X t argmax Et ,AUt ,X tP(E t , AUt , X t M AU1:t , M X 1:t ).(3)B. Modeling the Relationships Between Facial Features andAUsIn this paper, we will track 26 facial feature points as shownin Fig. 3 and recognize 15 AUs, i.e., AU1, 2, 4, 5, 6, 7, 9,12, 15, 17, 23, 24, 25, 26 and 27 as summarized in Table I.The selection of AUs to be recognized is mainly based on theAUs occurrence frequency, their importance to characterize theFig. 3.Facial feature points used in the algorithm.TABLE IL IST OF AU S AND T HEIR I NTERPRETATIONS6 expressions, and the amount of annotation available. The15 AUs we propose to recognize are all most commonlyoccurring AUs, and they are primary and crucial to describethe six basic expressions. They are also widely annotated.Though we only investigate 15 AUs in this paper, the proposedframework is not restricted to recognizing these AUs, givenan adequate training data set. Facial action units controlthe movement of face components and therefore, control themovement of facial feature points. For instance, activatingAU27 (mouth stretch) results in a widely open mouth; andactivating AU4 (brow lowerer) mak

Simultaneous Facial Feature Tracking and Facial Expression Recognition Yongqiang Li, Shangfei Wang, Member, IEEE, Yongping Zhao, and Qiang Ji, Senior Member, IEEE Abstract—The tracking and recognition of facial activities from images or videos have attracted great attention in computer vision field. Facial activities are characterized by .

Related Documents:

IEEE 3 Park Avenue New York, NY 10016-5997 USA 28 December 2012 IEEE Power and Energy Society IEEE Std 81 -2012 (Revision of IEEE Std 81-1983) Authorized licensed use limited to: Australian National University. Downloaded on July 27,2018 at 14:57:43 UTC from IEEE Xplore. Restrictions apply.File Size: 2MBPage Count: 86Explore furtherIEEE 81-2012 - IEEE Guide for Measuring Earth Resistivity .standards.ieee.org81-2012 - IEEE Guide for Measuring Earth Resistivity .ieeexplore.ieee.orgAn Overview Of The IEEE Standard 81 Fall-Of-Potential .www.agiusa.com(PDF) IEEE Std 80-2000 IEEE Guide for Safety in AC .www.academia.eduTesting and Evaluation of Grounding . - IEEE Web Hostingwww.ewh.ieee.orgRecommended to you b

IEEE TRANSACTIONS ON IMAGE PROCESSING, TO APPEAR 1 Quality-Aware Images Zhou Wang, Member, IEEE, Guixing Wu, Student Member, IEEE, Hamid R. Sheikh, Member, IEEE, Eero P. Simoncelli, Senior Member, IEEE, En-Hui Yang, Senior Member, IEEE, and Alan C. Bovik, Fellow, IEEE Abstract— We propose the concept of quality-aware image, in which certain extracted features of the original (high-

Signal Processing, IEEE Transactions on IEEE Trans. Signal Process. IEEE Trans. Acoust., Speech, Signal Process.*(1975-1990) IEEE Trans. Audio Electroacoust.* (until 1974) Smart Grid, IEEE Transactions on IEEE Trans. Smart Grid Software Engineering, IEEE Transactions on IEEE Trans. Softw. Eng.

IEEE Robotics and Automation Society IEEE Signal Processing Society IEEE Society on Social Implications of Technology IEEE Solid-State Circuits Society IEEE Systems, Man, and Cybernetics Society . IEEE Communications Standards Magazine IEEE Journal of Electromagnetics, RF and Microwaves in Medicine and Biology IEEE Transactions on Emerging .

Standards IEEE 802.1D-2004 for Spanning Tree Protocol IEEE 802.1p for Class of Service IEEE 802.1Q for VLAN Tagging IEEE 802.1s for Multiple Spanning Tree Protocol IEEE 802.1w for Rapid Spanning Tree Protocol IEEE 802.1X for authentication IEEE 802.3 for 10BaseT IEEE 802.3ab for 1000BaseT(X) IEEE 802.3ad for Port Trunk with LACP IEEE 802.3u for .

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 6, JUNE 2017 2957 No-Reference Quality Assessment of Tone-Mapped HDR Pictures Debarati Kundu, Deepti Ghadiyaram, Student Member, IEEE,AlanC.Bovik,Fellow, IEEE, and Brian L. Evans, Fellow, IEEE Abstract—Being able to automatically predict digital picture quality, as perceived by human observers, has become important

4 IEEE TRANSACTIONS ON IMAGE PROCESSING, XXXX Natural image source Channel (Distortion) HVS HVS C D F E Fig. 1. Mutual information between C and E quantifies the information that the brain could ideally extract from the reference image, whereas the mutual information between C and F quantifies the corresponding information that could be extracted from the test image.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 7, JULY 1998 979 Nonlinear Image Estimation Using Piecewise and Local Image Models Scott T. Acton, Member, IEEE, and Alan C. Bovik, Fellow, IEEE Abstract— We introduce a new approach to image estimation based on a flexible constraint framework that encapsulates mean-ingful structural image .