DRAFT FOR IEEE TRANSACTIONS ON PATTERN ANALYSIS AND .

3y ago
30 Views
2 Downloads
3.58 MB
14 Pages
Last View : 14d ago
Last Download : 3m ago
Upload by : Kaydence Vann
Transcription

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.DRAFT FOR IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE1Recognizing Human Actions by Learning andMatching Shape-Motion Prototype TreesZhuolin Jiang, Member, IEEE, Zhe Lin, Member, IEEE, Larry S. Davis, Fellow, IEEEAbstract— A shape-motion prototype-based approach is introduced for action recognition. The approach represents anaction as a sequence of prototypes for efficient and flexibleaction matching in long video sequences. During training, anaction prototype tree is learned in a joint shape and motionspace via hierarchical K-means clustering and each trainingsequence is represented as a labeled prototype sequence; thena look-up table of prototype-to-prototype distances is generated.During testing, based on a joint probability model of the actorlocation and action prototype, the actor is tracked while a frameto-prototype correspondence is established by maximizing thejoint probability, which is efficiently performed by searchingthe learned prototype tree; then actions are recognized usingdynamic prototype sequence matching. Distance measures usedfor sequence matching are rapidly obtained by look-up tableindexing, which is an order of magnitude faster than brute-forcecomputation of frame-to-frame distances. Our approach enablesrobust action matching in challenging situations (such as movingcameras, dynamic backgrounds) and allows automatic alignmentof action sequences. Experimental results demonstrate that ourapproach achieves recognition rates of 92.86% on a large gesturedataset (with dynamic backgrounds), 100% on the Weizmannaction dataset, 95.77% on the KTH action dataset, 88% on theUCF sports dataset and 87.27% on the CMU action dataset.Index Terms— Action recognition, shape-motion prototypetree, hierarchical K-means clustering, joint probability, dynamictime warping.I. I NTRODUCTIONAction recognition is receiving more and more attention incomputer vision due to its potential applications such as videosurveillance, human-computer interaction, virtual reality and multimedia retrieval. Descriptor matching and classification-basedschemes have been common for action recognition. However, forlarge-scale action retrieval and recognition, where the trainingdatabase consists of thousands of action videos, such a matchingscheme may require tremendous amounts of computation. Recognizing actions viewed against a dynamic varying background isanother important challenge. Many studies have been performedon effective feature extraction and categorization methods forrobust action recognition. Detailed surveys were reported in [1]–[3].Feature extraction methods for activity recognition can beroughly classified into four categories: geometry-based [4]–[6], motion-based [7]–[10], appearance-based [4], [11], [12],This work was funded by Army Research Laboratory Robotics Collaborative Technology Alliance program (contract number: DAAD 19-012-0012ARL-CTA-DJH) and VIRAT program.Zhuolin Jiang and Larry S. Davis are with the Institute for AdvancedComputer Studies, University of Maryland, College Park, MD, 20742 USAE-Mail: {zhuolin—lsd}@umiacs.umd.edu.Zhe Lin is with Advanced Technology Labs, Adobe Systems Incorporated,San Jose, CA, 95110 USA Email: zlin@adobe.com.Digital Object Indentifier 10.1109/TPAMI.2011.147and space-time feature-based [13]–[28]. The geometry-based approaches recover information about human body configuration,but they often heavily rely on object segmentation and tracking,which is typically difficult and time consuming. The motionbased approaches extract optical flow features for recognition,but they rely on segmentation of the foreground for reducingeffects of background flows. The appearance-based approachesuse shape and contour information to identify actions, but theyare vulnerable to cluttered complex backgrounds. The space-timefeature-based approaches either characterize actions using globalspace-time 3D volumes or more compactly using sparse spacetime interest points.Recently, methods have been introduced, e.g. [14], [29]–[35],that combine multiple features to detect and recognize actions.Laptev and Perez [14] used shape and motion cues to detectdrinking and smoking actions. Jhuang et al. [29] introduced abiologically inspired action recognition system which used ahierarchy of spatial-temporal feature detectors. Liu et al. [30]combined quantized vocabularies of local spatial-temporal volumes and spin images. Shet et al. [31] combined shape andmotion exemplars in a unified probabilistic framework to recognize gestures. Schindler and Gool [32] extracted both form andmotion features from an action snippet to model and recognizeactions. Niebles and Fei-Fei [33] introduced a hierarchical modeland a hybrid use of static shape features and spatial-temporalfeatures for action classification. Ahmad and Lee [34] combinedshape and motion flows to classify actions from multi-view imagesequences. Mikolajczyk and Uemura [35] extracted a large set oflow dimensional local features to learn many vocabulary trees toallow efficient action recognition and perform simultaneous actionlocalization and recognition. For recognizing human actions underview changes, there are some approaches proposed in recentyears. Junejo et al. [36] proposed a self-similarity based descriptorfor view-independent human action recognition. Parameswaranand Chellappa [37] modeled actions in terms of view-invariantcanonical body poses and trajectories in 2D invariance space, torepresent and recognize human actions from a general viewpoint.Souvenir and Babbs [38] learned the viewpoint manifolds toprovide a compact representation of primitive actions for viewinvariant action recognition and viewpoint estimation.Categorization methods are mostly based on machine learningor pattern recognition techniques. Classifiers commonly used foraction recognition include NN/K -NN classifiers [7], [10], [12],[17], [18], [39]–[42], Support Vector Machine (SVM) classifiers [13], [15], [20], [21], [29], [32], [43], boosting-based classifiers [9], [14], [23], [26], Hidden Markov Model (HMM) [11],[31], [44], dynamic time warping (DTW) [45], [46] and Houghvoting-based classifiers [47]. For example, the method in [12]used n-Gram models to represent local temporal context andrecognized actions based on histogram comparisons. Marszaleket al. [43] exploited a context model between scenes and actions,0162-8828/11/ 26.00 2011 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.DRAFT FOR IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCEFig. 1.2Overview of our approach.and integrated it with bag of features and SVM based classifiers toimprove action recognition. Ref. [11], [31] incorporated temporalconstraints between examplars using HMMs. Additionally, Liet al. [40] presented a weighted directed graph to learn andrecognize human actions. Lv and Nevatia [41] modeled thedynamics of an action or between actions using an unweighteddirected graph. Fanti et al. [42] presented a hybrid probabilisticmodel for human motion recognition and modeled human motionas a triangulated graph. Sminchisescu et al. [48] integrateddiscriminative Conditional Random Field (CRF) and MaximumEntropy Markov Models (MEMM) to recognize human actions.Descriptor matching and classification-based schemes suchas [7], [17] have been common for action recognition. However,for large-scale action recognition problems, where the trainingdatabase consists of thousands of labeled action videos, sucha matching scheme may require tremendous amounts of timefor computing similarities or distances between actions. Thecomplexity increases quadratically with respect to the dimensionof action (frame) descriptors. Reducing the dimensionality of thedescriptors can speedup the computation, but it tends to tradeoff with recognition accuracy. In this regard, an efficient actionrecognition system capable of rapidly retrieving actions from alarge database of action videos is highly desirable.Many previous approaches relied on static cameras or considered only videos with simple backgrounds. The recognition problem becomes very difficult with dynamic backgrounds, becausemotion features can be greatly affected by background motionflows. Although some preliminary work [14], [15], [22], [35] hasbeen done for recognizing actions in challenging movie scenarios,robustly recognizing actions viewed against a dynamic varyingbackground is still an important challenge.Motivated by these issues, we introduce an efficient, prototypebased approach for action recognition. Our approach extractsrich information from observations but performs recognitionefficiently via tree-based prototype matching and look-up tableindexing. It captures correlations between different visual cues(i.e shape and motion) by learning action prototypes in a jointfeature space. It also ensures global temporal consistency bydynamic sequence alignment. In addition, it has the advantageof tolerating complex dynamic backgrounds due to medianbased background motion compensation and probabilistic frameto-prototype matching.A. OverviewThe block diagram of our approach is shown in Figure 1.During training, action interest regions are first localized andshape-motion descriptors are computed from them. Next, actionprototypes are learned as the cluster centers of K -means clustering, and each training sequence is mapped to a sequence oflearned prototypes. Finally, a binary prototype tree is constructedvia hierarchical K -means clustering [49] using the set of learnedaction prototypes. In the binary tree, each leaf node correspondsto a prototype. During testing, humans are first detected andtracked using appearance information, and a frame-to-prototypecorrespondence is established by maximizing a joint probabilityof the actor location and action prototype. Given rough locationof the actor by appearance-based tracking, joint optimizationis performed to refine the location of the actor and identifythe corresponding prototype. Then, actions are recognized basedon dynamic prototype sequence matching. Distances needed formatching are rapidly obtained by look-up table indexing, which isan order of magnitude faster than the brute-force computation offrame-to-frame distances. Our main contributions are three-fold: A prototype-based approach is introduced for robustly detecting and matching prototypes, and recognizing actionsagainst dynamic backgrounds. Actions are modeled by learning a prototype tree in a jointshape-motion space via hierarchical K -means clustering. Frame-to-frame distances are rapidly estimated via fast prototype tree search and look-up table indexing.This paper is organized as follows. In Sec. II, we introduce ouraction representation and learning methods in detail. In Sec. III,we first describe a tree based approach for frame-to-prototypematching, and then introduce a prototype-based approach tomeasure distances between actions. In Sec. IV, we discuss ouraction localization and tracking methods. In Sec. V, we describeimplementation details. In Sec. VI, we present experimentalresults and analysis. Finally, we conclude the paper and discusspossible future research directions in Sec. VII.II. L EARNING ACTION R EPRESENTATIONFor representing and describing actions, an action interestregion (bounding box surrounding a person) is specified arounda person in each frame of an action learning sequence. Examplesof action interest regions are illustrated in Figure 2.A. Shape-Motion DescriptorA shape descriptor for an action interest region is represented asa feature vector Ds (s1 .sns ) Rns by dividing the action interest region into ns square grids (or sub-regions) R1 .Rns . Giventhe shape observations from background subtraction (when thecamera and background are static) or from appearance likelihood

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.DRAFT FOR IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCEFig. 2. Examples of action interest regions illustrated for samples fromthree datasets: (a) Keck gesture dataset, (b) Weizmann action dataset, (c)KTH action dataset.Fig. 3.An example of computing the shape-motion descriptor of agesture frame with a dynamic background. (a) Raw optical flow field, (b)Compensated optical flow field, (c) Combined appearance likelihood map, (d)Motion descriptor Dm computed from the raw optical flow field, (e) Motiondescriptor Dm computed from the compensated optical flow field, (f) Shapedescriptor Ds . A motion descriptor is visualized by placing its four channelsin a 2 2 grid.maps (for dynamic cameras and background). In the latter case, acolor appearance model is used to assign a probability of personoccupancy for each pixel in the bounding box. Then, to form theun-normalized shape feature vector, we simply accumulate theprobability of foreground pixel values (0/1 in the case of simplebackground subtraction). Alternatively, in situations in which itmight be difficult to extract binary silhouettes or appearancelikelihood maps, histogram of oriented gradient (HOG) can beused to encode the shape of each subregion, and then all thehistograms can be concatenated to form a raw shape featurevector. The feature vector is L2 normalized to generate the shapedescriptor Ds . An appearance likelihood map and the shapedescriptor computed from it are shown in Figure 3(c) and 3(f),respectively. Our method for estimating appearance likelihoods isexplained in Sec. IV. The distance between two shape descriptorsis computed using the Euclidean distance metric.3A motion descriptor for an action interest region isrepresented as a nm -dimensional feature vector Dm (QBM Fx , QBM Fx , QBM Fy , QBM Fy ) Rnm , where‘QBM F ’ refers to quantized, blurred, motion-compensated flow.We compute the motion descriptor Dm based on the robustmotion flow feature introduced in [7] as follows. Given an actioninterest region, its optical flow field is first computed and dividedinto horizontal and vertical components, Fx and Fy as in [7].In contrast to [7] which directly use Fx , Fy to compute themotion descriptors, we remove background motion componentsby subtracting from them the medians of flow fields to obtainmedian-compensated flow fields. Intuitively, median flows estimate robust statistics of dominant background flows caused bycamera movement and moving background objects. Figure 3(a)and 3(b) show an example of motion flow compensation for agesture frame with a dynamic background. We can see from thefigure that this approach not only effectively removes backgroundflows but also corrects foreground flows so that the extractedmotion descriptors are more robust against dynamic, varyingbackgrounds.The motion-compensated flow fields, M Fx and M Fy ,are then half-wave rectified into four non-negative channelsM Fx , M Fx , M Fy , M Fy , and each of them is blurred witha Gaussian kernel to form the low-level motion observations(BM Fx , BM Fx , BM Fy , BM Fy ) as in [7]. As in computing shape descriptors, we reduce the resolution of the motionobservations by averaging them inside uniform grids overlaidon the interest region. The resulting four channel descriptorsare L2 normalized independently as in [7] and concatenated togenerate a raw motion descriptor; the L2 normalization makesthe four channel descriptors have equal energy when generatingthe raw motion descriptor. Finally the raw motion descriptoris L2 normalized to form the final motion descriptor Dm , andequalize the energy in the shape and motion components of thefinal joint descriptor. Figure 3(d) and 3(e) visualize the motiondescriptors for an example gesture frame with and without motioncompensation, respectively.We concatenate shape and motion descriptors Ds and Dm toform joint shape-motion descriptors: Dsm (Ds , Dm ) Rnsm ,where nsm ns nm is the dimension of the combineddescriptor. The distance between two action frames, i.e. twoabshape-motion descriptors, Dsmand Dsm, is computed using theEuclidean distance metric. Based on the relative importance ofshape and motion cues, we can learn a weighting scheme for theshape and motion components of Dsm (wDs , (1 w)Dm ),where the optimal weight w can be estimated using cross validation by maximizing the recognition rate in the training data 1 .B. Shape-Motion Prototype TreeMotivated by [11], [12], we represent an action as a set of basicaction units. We refer to these action units as action prototypesΘ (θ1 , θ2 .θK ). For learning a representative set of actionprototypes Θ, we perform clustering on a set of descriptorsextracted from the training data.Given the set of shape-motion descriptors for all frames ofthe training set, we perform K -means clustering in the joint1 The optimal w was estimated as 0.5 from a leave-one-person-out crossvalidation on the Keck gesture dataset, and we then simply set w 0.5 inall our experiments.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.DRAFT FOR IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE4III. ACTION R ECOGNITIONThe action recognition process is divided into two steps: frameto-prototype matching and prototype-based sequence matching.A. Tree-based Frame-to-prototype Matching1) Problem Formulation: Let random variable V be an observation from an image frame, θ be a prototype random variablechosen from the set of K learned shape-motion prototypesΘ (θ1 , θ2 .θK ), and α (x, y, s) denote random variablesrepresenting actor location (image position (x, y) and scale s).Then, the frame-to-prototype matching problem is equivalent tomaximizing the conditional probability p(θ, α V ). The conditionalprobability p(θ, α V ) is decomposed into a prototype matchingterm (prototype likelihood given the actor location) and an actorlocalization term:p(θ, α V ) p(θ V, α)p(α V ).(1)For a test action sequence of length T with observations{Vt }t 1.T , a track of the actor’s location ({α t }t 1.T ) andlocation likelihood maps L(α Vt ), t 1.T are provided by anactor tracker (see Sec. IV). Based on the tracking information,the actor localization term p(α V ) is modeled as follows:p(α V ) Fig. 4. An example of learning. (a)(b) Visualization of shape and motioncomponents of learned prototypes generated by K-means (K 16). Theshape component in (a) is represented by 16 16 grids and the motioncomponent in (b) is represented by four (orientation channels) 8 8 grids. Inthe motion component, grid intensity indicates motion strength and ‘arrow’indicates the dominant motion orientation at that grid, (c) The learned binaryprototype tree constructed by hierarchical K-means (K 2). Leaf nodes,represented as yellow ellipses, are action prototypes.L(α V ) Lmin,Lmax Lmin(2)where α is defined over a 3D neighborhood (i.e. image position(x, y) and scale s) around ᾱt , and Lmin , Lmax are the globalminimum and maximum limits of L(α V ) respectively. Detailsof modeling and computing L(α V ) are provided in Sec. IV. Anexample of a location likelihood map is shown in Figure 6(c).We model the prototype matching term p(θ V, α) as:p(θ V, α) e d(D(V,α),D(θ)),shape-motion space using the Euclidean distance for learningthe action prototypes. The cluster centers are then used as theaction prototypes. In order to rapidly construct frame-to-prototypecorrespondence, similar to the online matching of shape exemplarby tree traversal in [50], we build a binary prototype tree over theset of prototypes based on hierarchical K -means clustering [49](K 2) and traverse the tree to find the nearest neighborprototype for any given test frame (i.e. obs

action matching in long video sequences. During training, an action prototype tree is learned in a joint shape and motion space via hierarchical K-means clustering and each training sequence is represented as a labeled prototype sequence; then a look-up table of prototype-to-prototype distances is generated.

Related Documents:

IEEE 3 Park Avenue New York, NY 10016-5997 USA 28 December 2012 IEEE Power and Energy Society IEEE Std 81 -2012 (Revision of IEEE Std 81-1983) Authorized licensed use limited to: Australian National University. Downloaded on July 27,2018 at 14:57:43 UTC from IEEE Xplore. Restrictions apply.File Size: 2MBPage Count: 86Explore furtherIEEE 81-2012 - IEEE Guide for Measuring Earth Resistivity .standards.ieee.org81-2012 - IEEE Guide for Measuring Earth Resistivity .ieeexplore.ieee.orgAn Overview Of The IEEE Standard 81 Fall-Of-Potential .www.agiusa.com(PDF) IEEE Std 80-2000 IEEE Guide for Safety in AC .www.academia.eduTesting and Evaluation of Grounding . - IEEE Web Hostingwww.ewh.ieee.orgRecommended to you b

Signal Processing, IEEE Transactions on IEEE Trans. Signal Process. IEEE Trans. Acoust., Speech, Signal Process.*(1975-1990) IEEE Trans. Audio Electroacoust.* (until 1974) Smart Grid, IEEE Transactions on IEEE Trans. Smart Grid Software Engineering, IEEE Transactions on IEEE Trans. Softw. Eng.

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

IEEE TRANSACTIONS ON IMAGE PROCESSING, TO APPEAR 1 Quality-Aware Images Zhou Wang, Member, IEEE, Guixing Wu, Student Member, IEEE, Hamid R. Sheikh, Member, IEEE, Eero P. Simoncelli, Senior Member, IEEE, En-Hui Yang, Senior Member, IEEE, and Alan C. Bovik, Fellow, IEEE Abstract— We propose the concept of quality-aware image, in which certain extracted features of the original (high-

IEEE Robotics and Automation Society IEEE Signal Processing Society IEEE Society on Social Implications of Technology IEEE Solid-State Circuits Society IEEE Systems, Man, and Cybernetics Society . IEEE Communications Standards Magazine IEEE Journal of Electromagnetics, RF and Microwaves in Medicine and Biology IEEE Transactions on Emerging .

Standards IEEE 802.1D-2004 for Spanning Tree Protocol IEEE 802.1p for Class of Service IEEE 802.1Q for VLAN Tagging IEEE 802.1s for Multiple Spanning Tree Protocol IEEE 802.1w for Rapid Spanning Tree Protocol IEEE 802.1X for authentication IEEE 802.3 for 10BaseT IEEE 802.3ab for 1000BaseT(X) IEEE 802.3ad for Port Trunk with LACP IEEE 802.3u for .

3.522 1000BASE-H PCS status register 4 45.2.3.54 3.523 through 3.1799 Reserved. Draft Amendment to IEEE Std 802.3-201x IEEE Draft P802.3bv/D1.1_8 . Draft Amendment to IEEE Std 802.3-201x IEEE Draft P802.3bv/D1.1_8 IEEE 802.3bv Gigabit Ethernet Over Plastic Optical Fiber Task Force 12th July 2015 3

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan