Do Not Lose The Details: Reinforced Representation .

2y ago
12 Views
2 Downloads
860.97 KB
7 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Ciara Libby
Transcription

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)Do not Lose the Details: Reinforced Representation Learningfor High Performance Visual TrackingQiang Wang1,2 , Mengdan Zhang2 , Junliang Xing2† , Jin Gao2 , Weiming Hu2 , Steve Maybank31University of Chinese Academy of Sciences2National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences3Department of Computer Science and Information Systems, Birkbeck College, University of London{qiang.wang, mengdan.zhang, jlxing, jin.gao, wmhu}@nlpr.ia.ac.cn sjmaybank@dcs.bbk.ac.ukAbstractThis work presents a novel end-to-end trainableCNN model for high performance visual object tracking. It learns both low-level fine-grained representations and a high-level semantic embeddingspace in a mutual reinforced way, and a multi-tasklearning strategy is proposed to perform the correlation analysis on representations from both levels. In particular, a fully convolutional encoderdecoder network is designed to reconstruct the original visual features from the semantic projectionsto preserve all the geometric information. Moreover, the correlation filter layer working on the finegrained representations leverages a global contextconstraint for accurate object appearance modeling.The correlation filter in this layer is updated onlineefficiently without network fine-tuning. Therefore,the proposed tracker benefits from two complementary effects: the adaptability of the fine-grained correlation analysis and the generalization capabilityof the semantic embedding. Extensive experimental evaluations on four popular benchmarks demonstrate its state-of-the-art performance.1(a) ROIVisual tracking aims to estimate the trajectory of a target ina video sequence. It is widely applied, ranging from humanmotion analysis, human computer interaction, to autonomousdriving. Although much progress [Ross et al., 2008; Kalal etal., 2010; Henriques et al., 2015] has been made in the pastdecade, it remains very challenging for a tracker to work at ahigh speed and to be adaptive and robust to complex trackingscenarios including significant object appearance changes,pose variations, severe occlusions, and background clutters.Recent CNN based trackers [Tao et al., 2016; Held etal., 2016; Bertinetto et al., 2016; Wang et al., 2018] haveshown great potential for fast and robust visual tracking. Inthe off-line network pre-training stage, they learn a semantic†(c) EDSiam(d) EDCFembedding space for classification [Bertinetto et al., 2016;Valmadre et al., 2017] or regression [Held et al., 2016] on theexternal massive video dataset ILSVRC2015 [Russakovskyet al., 2015] using a backbone CNN architecture such asAlexNet [Krizhevsky et al., 2012] and VGGNet [Simonyanand Zisserman, 2015]. Different from hand-crafted features,the representations projected in the learned semantic embedding space contain rich high-level semantic informationand are effective for distinguishing objects of different categories. They also have certain generalization capabilitiesacross datasets, which ensure robust tracking. In the onlinetracking stage, these trackers estimate the target position at ahigh speed just through a single feed forward network passwithout any network fine tuning.Despite the convincing design of the above CNN basedtrackers, they still have some limitations. First, the representations in the semantic embedding space usually havelow resolution and lose some instance specific details andfine-grained localization information. These representations usually serve the discriminative learning of the categoriesin training data. Thus, on the one hand, they may be lesssensitive to the details and be confused when comparing t-Introduction (b) SiamFCFigure 1: Response maps learned by different methods for searchinstances (gymnastics4 and basketball). (a) Search instance, (b) Response map by SiamFC, (c) Response map by our Encoder-DecoderSiamFC (EDSiam), and (d) Response map by our Encoder-DecoderCorrelation Filter (EDCF). EDSiam removes many noisy local minima in the response map of SimaFC. EDCF further refines the response map of EDSiam for more accurate tracking.Equal contribution.Contact author.985

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)2015b; Kiani Galoogahi et al., 2017; Lukezic et al., 2017;Mueller et al., 2017]. However, with increasing accuracycomes a dramatic decrease in speed. Thus, CFNet [Valmadreet al., 2017] and DCFNet [Wang et al., 2017] propose to learntracking specific deep features from end to end, which improve the tracking accuracy without losing the high speed.Inspired by the above two trackers, we incorporate a globalcontext constraint into the correlation filter learning processwhile still obtaining a closed-form solution, which ensures amore reliable end-to-end network training process. Instead ofusing deep features with wide feature channels and low resolution as in [Valmadre et al., 2017], we focus on learningfine-grained features with fewer channels. This approach ismore suitable for efficient tracking and accurate localization.wo objects with the same attributes or semantics as shown inFig. 1; on the other hand, the domain shift problem [Nam andHan, 2016] may occur especially when trackers encounter targets of unseen categories or undergoing abrupt deformations.Second, these models usually do not perform online networkupdating to improve tracking speed, which inevitably affectsthe model adaptability, and thus hurts the tracking accuracy.To tackle the above limitations, we develop a novelencoder-decoder paradigm for fast, robust, and adaptive visual tracking. Specifically, the encoder carries out correlationanalysis on multi-resolution representations to benefit fromboth the fine-grained details and high-level semantics. Onone hand, we show that the correlation filter (CF) based onthe high-level representations from the semantic embeddingspace has good generalization capabilities for robust tracking,because the semantic embedding is additionally regularizedby the reconstruction constraint from the decoder. The decoder imposes a constraint that the representations in the semantic space must be sufficient for the reconstruction of theoriginal image. This domain-independent reconstruction constraint relieves the domain shift problem and ensures that thelearned semantic embedding preserves all the geometric andstructural information contained in the original fine-grainedvisual features. This yields a more accurate and robust correlation evaluation. On the other hand, another CF workingon the low-level high-resolution representations contributesto fine-grained localization. A global context constraint is incorporated into the appearance modeling process of this filterto further boost its discrimination power. This filter servesas a differentiable CF layer and is updated efficiently on-linefor adaptive tracking without network fine-tuning. The maincontributions of this work are three-fold:Deep learning based tracking. The excellent performanceof deep convolutional networks on several challenging visiontasks [Girshick, 2015; Long et al., 2015] encourages recent works to either exploit existing deep CNN features withinCFs [Ma et al., 2015a; Danelljan et al., 2015a] and SVMs [Hong et al., 2015a], or design deep architectures [Wangand Yeung, 2013; Wang et al., 2015; Nam and Han, 2016;Tao et al., 2016] for discriminative visual tracking. AlthoughCNN features have shown high discrimination, extracting CNN features from each frame and training or updating trackers over high dimensional CNN features are computationally expensive. Online fine-tuning a CNN to account for thetarget-specific appearance also severely hampers a tracker’sspeed as discussed in [Wang et al., 2015; Nam and Han,2016]. Siamese networks are exploited in [Tao et al., 2016;Held et al., 2016; Bertinetto et al., 2016] to formulate visual tracking as a verification problem without on-line updates.We enhance a Siamese network based tracker by exploitingan encoder-decoder architecture for multi-task learning. Thedomain independent reconstruction constraint imposed by thedecoder makes the semantic embedding learned in the encoder more robust to avoid domain shifts. A novel convolutional encoder-decoder network is developed for visual tracking. The decoder incorporates areconstruction constraint to enhance the generalizationcapability and discriminative power of the tracker. A differentiable correlation filter layer regularized by theglobal context constraint is designed to allow efficienton-line updates for continuous fine grained localization.Based on the above contributions, an end-to-end deepencoder-decoder network for high performance visual tracking is presented.Extensive experimental evaluationson four benchmarks, OTB2013 [Wu et al., 2013], OTB2015 [Wu et al., 2015], VOT2015 [Kristan et al., 2015],and VOT2017 [Kristan et al., 2017], demonstrate its state-ofthe-art tracking accuracy and real-time tracking speed.Hybrid multi-tracker methods. Some tracking methodsmaintain a tracker ensemble [Zhang et al., 2014; Wang et al.,2015], so the failure of a single tracker can be compensated by other trackers. TLD [Kalal et al., 2010] decomposesthe tracking task into tracking, learning and detection wheretracking and detection facilitates each other. MUSTer [Honget al., 2015b], LCT [Ma et al., 2015b] and PTAV [Fan andLing, 2017] equip short-term correlation filter based tracking with long-term conservative re-detections or verifications. Our online adaptive correlation filter, working on the finegrained representations, complements with the lone-term correlation filter based on the high-level generic semantic embedding. They share network architectures and are learnedsimultaneously in an end-to-end manner.23 A multi-task learning strategy is proposed to optimizethe correlation analysis and the image reconstruction ina mutual reinforced way. This guarantees tracking robustness and the model adaptability.Related WorkCorrelation filter based tracking. Recent advances of CFhave achieved great success by using multi-feature channels [Danelljan et al., 2014b; Ma et al., 2015a], scale estimation [Li and Zhu, 2014; Zhang et al., 2015; Danelljan etal., 2014a], and boundary effect alleviation [Danelljan et al.,Encoder-Decoder Correlation Filter basedTrackingThe proposed framework named EDCF is illustrated in Fig. 2.It is an encoder-decoder architecture to fully exploit multiresolution representations for adaptive and robust tracking. Inparticular, a generic semantic embedding is learnt for robust986

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)Reconstruction Lossz'Conv1 poolingx'tConvolutional EncoderConv2 poolingConv3Conv4*125x125x1Conv1 poolingConv4Conv5Deconv4Deconv5Conv5CACFConv3Conv2 poolingConvolutional DecoderDeconv5Deconv1 1/2Deconv2 1/2Deconv317x17x1Deconv4Deconv3Deconv2 1/2Reconstruction LossDeconv1 1/2Figure 2: Architecture of EDCF, an Encoder-Decoder Correlation Filter. It consists of two fully convolutional encoder-decoder. Convolutionalfeatures are extracted from the initial exemplar patch z0 and the search patch x0t in frame t. The shallow features are exploited by the contextaware correlation filter tracker (CACF). The deep features that capture a high-level representation of the image are used in a cross correlationembedding without update online to avoid the drift problem. The reconstruction loss is used to enrich the detailed representation. Threehybrid loss is jointly trained in a mutual reinforced way.spatial correlation analysis. The embedding benefits from thedomain independent reconstruction constraint imposed by thedecoder. Fine-grained target localization is achieved usingthe correlation filter working on the low-level fine-grainedrepresentations. This correlation filter is regularized by aglobal context constraint and implemented as a differentiablelayer. Finally, the whole network is trained from end to endbased on a multi-task learning strategy to reinforce both thediscriminative and generative parts.3.1representation back to the image space with the input resolution, achieved by stacking 7 deconvolutional layers. Then,the semantic embedding learning is optimized by minimizingthe combination loss of the reconstruction loss Lrecon and thetracking loss Lhigh :Lsel Lrecon Lhigh ,(1)00 2Lrecon k ψ(φ(z ; θe ); θd ) z k2(2) k ψ(φ(x0 ; θe ); θd ) x0 k22 ,where parameters of the encoder and the decoder are denotedas θe and θd , z0 is the target image, and x0 is the search image.The tracking loss is discussed as follows.The spatial correlation operation in the semantic embedding space is used to measure the similarities between thetarget image and the search image:m 1X n 1Xfu,v hφi,j (z0 ; θe ), φu i,v j (x0 ; θe )i, (3)Generic Semantic Embedding Learning forRobust TrackingDifferent from recent deep trackers [Tao et al., 2016;Bertinetto et al., 2016], whose semantic embedding spacesonly serve discriminative learning, we propose to learn amore generic semantic embedding space by equipping traditional discriminative learning with an extra image reconstruction constraint. Since the image reconstruction is an unsupervised task and is less sensitive to the characteristics of atraining dataset, our learned semantic embedding space has alarger generalization capability, leading to more robust visualtracking. Moreover, the reconstruction constraint ensures thatthe semantic embedding space preserves all the geometric orstructural information contained in the original fine-grainedvisual features. This increases the accuracy of the tracking.The generic semantic embedding learning is based on anencoder-decoder architecture. The encoder φ : RM N 3 RP Q D consists of 5 convolution layers with two maxpooling layers and outputs a latent representation projected from the semantic embedding space. The decoder ψ :RP Q D RM N 3 maps this high-level low-resolutioni 0 j 00where φi,j (z ; θe ) is a multi-channel entry for position(i, j) Z 2 in the latent representation of the target imagez0 , m n corresponds to the spatial size for correlation analysis, and fu,v denotes the similarity between the target imageand the search image whose center is of (u, v) Z 2 pixelsin height and width away from the target center. Each searchimage has a label y(u, v) { 1, 1} indicating whether itis a positive sample or a negative sample. Thus, the trackingproblem can be formulated as the minimization of the following logistic loss:1 Xlog(1 exp( y(u, v)fu,v )), (4)Lhigh D (u,v) D987

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)where D Z 2 is a finite grid corresponding to the searchspace and D denotes the number of search patches.3.2low-level fine-grained representations fitting to a CF by transforming the above correlation filter into a differentiable CFlayer which is cascaded behind a low-level convolutional layer of the encoder. This design permits end-to-end training ofthe whole encoder-decoder based network. In particular, therepresentations provided by a low-level convolutional layerof the encoder are designed to be fine-grained (without maxpooling) and with thin feature maps which are quite sufficientfor accurate localization. The representations are denoted asx ϕ(x0 ; θel ), where x0 is a search image and θel denotesthe parameters of these low-level convolutional layers. Then,representations are learned via the following tracking loss:Context-Aware Correlation Filter basedAdaptive TrackingAlthough the reconstruction constraint reinforces the semantic embedding learning to preserve some useful structural details, it is still necessary to carry out correlation analysis onthe low-level fine-grained representations for accurate localization. A global context constraint is incorporated into thecorrelation analysis to suppress the negative effects of distractors. This is achieved by a differentiable correlation filterlayer. This layer permits end-to-end training and online updates for adaptive tracking. Note that this correlation analysisis implemented in the frequency domain for efficiency.We begin with an overview of the general correlation filter(CF). A CF is learned efficiently using samples densely extracted around the target. This is achieved by modeling allpossible translations of the target within a search window ascirculant shifts and concatenating their features to form thefeature matrix Z0 . Note that both the hand-crafted featuresand the CNN features can be exploited as long as they preserve the structural or localization information of the image.The circulant structure of this matrix facilitates a very efficient solution to the following ridge regression problem inthe Fourier domain:min k Z0 w y k22 λ k w k22 ,wLlow k g(x0 ) y k22 k Xw y k22 ,0g(x ) Xw F (x̂ ŵ) ,(9)where X is the circulant matrix of the representations x forthe search image patch, and w is the learned CF based on therepresentations z0 ϕ(z00 ; θel ) for the target image patchand the representations zi ϕ(z0i ; θel ) for the global contextas in Eqn. (7). The derivatives of Llow in Eqn. (8) are thenobtained: ĝ Llow 2(ĝ(x) ŷ), 1min k Z0 w y k22 λ1 k w k22 λ2w(10) (11) x Llow F ( Llow ŵ ),(12) ŵ Llow ĝ Llow x̂ ,ŷ 2Re(ẑ 0 ŵ) z0 Llow F 1 ( ŵ Llow), (13)D̂ 2Re(ẑ i ŵ) zi Llow F 1 ( ŵ Llow).(14)D̂Pkwhere D̂ : ẑ 0 ẑ0 λ1 λ2 i 1 ẑ i ẑi is the denominatorof ŵ and Re(·) is the real part of a complex-valued matrix.(5)where the learned correlation filter is denoted by the vectorw, each row of the square matrix Z0 contains the features extracted from a certain circulant shift of the vectorized imagepatch z00 and the regression objective y is a vectorized imageof a 2D Gaussian.Inspired by the CACF method [Mueller et al., 2017], ourcorrelation filter is regularized by the global context for larger discrimination power. In each frame, we sample k contextimage patches z0i around the target image patch z00 . Their corresponding circulant feature matrices are Zi and Z0 based onthe low-level fine-grained CNN features. The context patchescan be viewed as hard negative samples which contain various distractors and diverse background. Then, a CF is learnedthat has a high response for the target patch and close to zeroresponse for context patches:kX(8) 13.3ĝ Multi-task Learning and Efficient TrackingConsidering above two differentiable functional componentswhich complement with each other in fine-grained localization and discriminative tracking based on the multi-resolutionrepresentations, we propose to utilize the multi-task learningstrategy to end-to-end train our network to simultaneously reinforce two components. Our multi-task loss function is:Lall Llow Lhigh Lrecon R(θ)(15)where R(θ) is introduced as 2 -norm of the network weightsin order to regularize the network for better generalization.In the tracking stage, given an input video frame at timet, we crop some large search patches centered at the previous target position with multiple scales, denoted as x0s . Thesesearch patches are fed into the encoder to get two representations. The fine-grained representation is fed into the contextaware correlation filter layer given in Eqn. (9). The semantic representation is evaluated based on the spatial correlationoperation given in Eqn. (3). Then, the target state is estimatedby finding the maximum of the fused correlation response:k Zi w k22 . (6)i 1The closed-form solution in the Fourier domain for our CF is:ẑ 0 ŷŵ ,(7)Pkẑ 0 ẑ0 λ1 λ2 i 1 ẑ i ẑiwhere z0 denotes the feature patch of the image patch z00 , i.e.,z0 ϕ(z00 ), ϕ(·) is a feature mapping based on the low-levelconvolutional layers in our decoder, ẑ0 denotes the discreteFourier transform of z

domain independent reconstruction constraint imposed by the decoder makes the semantic embedding learned in the en-coder more robust to avoid domain shifts. Hybrid multi-tracker methods. Some tracking methods maintain a tracker ensemble[Zhanget al., 2014; Wanget al., 2015], so the failure o

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. Crawford M., Marsh D. The driving force : food in human evolution and the future.

Le genou de Lucy. Odile Jacob. 1999. Coppens Y. Pré-textes. L’homme préhistorique en morceaux. Eds Odile Jacob. 2011. Costentin J., Delaveau P. Café, thé, chocolat, les bons effets sur le cerveau et pour le corps. Editions Odile Jacob. 2010. 3 Crawford M., Marsh D. The driving force : food in human evolution and the future.