Face Recognition Based On MTCNN And FaceNet

2y ago
838 Views
205 Downloads
348.17 KB
6 Pages
Last View : 8d ago
Last Download : 3m ago
Upload by : Mollie Blount
Transcription

Face Recognition Based on MTCNN and FaceNetRongrong Jin, Hao Li, Jing Pan, Wenxi Ma, and Jingyu LinAbstractFace recognition performance improves rapidly with the recent deep learning technique developing and underlying largetraining dataset accumulating. However, face images in thewild undergo large intra-personal variations, such as poses,illuminations, occlusions, and low resolutions, which causegreat challenges to face-related applications.This paper addresses this challenge by proposing a deep learning framework which is based on MTCNN and FaceNet, which canrecover the canonical view of face images. In our project, webuild our own Face Recognition System, which achieves highaccuracy on the LFW benchmark.We use the inherent correlation between detection and calibration to improve theirperformance under the multi-task framework of deep cascading. In particular, we use a three-tiered architecture combinedwith a well-designed roll neural network algorithm to detectfaces and roughly locate key points.In the FaceNet method,it directly learns the mapping from a face image to a compact Euclidean space, where distance directly correspondsto a measure of facial similarity.Once this space is generated, face recognition, validation and clustering can be easily implemented using the standard FaceNet embedding technique as the feature vector.This approaches dramatically reduce the intra-person variances, while maintaining the interperson discriminativeness. Maybe there is something not thatperfect during our experiments, but we are going to summarize our experiments and present some challenges lying aheadin recent face recognition.1. IntroductionWith the rapid development of artificial intelligence in recent years, facial recognition gains more and more attention.Compared with the traditional card recognition, fingerprintrecognition and iris recognition, face recognition has manyadvantages, including but limit to non-contact, high concurrency, and user friendly. It has high potential to be used ingovernment, public facilities, security, e-commerce, retailing, education and many other fields.Traditional face recognition methods use feature operators to model face, which is simple and easy to implement. However, with the further research, these algorithmscan show strong effectiveness in finding linear structures,Copyright 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.but when facing potential nonlinear structures, they oftenachieve unsatisfactory recognition results.With the development of deep learning and the introduction of deep convolutional neural networks, the accuracy andspeed of face recognition have made great strides. However,the results from different networks and models are very different. Previous face recognition approaches based on deepnetworks use a classification layer(Taigman et al. 2014; Tang2015), they regard face recognition as a classification task.The number of softmax output is the number of face tags.Therefore, every time a new sample comes in, the wholemodel needs to be retrained. While FaceNet directly trainsits output to be a compact 128-D embedding using a tripletbased loss function based on LMNN(Schroff, Kalenichenko,and Philbin 2015). The triplets consist of two matchingface thumbnails and a non-matching face thumbnail and theloss aims to separate the positive pair. The thumbnails aretight crops of the face area, no 2D or 3D alignment, otherthan scale and translation is performed. The benefit of thisapproach is much greater representational efficiency: theyachieve state-of-the-art face recognition performance usingonly 128-bytes per face. So we use FaceNet, 128 dimensional vector to represent face, and then recognize face bycalculating vector distance.In order to achieve better performance, we first useMTCNN(Zhang et al. 2016) to do face detection. Then usethe result of MTCNN as the input of FaceNet to perform facerecognition. MTCNN network, which is a mainstream targetdetection network with high detection accuracy, lightweightand real-time.So our face recognition process is mainly divided into twosteps: face detection and face recognition. Firstly, MTCNNis used for face detection to get accurate face coordinates.Based on the results of the previous step, FaceNet is usedfor face recognition. The processing flow of MTCNN is asfollows: First of all, the test image is continuously resized toget the image pyramid. Then the image pyramid is input intoP-Net to get a large number of candidates. The candidateimages screened by P-Net are fine tuned by R-Net. Aftermany candidates are removed by R-Net, the images are inputto O-Net. Finally, the accurate bbox coordinates are output.Compared with DeepFace, FaceNet retains face alignment,abandons feature extraction steps, and directly uses CNN totrain end-to-end after face alignment.

2. Related WorkFace detectionFace detection are essential to many face applications, suchas face recognition and facial expression analysis. However,the large visual variations of faces, such as occlusions, largepose variations and extreme lightings, impose great challenges for these tasks in real world applications.The cascade face detector proposed by Viola andJones(Viola and Jones 2004) utilizes Haar-Like features andAdaBoost to train cascaded classifiers, which achieves goodperformance with real-time efficiency. However, quite a fewworks(Yang et al. 2014; Pham et al. 2010) indicate that thiskind of detector may degrade significantly in real-world applications with larger visual variations of human faces evenwith more advanced features and classifiers. Besides thecascade structure(Zhu and Ramanan 2012), introduce deformable part models (DPM) for face detection and achieveremarkable performance. However, they are computationally expensive and may usually require expensive annotation in the training stage. Recently, convolutional neural networks (CNNs) achieve remarkable progresses in a varietyof computer vision tasks, such as image classification andface recognition(Sun, Wang, and Tang 2014). Inspired bythe significant successes of deep learning methods in computer vision tasks, several studies utilize deep CNNs for facedetection. Yang et al.(Yang et al. 2016) train deep convolution neural networks for facial attribute recognition to obtain high response in face regions which further yield candidate windows of faces. However, due to its complex CNNstructure, this approach is time costly in practice. Li et al.(Liet al. 2015) use cascaded CNNs for face detection, but itrequires bounding box calibration from face detection withextra computational expense and ignores the inherent correlation between facial landmarks localization and boundingbox regression.Face recognitionUsing deep neural networks to learn effective feature representations has become popular in face recognition(Sun,Wang, and Tang 2013). With better deep network architectures and supervisory methods, face recognition accuracy has been boosted rapidly in recent years. Previous facerecognition approaches based on deep networks use a classification layer (Taigman et al. 2014) trained over a set ofknown face identities and then take an intermediate bottleneck layer as a representation used to generalize recognitionbeyond the set of identities used in training. The downsidesof this approach are its indirectness and its inefficiency: onehas to hope that the bottleneck representation generalizeswell to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work has reduced this dimensionality using PCA, but this is a linear transformation that can beeasily learnt in one layer of the network.3. MethodFor accurate face recognition, we train two networks,MTCNN and FaceNet. MTCNN is used to detect the faceFigure 1: Pipeline of MTCNN cascaded framework that includes three-stage multi-task deep convolutional networks.Firstly, candidate windows are produced through a fast Proposal Network (P-Net). After that, we refine these candidates in the next stage through a Refinement Network (RNet). In the third stage, the Output Network (O-Net) produces final bounding box.and get the exact coordinates of the face. Based on the results of face detection, face recognition is performed usingFaceNet.FaceNet directly learns a mapping from face images toa compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space hasbeen produced, tasks such as face recognition, verificationand clustering can be easily implemented using standardtechniques with FaceNet embeddings as feature vectors.3.1. MTCNNMTCNN is a deep cascaded multi-task framework which exploits the inherent correlation between detection and alignment to boost up their performance. The framework ofMTCNN leverages a cascaded architecture with three stagesof carefully designed deep convolutional networks to predict face and landmark location in a coarse-to-fine manner.In addition, a new online hard sample mining strategy thatfurther improves the performance in practice.3.1.1. Overall FrameworkThe overall pipeline of MTCNN is shown in Figure. 1. Givenan image, we initially resize it to different scales to build animage pyramid, which is the input of the following threestage cascaded framework:Stage 1: We exploit a fully convolutional network, calledProposal Network (P-Net), to obtain the candidate facialwindows and their bounding box regression vectors. Thencandidates are calibrated based on the estimated bounding box regression vectors. After that, we employ non-

Figure 2: The architecture of P-Net, R-Net, and O-Net. Where “MP” means max pooling and “Conv” means convolution. Thestep size in convolution and pooling is 1 and 2, respectivelymaximum suppression (NMS) to merge highly overlappedcandidates.Stage 2: All candidates are fed to another CNN, calledRefine Network (R-Net), which further rejects a large number of false candidates, performs calibration with boundingbox regression, and conducts NMS.Stage 3: This stage is similar to the second stage, but inthis stage we aim to identify face regions with more supervision. In particular, the network will output five facial landmarks’ positions.3.1.2. CNN ArchitecturesWe use 3 3 filter rather than 5 5 filter to reduce the computing while increase the depth to get better performance. Withthese improvements, compared to the previous architecturein(Li et al. 2015), we can get better performance with lessruntime. The CNN architectures are shown in Figure. 2. Weapply PReLU(He et al. 2015) as nonlinearity activation function after the convolution and fully connection layers(exceptoutput layers).3.1.3. TrainingWe leverage three tasks to train our CNN detectors:face/non-face classification, bounding box regression, andfacial landmark localization.1) Face classification: The learning objective is formulated as a two-class classification problem. For each samplexi , we use the cross-entropy loss:Ldet (yidet log(pi ) (1 yidet )(1 log(pi )))iThe learning objective is formulated as a regression problem, and we employ the Euclidean loss for each sample xi :Lbox ŷibox yiboxi2where ŷibox is the regression target obtained from the network and yibox is the ground-truth coordinate.3) Facial landmark localization: Similar to bounding boxregression task, facial landmark detection is formulated as aregression problem and we minimize the Euclidean loss:Llanddmark ŷilandmark yilandmarki22(3)where ŷilandmark is the facial landmark’s coordinates obtained from the network and yilandmark is the ground-truthcoordinate for the i-th sample.4) Multi-source training: Since we employ different tasksin each CNN, there are different types of training imagesin the learning process, such as face, non-face, and partiallyaligned face. In this case, some of the loss functions (i.e.,Eq. (1)-(3)) are not used. The overall learning target can beformulated as:minN XXαj βij Lji(4)i 1 j Uwhere U {det, box, landmark}, and N is the number oftraining samples and aj denotes on the task importance.(1)where pi is the probability produced by the network that indicate sample xi being a face. The notation yidet {0, 1}denotes the ground-truth label.2) Bounding box regression: For each candidate window,we predict the offset between it and the nearest ground truth.(2)2Figure 3: FaceNet model structure.

3.2. FaceNetFaceNet is adopted in our face recognition truncation.FaceNet directly trains its output to be a compact 128-D embedding using a tripletbased loss function based on LMNN.Our triplets consist of two matching face thumbnails and anon-matching face thumbnail and the loss aims to separatethe positive pair from the negative by a distance margin. Thethumbnails are tight crops of the face area, no 2D or 3Dalignment, other than scale and translation is performed.Andit is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trainedsuch that the squared L2 distances in the embedding spacedirectly correspond to face similarity: faces of the same person have small distances and faces of distinct people havelarge distances. (f (xai ), f (xpi ), f (xni )) T(6)where α is a margin that is enforced between positive andnegative pairs. T is the set of all possible triplets in the training set and has cardinality N. The loss that is being minimized is then L NX22[kf (xai ) f (xpi )k2 kf (xai ) f (xni )k2 α] (7)iGenerating all possible triplets would result in manytriplets that are easily satisfied. These triplets would not contribute to the training and result in slower convergence, asthey would still be passed through the network.3.2.1 End-to-end learningInstead of using the traditional softmax method to do classification learning, FaceNet extracted a certain layer as a feature to learn a coding method from the image to The European space, and then do face recognition face verificationand face clustering based on this code.Given the model details, and treating it as a black box (see Figure 3), the mostimportant part of our approach lies in the end-to-end learning of the whole system. To this end we employ the tripletloss that directly reflects what we want to achieve in faceverification, recognition and clustering. Namely, we strivefor an embedding f(x), from an image x into a feature spaceRd, such that the squared distance between all faces, independent of imaging conditions, of the same identity is small,whereas the squared distance between a pair of face imagesfrom different identities is large.3.2.2 Triplet LossThe triplet loss is more suitable for face verification. Themotivation is that the loss from(Sun, Wang, and Tang 2014)encourages all faces of one identity to be projected onto asingle point in the embedding space. The triplet loss, however, tries to enforce a margin between each pair of facesfrom one person to all other faces. This allows the faces forone identity to live on a manifold, while still enforcing thedistance and thus discriminability to other identities.The embedding is represented by f (x) Rd . It embeds an image x into a d-dimensional Euclidean space.Additionally, we constrain this embedding to live on thed-dimensional hypersphere, i.e.kf (x)2 k 1 This loss ismotivated in(Weinberger 2009) in the context of nearestneighbor classification. Here we want to ensure that an image xai (anchor) of a specific person is closer to all otherimages xpi (positive) of the same person than it is to any image xni (negative) of any other person. This is visualized inFigure 4.Thus we want,Figure 4: The Triplet Loss minimizes the distance betweenan anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and anegative of a different identity.3.2.3 Triplet SelectionIn order to ensure fast convergence it is crucial to selecttriplets that violate the triplet constraint in Eq. (5). Thismeans that, given xai , we want to select an xpi (hard pos2itive) such that argmaxxpi kf (xai ) f (xpi )k2 and similarly2xni (hard negative) such that argmaxxni kf (xai ) f (xni )k2 .It is infeasible to compute the argmin and argmax acrossthe whole training set. Additionally, it might lead to poortraining, as mislabelled and poorly imaged faces woulddominate the hard positives and negatives. There are two obvious choices that avoid this issue: Generate triplets offline every n steps, using the mostrecent network checkpoint and computing the argmin andargmax on a subset of the data. Generate triplets online. This can be done by selectingthe hard positive/negative exemplars from within a minibatch.Instead of picking the hardest positive, we use all anchorpositive pairs in a mini-batch while still selecting thehard negatives. We don’t have a side-by-side comparisonof hard anchor-positive pairs versus all anchor-positivepairs within a mini-batch, but we found in practice that theall anchorpositive method was more stable and convergedslightly faster at the beginning of training. Selecting thehardest negatives can in practice lead to bad local minimaearly on in training, specifically it can result in a collapsedmodel (i.e. f(x) 0). In order to mitigate this, it helps toselect xni such that222kf (xai ) f (xpi )k2 α kf (xai ) f (xni )k2(5)2kf (xai ) f (xpi )k2 kf (xai ) f (xni )k2(8)To sum up, Correct triplet selection is crucial for fast convergence. On the one hand we would like to use small

mini-batches as these tend to improve convergence duringStochastic Gradient Descent (SGD). On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient.we can get better performance with less runtime which areshown in Table I.Table 1: COMPARISON OF SPEED AND VALIDATIONACCURACY OF OUR CNNs AND PREVIOUS tGroup2Group3Figure 5: Enhanced image and the original image. The firstline is enhanced by four contrast changes. The second line isthe enhanced image with random operations and the originalimage.3.3. Data AugmentationLarge-scale datasets are the prerequisite for the successfulapplication of deep neural networks. The image augmentation technology uses a series of random changes to the training images to generate similar but different training samples,thereby expanding the size of the training dataset.Another explanation for image augmentation is that randomly changing the training samples can reduce the model’sdependence on certain attributes and improve the generalization ability of the model. For example, we can crop theimage in different ways to make the objects of interest appear in different positions, thereby reducing the dependenceof the model on the position of the object. We can also adjustfactors such as contrast ratio to reduce the model’s sensitivity to brightness.In order to enhance the robustness of the model when predicting, we decide to apply image augmentation to datasetswith random operations when training. The methods we usehere are: random fixed ratio cropping, mirror flipping, turning left 45 , turning right 45 , etc. Some samples see in Figure 5.4. Experiments4.1. MTCNN Backbone networksBefore the experiment, we notice the performance of multiple CNNs might be limited by the following facts:(1) Some filters in convolution layers lack diversity thatmay limit their discriminative ability.(2) Considering face detection is a challenging binaryclassification task, so it may need less numbers of filtersper layer. To this end, we reduce the number of filters andchange the 5 5 filter to 3 3 filter to reduce the computing while increase the depth to get better performance. Withthese improvements, compared to the previous 38s0.466s3.601s1.411sValidation Accuracy93.10%93.70%93.80%94.50%92.10%93.50%So in the MTCNN part, with the cascade structu

1. Introduction With the rapid development of artificial intelligence in re-cent years, facial recognition gains more and more attention. Compared with the traditional card recognition, fingerprint recognition and iris recognition, face recognition has many advantages, including but li

Related Documents:

2.1 Face Recognition Face recognition has been an active research topic since the 1970’s [Kan73]. Given an input image with multiple faces, face recognition systems typically first run face detection to isolate the faces. Each face is pre

Subspace methods have been applied successfully in numerous visual recognition tasks such as face localization, face recognition, 3D object recognition, andtracking. In particular, Principal Component Analysis (PCA) [20] [13] ,andFisher Linear Dis criminant (FLD) methods [6] have been applied to face recognition with impressive results.

18-794 Pattern Recognition Theory! Speech recognition! Optical character recognition (OCR)! Fingerprint recognition! Face recognition! Automatic target recognition! Biomedical image analysis Objective: To provide the background and techniques needed for pattern classification For advanced UG and starting graduate students Example Applications:

However, such face recognition studies only concern bias in terms of identity, rather than our focus of demographic bias. In this paper, we propose a framework to address the in uence of bias on face recognition and demographic attribute estimation. In typical deep learn-ing based face recognition

Garment Sizing Chart 37 Face Masks 38 PFL Face Masks 39 PFL Face Masks with No Magic Arch 39 PFL Laser Face Masks 40 PFL N-95 Particulate Respirator 40 AlphaAir Face Masks 41 AlphaGuard Face Masks 41 MVT Highly Breathable Face Masks 42 Microbreathe Face Masks 42 Coo

pose-robust face recognition remains a challenge. To meet this challenge, this chap-ter introduces reference-based similarity where the similarity between a face image and a set of reference individuals (the "reference set") defines the reference-based descriptor for a face image. Recognition is performed using the reference-based

recognition. 2. PCA-based face recognition In this section we will describe Karhunen– Loeve transform (KLT)-based face recognition method, that is often called principal component analysis (PCA) or eigenfaces. We will present only main formulas of this method, whose details could be found in (Groß, 1994). LetX j beN-elementone .

Description Logic Reasoning Research Challenges Reasoning with Expressive Description Logics – p. 2/40. Talk Outline Introduction to Description Logics The Semantic Web: Killer App for (DL) Reasoning? Web Ontology Languages DAML OIL Language Reasoning with DAML OIL OilEd Demo Description Logic Reasoning Research Challenges Reasoning with Expressive Description Logics – p. 2/40. Talk .