Improving Generalization By Controlling Label-Noise .

3y ago
9 Views
2 Downloads
5.60 MB
11 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Josiah Pursley
Transcription

Improving Generalization by ControllingLabel-Noise Information in Neural Network WeightsHrayr Harutyunyan 1 Kyle Reing 1 Greg Ver Steeg 1 Aram Galstyan 1Abstract1. IntroductionSupervised learning with deep neural networks has showngreat success in the last decade. Despite having millions ofparameters, modern neural networks generalize surprisinglywell. However, their training is particularly susceptible tonoisy labels, as shown by Zhang et al. (2016) in their analysis of generalization error. In the presence of noisy orincorrect labels, networks start to memorize the training labels, which degrades the generalization performance (Chenet al., 2019). At the extreme, standard architectures have the1Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292. Correspondence to: HrayrHarutyunyan hrayrh@isi.edu .Proceedings of the 37 th International Conference on MachineLearning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).no regularizationdropout 0.5weight decay 0.003Train accuracyon noisy labels0.6overfitting 0.21.0Test accuracyon clean laEelsIn the presence of noisy or incorrect labels, neuralnetworks have the undesirable tendency to memorize information about the noise. Standard regularization techniques such as dropout, weight decayor data augmentation sometimes help, but do notprevent this behavior. If one considers neural network weights as random variables that depend onthe data and stochasticity of training, the amountof memorized information can be quantified withthe Shannon mutual information between weightsand the vector of all training labels given inputs,I(w; y x). We show that for any training algorithm, low values of this term correspond toreduction in memorization of label-noise and better generalization bounds. To obtain these lowvalues, we propose training algorithms that employ an auxiliary network that predicts gradientsin the final layers of a classifier without accessing labels. We illustrate the effectiveness of ourapproach on versions of MNIST, CIFAR-10, andCIFAR-100 corrupted with various noise models,and on a large-scale dataset Clothing1M that hasnoisy labels.proposed1.00.60.2EpochsFigure 1: Neural networks tend to memorize labels whentrained with noisy labels (80% noise in this case), evenwhen dropout or weight decay are applied. Our trainingapproach limits label-noise information in neural networkweights, avoiding memorization of labels and improvinggeneralization. Please refer to Sec. 2.1 for more details.capacity to achieve 100% classification accuracy on trainingdata, even when labels are assigned at random (Zhang et al.,2016). Furthermore, standard explicit or implicit regularization techniques such as dropout, weight decay or dataaugmentation do not directly address nor completely preventlabel memorization (Zhang et al., 2016; Arpit et al., 2017).Poor generalization due to label memorization is a significant problem because many large, real-world datasets areimperfectly labeled. Label noise may be introduced whenbuilding datasets from unreliable sources of information orusing crowd-sourcing resources like Amazon MechanicalTurk. A practical solution to the memorization problem islikely to be algorithmic as sanitizing labels in large datasetsis costly and time consuming. Existing approaches for addressing the problem of label-noise and generalization performance include deriving robust loss functions (Natarajanet al., 2013; Ghosh et al., 2017; Zhang & Sabuncu, 2018; Xuet al., 2019), loss correction techniques (Sukhbaatar et al.,2014; Tong Xiao et al., 2015; Goldberger & Ben-Reuven,2017; Patrini et al., 2017), re-weighting samples (Jiang et al.,2017; Ren et al., 2018), detecting incorrect samples and re-

Improving Generalization by Controlling Label-Noise Information in Neural Network Weightslabeling them (Reed et al., 2014; Tanaka et al., 2018; Maet al., 2018), and employing two networks that select training examples for each other (Han et al., 2018; Yu et al.,2019). We propose an information-theoretic approach thatdirectly addresses the root of the problem. If a classifieris able to correctly predict a training label that is actuallyrandom, it must have somehow stored information aboutthis label in the parameters of the model. To quantify thisinformation, Achille & Soatto (2018) consider weights as arandom variable, w, that depends on stochasticity in trainingdata and parameter initialization. The entire training datasetis considered a random variable consisting of a vector ofinputs, x, and a vector of labels for each input, y. Theamount of label memorization is then given by the Shannonmutual information between weights and labels conditionedon inputs, I(w; y x). Achille & Soatto (2018) show thatthis term appears in a decomposition of the commonly usedexpected cross-entropy loss, along with three other individually meaningful terms. Surprisingly, cross-entropy rewardslarge values of I(w; y x), which may promote memorization if labels contain information beyond what can beinferred from x. Such a result highlights that in addition tothe network’s representational capabilities, the loss function– or more generally, the learning algorithm – plays an important role in memorization. To this end, we wish to studythe utility of limiting I(w; y x), and how it can be used tomodify training algorithms to reduce memorization.Our main contributions towards this goal are as follows:1) We show that low values of I(w; y x) correspond toreduction in memorization of label-noise, and lead to bettergeneralization gap bounds. 2) We propose training methodsthat control memorization by regularizing label-noise information in weights. When the training algorithm is a variantof stochastic gradient descent, one can achieve this by controlling label-noise information in gradients. A promisingway of doing this is through an additional network that triesto predict the classifier gradients without using label information. We experiment with two training procedures thatincorporate gradient prediction in different ways: one whichuses the auxiliary network to penalize the classifier, andanother which uses predicted gradients to train it. In bothapproaches, we employ a regularization that penalizes theL2 norm of predicted gradients to control their capacity.The latter approach can be viewed as a search over trainingalgorithms, as it implicitly looks for a loss function thatbalances training performance with label memorization. 3)Finally, we show that the auxiliary network can be usedto detect incorrect or misleading labels. To illustrate theeffectiveness of the proposed approaches, we apply themon corrupted versions of MNIST, CIFAR-10, CIFAR-100with various label noise models, and on the Clothing1Mdataset, which already contains noisy labels. We show thatmethods based on gradient prediction yield drastic improve-ments over standard training algorithms (like cross-entropyloss), and outperform competitive approaches designed forlearning with noisy labels.2. Label-Noise Information in WeightsWe begin by formally introducing a measure of labelnoise information in weights, and discuss its connectionsto memorization and generalization. Throughout the paper we use several information-theoreticquantities such as entropy: H(X) E log p(x) , mutual information:I(X; Y ) H(X) H(Y ) H(X, Y ), Kullback–Leiblerdivergence: KL(p(x) q(x)) Ex p(x) [log(p(x)/q(x))]and their conditional variants (Cover & Thomas, 2006).Consider a setup in which a labeled dataset, S (x, y), fornndata x x(i) i 1 and categorical labels y y (i) i 1 ,is generated from a distribution p (x, y). A training algorithm for learning weights w of a fixed probabilistic classifier f (y x, w) can be denoted as a conditional distributionA(w S). Given any training algorithm A, its training performance can be measured using the expected cross-entropy:" n#X(i)(i)Hp,f (y x, w) ES Ew Slog f (y x , w) .i 1Achille & Soatto (2018) present a decomposition of this expected cross-entropy, which reduces to the following whenthe data generating process is fixed:memorizing label-noisez } {I(w; y x)(1) Ex,w KL p(y x) f (y x, w) .Hp,f (y x, w) H(y x)The problem of minimizing this expected cross-entropy isequivalent to selecting an appropriate training algorithm.If the labels contain information beyond what can be inferred from inputs (meaning non-zero H(y x)), such analgorithm may do well by memorizing the labels throughthe second term of (1). Indeed, minimizing the empirERMical cross-entropy(w S) (w ), wherePnloss A w 2 arg min w i 1 log f (y (i) x(i) , w), does exactlythat (Zhang et al., 2016).2.1. Decreasing I(w; y x) Reduces MemorizationTo demonstrate that I(w; y x) is directly linked to memorization, we prove that any algorithm with small I(w; y x)overfits less to label-noise in the training set.Theorem 2.1. Consider a dataset S (x, y) of n i.i.d. samples, x {x(i) }ni 1 and y {y (i) }ni 1 , where the domainof labels is a finite set, Y. Let A(w S) be any training algorithm, producing weights for a possibly stochastic classifierf (y x, w). Let yb(i) denote the prediction of the classifier

Improving Generalization by Controlling Label-Noise Information in Neural Network WeightsE"nXi 1e(i)#H(y x)PnI(w; y x)log ( Y 1)i 1H(e(i) ).This result establishes a lower bound on the expected number of prediction errors on the training set, which increasesas I(w; y x) decreases. For example, consider a corrupted version of the MNIST dataset where each label ischanged with probability 0.8 to a uniformly random incorrect label. By the above bound, every algorithm for whichI(w; y x) 0 will make at least 80% prediction errorson the training set in expectation. In contrast, if the weightsretain 1 bit of label-noise information per example, the classifier will make at least 40.5% errors in expectation. Theproof of Thm. 2.1 uses Fano’s inequality and is presented inthe supplementary material (Sec. A.1). Below we discussthe dependence of error probability on I(w; y x). Pn (i) Remark 1. If we let k Y and r n1 Ei 1 edenote the expected training error rate, then by Jensen’sinequality we can simplify Thm. 2.1 as follows:rH(y (1) x(1) )I(w; y x)/nlog(k 1)H(y (1) x(1) )I(w; y x)/nlog(k 1)1nPni 1H(r)H(e(i) ).(2)Solving this inequality for r is challenging. One can simplify the right hand side further by bounding H(e(1) ) 1(assuming that entropies are measured in bits). However,this will loosen the bound. Alternatively, we can find thesmallest r0 for which (2) holds and claim that r r0 .Remark 2. If Y 2, then log( Y 1) 0, puttingwhich in (13) of supplementary leads to:H(r)H(y (1) x(1) )I(w; y x)/n.Remark 3.When we have uniform label noise wherea label is incorrect with probability p (0 p k k 1 ) andI(w; y x) 0, the bound of (2) is tight, i.e., impliesthat rp. To see this, we note that H(y (1) x(1) ) H(p) p log(k 1), putting which in (2) gives us:H(p) H(r).log(k 1)(3)Therefore, when r p, the inequality holds, implying thatr0 p. To show that r0 p, we need to show that for any0 r p, the (3) does not hold. Let r 2 [0, p) and assumerH(p) p log(k 1)log(k 1)H(r) p 0.8r0 (the lower bound of r)on the i-th example and let e(i) 1{by (i) 6 y (i) } be a random variable corresponding to predicting y (i) incorrectly.Then, the following inequality holds:pppp0.60.20.40.60.80.40.20.0012I(w; y x)/n3Figure 2: The lower bound r0 on the rate of training errorsr Thm. 2.1 establishes for varying values of I(w; y x), inthe case when label noise is uniform and probability of alabel being incorrect is p.that (3) holds. ThenrH(p) H(r)log(k 1)H(p) (H(p) (r p)H 0 (p))p log(k 1)(r p) log(k 1)p 2p r.log(k 1)p (4)The second line above follows from concavity of H(x); andthe third line follows from the fact that H 0 (p) log(k1) when 0 p (k 1)/k. Eq. (4) directly contradictswith r p. Therefore, Eq. (3) cannot hold for any r p.When I(w; y x) 0, we can find the smallest r0 by anumerical method. Fig. 2 plots r0 vs I(w; y x) when thelabel noise is uniform. When the label-noise is not uniform,the bound of (2) becomes loose as Fano’s inequality becomes loose. We leave the problem of deriving better lowerbounds in such cases for a future work.Thm. 2.1 provides theoretical guarantees that memorization of noisy labels is prevented when I(w; y x) is small,in contrast to standard regularization techniques – such asdropout, weight decay, and data augmentation – which onlyslow it down (Zhang et al., 2016; Arpit et al., 2017). Todemonstrate this empirically, we compare an algorithm thatcontrols I(w; y x) (presented in Sec. 3) against theseregularization techniques on the aforementioned corruptedMNIST setup. We see in Fig. 1 that explicitly preventingmemorization of label-noise information leads to optimaltraining performance (20% training accuracy) and goodgeneralization on a non-corrupted validation set. Otherapproaches quickly exceed 20% training accuracy by incorporating label-noise information, and generalize poorlyas a consequence. The classifier here is a fully connectedneural network with 4 hidden layers each having 512 ReLU

Improving Generalization by Controlling Label-Noise Information in Neural Network Weightsunits. The rates of dropout and weight decay were selectedaccording to the performance on a validation set.2.2. Decreasing I(w; y x) Improves GeneralizationThe information that weights contain about a training datasetS has previously been linked to generalization (Xu & Raginsky, 2017). The following bound relates the expecteddifference between train and test performance to the mutualinformation I(w; S).Theorem 2.2. (Xu & Raginsky, 2017) Suppose (ŷ, y) isa loss function, such that (fw (x), y) is -sub-Gaussianrandom variable for each w. Let S (x, y) be the trainingset, A(w S) be the training algorithm, and (x̄, ȳ) be atest sample independent from S and w. Then the followingholds:"E (fw (x̄), ȳ)n 1X fw (x(i) ), y (i)n i 1# r2 2I(w; S)n(5)For good test performance, learning algorithms need tohave both a small generalization gap, and good trainingperformance. The latter may require retaining more information about the training set, meaning there is a natural conflict between increasing training performance anddecreasing the generalization gap bound of (5). Furthermore, information in weights can be decomposed as follows: I(w; S) I(w; x) I(w; y x). We claim that oneneeds to prioritize reducing I(w; y x) over I(w; x) forthe following reason. When noise is present in the training labels, fitting this noise implies a non-zero value ofI(w; y x), which grows linearly with the number of samples n. In such cases, the generalization gap bound of (5)becomes a constant and does not improve as n increases. Toget meaningful generalization bounds via (5) one needs tolimit I(w; y x). We hypothesize that for efficient learningalgorithms, this condition might be also sufficient.3. Methods Limiting Label InformationWe now consider how to design training algorithms thatcontrol I(w; y x). We assume f (y x, w) Multinoulli(y; s(a)), with a as the output of a neural network hw (x), and s (·) as the softmax function. We considerthe case when hw (x) is trained with a variant of stochastic gradient descent for T iterations. The inputs and labels of a mini-batch at iteration t are denoted by xt and ytrespectively, and are selected using a deterministic procedure (such as cycling through the dataset, or using pseudorandomness). Let w0 denote the weights after initialization,and wt the weights after iteration t. Let L(w; x, y) be someclassification loss function (e.g, cross-entropy loss) andgtL , rw L(wt 1 ; xt , yt ) be the gradient at iteration t. Letgt denote the gradients used to update the weights, possiblydifferent from gtL . Let the update rule be wt (w0 , g1:t ),and wT (w0 , g1:T ) be the final weights (denoted withw for convenience).To limit I(w; y x), the following sections will discuss twoapproximations which relax the computational difficulty,while still provide meaningful bounds: 1) first, we show thatthe information in weights can be replaced by informationin the gradients; 2) we introduce a variational bound on theinformation in gradients. The bound employs an auxiliarynetwork that predicts gradients of the original loss withoutlabel information. We then explore two ways of incorporating predicted gradients: (a) using them in a regularizationterm for gradients of the original loss, and (b) using them totrain the classifier.3.1. Penalizing Information in GradientsLooking at (1) it is tempting to add I(w; y x) as a regularization to the Hp,f (y x, w) objective and minimize overall training algorithms:min Hp,f (y x, w) I(w; y x).A(w D)(6) This will become equivalent to minimizing Ex,w KL p(y x) f (y x, w) . Unfortunately, the optimization problemof (6) is hard to solve for two major reasons. First, the optimization is over training algorithms (rather than over theweights of a classifier, as in the standard machine learningsetup). Second, the penalty I(w; y x) is hard to compute/approximate.To simplify the problem of (6), we relate information inweights to information in gradients as follows:I(w; y x) I(g1:T ; y x) TXt 1I(gt ; y x, g t ), (7)where g1:T and g t are shorthands for sets {g1 , . . . , gT }and {g1 , . . . , gt 1 } respectively. Hereafter, we focus onconstraining I(gt ; y x, g t ) at each iteration. Our taskbecomes choosing a loss function such that I(gt ; y x, g t )is small and f (y x, wt ) is a good classifier. One keyobservation is that if our task is to minimize label-noiseinformation in gradients it may be helpful to consider gradients with respect to the last layer only and compute theremaining gradients using back-propagation. As these stepsof back-propagation do not use labels, by data processing inequality, subsequent gradients would have at most as muchlabel information as the last layer gradient.To simplify information-theoretic quantities, we add a smallindependent Gaussian noise to the gradients of the originalloss: g̃tL , gtL t , where t N (0, 2 I) and issmall enough to have no significant effect on training (lessthan 10 9 is fine). With this convention, we formulate the

Improving Generalization by Controlling Label-Noise Information in Neural Network Weightsfollowing regularized objective function:min L(w; xt , yt ) I(g̃tL ; y x, g t ),(8)wwhere 0 is a regularization coefficient. The termI(g̃tL ; y x, g t ) is a function of x and g t , or more explicitly, a function (wt 1 ; xt ) of xt and wt 1 . Computingthis function would allow the optimization of (8) throughgradient descent: gt gtL t rw (wt 1 ; xt ). Importantly, label-noise information is equal in both gt and g̃tL , asthe gradient from the regularization is constant given x andg t :I(gt ; y x, g t ) I(gtL t rw (wt I(gtL t ; y x, g t ) 1 ; xt ); yI(g̃tL ; y x, g t ) x, g t ).Therefore, by minimizing I(g̃tL ; y x, g t ) in (8) we minimize I(gt ; y x, g t ), which is used to upper boundI(w; y x) in (7). We rewrite this regularization in termsof entropy and discard the constant term, H( t ):I(g̃tL ; y x, g t ) H(g̃tL x, g t ) H(g̃tL x, g t )H(g̃tL x, y, g t )H( t ).(9)3.2. Variational Bounds on Gradient InformationThe first term in (9) is still challenging to compute, as wetypically only have one sample from the unknown distribution p(yt xt ). Nevertheless, we can upper bound it with the cross-entropy Hp,q Eg̃tL log q (g̃tL x, g t ) ,where q (· x, g t ) is a variational approximation forp(g̃tL x, g t ): H(g̃tL x, g t ) E log q (g̃tL x, g t ) .This bound is correct when is a constant or a randomvariable that depends only on x. With this upper bound, (8)reduces to: min L(w; xt , yt )Eg̃tL log q (g̃tL x, g t ) . (10)w,

Improving Generalization by Controlling Label-Noise Information in Neural Network Weights . expected cross-entropy loss, along with three other individu-ally meaningful terms. Surprisingly, cross-entropy rewards . The latter approach can be viewed as a search over training

Related Documents:

March 23, 2008 DB:EER Model 10 - Generalization Generalization is the process of defining a superclass of many entity types. Generalization can be considered as the reverse of specialization. Example: We can view EMPLOYEE as a generalization of SECRETARY, TECHNICIAN, and ENGINEER. The figure in the next slide shows an example

6 1 128377-07190 label,loose parts 1 7 2 128377-07400 label,danger 2 8 2 128377-07420 label,danger 1 9 2 128377-07450 label,caution 1 10 2 128377-07460 label,caution 1 11 1 129670-07520 label, bso2 1 12 1 196630-12980 label 1 13 1 177073-02431 label 1 (c)copy rights yanmar co.,ltd an

www.LearnSAP.com Controlling - - 3 Step - 1 Setup Controlling Area - Basic Data The controlling area is the central organizational unit within the CO module. There are four rules concerning the controlling area that you must know. If you utilize CO you must configure at least one controlling area.

G. Cauwenberghs 520.776 Learning on Silicon Generalization and Complexity - Generalization is the key to supervised learning, for classification or regression. - Statistical Learning Theory offers a principled approach to understanding and controlling generalization performance. The complexity of the hypothesis class of functions determines

Out-of-Distribution Generalization via Risk Extrapolation Method Invariant Prediction Covariate Shift Robustness Suitable for Deep Learning DRO 7 3 3 (C-)ADA 7 (3) 3 ICP 3 7 7 IRM 3 7 3 REx 3 3 3 Table 1. A comparison of approaches for OOD generalization. (C-)ADA works for covariate shifts that do not also induce label shift.

Observe label. Lannate LV 0.75-3.0 pints 0 roots, 10 tops Observe label. Lannate SP 0.25-1.0 pound 0 roots, 10 tops Observe label. SpinTor 2SC 3.0-6.0 ounces 3 Observe label. Vegetable weevils Carbaryl 50WP 0.5 pound 3 Observe label. Malathion 5EC 1.0 pint 7 Observe label. Malathion 25WP 2.5 pounds 7 Observe label. Armyworms,

Will I get a label for each vehicle? As the label is tagge d to the qualified label holder (i.e. the driver or passenger with physical disability), only one label will be issued. The Class 1 label is strictly non-transferable while the Class 2 label is only transferable between the two re

m/s bharat power tech ltd mehak kanwar, b.p. lathawal 40. o.m.p.(comm) 199/2020 jansatta sehkri awas samiti milind m bharadwaj i.a. 2641/2014 ltd vs. m/s gobind ram chaprana old no. o.m.p. 179/2014 and sons 41. o.m.p. (comm)197/2020 mukesh gupta vikas arora old no. o.m.p. 268/2014 vs. praveen kumar jolly 42. o.m.p. (comm)198/2020 celebi delhi cargo terminal gaurav duggal old no. o.m.p. 304 .