A Neyman-Pearson Approach To Statistical Learning

2y ago
107 Views
2 Downloads
315.56 KB
29 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Mia Martinelli
Transcription

A Neyman-Pearson Approach to Statistical LearningClayton Scott and Robert Nowak†Technical Report TREE 0407Department of Electrical and Computer EngineeringRice UniversityEmail: cscott@rice.edu, nowak@engr.wisc.eduSeptember 19, 2004Revised February 2, 2005AbstractThe Neyman-Pearson (NP) approach to hypothesis testing is useful in situations where different types of error have different consequences or a priori probabilities are unknown. For anyα 0, the Neyman-Pearson lemma specifies the most powerful test of size α, but assumes thedistributions for each hypothesis are known or (in some cases) the likelihood ratio is monotonicin an unknown parameter. This paper investigates an extension of NP theory to situations inwhich one has no knowledge of the underlying distributions except for a collection of independentand identically distributed training examples from each hypothesis. Building on a “fundamental lemma” of Cannon et al., we demonstrate that several concepts from statistical learningtheory have counterparts in the NP context. Specifically, we consider constrained versions ofempirical risk minimization (NP-ERM) and structural risk minimization (NP-SRM), and proveperformance guarantees for both. General conditions are given under which NP-SRM leads tostrong universal consistency. We also apply NP-SRM to (dyadic) decision trees to derive ratesof convergence. Finally, we present explicit algorithms to implement NP-SRM for histogramsand dyadic decision trees.1IntroductionIn most approaches to binary classification, classifiers are designed to minimize the probability oferror. However, in many applications it is more important to avoid one kind of error than theother. Applications where this situation arises include fraud detection, spam filtering, machinemonitoring, target recognition, and disease diagnosis, to name a few. In this paper we investigateone approach to classification in this context inspired by classical Neyman-Pearson (NP) hypothesistesting.In the spirit of statistical learning theory, we develop the theoretical foundations of a NeymanPearson approach to learning classifiers from labeled training data. We show that several resultsand concepts from standard learning theory have counterparts in the NP setting. Specifically,we consider constrained versions of empirical risk minimization (NP-ERM) and structural riskminimization (NP-SRM), and prove performance guarantees for both. General conditions are given Department of Statistics, Rice University, 6100 Main St, MS-138, Houston, TX 77005.Department of Electrical and Computer Engineering, University of Wisconsin-Madison, 1415 Engineering Dr,Madison, WI 53706†1

under which NP-SRM leads to strong universal consistency. Here consistency involves the (almostsure) convergence of both the learned miss and false alarm probabilities to the optimal probabilitiesgiven by the Neyman-Pearson lemma. We also apply NP-SRM to (dyadic) decision trees to deriverates of convergence. Finally, we present explicit algorithms to implement NP-SRM for histogramsand dyadic decision trees.1.1MotivationIn the Neyman-Pearson (NP) theory of binary hypothesis testing, one must decide between a nullhypothesis and an alternative hypothesis. A level of significance α (called the size of the test)is imposed on the false alarm probability (type I error), and one seeks a test that satisfies thisconstraint while minimizing the miss probability (type II error), or equivalently maximizing thedetection probability (power). The NP lemma specifies necessary and sufficient conditions for themost powerful test of size α, provided the distributions under the two hypotheses are known, or(in special cases) the likelihood ratio is a monotonic function of an unknown parameter [1, 2, 3](see [4] for an interesting overview of the history and philosophy of NP testing). We are interestedin extending the NP paradigm to the situation where one has no knowledge of the underlyingdistributions except for independent and identically distributed training examples drawn from eachhypothesis. We use the language of classification, whereby a test is a classifier and each hypothesiscorresponds to a particular class.To motivate the Neyman-Pearson approach to learning classifiers, consider the problem of classifying tumors using gene expression microarrays [5], which are typically used as follows: First,identify several patients whose status for a particular form of cancer is known. Next, collect cellsamples from the appropriate tissue in each patient. Then, conduct a microarray experiment toassess the relative abundance of various genes in each of the subjects. Finally, use this “trainingdata” to build a classifier that can in principle be used to diagnose future patients based on theirgene expression profiles.The dominant approach to classifier design in microarray studies has been to minimize theprobability of error (see, for example, [6] and references therein). Yet it is clear that failing todetect a malignant tumor has drastically different consequences than erroneously flagging a benigntumor. In the Neyman-Pearson approach, the diagnostic procedure involves setting a level ofsignificance α, an upper bound on the fraction of healthy patients that may be unnecessarily sentfor treatment or further screening, and constructing a classifier to minimize the number of missedtrue cancer cases.Further motivation for the NP paradigm comes from a comparison with cost-sensitive classification (CSC). Cost-sensitive classification (also called cost-sensitive learning) is another approachto handling disparate kinds of errors in classification (see [7, 8, 9, 10] and references therein).Following classical Bayesian decision theory, CSC modifies the standard ‘0-1’ loss function to aweighted Bayes cost. Cost-sensitive classifiers assume the relative costs for different classes areknown. Moreover, these algorithms assume estimates (either implicit or explicit) of the a prioriclass probabilities can be obtained from the training data or some other source.Cost-sensitive and NP classification are fundamentally different approaches that have differingpros and cons. In some situations it may be difficult to (objectively) assign costs. For instance, howmuch greater is the cost of failing to detect a malignant tumor compared to the cost of erroneouslyflagging a benign tumor? The two approaches are similar in that the user essentially has one freeparameter to set. In CSC, this free parameter is the ratio of costs of the two class errors, while inNP classification it is the false alarm threshold α. The selection of costs does not directly provide2

control on α, and conversely setting α does not directly translate into costs. The lack of preciseknowledge of the underlying distributions makes it impossible to precisely relate costs with α. Thechoice of method will thus depend on which parameter can be set more realistically, which in turndepends on the particular application. For example, consider a network intrusion detector thatmonitors network activity and flags events that warrant a more careful inspection. The value of αmay be set so that the number of events that are flagged for further scrutiny matches the resourcesavailable for processing and analyzing them.From another perspective, however, the NP approach seems to have a clear advantage withrespect to CSC. Namely, Neyman-Pearson classification does not assume knowledge of or about apriori class probabilities. Cost-sensitive classification can be misleading if a priori class probabilitiesare not accurately reflected by their sample-based estimates. Consider, for example, the case whereone class has very few representatives in the training set simply because it is very expensive togather that kind of data. This situation arises, for example, in machine fault detection, wheremachines must often be induced to fail (usually at high cost) to gather the necessary training data.Here the fraction of “faulty” training examples is not indicative of the true probability of a faultoccurring. In fact, it could be argued that in most applications of interest, class probabilities differbetween training and test data. Returning to the cancer diagnosis example, class frequencies ofdiseased and normal patients at a cancer research institution, where training data is likely to begathered, in all likelihood do not reflect the disease frequency in the population at large.1.2Previous Work on Neyman-Pearson ClassificationAlthough Neyman-Pearson classification has been studied previously from an empirical perspective[11], the theoretical underpinnings were apparently first studied by Cannon, Howse, Hush and Scovel[12]. They give an analysis of a constrained form of empirical risk minimization (ERM) that we callNP-ERM. The present work builds on their theoretical foundations in several respects. First, usingdifferent bounding techniques, we derive predictive error bounds for NP-ERM that are non-trivialfor substantially smaller sample sizes. Second, while Cannon et al. consider only learning from fixedVapnik-Chervonenkis (VC) classes, we introduce a constrained form of structural risk minimization,NP-SRM, that automatically balances model complexity and training error, a feature necessary forconsistency. Third, assuming mild regularity conditions on the underlying distribution, we deriverates of convergence for NP-SRM as realized by a certain family of decision trees called dyadicdecision trees. Finally, we present exact and computationally efficient algorithms for implementingNP-SRM for histograms and dyadic decision trees.In a separate paper Cannon et al. [13] consider NP-ERM over a data-dependent collection ofclassifiers and are able to bound the estimation error in this case as well. They also provide analgorithm for NP-ERM in some simple cases involving linear and spherical classifiers, and presentan experimental evaluation. To our knowledge the present work is the third study to consider aNP approach to statistical learning, the second to consider practical algorithms with performanceguarantees, and the first to consider model selection, consistency, and rates of convergence.1.3NotationWe focus exclusively on binary classification, although extensions to multi-class settings are possible.Let X be a set and let Z (X, Y ) be a random variable taking values in Z X {0, 1}. Thevariable X corresponds to the observed signal (pattern, feature vector) and Y is the class labelassociated with X. In classical Neyman-Pearson testing, Y 0 corresponds to the null hypothesis.3

A classifier is a Borel measurable function h : X {0, 1} mapping signals to class labels.In standard classification, the performance of h is measured by the probability of error, R(h) P(h(X) 6 Y ). Here P denotes the probability measure for Z. We will focus instead on the falsealarm and miss probabilities denoted byRj (h) PX Y j (h(X) 6 j)for j 0 and j 1, respectively. Note that R(h) π0 R0 (h) π1 R1 (h) where πj PY (Y j) isthe (unknown) a priori probability of class j. The false-alarm probability is also known as the size,while one minus the miss probability is known as the power.Let α [0, 1] be a user-specified level of significance or false alarm threshold. In Neyman-Pearsontesting one seeks the classifier g minimizing R1 (h) over all h such that R0 (h) α. In words, g isthe most powerful test (classifier) of size α. If f0 and f1 are the conditional densities of X (withrespect to a measure ν) corresponding to classes 0 and 1 respectively, then the Neyman-Pearsonlemma [1] states that g (x) I{Λ(x) η} . Here I denotes the indicator function,R Λ(x) f1 (x)/f0 (x)is the likelihood ratio, and η is defined implicitly by the integral equation Λ η f0 dν α. Thus,when f0 and f1 are known (or in certain special cases where the likelihood ratio is a monotonicfunction of an unknown parameter) the Neyman-Pearson lemma delivers the optimal classifier.In this paper we are interested in the case where our only information about fj is a finitetraining sample. Let Z n {(Xi , Yi )}ni 1 Z n be a collection of n independent and identicallydistributed (IID) samples of Z (X, Y ). A learning algorithm (or learner for short) is a mappingbhn : Z n H(X , Y), where H(X , Y) is the set of all classifiers.1 In words, the learner bhn is a rulenfor selecting a classifier based on a training sample. When a training sample Z is given, we alsouse bhn to denote the classifier produced by the learner.In classical NP theory it is possible to impose absolute bounds on the false alarm and missprobabilities. When learning from training data, however, one cannot make such guarantees becauseof the dependence on the random training sample. Unfavorable training samples, however unlikely,can lead to arbitrarily poor performance. Therefore, bounds on R0 and R1 can only be shown tohold with high probability or in expectation. Formally, let Pn denote the product measure on Z ninduced by P, and let En denote expectation with respect to Pn . Bounds on the false alarm andmiss probabilities must be stated in terms of Pn and En (see Section 1.4 below).We investigate classifiers defined in terms of the following sample-dependent quantities. Forj 0, 1, letnXnj I{Yi j}i 1be the number of samples from class j. LetXbj (h) 1I{h(Xi )6 j}Rnji:Yi jdenote the empirical false alarm and miss probabilities, corresponding to j 0 and j 1, respectively. Given a class of classifiers H, define H0 {h H : R0 (h) α}, and h arg min{R1 (h) :1When X has a discrete distribution, one can introduce randomized tests to achieve slightly higher power. Otherwise the false alarm constraint may not be satisfied with equality. In this paper we do not explicitly treat randomized tests. Thus, h and g should be understood as being optimal with respect to classes of non-randomizedtests/classifiers. We thus implicitly assume that α is such that the false alarm constraint can be satisfied with equality. Incorporating randomized tests into our theoretical results would be straightforward, requiring only a few minormodifications to our current arguments.4

h H0 }. That is, h is the most powerful test/classifier2 in H of size α. Finally, set R1 R1 (g )to be the miss probability of the optimal classifier g provided by the Neyman-Pearson lemma.1.4Problem StatementThe goal of Neyman-Pearson learning is to design learning algorithms bhn producing classifiers thatperform almost as well as h or g . In particular, we desire learners with the following properties:(This section is intended as a preview; precise statement are given later.)PAC bounds: bhn is “probably approximately correct” in the sense that given δ0 , δ1 0, thereexist 0 0 (n0 , δ0 ) and 1 1 (n1 , δ1 ) such thatandPn (R0 (bhn ) α 0 (n0 , δ0 )) δ0Pn (R1 (bhn ) R1 (h ) 1 (n1 , δ1 )) δ1 .Moreover, δ0 , δ1 decay exponentially fast as functions of increasing 0 , 1 (see Section 2 fordetails).False Alarm Probability Constraints: In classical hypothesis testing, one is interested intests/classifiers h satisfying R0 (h) α. In the learning setting, such constraints can onlyhold with a certain probability. It is however possible to obtain non-probabilistic guaranteesof the form (see Section 2.3 for details):En {R0 (bhn ) n0 } α0 .Oracle inequalities: Given a hierarchy of sets of classifiers H1 . . . HK , bhn does about aswell as an oracle that knows which Hk achieves the proper balance between the estimationand approximation errors. In particular we will show that with high probability, both!R1 (bhn ) R min 1 (n1 , k) inf R1 (h) R 11 k Kh H0k1andR0 (bhn ) α 0 (n0 , K)hold, where j (nj , k) tends to zero at a certain rate depending on the choice of Hk (see Section3 for details).Consistency: If H grows (as a function of n) in a suitable way, then bhn is strongly universallyconsistent [14, chap. 6] in the sense thatlim R0 (bhn ) αwith probability 1lim R1 (bhn ) R1 with probability 1n andn for all distributions of Z (see Section 4 for details).2In this paper we assume that a classifier h achieving the minimum exists. Although not necessary, this allowsus to avoid laborious approximation arguments.5

Rates of Convergence: Under mild regularity conditions, there exist functions r0 (n) and r1 (n)tending to zero at a polynomial rate such thatEn R0 (bhn ) α 4 r0 (n)andEn R1 (bhn ) R1 4 r1 (n).We write an 4 bn when an O(bn ) and an bn if both an 4 bn and bn 4 an (see Section 5for details).Implementable Rules: Finally, we would like to find rules satisfying the above properties thatcan be implemented efficiently.2Neyman-Pearson and Empirical Risk MinimizationIn this section we review the work of Cannon et al. [12] who study Neyman-Pearson learning in thecontext of fixed Vapnik-Chervonenkis (VC) classes. We also apply a different bounding techniquethat leads to substantially tighter upper bounds. For a review of VC theory see Devroye, Györfi,and Lugosi [14].For the moment let H be an arbitrary, fixed collection of classifiers and let 0 0. Cannon etal. propose the classifierbhn b1 (h)arg min R(1)h Hb0 (h) α 1 0 .s.t. R2We call this procedure Neyman-Pearson empirical risk minimization (NP-ERM). Cannon et al.demonstrate that NP-ERM enjoys properties similar to standard ERM [14, 15] translated to theNeyman-Pearson context. We now recall their analysis.To state the theoretical properties of NP-ERM, introduce the following notation.3 Let 1 0.Recall H0 {h H : R0 (h) α} and h arg min{R1 (h) : h H0 }. DefineΘ0 {Z n : R0 (bhn ) α 0 }nbΘ1 {Z : R1 (hn ) R1 (h ) 1 }b0 (h) 0 /2}Ω0 {Z n : sup R0 (h) Rh H0nb1 (h) 1 /2}.Ω1 {Z : sup R1 (h) Rh H0A main result of Cannon et al. [12] is the following lemma.Lemma 1 (Cannon, Howse, Hush and Scovel). With Θj and Ωj defined as above and bhn asdefined in (1) we haveΘ0 Ω0 ,and in particular3Θ1 Ω0 Ω1 ,Θ0 Θ1 Ω0 Ω1 .We interchange the meanings of the subscripts 0 and 1 used in [12], preferring to associate class 0 with the nullhypothesis.6

This result is termed a “fundamental lemma” for its pivotal role in relating the performancebof hn to bounds on the error deviance. Vapnik and Chervonenkis introduced an analogous resultto bound the performance of ERM (for standard classification) in terms of the error deviance [16],(see also [14, chap. 8]). An immediate corollary is the following.Proposition 1 (Cannon, Howse, Hush and Scovel). Let 1 , 0 0 and take bhn as in (1).Then Pn R0 (bhn ) R1 (h ) 1hn ) α 0 or R1 (b nnb0 (h) 0 /2 P sup R1 (h) Rb1 (h) 1 /2 . P sup R0 (h) Rh Hh HBelow this result is used to derive PAC bounds by applying results for convergence of empiricalprocesses such as VC inequalities.We make an important observation that is not mentioned by Cannon et al. [12]. In both ofthe above results, the tolerance parameters 0 and 1 need not be constants, in the sense that theymay depend on the sample or certain other parameters. This will be a key to our improved boundand extension to structural risk minimization. In particular, we will choose j to depend on nj , aspecified confidence parameter δj , and a measure of the capacity of H such as the cardinality (if His finite) or VC dimension (if H is infinite).While we focus our discussion on VC and finite classes for concreteness and comparison with[12], other classes and bounding techniques are applicable. Proposition 1 allows for the use of manyof the error deviance bounds that have appeared in the empirical process and machine learningliterature in recent years. The tolerances j may even depend on the full sample Z n or on theindividual classifier. Thus, for example, Rademacher averages [17, 18] could be used to define thetolerances in NP-ERM. However, the fact that the tolerances for VC and finite classes (definedbelow in Equations (2) and (3)) are independent of the classifier, and depend on the sample onlythrough n0 and n1 , does simplify our extension to structural risk minimization in Section 3.2.1NP-ERM with VC ClassesSuppose H has VC dimension V . Cannon et al. consider two viewpoints for NP classification.First they consider retrospective sampling where n0 and n1 are known before the sample is observed.Applying the VC inequality as stated by Devroye et al. [14], together with Proposition 1, theyobtainTheorem 1 (Cannon, Howse, Hush and Scovel). Let 0 , 1 0 and take bhn as in (1). Then 22Pn R0 (bhn ) α 0 or R1 (bhn ) R1 (h ) 1 8nV0 e n0 0 /128 8nV1 e n1 1 /128 .An alternative viewpoint for Neyman-Pearson classification is IID sampling in which n0 and n1are unknown until the training sample is observed. Unfortunately, application of the VC inequalityis not so straightforward in this setting because n0 and n1 are now random variables. To circumventthis problem, Cannon et al. arrive at the following result using the fact that with high probabilityn0 and n1 are concentrated near their expected values. Recall that πj is the a priori probability ofclass j.7

Theorem2 (Cannon, Howse, Hush and Scovel). Let 0 , 1 0 and take bhn as in (1). If 10 5n π2 2 , j 0, 1, thenj jPn nπ 2 2nπ 2 2 0 0 1 1 10(2n)V (e 640 5 e 640 5 ).hn ) R1 (h ) 1R0 (bhn ) α 0 or R1 (bOwing to the larger constants out front and in the exponents, their bound for IID samplingis substantially larger than for retrospective sampling. In addition, the bound does not hold forsmall n, and since the a priori class probabilities are unknown in the NP setting, it is not knownfor which n the bound does hold.We propose an alternate VC-based bound for NP classification under IID sampling that improvesupon the preceding result. In particular, our bound is as tight as the bound in Theorem 1 forretrospective sampling and it holds for all values of n. Thus it is no longer necessary to be concernedwith the philosophical differences between retrospective and IID sampling, since the same boundis available for both. As mentioned previously, the key idea is to let the tolerances 0 and 1 bevariable as follows. Let δ0 , δ1 0 and definesV log nj log(8/δj )(2) j j (nj , δj , H) 128njfor j 0, 1. Let bhn be defined by (1) as before, but with the new definition of 0 (which nowdepends on n0 , δ0 and H).The main difference between the rule of Cannon et al. and the rule proposed here is that intheir formulation the term α 21 0 constraining the empirical false alarm probability is independentof the sample. In contrast, our constraint is smaller for larger values of n0 . When more trainingdata is available for class 0, a higher level of accuracy is required. We argue that this is a desirableb0 should more accurately estimate R0 , and therefore aproperty. Intuitively, when n0 is larger, Rsuitable classifier should be available from among those rules approximating α to within the smallertolerance. Theoretically, our approach leads to a substantially tighter bound as the followingtheorem shows.Theorem 3. For NP-ERM over a VC class H with tolerances given by (2), Pn R0 (bhn ) α 0 (n0 , δ0 , H) or R1 (bhn ) R1 (h ) 1 (n1 , δ1 , H) δ0 δ1 .Proof. By Proposition 1, it suffices to show for j 0, 1 1nbP sup Rj (h) Rj (h) j (nj , δj , H) δj .2h HWithout loss of generality take j 0. Let P(n0 ) be the probability that there are n0 examples ofclass 0 in the training sample. Then b0 (h) 1 0 (n0 , δ0 , H)Pn sup R0 (h) R2h H nX1nb P sup R0 (h) R0 (h) 0 (n0 , δ0 , H) n0 P(n0 )2h H n0 0nXδ0 P(n0 )n0 0 δ0 ,8

( , δ)π0 0.5π0 0.9π0 0.99π0 0.999(0.2, 0.1)2.97 1011.68 1022.04 1032.29 104(0.1, 0.1)2.90 1011.63 1021.95 1032.27 104(0.1, 0.05)2.87 1011.61 1021.91 1032.22 104(0.05,0.1)2.85 1011.59 1021.88 1032.17 104(0.001, 0.001)2.59 1011.39 1021.56 1031.73 104(10 5 , 10 5 )2.47 1011.30 1021.42 1031.53 104Figure 1: Assuming δ δ0 δ1 and 0 1 , this table reports, for various values of δ, , and π0 , the ratio of the sample sizes needed to satisfy the two bounds of Theorems 2 and3, respectively. Clearly the bound of Theorem 3 is tighter, especially for asymmetric a prioriprobabilities. Furthermore, the bound of Theorem 2 requires knowledge of π0 , which is typicallynot available in the Neyman-Pearson setting, while the bound of Theorem 3 does not. See Example1 for further discussion.where the inequality follows from the VC inequality [14, chap. 12]. This completes the proof.The new bound is substantially tighter than that of Theorem 2, as illustrated with the followingexample.Example 1. For the purpose of comparison, assume V 1, n0 n1 21 n, and π0 π1 21 . Howlarge should n be so that we are guaranteed that with at least 0.8 probability (say δ0 δ1 0.1),both bounds hold with 0 1 0.2? For the new bound of Theorem 3 to hold we needslog( 21 n) log(8/(0.1)) 0.212812nwhich implies n 97104. In contrast, Theorem 2 requires 20nen(0.5)2 (0.2)2 640 5 0.1which implies n 2887079. To verify that this phenomenon is not a consequence of the particularchoice of δj , j , or π0 , Fig. 1 reports the ratios of the necessary sample sizes for a range of theseparameters.4 Indeed, as π0 tends to 1, the disparity grows significantly. Finally, we note that thebound of Cannon et al. for retrospective sampling requires as many samples as our bound for IIDsampling.2.2NP-ERM with Finite ClassesThe VC inequality is so general that in most practical settings it is too loose owing to largeconstants. Much tighter bounds are possible when H is finite. For example, H could be obtainedby quantizing the elements of some VC class to machine precision.Let H be finite and define the NP-ERM estimator bhn as before. Redefine the tolerances 0 and 1 byslog H log(2/δj ) j j (nj , δj , H) 2.(3)njWe have the following analogue of Theorem 3.4When calculating (2) for this example we assume nj n · πj .9

Theorem 4. For NP-ERM over a finite class H with tolerances given by (3), hn ) α 0 (n0 , δ0 , H) or R1 (bhn ) R1 (h ) 1 (n1 , δ1 , H) δ0 δ1 .Pn R0 (bThe proof is identical to the proof of Theorem 3 except that the VC inequality is replaced bya bound derived from Hoeffding’s inequality and the union bound (see [14, chap. 8]).Example 2. To illustrate the significance of the new bound, consider the scenario described inExample 1, but assume the VC class H is quantized so that H 216 . For the bound of Theorem4 to hold, we needs16 log 2 log(2/(0.1))2 0.212nwhich implies n 1409, a significant improvement.Another example of the improved bounds available for finite H is given in Section 6.2 where weapply NP-SRM to dyadic decision trees.2.3Learner False Alarm ProbabilitiesIn the classical setting, one is interested in classifiers h satisfying R0 (h) α. When learningfrom training data, such bounds on the false alarm probability of bhn are not possible due to itsdependence on the training sample. However, a practitioner may still wish to have an absoluteupper bound of some sort. One possibility is to bound the quantityEn {R0 (bhn ) n0 },which we call the learner false alarm probability.The learner false alarm probability can be constrained for a certain range of α depending onthe class H and the number n0 of class 0 training samples. Recall Θ0 {Z n : R0 (bhn ) α 0 }.Arguing as in the proof of Theorem 3 and using Lemma 1 (Θ0 Ω0 ), we havehn ) Θ0 , n0 }En {R0 (bhn ) n0 } Pn (Θ0 n0 )En {R0 (bhn ) Θ0 , n0 } Pn (Θ0 n0 )En {R0 (b α 0 (n0 , δ0 , H) δ0 .The confidence δ0 is essentially a free parameter, so let α0 be the minimum possible value of 0 (n0 , δ0 , H) δ0 as δ0 ranges over (0, 1]. Then the learner false alarm probability can be boundedby any desired α0 , α0 α0 1, by choosing α appropriately. In particular, if bhn is obtained byNP-ERM with α α0 α0 , thenEn {R0 (bhn ) n0 } α0 .Example 3. Suppose H is finite, H 28 , and n0 1000. A simple numerical experiment, usingthe formula of (3) for 0 , shows that the minimum value of δ0 0 (n0 , δ0 , H) is α0 0.157. Thus, ifa bound of α0 0.2 on the learner false alarm probability is desired, it suffices to perform NP-ERMwith α α0 α0 0.043.10

3Neyman-Pearson and Structural Risk MinimizationOne limitation of NP-ERM over fixed H is that most possibilities for the optimal rule g cannotbe approximated arbitrarily well. Such rules will never be universally consistent. A solution tothis problem is known as structural risk minimization (SRM) [19], whereby a classifier is selectedfrom a family Hk , k 1, 2, . . . , of increasingly rich classes of classifiers. In this section we presentNP-SRM, a version of SRM adapted to the NP setting, in the two cases where all Hk are eitherVC classes or finite.3.1NP-SRM over VC ClassesLet Hk , k 1, 2, . . . be given, with Hk having VC dimension Vk . Assume V1 V2 · · · . Definethe tolerances 0 and 1 bysVk log nj k log 2 log(8/δj ) j j (nj , δj , k) 128.(4)njRemark 1. The new value for j is equal to the value of j in the previoussection with δj replacedP k 1, which is used toby δj 2 k . The choice of the scaling factor 2 k stems from the fact 2k 1show (by the union bound) that the VC inequalities hold for all k simultaneously with probability atleast 1 δj (see the proof of Theorem 5).NP-SRM produces a classifier bhn according to the following two-step process. Let K(n) be anondecreasing integer valued function of n with K(1) 1.1. For each k 1, 2, . . . , K(n), setbhkn2. Set b1 (h)arg min R(5)h Hkb0 (h) α 1 0 (n0 , δ0 , k).s.t. R2 1bb1 (bhn arg min Rhkn ) 1 (n1 , δ1 , k) k 1, 2, . . . K(n) .2The term 12 1 (n1 , δ1 , k) may be viewed as a penalty that measures the complexity of class Hk . Inwords, NP-SRM uses NP-ERM to select a candidate from each VC class, and then sel

In the spirit of statistical learning theory, we develop the theoretical foundations of a Neyman-Pearson approach to learning classifiers from labeled training data. We show that several results and concepts from standard learning theory

Related Documents:

THE NEYMAN-PEARSON THEORY AS DECISION THEORY, AND AS INFERENCE THEORY; WITH A CRITICISM OF THE LINDLEY-SAVAGE ARGUMENT FOR BAYESIAN THEORY 1. INTRODUCTION AND SUMMARY The concept of a decision, which is basic in the theories of Neyman Pearson, W

Neyman's theory of confidence interval estimation was designed, as . the investigator has merely indeterminate knowledge of "prior" chances for the unknown quantities.' However, this criticism of Bayesian . In "In Defense of the Neyman-Pearson Theory of Confidence Inter- vals", D. Mayo ex

His favorite Polish poets were Adam Mickiewicz and Julian Tuwim. He often quoted them at social gatherings. I would probably not remember those early talks about Jerzy Neyman if I would not have taken an advanced statistics course. It was that course in which one day I learned about the Neyman-Pearson lemma and its importance in statistics.

The Neyman-Pearson Theory of statistics (NPT), often referred to as 'standard' or 'orthodox' statistical theory, is the generally-received view in university departments of statistics, and it underlies common statis tical reports

Pearson Education LTD. Pearson Education Australia PTY, Limited. Pearson Education Singapore, Pte. Ltd. Pearson Education North Asia, Ltd. Pearson Education Canada, Ltd. Pearson Educación de Mexico, S.A. de C.V. Pearson Education—Japan Pearson Education Malaysia, Pte. Ltd. The Libra

Pearson Education LTD. Pearson Education Australia PTY, Limited. Pearson Education Singapore, Pte. Ltd. Pearson Education North Asia, Ltd. Pearson Education Canada, Ltd. Pearson Educatión de Mexico, S.A. de C.V. Pearson Education—Japan Pearson Education Malaysia, Pte. Ltd. Library of Co

Pearson (UK) 80 Strand, London WC2R 0RL, UK T 44 (0)20 7010 2000 F 44 (0)20 7010 6060 firstname.lastname@pearson.com www.pearson.com Pearson (US) 1330 Avenue of the Americas, New York City, NY 10019, USA T 1 212 641 2400 F 1 212 641 2500 firstname.lastname@pearson-inc.com www.pearson.com Pearson Education One Lake Street, Upper Saddle River,

Animal Food Nutrition Science Public Health Sports & Exercise Healthcare Medical 2.3 Separate, speciality specific listings providing examples of the detailed areas of knowledge and application for each of the five new core competencies required by Registered Nutritionist within these specialist areas have been created and are listed later in this document under the relevant headings. 2.4 All .