Nearest Neighbor Pattern Classification

2y ago
23 Views
3 Downloads
1,019.55 KB
7 Pages
Last View : 1d ago
Last Download : 2m ago
Upload by : Tia Newell
Transcription

IEEE TRANSACTIONSON INFORMATIONTHEORY,VOL.IT-IS,NO.1, JANUARY196721[51 C. W . Ericsen,“Multidimensionalstimulus differences andthe accuracy of discrimination,”WrightAir DevelopmentC er, Wright-PattersonAFB, Dayton, Ohio, Tech. Rept.,The author is grateful to Prof. S. J. Mason of M.I.T.for his interest in this work, and for his many helpfulsuggestions. The author also wishes to thank Prof.K. N. Stevens and Prof. M. Eden for some very helpfuldiscussions, and Prof. D. E. Troxel for his help in designingthe sensory display.PI N. S. Andersen, and P. S. Fit&, “Amountof information gainedduring brief exposures to numerals and colors,” J. Exper.Psychol., vol. 56, pp. 362-369, October 1958.[71 G. A. Miller, “The magical number seven, plus or minus two:some limits on our capacity for processing information,”Psychol.Rev., vol. 63, pp. 81-97, March 1956.@ I F. Attneave, Applications of Information Theory to Psychology.New York: HoIt,, 1959.PI D. E. Troxel, “Comparison of tactile and visual reading rates,”M.I.T.Electronics Research Lab., Cambridge, Mass. Quart.Progress Rept. 67, pp. 267-272, October 1962.[lOI R. W . Donaldson, “Approximate formulas for the informationtransmittedbv a discrete communicationchannel.” M.I.T.Electronics Research Lab., Cambridge, Mass. Qua&. ProgressRept. 77, pp. 335-342, April 1965.[Ill A. de Saint-Exupery, The Little Prince. New York: Harcourt,Brace, and World, 1943.[ I F. A. Geldard, Cutaneous Channels of Communication, W . A.Rosenblith, Ed. New York: Wiley 1961.REFERENCES[l] I. Pollack, “The informationof elementary auditory displays,”J. Acoust. Sot. Am., vol. 24, pp. 745-749, November 1952.[2] --,“The informationof elementary auditory displays II,”J. Acoust. Sot. Am., vol. 25, pp. 765-769, July 1953.of multidimensional[3] I. Pollack and L. Ficks, “Informationauditory displays,” J. Acoust. Sot. Am., vol. 26, pp. 155-158,March 1954.analysis of absolute judge[4] W . R. Garner, “An informationalments of loudness,” J. Exper. Psychol., vol. 46, pp. 373-380,November 1953.Nearest NeighborT. M. COVER,MEMBER,PatternIEEE,Absfracf-Thenearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set ofpreviously classified points. This rule is independent of the underlying joint distribution on the sample points and their classifications,and hence the probability of error R of such a rule must be at leastas great as the Bayes probability of error R*--the minimum probability of error over all decision rules taking underlying probabilitystructure into account. However, in a large sample analysis, we willshow in the M-category case that R* R R*(Z - MR*/(M-I)),where these bounds are the tightest possible, for all suitably smoothunderlyingdistributions.Thus for any number of categories, theprobability of error of the nearest neighbor rule is bounded aboveby twice the Bayes probability of error. In this sense, it may be saidthat half the classification informationin an iu6uite sample set iscontained iu the nearest neighbor.I. INTRODUCTIONN THE CLASSIFICATIONproblem there are twoextremes of knowledge which the statistician maypossess. Either he may have complete statisticalknowledge of the underlying joint distribut’ion of theManuscriptreceived February 23, 1966; revised April 29, 1966.This work has been supported at Stanford University by U. S. ArmyElectronicsCommand under ContractDA28-043-AMC-01764(E)and by USAF under Contract AF49(638)1517;and at the StanfordResearch Institute,Menlo Park, Calif., by RADC under ContractAF30( 602)-3945.T. M. Cover is with the Departmentof Electrical Engineering,Stanford University, Stanford, Calif.P. E. Hart is with the Stanford Research Institute, Menlo Park,Calif.ANDClassificationP. E. HART,MEMBER,IEEEobservation 2 and the true category 8, or he may haveno knowledge of the underlying distribution except thatwhich can be inferred from samples. In the first extreme,a standard Bayes analysis will yield an optimal decisionprocedure and the corresponding minimum (Bayes) probability of error of classification R*. In the other extreme,a decision to classify x into category 0 is allowed to dependonly on a collection of n correctly classified samples(q e,), (x2, e,), . . . , (z,, e,), and the decision procedureis by no means clear. This problem is in the domain ofnonparametricstatistics and no optima1 CIassificationprocedure exists with respect to all underlying statistics.If it is assumed that the classified samples (xi, 0,) areindependently identically distributed according to the distribution of (x, 0)) certain heuristic arguments may bemade about good decision procedures. For example, it isreasonable to assume that observaGons which are closetogether (in some appropriate metric) will have the sameclassification, or at least will have almost the sameposterior probabilitydistribut’ionson their respectiveclassifications. Thus to classify the unknown sample .2:we may wish to weight the evidence of the nearby xi’smost heavily. Perhaps the simplest nonparametric decisionprocedure of this form is the nearest neighbor (NN) rule,which classifies x in the category of its nearest neighbor.Surprisingly, it will be shown that, in the large samplecase, this simple rule has a probabilityof error which

22IEEETRANSACTIONSONINFORMATIONTHEORYis less than twice the Bayes probabilityof error, andhence is less than twice the probabilityof error of anyother decision rule, nonparametric or otherwise, based onthe infinite sample set.The first formulation of a rule of the nearest neighbortype and primary previous contributionto the analysisof its properties, appears to have been made by Fix andHodges [I] and [a]. They investigated a rule which mightbe called the k,-nearest neighbor rule. It assigns to anunclassified point the class most heavily represented amongits k, nearest neighbors. Rx and Hodges established theconsistency of this rule for sequences lc, ---f m such thatlc,/n -- 0. In reference [a], they investigate numericallythe small sample performance of the Ic,-NN rule underthe assumption of normal statistics.The NN rule has been used by Johns [3] as an exampleof an empirical Bayes rule. Kanal [4], Sebestyen [5](who calls it the proximityalgorithm),and Nilsson [6]have mentioned the intuitive appeal of the NN rule andsuggested its use in the pattern recognition problem.Loftsgaardenand Quesenberry [7] have shown that asimple modification of the L,-NN rule gives a consistentestimate of a probabilitydensity function. In the abovementioned papers, no analytical results in the nonparametric case were obtained either for the finite samplesize problem or for the finite number of nearest neighborsproblem.In this paper we shall show that, for any number n ofsamples, the single-NN rule has strictly lower probabilityof error than any other k,-NN rule against certain classesof distributions, and hence is admissible among the Ic,-NNrules. We will then establish the extent to which “sampleswhich are close together have categories which are closetogether” and use this to compare in Section VI theprobabilityof error of the NN-rule with the minimumpossible probability of error.II. TI-IE NEARESTNEIGI BORRULEA set of n pairs (x1, &), . . , (z,, 0,) is given, wherethe xi’s take values in a metric space X upon which isdefined a metric d, and the 0,‘s take values in the set11, 2, .* 7 ill}. Each 6’( is considered to be the indexof the category to which the ith individual belongs, andeach x3 is the outcome of the set of measurements madeupon that individual.Vor brevity, we shall frequentlysay ‘(xi belongs to ei” when we mean precisely that theith individualupon which measurements xi have beenobserved, belongs t’o category Bi.A new pair (2, 0) is given, where only the measurementx is observable by the statistician,and it is desired toestimate e by utilizing the informationcontained in theset of correctly classified points. We shall callx:, & {Xl, X2) * * * f x,1a nearest neighbor to x ifmin d(zi, x) d(&, Z)i 1, 2, * ** , n.(1)JANUARTThe nearest neighbor rule decides x belongs to the categorye; of its nearest’ neighbor XL. A mistake is made if e:, # 8.Notice that the NN rule utilizes only the classificationof the nearest neighbor. The n - 1 remaining classifications Bi are ignored.III.ADMISSIBILITYOF NEARESTNEIGHBORRULEIf the number of samples is large it makes good senseto use, instead of the single nearest neighbor, the majorityvote of the nearest k neighbors. We wish lc to be largein order to minimize the probabilityof a non-Bayesdecision for the unclassified point x, but we wish Ic to besmall (in proportion to the number of samples) in orderthat the points be close enough to x to give an accurateestimate of the posterior probabilitiesof the true classof x.The purpose of this section is to show that, amongthe class of L-NN rules, the single nearest neighbor rule(I-NN) is admissible. That is, for the n-sample problem,there exists no lc-NN rule, k # 1, which has lower probability of error against all distributions.We shall showthat the single NN rule is undominatedby exhibitinga simple distribution for which it has strictly lower probability of error P,. The example to be given comes fromthe family of distributionsfor which simple decisionboundaries provide complete separation of the samplesinto their respective categories. l?ortunately,one example will serve for all n.Consider the two category problem in which the priorprobabilitiesv1 v2 , and the conditional densityfl is uniform on the unit disk D, centered at (-3, 0),and the conditionaldensity fz is uniform on the unitdisk D, centered at (3, 0) as shown in Fig. 1. In then-sample problem, the probability that i individuals comefrom category 1, and hence have measurements lying inD,, is ( )“(;). Without loss of generality, assume that theunclassified x lies in category 1. Then the NN rule willmake a classification error only if the nearest neighbor x6belongs to category 2, and thus, necessarily, lies in D,.But’, from inspection of the distance relationships, if thenearest neighbor to x is in D,, then each of the xi must liein D,. Thus the probability P,(l; n) of error of the NNrule in this case is precisely ( )*-theprobabilitythatXl, x2, * -. , x, all lie in D,. Let lc 2/c, 1. Then t#hek-NN rule makes an error if k, or fewer points lie in D,.This occurs with probabilityThus in this example, the I-NN rule has strictly lowerP, than does any k-NN rule, k # 1, and hence is admissible in that class. Indeed1 In case of ties for the nearest neighbor, the rule may be modifiedto decide the most popular category among the ties. However, inthose cases in which ties occur with nonzero probability,our resultsare trivially true.

1967COVERANDHART:P,(k; n) 1 ink,for anyn,P,(lc;n)inn,for anyli J 0NEARESTNEIGHBORPATTERN(3)0,andP,(l;,;n) ---f 0,ifO & a ln-,for alln.In general, then, the I-NN rule is strictly betterthan the k # l-NN rule in those cases where thesupports of the densities fl, fz, . . . , fM are such that eachin-class distance is greater than any between-class distance.23CLASSIFICATIOKFor a given x the conditional loss is minimum when theindividual is assigned to the category j for which ri(x) islowest. Minimizingthe conditionalexpected loss obviously minimizes the unconditionalexpected loss. Thusthe minimizing decision rule 6*, called the Bayes decisionrule with respect to r], is given by deciding the category jfor which ri is lowest. Using 6*, the conditional Bayesrisk r*(x) isand the resulting overall minimumcalled the Bayes risk, is given byexpectedrisk R*,R* l *(x),where the expectationdensityIIFig. 1.Admissibilityof nearest neighborf(x) 2rule.V. CONVERGENCEIV. BAYES PROCEDUREIn this section we shall present the simplest versionof the Bayes decision procedure for minimizing the probability of error in classifying a given observation x intoone of M categories. All the statistics will be assumedknown. Bear in mind, however, that the NN rule isnonparametric,or distributionfree, in the sense that itdoes not depend on any assumptions about the underlying statistics for its application. The Bayes risk servesmerely as a reference-thelimit of excellence beyondwhich it is not possible to go.Let x denote the measurements on an individual andX the sample space of possible values of x. We shall referto x as the observation. On the basis of x a decision mustbe made about the membership of the individual in oneof Al specified categories.For the purposes of defining the Bayes risk, we assumedensities at x withfib),fdx , - - * , fM(x), probabilityrespect to a u-finite measure V) such that an individualin category i gives rise to an observation x accordingto density fi. Let L(i, j) be the loss incurred by assigningan individual from category i to category j.Let vl, 712,- . . , T. , vi 2 0, c vi 1, be the prior probabilities of the 116 categories. The conditional probabilitygi(x) of an individualwith measurement’s x belongingto category i is, by the Bayes theorem,qi &,i l,2,a.*,M.(4)Thus the random variable x transforms the prior probability vect’or r] into the posterior probability vector q(x).If the statistician decides to place an individualwithmeasuremen& x into category j, the conditional loss is(5)(7)is with respect to the compound%fi(X).OF NEARESTNEIGHBORSMost of the properties of the NN rules hinge on theassumption that the conditional distributions of &!, and 0approach one another when x6 - x. In order to putbounds on the NN risk for as wide a class of underlyingstatistics as possible, it will be necessary to determinethe weakest possible conditions on the statistics whichguarantee the above convergence.Lemma (Convergence of the Nearest Neighbor)Let x and x1, x2, . *. be independent identically distributed random variables taking values in a separablemetric space X. Let x!, denote the nearest neighbor to xfrom the set (x1, x2, . . . , x,}. Then a; - x with probability one.Bemark:In particular, 2: x with probabilityonefor any probabilitymeasure in Euclidean n-space. Weprove the lemma in this generality in order to includein its coverage such standard pathological candidates forcounterexamples as the Cantor ternary distribution function defined on X the real line.Since the convergence of the nearest neighbor to x isindependent of the metric, the bounds on the risks ofthe NN rule will be independent of the metric on X.Proof: Let X,(r) be the sphere (Z e X: d(z, 2) 5 r] ofradius r centered at x, where d is the metric defined on X.Consider first a point x E X having the property thatevery sphere X,(r), r 0, has nonzero probability measure.Then, for any 6 0,P(mind(xii, x) 2 6) (1 - P(S,(6))” 0k 1,2,.**.n69and thercfore, since d(q x) is monotonicallydecreasingin lc, the nearest neighbor to x converges to x with probabilit’y one.

24IEEETRAT\TSACTIONS ON IT\'FORMATION THEORYIt remains to argue that the random variable x hasthis property with probabilityone. We shall do so byproving that the set N of points failing to have thisproperty has probabilitymeasure zero. Accordingly, letN be the set of all x for which there exists some rZ sufficiently small that P(S,(r,)) 0.By the definition of the separability of X, there existsa countable dense subset A of X. For each z e N thereexists, by the denseness of A, a, in A for which ag E S,(r,/3).Thus, there exists a small sphere X,,(r,/Z) which is strictlycontained in the original sphere X,(r,) and which contains3. Thus P(X,,(r,/2)) 0. Then the possibly uncountableset N is contained in t’he countable union (by the countability of A) of spheres uZ,,,, S,Z(r,). Since N is containedin the countable union of sets of measure zero, P(N) 0,as was to be shown.VI. NEARESTNEIGHBORPROBABILITYJANUARYmixture of &functions and piecewise continuous densityfunctions on Euclidean d-space. Observe t’hat 0 5 R” 5R 5 2R*(l - R*) 5 8; so Ii?* 0 if and only if R 0,and R* 3 if and only if R 3. Thus in the extremecases of complete certainty and complete uncertaint’y theNN probabilityof error equals the Bayes probabilit,yof error. Conditions for equality of R and R* for othervalues of R” will be developed in the proof.Proof: Let us condition on the random variables x and2: in the n-sample KN problem. The conditional NN riskr(x, a:) is then given, upon using the conditional independence of e and e;, byr(x,2;) E[L(e,e:,) / 5, x;] P?(e Pr{OF ERRORP,{e 1 1 x)Pr(e:,e#ei,/X,XL} 2 1 XL) 2 1 x}Pr(e:, 1 IX;)(14)Let 2; E {x,, x1, . , z,) be the nearest neighbor t,o where the expectation is taken over e and e;. By thedevelopment of (4) the above may be written asII: and let 8; be the category to which the individual havingmeasurement 2; belongs. If 0 is indeed the category of IL’,(15)r(x, 23 7jl(x)fj2(x!J e(x) i(x!J.theNN rule incurs loss L(e, 0;). If (5, e), (x1, 0,), . . . , (z,, 0%)We wish first to show that’ r(Ic, ai) converges to theare random variables, we define the n-sample NN riskrandomvariable 2 ji (x) jZ(x) with probability one.R(n) by the expectationWe have not required that fl, f2 be continuous at theR(n) E[L(e, e:)](10) points x of nonzcro probabilitymeasure v(x), becausethese points may be trivially taken into account as follows.and the (large sample) NN risk ZZ byLet v(x,) 0; thenR lim R(n).(11)n-mPr{x, # XL] (1 - ZJ(Xg))n---f 0.(16)Throughoutthis discussion we shall assume that theSince x once equalling x0, equals x0 thereafter,pairs (x, S), (x1, e,), . . . , (x,,, 0,) are independent identically distributed random variables in X X 8. Of course,r(x, xi) -- 2&(x0)&(x0)(17)except in trivial cases, there will be some dependencewith probability one.between t,he elements xi, ei of each pair.For the remaining points, the hypothesized continuityWe shall first consider t,he M 2 cat’egory problemof fl and f2 is needed. Here x is a continuity point of f,with probabilityof error criterion given by the 0 - 1and fz with conditional probability one (conditioned onloss matrixI% such that v(x) 0). Then, since 7i is continuous infl and f2, x is a continuity point of j with probability one.w By the lemma, XL converges to the random variable xwith probability one. Hence, with probability one,where L counts an error whenever a mistake in classificari(x3 -- 4(x)(18)tion is made. The following theorem is the principal resultand, from (15), with probability one,of this discussion.r(x, XL) tTheoremLet X be a separable metric space. Let fl and f2 besuch that, with probability one, x is either 1) a cont’inuitypoint of f1 and fz, or 2) a point of nonzero probabilitymeasure. Then the NN risk R (probabilityof error) hasthe boundsR* I R 5 2R*(l- R*).(13)These bounds are as tight as possible.Remarks: In particular, the hypotheses of the theoremare satisfied for probability densities which consist of anyr(x) 2rj1(x)qZ(x),(19)where r(x) is the limit of the n-sample conditional NN risk.As shown in (6) the conditional Bayes risk isr*(x) min ( jl(x), (x) } min ( el(x), 1 Now, by the symmetry@O q,(x) ) .of r* in jl, we may writef”(x) 24l(X)?iZ(X) 2?j,(4o(I -41(x)) 2r*(x)(l- r*(x)).cw

1967COVER AND HART:Thus as a by-productof the proof, we have shown inthe large sample case, that with probability one a randomlychosen x will be correctly classified with probability2r*(x)(l- r*(x)). For the overall NK risk R, we have,by definition,R limE[r(x,25NEAREST KEIGHBOR PATTERN CLASSIFICATIONx3]measure v such that, with probabilityone, x is eit’her1) a continuity point of fl, f2, . . . , fl,f, or 2) a pointof nonzero probability measure. Then the NN probabilityof error R has the bounds.(22)where the expectation is taken over x and 2:. Now L,and hence r, is bounded by one; so applying the dominatedconvergence theorem,R E[lim r(x, x3].n(29)These bounds are as tight as possible.Proof: Since XL -- x with probability one, the posteriorprobabilit)y vector (x;) 3 q(x) with probabilityone.The conditional n-sample NN risk r(x, XL) is(23)r(x, 2;) E[L(e,eg [ x

a nearest neighbor to x if min d(zi, x) d(&, Z) i 1, 2, * ** , n. (1) The nearest neighbor rule decides x belongs to the category e; of its nearest’ neighbor XL. A mistake is made if e:, # 8. Notice that the NN rule utilizes only the classifi

Related Documents:

nearest hundred thousand. tf Round 93,206 to the nearest ten. Round 289,996 to the nearest ten thousand. 6 Round 279,996 to the nearest hundred. 7 Round 999,999 to the nearest ten. 3 Round 269,996 to the nearest thousand. 9 Round 219,996 to the nearest ten. 10 Round 10,876 to the nearest hundred. 1 1 R

Supervised classification in ENVI Feature Extraction is an iterative process from the "nearest neighbor" algorithm based on K-nearest neighbor. The advantage of the approach "nearest neighbor" is that classes are Figure 4. Principle of multi-resolution segmentation modi- fied of [38].

the test sample are selected. Using majority voting among K neighbors, class label for the test sample is found. For example with 1-nearest neighbor rule, if ω be the true class of a training sample, the predicted class of test sample x is set ωof its nearest neighbor, where m i is a nearest neighbor to x if the distance d(m i,x) min j {d(m j .

ing, multi-class classification, support vector machines 1. Introduction One of the oldest and simplest methods for pattern classification is the k-nearest neighbors (kNN) rule (Cover and Hart, 1967). The kNN rule classifies each unlabeled ex ample by the majority label of its k-nearest neighbors in the training set.

6. Spatial Analysis 1. The result depends on the cell size. 6. Spatial Analysis 2. The quadrat method cannot distinguish some different distributions. 6. Spatial Analysis 6.4.2 Cross nearest neighbor distance Cross nearest neighbor distance is a natural extension of the (ordinary) nearest n

2. Round the number 896 to the nearest hundred and the nearest ten. Does it round to the same number both times? 3. Round the number 836 to the nearest hundred and the nearest ten. Does it round to the same number both times? 11. Short Answer Write your answer in the box. 4. Round 234 to the nearest hundred. Answer: 5. Round 68 to the nearest .

costs 200. You have rounded the price to the nearest 100. To round to the nearest 10, look at the units digit to decide which ten is nearest. To round to the nearest 100, look at the tens digit to decide which hundred is nearest. 274 rounded to the nearest 10 is 270. 270 271 27

to Elementary Reading in Curriculum 2.0, and the Balanced Literacy Guides for Grades K–1 and 25. – These were analyzed for their implementation of the ELA/Literacy Instructional Shifts: Regular practice with complex text and its academic language; reading, writing , and speaking grounded in evidence from text, both literary and informational; and uilding knowledge through contentb -rich .