1y ago

25 Views

2 Downloads

1.36 MB

24 Pages

Transcription

CSC 411: Lecture 03: Linear ClassificationRichard Zemel, Raquel Urtasun and Sanja FidlerUniversity of TorontoZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification1 / 24

Examples of ProblemsWhat digit is this?How can I predict this? What are my input features?Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification2 / 24

RegressionWhat do all these problems have in common?Categorical outputs, called labels(eg, yes/no, dog/cat/person/other)Assigning each input vector to one of a finite number of labels is calledclassificationBinary classification: two possible labels (eg, yes/no, 0/1, cat/dog)Multi-class classification: multiple possible labelsWe will first look at binary problems, and discuss multi-class problems laterin classZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification3 / 24

TodayLinear Classification (binary)Key Concepts:IIIIClassification as regressionDecision boundaryLoss functionsMetrics to evaluate classificationZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification4 / 24

Classification vs RegressionWe are interested in mapping the input x X to a label t YIn regression typically Y Now Y is categoricalZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification5 / 24

Classification as RegressionCan we do this task using what we have learned in previous lectures?Simple hack: Ignore that the output is categorical!Suppose we have a binary problem, t { 1, 1}Assuming the standard model used for (linear) regressiony (x) f (x, w) wT xHow can we obtain w?Use least squares, w (XT X) 1 XT t. How is X computed? and t?Which loss are we minimizing? Does it make sense? square (w, t) N1 X (n)(t wT x(n) )2N n 1How do I compute a label for a new example? Let’s see an exampleZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification6 / 24

Classification as RegressionA dimensional1D example:Oneexample (input x is 1-dim)xThe colors indicate labels (a blue plus denotes that t (i) is from the firstclass, red circle that t (i) is from the second class)Greg Shakhnarovich (TTIC)Zemel, Urtasun, Fidler (UofT)Lecture 5: Regularization, intro to classificationCSC 411: 03-ClassificationOctober 15, 201311 / 17 / 24

Decision RulesOur classifier has the formf (x, w) wo wT xA reasonable decision rule is(1y 1if f (x, w) 0otherwiseHow can I mathematically write this rule?y (x) sign(w0 wT x)What does this function look like?Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification8 / 24

Decision RulesA 1D example:yŷ 1 1ŷ 1xw0 w T x-1How can I mathematically write this rule?Greg Shakhnarovich (TTIC)y (x) sign(w0 wT x)Lecture 5: Regularization, intro to classificationOctober 15, 201311 / 15This specifies a linear classifier: it has a linear boundary (hyperplane)w0 wT x 0which separates the space into two ”half-spaces”Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification9 / 24

Example in 1DThe linear classifier has a linear boundary (hyperplane)w0 wT x 0which separates the space into two ”half-spaces”In 1D this is simply a thresholdZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification10 / 24

Example in 2DThe linear classifier has a linear boundary (hyperplane)w0 wT x 0which separates the space into two ”half-spaces”In 2D this is a lineZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification11 / 24

Example in 3DThe linear classifier has a linear boundary (hyperplane)w0 wT x 0which separates the space into two ”half-spaces”In 3D this is a planeWhat about higher-dimensional spaces?Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification12 / 24

GeometrywT x 0 a line passing though the origin and orthogonal to wwT x w0 0 shifts it by w0Figure from G. ShakhnarovichZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification13 / 24

Learning Linear ClassifiersLearning consists in estimating a “good” decision boundaryWe need to find w (direction) and w0 (location) of the boundaryWhat does “good” mean?Is this boundary good?We need a criteria that tell us how to select the parametersDo you know any?Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification14 / 24

Loss functionsClassifying using a linear decision boundary reduces the data dimension to 1y (x) sign(w0 wT x)What is the cost of being wrong?Loss function: L(y , t) is the loss incurred for predicting y when correctanswer is tFor medical diagnosis: For a diabetes screening test is it better to have falsepositives or false negatives?For movie ratings: The ”truth” is that Alice thinks E.T. is worthy of a 4.How bad is it to predict a 5? How about a 2?Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification15 / 24

Loss functionsA possible loss to minimize is the zero/one loss(0 if y (x) tL(y (x), t) 1 if y (x) 6 tIs this minimization easy to do? Why?Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification16 / 24

Other Loss functionsZero/one loss for a classifier(L0 1 (y (x), t) 0 if y (x) t1 if y (x) 6 tAsymmetric Binary Loss αLABL (y (x), t) β 0if y (x) 1 t 0if y (x) 0 t 1if y (x) tSquared (quadratic) lossLsquared (y (x), t) (t y (x))2Absolute ErrorLabsolute (y (x), t) t y (x) Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification17 / 24

More Complex Loss FunctionsWhat if the movie predictions are used for rankings? Now the predictedratings don’t matter, just the order that they imply.In what order does Alice prefer E.T., Amelie and Titanic?Possibilities:III0-1 loss on the winnerPermutation distanceAccuracy of top K movies.Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification18 / 24

Can we always separate the classes?If we can separate the classes, the problem is linearly separableZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification19 / 24

Can we always separate the classes?Causes of non perfect separation:Model is too simpleNoise in the inputs (i.e., data attributes)Simple features that do not account for all variationsErrors in data targets (mis-labelings)Should we make the model complex enough to have perfect separation in thetraining data?Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification20 / 24

MetricsHow to evaluate how good my classifier is? How is it doing on dog vs no-dog?Zemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification21 / 24

MetricsHow to evaluate how good my classifier is?Recall: is the fraction of relevant instances that are retrievedR TPTP TP FNall groundtruth instancesPrecision: is the fraction of retrieved instances that are relevantP TPTP TP FPall predictedF1 score: harmonic mean of precision and recallF1 2Zemel, Urtasun, Fidler (UofT)P ·RP RCSC 411: 03-Classification22 / 24

More on MetricsHow to evaluate how good my classifier is?Precision: is the fraction of retrieved instances that are relevantRecall: is the fraction of relevant instances that are retrievedPrecision Recall CurveAverage Precision (AP): mean under the curveZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification23 / 24

Metrics vs LossMetrics on a dataset is what we care about (performance)We typically cannot directly optimize for the metricsOur loss function should reflect the problem we are solving. We then hope itwill yield models that will do well on our datasetZemel, Urtasun, Fidler (UofT)CSC 411: 03-Classification24 / 24

Multi-class classi cation: multiple possible labels . We are interested in mapping the input x 2Xto a label t 2Y In regression typically Y Now Yis categorical Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classi cation 5 / 24 . Classi cation as Regression Can we do this task using what we have learned in previous lectures? Simple hack .

Related Documents: