1y ago

31 Views

3 Downloads

843.06 KB

74 Pages

Transcription

Experimental Setup,Multi-class vs. Multi-label classification,andEvaluationCMSC 678UMBC

Central Question: How Well Are We Doing?ClassificationRegressionClusteringthe task: what kindof problem are yousolving? Precision,Recall, F1 Accuracy Log-loss ROC-AUC (Root) Mean Square Error Mean Absolute Error Mutual Information V-score

Central Question: How Well Are We Doing?ClassificationRegressionClusteringthe task: what kindof problem are yousolving? Precision,Recall, F1 Accuracy Log-loss ROC-AUC (Root) Mean Square Error Mean Absolute Error Mutual Information V-score This doesnot have tobe the samething as thelossfunctionyouoptimize

OutlineExperimental Design: Rule 1Multi-class vs. Multi-label classificationEvaluationRegression MetricsClassification Metrics

Experimenting with Machine LearningModelsAll your dataTraining DataDevDataTestData

Rule #1

Experimenting with Machine LearningModelsWhat is “correct?”What is working “well?”Training Dataset hyperparametersLearn model parameters fromtraining setDevDataTestData

Experimenting with Machine LearningModelsWhat is “correct?”What is working “well?”Evaluate the learned model on devwith that hyperparameter settingTraining Dataset hyperparametersLearn model parameters fromtraining setDevDataTestData

Experimenting with Machine LearningModelsWhat is “correct?”What is working “well?”Evaluate the learned model on devwith that hyperparameter settingTraining Dataset hyperparametersLearn model parameters fromtraining setDevDataTestDataperform final evaluation on test,using the hyperparameters thatoptimized dev performance andretraining the model

Experimenting with Machine LearningModelsWhat is “correct?”What is working “well?”Evaluate the learned model on devwith that hyperparameter settingTraining Dataset hyperparametersLearn model parameters fromtraining setDevDataTestDataperform final evaluation on test,using the hyperparameters thatoptimized dev performance andretraining the modelRule 1: DO NOT ITERATE ON THE TEST DATA

On-board ExerciseProduce dev and test tables for a linearregression model with learned weights andset/fixed (non-learned) bias

OutlineExperimental Design: Rule 1Multi-class vs. Multi-label classificationEvaluationRegression MetricsClassification Metrics

Multi-class ClassificationGiven input 𝑥, predict discrete label 𝑦Multi-label Classification

Multi-class ClassificationGiven input 𝑥, predict discrete label 𝑦If 𝑦 {0,1} (or 𝑦 {True, False}), then abinary classification taskMulti-label Classification

Multi-class ClassificationGiven input 𝑥, predict discrete label 𝑦If 𝑦 {0,1} (or 𝑦 {True, False}), then abinary classification taskIf 𝑦 {0,1, , 𝐾 1} (forfinite K), then a multi-classclassification taskQ: What are some examplesof multi-class classification?Multi-label Classification

Multi-class ClassificationGiven input 𝑥, predict discrete label 𝑦If 𝑦 {0,1} (or 𝑦 {True, False}), then abinary classification taskQ: What are some examplesof multi-class classification?If 𝑦 {0,1, , 𝐾 1} (forfinite K), then a multi-classclassification taskA: Many possibilities. SeeA2, Q{1,2,4-7}Multi-label Classification

Multi-class ClassificationGiven input 𝑥, predict discrete label 𝑦SingleoutputMultioutputIf 𝑦 {0,1} (or 𝑦 {True, False}), then abinary classification taskIf 𝑦 {0,1, , 𝐾 1} (forfinite K), then a multi-classclassification taskIf multiple 𝑦4 arepredicted, then a multilabel classification taskMulti-label Classification

Multi-class ClassificationGiven input 𝑥, predict discrete label 𝑦SingleoutputMultioutputIf 𝑦 {0,1} (or 𝑦 {True, False}), then abinary classification taskIf 𝑦 {0,1, , 𝐾 1} (forfinite K), then a multi-classclassification taskIf multiple 𝑦4 arepredicted, then a multilabel classification taskGiven input 𝑥, predict multiple discrete labels 𝑦 (𝑦7 , , 𝑦8 )Multi-label Classification

Multi-class ClassificationGiven input 𝑥, predict discrete label 𝑦SingleoutputIf 𝑦 {0,1} (or 𝑦 {True, False}), then abinary classification taskIf 𝑦 {0,1, , 𝐾 1} (forfinite K), then a multi-classclassification taskMultioutputIf multiple 𝑦4 arepredicted, then a multilabel classification taskEach 𝑦4 could be binary ormulti-classGiven input 𝑥, predict multiple discrete labels 𝑦 (𝑦7 , , 𝑦8 )Multi-label Classification

Multi-Label Classification Will not be a primary focus of this courseMany of the single output classification methodsapply to multi-label classificationPredicting “in the wild” can be trickierEvaluation can be trickier

We’ve only developed binaryclassifiers so far Option 1: Develop a multiclass versionOption 2: Build a one-vsall (OvA) classifierOption 3: Build an all-vsall (AvA) classifier(there can be others)

We’ve only developed binaryclassifiers so far Option 1: Develop a multiclass versionOption 2: Build a one-vsall (OvA) classifierOption 3: Build an all-vsall (AvA) classifier(there can be others)Loss function may (or may not)need to be extended & themodel structure may need tochange (big or small)

We’ve only developed binaryclassifiers so far Option 1: Develop a multiclass versionOption 2: Build a one-vsall (OvA) classifierOption 3: Build an all-vsall (AvA) classifier(there can be others)Loss function may (or may not)need to be extended & themodel structure may need tochange (big or small)Common change:instead of a single weight vector𝑤, keep a weight vector 𝑤 (;) foreach class cCompute class specific scores, e.g., (;)(;)𝑦 𝑤𝑥 𝑏(;)

Multi-class Option 1: LinearRegression/Perceptron𝑥𝐰𝑦𝑦 𝐰 𝑥 𝑏output:if y 0: class 1else: class 2

Multi-class Option 1: Linear Regression/Perceptron:A Per-Class View𝑥𝑥𝑦𝐰𝟏𝐰𝑦7 𝐰𝟏 𝑥 𝑏7𝑦7𝑦𝑦D𝑦D 𝐰𝟐 𝑥 𝑏D𝑦 𝐰 𝑥 𝑏output:if y 0: class 1else: class 2binary version isspecial case𝐰𝟐output:i argmax {y1, y2}class i

Multi-class Option 1: Linear Regression/Perceptron:A Per-Class View (alternative)𝑥𝑥𝑦𝐰𝟏𝑦7 𝒘𝟏 ; 𝒘𝟐 𝑻 [𝑥; 𝟎] 𝑏7𝐰𝑦7𝑦concatenation𝑦D𝑦 𝐰 𝑥𝑦D 𝒘𝟏 ; 𝒘𝟐 𝑻 [𝟎; 𝑥] 𝑏D 𝑏output:if y 0: class 1else: class 2𝐰𝟐output:i argmax {y1, y2}class iQ: (For discussion) Why doesthis work?

We’ve only developed binaryclassifiers so far Option 1: Develop a multi- With C classes:class versionTrain C different binary classifiers𝛾; (𝑥)Option 2: Build a one-vs𝛾; (𝑥) predicts 1 if x is likely class c,all (OvA) classifier0 otherwiseOption 3: Build an all-vsall (AvA) classifier(there can be others)

We’ve only developed binaryclassifiers so far Option 1: Develop a multi- With C classes:class versionTrain C different binary classifiers𝛾; (𝑥)Option 2: Build a one-vs𝛾; (𝑥) predicts 1 if x is likely class c,all (OvA) classifier0 otherwiseOption 3: Build an all-vsall (AvA) classifier(there can be others)To test/predict a new instance z:Get scores 𝑠 ; 𝛾; (𝑧)Output the max of these scores,𝑦N argmax; 𝑠 ;

We’ve only developed binaryclassifiers so far Option 1: Develop a multiclass versionOption 2: Build a one-vsall (OvA) classifierOption 3: Build an all-vsall (AvA) classifier(there can be others)With C classes:RDTraindifferent binaryclassifiers 𝛾;S,;T (𝑥)

We’ve only developed binaryclassifiers so far Option 1: Develop a multiclass versionOption 2: Build a one-vsall (OvA) classifierOption 3: Build an all-vsall (AvA) classifier(there can be others)With C classes:RDTraindifferent binaryclassifiers 𝛾;S,;T (𝑥)𝛾;S,;T (𝑥) predicts 1 if x is likelyclass 𝑐7 , 0 otherwise (likely class𝑐D )

We’ve only developed binaryclassifiers so far Option 1: Develop a multiclass versionOption 2: Build a one-vsall (OvA) classifierOption 3: Build an all-vsall (AvA) classifier(there can be others)With C classes:RDTraindifferent binaryclassifiers 𝛾;S,;T (𝑥)𝛾;S,;T (𝑥) predicts 1 if x is likelyclass 𝑐7 , 0 otherwise (likely class𝑐D )To test/predict a new instance z:Get scores or predictions 𝑠 ;S,;T 𝛾;S,;T 𝑧

We’ve only developed binaryclassifiers so far With C classes:Option 1: Develop a multiclass versionOption 2: Build a one-vs-all(OvA) classifierOption 3: Build an all-vs-all(AvA) classifier(there can be others)RDTraindifferent binaryclassifiers 𝛾;S,;T (𝑥)𝛾;S ,;T (𝑥) predicts 1 if x is likely class𝑐7 , 0 otherwise (likely class 𝑐D )To test/predict a new instance z:Get scores or predictions 𝑠 ;S ,;T 𝛾;S ,;T 𝑧Multiple options for final prediction:(1) count # times a class c waspredicted(2) margin-based approach

We’ve only developed binaryclassifiers so far Option 1: Develop a multiclass versionOption 2: Build a one-vsall (OvA) classifierOption 3: Build an all-vsall (AvA) classifier(there can be others)Q: (to discuss)Why might you want touse option 1 or optionsOvA/AvA?What are the benefits ofOvA vs. AvA?

We’ve only developed binaryclassifiers so far Option 1: Develop a multiclass versionOption 2: Build a one-vs-all(OvA) classifierOption 3: Build an all-vs-all(AvA) classifier(there can be others)Q: (to discuss)Why might you want to useoption 1 or optionsOvA/AvA?What are the benefits ofOvA vs. AvA?What if you start with abalanced dataset, e.g.,100 instances per class?

OutlineExperimental Design: Rule 1Multi-class vs. Multi-label classificationEvaluationRegression MetricsClassification Metrics

Regression Metrics(Root) Mean Square Error\𝑅𝑀𝑆𝐸 1[ 𝑦 𝑦] 𝑁 D

Regression Metrics(Root) Mean Square ErrorMean Absolute Error\\𝑅𝑀𝑆𝐸 1[ 𝑦 𝑦] 𝑁 D1𝑀𝐴𝐸 [ 𝑦 𝑦] 𝑁

Regression Metrics(Root) Mean Square ErrorMean Absolute Error\\𝑅𝑀𝑆𝐸 1[ 𝑦 𝑦] 𝑁 Q: How can thesereward/punish predictionsdifferently?D1𝑀𝐴𝐸 [ 𝑦 𝑦] 𝑁

Regression Metrics(Root) Mean Square ErrorMean Absolute Error\\𝑅𝑀𝑆𝐸 1[ 𝑦 𝑦] 𝑁 Q: How can thesereward/punish predictionsdifferently?D1𝑀𝐴𝐸 [ 𝑦 𝑦] 𝑁 A: RMSE punishes outlierpredictions more harshly

Training Loss vs. Evaluation ScoreIn training, compute loss to update parametersSometimes loss is a computational compromise- surrogate lossThe loss you use might not be as informative asyou’d likeBinary classification: 90 of 100 trainingexamples are 1, 10 of 100 are -1

Some Classification MetricsAccuracyPrecisionRecallAUC (Area Under Curve)F1Confusion Matrix

Classification Evaluation:the 2-by-2 contingency tableActuallyCorrectSelected/GuessedNot selected/not guessedClasses/ChoicesActuallyIncorrect

Classification Evaluation:the 2-by-2 contingency tableSelected/GuessedNot selected/not guessedClasses/ChoicesActuallyCorrectTrue PositiveCorrect (TP) GuessedActuallyIncorrect

Classification Evaluation:the 2-by-2 contingency tableSelected/GuessedNot selected/not guessedClasses/ChoicesActuallyCorrectTrue PositiveCorrect (TP) GuessedActuallyIncorrectFalse PositiveCorrect (FP) Guessed

Classification Evaluation:the 2-by-2 contingency tableActuallyActuallyCorrectIncorrectSelected/True Positive False PositiveGuessedCorrect (TP) GuessedCorrect (FP) GuessedNot selected/ False Negativenot guessed Correct (FN) GuessedClasses/Choices

Classification Evaluation:the 2-by-2 contingency tableActuallyActuallyCorrectIncorrectSelected/True Positive False PositiveGuessedCorrect (TP) GuessedCorrect (FP) GuessedNot selected/ False Negative True Negativenot guessed Correct (FN) Guessed Correct (TN) GuessedClasses/Choices

Classification Evaluation:Accuracy, Precision, and RecallAccuracy: % of items correctTP TNTP FP FN TNActually CorrectActually IncorrectSelected/GuessedTrue Positive (TP)False Positive (FP)Not select/not guessedFalse Negative (FN)True Negative (TN)

Classification Evaluation:Accuracy, Precision, and RecallAccuracy: % of items correctTP TNTP FP FN TNPrecision: % of selected items that are correctTPTP FPActually CorrectActually IncorrectSelected/GuessedTrue Positive (TP)False Positive (FP)Not select/not guessedFalse Negative (FN)True Negative (TN)

Classification Evaluation:Accuracy, Precision, and RecallAccuracy: % of items correctTP TNTP FP FN TNPrecision: % of selected items that are correctTPTP FPRecall: % of correct items that are selectedTPTP FNActually CorrectActually IncorrectSelected/GuessedTrue Positive (TP)False Positive (FP)Not select/not guessedFalse Negative (FN)True Negative (TN)

Classification Evaluation:Accuracy, Precision, and RecallAccuracy: % of items correctTP TNTP FP FN TNPrecision: % of selected items thatTPare correctMin: 0 Max: 1 TP FPRecall: % of correct items that areTPselectedTP FNActually CorrectActually IncorrectSelected/GuessedTrue Positive (TP)False Positive (FP)Not select/not guessedFalse Negative (FN)True Negative (TN)

Precision and Recall Present a TradeoffQ: Where do youwant your idealmodelmodel?1precision00recall1

Precision and Recall Present a TradeoffQ: Where do youwant your idealmodelmodel?1Q: You have amodel that alwaysmodelidentifies correctinstances. Whereon this graph is it?precision00recall1

Precision and Recall Present a TradeoffQ: Where do youwant your idealmodelmodel?1Q: You have amodel that alwaysmodelidentifies correctinstances. Whereon this graph is it?precision00recall1Q: You have amodel that onlymodelmake correctpredictions. Whereon this graph is it?

Precision and Recall Present a TradeoffQ: Where do youwant your idealmodelmodel?1Q: You have amodel that alwaysmodelidentifies correctinstances. Whereon this graph is it?precision00recall1Q: You have amodel that onlymodelmake correctpredictions. Whereon this graph is it?

Precision and Recall Present a TradeoffQ: Where do youwant your idealmodelmodel?1Q: You have amodel that alwaysmodelidentifies correctinstances. Whereon this graph is it?precisionQ: You have amodel that onlymodelmake correctpredicoons. Whereon this graph is it?Remember thosehyperparameters: Eachpoint is a differentlytrained/tuned model00recall1Idea: measure thetradeoff betweenprecision and recall

Precision and Recall Present a TradeoffQ: Where do youwant your idealmodelmodel?1Q: You have amodel that alwaysmodelidentifies correctinstances. Whereon this graph is it?precisionQ: You have amodel that onlymodelmake correctpredictions. Whereon this graph is it?Improve overallmodel: push thecurve that way00recall1Idea: measure thetradeoff betweenprecision and recall

Measure this Tradeoff:Area Under the Curve (AUC)AUC measures the area underthis tradeoff curveprecision1Improve overallmodel: push thecurve that way00recallMin AUC: 0 Max AUC: 1 1

Measure this Tradeoff:Area Under the Curve (AUC)AUC measures the area underthis tradeoﬀ curve1precision1. Compuong the curveYou need true labels & predictedlabels with somescore/conﬁdence esomateImprove overallmodel: push thecurve that way00recallMin AUC: 0 Max AUC: 1 1Threshold the scores and for eachthreshold compute precision andrecall

Measure this Tradeoff:Area Under the Curve (AUC)AUC measures the area under thistradeoff curve11. Computing the curveprecisionYou need true labels & predicted labelswith some score/confidence estimateThreshold the scores and for eachthreshold compute precision and recallImprove overallmodel: push thecurve that way2. Finding the areaHow to implement: trapezoidal rule (&others)00recallMin AUC: 0 Max AUC: 1 1In practice: external library like thesklearn.metrics module

Measure A Slightly Different Tradeoff:ROC-AUCAUC measures the area under this tradeoff curve1.1True positive rate2.Improve overallmodel: push thecurve that wayFalse posiove rateMin ROC-AUC: 0.5 Max ROC-AUC: 1 You need true labels & predicted labels with somescore/confidence estimateThreshold the scores and for each threshold computemetricsFinding the areaHow to implement: trapezoidal rule (& others)In practice: external library like thesklearn.metrics module00Computing the curve1Main variant: ROC-AUCSame idea as before but with someflipped metrics

A combined measure: FWeighted (harmonic) average of Precision & Recall𝐹 111𝛼 (1 𝛼)𝑃𝑅

A combined measure: FWeighted (harmonic) average of Precision & Recall1D1 𝛽 𝑃 𝑅𝐹 D 𝑃) 𝑅11(𝛽𝛼 (1 𝛼)𝑃𝑅algebra(not important)

A combined measure: FWeighted (harmonic) average of Precision & RecallD1 𝛽 𝑃 𝑅𝐹 (𝛽 D 𝑃) 𝑅Balanced F1 measure: β 12 𝑃 𝑅𝐹7 𝑃 𝑅

P/R/F in a Multi-class Setting:Micro- vs. Macro-AveragingSec. 15.2.4If we have more than one class, how do we combinemulGple performance measures into one quanGty?Macroaveraging: Compute performance for each class,then average.Microaveraging: Collect decisions for all classes,compute conongency table, evaluate.

P/R/F in a Mulo-class Setng:Micro- vs. Macro-AveragingSec. 15.2.4Macroaveraging: Compute performance for each class,then average.TPnmacroprecision [ [ precision;TPn FPn;;Microaveraging: Collect decisions for all classes,compute contingency table, evaluate. n TPnmicroprecision n TPn n FPn

P/R/F in a Multi-class Setting:Micro- vs. Macro-AveragingMacroaveraging: Computeperformance for each class, thenaverage.Sec. 15.2.4when to prefer themacroaverage?TPnmacroprecision [ [ precision;TPn FPn;;Microaveraging: Collectdecisions for all classes,compute conongency table,evaluate. n TPnmicroprecision n TPn n FPnwhen to prefer themicroaverage?

Sec. 15.2.4Micro- vs. Macro-Averaging: ExampleClass 1Class 2Micro Ave. TableTruth: yesTruth: noTruth Truth: yes : noTruth Truth: yes : ier:no201860Macroaveraged precision: (0.5 0.9)/2 0.7Microaveraged precision: 100/120 .83Microaveraged score is dominated by score on frequent classes

Confusion Matrix: Generalizing the 2-by-2contingency tableCorrect ValueGuessedValue#########

Confusion Matrix: Generalizing the 2-by-2contingency tableCorrect ValueGuessedValue809117867289Q: Is this a goodresult?

Confusion Matrix: Generalizing the 2-by-2contingency tableCorrect ValueGuessedValue304030253050303535Q: Is this a goodresult?

Confusion Matrix: Generalizing the 2-by-2contingency tableCorrect ValueGuessedValue739048883790Q: Is this a goodresult?

Some Classification MetricsAccuracyPrecisionRecallAUC (Area Under Curve)F1Confusion Matrix

Experimental Design: Rule 1 Multi-class vs. Multi-label classification Evaluation Regression Metrics Classification Metrics. Multi-classClassification Given input !, predict discrete label " . predicted, then a multi-label classification task Each "4could be binary or multi-class.

Related Documents: