ECE595 / STAT598: Machine Learning I Lecture 15 Logistic Regression 2

1y ago
9 Views
2 Downloads
962.38 KB
30 Pages
Last View : 14d ago
Last Download : 3m ago
Upload by : Pierre Damon
Transcription

ECE595 / STAT598: Machine Learning I Lecture 15 Logistic Regression 2 Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University c Stanley Chan 2020. All Rights Reserved. 1 / 30

Overview In linear discriminant analysis (LDA), there are generally two types of approaches Generative approach: Estimate model, then define the classifier Discriminative approach: Directly define the classifier c Stanley Chan 2020. All Rights Reserved. 2 / 30

Outline Discriminative Approaches Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 2 Gradient Descent Convexity Gradient Regularization Connection with Bayes Derivation Interpretation Comparison with Linear Regression Is logistic regression better than linear? Case studies c Stanley Chan 2020. All Rights Reserved. 3 / 30

From Linear to Logistic Regression Can we replace g (x) by sign(g (x))? How about a soft-version of sign(g (x))? This gives a logistic regression. c Stanley Chan 2020. All Rights Reserved. 4 / 30

Logistic Regression and Deep Learning Logistic regression can be considered as the last layer of a deep network Inputs are x n , weights are w The sigmoid function is the nonlinear activation To train the model, you compare the prediction error and minimize the loss by updating the weights c Stanley Chan 2020. All Rights Reserved. 5 / 30

Training Loss Function J(θ) N X n 1 N X L(hθ (x n ), yn ) n o yn log hθ (x n ) (1 yn ) log(1 hθ (x n )) n 1 This is called the cross-entropy loss Consider two cases ( 0, yn log hθ (x n ) , ( 0, (1 yn )(1 log hθ (x n )) , No solution if mismatch if yn 1, if yn 1, and hθ (x n ) 1, and hθ (x n ) 0, if yn 0, if yn 0, and hθ (x n ) 0, and hθ (x n ) 1. c Stanley Chan 2020. All Rights Reserved. 6 / 30

Convexity of Logistic Training Loss Recall that n n o X hθ (x n ) J(θ) yn log log(1 hθ (x n )) 1 hθ (x n ) n 1 The first term is linear, so it is convex. The second term: Gradient: 1 θ [ log(1 hθ (x))] θ log 1 T 1 e θ x " # T h i e θ x θ T x θ T x θ log log e log(1 e ) θ T 1 e θ x h i h i T T θ θ T x log(1 e θ x ) x θ log 1 e θ x ! T e θ x x x hθ (x)x. T 1 e θ x c Stanley Chan 2020. All Rights Reserved. 7 / 30

Convexity of Logistic Training Loss Gradient of second term is θ [ log(1 hθ (x))] hθ (x)x. Hessian is: 2θ [ log(1 hθ (x))] θ [hθ (x)x] 1 x θ T 1 e θ x ! 1 θ T x xx T e T θ x 2 (1 e ) 1 1 1 xx T T T θ x θ x 1 e 1 e T hθ (x)[1 hθ (x)]xx . c Stanley Chan 2020. All Rights Reserved. 8 / 30

Convexity of Logistic Training Loss For any v Rd , we have that h i v T 2θ [ log(1 hθ (x))]v v T hθ (x)[1 hθ (x)]xx T v (hθ (x)[1 hθ (x)]) kv T xk2 0. Therefore the Hessian is positive semi-definite. So log(1 hθ (x) is convex in θ. Conclusion: The training loss function n n o X hθ (x n ) J(θ) yn log log(1 hθ (x n )) 1 hθ (x n ) n 1 is convex in θ. So we can use convex optimization algorithms to find θ. c Stanley Chan 2020. All Rights Reserved. 9 / 30

Convex Optimization for Logistic Regression We can use CVX to solve the logistic regression problem But it requires some re-organization of the equations J(θ) N n o X yn θ T x n log(1 hθ (x n )) n 1 N X n 1 N X n yn θ T x n log 1 eθ T ! xn 1 eθ T o xn n o T yn θ T x n log 1 e θ x n n 1 N X n 1 !T yn x n θ N X log 1 e θ T xn . n 1 The last term is a sum of log-sum-exp: log(e 0 e θ Tx ). c Stanley Chan 2020. All Rights Reserved. 10 / 30

Convex Optimization for Logistic Regression 1 Data Estimated True 0.8 0.6 0.4 0.2 0 0 2 4 6 8 Black: The true model. You create it. Blue circles: Samples drawn from the true distribution. Red: Trained model from the samples. 10 c Stanley Chan 2020. All Rights Reserved. 11 / 30

Gradient Descent for Logistic Regression The training loss function is J(θ) n n o X yn θ T x n log(1 hθ (x n )) . n 1 Recall that θ [ log(1 hθ (x))] hθ (x)x. You can run gradient descent θ (k 1) θ (k) αk θ J(θ (k) ) θ (k) αk N X ! (hθ(k) (x n ) yn )x n . n 1 Since the loss function is convex, guaranteed to find global minimum. c Stanley Chan 2020. All Rights Reserved. 12 / 30

Regularization in Logistic Regression The loss function is n n o X J(θ) yn θ T x n log(1 hθ (x n )) n 1 n n X T yn θ x n log 1 n 1 o 1 1 e θ T xn What if hθ (x n ) 1? (We need θ T x n .) Then we have log(1 1) log 0, which is . Same thing happens in the equivalent form !T N N X X T J(θ) yn x n θ log 1 e θ x n . n 1 When θ T x n , we have log( ). n 1 c Stanley Chan 2020. All Rights Reserved. 13 / 30

Regularization in Logistic Regression Example: Two classes: N (0, 1) and N (10, 1). Run CVX 1 0.8 0.6 0.4 0.2 0 -5 0 5 10 15 NaN for yn 1 c Stanley Chan 2020. All Rights Reserved. 14 / 30

Regularization in Logistic Regression Add a small regularization !T N N X X T yn x n θ log 1 e θ x n λkθk2 . J(θ) n 1 n 1 Re-run the same CVX program 1 0.8 0.6 0.4 0.2 0 -5 0 5 10 c Stanley Chan 2020. All Rights Reserved. 15 15 / 30

Regularization in Logistic Regression If you make λ really really small . !T N N X X T yn x n θ log 1 e θ x n λkθk2 . J(θ) n 1 n 1 Re-run the same CVX program 1 0.8 0.6 0.4 0.2 0 -5 0 5 10 c Stanley Chan 2020. All Rights Reserved. 15 16 / 30

Try This Online Exercise Classify two digits in the MNIST dataset http://ufldl.stanford.edu/tutorial/supervised/ LogisticRegression/ c Stanley Chan 2020. All Rights Reserved. 17 / 30

Outline Discriminative Approaches Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 2 Gradient Descent Convexity Gradient Regularization Connection with Bayes Derivation Interpretation Comparison with Linear Regression Is logistic regression better than linear? Case studies c Stanley Chan 2020. All Rights Reserved. 18 / 30

Connection with Bayes The likelihood is 1 exp (x µi )T Σ 1 (x µi ) p(x i) p d 2 (2π) Σ 1 The prior is pY (i) πi . The posterior is p(x 1)pY (1) p(x 1)pY (1) p(x 0)pY (0) 1 1 n o p(x 0)pY (0) Y (1) 1 p(x 1)pY (1) 1 exp log p(x 1)p p(x 0)pY (0) p(1 x) 1 n o . π1 1 exp log π0 log p(x 1) p(x 0) c Stanley Chan 2020. All Rights Reserved. 19 / 30

Connection with Bayes We can show that the last term is p(x 1) log p(x 0) 1 d exp 12 (x µ1 )T Σ 1 (x µ1 ) (2π) Σ log 1 d exp 12 (x µ0 )T Σ 1 (x µ0 ) (2π) Σ h i 1 (x µ1 )T Σ 1 (x µ1 ) (x µ0 )T Σ 1 (x µ0 ) 2 1 T 1 1 (µ1 µ0 )T Σ 1 x µ1 Σ µ1 µT Σ µ 0 . 0 2 Let us define w Σ 1 (µ1 µ0 ) 1 T 1 π1 T 1 w0 µ1 Σ µ1 µ0 Σ µ0 log 2 π0 c Stanley Chan 2020. All Rights Reserved. 20 / 30

Connection with Bayes Then, p(x 1) 1 T 1 1 log (µ1 µ0 )T Σ 1 x µ1 Σ µ1 µT Σ µ 0 0 p(x 0) 2 w T x w0 log π1 /π0 Therefore, p(1 x) 1 n o 1 exp log ππ10 log p(x 1) p(x 0) 1 1 exp{ (w T x w0 )} hθ (x) c Stanley Chan 2020. All Rights Reserved. 21 / 30

Connection with Bayes The hypothesis function is the posterior distribution 1 hθ (x) 1 exp{ (w T x w0 )} exp{ (w T x w0 ) pY X (0 x) 1 hθ (x), 1 exp{ (w T x w0 )} pY X (1 x) (1) So logistic regression offers probabilistic reasoning which linear regression does not Not true when the covariances are different Remark: If the covariances are different, the Bayes returns a quadratic classifier c Stanley Chan 2020. All Rights Reserved. 22 / 30

Outline Discriminative Approaches Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 2 Gradient Descent Convexity Gradient Regularization Connection with Bayes Derivation Interpretation Comparison with Linear Regression Is logistic regression better than linear? Case studies c Stanley Chan 2020. All Rights Reserved. 23 / 30

Is Logistic Regression Better than Linear? This is taken from the Internet Is that true? c Stanley Chan 2020. All Rights Reserved. 24 / 30

Is Logistic Regression Better than Linear? Scenario 1: Identical Covariance. Equal Prior. Enough samples. N (0, 1) with 100 samples and N (10, 1) with 100 samples. Linear and logistic: Not much different. 1 0.8 0.6 Bayes oracle Bayes empirical lin reg lin reg decision log reg log reg decision true samples training samples 0.4 0.2 0 -5 0 5 10 15 c Stanley Chan 2020. All Rights Reserved. 25 / 30

The False Sense of Good Fitting Scenario 2: Identical Covariance. Equal Prior. Not a lot of samples. N (0, 2) with 10 samples and N (10, 2) with 10 samples. Linear and logistic: Not much different. 1 0.8 0.6 Bayes oracle Bayes empirical lin reg lin reg decision log reg log reg decision true samples training samples 0.4 0.2 0 -5 0 5 10 15 c Stanley Chan 2020. All Rights Reserved. 26 / 30

Is Logistic Regression Better than Linear? Scenario 3: Different Covariance. Equal Prior. N (0, 2) with 50 samples and N (10, 0.2) with 50 samples. Linear and logistic: Equally bad. 1 0.8 0.6 Bayes oracle Bayes empirical lin reg lin reg decision log reg log reg decision true samples training samples 0.4 0.2 0 -5 0 5 10 15 c Stanley Chan 2020. All Rights Reserved. 27 / 30

Is Logistic Regression Better than Linear? Scenario 4: Identical Covariance. Unequal Prior. Training size proportional to prior: 180 samples and 20 samples. N (0, 1) with π0 0.9 and N (10, 1) with π1 0.1. Linear and logistic: Not much different. 1 0.8 0.6 Bayes oracle Bayes empirical lin reg lin reg decision log reg log reg decision true samples training samples 0.4 0.2 0 -5 0 5 10 c Stanley Chan 2020. All Rights 15 Reserved. 28 / 30

So what can we say about Logistic Regression? Logistic regression empowers a discriminative method with probabilistic reasonings. The hypothesis function is the posterior probability 1 hθ (x) 1 exp{ (w T x w0 )} exp{ (w T x w0 ) p(0 x) 1 hθ (x), 1 exp{ (w T x w0 )} p(1 x) Logistic is yet another special case of Bayesian More or less the same performance as linear regression Logistic can give lower training error — which looks better on plots. But its generalization is similar to linear regression c Stanley Chan 2020. All Rights Reserved. 29 / 30

Reading List Logistic Regression (Machine Learning Perspective) Chris Bishop’s Pattern Recognition, Chapter 4.3 Hastie-Tibshirani-Friedman’s Elements of Statistical Learning, Chapter 4.4 Stanford CS 229 Discriminant Algorithms http://cs229.stanford.edu/notes/cs229-notes1.pdf CMU Lecture https: //www.stat.cmu.edu/ cshalizi/uADA/12/lectures/ch12.pdf Stanford Language Processing https://web.stanford.edu/ jurafsky/slp3/ (Lecture 5) Logistic Regression (Statistics Perspective) Duke Lecture https://www2.stat.duke.edu/courses/Spring13/ sta102.001/Lec/Lec20.pdf Princeton Lecture https://data.princeton.edu/wws509/notes/c3.pdf c Stanley Chan 2020. All Rights Reserved. 30 / 30

Is Logistic Regression Better than Linear? Scenario 1: Identical Covariance. Equal Prior. Enough samples. N(0;1) with 100 samples and N(10;1) with 100 samples. Linear and logistic: Not much di erent.-5 0 5 10 15 0 0.2 0.4 0.6 0.8 1 Bayes oracle Bayes empirical lin reg lin reg decision log reg log reg decision

Related Documents:

Texts of Wow Rosh Hashana II 5780 - Congregation Shearith Israel, Atlanta Georgia Wow ׳ג ׳א:׳א תישארב (א) ׃ץרֶָֽאָּהָּ תאֵֵ֥וְּ םִימִַׁ֖שַָּה תאֵֵ֥ םיקִִ֑לֹאֱ ארָָּ֣ Îָּ תישִִׁ֖ארֵ Îְּ(ב) חַורְָּ֣ו ם

Learning is feasible if x p(x) p(x) says: Training and testing are related If training and testing are u

Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3: Kernel Method: Enabling nonlinearity Lecture 2: Regularization Ridge Regression Regularization Parameter LASSO

Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3: Kernel Method: Enabling nonlinearity Lecture 1: Linear Regression Linear Regression Notation Loss Function Solving the Regression Problem Geome

Convex Optimization for Logistic Regression We can use CVX to solve the logistic regression problem But it requires some re-organization of the equations J( ) XN n 1 n y n Tx n log(1 h (x n)) o XN n 1 n y n Tx n log 1 e Tx n 1 e Tx n! o XN n 1 n y n Tx n log 1 e Tx n o 8 : XN n 1 y nx n! T XN n 1 log 1 e Tx n 9 ;: The last .

Maximum Loss Attack De nition (Maximum Loss Attack) The maximum loss attack nds a perturbed data x by solving the optimization maximize x g t(x ) max j6 t fg j(x )g subject to kx x 0k ; (2) where kkcan be any norm speci ed by the user, and 0 denotes the attack strength. I want to bound my attack kx x 0k I want to make g t(x ) as big as possible

decoration machine mortar machine paster machine plater machine wall machinery putzmeister plastering machine mortar spraying machine india ez renda automatic rendering machine price wall painting machine price machine manufacturers in china mail concrete mixer machines cement mixture machine wall finishing machine .

with an illustration of a sword and the main road located to the west of Sutton 6 7 Part of Thomas Jeffrey’s 1771 map of Yorkshire 6 8 Locations of the geophysical survey grid and the excavation trench 7 9 Results of the electrical earth resistance survey of the area across Old London Road, Towton 8 10 Results of geophysical survey shown superimposed over an aerial photograph 9 11 Electrical .