Learning From Data Lecture 9 Logistic Regression And .

3y ago
30 Views
3 Downloads
3.24 MB
23 Pages
Last View : Today
Last Download : 3m ago
Upload by : Jenson Heredia
Transcription

Learning From DataLecture 9Logistic Regression and Gradient DescentLogistic RegressionGradient DescentM. Magdon-IsmailCSCI 4100/6100

recap:Linear Classification and RegressionThe linear signal:s wtxAlgorithmsGood Features are ImportantLinear Classification.Pocket algorithm can tolerate errorsSimple and efficientyLinear Regression.Single step learning:Before looking at the data, we can reason thatsymmetry and intensity should be good featuresbased on our knowledge of the problem.c AML Creator: Malik Magdon-Ismailw X†y (XtX) 1Xt yVery efficient O(N d2 ) exact algorithm.x1x2Logistic Regression and Gradient Descent: 2 /23Predicting a probability

Predicting a ProbabilityWill someone have a heart attack over the next year?agegenderblood sugarHDLLDLMassHeight.62 yearsmale120 mg/dL40,00050120190 lbs5′ 10′′.Classification: Yes/NoLogistic Regression: Likelihood of heart attackh(x) θdXi 0c AML Creator: Malik Magdon-Ismailw i xi!logistic regression y [0, 1] θ(wtx)Logistic Regression and Gradient Descent: 3 /23What is θ?

Predicting a ProbabilityWill someone have a heart attack over the next year?agegenderblood sugarHDLLDLMassHeight.62 yearsmale120 mg/dL40,00050120190 lbs5′ 10′′.Classification: Yes/NoLogistic Regression: Likelihood of heart attackh(x) θdXi 0c AML Creator: Malik Magdon-Ismailw i xi!logistic regression y [0, 1]1θ(s)θ(s) es1 .s1 e1 e st θ(w x)0sLogistic Regression and Gradient Descent: 4 /23θ( s) 1e s 1 θ(s).1 e s1 esData is binary 1

The Data is Still Binary, 1D (x1, y1 1), · · · , (xN , yN 1)xn a person’s health informationyn 1 did they have a heart attack or notWe cannot measure a probability.We can only see the occurence of an event and try to infer a probability.c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 5 /23f is noisy

The Target Function is Inherently Noisyf (x) P[y 1 x].The data is generated from a noisy target function:P (y x) f (x) 1 f (x)c AML Creator: Malik Magdon-Ismailfor y 1;for y 1.Logistic Regression and Gradient Descent: 6 /23When is h good?

What Makes an h Good?‘fitting’ the data means finding a good hh is good if: h(xn) 1 h(xn) 0whenever yn 1;whenever yn 1.A simple error measure that captures this:N 21 X1Ein(h) h(xn) 2 (1 yn) .N n 1Not very convenient (hard to minimize).c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 7 /23Cross entropy error

The Cross Entropy Error MeasureN1 X yn ·wt xEin(w) ln(1 e)N n 1It looks complicated and ugly (ln, e(·), . . .),But,– it is based on an intuitive probabilistic interpretation of h.– it is very convenient and mathematically friendly (‘easy’ to minimize).Verify: yn 1 encourages wt xn 0, so θ(wt xn ) 1; yn 1 encourages wt xn 0, so θ(wt xn ) 0;c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 8 /23Probabilistic interpretation

The Probabilistic InterpretationSuppose that h(x) θ(wtx) closely captures P[ 1 x]:P (y x) t θ(w x) 1 θ(wtx)c AML Creator: Malik Magdon-Ismailfor y 1;for y 1.Logistic Regression and Gradient Descent: 9 /231 θ(s) θ( s)

The Probabilistic InterpretationSo, if h(x) θ(wtx) closely captures P[ 1 x]:P (y x) t θ(w x) θ( wtx)c AML Creator: Malik Magdon-Ismailfor y 1;for y 1.Logistic Regression and Gradient Descent: 10 /23Simplify to one equation

The Probabilistic InterpretationSo, if h(x) θ(wtx) closely captures P[ 1 x]:P (y x) t θ(w x) θ( wtx)for y 1;for y 1. . . or, more compactly,P (y x) θ(y · wtx)c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 11 /23The likelihood

The LikelihoodP (y x) θ(y · wtx)Recall: (x1, y1), . . . , (xN , yN ) are independently generatedLikelihood:The probability of getting the y1, . . . , yN in D from the corresponding x1, . . . , xN :P (y1, . . . , yN x1, . . . , xn) NYP (yn xn).n 1The likelihood measures the probability that the data were generated if f were h.c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 12 /23Maximize the likelihood

Maximizing The LikelihoodmaxQNn 1 P (yn QN maxln maxPN min xn)n 1 P (ynn 1 ln P (yn min1N min1N min1N1N(why?) xn) xn)PNn 1 ln P (yn xn)PN1n 1 ln P (yn xn )PN1lnn 1θ(yn·wt xn )PN yn ·wn 1 ln(1 e txnwe specialize to our “model” here)N1 X yn ·wt xnEin(w) ln(1 e)N n 1c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 13 /23How to minimize Ein (w)

How To Minimize Ein(w)Classification – PLA/Pocket (iterative)Regression – pseudoinverse (analytic), from solving w Ein(w) 0.Logistic Regression – analytic won’t work.Numerically/iteratively set w Ein(w) 0.c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 14 /23Hill analogy

Finding The Best Weights - Hill DescentBall on a complicated hilly terrain— rolls down to a local valley this is called a local minimumQuestions:How to get to the bottom of the deepest valey?How to do this when we don’t have gravity?c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 15 /23Our Ein is convex

In-sample Error, EinOur Ein Has Only One ValleyWeights, w. . . because Ein (w) is a convex function of w.c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 16 /23(So, who care’s if it looks ugly!)How to roll down?

How to “Roll Down”?Assume you are at weights w(t) and you take a step of size η in the direction v̂.w(t 1) w(t) η v̂We get to pick v̂ what’s the best direction to take the step?Pick v̂ to make Ein(w(t 1)) as small as possible.c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 17 /23The gradient

The Gradient is the Fastest Way to Roll DownApproximating the change in Ein Ein Ein (w(t 1)) Ein (w(t)) Ein (w(t) ηv̂) Ein (w(t)) η Ein (w(t))tv̂ O(η 2 )(Taylor’s Approximation) {z} Ein (w(t))minimized at v̂ Ein (w(t)) η Ein (w(t)) in (w(t)) attained at v̂ E Ein (w(t)) The best (steepest) direction to move is the negative gradient:v̂ c AML Creator: Malik Magdon-Ismail Ein(w(t)) Ein(w(t)) Logistic Regression and Gradient Descent: 18 /23Iterate the gradient

“Rolling Down” Iterating the Negative Gradientw(0) negative gradientw(1) negative gradient w(2) negative gradient w(3) negative gradient.c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 19 /23η 0.5; 15 stepsWhat step size?

The ‘Goldilocks’ Step Sizeη too smallη too largevariable ηt – just rightIn-sample Error, EinIn-sample Error, EinIn-sample Error, Einlarge ηsmall ηWeights, wWeights, wWeights, wη 0.1; 75 stepsη 2; 10 stepsvariable ηt; 10 stepsc AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 20 /23Fixed learning rate gradient descent

Fixed Learning Rate Gradient Descentηt η · Ein(w(t)) 1:2: Ein(w(t)) 0 when closer to the minimum.v̂ ηt · (Ex. 3.7 in LFD)gt Ein(w(t)). Ein(w(t)) Ein (w(t)) η · Ein (w(t)) ·3:Initialize at step t 0 to w(0).for t 0, 1, 2, . . . doCompute the gradient4: Ein (w(t)) Ein (w(t)) 5:w(t 1) w(t) ηvt .6:7:v̂ η · Ein(w(t))Move in the direction vt gt .Update the weights:8:Iterate ‘until it is time to stop’.end forReturn the final weights.Gradient descent can minimize any smooth function, for exampleN1 X yn ·wt xEin(w) ln(1 e)N n 1c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 21 /23 logistic regressionStochastic gradient descent

Stochastic Gradient Descent (SGD)A variation of GD that considers only the error on one data point.NNX1 X1tEin(w) ln(1 e yn·w x) e(w, xn, yn)N n 1N n 11. The ‘average’ move is the same as GD; Pick a random data point (x , y )2. Computation: fraction Run an iteration of GD on e(w, x , y )1Ncheaper per step;3. Stochastic: helps escape local minima;4. Simple;w(t 1) w(t) η w e(w, x , y )5. Similar to PLA.Logistic Regression:w(t 1) w(t) y x η1 ey wtx (Recall PLA: w(t 1) w(t) y x )c AML Creator: Malik Magdon-IsmailLogistic Regression and Gradient Descent: 22 /23GD versus SGD, a picture

Stochastic Gradient DescentGDη 610 stepsN 10c AML Creator: Malik Magdon-IsmailSGDη 230 stepsLogistic Regression and Gradient Descent: 23 /23

2 y Linear Regression. Single step learning: w X†y (XtX) 1Xty Very efficient O(Nd2) exact algorithm. c AML Creator: MalikMagdon-Ismail LogisticRegressionand Gradient Descent: 2/23 Predictingaprobability

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .

Lecture 5-6: Artificial Neural Networks (THs) Lecture 7-8: Instance Based Learning (M. Pantic) . (Notes) Lecture 17-18: Inductive Logic Programming (Notes) Maja Pantic Machine Learning (course 395) Lecture 1-2: Concept Learning Lecture 3-4: Decision Trees & CBC Intro Lecture 5-6: Artificial Neural Networks .

Introduction to Quantum Field Theory for Mathematicians Lecture notes for Math 273, Stanford, Fall 2018 Sourav Chatterjee (Based on a forthcoming textbook by Michel Talagrand) Contents Lecture 1. Introduction 1 Lecture 2. The postulates of quantum mechanics 5 Lecture 3. Position and momentum operators 9 Lecture 4. Time evolution 13 Lecture 5. Many particle states 19 Lecture 6. Bosonic Fock .

Lecture 11 – Eigenvectors and diagonalization Lecture 12 – Jordan canonical form Lecture 13 – Linear dynamical systems with inputs and outputs Lecture 14 – Example: Aircraft dynamics Lecture 15 – Symmetric matrices, quadratic forms, matrix norm, and SVD Lecture 16 – SVD applications