Lecture 2: Linear Regression - GitHub Pages

2y ago
44 Views
7 Downloads
623.10 KB
31 Pages
Last View : 5m ago
Last Download : 3m ago
Upload by : Adele Mcdaniel
Transcription

Lecture 2: Linear RegressionFeng LiShandong Universityfli@sdu.edu.cnSeptember 14, 2020Feng Li (SDU)Linear RegressionSeptember 14, 20201 / 31

Lecture 2: Linear Regression1Supervised Learning: Regression and Classification2Linear Regression3Gradient Descent Algorithm4Stochastic Gradient Descent5Revisiting Least Square6A Probabilistic Interpretation to Linear RegressionFeng Li (SDU)Linear RegressionSeptember 14, 20202 / 31

Supervised LearningRegression: Predict a continuous valueClassification: Predict a discrete value, the classLiving area (feet2 )21041600240014163000.Feng Li (SDU)Price (1000 s)400330369232540.Linear RegressionSeptember 14, 20203 / 31

Supervised Learning (Contd.)Features: input variables, x;Target: output variable, y ;Training example: (x (i) , y (i) ), i 1, 2, 3, ., mHypothesis: h : X Y.TrainingsetLearningalgorithmx(living area ofhouse.)Feng Li (SDU)hpredicted y(predicted price)of house)Linear RegressionSeptember 14, 20204 / 31

Linear RegressionLinear hypothesis: h(x) θ1 x θ0 .θi (i 1, 2 for 2D cases): Parameters to estimate.How to choose θi ’s?Feng Li (SDU)Linear RegressionSeptember 14, 20205 / 31

Linear Regression (Contd.)Input: Training set (x (i) , y (i) ) R2 (i 1, ., m)Goal: Model the relationship between x and y such that we can predictthe corresponding target according to a given new feature.Feng Li (SDU)Linear RegressionSeptember 14, 20206 / 31

Linear Regression (Contd.)The relationship between x and y is modeled as a linear function.The linear function in the 2D plane is a straight line.Hypothesis: hθ (x) θ0 θ1 x (where θ0 and θ1 are parameters)Feng Li (SDU)Linear RegressionSeptember 14, 20207 / 31

Linear Regression (Contd.)Given data x Rn , we then have θ Rn 1PThus hθ (x) ni 0 θi xi θT x, where x0 1What is the best choice of θ ?mmin J(θ) θ1X(hθ (x (i) ) y (i) )22i 1where J(θ) is so-called a cost functionFeng Li (SDU)Linear RegressionSeptember 14, 20208 / 31

Linear Regression (Contd.)m1X(hθ (x (i) ) y (i) )2min J(θ) θ2i 1Feng Li (SDU)Linear RegressionSeptember 14, 20209 / 31

GradientDefinitionDirectional Derivative: The directional derivative of function f : Rn Rin the direction u Rn isf (x hu) f (x)h 0h u f (x) lim u f (x) represents the rate at which f is increased in direction uWhen u is the i-th standard unit vector ei , u f (x) fi 0 (x)where fi 0 (x) Feng Li (SDU) f (x) xiis the partial derivative of f (x) w.r.t. xiLinear RegressionSeptember 14, 202010 / 31

Gradient (Contd.)TheoremFor any n-dimensional vector u, the directional derivative of f in the directionof u can be represented as u f (x) nXfi 0 (x) · uii 1Feng Li (SDU)Linear RegressionSeptember 14, 202011 / 31

Gradient (Contd.)Proof.Letting g (h) f (x hu), we haveg (h) g (0)f (x hu) g (0) lim u f (x)h 0h 0hhg 0 (0) lim(1)On the other hand, by the chain rule,g 0 (h) nXnfi 0 (x)i 1Let h 0, then g 0 (0) complete the proof.Feng Li (SDU)Xd(xi hui ) fi 0 (x)uidh(2)i 1Pni 1 fi0 (x)u ,iby substituting which into (1), weLinear RegressionSeptember 14, 202012 / 31

Gradient (Contd.)DefinitionGradient: The gradient of f is a vector function f : Rn Rn defined bynX f f (x) ei xii 1where ei is the i-th standard unit vector. In another simple form, f f f,,··· , f (x) x1 x2 xnFeng Li (SDU)Linear Regression TSeptember 14, 202013 / 31

Gradient (Contd.) u f (x) f (x) · u k f (x)kkuk cos a where a is the angle between f (x) and uWithout loss of generality, assume u is a unit vector, u f (x) k f (x)k cos aWhen u f (x) such that a 0 (and thus cos a 1, we have themaximum directional derivative of f , which implies that f (x) is thedirection of steepest ascent of f .Feng Li (SDU)Linear RegressionSeptember 14, 202014 / 31

Gradient Descent (GD) AlgorithmIf the multi-variable function J(θ) is differentiable in a neighborhood ofa point θ, then J(θ) decreases fastest if one goes from θ in the directionof the negative gradient of J at θFind a local minimum of a differentiable function using gradient descentAlgorithm 1 Gradient Descent1: Given a starting point θ dom J2: repeat3:Calculate gradient J(θ);4:Update θ θ α J(θ)5: until convergence criterion is satisfiedθ is usually initialized randomlyα is so-called learning rateFeng Li (SDU)Linear RegressionSeptember 14, 202015 / 31

GD Algorithm (Contd.)Stopping criterion (i.e., conditions to convergence)the gradient has its magnitude less than or equal to a predefined threshold (say ε), i.e.k f (x)k2 εwhere k · k2 is 2 norm, such that the values of the objective functiondiffer very slightly in successive iterationsSet a fixed value for the maximum number of iterations, such that thealgorithm is terminated after the number of the iterations exceeds thethreshold.Feng Li (SDU)Linear RegressionSeptember 14, 202016 / 31

GD Algorithm (Contd.)In more details, we update each component of θ according to the following rule J(θ), j 0, 1, · · · , nθj θj α θjCalculating the gradient for linear regression J(θ) θjm 1 X T (i)(θ x y (i) )2 θj 2 1 θj 2mXi 1mXnX(i)(θj xj y (i) )2i 1 j 0(i)(θT x (i) y (i) )xji 1Feng Li (SDU)Linear RegressionSeptember 14, 202017 / 31

GD Algorithm (Contd.)An illustration of gradient descent algorithmThe objective function is decreased fastest along the gradientFeng Li (SDU)Linear RegressionSeptember 14, 202018 / 31

GD Algorithm (Contd.)Another commonly used formm1 Xmin J(θ) (hθ (x (i) ) y (i) )2θ2mi 1What’s the difference?m is introduced to scale the objective function to deal with differentlysized training set.Gradient ascent algorithmMaximize the differentiable function J(θ)The gradient represents the direction along which J increases fastestTherefore, we have J(θ)θj θj α θjFeng Li (SDU)Linear RegressionSeptember 14, 202019 / 31

Convergence under Different Step Sizes0.6, 0.06, 0.07, 0.0710.5Objective function ationsFeng Li (SDU)Linear RegressionSeptember 14, 202020 / 31

Stochastic Gradient Descent (SGD)What if the training set is huge?In the above batch gradient descent algorithm, we have to run throughthe entire training set in each iterationA considerable computation cost is induced!Stochastic gradient descent (SGD), also known as incremental gradient descent, is a stochastic approximation of the gradient descentoptimization methodIn each iteration, the parameters are updated according to the gradientof the error with respect to one training sample onlyFeng Li (SDU)Linear RegressionSeptember 14, 202021 / 31

Stochastic Gradient Descent (Contd.)Algorithm 2 Stochastic Gradient Descent for Linear Regression1: Given a starting point θ dom J2: repeat3:Randomly shuffle the training data;4:for i 1, 2, · · · , m do5:θ θ α J(θ; x (i) , y (i) )6:end for7: until convergence criterion is satisfiedFeng Li (SDU)Linear RegressionSeptember 14, 202022 / 31

More About SGDThe objective does not always decrease for each iterationUsually, SGD has θ approaching the minimum much faster than batchGDSGD may never converge to the minimum, and oscillating may happenA variants: Mini-batch, say pick up a small group of samples and doaverage, which may accelerate and smoothen the convergenceFeng Li (SDU)Linear RegressionSeptember 14, 202023 / 31

Matrix Derivatives1A function f : Rm n RThe derivative of f with respect to A is defined f f· · · A A11n .Of (A) . f Am1··· f Amnas For an n n matrix, its trace is defined as trA Pni 1 AiitrABCD trDABC trCDAB trBCDAtrA trAT , tr(A B) trA trB, traA atrA5A trAB B T , 5AT f (A) (5A f (A))T5A trABAT C CAB C T AB T , 5A A A (A 1 )TFunky trace derivative OAT trABAT C B T AT C T BAT C1Details can be found in “Properties of the Trace and Matrix Derivatives” by JohnDuchiFeng Li (SDU)Linear RegressionSeptember 14, 202024 / 31

Revisiting Least SquareAssume (x (1) )T .X . (x (m) )T y (1) Y . y (m)Therefore, we have (1) T (1) hθ (x (1) ) y (1)(x ) θy . .Xθ Y . .x (m) )T θJ(θ) 12PmFeng Li (SDU)i 1 (hθ (x(i) )y (m)hθ (x (m) ) y (m) y (i) )2 12 (X θ Y )T (X θ Y )Linear RegressionSeptember 14, 202025 / 31

Revisiting Least Square (Contd.)Minimize J(θ) 21 (Y X θ)T (Y X θ)Calculate its derivatives with respect to θ15θ J(θ) 5θ (Y X θ)T (Y X θ)21 Oθ (Y T θT X T )(Y X θ)21Oθ tr(Y T Y Y T X θ θT X T Y θT X T X θ) 21 Oθ tr(θT X T X θ) X T Y21 T (X X θ X T X θ) X T Y2 XTXθ XTYTip: Funky trace derivative OAT trABAT C B T AT C T BAT CFeng Li (SDU)Linear RegressionSeptember 14, 202026 / 31

Revisiting Least Square (Contd.)Theorem:The matrix AT A is invertible if and only if the columns of A are linearlyindependent. In this case, there exists only one least-squares solutionθ (X T X ) 1 X T YProve the above theorem in Problem Set 1.Feng Li (SDU)Linear RegressionSeptember 14, 202027 / 31

Probabilistic InterpretationThe target variables and the inputs are relatedy θT x ’s denote the errors and are independently and identically distributed(i.i.d.) according to a Gaussian distribution N (0, σ 2 )The density of (i) is given by 1 2f ( ) exp 22σ2πσThe conditional probability density function of yy x; θ N (θT x, σ 2 )Feng Li (SDU)Linear RegressionSeptember 14, 202028 / 31

Probabilistic Interpretation (Contd.)The training data {x (i) , y (i) }i 1,··· ,m are sampled identically and independently!1(y (i) θT x (i) )2(i)(i)p(y y x x ; θ) exp 2σ 22πσLikelihood functoinL(θ) Yp(y (i) x (i) ; θ)i YiFeng Li (SDU)1(y (i) θT x (i) )2 exp 2σ 22πσLinear Regression!September 14, 202029 / 31

Probabilistic Interpretation (Contd.)Maximizing the likelihood L(θ)Since L(θ) is complicated, we maximize an increasing function of L(θ)instead (θ) log L(θ)!mY(y (i) θT x (i) )21 exp log2σ 22πσi!mX1(y (i) θT x (i) )2 log exp 2σ 22πσi1 X (i)1 2 m log (y θT x (i) )22πσ 2σiApparently, maximizing L(θ) (thus (θ)) is equivalent to minimizingm1 X (i)(y θT x (i) )22iFeng Li (SDU)Linear RegressionSeptember 14, 202030 / 31

Thanks!Q&AFeng Li (SDU)Linear RegressionSeptember 14, 202031 / 31

Lecture 2: Linear Regression 1 Supervised Learning: Regression and Classi cation 2 Linear Regression 3 Gradient Descent Algorithm 4 Stochastic Gradient Descent 5 Revisiting Least Square 6 A Probabilistic Interpretation to Linear Regressi

Related Documents:

independent variables. Many other procedures can also fit regression models, but they focus on more specialized forms of regression, such as robust regression, generalized linear regression, nonlinear regression, nonparametric regression, quantile regression, regression modeling of survey data, regression modeling of

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3: Kernel Method: Enabling nonlinearity Lecture 1: Linear Regression Linear Regression Notation Loss Function Solving the Regression Problem Geome

LINEAR REGRESSION 12-2.1 Test for Significance of Regression 12-2.2 Tests on Individual Regression Coefficients and Subsets of Coefficients 12-3 CONFIDENCE INTERVALS IN MULTIPLE LINEAR REGRESSION 12-3.1 Confidence Intervals on Individual Regression Coefficients 12-3.2 Confidence Interval

Lecture 9: Linear Regression. Goals Linear regression in R Estimating parameters and hypothesis testing with linear models Develop basic concepts of linear regression from a probabilistic framework. Regression Technique used for the modeling and analysis of numerical dataFile Size: 834KB

3 LECTURE 3 : REGRESSION 10 3 Lecture 3 : Regression This lecture was about regression. It started with formally de ning a regression problem. Then a simple regression model called linear regression was discussed. Di erent methods for learning the parameters in the model were next discussed. It also covered least square solution for the problem

Probability & Bayesian Inference CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 3 Linear Regression Topics What is linear regression? Example: polynomial curve fitting Other basis families Solving linear regression problems Regularized regression Multiple linear regression

Its simplicity and flexibility makes linear regression one of the most important and widely used statistical prediction methods. There are papers, books, and sequences of courses devoted to linear regression. 1.1Fitting a regression We fit a linear regression to covariate/response data. Each data point is a pair .x;y/, where